Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Rate control techniques for H.264/AVC video with enhanced rate-distortion modeling
(USC Thesis Other)
Rate control techniques for H.264/AVC video with enhanced rate-distortion modeling
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
RATE CONTROL TECHNIQUES FOR H.264/AVC VIDEO WITH
ENHANCED RATE-DISTORTION MODELING
by
Do-Kyoung Kwon
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
December 2006
Copyright 2006 Do-Kyoung Kwon
Dedication
This dissertation is dedicated with loving appreciation to my family. Without their
constant support and unwavering belief in my abilities, this successful work would
never have been achieved.
ii
Acknowledgements
First of all, I would like to thank my advisor, C.-C. Jay Kuo, for his inspiring and
encouraging way to guide me to a deeper understanding of knowledge work, and his
invaluable comments during the whole work with this dissertation. He taught me
how to ask questions, how to approach to different solutions, and how to express
my ideas and work. He was always there to listen and to give advice.
I also would like to thank the rest of my thesis committee: Antonio Ortega and
Ulrich Neumann, who reviewed my work and gave insightful comments.
I am very grateful to my sincere friends, Kwansik Kim, Young-Jae Kim and
Woo-Young Jeong, who have taken care of me in many ways since I first came
to USC. A special thanks goes to my friends, Yongho Han and Segi Hong. They
shared joy and sorrow with me throughout my academic years. They were truly
my reliable friends as life advisors as well as academic colleagues.
Let me also express my deep appreciation to all my friends at USC, Jun-Seong
Park, Jae-Joon Lee, Yongjin Cho, Jong-Dae Oh, Byung-Ho Cha, Hyo-Ryol Cho,
Junghun Park, Tae-Hoon Shin, Zungho Zun and Jonghye Woo. Without their
sincere care and consideration, I could never have succeeded in my Ph. D program.
Finally, to my parents and my wife, Yeonjoo, thanks for supporting me with your
love and unwavering belief in my abilities, and to my brothers and sisters-in-law,
thanks also for your consideration and patience.
iii
Table of Contents
Dedication ii
Acknowledgements iii
List Of Tables vii
List Of Figures viii
Abstract xii
1 Introduction 1
1.1 SignificanceoftheResearch . ..................... 1
1.2 PreviousWork . ............................ 6
1.3 ContributionsoftheResearch . .................... 8
1.4 OrganizationoftheDissertation .................... 12
2 Background Review 13
2.1 Block-BasedVideoCoding . ...................... 13
2.2 NewFeaturesinH.264 ......................... 16
2.3 ConstraintsonVideoCodingSystem . ................ 18
2.4 DigitalVideoApplications ....................... 22
2.5 ReviewofRateControlAlgorithms . ................. 23
2.5.1 Rate Control for Conversational Applications . . . . . . . . 23
2.5.2 Rate Control for Non-Conversational Applications . . . . . . 24
2.5.2.1 Rate Control with Fixed GOP . . . . . . . . . . . 25
2.5.2.2 Rate Control with Adaptive GOP . . . . . . . . . . 26
3 Rate Control for H.264 Video with Enhanced Rate and Distortion
Models 28
3.1 Introduction ............................... 28
3.2 ProposedEncoderStructure . ..................... 29
3.3 EnhancedRateandDistortionModels ................ 34
3.3.1 RateModelforHeaderBits .................. 34
iv
3.3.2 RateModelforSourceBits .................. 40
3.3.3 DistortionModel . ....................... 43
3.3.4 BlockTypeIdentification . .................. 48
3.4 Two-Stage Rate Control
forH.264Baseline-ProfileEncoder. .................. 49
3.5 ExperimentalResults .......................... 53
3.6 Conclusion. ............................... 59
4 Two-Pass Frame-Layer Bit Allocation for H.264 Video 63
4.1 Introduction ............................... 63
4.2 ProblemFormulation . ......................... 64
4.3 Two-PassFrame-LayerBitAllocation................. 65
4.3.1 First-PassEncoding . ..................... 66
4.3.2 Frame-LayerBitAllocation .................. 68
4.3.3 Second-Pass Encoding . . . . . . . . . . . . . . . . . . . . . 68
4.3.4 Two-Pass Rate Control with Bit Allocation . . . . . . . . . 69
4.4 ExperimentalResults .......................... 70
4.5 Conclusion. ............................... 73
5 Frame-Layer Bit Allocation for H.264 Video via GOP Rate Mod-
eling 78
5.1 Introduction ............................... 78
5.2 One-PassFrame-LayerBitAllocation ................. 79
5.2.1 ResearchMotivation . ..................... 79
5.2.2 GOPComplexityMeasure ................... 81
5.2.3 GOPRateModeling . ..................... 82
5.2.3.1 TheI-P-P-PCase . ................. 84
5.2.3.2 TheI-B-B-PCase . ................. 89
5.2.4 SummaryofAlgorithm ..................... 89
5.3 SimplifiedOne-PassRateControl . .................. 93
5.4 ExperimentalResults .......................... 95
5.5 Conclusion. ............................... 98
6 Rate Control for H.264 Video with Adaptive GOP Structure 103
6.1 Introduction ............................... 103
6.2 RateControlwithAdaptiveGOP ................... 105
6.2.1 OverviewofAlgorithm ..................... 105
6.2.2 I-FrameDecision ........................ 106
6.2.3 P-FrameDecision ........................ 108
6.2.3.1 ProblemFormulation. ................ 109
6.2.3.2 Enhanced GOP Rate and Distortion Models . . . . 110
6.2.3.3 Joint GOP structure decision and Frame-Layer Bit
Allocation ....................... 115
v
6.2.3.4 Rate Control Algorithm with Adaptive GOP Struc-
ture .......................... 119
6.3 GOP-LayerBitAllocation ....................... 120
6.4 ExperimentalResults .......................... 122
6.5 Conclusion. ............................... 125
7 Conclusion and Future Work 130
7.1 SummaryoftheResearch . ...................... 130
7.2 FutureResearchDirections....................... 132
Reference List 135
vi
List Of Tables
3.1 Performance comparison for P frames of the header rate models. . . 39
3.2 Performance comparison for P frames of different source rate models. 43
3.3 Performance comparison for P frames of different distortion models. 45
3.4 Performance of the proposed rate control algorithms with and with-
out bit allocation. The target bit rates are 48, 64 and 96 Kbps for
allsequences. . ............................. 55
4.1 Performance of the proposed two-pass algorithm for the I-P-P-P case
withvarioussequences. . ....................... 71
4.2 Performance of the proposed two-pass algorithm for the I-B-B-P case
withvarioussequences. . ........................ 72
5.1 Performances by the proposed one-pass algorithms for the I-P-P-P
case with the same sequences and at the same bit rates in Table 4.1. 96
5.2 Performances by the proposed one-pass algorithms for the I-B-B-P
case with the same sequences and at the same bit rates in Table 4.2. 97
6.1 Performances by the different rate control schemes for QCIF se-
quences. . ............................... 123
6.2 Composite QCIF sequences for the second experiment. . . . . . . . 123
6.3 Performances by the different rate control schemes for the composite
sequencesinTable6.2. . ........................ 125
vii
List Of Figures
2.1 The block diagram of the standard compatible encoder. . . . . . . . 14
2.2 A video encoder and decoder (codec) system. . . . . . . . . . . . . . 18
2.3 Exemplary GOP structures: (a) fixed GOP and (b) variable GOP. . 24
3.1 The R-D curves by setting QP
2
values in quantization to (a) QP
1
and QP
1
± 3, and (b) QP
1
and QP
1
± 1, where QP
1
is used for RDO. 31
3.2 Overview of the proposed encoder structure, where RDO is per-
formed for all MBs using QP
1
at the first stage and the residual
signal of a MB is then quantized using QP
2
at the second stage. . . 33
3.3 (a) The average percentages of header bits (including motion vec-
tors and modes) and source bits (including residual and CBP) of P
frames at various quantization parameters, and (b) the source bits
and header bits of a P frame when QP = 35 as a function of the
framenumber............................... 35
3.4 The relationship between R
hdr,inter
and (N
nzMVe
+ω·N
MV
) for three
testsequences: (a)Pframesand(b)Bframes.. ........... 38
3.5 Verification of the source rate model given in (3.12), where the hori-
zontal axis is SATD
c
(Q)/Q
p
and the vertical axis is the source rate
for (a) I frames (p =0.8), (b) P frames (p =1)and (c)Bframes
(p=1). ................................. 42
3.6 The relationship between the distortion of coded blocks and SATD
c
(Q)·
Q
p
for three test sequences in (a) I frames, (b) P frames and (c) B
frames.. ................................. 46
3.7 The relationship between the actual total distortion and the esti-
mated total for several test sequences in (a) I frames, (b) P frames
and(c)Bframes. ............................ 47
viii
3.8 The variation of quantization parameters in a frame by the proposed
rate control scheme without bit allocation (solid line) and the rate
control in JM8.1a (dashed line): (a)-(b) the 10-th and 50-th frames
of “Mother & Daughter” at 48 Kbps; (c)-(d) the 10-th and 50-th
frames of “Silent” at 64 Kbps; and (e)-(f) the 10-th and 50-th frames
of“Salesman”at96Kbps. . ...................... 57
3.9 The distributions of QP
1
−QP
2
by the proposed MB layer rate control
without bit allocation for the QCIF sequences, (a) “Foreman” and
(b)“News”at64Kbps. . ....................... 58
3.10 Performance comparison of the proposed rate control with bit allo-
cation (solid line) and the rate control in JM8.1a (dashed line) for
“Mother & daughter” at 48 Kbps: (a) the number of allocated bits
per frame and (b) the PSNR value per frame. . . . . . . . . . . . . 60
3.11 Performance comparison of the proposed rate control with bit allo-
cation (solid line) and the rate control in JM8.1a (dashed line) for
“Silent” at 64 Kbps: (a) the number of allocated bits per frame and
(b)thePSNRvalueperframe...................... 61
3.12 Performance comparison of the proposed rate control with bit allo-
cation (solid line) and the rate control in JM8.1a (dashed line) for
“Salesman” at 96 Kbps: (a) the number of allocated bits per frame
and(b)thePSNRvalueperframe. . ................. 62
4.1 The variations of PSNRs of (a) “News” (QCIF, 64 Kbps) and (b)
“Carphone” (CIF, 128 Kbps) for the I-P-P-P case. . . . . . . . . . . 74
4.2 The variations of PSNRs of (a) “Paris” (QCIF, 1024 Kbps) and (b)
“Football” (SIF, 1024 Kbps) for the I-P-P-P case. . . . . . . . . . . 75
4.3 The variations of PSNRs of (a) “Foreman” (QCIF, 64 Kbps) and (b)
“Coastguard” (QCIF, 128 Kbps) for the I-B-B-P case. . . . . . . . . 76
4.4 The variations of PSNRs of (a) “Flower” (CIF, 512 Kbps) and (b)
“Stefan” (SIF, 1024 Kbps) for the I-B-B-P case. . . . . . . . . . . . 77
5.1 QP
1
of each frame for (a) the I-P-P-P case and (b) the I-B-B-P
cases in the one-pass algorithm based on the Lagrange optimization
framework................................. 80
5.2 The relationships between the GOP rate R
GOP
and S/Q of the first,
third and fifth GOPs for (a) the “News” (QCIF), (b) the “Table
Tennis” (QCIF), (c) the “Stefan” (SIF) and (d) the “Paris” (CIF)
sequencesintheI-P-P-Pcase. . .................... 86
ix
5.3 The relationships between the GOP rate R
GOP
and S/
√
λ of the
first, third and fifth GOPs for (a) the “News” (QCIF), (b) the “Table
Tennis” (QCIF), (c) the “Stefan” (SIF) and (d) the “Paris” (CIF)
sequencesintheI-P-P-Pcase. . .................... 87
5.4 The relationships between the GOP rate R
GOP
and S/Q of the
first, third and fifth GOPs for (a) the “Carphone” (QCIF), (b) the
“Trevor” (QCIF), (c) the “Football” (SIF) and (d) the “Bus” (CIF)
sequencesintheI-B-B-Pcase. . .................... 90
5.5 The relationships between the GOP rate R
GOP
and S/
√
λ of the
first, third and fifth GOPs for (a) the “Carphone” (QCIF), (b) the
“Trevor” (QCIF), (c) the “Football” (SIF) and (d) the “Bus” (CIF)
sequencesintheI-B-B-Pcase. . .................... 91
5.6 The variations of PSNRs of (a) “News” (QCIF, 64 Kbps) and (b)
“Carphone” (QCIF, 128 Kbps) for the I-P-P-P case. . . . . . . . . . 99
5.7 The variations of PSNRs of (a) “Paris” (CIF, 1024 Kbps) and (b)
“Football” (SIF, 1024 Kbps) for the I-P-P-P case. . . . . . . . . . . 100
5.8 The variations of PSNRs of (a) “Foreman” (QCIF, 64 Kbps) and (b)
“Coastguard” (QCIF, 128 Kbps) for the I-B-B-P case. . . . . . . . . 101
5.9 The variations of PSNRs of (a) “Flower” (CIF, 512 Kbps) and (b)
“Stefan” (SIF, 1024 Kbps) for the I-B-B-P case. . . . . . . . . . . . 102
6.1 Flowchart of rate control with adaptive GOP structure. . . . . . . . 105
6.2 The variation of MAD between down-sampled frames in the QCIF
sequences, (a) “Table Tennis” and (b) “Trevor”. . . . . . . . . . . . 107
6.3 TheGOPstructureindisplayorder. ................. 108
6.4 The relationships between the GOP rate R
GOP
and S/Q of (a) the
“Carphone” (QCIF), (b) the “Trevor” (QCIF), (c) the “Football”
(SIF)and(d)the“Paris”(CIF)sequences. . ............. 113
6.5 The relationships between the GOP rate R
GOP
and S/Q of (a) the
“Trevor” (QCIF) and (b) the “Paris” (CIF) sequences when the GOP
complexitymeasureinChapter5isemployed. ............ 114
6.6 The relationships between the GOP distortion D
GOP
and Q of (a)
the “Carphone” (QCIF), (b) the “Trevor” (QCIF), (c) the “Football”
(SIF)and(d)the“Paris”(CIF)sequences. .............. 116
x
6.7 The variations of PSNRs by different rate control schemes for (a)
“Container” (QCIF, 64 Kbps) and (b) “Akiyo” (QCIF, 128 Kbps). . 127
6.8 The variations of PSNRs by different rate control schemes for (a)
“Sequence 1” (QCIF, 64 Kbps) and (b) “Sequence 2” (QCIF, 64
Kbps). .................................. 128
6.9 The variations of PSNRs by different rate control schemes for (a)
“Sequence 3” (QCIF, 128 Kbps) and (b) “Sequence 4” (QCIF, 128
Kbps). .................................. 129
xi
Abstract
In this research, we propose rate control algorithms with enhanced rate and distor-
tion modeling for H.264 video in various applications.
We first propose an enhanced rate control scheme for real-time conversational
applications. Compared with existing H.264 rate control, the proposed scheme
offers several new features. First, the inter-dependency between RDO and rate
control is resolved by allowing different quantization parameter values at the RDO
process and the quantization process, respectively. Second, to address the increased
importance of header bits, a header rate model is established so as to estimate
header bits more accurately. To be more specific, the number of header bits is
modeled as a function of the number of non-zero MV elements and the number of
MVs. Third, a new source rate model and a distortion model are proposed. For
this purpose, coded 4×4 blocks are identified and the number of source bits and
distortion are modeled as functions of the quantization stepsize and the complexity
of coded 4×4 blocks. Built upon the above ideas, a rate control algorithm is
developed for real-time conversational applications under the CBR constraint.
For the non-conversational H.264 video that has a fixed GOP (Group of Pic-
tures) structure, we propose frame-layer bit allocation algorithms. Under the as-
sumption of frame’s independence, a two-pass algorithm based on the Lagrange
optimization framework is proposed first as a fundamental study. Then, to reduce
the encoding complexity, an one-pass algorithm via GOP-based rate modeling is
xii
proposed based on the Lagrange optimization framework as well. Instead of esti-
mating the R-D data of future frames directly, the one-pass algorithm estimates
the Lagrange multiplier using GOP rate models, which characterize the number of
bits consumed by a GOP. For this purpose, GOP-based R-Q and R-λ models are
investigated. Finally, we propose a simplified one-pass algorithm by exploiting the
monotonicity property. The simplified algorithm does not require any frame rate
and distortion model. Thus, the rate control process can be greatly simplified.
The GOP structure may change adaptively according to the spatial and tempo-
ral scene contexts. We address the rate control problem with varying GOP struc-
tures as well. We point out an important issue in this case, the inter-dependency
of frame-layer bit allocation and GOP structure decision, and resolve this problem
using the simplified bit allocation scheme and GOP rate and distortion models.
Finally, we propose a GOP-layer bit allocation algorithm using GOP rate and dis-
tortion models. This algorithm achieves higher average quality as well as smoother
visual quality variation.
xiii
Chapter 1
Introduction
1.1 Significance of the Research
Aiming at reducing the required number of bits to represent source video signals at
a given level of quality, many video coding standards have been developed over the
last two decades to facilitate digital video technologies. Examples include ITU-T
H.261 [12], H.263 [13], ISO/ISE MPEG-1, MPEG-2 [30] and MPEG-4 [44]. Re-
markable advances in digital video technologies have made it easy to create, store
and transfer visual information in various applications such as video streaming,
recording and playback.
Most video coding standards employ the block-based hybrid coding scheme
to achieve high coding efficiency. Under such a scheme, a frame is divided into
smaller blocks of the same size, where the spatial redundancy is removed us-
ing the discrete cosine transform (DCT). After the transform, DCT coefficients
are quantized and variable-length coded by an entropy coder. Coding efficiency
can be further improved by removing the temporal redundancy between frames
using motion-compensated prediction (MCP). A difference signal, i.e.,motion-
compensated residual signal that has significantly reduced energy, is produced by
1
predicting an area of a frame from previous coded frames. The residual signal is
coded through DCT and variable-length coding so as to remove the spatial redun-
dancy in the residual signal.
Video coding standards only define the bitstream syntax and the decoding
process of syntax elements. An encoder only needs to produce a bitstream that
conforms with the bitstream syntax defined by the standard so that every standard-
compatible decoder can decode the generated bitstreams. However, it does not nec-
essarily produce high video quality by simply conforming with the bitstream syntax
of a video coding standard. Therefore, most of standard-compatible encoders em-
ploy their own encoder optimization techniques to maximize the rate-distortion
(R-D) performance for the quality improvement of coded video in various applica-
tions.
A rate control process constitutes an important function block in an encoder
even though it is not mandated by any video coding standard. When optimiz-
ing the R-D performance of an encoder, we should first ensure that the encoded
bitstream does not violate transmission constraints such as the channel rate, the
encoder/decoder buffer sizes, the end-to-end delay for smooth video playback, etc.
There are several coding parameters to be adjusted to improve coding efficiency,
e.g., the frame rate, the frame type, the macroblock (MB) mode and the quanti-
zation parameter (QP). Among them, the quantization parameter that regulates
the encoded bitstream plays a key role in rate control. Generally speaking, a rate
control algorithm determines quantization parameters in creating an encoded bit-
stream such that coding efficiency is maximized while all constraints are met at
the same time. Other coding parameters have also to be chosen to ensure that all
imposed constraints are not violated. This implies that they should be optimized
jointly along with quantization parameters. For example, a study on joint MB mode
2
and QP selection was carried out in [31,41]. Selection of the frame type and the
frame rate is studied jointly with rate control in [23,55] and [26,45], respectively.
An emerging video coding standard, H.264, has been recently developed jointly
by ITU-T and ISO/ISE. Following a similar framework of previous standards, it
adopts several new techniques to improve the coding performance. Its coding ef-
ficiency is significantly better than those of previous video coding standards in a
wide range of applications [50,51]. The substantial coding gain achieved by H.264
has drawn a lot of attention. As a consequence, it becomes important to develop
an efficient rate control algorithm for an H.264 encoder.
The main focus of this research is to develop efficient rate control algorithms
via rate and distortion modeling for the H.264 encoders in real-time conversational
applications as well as non-conversational applications. At the same time, we are
interested in developing joint selection algorithms of the quantization parameter,
the frame type and the MB mode in different types of applications. Among new
coding techniques adopted in H.264, the R-D optimized motion estimation and MB
mode decision (RDO) with various intra and inter-prediction modes and multiple
reference frames contributes considerably to high coding efficiency. For the reason,
we also adopt the RDO technique for MB mode decision in the proposed algorithms
such that the proposed algorithms comply with the standard H.264 encoder. How-
ever, depending on the type of application, there are still quite a few challenging
research problems to be addressed.
For both conversational and non-conversational applications, the following prob-
lems should be addressed in model-based H.264 rate control algorithms.
• While the RDO process can provide a high-quality video at a reduced bit rate,
as observed in [14,19,24,47,52,53], it makes H.264 rate control more difficult
3
and complicated. For example, when a model-based rate control approach
is applied, the residual information such as mean absolute difference (MAD)
or variance is required to determine the proper quantization parameter. In
contrast, the residual information is available only after the RDO process that
uses a pre-determined quantization parameter by rate control to generate it.
This inter-dependency of RDO and rate control, which was described as a
“chicken and egg” dilemma in [24], makes rate control in an H.264 encoder
more challenging than in the previous standards.
One possible solution to the inter-dependency problem between RDO and rate
control is to employ a multiple encoding approach. That is, each frame or MB
is compressed multiple times using admissible quantization parameters and
the best quantization parameter is selected. However, this is a computation-
ally intensive procedure, which cannot be applied in real-time conversational
applications such as video conferencing. Even for non-conversational applica-
tions, where video sequence can be encoded off-line such as video streaming
and video on a storage medium, this approach is hardly used in a light-weight
encoder that has a strict complexity constraint. We should note that even
one additional RDO process increases the encoder complexity a lot. A novel
method that decouples the inter-dependency of RDO and rate control is pro-
posedinChapter 3.
• Accurate source rate and distortion modeling is very critical to successful
model-based rate control algorithms. Thus, accurate source rate and distor-
tion models are necessary for an H.264 encoder. In addition, a header rate
model is necessary to estimate header bits more precisely for the increased
importance of header bits in H.264. To be more specific, due to various MB
4
coding modes with multiple reference frames allowed, the number of header
bits associated with the header information such as MB modes, motion vec-
tors (MVs) and reference frame varies a lot frame to frame. Sometimes they
may even occupy a larger portion than source bits in the total bit budget,
which means that the impact of header bits is more obvious at low bit rates.
The header information is as important as the residual source signal to rate
control of H.264. A header rate modeling as well as an enhanced source rate
and distortion modeling are addressed in Chapter 3.
In non-conversational applications, a set of frames are usually grouped into a
GOP structure, which can be either fixed or variable. In such applications, the
following additional research problems should be addressed.
• Suppose that a set of frames are grouped into a fixed GOP structure. Here,
another key issue is to allocate bits properly among the frames so that the
encoded video quality is maximized. Different frames have different R-D char-
acteristics according to their spatial complexities, temporal complexities and
frame types. Thus, they should be assigned with a different number of bits to
improve the video quality. It is well known that the frame-layer bit allocation
problem can be formulated as a constrained R-D optimization problem and
solved by the Lagrange optimization or the dynamic programming. Even with
the similar problem formulation and the similar approach (i.e., the Lagrange
optimization and the dynamic programming), the encoded video quality de-
pends on the accuracies of the estimated R-D characteristics of frames in a
GOP with respect to the actual R-D characteristics. In Chapters 4 and 5,
novel methods to estimate the R-D characteristic of a GOP are proposed for
efficient frame-layer bit allocation.
5
• The GOP structure may change adaptively according to the spatial and tem-
poral scene contexts. That is, to enhance the coded video quality, we choose
the frame type adaptively and allocate bits to each frame accordingly. How-
ever, the adaptive GOP structure makes the frame-layer bit allocation prob-
lem more challenging due to the inter-dependency of bit allocation and GOP
structure decision. To be more specific, on one hand, since the R-D character-
istics of each frame depends on its frame type, the frame-layer bit allocation
problem depends on the GOP structure. On the other hand, since the best
type of each frame depends on the number of bits allocated to it, the GOP
structure decision also depends on frame-layer bit allocation.
A straightforward solution to this inter-dependency problem is to encode a
GOP into all possible candidate GOP structures with frame-layer bit alloca-
tion and then choose the GOP structure that gives the optimal R-D perfor-
mance. However, it demands multiple encoding passes, which tends to result
in an extremely large amount of coding complexity. An effective algorithm is
proposed to solve this problem in Chapter 6.
1.2 Previous Work
Several model-based rate control algorithms have been proposed for H.264. Most of
them are based on the algorithms that were proposed for the previous video coding
standards. They focus on methods to resolve the inter-dependency of RDO and
rate control while employing existing rate and distortion models
In [24], the quadratic rate model proposed in [2] is employed to determine the
quantization parameter of a basic unit, which can be either a frame, a slice or
an MB. Since the residual signal is not available before the RDO process, the
6
MAD of each basic unit in the current frame is estimated by the MAD of the
collocated basic unit in the previous frame using a linear model. This algorithm
is implemented in the H.264 software encoder. The same quadratic rate model
and the linear MAD estimation model are used in [14]. However, to improve the
performance of [24], a new complexity measure using the MAD ratio and the PSNR
drop ratio is developed for the improved bit allocation. In [53], the H.263 TMN8
rate and distortion models proposed in [38] are employed. The residual signal of
each frame is first estimated by performing the RDO process with a reduced set
of reference frame, intra- and inter-prediction modes. After that, the standard
deviation of the estimated residual signal is fed into the H.263 TMN8 rate and
distortion models to refine the quantization parameter of each MB. Then, the RDO
process is performed again using the refined quantization parameter. Other H.264
rate control algorithms, e.g. [19,47,52], were developed by following a similar idea.
The above rate control algorithms are one-pass ones that make use of heuristics
to allocate bits among frames. Instead of analyzing the R-D characteristics of
frames, the number of frame bits is determined simply according to the frame
complexity, buffer occupancy and so on. For example, preceding reference frames
are allocated more bits heuristically to improve encoded video quality. This kind
of bit allocation methods is preferred in many encoders due to its simplicity and
acceptable video quality. However, robust performance cannot be guaranteed for a
video with varying characteristics, especially in non-conversational applications.
It is worthwhile to point out that none of the previous algorithms address the
importance of header bits. They suffer from a large error in header bit estimation
since they do not provide accurate header rate models. They also suffer from
inaccurate source rate and distortion modeling since they employ rate and distortion
models which are originally investigated for the previous video coding standards.
7
For the reason, they fail to control bit rate precisely. That means they are not
enough to be used for conversational applications, in which precise bit rate control
is very important.
The previous algorithms assume that frames are encoded into a fixed pattern
of frame types or fixed GOP structure, regardless of spatial and temporal char-
acteristics. In [6], a frame type decision method has been proposed for an H.264
encoder. However, it does not consider bit rate control at the same time. To our
best knowledge, adaptive frame type or GOP structure is not considered in any of
previous rate control algorithms for H.264.
1.3 Contributions of the Research
First of all, we point out two issues associated with rate control for the H.264 en-
coders in both conversational and non-conversational applications. One is the inter-
dependency problem between RDO and rate control. The other is the enhanced
frame rate and distortion modeling for efficient rate control. We have addressed
these two issues and made the following contributions in the research presented in
Chapter 3.
• To resolve the inter-dependency of RDO and rate control, we propose a two-
stage encoding scheme. It is motivated by the fact that even though the
quantization parameters at the RDO process and the quantization process
are different, the coding gain loss is not significant as long as the difference
between them is small. Thus, we divide the encoding process into two stages,
the first stage for ROD using an initial quantization parameter and the second
stage for quantization using a quantization parameter determined by a rate
control algorithm. The two-stage encoding scheme not only decouples the
8
inter-dependency of RDO and rate control but also makes accurate rate and
distortion modeling possible in H.264 since the motion-compensated residual
signal and the header information can be available after the first stage of
encoding.
• We propose a header rate model to estimate header bits required to encode
the header information. In previous video coding standards, there is no need
to estimate header bits using a model since the amount of header bits is small
and it does not vary a lot frame to frame. However, this is no longer true for
H.264. The accurate estimate of header bits using the proposed header rate
model improves the rate control performance.
• We propose enhanced source rate and distortion models based on coded block
identification. They are motivated by the facts that no bit is necessary for
skipped blocks and the distortion of skipped blocks can be directly computed
from the residual signal. The accuracies of rate and distortion models are
significantly improved by identifying coded blocks and considering only those
blocks for rate and distortion modeling.
• We propose novel rate control algorithms for conversational video coding ap-
plications, which are based on the two-stage encoding scheme along with new
rate and distortion models. We demonstrate that the proposed algorithms
outperform the existing H.264 rate control algorithm in many aspects such as
the encoded video quality, the target bit rate control, etc.
For non-conversational applications where a set of frames are grouped into a
fixed GOP structure, we have made the following contributions in the research
presented in Chapters 4 and 5.
9
• We propose a two-pass frame-layer bit allocation algorithm in Chapter 4. As-
suming that frames in a GOP are independent, we formulate the frame-layer
bit allocation problem as a bit-budget constrained R-D optimization problem.
Then, we propose the two-pass algorithm based on the Lagrange optimiza-
tion framework. We show that the proposed two-stage encoding scheme and
frame rate and distortion models can be employed successfully to solve the
frame-layer bit allocation problem.
• Based on the Lagrange optimization framework, we propose an one-pass
frame-layer bit allocation in Chapter 5 to reduce the encoding complexity
incurred by the two-pass algorithm. In previous one-pass algorithms, the R-
D characteristic of future frames is estimated based on previously encoded
frames. However, the R-D characteristic of a frame is hardly estimated with-
out any information of its residual signal. Therefore, instead of estimating
the R-D characteristic of each frame, the proposed one-pass algorithm esti-
mates the GOP rate and performs bit allocation by estimating directly the
Lagrange multiplier that satisfies the bit-budget constraint. For the pur-
pose, we develop two GOP rate models which are R-Q and R-λ models. It is
demonstrated that the rate control algorithm with the proposed bit allocation
scheme outperforms the existing H.264 rate control algorithm.
• We also propose a simplified one-pass algorithm in Chapter 5 to reduce the
encoding complexity further. The simplified algorithm determines the quan-
tization parameter of each frame based on its frame type, instead of the
Lagrange optimization framework. In this case, the number of bits to each
frame is determined by it quantization parameter. The simplified algorithm
dose not require any frame rate and distortion model. It is worthwhile to
10
point out that accurate frame and distortion modeling is not a trivial work
and makes an encoder complicated. Since it only requires the GOP-based
R-Q model, the rate control process can be simplified significantly.
To improve the coded video quality by exploiting the spatial and temporal
scene contexts, the GOP structure is adaptively determined. We point out one
important issue related with frame-layer bit allocation, i.e., the inter-dependency
between frame-layer bit allocation and GOP structure decision. We address this
problem and make the following contributions in Chapter 6.
• We propose enhanced GOP rate and distortion models. The GOP rate and
distortion models in Chapter 6 are useful in rate control with adaptive GOP,
since they are robust against the structure change of a GOP. That is, it is
possible to estimate the rate and the distortion of a GOP accurately indepen-
dently of its structural change.
• The GOP structure is distinguished by the position of the I frame (equiv-
alently, the GOP size) and P frames (equivalently, the distance between P
frames). We propose an I frame selection method based on the mean absolute
difference (MAD) between original frames. Then, on top of the simplified bit
allocation scheme in Chapter 5, we propose an efficient joint GOP structure
decision and frame-layer bit allocation method. The inter-dependency be-
tween them is resolved using the simplified frame-layer bit allocation scheme
and the enhanced GOP rate and distortion models to avoid multiple encoding
passes. As compared with frame-layer bit allocation with a fixed GOP struc-
ture, the coded video quality is significantly improved by adaptively changing
the GOP structure and allocating bits accordingly.
11
1.4 Organization of the Dissertation
The rest of the dissertation is organized as follows. Some background material is
reviewed in Chapter 2. Novel two-stage rate control algorithms for conversational
video coding applications as well as enhanced rate and distortion models are pro-
posed in Chapter 3. In Chapters 4 and 5, the frame-layer bit allocation problem is
addressed by a two-pass algorithm and a one-pass algorithm, respectively, for non-
conversational video coding applications where a fixed GOP structure is adopted.
In Chapter 6, the rate control algorithm with an adaptive GOP structure is pro-
posed using GOP rate and distortion models. Finally, Chapter 7 provides some
concluding remarks and points out some future research directions.
12
Chapter 2
Background Review
Background knowledge related to the research in this dissertation is reviewed. The
general framework of video coding standards is presented in Sec. 2.1. Newly in-
troduce features to H.264 are described in Sec. 2.2. A video codec system, which
includes the encoder, the encoder buffer, the channel, the decoder buffer and the
decoder, is studied in Sec. 2.3. In Sec. 2.4, various types of video applications are
discussed. Finally, background knowledge on rate control is given in Sec. 2.5.
2.1 Block-Based Video Coding
An image sequence (or video) consists of a set of frames in a sequential order.
When an input image sequence is encoded at a certain frame rate by the standard-
compatible encoder, each frame is coded as one of three frame types, i.e., the intra
(I), the predicted (P), or the bi-directionally predicted (B) frames. An input frame
is segmented into macroblocks (MBs) of 16×16 samples.
Each MB can be either an intra block or an inter block. An intra block does
not depend on any other frames. Except for H.264 that using intra prediction,
other video coding standards apply the DCT-based encoding schemes to encode
13
Figure 2.1: The block diagram of the standard compatible encoder.
pixel values in intra blocks directly. For an inter block, inter prediction is first
performed by the motion estimation process to search a similar block in reference
frames. After the search, the motion compensated residual signal is obtained by
subtracting the predicted signal from the original signal. Then, the residual signal
is compressed by the DCT-based encoding scheme.
To encode an MB using DCT, an MB is further segmented into blocks of 8×8
samples, where a block-DCT is applied (MBs can be more finely segmented into
blockof4×4 samples in H.264). After that, DCT coefficients are quantized by a
scalar quantizer. Then, quantized DCT coefficients are encoded by an entropy coder
using a variable-length code. In decoding, the compressed bit stream is decoded by
an entropy decoder to reconstruct quantized DCT coefficients. These coefficients go
through the inverse quantization and inverse DCT to reconstruct the pixel values
in a DCT block. The decoded I and P frames can be used as the reference frame
for other frames.
14
Fig. 2.1 shows the block diagram of a block-based video encoder mentioned
above. We can see that several important coding parameters have to be specified
for an encoder. They are detailed below.
• The encoding frame rate
This is the number of frames to be encoded per second. In most cases, the
frame rate is fixed and set in advance according to applications. However, in
low bit-rate cases, the encoder performance can be improved by varying the
frame rate based on the temporal complexity [26,45].
• The frame type
The frame type of each frame should be chosen before its coding. The frame
type and its occurrence pattern in an image sequence highly depend on appli-
cation requirements. For example, for low-delay video coding such as video
conferencing, the first frame is encoded as a I frame and then its successive
frames are all encoded as P frames as long as there is no scene change. In other
applications such as video in storage devices, a set of frames is encoded using
a finite GOP (group of pictures) structure. Each frame is then coded as one
of three frame types sequentially according to the GOP structure. This kind
of frame type decision method is widely used due to its simplicity. However, it
does not consider the spatial and temporal characteristics of each frame. If a
more sophisticated frame type decision scheme is employed, coding efficiency
can be improved more [23].
• The macroblock type
Thereare twoMBmodes in general, i.e., intra block and inter block modes,
and the encoder should decide the block type for each MB. The video coding
standards specify only the restriction on MB modes in each frame type. For
15
example, every MB in I frames is coded as an intra block while an MB in P
frames can be coded as an intra or an inter block with forward prediction.
As to B frames, an MB can be coded as an intra block or an inter block with
forward prediction only, backward prediction only or both predictions. How
to choose thecodingmodeofanMBisthe keyissueinthe encoderdesign
and its result will affect coding efficiency greatly [31,41].
• The quantization parameter
The scalar quantizer quantizes DCT coefficients using a stepsize which is pro-
portional to the quantization parameter. A smaller quantization parameter
yields higher video quality at the cost of a larger file size. On the contrary,
a larger quantization parameter can achieve more compression at the cost of
lower video quality. The trade-off between rate and distortion can be effec-
tively managed by adjusting the quantization parameter using a rate control
algorithm. Since most video coding standards allow a different quantization
parameter for each MB, an MB-layer rate control algorithm can be employed
when a more precise rate control is needed.
2.2 New Features in H.264
Several new video encoding techniques have been introduced in H.264 so that it can
achieve significantly improved coding efficiency over all previous coding standards.
The followings are one of the most important features.
• Variable block sizes and shapes for motion compensated prediction (MCP)
An MB has more flexibility in its size and shape for MCP. To be more specific,
an MB can be partitioned into smaller blocks of 16×16, 16×8, 8×16 and 8×16
16
samples, and the 8×8 partitions can be partitioned further into smaller blocks
of 8×8, 8×4, 4×8and 4×4 samples. An MB is encoded using the partition
that represents the motion field optimally.
• Multiple references for MCP
More than one previous frame can be used as reference frames for forward
prediction. Each 8×8 block is allowed to have its own reference frame, which
means four previous frames are referred by an MB maximally at the same
time. This feature allows the encoder to exploit the long-term dependency
between frames so that the motion field can be represented more accurately.
• Intra prediction
In previous standards as shown in Fig. 2.1, an intra MB goes through the DCT
process directly without any prediction. In H.264, an MB can be predicted
from the neighboring coded blocks to exploit the spatial correlation in the
same frame. After intra prediction, the residual signal is then compressed by
the DCT process. In H.264, there are nine 4x4 intra prediction modes and
five 16x16 intra prediction modes. An intra MB is coded as the best mode
that represents the MB optimally.
Besides prediction techniques mentioned above, other techniques such as 4x4
integer DCT, context-adaptive binary arithmetic coding (CABAC) and enhanced
in-loop deblocking filter are also adopted in H.264. Generally speaking, improved
prediction techniques are the major reason for H.264 to achieve excellent coding
efficiency. However, these techniques are computationally intensive and the encoder
complexity increases tremendously as a result.
The sophisticated prediction techniques of H.264 brings out an important issue;
namely, the optimal coding mode selection. For example, to determine the best MB
17
Figure 2.2: A video encoder and decoder (codec) system.
partition, the accuracy of the motion field representation and the number of bits
required to encode motion vectors (MVs) should be balanced in an optimal way.
For this purpose, the R-D optimized mode estimation and motion estimation [49],
which is called the R-D optimization (RDO) technique, is adopted by H.264 as an
non-normative recommendation. By the RDO process, the best MVs, reference
frames and MB mode are determined using the Lagrange optimization technique.
For more details of H.264, we refer to [50,51].
2.3 Constraints on Video Coding System
Fig. 2.2 shows a typical video codec system. An encoder generates a bitstream from
the input video source. The encoded bitstream is temporarily stored in an encoder
buffer before it is transmitted or stored. Likewise, the received bitstream is stored
in a decoder buffer before it is decoded by a decoder for display. There are several
factors that regulate the encoded bitstream, e.g., the channel bit rate, the encoder
and the decoder buffer sizes and the end-to-end delay. Different applications may
have different system requirements and configurations, thus leading to different
constraints on the encoded bitstream.
18
The available channel bandwidth sets a limit on the bit rate of the encoded bit-
stream. According to the type of transmission channels, an encoder may produce a
constant-bit-rate (CBR) bitstream or a variable-bit-rate (VBR) bitstream. For ex-
ample, CBR channels (such as dedicated connection with fixed channel bandwidth)
are often used in applications that require guaranteed transmission bandwidth and
easy network management. In contrast, VBR coding are used in applications that
demands higher video quality. Since VBR bitstreams are well matched with the
VBR characteristics of the encoded bitstream, better video quality can be obtained
with shorter delay with respect to the same average bit rate. However, in VBR
transmission channels, the channel feedback to the encoder should be provided so
as to achieve high video quality [34]. While we concentrate in the dissertation on
rate control over the CBR channel, our work could be extended to VBR transmis-
sion with slight modification.
Available encoder and decoder buffer sizes also regulate the encoded bitstream
[11,34,36]. In CBR, encoder and decoder buffers are used to smooth out the bit-rate
variation of a bitstream so that a constant bit rate (CBR) stream can be provided
to channels. In VBR, encoder and decoder buffers play a role that lower the burden
of channels by shaping the traffic pattern and smoothing out the bit-rate variation.
The encoder has to regulate the encoded bitstream such that there is no buffer
overflow and underflow. In other words, the encoded bitstream is regulated by
encoder and decoder buffer sizes. Even with sufficiently large buffers, it is still
important to perform rate control on a bitstream so that constant end-to-end delay
can be maintained for smooth video playback. That means the end-to-end delay is
an another factor that regulates the encoded bitstream.
Let C
n
be the channel rate during the n-th frame interval and R
n
be the number
of bits generated by the n-th frame. Let B
e
max
and B
d
max
be the encoder and decoder
19
buffer sizes, respectively. The encoder buffer occupancy B
e
n
at the n-th frame time
can be expressed as
B
e
n
=
⎧ ⎪⎪⎪⎪⎨ ⎪⎪⎪⎪⎩ n
i=1
R
i
−
n
i=ΔNe+1
C
i
, if n≥ ΔN
e
,
n
i=1
R
i
, if n< ΔN
e
,
(2.1)
where ΔN
e
is the initial encoder buffer delay that represents the duration the first
frame stays in the encoder buffer. Similarly, the decoder buffer occupancy B
d
n
at
the n-th frame time can be expressed as
B
d
n
=
⎧ ⎪⎪⎪⎪⎨ ⎪⎪⎪⎪⎩ n
i=1
C
i
−
n−ΔN
d
i=1
R
i
, if n≥ ΔN
d
,
n
i=1
C
i
, if n< ΔN
d
,
(2.2)
where ΔN
d
is the initial decoder buffer delay that represents the duration the first
frame stays in the decoder buffer.
The decoder buffer underflow causes the frame loss since frames cannot be timely
decoded and the decoder buffer overflow causes waste of channel resources. The
decoder buffer fullness depends on the encoded bit rate and the channel rate. Thus,
for the given channel rate, the initial encoder and decoder buffer delays and available
encoder and decoder buffers, an encoder has to regulate the encoded bitstream so
as to prevent a decoder buffer from underflow and overflow. For CBR channels
(i.e., C
n
= C,∀n), we can easily show from (2.1) and (2.2) that the decoder buffer
occupancy at the (n+ΔN
d
)-th frame time is written as
B
d
n+ΔN
d
=(ΔN
e
+ΔN
d
)· C− B
e
n
. (2.3)
20
In order to prevent the decoder underflow and overflow, we demand
0≤ B
d
n+ΔN
d
≤ B
d
max
. (2.4)
Then, it can be easily shown from (2.3) and (2.4) that B
e
n
should satisfy
(ΔN
e
+ΔN
d
)· C− B
d
max
≤ B
e
n
≤ (ΔN
e
+ΔN
d
)· C. (2.5)
Suppose that the encoder buffer and the decoder buffer are of the same size, i.e.,
B
e
max
= B
d
max
=(ΔN
e
+ΔN
d
)· C. (2.6)
Then, we can prevent the decoder from underflow and overflow by preventing the
encoder buffer from overflow and underflow, respectively. Moreover, the constant
end-to-end delay will not be violated by preventing the encoder from overflow with
appropriate initial delay (or initial buffer fullness, ΔN
e
· C and ΔN
d
· C) for given
encoder and decoder buffer sizes. For more details on buffer dynamics, we refer to
the discussion in [11,36,37].
In the research in this dissertation, we assume that a decoder has the same size
of buffer with an encoder. In addition, we do not consider the buffer constraint
and the end-to-end delay constraint. When the target bits for a set of frames,
e.g., GOP, are determined based on the channel rate, we can prevent the encoder
(decoder) from underflow (overflow), as long as the target bits can be achieved
accurately. When the target bits for a set of frames are smaller than the available
encoder buffer, we can prevent the encoder (decoder) from overflow (underflow).
By assuming that there are enough buffers and the corresponding initial delays are
allowed, we focus on the constraint imposed by the channel bit rate only.
21
2.4 Digital Video Applications
We classify digital video applications into two classes according to the amount of
delay allowed at the encoder side as follows.
• Conversational Applications
For this type of applications such as video conferencing, the input video source
is encoded and transmitted in real time with a strict delay constraint. Thus,
a short end-to-end delay is required and no frame is encoded as the B frame
to avoid the additional encoding delay caused by bi-directional prediction.
Accurate bit rate control is more important in this type of applications, since
a large buffer size is not allowed due to the strict delay constraint. Rate
control algorithms for conversational applications are studied in Chapter 3.
• Non-conversational Applications
This type of applications can be classified further into several classes, e.g.,
real-time video coding applications for real-time transmission such as live
video broadcasting, off-line video coding applications for real-time transmis-
sion such as video streaming, and off-line video coding applications for video
storage such as DVD. In these applications, a longer end-to-end delay is al-
lowed. For example, even for transmitting live video that requires real time
encoding and transmission, the delay constraint may not be as strict as that
in conversational applications. Hence, B frames can be used to improve cod-
ing efficiency and a set of frames are usually grouped into a GOP structure.
The GOP structure can be either fixed over a entire sequence or changed from
GOP to GOP. We propose rate control algorithms for the non-conversational
applications with fixed GOP in Chapters 4 and 5. For the applications with
adaptive GOP, we propose a rate control algorithm in Chapter 6.
22
2.5 Review of Rate Control Algorithms
A lot of rate control algorithms have been investigated for various video coding
standards in various video coding applications. Rate control algorithms can be
classified in different ways based on different points of views. Basically, it can
be based on whether pre-analysis of future frames is allowed or not, or based on
the number of encoding passes. For example, all algorithms can be classified into
one-pass and multi-pass. Multi-pass algorithms [16, 23, 25, 26, 32, 35] analyze R-
D characteristics of future frames first by encoding them through multiple passe.
They are applied to the applications where a longer end-to-end delay is permitted.
In contrast, one-pass algorithms where no pre-analysis is employed [1,2,5,7,9,14,
24,33,38,39,53,55] are useful especially in the applications where a short end-to-
end delay is required. Here, we provide a review of rate control algorithms for two
different types of applications.
2.5.1 Rate Control for Conversational Applications
In conversational applications, a major constraint on rate control comes from low
end-to-end delay. Even though a larger buffer can be used at both an encoder and
a decoder, the effective buffer size is bounded by the end-to-end delay [32]. The
strict delay constraint make it difficult to exploit sophisticate rate control schemes.
For example, pre-analysis of future frames (e.g., pre-encoding frames to collect R-D
data) is not allowed. Even pre-analysis of a current frame is not allowed due to the
additional encoding delay caused by it. Thus, model-based rate control is the most
widely used approach.
The largest coding unit in rate control for conversational applications is often a
frame. The rate control process starts by allocating the target bits to each frame
23
I PP I P P BB BB BB BB BB
(a)
I PP I P P BBB B B BBBB B B P BB
(b)
Figure 2.3: Exemplary GOP structures: (a) fixed GOP and (b) variable GOP.
and ends by determining the quantization parameter of each MB such that the
target bits are achieved. The target bits to each frame are usually determined
based on the buffer occupancy and the channel bit rate such that the end-to-end
delay is not violated. Even though the target bits are allocated in such a way,
inaccurate bit rate control causes frequent (encoder) buffer overflow and thus causes
frequent frame skip, especially when buffers are very small. That is, accurate bit
rate control is most important in conversational applications. For the reason, rate
control algorithms in this type of applications focus on accurate rate and distortion
modeling. In [38], rate and distortion are modeled as functions of a quantization
parameter and these models are employed for rate control of H.263. The ρ-domain
rate and distortion models, where ρ is defined as the percentage of zero DCT
coefficients, were proposed in [7] and [9] for rate control of H.263 and MPEG-4.
2.5.2 Rate Control for Non-Conversational Applications
In non-conversation applications, a finite GOP structure is usually imposed on the
video to be coded. A set of frames are encoded into a fixed or variable GOP
structure. The GOP structure can be characterized by the GOP size, which is
denoted by N, and the distance between P frames, which is denoted by M. Fig. 2.3
24
illustrates exemplary fixed and variable GOP structures. On one hand, when a
GOP structure is fixed, all GOPs have the same values of N and M. Fig. 2.3
(a) illustrates a fixed GOP structure, where N =15and M = 3. On the other
hand, when a GOP structure is adaptively determined, different GOPs can have
the different values of N and M.Moreover, M can vary even in the same GOP as
shown in Fig. 2.3 (b).
2.5.2.1 Rate Control with Fixed GOP
In this case, an important issue is how to allocate bits among frames in a GOP to
provide higher video quality. Three representative approaches have been proposed
to deal with the optimal frame-layer bit allocation problem. They are: 1) opera-
tional R-D analysis, 2) model-based R-D analysis and 3) heuristics without R-D
analysis.
Operational R-D analysis [25,26,32,35] pre-analyzes frames in a GOP by encod-
ing them using admissible quantization parameters through multiple passes. After
that, the optimal number of bits for each frame is determined using the Lagrange
optimization or dynamic programming framework. This approach requires high en-
coding complexity due to repeated encoding passes. When dependencies between
frames are taken into consideration, the complexity increases exponentially with
respect to the number of frames in a GOP. Consequently, even though an optimal
solution can be reached by this approach, the prohibitive high complexity prevents
it from being used even for off-line video coding applications.
Model-based R-D analysis [1,16,39] employs rate and distortion models to reduce
the complexity in collecting R-D data. This approach is generally performed in two
passes or one pass with the assumption that frames are independent. In two-
pass algorithms, the R-D data of frames in a GOP are estimated using rate and
25
distortion models in the first pass. After allocating the target bit rate to each frame
based on the estimated R-D data, frames are finally encoded in the second pass
to achieve their target bits. For example, in [16], all frames are encoded using a
constant quantization parameter in the first pass. The parameters of ρ-domain rate
and distortion models are estimated from the actual rate and distortion. Finally,
the optimal number of bits for each frame is determined by solving independent
Lagrange optimization problem. In one-pass algorithms, instead of an additional
pre-encoding pass, the R-D data of previous frames is exploited to estimate the R-D
data of future frames. In [1], an exponential R-D model was employed for frame-
layer bit allocation in an embedded wavelet video coder. Without a pre-encoding
pass, the model parameters of frames in a current GOP are estimated directly from
those of frames in a previous GOP. A similar method was applied in [39], where
the complexity of a current frame is assumed to be the same as that of a previous
frame. However, as expected, this assumption does not hold in many cases and
thus makes model-based one-pass bit allocations unreliable.
Many practical standard video encoders adopt simple heuristics for frame-layer
bit allocation [24,33]. In this approach, the target bits for each frame are partially
determined by the frame complexity and buffer occupancy. To improve video qual-
ity, more weights are given to preceding frames in the bit allocation process. This
approach is preferred due to its simplicity and acceptable video quality in com-
parison with the approach based on model-based R-D analysis. However, it is less
robust in particular for a video consisting of GOPs with varying characteristics.
2.5.2.2 Rate Control with Adaptive GOP
There has been research on adaptive GOP decision [6,20,22,23,48,54,55]. How-
ever, most of them except for [23,55] focused on adaptive GOP decision without
26
considering rate control. An one-pass algorithm was proposed for MPEG-2 video
in [55]. The value of N is determined by detecting a scene cut based on the lumi-
nance change and motion activities. The optimal value of M is chosen from the set
of {1, 2, 3and4} according to motion activities. Then, the rate control process is
performed using the TM5 rate control algorithm [30]. However, in this algorithm,
the GOP structure decision process and the rate control process are independent
of each other. An optimal GOP structure decision algorithm was proposed for
MPEG-2 video in [23]. For the GOP size N = 15, the optimal number of P frames
and their positions within a GOP are determined using dynamic programming. It
was shown in [23] by experimental results that, as compared with the optimal rate
control with fixed P frames of M = 3, the optimal rate control with adaptive P
frames can reduce the average distortion by around 0.2 dB. This algorithm may give
a near-optimal solution. However, this algorithm is only useful for the benchmark
purpose since its complexity is extremely high.
27
Chapter 3
Rate Control for H.264 Video with Enhanced
Rate and Distortion Models
3.1 Introduction
In this chapter, we propose an enhanced rate control scheme for H.264 video with
a few new features. First, the inter-dependency of RDO and rate control is re-
solved by allowing different quantization parameter values at the RDO process and
the quantization process, respectively. As long as these quantization parameters
are close to each other, we have a good approximation. Second, to address the
increased importance of header bits, a header rate model is established so as to es-
timate header bits more accurately. To be more specific, the number of header bits
is modeled as a function of the number of non-zero MV elements and the number
of MVs. Third, a new source rate model and a distortion model are proposed. For
this purpose, coded 4×4 blocks are identified and the number of source bits and
distortion are modeled as functions of the quantization stepsize and the complexity
of coded 4×4 blocks. Finally, a R-D optimized bit allocation scheme among mac-
roblocks (MBs) is proposed to improve picture quality. Built upon the above ideas,
a rate control algorithm is developed for the H.264 baseline-profile encoder under
28
the CBR constraint. It is shown by experimental results that the new algorithm can
control bit rates accurately with the R-D performance significantly better than that
of the rate control algorithm implemented in the H.264 software encoder JM8.1a.
The rest of this chapter is organized as follows. An overview of the proposed
encoder structure is presented in Sec. 3.2, where the inter-dependency between
RDO and rate control in H.264 is decoupled via quantization parameter estimation
and update. Improved rate and distortion models are presented in Sec. 3.3 while the
constant bit rate control algorithm for the H.264 baseline-profile encoder is proposed
in Sec. 3.4. Experimental results are provided in Sec. 3.5. Finally, concluding
remarks are drawn in Sec. 3.6.
3.2 Proposed Encoder Structure
The inter-dependency of RDO and rate control is the main difference between the
rate control problem for H.264 and prior standards. In this section, we propose an
encoder structure that decouples the inter-dependency between them via the use of
two quantization parameter values. In the following discussion, we use QP
1
and QP
2
to denote quantization parameters used in the RDO process and the quantization
process, respectively.
In H.264, the RDO process is performed based on the Lagrange optimization
framework. If there exists a known quantization parameter for RDO and quan-
tization, i.e., QP = QP
1
= QP
2
, the best set of MVs and reference frames of
partitions and a MB mode for each MB can be determined via RDO by minimizing
the following Lagrange cost [49]:
J(S
QP
)= D(S
QP
)+ λ(QP)· R(S
QP
), (3.1)
29
where S
QP
represents a set of MVs, reference frames and a MB mode, and λ(QP)is
the Lagrangian multiplier. Please note that there are two Lagrange multipliers (one
for motion estimation and the other for mode decision) and both of them depend
on the pre-determined quantization parameter. Although various RDO methods
and their low complexity variants such as fast motion estimation and mode decision
have been proposed to speed up the encoding process [3,17], all these algorithms
eventually determine MVs, reference frames and a MB mode by minimizing (3.1).
However, it is difficult to find QP = QP
1
= QP
2
a priori due to the inter-
dependency of RDO and rate control. In practice, we have to begin with one
quantization parameter value, say QP
1
, for the RDO process to determine the set
of MVs, reference frames and the MB mode. Then, under such a setting, we can
determine the best QP
2
for quantization in the rate control process. If QP
1
= QP
2
,
the process is done. If QP
1
= QP
2
, we may assign (QP
1
+ QP
2
)/2to QP
(2)
1
,where
the superscript (2) denotes that this is the second iteration. After that, with QP
(2)
1
and RDO, we can determine the corresponding MVs, reference frames and the MB
mode. Then, we can find QP
(2)
2
for quantization. Typically, the gap between QP
(k)
1
and QP
(k)
2
becomes smaller as the iteration number k becomes larger. This iteration
can continue until the following stopping criterion is met:
|QP
(k)
1
− QP
(k)
2
|≤ δ, (3.2)
where δ is a parameter to be chosen. Since the iteration process is computationally
expensive, we would like to find good initial values of QP
1
, QP
2
and the proper
value of δ to control the complexity while keeping good coding performance.
The following two observations provide useful guidelines in selecting QP
1
, QP
2
and δ.
30
32
33
34
35
36
37
38
39
50 100 150 200 250 300
Foreman
QP2=QP1
QP2=QP1+3
QP2=QP1-3
PSNR (dB)
Rate (Kbps)
(a)
32
33
34
35
36
37
38
39
50 100 150 200 250
Foreman
QP2=QP1
QP2=QP1+1
QP2=QP1-1
PSNR (dB)
Rate (Kbps)
(b)
Figure 3.1: The R-D curves by setting QP
2
values in quantization to (a) QP
1
and
QP
1
± 3, and (b) QP
1
and QP
1
± 1, where QP
1
is used for RDO.
• Observation 1:
The quantization process determines the quality of video. Therefore, for
smooth video playback, the variation between the quantization parameters of
two consecutive frames should be restricted to a small range; namely,
|QP
n
− QP
n−1
|≤ΔwhereΔ≤ 3,
where QP
n
and QP
n−1
are the average quantization parameters of the n-th
and the (n− 1)-th frames, respectively.
• Observation 2:
The decrease of the coding gain is not much even though QP
1
and QP
2
are
different as long as their difference is restricted to a small range, i.e.,
|QP
1
− QP
2
|≤Δwhere Δ≤ 3.
31
Observation 1 is often exploited in model-based rate control algorithms [7,14,24,
38,47,53] to smooth the quality variation between frames and between MBs. Fig. 3.1
provides the evidence to Observation 2. This figure shows R-D curves with fixed
QP
1
for RDO and varying QP
2
for quantization of the QCIF “Foreman” sequence
encoded with five reference frames. We can see that the coding gain loss is around
0.2 dB when |QP
1
− QP
2
| = 3 and it is almost negligible when |QP
1
− QP
2
| =1.
Clearly, observation 2 can be used to choose parameter δ in (3.2).
Based on the above observations, we show the proposed encoding scheme in
Fig. 3.2 that consists of the following two stages.
• Stage 1:
To encode a new frame, it is natural to choose QP
1
to be the average QP
2
of a
previous frame based on observation 1, and perform RDO for all MBs in the
current frame to determine the residual signal and the header information.
Then, the residual signals of all MBs go through DCT/Q and IQ/IDCT to get
a reconstructed frame, which is required for intra predictions of subsequent
MBs. Throughout stage 1, only QP
1
is used.
• Stage 2:
Given the target bits for the frame, the QP
2
values of all MBs are determined
by the rate control algorithm (to be discussed in Secs. 3.3 and 3.4) and used
to quantize the residual signals. If an MB is intra-coded, its residual signal is
simply re-quantized using QP
2
. If an MB is intra-coded, the residual signal
should be updated since its neighboring pixels can be different from those in
the first stage. For such a case, the residual signal is updated using the same
intra mode determined in the first stage of encoding. The final reconstructed
frame is obtained via IQ/IDCT for the intra prediction of subsequent MBs
32
Original
signal
Output
bitstream
* First stage with QP
1
RDO DCT/Q IQ/IDCT
Residual
signal
Determine QP
2
Reconstruted
signal
DCT/Q IQ/IDCT
Final
reconstruted
signal
Entropy
coding
* Second stage with QP
2
Figure 3.2: Overview of the proposed encoder structure, where RDO is performed
for all MBs using QP
1
at the first stage and the residual signal of a MB is then
quantized using QP
2
at the second stage.
and the inter prediction of subsequent frames. The output bitstream can also
be easily generated via entropy coding as shown in the figure. Throughout
stage 2, only QP
2
is used.
After the above two-stage process, we can verify the difference between QP
1
and QP
2
. If the difference is less or equal to 3 for a MB, it is done. If not, one
possible solution is to do some iterations to narrow down the gap between them.
However, the complexity could be too high to be practical. Another solution is to
choose QP
2
to be either QP
1
+3 or QP
1
− 3 and accept the consequence of R-D
performance degradation. The latter solution is adopted in this work. The proposed
two-stage encoding requires one more forward and inverse DCT and quantization
process. It should be noted that the high complexity of the H.264 encoder comes
mainly from the RDO process. Since RDO is still performed only once in the
proposed two-stage encoding, the additional relative complexity with respect to the
overall H.264 encoding complexity is not large. The additional relative complexity
depends on the complexity of RDO, the number of reference frames and so on.
33
For the reason, as the number of reference frames increases, the additional relative
complexity decreases further. It is also worthwhile to point out the residual signal
and the header information are determined in the first stage, and all of them except
for residual signals of intra MBs do not change throughout the encoding process.
Therefore, the rate and the distortion can be estimated accurately based on the
information obtained at the first stage.
In the next two sections, we will focus on the selection of quantization parameter
QP
2
given the output from stage 1 by an improved rate control scheme.
3.3 Enhanced Rate and Distortion Models
Accurate rate and distortion models have great impact on the rate control per-
formance in model-based rate control algorithms. In this section, enhanced rate
models for header bits and source bits and a new distortion model are proposed.
They are used to estimate rates and distortion of the residual signal obtained at
the first stage of the proposed encoder given in Sec. 3.2.
3.3.1 Rate Model for Header Bits
In the design of model-based rate control algorithms for previous standards such as
MPEG-1/2, H.263 and MPEG-4, the number of bits in the header is less critical due
to two reasons. First, the amount of header bits is small as compared with that of
source bits. Second, the number of header bits is nearly constant for the same type
of frames so that it can be easily estimated by the average header bits of previous
frames. However, the situation changes for H.264. The average percentages of
header bits of P frames at various quantization parameters is shown in Fig. 3.3(a)
when the QCIF “Foreman” sequence is encoded with a single reference frame by
34
0
10
20
30
40
50
60
70
80
20 25 30 35 40
Foreman
Residual
CBP
MV
Mode
Percentage (%)
QP
(a)
0
500
1000
1500
2000
0 20406080 100
Foreman
Source
Header
Bits
Frame
(b)
Figure 3.3: (a) The average percentages of header bits (including motion vectors
and modes) and source bits (including residual and CBP) of P frames at various
quantization parameters, and (b) the source bits and header bits of a P frame when
QP = 35 as a function of the frame number.
the H.264 software encoder. The variations of source bits and header bits when the
same sequence is encoded using QP = 35 is shown in Fig. 3.3(b). Note that both
MV bits and MB mode bits are included in the header bits. Although the coded
block pattern (CBP) is also a type of header information, we classify CBP bits to
source bits since they have a strong relationship with residual bits. For example, if
there is no residual signal to encode, the number of residual bits and the number
of CBP bits are both equal to zero.
As shown in Fig. 3.3(a), header bits occupy a significant portion of the total
amount of bits, and the percentage of header bits increases as the quantization
parameter value becomes larger. When a sequence is encoded at very low bit rates,
the amount of header bits may even exceed that of source bits. Furthermore, we
see from Fig. 3.3(b) that the size of header bits fluctuates a lot from frame to
frame, which means header bits cannot be simply estimated by the average value
35
of previous frames. It is also observed that the percentage of header bits increases
as the number of reference frames increases. As a result, it is clear that the header
information is as important as the residual signal to the rate control of H.264.
Let R
T
, R
src
and R
hdr
denote the total number of bits allocated to a frame, and
the corresponding source and header bits, respectively. We can first estimate the
header bits after the RDO process, and then compute available source bits as
R
src
= R
T
− R
hdr
. (3.3)
Here, we consider a header rate model that estimates the total number of bits
required to encode MVs, reference frames and MB modes. Since inter MBs and
intra MBs are quite different, we estimate their header bits separately.
It must be clear that the header bits of inter MBs have a strong relationship
with the number of non-zero horizontal/vertical MV elements, N
nzMVe
,and the
number of MVs, N
MV
. To give an example, suppose an MB is partitioned into four
8×8 blocks and the four MVs are (4, 1), (2, 0), (3, 7) and (0, 0). Then, N
nzMVe
=5
(i.e.,4,1,2,3 and 7) and N
MV
= 4. As the number of non-zero MV elements
increases, the bits required to encode their values increases. Likewise, as an MB
is finely partitioned, there are more MVs associated with it, which results in an
increase in header bits. Consequently, we model inter header bits as a function of
N
nzMVe
and N
MV
.
It is observed from experiments that header bits of inter MBs in a frame can be
estimated by the following linear model:
R
hdr,inter
= γ
1
· N
nzMVe
+ γ
2
· N
MV
, (3.4)
36
where γ
1
and γ
2
are model parameters. In addition, at the cost of little estimation
error, we observe that (3.4) can be simplified as
R
hdr,inter
= γ· (N
nzMVe
+ ω· N
MV
), (3.5)
where γ is a model parameter and ω is a weighting factor that depends on the
number of reference frames used. It should be noted that the number of reference
frames is another important factor that affects the number of header bits. Clearly,
the required bits to encode the indices of reference frames increases as the number
of reference frames increases. Thus, if more reference frames are used for inter
prediction, a larger weight should be given to N
MV
. We find empirically that the
following weights are good choices for most sequences:
ω =
⎧ ⎪⎪⎪⎪⎨ ⎪⎪⎪⎪⎩ 0.5, if the no. of reference frames ≥ 5,
0.4, if the no. of reference frames ≥ 3,
0.3, otherwise.
(3.6)
The relationship between R
hdr,inter
and (N
nzMVe
+ω·N
MV
) for three QCIF test
sequences is given in Fig. 3.4. We consider P and B frames separately and show
their results in Figs. 3.4 (a) and (b), respectively. For each plot, 30 P or B frames
are encoded using every three quantization parameter values ranging from 15 to
45 with three reference frames (i.e., ω =0.4). We can see from these plots that
header bits of all test sequences can be approximated closely by the header rate
model given in (3.5) regardless of frame types. The same result can be observed
from other sequences with different number of reference frames. Therefore, instead
of (3.5), we employ (3.5) to estimate the number of header bits of inter MBs for
simplicity.
37
0
1000
2000
3000
4000
5000
6000
7000
0 200 400 600 800 1000
Foreman
The no. of header bits
N
nzMVe
+0.4*N
MV
0
1000
2000
3000
4000
5000
0 100 200 300 400 500 600 700
Mother & Daughter
The no. of header bits
N
nzMVe
+0.4*N
MV
0
1000
2000
3000
4000
5000
6000
7000
8000
0 200 400 600 800 1000
Carphone
The no. of header bits
N
nzMVe
+0.4*N
MV
(a)
0
1000
2000
3000
4000
5000
6000
7000
0 200 400 600 800 1000
Foreman
The no. of header bits
N
nzMVe
+0.4*N
MV
0
1000
2000
3000
4000
5000
0 100 200 300 400 500 600 700 800
Mother & Daughter
The no. of header bits
N
nzMVe
+0.4*N
MV
0
2000
4000
6000
8000
1 10
4
0 200 400 600 800 1000 1200
Carphone
The no. of header bits
N
nzMVe
+0.4*N
MV
(b)
Figure 3.4: The relationship between R
hdr,inter
and (N
nzMVe
+ ω· N
MV
) for three
test sequences: (a) P frames and (b) B frames.
Table 3.1 shows the estimation errors in RMSE and the R
2
values for P frames
using (3.4) and (3.5), respectively. The R
2
is a quantity used to measure the degree
of data variation from a given model [4]. It is defined as
R
2
=1−
i
(X
i
−
ˆ
X
i
)
2
i
(X
i
− X)
2
, (3.7)
where X
i
and
ˆ
X
i
are the actual and estimated values of data point i, respectively,
and X is the mean of all data points. The estimation
ˆ
X
i
is obtained using a data
model. For any reasonable model, we expect that the second term of the right-
hand-side of (3.7) is less than one. That is, R
2
takes the value between 0 and 1.
38
Eq. (3.4) Eq. (3.5)
Sequence
RMSE R
2
RMSE R
2
News 53.2 0.9929 66.2 0.9897
Carphone 99.2 0.9971 101.7 0.9969
Table Tennis 189.7 0.9862 201.2 0.9852
Mot & Dau 48.6 0.9963 93.5 0.9873
Foreman 95.6 0.9964 97.5 0.9963
Silent 104.0 0.9922 105.4 0.9921
Coastguard 103.1 0.9968 133.9 0.9950
Table 3.1: Performance comparison for P frames of the header rate models.
The better the model, the closer the R
2
value to 1. As shown in Table 3.1, the
estimation error based on the model in (3.5) is not much larger than that based on
the model in (3.4) for most sequences.
For an intra MB, the header information includes only MB modes and the
residual signal usually has a larger energy, which means the amount of header bits
is much smaller than that of source bits. Thus, an exact estimation of header bits
of intra MBs is not critical as much as inter MBs. The total number of header bits
of intra MBs in a frame is simply estimated via
R
hdr,intra
= N
intra
· b
intra
, (3.8)
where N
intra
is the number of intra MBs in a frame and b
intra
is the average number
of header bits of intra MBs in previous frames. In other words, the number of intra
header bits is estimated based on the average header bits of previous intra MBs.
Then, the total number of header bits of a frame is equal to the sum of header
bits from inter and intra MBs in the frame, i.e.,
R
hdr
= R
hdr,inter
+ R
hdr,intra
= γ· (N
nzMVe
+ ω· N
MV
)+ N
intra
· b
intra
. (3.9)
39
3.3.2 Rate Model for Source Bits
Two types of rate and distortion models have been considered in q-domain and
ρ-domain, respectively. In the q-domain approach, the source rate is modeled as
a function of the quantization stepsize and the complexity of the residual signal.
Several source rate models were studied in the past, e.g. [2,5,38]. In [2], the source
is assumed to be Laplacian distributed, and a quadratic function of the quantization
stepsize is employed to estimate source bits. The following quadratic model
R
src
(Q)= α
1
·
MAD
Q
+ α
2
·
MAD
Q
2
, (3.10)
where α
1
and α
2
are model parameters and Q is the quantization stepsize, has been
widely studied. It was adopted as a non-normative guidance for the rate control
implementation in several standards such as MPEG-4 [21,46] and H.264 [24].
In the ρ-domain approach, where ρ is the percentage of zero DCT coefficients,
the number of source bits is modeled as a linear function of (1− ρ) [7, 9]. To
determine the quantization stepsize Q from ρ, it is possible to find the one-to-one
correspondence between them. That is, we can build a DCT coefficient histogram
and find the relationship between ρ and Q to determine the corresponding Q.The
ρ-domain source rate model is often more accurate than the q-domain source rate
model. The inaccuracy of the q-domain source rate model is mainly due to the
rough estimate of the complexity of residual signals such as MAD given in (3.10).
The basic unit for DCT and quantization in H.264 is a 4×4 block, which can
be either coded or skipped. A 4x4 block is skipped if all of its 4x4 DCT coefficients
are zero after quantization. Otherwise, it is a coded block. It should be noted
that no bit is required for skipped block. Therefore, in this work, we consider the
40
complexities of coded blocks only so as to estimate the complexity of the residual
signal more precisely. Let SATD
c
be the sum of absolute transform differences
(SATD) of coded 4×4 blocks. By modifying (3.10) slightly, we can derive another
source rate model:
R
src
(Q)= α
1
·
SATD
c
(Q)
Q
+ α
2
·
SATD
c
(Q)
Q
2
, (3.11)
where α
1
and α
2
are model parameters and Q is a quantization stepsize. It is clear
that the difference between (3.10) and (3.11) lies in the replacement of MAD with
SATD
c
(Q) to characterize the complexity of the residual signal. Please note that
Q in parentheses implies SATD
c
depends on the quantization stepsize. The MAD
of coded blocks, i.e., MAD
c
could be used instead of SATD
c
in (3.11). However,
SATD
c
is used because of its slightly better performance in the source rate model.
It is confirmed by experimental results that the estimation error of the quadratic
rate model in (3.10) can be significantly reduced by considering only the complex-
ities of coded 4×4 blocks. Furthermore, it is observed that the second order term
Q
−2
in (3.11) can be dropped without sacrificing much the modeling performance.
As a result, we can obtain a simplified source rate model as
R
src
(Q)= α·
SATD
c
(Q)
Q
p
, (3.12)
where α is a model parameter and p is a value depending on the frame type. It
is observed from experiments that the best value of p is 1.0 for P and B frames
and 0.8 for I frames. The relationship between the source rate and SATD
c
(Q)/Q
p
for three QCIF test sequences is plotted in Fig. 3.5. For each plot, 30 frames are
encoded using every three quantization parameter values ranging from 15 to 45
41
0
2 10
4
4 10
4
6 10
4
8 10
4
1 10
5
01 10
5
2 10
5
3 10
5
4 10
5
5 10
5
Foreman
The no. of source bits
SATD
C
(Q)/Q
0.8
0
1 10
4
2 10
4
3 10
4
4 10
4
5 10
4
6 10
4
05 10
4
1 10
5
1.5 10
5
2 10
5
2.5 10
5
Mother & Daughter
The no. of source bits
SATD
C
(Q)/Q
0.8
0
1 10
4
2 10
4
3 10
4
4 10
4
5 10
4
6 10
4
7 10
4
05 10
4
1 10
5
1.5 10
5
2 10
5
2.5 10
5
3 10
5
3.5 10
5
Carphone
The no. of source bits
SATD
C
(Q)/Q
0.8
(a)
0
5000
1 10
4
1.5 10
4
2 10
4
2.5 10
4
3 10
4
02 10
4
4 10
4
6 10
4
8 10
4
1 10
5
Foreman
The no. of source bits
SATD
C
(Q)/Q
0
2000
4000
6000
8000
1 10
4
1.2 10
4
1.4 10
4
01 10
4
2 10
4
3 10
4
4 10
4
5 10
4
Mother & Daughter
The no. of source bits
SATD
C
(Q)/Q
0
5000
1 10
4
1.5 10
4
2 10
4
2.5 10
4
3 10
4
02 10
4
4 10
4
6 10
4
8 10
4
Carphone
The no. of source bits
SATD
C
(Q)/Q
(b)
0
5000
1 10
4
1.5 10
4
2 10
4
2.5 10
4
3 10
4
02 10
4
4 10
4
6 10
4
8 10
4
Foreman
The no. of source bits
SATD
C
(Q)/Q
0
2000
4000
6000
8000
1 10
4
1.2 10
4
0 5000 1 10
4
1.5 10
4
2 10
4
2.5 10
4
3 10
4
3.5 10
4
Mother & Daughter
The no. of source bits
SATD
C
(Q)/Q
0
5000
1 10
4
1.5 10
4
2 10
4
2.5 10
4
01 10
4
2 10
4
3 10
4
4 10
4
5 10
4
6 10
4
7 10
4
Carphone
The no. of source bits
SATD
C
(Q)/Q
(c)
Figure 3.5: Verification of the source rate model given in (3.12), where the horizon-
tal axis is SATD
c
(Q)/Q
p
and the vertical axis is the source rate for (a) I frames
(p=0.8), (b) P frames (p = 1) and (c) B frames (p=1).
42
Eq. (3.11) Eq. (3.12) ρ-domain [9]
Sequence
RMSE R
2
RMSE R
2
RMSE R
2
News 140.7 0.9986 231.2 0.9964 323.1 0.9930
Carphone 196.1 0.9991 545.5 0.9937 523.6 0.9942
Table Tennis 306.6 0.9988 396.8 0.9981 803.1 0.9925
Mot & Dau 120.8 0.9986 156.4 0.9977 430.2 0.9836
Foreman 279.8 0.9991 423.4 0.9980 825.9 0.9928
Silent 167.7 0.9982 249.8 0.9962 426.0 0.9895
Coastguard 392.9 0.9992 395.0 0.9991 1,161.4 0.9931
Table 3.2: Performance comparison for P frames of different source rate models.
with three reference frames for P and B frames. We see from Fig. 3.5 that the
linear relationship as shown in (3.12) for all frame types is confirmed.
We study the performance of different source rate models, i.e. the proposed
quadratic model in (3.11), the simplified model in (3.12) and the ρ-domain model
[7,9], by comparing the estimation errors and the R
2
values for P frames in Table 3.2.
We see that, for all test sequences except for“Carphone”, the simplified model gives
smaller RMSE values and larger R
2
values than the ρ-domain model. Moreover,
the estimation errors of the simplified model are only slightly larger than those of
the proposed quadratic model. Consequently, we adopt (3.12) to estimate source
bits in our rate control algorithms for its simplicity.
3.3.3 Distortion Model
Generally speaking, distortion can be approximated by an exponential function of
source bits. In [9], distortion was modeled as an exponential function of source bits
in the ρ-domain, and it was shown that the number of source bits and distortion
can be estimated more accurately in the ρ-domain than in the q-domain. However,
the inferior performance in the q-domain is mainly attributed to the poor estima-
tion of source bits. Since the source rate model proposed in Sec. 3.3.2 is more
43
accurate, the accuracy of the distortion model is enhanced accordingly. However,
in this subsection, we consider a new approach to improve the distortion estimation
furthermore.
Let D
c
and D
s
represent the distortion measures of all coded 4×4blocksand
skipped 4×4 blocks, respectively. Then, the total distortion D is the sum of two
distortion measures, i.e.,
D(Q)= D
c
(Q)+ D
s
(Q). (3.13)
Since the residual signal of a skipped 4×4 block is not coded, its distortion can
be computed directly from its residual signal. Therefore, we need to estimate the
distortion of coded 4×4 blocks only. The distortion of coded blocks can be estimated
via the following quadratic model:
D
c
(Q)= β
1
· SATD
c
(Q)· Q + β
2
· SATD
c
(Q)· Q
2
, (3.14)
where β
1
and β
2
are model parameters. Similar to the source rate model, the
quadratic model in (3.14) can be simplified to be
D
c
(Q)= β· SATD
c
(Q)· Q
p
, (3.15)
where β is a model parameter and p is the value depending on the frame type. It
is observed that the best value of p is 1.0 for P and B frames and 1.2 for I frames.
The distortion of coded 4×4 blocks, measured in terms of the sum of squared
errors (SSE), versus the quantity SATD
c
(Q)· Q
p
for three QCIF test sequences
is plotted in Fig. 3.6. The same experimental conditions with Fig. 3.5 are applied
to Fig. 3.6. We see an approximately linear relationship between them from this
44
Eq. (3.14) Eq. (3.15) ρ-domain [9]
Sequence
RMSE R
2
RMSE R
2
RMSE R
2
News 4,910 0.9999 6,158 0.9999 42,608 0.9978
Carphone 5,098 0.9999 6,360 0.9999 73,942 0.9883
Table Tennis 4,926 0.9999 8,329 0.9998 52,655 0.9931
Mot & Dau 1,905 0.9999 3,306 0.9999 17,021 0.9988
Foreman 5,975 0.9999 9,879 0.9998 70,602 0.9920
Silent 10,177 0.9998 12,951 0.9997 80,906 0.9926
Coastguard 11,250 0.9999 14,204 0.9998 108,994 0.9901
Table 3.3: Performance comparison for P frames of different distortion models.
figure, which justifies the model given in (3.15). However, we should note that we
are concerned with the total distortion of the whole frame (including both skipped
and coded 4×4 blocks) rather than the distortion of the coded 4x4 blocks only.
Fig. 3.7 plots the relationship between the actual total distortion (i.e., D
c
+ D
s
)
and the estimated total distortion (i.e.,
ˆ
D
c
+ D
s
,where
ˆ
D
c
is estimated by (3.15)).
Fig. 3.7 also shows a dotted y = x line in every plot. We see that, regardless of
frame types, the total distortion can be estimated closely by the proposed estimation
method after identifying coded 4×4 blocks. The same result can be observed from
other sequences with different number of reference frames.
In Table 3.3, we compare the performance of different distortion models, i.e.,the
models given by (3.14) and (3.15) and the ρ-domain distortion model [9] in terms
of the estimation errors and the R
2
values for P frames. We see from Table 3.3
that, for all test sequences, the estimation errors are significantly reduced by the
proposed estimation method using the proposed distortion models of coded blocks.
The estimation errors by the simplified model in (3.15) are only slightly larger
than those of the quadratic model in (3.14). In our rate control algorithms, we
adopt (3.15) in estimating the distortion of coded blocks for its simplicity and good
performance in comparison with (3.14).
45
0
5 10
5
1 10
6
1.5 10
6
2 10
6
2.5 10
6
3 10
6
3.5 10
6
4 10
6
05 10
7
1 10
8
1.5 10
8
2 10
8
Foreman
D
C
(SSE)
SATD
C
(Q)*Q
1.2
0
5 10
5
1 10
6
1.5 10
6
2 10
6
02 10
7
4 10
7
6 10
7
8 10
7
1 10
8
1.2 10
8
Mother & Daughter
D
C
(SSE)
SATD
C
(Q)*Q
1.2
0
5 10
5
1 10
6
1.5 10
6
2 10
6
2.5 10
6
3 10
6
0 4.5 10
7
9 10
7
1.35 10
8
1.8 10
8
Carphone
D
C
(SSE)
SATD
C
(Q)*Q
1.2
(a)
0
5 10
4
1 10
5
1.5 10
5
2 10
5
2.5 10
5
3 10
5
02 10
6
4 10
6
6 10
6
8 10
6
1 10
7
Foreman
D
C
(SSE)
SATD
C
(Q)*Q
0
1 10
4
2 10
4
3 10
4
4 10
4
5 10
4
6 10
4
7 10
4
05 10
5
1 10
6
1.5 10
6
2 10
6
Mother & Daughter
D
C
(SSE)
SATD
C
(Q)*Q
0
5 10
4
1 10
5
1.5 10
5
2 10
5
2.5 10
5
01 10
6
2 10
6
3 10
6
4 10
6
5 10
6
6 10
6
Carphone
D
C
(SSE)
SATD
C
(Q)*Q
(b)
0
5 10
4
1 10
5
1.5 10
5
2 10
5
01 10
6
2 10
6
3 10
6
4 10
6
5 10
6
6 10
6
7 10
6
Foreman
D
C
(SSE)
SATD
C
(Q)*Q
0
5000
1 10
4
1.5 10
4
2 10
4
2.5 10
4
3 10
4
3.5 10
4
4 10
4
02 10
5
4 10
5
6 10
5
8 10
5
1 10
6
Mother & Daughter
D
C
(SSE)
SATD
C
(Q)*Q
0
5 10
4
1 10
5
1.5 10
5
01 10
6
2 10
6
3 10
6
4 10
6
5 10
6
Carphone
D
C
(SSE)
SATD
C
(Q)*Q
(c)
Figure 3.6: The relationship between the distortion of coded blocks and SATD
c
(Q)·
Q
p
for three test sequences in (a) I frames, (b) P frames and (c) B frames.
46
0
1 10
6
2 10
6
3 10
6
4 10
6
5 10
6
6 10
6
01 10
6
2 10
6
3 10
6
4 10
6
5 10
6
6 10
6
Foreman
Total distortion (SSE)
Est. total distortion (SSE)
0
1 10
6
2 10
6
3 10
6
4 10
6
01 10
6
2 10
6
3 10
6
4 10
6
Mother & Daughter
Total distortion (SSE)
Est. total distortion (SSE)
0
1 10
6
2 10
6
3 10
6
4 10
6
5 10
6
01 10
6
2 10
6
3 10
6
4 10
6
5 10
6
Carphone
Total distortion (SSE)
Est. total distortion (SSE)
(a)
0
1 10
6
2 10
6
3 10
6
4 10
6
5 10
6
6 10
6
01 10
6
2 10
6
3 10
6
4 10
6
5 10
6
6 10
6
Foreman
Total distortion (SSE)
Est. total distortion (SSE)
0
5 10
5
1 10
6
1.5 10
6
2 10
6
2.5 10
6
3 10
6
3.5 10
6
05 10
5
1 10
6
1.5 10
6
2 10
6
2.5 10
6
3 10
6
3.5 10
6
Mother & Daughter
Total distortion (SSE)
Est. total distortion (SSE)
0
1 10
6
2 10
6
3 10
6
4 10
6
5 10
6
01 10
6
2 10
6
3 10
6
4 10
6
5 10
6
Carphone
Total distortion (SSE)
Est. total distortion (SSE)
(b)
0
1 10
6
2 10
6
3 10
6
4 10
6
5 10
6
01 10
6
2 10
6
3 10
6
4 10
6
5 10
6
Foreman
Total distortion (SSE)
Est. total distortion (SSE)
0
5 10
5
1 10
6
1.5 10
6
2 10
6
2.5 10
6
3 10
6
3.5 10
6
05 10
5
1 10
6
1.5 10
6
2 10
6
2.5 10
6
3 10
6
3.5 10
6
Mother & Daughter
Total distortion (SSE)
Est. total distortion (SSE)
0
1 10
6
2 10
6
3 10
6
4 10
6
5 10
6
01 10
6
2 10
6
3 10
6
4 10
6
5 10
6
Carphone
Total distortion (SSE)
Est. total distortion (SSE)
(c)
Figure 3.7: The relationship between the actual total distortion and the estimated
total for several test sequences in (a) I frames, (b) P frames and (c) B frames.
47
3.3.4 Block Type Identification
To apply the proposed source rate and distortion models properly, we have to
identify whether a 4×4 block is coded or not at different quantization parameters
after DCT and quantization. In H.264, the integer DCT and the corresponding
scalar quantization are applied to a 4×4 block to avoid both multiplications and
divisions in DCT and division in quantization [28]. Let X(i,j)and X
q
(i,j)denote
the DCT coefficients at position (i,j) before and after the quantization operation,
respectively. Coefficient X(i,j) is quantized via
X
q
(i,j)= sign{X(i,j)}· [(|X(i,j)|· A(Q
M
,i,j)+ f · 2
17+Q
E
)(17 + Q
E
)], (3.16)
where Q
M
= QP mod 6, Q
E
= QP/6, A(Q
M
,i,j) is a function specified by the
quantization table that gives the integer quantization scaling factor at (i,j), and
f controls the quantization width near the origin. As indicated above, Q
M
and
Q
E
depend on the quantization parameter and, for that reason, the quantization
process is performed differently.
Coded 4×4 blocks are identified during the DCT/quantization module at the
first stage of our proposed encoder structure shown in Fig. 3.2. Since the difference
between QP
1
and QP
2
is restricted to Δ (which is set to 3 in this work) as mentioned
before, we only need to identify block types at the quantization parameter values
from QP
1
−Δto QP
1
+ Δ. To determine whether a 4×4 block is coded or not at
these admissible values, we first find DCT coefficient X
max
that gives the maximum
value of |X(i,j)|· A(Q
M
,i,j)with QP
1
:
X
max
= arg max
1≤i,j≤4
|X(i,j)|· A(Q
M
,i,j). (3.17)
48
After that, X
max
is re-quantized using all admissible values. Please note that, while
A(Q
M
,i,j) depends on the quantization parameter, their rank order remains the
same regardless of it. In other words, X
max
has the maximum value of |X(i,j)|·
A(Q
m
,i,j) at other quantization parameters, too. Thus, by quantizing X
max
(i,j)
with all admissible values, we can identify whether a 4×4 block is coded or not for
this range of quantization parameters. That is, if X
max
is zero after quantization
at a particular quantization parameter, the 4×4 block is a skipped block at this
value. Otherwise, it is a coded block.
3.4 Two-Stage Rate Control
for H.264 Baseline-Profile Encoder
Although the ρ-domain rate and distortion models have been employed successfully
for various video encoders, they are difficult to apply with respect to the H.264 en-
coder. The main problem is that it is not easy to find the one-to-one correspondence
between ρ and Q in H.264 due to the complicated expression given in (3.16). To
apply the ρ-domain rate and distortion models, we may quantize all DCT coeffi-
cients using all the admissible quantization parameters to obtain (ρ, Q)pairs. It
demands more computations than the coded 4×4 block identification method as
described above. Generally speaking, the ρ-domain approach is less direct than
the q-domain approach with the enhanced rate and distortion models discussed in
Secs. 3.3.1 - 3.3.3. In this section, in order to show the efficiency of the two-stage
encoder structure and the enhanced rate and distortion models, we propose two
constant bit rate control algorithms for the H.264 baseline-profile encoder, which
can be used in the real-time conversational video coding applications.
49
We propose a straightforward H.264 rate control algorithm for the baseline-
profile encoder as follows.
Rate Control Algorithm without Bit Allocation
1. Initial bit allocation to a frame
Allocate a certain bit budget to the current frame, R
T
. A simple frame-layer
bit allocation method is described in Sec. 3.5.
2. First-stage encoding
Perform the first stage of encoding using QP
1
to get the residual signal and
the header information of the current frame, where QP
1
is chosen to be the
average QP
2
of the previous frame. The coded 4×4 blocks are identified using
the method described in Sec. 3.3.4 and the corresponding SATD
c
(Q)ofall
MBs are computed for QP ∈ (QP
1
− Δ,QP
1
+ Δ). Estimate the required
number of header bits using (3.9) and get the available source bits R
src
.
3. Second-stage encoding
Suppose that the current MB number is m.Let R
src
be the available source
bits before encoding the m-th MB. We can determine the quantization para-
meter QP
2
for the m-th MB, denoted by QP
2,m
, as follows. First, compute
the sum of SATD
c
(Q) of remaining MBs for QP ∈ (QP
1
− Δ,QP
1
+ Δ).
Using (3.12) and SATD
c
(Q) values, we can estimate source bits,
ˆ
R
src
(Q) for
the quantization parameters ranging from QP
1
−Δto QP
1
+Δ, and choose
the value that minimizes the distance between the available source bits and
the estimated source bits, i.e.,
QP
2,m
= arg min
QP
1
−Δ≤q≤QP
1
+Δ
|
ˆ
R
src
(Q(q))− R
src
|,
50
where Q(q) is the actual quantization stepsize with respect to a quantization
parameter q. Then, we can encode the residual signal of the m-th MB using
QP
2,m
and update R
src
by subtracting the actual source bits of the m-th MB
from it. The above procedure is repeated for all MBs in the current frame.
4. Model update
Update model parameters γ and α and the buffer fullness. The model para-
meters are updated by the least square approximation using data from the
previous 5 frames. Then, we proceed to the next frame until the last frame
is reached.
Please note that there are (2Δ + 1) values of SATD
c
(Q) for each MB and (2Δ +
1) estimated source bits
ˆ
R
src
(Q) corresponding to each quantization parameter
between QP
1
−Δand QP
1
+Δ. Therefore, QP
2,m
is determined from a discrete set
of values by minimizing the distance to the available source bits rather than being
computed from a closed-form source rate model directly.
The above rate control algorithm without bit allocation does not take distortion
into account in determining the quantization parameter, i.e., QP
2,m
of the m-th
macroblock. To further improve picture quality, a R-D optimized bit allocation
method among MB classes is proposed as well by exploiting the different R-D
characteristics of different MB classes. To be more specific, let N
c
be the number
of classes and let R
src,l
and D
l
be the number of source bits and the distortion of
the l-th class, respectively. Then, given the available source bits R
src
for a frame,
our objective is to determine R
src,l
for all 1≤ l≤ N
c
such that the total distortion
of the frame is minimized. In other words,
minimize
Nc
l=1
D
l
subject to
Nc
l=1
R
src,l
≤ R
src
. (3.18)
51
Several MB classification methods were proposed for various purposes including
rate control before [18]. A good MB classification method can improve the coding
performance. In this work, a simple grouping scheme is considered, where MBs
in the same row, i.e., group of block (GOB) are classified into the same class, to
demonstrate the idea.
To perform R-D optimized bit allocation among MB classes, a discrete set of R-D
data for the l-th class is estimated after the first stage of encoding for all 1≤ l≤ N
c
.
For example, SATD
c
(Q) of each class is computed by the sum of SATD
c
(Q)ofall
MBs in that class for the quantization parameters from QP
1
−Δto QP
1
+Δ. After
that, the number of source bits and the corresponding distortion of each class can
be estimated for each admissible quantization parameter. That is, the source bits
and distortion are estimated by the simplified source rate model in (3.12) and the
simplified distortion model in (3.15), respectively. Using the estimated sets of R-D
data, the following algorithm can be conducted, which performs R-D optimized bit
allocation among MB classes using the gradient descent method [40].
Bit Allocation among MB Classes
1. Initialization with q
l
= QP
1
− Δ
We initialize the rate allocation with the finest quantization parameter and
compute
ˆ
R
src,l
=
ˆ
R
src,l
(Q(q
l
)) for all 1≤ l≤ N
c
,where Q(q
l
)isthequantiza-
tion stepsize corresponding to q
l
.Let
ˆ
R
src
=
Nc
l=1
ˆ
R
src,l
be the total number
of allocated bits. If
ˆ
R
src
≤ R
src
, we stop the bit allocation procedure since
the maximum bit rate under our scheme is still less than the available source
rate. Then, we simply allocate bits to the l-th class via R
src,l
=
ˆ
R
src,l
for all
1 ≤ l ≤ N
c
. Otherwise, if
ˆ
R
src
>R
src
, we need to enlarge the quantization
paramter by going to the next step.
52
2. R-D optimization
Based on the simplified rate and distortion models, we can compute
λ(l,k)=−
ˆ
D
l
(Q(q
l
))−
ˆ
D
l
(Q(k))
ˆ
R
src,l
(Q(q
l
))−
ˆ
R
src,l
(Q(k))
,
where 1 ≤ l ≤ N
c
and q
l
+1 ≤ k ≤ QP
1
+Δ, and choose l
and k
that
give the minimum value of λ(l,k). This means that l
and k
yield the least
quality degradation under a unit rate reduction. Then, we can set q
l
= k
and
ˆ
R
src,l
=
ˆ
R
src,l
(Q(k
)) and update
ˆ
R
src
=
Nc
l=1
ˆ
R
src,l
. We repeat this
step until the following stopping criterion is met.
3. Stopping Criterion
If
ˆ
R
src
= R
src
or q
l
= QP
1
+ Δ for all 1 ≤ l ≤ N
c
, stop Step 2 and the final
allocated bits to the l-th class, R
src,l
is set to
ˆ
R
src,l
for all 1≤ l≤ N
c
.
The basic idea of the above bit allocation method can be simply stated as follows.
First, it allocates the maximum bits to all classes using the smallest quantization
parameters. Usually, this rate will be higher than the available bit rate. Then, we
have to reduce the allocated bits from some classes. In Step 2, we choose the class
that gives the minimum distortion increase per bit rate reduction. This process
repeats until the target bit is met. With MB-class bit allocation, the proposed rate
control algorithm can be performed on MBs within the same class.
3.5 Experimental Results
The proposed rate control algorithms with and without bit allocation are compared
with the rate control algorithm in JM8.1a [15] for the H.264 baseline-profile encoder.
The JM8.1a rate control algorithm is based on [24]. In our experiments, all frames
53
are encoded as P frames except for the first I frame. Generally speaking, the coded
video quality largely depends on the frame-layer bit allocation method. More bits
are allocated to the I frame and the preceding P frames in many rate control
algorithms to improve average quality according to the monotonicity property [35].
However, since the main purpose of our experiments is to show the advantage of
the new encoder structure with the enhanced rate and distortion models, we adopt
a simple frame-based bit allocation as stated below.
Because we use the rate control scheme in JM8.1a [24] as the benchmark, the
first I frame is encoded using the quantization parameter determined by the same
rule as in JM8.1a for fair comparison. The target bits for subsequent P frames are
allocated by the following rule in our scheme. Let N
r
(n)and T
r
(n)bethenumber
of remaining frames and the number of remaining bits before encoding the n-th
frame, respectively. Then, the target bits for the n-th frame is simply determined
as T
r
(n)/N
r
(n). The Δ value is set to 3 and the initial values of model parameters
α, β and γ in (3.5), (3.12) and (3.15) are set to 6.0, 0.4 and 0.04, respectively, for
the first P frame in our algorithm. For all experiments, the buffer size is set to
two times of the channel rate and the initial buffer occupancy (i.e., initial delay in
terms of buffer) is set to half of the buffer size.
Several QCIF sequences are encoded using the low-complexity RDO with a sin-
gle reference frames at 30 fps. As mentioned in Sec. 3.4, the MBs in the same GOB
are classified into the same class for bit allocation. Table 3.4 shows the experimen-
tal results by three different rate control algorithms. Note that the coding gains in
this table are all calculated with respect to the rate control scheme in JM81a. The
results show the proposed algorithm with or without bit allocation can meet the
target bit rate more accurately for all test sequences at all bit rates. The picture
quality is also improved. While the average coding gain and the maximal coding
54
JM8.1a Ours w/o Bit Alloc. Ours w/ Bit Alloc.
Sequence
Bitrate PSNR Bitrate PSNR Bitrate PSNR
48.06 36.30 48.01 36.62 (+0.32) 48.00 37.05 (+0.75)
Mot
64.04 37.73 64.00 38.01 (+0.28) 64.00 38.40 (+0.67)
& Dau
96.12 40.14 96.00 40.38 (+0.24) 95.99 40.55 (+0.41)
48.03 32.18 48.01 32.19 (+0.01) 47.99 32.57 (+0.39)
Silent 64.03 33.54 64.00 33.56 (+0.02) 64.00 33.95 (+0.41)
96.05 36.65 96.01 36.82 (+0.17) 96.00 37.02 (+0.37)
48.01 33.78 48.00 33.99 (+0.21) 47.99 34.66 (+0.88)
Salesman 64.04 35.36 64.01 35.59 (+0.23) 63.98 36.26 (+0.90)
96.09 38.51 96.00 38.78 (+0.27) 95.99 39.18 (+0.67)
48.06 31.07 48.01 31.44 (+0.37) 48.00 31.66 (+0.59)
Carphone 64.09 32.43 64.01 32.74 (+0.31) 63.99 32.94 (+0.51)
96.11 34.41 96.01 34.67 (+0.26) 95.99 34.77 (+0.36)
48.11 29.99 48.00 30.25 (+0.26) 48.00 30.52 (+0.53)
Foreman 64.12 31.53 64.00 31.79 (+0.26) 64.00 31.93 (+0.40)
96.23 33.52 96.00 33.70 (+0.18) 96.00 33.83 (+0.31)
48.10 32.89 48.02 33.24 (+0.35) 47.99 33.78 (+0.89)
News 64.15 34.48 63.99 34.87 (+0.39) 63.99 35.50 (+1.02)
96.21 37.79 95.99 38.02 (+0.23) 95.99 38.38 (+0.59)
Table 3.4: Performance of the proposed rate control algorithms with and without
bit allocation. The target bit rates are 48, 64 and 96 Kbps for all sequences.
gain are 0.25 dB and 0.39 dB, respectively, by the proposed rate control algorithm
without bit allocation, they can go up to 0.60 dB and 1.02 dB, respectively, by
the proposed rate control with bit allocation. The coding gain improvement is
significant.
As mentioned in Sec. 3.3, the simplified rate and distortion models are employed
for the experiments. In these models, there are three parameters to be tuned: ω
in the header rate model (3.5), p in the source rate model (3.12) and p in the
distortion model (3.15). These parameters are introduced to simplify the models
given by (3.4), (3.11) and (3.14), respectively. The values of them as specified in
Sec. 3.3 work well for all test sequences. In other words, they are robust with
respect to test sequences shown in Table 3.4.
55
The improved coding gain by the proposed rate control scheme without bit
allocation mainly comes from the saving of bits required to encode the difference of
the quantization parameters between two successive MBs. Fig. 3.8 illustrates the
variation of quantization parameters in a frame for several sequences at different bit
rates. We see that even though each MB can have its own quantization parameter,
the variation is small due to the accurate rate models. The difference between the
largest value and the smallest value is at most 2 for most of the frames. As a
result, fewer bits are required to encode the difference of them. On the contrary,
the inaccurate source rate and the header bit estimation method in JM8.1a causes
a large variation in quantization parameters. Consequently, it demands more bits
to encode their difference. It also causes significant quality fluctuations within a
frame. As a evidence, we can easily conjecture from Fig. 3.8 that JM8.1a rate
control algorithm produces higher quality MBs in the beginning part of a frame
and poorer quality MBs in the ending part of a frame. Furthermore, by employing
bit allocation, we can further improve the PSNR value by 0.35 dB on average
over the one without bit allocation. The amount of improvement depends on the
characteristics of video sequences. When MBs in a frame have more different R-D
characteristics, the additional coding gain is larger.
Fig. 3.9 shows the distribution of QP
1
− QP
2
for the “Foreman” and “News”
sequences, when they are coded at 64 Kbps by the proposed rate control without bit
allocation. In case of the “Foreman” sequence, the probability of |QP
1
− QP
2
|≤ 1
is larger than 95% and the probability of |QP
1
− QP
2
|≥ 3islessthan2%. In
case of the “News” sequence, these probabilities are larger than 80% and less than
4%, respectively. According to the discussion in Sec. 3.2, the high probability of
|QP
1
−QP
2
|≤ 1 implies that the coding gain loss by the two-stage encoding is very
small. The low probability of|QP
1
−QP
2
|≥ 3 implies that 3 is a reasonable choice
56
26
27
28
29
30
31
32
33
34
0 20406080 100
Mother & Daughter
Proposed
JM81a
QP
MB
(a)
26
27
28
29
30
31
32
33
34
0 20406080 100
Mother & Daughter
Proposed
JM81a
QP
MB
(b)
27
28
29
30
31
32
33
34
35
0 20406080 100
Silent
Proposed
JM81a
QP
MB
(c)
29
30
31
32
33
34
35
36
37
0 20406080 100
Silent
Proposed
JM81a
QP
MB
(d)
22
24
26
28
30
32
34
0 20406080 100
Salesman
Proposed
JM81a
QP
MB
(e)
22
23
24
25
26
27
28
29
30
0 20406080 100
Salesman
Proposed
JM81a
QP
MB
(f)
Figure 3.8: The variation of quantization parameters in a frame by the proposed
rate control scheme without bit allocation (solid line) and the rate control in JM8.1a
(dashed line): (a)-(b) the 10-th and 50-th frames of “Mother & Daughter” at 48
Kbps; (c)-(d) the 10-th and 50-th frames of “Silent” at 64 Kbps; and (e)-(f) the
10-th and 50-th frames of “Salesman” at 96 Kbps.
57
0
10
20
30
40
50
60
-3 -2 -1 0 1 2 3
Foreman
Percentage (%)
QP
1
-QP
2
(a)
0
5
10
15
20
25
30
35
40
-3 -2 -1 0 1 2 3
News
Percentage (%)
QP
1
-QP
2
(b)
Figure 3.9: The distributions of QP
1
− QP
2
by the proposed MB layer rate control
without bit allocation for the QCIF sequences, (a) “Foreman” and (b) “News” at
64 Kbps.
for the value of Δ. Furthermore, it is observed that the difference between these
two QP values is small for most frames in a sequence whether the sequence is of
high activity or not. The large difference between them (i.e.,≥ 3) is observed in the
scene change frames. In the “News” sequences, the probability of|QP
1
− QP
2
|≥ 3
is larger due to the background scene changes at the 90-th, 150-th, and 240-th
frames, and we can reduce the sudden quality change at these frames by setting
Δ = 3. Generally speaking, we conclude from Fig. 3.9 that the average QP
2
of
the previous frame provides a good estimate of QP
1
of the current frame in this
application with the baseline profile.
We show the number of allocated bits and the PSNR value as a function of the
frame number by the proposed rate control algorithm with bit allocation and the
rate control algorithm in JM8.1a for the three test sequences in Figs. 3.10, 3.11
and 3.12. As shown in these figures, the target bit of each frame can be achieved
very precisely by the proposed rate control algorithm. This is important when the
decoder buffer size is limited, where frame skipping may occur often if the number
58
of bits cannot be well controlled. Furthermore, the PSNR improvement of the
proposed rate control scheme over the rate control algorithm in JM8.1a is evident.
3.6 Conclusion
A novel model-based rate control algorithm based on the two-stage encoding was
proposed for the H.264 baseline-profile encoder in this chapter. The inter-dependency
of RDO and rate control is resolved by the two-stage encoding scheme at the cost of
acceptable extra encoding complexity. An enhanced header rate model was estab-
lished to estimate the number of header bits more accurately due to the increased
importance of header bits in H.264. Enhanced source rate and distortion models
were also proposed based on coded 4×4 block identification. It was shown by exper-
imental results that both rate and distortion can be well estimated by the proposed
models. In addition, an MB-based bit allocation method was proposed to improve
the picture quality. The proposed rate control algorithm can achieve the target bit
rate of a frame more accurately with improved R-D performance as compared with
the rate control algorithm in JM8.1a.
It is worthwhile to point out that we may need a more sophisticated estimation
method for QP
2
to improve the performance for different video coding applications,
for example, non-conversational applications that require a finite GOP structure
with B frames. This open issue will be addressed in the next chapter.
59
0
1000
2000
3000
4000
0 40 80 120 160 200 240 280
Mother & Daughter
Proposed-MBOpt
JM81a
Bits
Frame
(a)
33
34
35
36
37
38
39
40
0 40 80 120 160 200 240 280
Mother & Daughter
Proposed-MBOpt
JM81a
PSNR (dB)
Frame
(b)
Figure 3.10: Performance comparison of the proposed rate control with bit allo-
cation (solid line) and the rate control in JM8.1a (dashed line) for “Mother &
daughter” at 48 Kbps: (a) the number of allocated bits per frame and (b) the
PSNR value per frame.
60
0
1000
2000
3000
4000
0 40 80 120 160 200 240 280
Silent
Proposed-MBOpt
JM81a
Bits
Frame
(a)
31
32
33
34
35
36
0 40 80 120 160 200 240 280
Silent
Proposed-MBOpt
JM81a
PSNR (dB)
Frame
(b)
Figure 3.11: Performance comparison of the proposed rate control with bit allo-
cation (solid line) and the rate control in JM8.1a (dashed line) for “Silent” at 64
Kbps: (a) the number of allocated bits per frame and (b) the PSNR value per
frame.
61
1000
2000
3000
4000
5000
6000
0 40 80 120 160 200 240 280
Salesman
Proposed-MBOpt
JM81a
Bits
Frame
(a)
36
37
38
39
40
41
0 40 80 120 160 200 240 280
Salesman
Proposed-MBOpt
JM81a
PSNR (dB)
Frame
(b)
Figure 3.12: Performance comparison of the proposed rate control with bit allo-
cation (solid line) and the rate control in JM8.1a (dashed line) for “Salesman” at
96 Kbps: (a) the number of allocated bits per frame and (b) the PSNR value per
frame.
62
Chapter 4
Two-Pass Frame-Layer Bit Allocation for H.264
Video
4.1 Introduction
The frame-layer bit allocation problem is another important issue for the non-
conversational H.264 video that has a fixed GOP structure. In this chapter, as a
fundamental study, this problem is addressed by a two-pass algorithm using the
Lagrange optimization framework. The proposed two-pass algorithm estimates the
R-D information of all frames in a GOP by pre-encoding them using a constant
quantization parameter in the first pass. The target bit number of each frame is
determined by searching the Lagrange multiplier that meets the total bit budget of
the GOP. Finally, all frames are actually encoded in the second pass.
It is worthwhile to note that most two-pass algorithms developed for previous
standards adopt the similar idea, i.e., frame analysis via pre-encoding in the first
pass and actual encoding in the second pass. However, their performance largely
depend on the accuracy of frame analysis. To the best of our knowledge, no frame-
layer bit allocation has ever been proposed for H.264 with model-based R-D analy-
sis. In this chapter, we show how our two-stage encoding scheme and frame rate and
63
distortion models in Chapter 3 can be applied successfully for non-conversational
video applications. The proposed algorithm can be applied to off-line applications
or real-time applications that allow a delay of one GOP period.
This chapter is organized as follows. We first formulate the frame-layer bit
allocation problem in Sec. 4.2. Then, a two-pass frame-layer bit allocation algorithm
is proposed in Sec. 4.3. Experimental results are provided in Sec. 4.4, which is
followed by concluding remarks in Sec. 4.5.
4.2 Problem Formulation
For a given channel bit rate, the bit allocation problem among the frames in a GOP
is considered with the following assumptions.
• A constant number of frames is grouped in a fixed structure of GOP, where
B frames are not used as reference frames. A GOP structure can be changed
adaptively depending on the spatial and temporal scene context. Moreover,
in H.264, B frames can be used as references for inter prediction. However,
we do not consider such cases.
• Coded bitstreams are delivered over CBR channels such as satellite video
broadcasting under the bit-budget constraint.
• The encoder and the decoder buffers are large enough and the corresponding
initial delays are allowed such that they do not become the constraints of the
bit allocation problem.
• Frames are independent of each other to make the bit allocation problem more
tractable, although this assumption results in some loss of coding efficiency.
64
Let N be the number of frames in a GOP and let R
i
(Q(q
i
)) and D
i
(Q(q
i
)) be
the rate and distortion of the i-th frame with respect to its quantization parameter
q
i
,where Q(q
i
) is a quantization stepsize corresponding to q
i
. With the assumption
of frame’s independence, the frame-layer bit allocation problem is formulated as
minimize
N
i=1
ω
i
· D
i
(Q(q
i
)) subject to
N
i=1
R
i
(Q(q
i
))≤ R
T,GOP
, (4.1)
where R
T,GOP
is the target bit budget for a GOP which is determined by the
product of the channel rate and the duration of a GOP, and ω
i
is the weight for
the i-th frame which takes the corresponding frame type into account. When all
frames are I frames, their weights are the same. If there are multiple frame types,
more weights should be given to reference frames to achieve higher video quality
according to the monotonicity property [35]. In this work, based on experimental
observations, w
i
is set to 1.5, 1.0 and 0.5 for I, P and B frames, respectively.
It is well known that the problem in (4.1) can be effectively solved by minimizing
the the following Lagrange cost:
J =
N
i=1
ω
i
· D
i
(Q(q
i
)) + λ·
N
i=1
R
i
(Q(q
i
)), (4.2)
for a Lagrange multiplier λ that satisfies the bit-budget constraint [43]. For the rest
of this chapter, we consider two GOP structures, I-P-P-P and I-B-B-P, where each
GOP size is equal to 15. However, the proposed algorithms are general enough to
allow extensions to other GOP structures with slight modification.
4.3 Two-Pass Frame-Layer Bit Allocation
A two-pass bit allocation algorithm is proposed in this section.
65
4.3.1 First-Pass Encoding
To optimally allocate bits among the frames in a GOP, it is necessary to collect R-D
information by analyzing frames. In the proposed two-pass algorithm, this task is
carried out in the first pass by pre-encoding them using a constant quantization
parameter, which is called a “preanalysis QP”. Its value is determined by different
methods according to the GOP structure.
Let QP
avg
be the average quantization parameter of the previous GOP. And let
QP
p,i
be the preanalysis QP of the i-th frame in the current GOP. For the I-P-P-P
case, QP
p,i
is set by
QP
p,i
= QP
avg
, ∀i=1,..., N. (4.3)
In general, an effective way to improve video quality is to encode B frames with a
larger quantization parameter, since B frames are not used as references to other
frames. Therefore, for the I-B-B-P case, we adopt a simple empirical rule for the
selection of the preanalysis QP values. To be more specific, if the i-th frame is an
I or P frame, we choose
QP
p,i
= QP
avg
− 2·
N
B
N
, (4.4)
where N
B
is the number of B frames in a GOP. Otherwise, we choose
QP
p,i
= QP
avg
− 2·
N
B
N
+2, (4.5)
when the i-th frame is a B frame. Please note that the above rule leads to
QP
avg
− 2·
N
B
N
·
N − N
B
N
+
QP
avg
− 2·
N
B
N
+2
·
N
B
N
= QP
avg
,
where N − N
B
is the number of I and P frames.
66
In the first pass, the R-D information of each frame is estimated using the
frame rate and distortion models as follows. Suppose that the i-th frame is the
target frame. The RDO process is performed first using QP
p,i
. During the RDO
process, the coded 4x4 blocks are identified so that SATD
c,i
(Q)and D
s,i
(Q)canbe
computed for quantization parameters from QP
p,i
−Δto QP
p,i
+ Δ. After entropy
coding, based on the actual header bits, source bits and distortion with respect to
QP
p,i
, we update frame-layer model parameters γ
i
, b
intra,i
, α
i
and β
i
in (3.5), (3.8),
(3.12) and (3.15), respectively. Using these values of model parameters, we estimate
source bits
ˆ
R
src,i
(Q) and coded distortion
ˆ
D
c,i
(Q) for the quantization parameters
from QP
p,i
−Δto QP
p,i
+Δ. Then, the estimated total distortion of the i-th frame
can be written as
ˆ
D
i
(Q(q
i
)) = D
s,i
(Q(q
i
)) +
ˆ
D
c,i
(Q(q
i
)), (4.6)
while the estimated total bits of the i-th frame can be written as
ˆ
R
i
(Q(q
i
)) = R
hdr,i
+
ˆ
R
src,i
(Q(q
i
)), (4.7)
where R
hdr,i
is the actual header bits after the RDO process with respect to QP
p,i
.
The R-D characteristic of each frame depends on the quantization parameter
used in the RDO process. In the proposed algorithm, since the RDO process is
performed using QP
p,i
, the estimated rate and distortion for the QP value which
has a large difference with QP
p,i
will have a large difference from actual rate and
distortion. For the reason, we estimate the R-D information only for a small range
of quantization parameters around QP
p,i
to reduce estimation error. Even though
the number of header bits R
hdr,i
also varies with q
i
in (4.7), it can be treated as
a constant without introducing a large estimation error when q
i
ranges only from
QP
p,i
−Δto QP
p,i
+ Δ and Δ is a small value.
67
4.3.2 Frame-Layer Bit Allocation
After the first encoding pass, the target bit budget R
T,GOP
are allocated among
the frames in a GOP using the Lagrange optimization technique. For the purpose,
a set of quantization parameters q
∗
=(q
∗
1
,...,q
∗
N
) that minimizes the following
Lagrange cost is determined. That is,
q
∗
= arg min
∀q
i
∈(QP
p,i
−Δ,QP
p,i
+Δ)
N
i=1
ω
i
·
ˆ
D
i
(Q(q
i
)) + λ
∗
·
N
i=1
ˆ
R
i
(Q(q
i
)), (4.8)
for a Lagrange multiplier λ
∗
that satisfies
R
T,GOP
=
N
i=1
ˆ
R
i
(Q(q
∗
i
)). (4.9)
The value of λ
∗
can be obtained by a bisection search algorithm [32]. As a result,
the target bits for the i-th frame is given by
ˆ
R
i
(Q(q
∗
i
)).
Since the above procedure considers the R-D information only for a limited
range of QP (i.e., QP
p,i
− Δ≤ q
i
≤ QP
p,i
+ Δ), the selection of Δ is critical. On
one hand, it should be large enough to make bit allocation more efficient. On the
other hand, it is desirable to be small in order to minimize the R-D estimation error
and reduce the computational cost. From extensive experimental observations, we
found out that Δ = 5 is a good choice in the first pass.
4.3.3 Second-Pass Encoding
To meet the allocated bits as exactly as possible, all frames in the GOP are finally
encoded in the second pass by the two-stage MB-layer rate control algorithm in
Chapter 3. In the second pass, QP
1
of the i-th frame is set to q
∗
i
and Δ, which is the
maximum difference allowed between QP
1
in the first stage and QP
2
in the second
68
stage, is set to 3 such that the coding gain loss caused by the discrepancy between
QP
1
and QP
2
is restricted. Moreover, the values of model parameters estimated
for each frame in the first pass are also used to estimate rate and distortion of the
same frame in the second pass.
4.3.4 Two-Pass Rate Control with Bit Allocation
In summary, the two-pass rate control algorithm with frame-layer bit allocation is
described as follows.
1. Initial bit allocation to a GOP
Based on the frame rate F and the channel rate C, allocate the target bit
budget to a current GOP by
R
T,GOP
= N ·
C
F
+ R
0
,
where R
0
is a feedback term which compensates for the difference between
the target bits and the actual bits of the previous coded GOP.
2. First-pass encoding
Determine QP
p,i
and encode the i-th frame in the GOP with QP
p,i
to collect
the R-D information, for all i =1,..., N. Estimate the values of model
parameters γ
i
, b
intra,i
, α
i
and β
i
. Then, estimate
ˆ
R
i
(Q)and
ˆ
D
i
(Q) for QP
∈ (QP
p,i
− Δ, QP
p,i
+ Δ), where Δ = 5.
3. Frame-layer bit allocation
Allocate the target bit budget R
T,GOP
among the frames by finding q
∗
that
minimizes the Lagrange cost as shown in (4.8). For all i =1,..., N,the
target bits for the i-th frame is given by
ˆ
R
i
(Q(q
∗
i
)).
69
4. Second-pass encoding
All frames in the GOP are encoded by the two-stage MB-layer rate control
scheme in Chapter 3. In the second pass, Δ is set to 3. Moreover, the
estimated model parameters for each frame in the first pass is used for the
same frame. Update the encoder buffer occupancy after encoding each frame.
5. Proceed to the next GOP until we reach the last GOP.
4.4 Experimental Results
In this section, the proposed two-pass algorithm is compared with the rate control
algorithm in JM8.1a [15] for the H.264 main-profile encoder. Two experiments are
performed for the I-P-P-P and the I-B-B-P cases, respectively. For both cases,
various types of sequences are encoded at target bit rates from 64 Kbps to 2048
Kbps as shown in Tables 4.1 and 4.2. For all test sequences, two previous encoded
frames are used as references for inter prediction and 150 frames are encoded at 30
fps with a total of 10 GOPs, each of which has a size of 15 frames. We assumed in
Sec. 4.2 that both encoder and decoder have the sufficiently large buffers. However,
for the experiments, the buffer sizes are set to two times of the channel rate and
the initial buffer occupancy is set to the middle of the encoder buffer.
Table 4.1 shows the actual bit rates and the average PSNRs for the I-P-P-P case
by the proposed two-pass and the JM8.1a rate control algorithms. We see that the
two-pass algorithm can achieve higher coding gain compared with the JM8.1a rate
control algorithm. The two-pass algorithm can improve the maximum coding gain
by 0.80 dB for the QCIF “News” sequence and the average coding gain by 0.36 dB.
As to target bit rate control, the proposed algorithm can control the bit rates more
accurately. Table 4.2 shows the actual bit rates and the average PSNRs for the
70
JM8.1a Proposed Two-Pass
Sequence Format Target
Bitrate PSNR Bitrate PSNR
News 64.00 64.26 31.92 64.05 32.72 (+0.80)
Salesman 64.00 64.13 31.86 64.01 32.53 (+0.67)
Foreman 64.00 64.29 30.90 64.04 31.25 (+0.35)
Carphone
QCIF
128.00 128.22 36.87 128.05 37.14 (+0.27)
Coastguard 128.00 128.13 30.46 128.03 30.75 (+0.29)
T.Tennis 128.00 128.27 32.98 28.15 33.42 (+0.44)
Flower 512.00 513.62 27.21 512.03 27.32 (+0.11)
Mobile 512.00 512.44 26.37 512.12 26.68 (+0.31)
Paris
CIF
1024.00 1027.57 37.43 1023.79 37.83 (+0.40)
Bus 1024.00 1024.37 32.81 1024.16 32.97 (+0.16)
Football 1024.00 1023.80 29.25 1023.97 29.41 (+0.21)
Stefan
SIF
2048.00 2047.18 36.63 2048.28 36.94 (+0.31)
Table 4.1: Performance of the proposed two-pass algorithm for the I-P-P-P case
with various sequences.
I-B-B-P case. As compared with the JM8.1a rate control algorithm, the two-pass
algorithm improves the maximum coding gains by 0.76 dB for the CIF “Flower”
sequence and the average coding gain by 0.52 dB. As to the target bit rate control,
similar observations as in the I-P-P-P case can be seen in the I-B-B-P case. From
Tables 4.1 and 4.2, we see that the coding gains are improved more in the I-B-B-P
case, especially for the sequences with the high activity, such as “Flower”, “Stefan”
and “Football”. For example, in the I-B-B-P case, the average coding gains for the
CIF and the SIF sequences are improved more than 0.60 dB. Therefore, we can say
that the proposed algorithm are more efficient in the I-B-B-P case, which is the
widely adopted GOP structure for non-conversational video applications.
For several test sequences, Figs. 4.1 to 4.4 compare the frame-by-frame PSNR
variations by the proposed two-pass algorithm with those by the JM8.1a rate con-
trol algorithm in the I-P-P-P and the I-B-B-P cases. We can see the significant
performance gain by the two-pass algorithm clearly from these figures especially in
the I-B-B-P case. The improved video qualities are from the proposed frame-layer
71
JM8.1a Proposed Two-pass
Sequence Format Target
Bitrate PSNR Bitrate PSNR
News 64.00 64.83 32.87 64.04 33.27 (+0.40)
Salesman 64.00 64.00 32.95 64.02 33.39 (+0.44)
Foreman 64.00 64.54 31.80 63.96 32.02 (+0.22)
Carphone
QCIF
128.00 127.19 37.62 127.95 38.22 (+0.60)
Coastguard 128.00 128.06 31.23 128.03 31.81 (+0.58)
T.Tennis 128.00 128.87 33.66 128.16 34.04 (+0.38)
Flower 512.00 518.22 27.49 512.12 28.25 (+0.76)
Mobile 1024.00 1021.02 30.72 1025.06 31.29 (+0.57)
Paris
CIF
512.00 517.25 33.55 511.94 34.17 (+0.62)
Bus 1024.00 1026.22 33.42 1023.99 33.89 (+0.47)
Football 2048.00 2040.78 33.23 2047.06 33.74 (+0.51)
Stefan
SIF
1024.00 1025.65 33.18 1025.59 33.89 (+0.71)
Table 4.2: Performance of the proposed two-pass algorithm for the I-B-B-P case
with various sequences.
bit allocation based on the Largrange optimization framework for sure. However,
it should be noted that the improved video qualities cannot be achieved without
our accurate frame rate and distortion models. The high accuracies of our frame
rate and distortion models make possible the effective pre-analysis of each frame as
well as the accurate bit rate control.
As specified in 4.2, the weight values in (4.8) and (5.13) are given as 1.5, 1.0
and 0.5 for I, P and B frames, respectively, in the experiments. While giving
higher weights to I and P frames can improve the average picture qualities, it also
increases the quality variations between frames. That explains the quality variations
between frames in Figs. 4.1 to 4.4. However, this phenomenon can be reduced by
using different weight values for each frame type (e.g., the same weight to P and
B frames), but at the cost of decreased average quality. Finally, the bit allocation
algorithm among MB classes in Chapter 3 was not incorporated in the experiments.
The coding gain will be improved further by applying the bit allocation algorithm
among MB classes.
72
4.5 Conclusion
In this chapter, for non-conversational video applications, the two-pass frame-layer
bit allocation algorithm was proposed based on the Lagrange optimization tech-
nique. The proposed algorithm adopted the similar idea (i.e., frame analysis in the
first pass and actual encoding in the second pass) that most two-pass algorithms
adopted. However, it was demonstrated by experimental results that our two-stage
encoding scheme and superior frame rate and distortion models made the proposed
algorithm more effective.
Two-pass algorithms require an pre-encoding pass. Thus, they are not suitable
for complexity-constrained applications. A simple one-pass algorithm is necessary
for such applications. The proposed two-pass algorithm serves as a fundament
study for further simplification in the next chapter.
73
30
31
32
33
34
35
0 50 100 150
News
JM81a
Proposed 2-pass
PSNR (dB)
Frame
(a)
32
33
34
35
36
37
38
39
40
0 50 100 150
Carphone
JM81a
Proposed 2-pass
PSNR (dB)
Frame
(b)
Figure 4.1: The variations of PSNRs of (a) “News” (QCIF, 64 Kbps) and (b)
“Carphone” (CIF, 128 Kbps) for the I-P-P-P case.
74
34
35
36
37
38
39
40
41
0 50 100 150
Paris
JM81a
Proposed 2-pass
PSNR (dB)
Frame
(a)
24
26
28
30
32
34
36
38
0 50 100 150
Football
JM81a
Proposed 2-pass
PSNR (dB)
Frame
(b)
Figure 4.2: The variations of PSNRs of (a) “Paris” (QCIF, 1024 Kbps) and (b)
“Football” (SIF, 1024 Kbps) for the I-P-P-P case.
75
29
30
31
32
33
34
0 50 100 150
Foreman
JM81a
Proposed 2-pass
PSNR (dB)
Frame
(a)
26
28
30
32
34
36
38
0 50 100 150
Coastguard
JM81a
Proposed 2-pass
PSNR (dB)
Frame
(b)
Figure 4.3: The variations of PSNRs of (a) “Foreman” (QCIF, 64 Kbps) and (b)
“Coastguard” (QCIF, 128 Kbps) for the I-B-B-P case.
76
25
26
27
28
29
30
31
0 50 100 150
Flower
JM81a
Proposed 2-pass
PSNR (dB)
Frame
(a)
28
30
32
34
36
38
0 50 100 150
Stefan
JM81a
Proposed 2-pass
PSNR (dB)
Frame
(b)
Figure 4.4: The variations of PSNRs of (a) “Flower” (CIF, 512 Kbps) and (b)
“Stefan” (SIF, 1024 Kbps) for the I-B-B-P case.
77
Chapter 5
Frame-Layer Bit Allocation for H.264 Video via
GOP Rate Modeling
5.1 Introduction
In this chapter, under the same assumption and problem formulation in Chapter 4,
the frame-layer bit allocation problem is addressed by one-pass algorithms via GOP
rate modeling, which is more desirable for complexity-constrained environments
than the two-pass algorithm.
As shown in Chapter 4, in the Lagrange optimization framework, the Lagrange
multiplier determines the optimal target bit number of each frame. Motivated by
this fact, we propose an one-pass algorithm which estimates the Lagrange multiplier
using GOP rate models, instead of estimating the R-D information of future frames
directly. Two GOP rate models; namely, R-Q and R-λ models, are investigated.
The GOP rate is modeled as a function of the complexity and the average quantiza-
tion stepsize of a GOP in the R-Q model. Similarly, the GOP rate is modeled as a
function of the complexity and the Lagrange multiplier of a GOP in the R-λ model.
In both cases, the GOP complexity is computed from original frames loaded in a
78
look-ahead buffer memory. The proposed GOP rate models enable the one-pass al-
gorithm to employ the Lagrange optimization framework effectively while adapting
to varying GOP characteristics without a pre-encoding pass.
We also propose a simplified one-pass rate control algorithm. In this algorithm,
we determine the quantization parameter of each frame according to its frame type,
instead of the Lagrange optimization framework. Therefore, the R-λ model is not
employed. Only the R-Q model is required. The novelty of the simplified algorithm
is that it does not demand any frame rate and distortion model. Hence, the rate
control mechanism can be greatly simplified. The two-stage encoding scheme is not
necessary either. No coding gain loss due to the discrepancy between QP
1
and QP
2
is introduced as well as the rate control mechanism can be simplified further.
This chapter is organized as follows. We first propose an one-pass frame-layer
bit allocation algorithm based on the Lagrange optimization framework in Sec. 5.2.
Then, a simplified algorithm is proposed in 5.3. Experimental results and conclud-
ing remarks are provided in Secs. 5.4 and 5.5, respectively.
5.2 One-Pass Frame-Layer Bit Allocation
5.2.1 Research Motivation
To reduce the complexity of the two-pass algorithm proposed in the last chapter,
it is desirable to estimate the R-D information of the frames in a GOP without
involving the pre-encoding process in the first pass. One straightforward way is
to estimate the R-D information of the frames in the target GOP from those in
the previous GOP. However, this approach is not robust unless the input video is
stable. Generally speaking, it is not reliable to estimate the R-D information of a
79
123 N-1 N
QP
1
IP P
4
PPP
(a)
12 3 N-1 N
QP
1
BB
4
P
B
I P
(b)
Figure 5.1: QP
1
of each frame for (a) the I-P-P-P case and (b) the I-B-B-P cases
in the one-pass algorithm based on the Lagrange optimization framework.
frame without any information on that frame, for example, the motion-compensated
residual signal. For the reason, based on the Lagrange optimization framework,
a novel one-pass frame-layer bit allocation algorithm is proposed via GOP rate
modeling in this section. Two GOP rate models; namely, the R-Q model and the
R-λ model, are developed. As compared with the two-pass algorithm, the one-pass
algorithm in this section has the following main differences.
• The QP
1
value of the i-th frame, for all i =1,...,N, is determined by the
proposed GOP-based R-Q model.
The one-pass algorithm in this section also employs the two-stage encoding
scheme as well as the frame rate and distortion models. For conversational
applications, QP
1
of one frame was set to the average quantization parameter
of the previous frame. However, for non-conversational applications, the same
type of frames in a GOP have the same QP
1
. Fig. 5.1 shows the QP
1
values
for the frames of a GOP in the proposed one-pass algorithm. Note that QP
1
of the i-th frame plays the similar role with QP
p,i
in the two-pass algorithm.
The frame-layer bit allocation process is performed by considering a smaller
range of quantization parameters (i.e., Δ = 3) to restrict the coding gain loss
80
due to the discrepancy between QP
1
and QP
2
. As a consequence, a proper
value of QP
1
has to be chosen to improve the coding gain. In the one-pass
algorithm, given the bit budget for a GOP, QP
1
of all frames are determined
by the GOP-based R-Q model that takes the complexity of the GOP into
account.
• Instead of estimating the R-D information of frames directly, a Lagrange
multiplier is estimated using the GOP-based R-λ model without an additional
encoding pass.
As shown in (4.8) and (4.9), the optimal number of bits assigned to each
frame depends on the bit budget and the Lagrange multiplier. On the other
hand, if the Lagrange multiplier that satisfies (4.9) is available, it is not
necessary to estimate the R-D information of all frames in advance. Therefore,
it is important to have a good estimate of the Lagrange multiplier. For the
purpose, we develop the GOP-based R-λ model that also takes the complexity
of the GOP into account.
5.2.2 GOP Complexity Measure
The proposed GOP rate models in this chapter are the functions of the GOP
complexity. We first propose a method to measure the complexity of a GOP. Since
it is not allowed in the one-pass algorithm to analyze frames by the pre-encoding
process, the complexity of a GOP should be computed from original frames loaded
in a look-ahead buffer memory. Let S be the complexity of a GOP which is the
sum of the complexities of I, P, and B frames in the GOP. That is,
S = S
I
+ S
P
+ S
B
, (5.1)
81
where S
I
, S
P
and S
B
denote the complexities of I, P and B frames, respectively.
Let f
i
be the original i-th frame. Let g
i
to denote its closest forward reference
frame when f
i
is a P frame. Likewise, when f
i
is a B frame, let g
i
and h
i
to denote
the closest forward and backward reference frames, respectively. For an I frame,
S
I
is simply computed as the sum of pixel values. For P frames, the complexities
are computed as the sum of absolute differences with respect to the closest forward
reference frame. That is, S
P
is written as
S
P
=
∀f
i
∈P
W
x=1
H
y=1
|f
i
(x,y)− g
i
(x,y)|, (5.2)
where W and H are the height and the width of a frame respectively. For B
frames, the complexities are computed as the sum of minimum absolute difference
with respect to the first forward and backward reference frames. That is, S
B
is
written as
S
B
=
∀f
i
∈B
W
x=1
H
y=1
min (|f
i
(x,y)− g
i
(x,y)|, |f
i
(x,y)− h
i
(x,y)|). (5.3)
5.2.3 GOP Rate Modeling
In this subsection, we first derive GOP rate models from simple frame rate and
distortion models. After that, the GOP rate models are verified by experiments.
Generally speaking, the number of source bits is proportional to the complexity
of motion-compensated residual signal (e.g., MAD) and the reciprocal of quanti-
zation stepsize. Thus, we can say that, although it might be inaccurate, the total
number of bits R
i
of the i-th frame is represented by [27]
R
i
= α
i
·
MAD
i
Q
i
+ h
i
, (5.4)
82
where α
i
is the model parameter and h
i
is the header bits required to encode header
information. It is also known that the distortion is proportional to the complexity
of residual signal and the quantization stepsize. We can represent the distortion D
i
of the i-th frame as
D
i
= β
i
· MAD
i
· Q
i
, (5.5)
where β
i
is the model parameter.
Let R
GOP
be the GOP rate, which is the number of bits consumed by a GOP.
It can be approximated by the R-Q model as follows:
R
GOP
=
N
i=1
R
i
=
N
i=1
α
i
·
MAD
i
Q
i
+
N
i=1
h
i
≈ η
1
·
S
Q
+ η
2
, (5.6)
where η
1
and η
2
are the model parameters which make the approximation valid and
Q is the average quantization stepsize of a GOP (i.e., Q =
1
N
N
i=1
Q
i
).
The R-λ model is derived from the Lagrange cost as follows. For the target bit
budget R
T,GOP
, the Lagrange cost J to be minimized can be expressed by
J =
N
i=1
ω
i
· D
i
+ λ·
N
i=1
R
i
− R
T,GOP
. (5.7)
Then, from (5.4) and (5.5), we easily get
J =
N
i=1
ω
i
· α
i
· β
i
·
MAD
2
i
(R
i
− h
i
)
+ λ·
N
i=1
R
i
− R
T,GOP
. (5.8)
By setting the partial derivative of (5.8) with respect to R
i
to 0, we have
R
i
=
ω
i
· α
i
· β
i
·
MAD
i
√
λ
+ h
i
,i=1,..., N. (5.9)
83
And setting the partial derivative of (5.8) with respect to λ to0gives
R
T,GOP
=
N
i=1
R
i
=
N
i=1
ω
i
· α
i
· β
i
·
MAD
i
√
λ
+
N
i=1
h
i
. (5.10)
We can express the GOP complexity S in (5.1) as
S =
1
ζ
1
N
i=1
ω
i
· α
i
· β
i
· MAD
i
, (5.11)
with the appropriate value of ζ
1
. Finally, by substituting
N
GOP
i=1
R
hdr,i
with ζ
2
and
R
T,GOP
with R
GOP
, we get from (5.10) the following R-λ model:
R
GOP
= ζ
1
·
S
√
λ
+ ζ
2
. (5.12)
5.2.3.1 The I-P-P-P Case
The relationship of the GOP rate R
GOP
with the quantization stepsize Q and
the Lagrange multiplier λ are verified for the I-P-P-P case through the following
experiment. A GOP is encoded several times with different values of QP
1
and
λ. More specifically, we choose QP
1
=15+3× n,where n =0, 1,··· 10, for all
frames in the GOP. And several values of λ are used for each QP
1
to allocate bits
optimally to all frames based on the Lagrange optimization framework. To give an
example with a particular choice of QP
1
and λ, suppose that the i-th frame is the
target frame. After the first stage of encoding using QP
1
, we estimate rates and
distortions with the quantization parameters from QP
1
−Δto QP
1
+Δ, where
Δ = 3. Then, q
∗
i
that minimize the Lagrange cost can be expressed by
q
∗
i
= arg min
q
i
∈(QP
1
−Δ,QP
1
+Δ)
ω
i
·
ˆ
D
i
(Q(q
i
)) + λ·
ˆ
R
i
(Q(q
i
)), (5.13)
84
where
ˆ
R
i
(Q)and
ˆ
D
i
(Q) are the estimated rate and distortion by the frame rate and
distortion models in Chapter 3. As a result, the target bits and the available source
bits for the i-th frame are given by
ˆ
R
i
(Q(q
∗
i
)) and
ˆ
R
src, i
(Q(q
∗
i
)). Finally, to meet
the target bits closely, the second-stage encoding process described in Chapter 3 is
performed with the available source bits.
Fig. 5.2 plots R
GOP
with respect to S/Q of various types of sequences when
their first, third and fifth GOPs are encoded with all combinations of QP
1
and λ.
We observe from Fig. 5.2 that R
GOP
and S/Q can be related by the R-Q model
in (5.6). Moreover, we can expect that the R-Q model can be simplified to the
following linear model with a model parameter η:
R
GOP
= η·
S
Q
. (5.14)
We also observe that the variation of each model parameter between adjacent GOPs
is generally small because adjacent GOPs are likely to have the similar character-
istics as long as they are in the same scene. This property is desirable when the
model parameters are approximated using the R-D information from the previous
GOPs.
It is generally more difficult to find accurate rate and distortion models for a
small coding unit. For example, estimating the rate and distortion of an MB tend
to yield a large estimation error since MBs in the same frame may still have a wide
range of R-D characteristics. Even a singular MB can degrade the accuracy of the
model significantly. Rate and distortion modeling of a frame is more accurate since
singular MBs in a frame are averaged with other MBs in the frame. The explains
the reason why R
GOP
can be approximated by the simple model in (5.6) or (5.14)
even with the complexity computed from original frames. Even though (5.6) and
85
0
5 10
4
1 10
5
1.5 10
5
2 10
5
2.5 10
5
3 10
5
3.5 10
5
05 10
5
1 10
6
1.5 10
6
2 10
6
News
1st GOP
3rd GOP
5th GOP
Rate
S/Q
(a)
0
2 10
5
4 10
5
6 10
5
8 10
5
1 10
6
01 10
6
2 10
6
3 10
6
4 10
6
Table Tennis
1st GOP
3rd GOP
5th GOP
Rate
S/Q
(b)
0
1 10
6
2 10
6
3 10
6
4 10
6
0 3.2 10
6
6.4 10
6
9.6 10
6
1.28 10
7
1.6 10
7
Stefan
1st GOP
3rd GOP
5th GOP
Rate
S/Q
(c)
0
5 10
5
1 10
6
1.5 10
6
2 10
6
2.5 10
6
3 10
6
02 10
6
4 10
6
6 10
6
8 10
6
1 10
7
Paris
1st GOP
3rd GOP
5th GOP
Rate
S/Q
(d)
Figure 5.2: The relationships between the GOP rate R
GOP
and S/Q of the first,
third and fifth GOPs for (a) the “News” (QCIF), (b) the “Table Tennis” (QCIF),
(c) the “Stefan” (SIF) and (d) the “Paris” (CIF) sequences in the I-P-P-P case.
86
0
5 10
4
1 10
5
1.5 10
5
2 10
5
2.5 10
5
02 10
8
4 10
8
6 10
8
8 10
8
News
1st GOP
3rd GOP
5th GOP
Rate
S/sqrt( O )
(a)
0
1 10
5
2 10
5
3 10
5
4 10
5
5 10
5
6 10
5
7 10
5
05 10
8
1 10
9
1.5 10
9
2 10
9
Table Tennis
1st GOP
3rd GOP
5th GOP
Rate
S/sqrt( O )
(b)
0
5 10
5
1 10
6
1.5 10
6
2 10
6
2.5 10
6
3 10
6
3.5 10
6
05 10
9
1 10
10
1.5 10
10
2 10
10
Stefan
1st GOP
3rd GOP
5th GOP
Rate
S/sqrt( O )
(c)
0
5 10
5
1 10
6
1.5 10
6
2 10
6
2.5 10
6
02 10
9
4 10
9
6 10
9
8 10
9
1 10
10
Paris
1st GOP
3rd GOP
5th GOP
Rate
S/sqrt( O )
(d)
Figure 5.3: The relationships between the GOP rate R
GOP
and S/
√
λ of the first,
third and fifth GOPs for (a) the “News” (QCIF), (b) the “Table Tennis” (QCIF),
(c) the “Stefan” (SIF) and (d) the “Paris” (CIF) sequences in the I-P-P-P case.
87
(5.14) may not represent the R-Q relationship of a frame accurately, it can represent
that with respect to a set of frames in a GOP accurately.
For the first, the third and the fifth GOP of the same sequences shown in
Fig. 5.2, Fig. 5.3 plots R
GOP
with respect to S/
√
λ when the average QP
2
is equal
to QP
1
. We see from Fig. 5.3 that, when the average QP
2
and QP
1
are the same,
R
GOP
can be approximated by the R-λ model in (5.12). Similar to the R-Q model,
we can expect that the R-λ model can be simplified to the following linear model
with a model parameter ζ:
R
GOP
= ζ·
S
√
λ
. (5.15)
For a given QP
1
,if λ is larger than a threshold value, we will get a constant
rate since QP
2
≤ QP
1
+ Δ. All frames and MBs will be quantized with QP
1
+Δ if
λ is larger than a threshold. Similarly, if λ is smaller than another threshold value,
we will have another constant rate since QP
2
≥ QP
1
− Δ. All frames and MBs
will be quantized with QP
1
−Δif λ is smaller than another threshold. Except for
such cases, R
GOP
can be approximated by (5.12) or (5.15). We observe that this
relationship holds in general for the values of λ that yield the average QP
2
which
is less than QP
1
by 1 or equal to QP
1
.
From (5.6) and (5.12), it is easy to derive the following relationship between λ
and Q:
λ = c· Q
2
, (5.16)
where c corresponds to ζ
2
/η
2
. In [49], the similar quadratic relationship between
λ and Q has been derived and applied for the RDO process in H.264. Hence,
our proposed R-Q and R-λ models can be viewed as a generalization of the result
in [49]. The difference is that c is varying according to the R-D characteristics of
GOPs in (5.16) while c is constant in the RDO process. In the proposed algorithm,
88
we employ the R-Q model in (5.14) instead of (5.14) and the R-λ model in (5.15)
instead of (5.12) due to their simplicities.
5.2.3.2 The I-B-B-P Case
We conduct the same experiment to determine the R-Q and R-λ models for the
I-B-B-P case. However, in this experiment, we choose QP
1
=15+3× n,where
n =0, 1,··· 10, for I and P frames and QP
1
of B frames is set to that of I and
P frames plus 2 as shown in Fig. 5.1. For the I-B-B-P case, Fig. 5.4 plots R
GOP
with respect to S/Q of various types of sequences when their the first, third and
fifth GOPs are encoded with all combinations of QP
1
and λ. For the same GOPs
of the same sequences, when the average QP
2
and the average QP
1
are the same,
Fig. 5.5 shows R
GOP
as a function of S/
√
λ. It is observed from both figures that
the proposed R-Q and R-λ models in (5.14) and (5.15) can be applied to represent
R
GOP
in terms of Q and λ, even for the I-B-B-P case.
5.2.4 Summary of Algorithm
We have derived the GOP rate models from frame rate and distortion models and
verified them empirically. In this subsection, using the proposed GOP rate models,
we propose an one-pass (MB-layer) rate control algorithm with frame-layer bit
allocation based on the Lagrange optimization framework. The proposed algorithm
is summarized as follows.
Algorithm 1: One-Pass Rate Control with Frame-Layer Bit Allocation
1. Initial bit allocation to a GOP
Allocate the target bit budget R
T,GOP
to the current GOP using the same
method employed in the two-pass algorithm in Chapter 4.
89
0
1 10
5
2 10
5
3 10
5
4 10
5
5 10
5
6 10
5
05 10
5
1 10
6
1.5 10
6
2 10
6
Carphone
1st GOP
3rd GOP
5th GOP
Rate
S/Q
(a)
0
1 10
5
2 10
5
3 10
5
4 10
5
5 10
5
6 10
5
7 10
5
05 10
5
1 10
6
1.5 10
6
2 10
6
2.5 10
6
3 10
6
Trevor
1st GOP
3rd GOP
5th GOP
Rate
S/Q
(b)
0
1 10
6
2 10
6
3 10
6
4 10
6
5 10
6
6 10
6
02 10
6
4 10
6
6 10
6
8 10
6
1 10
7
1.2 10
7
Football
1st GOP
3rd GOP
5th GOP
Rate
S/Q
(c)
0
5 10
5
1 10
6
1.5 10
6
2 10
6
2.5 10
6
3 10
6
3.5 10
6
4 10
6
05 10
6
1 10
7
1.5 10
7
Bus
1st GOP
3rd GOP
5th GOP
Rate
S/Q
(d)
Figure 5.4: The relationships between the GOP rate R
GOP
and S/Q of the first,
third and fifth GOPs for (a) the “Carphone” (QCIF), (b) the “Trevor” (QCIF), (c)
the “Football” (SIF) and (d) the “Bus” (CIF) sequences in the I-B-B-P case.
90
0
1 10
5
2 10
5
3 10
5
4 10
5
5 10
5
03 10
8
6 10
8
9 10
8
1.2 10
9
1.5 10
9
Carphone
1st GOP
3rd GOP
5th GOP
Rate
S/sqrt( O )
(a)
0
1 10
5
2 10
5
3 10
5
4 10
5
5 10
5
6 10
5
04 10
8
8 10
8
1.2 10
9
1.6 10
9
Trevor
1st GOP
3rd GOP
5th GOP
Rate
S/sqrt( O )
(b)
0
1 10
6
2 10
6
3 10
6
4 10
6
5 10
6
02 10
9
4 10
9
6 10
9
8 10
9
1 10
10
1.2 10
10
1.4 10
10
Football
1st GOP
3rd GOP
5th GOP
Rate
S/sqrt( O )
(c)
0
1 10
6
2 10
6
3 10
6
4 10
6
05 10
9
1 10
10
1.5 10
10
2 10
10
Bus
1st GOP
3rd GOP
5th GOP
Rate
S/sqrt( O )
(d)
Figure 5.5: The relationships between the GOP rate R
GOP
and S/
√
λ of the first,
third and fifth GOPs for (a) the “Carphone” (QCIF), (b) the “Trevor” (QCIF), (c)
the “Football” (SIF) and (d) the “Bus” (CIF) sequences in the I-B-B-P case.
91
2. Determine QP
1
for all frames in the GOP
Given the target bit budget R
T,GOP
, estimate the average quantization pa-
rameter QP
avg
by (5.14). In the I-P-P-P case, QP
1
of all frames are set to
QP
avg
. In the I-B-B-P case, QP
1
for I and P frames is determined first by
QP
1
= QP
avg
− 2·
N
B
N
.
Then, the QP
1
value of B frames is set to that of I and P frames plus 2.
3. Determine a Lagrange multiplier
Given the target bit budget R
T,GOP
, determine λ by (5.15).
4. Two-stage encoding with frame-layer bit allocation
All frame in the GOP is encoded by the two-stage MB-layer rate control.
Suppose that the i-th frame is the target frame. It is encoded as follows.
(a) Perform the first stage of encoding using QP
1
.
(b) Estimate rates and distortions for QP ∈ (QP
1
− Δ,QP
1
+ Δ), where
Δ = 3. Then, allocate the target bits by determining q
∗
i
that minimizes
the Lagrange cost in (5.13).
(c) Perform the second stage of encoding with the available source bits.
(d) Update the frame model parameters and the encoder buffer occupancy.
The model parameters are updated by least square approximation (LSA)
using data from the previous 5 frames of the same type.
5. Update the GOP rate model parameters.
The GOP rate model parameters in (5.14) and (5.15) are updated by LSA
using the information from previous 5 GOPs. The R-Q model parameter in
92
(5.14) is updated whenever all frames in a GOP are encoded. However, the
R-λ model parameter in (5.15) is updated only when the difference between
the average QP
2
and the average QP
1
is less than or equal to 1.
6. Proceed to the next GOP until we reach the last GOP.
Note that the initial values of the GOP rate model parameters are required for
the first GOP. Therefore, the first GOP is encoded by the two-pass algorithm so
that we can estimate the model parameters for the following GOPs.
While the R-D characteristics of the GOPs in the same scene are very similar,
those of the GOPs in the different scenes tend to be quite different. In such cases,
the estimated model parameters from previous GOPs fail to represent the R-D
characteristics of the current one, and thus degrade the rate control performance.
For the reason, whenever a P frame is encoded in Step 4, the GOP rate model
parameters are updated using the information from the coded frames in the current
GOP. Let S
c
, Q
c
and λ
c
be the complexity, the average quantization stepsize and
the average Lagrange multipliers of the coded frames in the current GOP. And let
R
c
be the consumed bits to encode them. Then, the model parameter η in (5.14)
is updated using R
c
, S
c
and Q
c
. The model parameter ζ in (5.15) is updated as
well using R
c
, S
c
and λ
c
. For the uncoded frames in the current GOP, QP
1
and λ
are determined again by (5.14) and (5.15), respectively, using the updated model
parameters.
5.3 Simplified One-Pass Rate Control
A simplified one-pass algorithm is proposed by exploiting the monotonicity property
[35]. This property says that better qualities of reference frames (i.e.,I and P
93
frames) lead to better overall coding efficiency. Therefore, it is reasonable to impose
the following constraint to make qualities of reference frames better than those of
predictive non-reference frames [25].
q
I
≤ q
P
≤ q
B
, (5.17)
where q
I
, q
P
and q
B
are the quantization parameters for I, P and B frames. We
should choose q
I
, q
P
and q
B
such that they meet the constraint in (5.17) and the
bid budget constraint in a GOP. This is the main idea of the simplified algorithm.
The simplified one-pass algorithm first estimates the average quantization pa-
rameter of a GOP using the GOP-based R-Q model. Then, q
I
, q
P
and q
B
are
determined such that their averaged value for all frames is equal to the estimated
GOP quantization parameter under the constraint in (5.17). Finally, all frames in
a GOP are encoded using the corresponding quantization parameter according to
frame types. More specifically, the simplified one-pass algorithm is stated below.
Algorithm 2: Simplified One-Pass Rate Control
1. Initial bit allocation to a GOP
Allocate the target bit budget R
T,GOP
to the current GOP using the same
method employed in the two-pass algorithm in Chapter 4.
2. Determine the quantization parameter for each type of frame
Given the target bit budget, estimate the average quantization parameter
QP
avg
by (5.14). Then, in the I-P-P-P case, q
I
is set to QP
avg
−1and q
P
is
set to QP
avg
. In the I-B-B-P case, q
P
is determined by
q
p
= QP
avg
− 2·
N
B
N
+
1
N
,
94
and q
I
is set to q
P
−1and q
B
is set to q
P
+2.
3. Encodes all frames in a GOP
All frames are encoding using q
I
, q
P
and q
B
according to their frame type.
As explained in Sec. 5.2, the R-Q model parameter is updated whenever a
P frame is encoded using the information from coded frames in a GOP. For
the uncoded frames, q
P
and q
B
are determined again using the updated R-Q
model parameter.
4. Update the GOP rate model parameters.
The R-Q model parameter in (5.14) is updated by LSA using the information
from previous 5 GOPs.
5. Proceed to the next GOP until we reach the last GOP.
The simplified one-pass algorithm does not require the R-λ model since the
Lagrange optimization framework is not employed. Moreover, the frame rate and
distortion models are not necessary. Because it demands only the R-Q model in
(5.14), the rate control process is simplified significantly. It is worthwhile to note
that the two-stage encoding scheme is not employed either in the simplified one-
pass algorithm. Therefore, no coding gain loss due to the discrepancy between QP
1
and QP
2
is introduced as well as an encoder can be simplified further
5.4 Experimental Results
In this section, the proposed one-pass algorithms are evaluated with the H.264
main-profile encoder. Two experiments are performed for the I-P-P-P and the I-B-
B-P cases, respectively, under the same experimental environments in Chapter 4.
In both cases, the same sequences are encoded at the same bit rates.
95
Algorithm No. 1 Algorithm No. 2
Sequence Format
Bitrate PSNR Bitrate PSNR
News 63.85 32.64 (+0.72) 63.89 32.59 (+0.67)
Salesman 64.21 32.35 (+0.49) 63.92 32.37 (+0.51)
Foreman 64.22 31.24 (+0.34) 64.55 31.36 (+0.46)
Carphone
QCIF
127.76 37.09 (+0.22) 128.18 37.13 (+0.26)
Coastguard 127.85 30.65 (+0.19) 127.48 30.67 (+0.21)
T.Tennis 128.26 33.43 (+0.45) 127.31 33.36 (+0.38)
Flower 511.23 27.30 (+0.09) 512.76 27.28 (+0.07)
Mobile 511.61 26.62 (+0.25) 512.87 26.57 (+0.20)
Paris
CIF
1025.46 37.74 (+0.31) 1025.62 37.79 (+0.36)
Bus 1024.80 32.98 (+0.17) 1025.19 33.03 (+0.22)
Football 1024.69 29.45 (+0.20) 1024.77 29.42 (+0.17)
Stefan
SIF
2049.92 36.84 (+0.21) 2049.39 36.88 (+0.25)
Table 5.1: Performances by the proposed one-pass algorithms for the I-P-P-P case
with the same sequences and at the same bit rates in Table 4.1.
For the I-P-P-P case, Table 5.1 shows the final bit rates and the average PSNRs
by the proposed one-pass algorithms. The coding gains in this table are with respect
to those by the JM8.1a rate control algorithm in Table 4.1. We see that both one-
pass algorithms can achieve higher coding gain compared with the JM8.1a rate
control. The one-pass algorithm based on the Lagrange optimization framework
(Algorithm No. 1) improves the maximum coding gain by 0.72 dB for the QCIF
“News” sequence and the average coding gain by 0.30 dB. With the simplified one-
pass algorithm (Algorithm No. 2), the maximum coding gain is improved by 0.67
dB for the “News” sequence and the average coding gain is improved by 0.31 dB.
As to target bit rate control, the proposed one-pass algorithms show the similar
performance with the JM8.1a rate control algorithm.
Table 5.2 shows the actual bit rates and the average PSNRs by the proposed
one-pass algorithms for the I-B-B-P case. Compared with the JM8.1a rate control
in Table 4.2, “Algorithm No. 1” and “Algorithm No. 2” improve the maximum
coding gains by 0.81 dB and 0.68 dB for the CIF “Flower” sequence. They also
96
Algorithm No. 1 Algorithm No. 2
Sequence Format
Bitrate PSNR Bitrate PSNR
News 64.38 33.28 (+0.41) 63.74 33.24 (+0.37)
Salesman 63.87 33.16 (+0.21) 63.97 33.15 (+0.20)
Foreman 64.00 32.05 (+0.25) 63.85 32.02 (+0.22)
Carphone
QCIF
127.44 38.13 (+0.51) 127.12 38.03 (+0.41)
Coastguard 127.48 31.77 (+0.54) 127.04 31.75 (+0.52)
T.Tennis 127.56 33.93 (+0.27) 127.61 33.90 (+0.24)
Flower 510.29 28.30 (+0.81) 510.98 28.17 (+0.68)
Mobile 1022.72 31.33 (+0.61) 1022.77 31.18 (+0.46)
Paris
CIF
510.94 34.24 (+0.69) 511.84 34.10 (+0.55)
Bus 1026.47 33.92 (+0.50) 1022.87 33.86 (+0.44)
Football 2048.22 33.76 (+0.53) 2048.69 33.67 (+0.45)
Stefan
SIF
1023.04 33.78 (+0.60) 1022.50 33.71 (+0.53)
Table 5.2: Performances by the proposed one-pass algorithms for the I-B-B-P case
with the same sequences and at the same bit rates in Table 4.2.
improve the average coding gain by 0.49 dB and 0.42 dB, respectively. As to the
target bit rate control, similar observations as in the I-P-P-P case can be seen from
the I-B-B-P case. Similar to the two-pass algorithm, we can see from Tables 5.1
and 5.2 that the coding gains are improved further in the I-B-B-P case by the
proposed one-pass algorithms. Specifically, in the I-B-B-P case, “Algorithm No. 1”
and “Algorithm No. 2” improve the average coding gains by 0.62 dB and 0.52 dB,
respectively, for the CIF and the SIF sequences.
For several test sequences, Figs. 5.6 to 5.9 compare the frame-by-frame PSNR
variations by the proposed one-pass algorithms with those by the JM8.1a rate con-
trol in the I-P-P-P and the I-B-B-P cases. We do not show the PSNR variations
by the two-pass algorithm in these figures to make figures more readable. Instead,
we can find them in Figs. 4.1 to 4.4. Comparing the PSNR variations by “Algo-
rithm No. 1” with those by the two-pass algorithm, we can observe that they are
very similar for the same sequence. Only one or two GOPs show a large difference
97
between PSNRs due to the estimation error in the GOP-based R-Q and R-λ mod-
els. That is, the proposed GOP rate models enable “Algorithm No. 1” to adapt
effectively to varying GOP characteristics without an additional encoding pass in
the two-pass algorithm. If we compare the PSNR variations by “Algorithm No. 1”
and “Algorithm No. 2”, we can observe that they are very similar as well for the
same sequence. In other words, the simplified method replaces effectively the La-
grange optimization framework by exploiting the monotonicity property without
significant coding gain loss.
5.5 Conclusion
In this chapter, the one-pass frame-layer bit allocation problem is studied for non-
conversational video applications. First, the one-pass algorithm based on the La-
grange optimization framework is proposed via GOP rate modeling. It was shown
by the experiments that the GOP-based R-Q and R-λ models make it possible
to adopt the Lagrange optimization framework successfully without a pre-encoding
pass. Second, the simplified one-pass algorithm is proposed based on the monotonic-
ity property. From the experiments, we observed that the rate control process can
be performed successfully even without any frame rate and distortion model.
98
30
31
32
33
34
35
0 50 100 150
News
JM81a
Algorithm 1
Algorithm 2
PSNR (dB)
Frame
(a)
32
33
34
35
36
37
38
39
40
0 50 100 150
Carphone
JM81a
Algorithm 1
Algorithm 2
PSNR (dB)
Frame
(b)
Figure 5.6: The variations of PSNRs of (a) “News” (QCIF, 64 Kbps) and (b)
“Carphone” (QCIF, 128 Kbps) for the I-P-P-P case.
99
34
35
36
37
38
39
40
41
0 50 100 150
Paris
JM81a
Algorithm 1
Algorithm 2
PSNR (dB)
Frame
(a)
24
26
28
30
32
34
36
38
0 50 100 150
Football JM81a
Algorithm 1
Algorithm 2
PSNR (dB)
Frame
(b)
Figure 5.7: The variations of PSNRs of (a) “Paris” (CIF, 1024 Kbps) and (b)
“Football” (SIF, 1024 Kbps) for the I-P-P-P case.
100
29
30
31
32
33
34
0 50 100 150
Foreman
JM81a
Algorithm 1
Algorithm 2
PSNR (dB)
Frame
(a)
26
28
30
32
34
36
38
0 50 100 150
Coastguard
JM81a
Algorithm 1
Algorithm 2
PSNR (dB)
Frame
(b)
Figure 5.8: The variations of PSNRs of (a) “Foreman” (QCIF, 64 Kbps) and (b)
“Coastguard” (QCIF, 128 Kbps) for the I-B-B-P case.
101
25
26
27
28
29
30
31
32
0 50 100 150
Flower
JM81a
Algorithm 1
Algorithm 2
PSNR (dB)
Frame
(a)
28
30
32
34
36
38
0 50 100 150
Stefan
JM81a
Algorithm 1
Algorithm 2
PSNR (dB)
Frame
(b)
Figure 5.9: The variations of PSNRs of (a) “Flower” (CIF, 512 Kbps) and (b)
“Stefan” (SIF, 1024 Kbps) for the I-B-B-P case.
102
Chapter 6
Rate Control for H.264 Video with Adaptive
GOP Structure
6.1 Introduction
We propose a rate control algorithm with adaptive GOP structure for the H.264
video in this chapter. Even though a fixed GOP structure over a whole sequence is
easy to implement, it prevents an encoder from adapting to the spatial and temporal
characteristics of frames. For example, a high quality of video can be produced by
placing more B frames when there are small amounts of motion and by placing more
P frames when there are large amounts of motions. In this chapter, we present a
GOP structure decision algorithm adaptive to the characteristics of scenes.
For each frame, its frame type may change depending on the target bits for it
and the target bits also may change according to its frame type. That is, GOP
structure decision should be performed jointly with frame-layer bit allocation since
they depend on each other. In Chapter 5, we proposed the simplified one-pass rate
control algorithm with fixed GOP structure, where the target bits to each frame
are determined automatically by the quantization parameter. In this chapter, on
103
top of the simplified rate control algorithm, we show how GOP structures can be
determined jointly with frame-layer bit allocation scheme.
The structure of a GOP is characterized by the size N and the distance between
P frames M. The proposed algorithm consists of two steps accordingly. First, we
identify an I frame based on the mean absolute difference (MAD) between frames.
If a frame has a larger MAD with respect to a previous frame than a threshold,
it is encoded as an I frame. Second, the final GOP structure are determined by
identifying P frames jointly with frame-layer bit allocation. The inter-dependency
of bit allocation and GOP structure decision is resolved by the simplified rate
control scheme and the GOP rate and distortion modeling. The GOP rate (i.e.,
R-Q) and distortion (i.e., D-Q) models, which are invariant to the change of GOP
structures, are investigated.
Finally, we propose a GOP-layer bit allocation algorithm as well using the GOP
rate and distortion models. Most of previous work considers bit allocation among
small coding units such as frame and macroblock. However, it is observed that,
by optimally allocating bits among a set of several GOPs, we can achieve not only
higher average quality but also more consistent visual quality. Frame-layer bit
allocation and GOP structure decision can be employed along with GOP-layer bit
allocation for further quality improvement. In this chapter, we show the proposed
R-Q and D-Q models can be used successfully for GOP-layer bit allocation through
two-pass encoding.
This chapter is organized as follows. The proposed rate control algorithm with
adaptive GOP structure is presented in Sec. 6.2. The two-pass GOP-layer bit
allocation algorithm is proposed in Sec. 6.3 and experimental results are provided
in Sec. 6.4. Finally, concluding remarks are given in Sec. 6.5.
104
Read and down-sample
a frame
I frame ?
Yes
P-frame decision
(Joint GOP structure decision and
frame-layer bit allocation)
No
Start
Encode all frames in a GOP
End
End of Seq.?
Yes
No
Compute a MAD
Figure 6.1: Flowchart of rate control with adaptive GOP structure.
6.2 Rate Control with Adaptive GOP
6.2.1 Overview of Algorithm
Fig. 6.1 shows the overall flow of the proposed rate control algorithm with adaptive
GOP structure. A down-sampled frame is generated first from an input original
frame after low-pass filtering by the average filter and down-sampling by 2 in both
horizontal and vertical directions. After that, it is stored in a look-ahead buffer.
The down-sampled frame is used to reduce the computational complexity in the
105
subsequent I-frame and P-frame decision methods. For I-frame decision, the MAD
value between two consecutive down-sampled frames is computed and compared
with a threshold. If a frame is not an I frame, a next frame is stored in the look-
ahead buffer after low-pass filtering and down-sampling, and examined if it should
be an I frame or not. If a frame is an I frame, given the target bit budget for a GOP,
the final GOP structure is determined by identifying P frames. Finally, all frames
in a GOP are encoded using the quantization stepsizes determined according to
frame types. The above procedure is continued until there is no frame to encode.
The proposed algorithm can be performed in one pass or two passes depend-
ing on the bit allocation scheme for GOPs. If we allocate the target bit budget
constantly to each GOP, the proposed algorithm is performed in one pass. On the
other hand, if we employ a optimal GOP-layer bit allocation scheme, the proposed
algorithm is performed in two passes. In this case, a set of several GOPs are pre-
encoded in the first pass to analyze their R-D characteristics. This scheme will be
useful for off-line encoders where a longer encoding delay is allowed. We present in
detail the optimal GOP-layer bit allocation algorithm in Sec. 6.3.
6.2.2 I-Frame Decision
To decide the GOP structure, the position of an I frame should be identified
first. In general, two different scenes are likely to have different characteristics
and thus two consecutive frames across a scene cut has little similarity. Since
motion-compensated prediction is not used in I frames, it is reasonable to encode
a frame just after a scene cut as an I frame. It is also preferred for the ease of
editing. We propose an I-frame decision method by a scene cut detection based on
the MAD value between two down-sampled frames.
106
0
5
10
15
20
25
5 1423324150596877 8695 104 113 122 131 140
Table Tennis
M
n,n-1
Frame
(a)
0
5
10
15
20
25
5 14 23 32 41 50 59 68 77 86 95 104 113 122 131 140
Trevor
M
n,n-1
Frame
(b)
Figure 6.2: The variation of MAD between down-sampled frames in the QCIF
sequences, (a) “Table Tennis” and (b) “Trevor”.
Let f
2,n
be the n-th down-sampled original frame by 2 and let M
n,n−1
be the
MAD value between f
2,n
and f
2,n−1
. To compute M
n,n−1
, we first perform motion
estimation for all 8×8blocksin f
2,n
with respect to f
2,n−1
within a 8×8 search range.
After motion estimation, denote f
d
2,n−1
(x,y) as the pixel value in f
2,n−1
which the
(x,y)-th pixel in f
2,n
(i.e., f
2,n
(x,y)) maps to. Then, M
n,n−1
is computed as
M
n,n−1
=
1
W
2
· H
2
·
W
2
x=1
H
2
y=1
f
2,n
(x,y)− f
d
2,n−1
(x,y)
(6.1)
where W
2
and H
2
are width and height of down-sampled frames, respectively.
Fig. 6.2 shows the frame-by-frame variation of M
n,n−1
in the QCIF “Table Ten-
nis” and “Trevor” sequences. We can see from this figure that M
n,n−1
is very close
107
I
1
P
0
P
Np
I
2
P
1
N
P
2
P
Np-1
Figure 6.3: The GOP structure in display order
to 0 between two frames within the same scene. Even when there are high activity
in a scene, M
n,n−1
is usually small since it is motion-compensated value. However,
M
n,n−1
between the frames across a scene cut is much higher than that between the
frames within the same scene. Therefore, in the proposed method, the n-th frame
is determined an an I frame if M
n,n−1
is higher than a threshold TH
M
,which is
simply set to 15.
As the GOP size increases, the encoding delay as well as the computation com-
plexity for P-frame decision also increase. To reduce the encoding delay and the
computational complexity, we force the distance between two I frames to be less
than the maximum distance N
max
. That is, if the distance between the n-th frame
and the previous I frame is larger than N
max
,the n-th frame is determined as an I
frame even though there is no scene cut. On the contrary, if there are too many I
frames, coding efficiency decrease significantly. For the reason, we force the distance
between I frames to be larger than the minimum distance N
min
. In the proposed
method, N
max
and N
min
are set to 30 and 10, respectively.
6.2.3 P-Frame Decision
Fig. 6.3 shows an exemplary GOP structure in display order. In this figure, I
1
, I
2
and P
0
represent the previous I frame, the newly identified I frame and the last
108
encoded P frame of the previous GOP. Suppose that P
1
,..., P
Np
are P frames of the
current GOP, which will be identified by a P-frame decision method. Then, the
GOP size N is the distance between P
0
and P
Np
in display order. We should note
that B frames between P
Np
and I
2
are parts of the next GOP. In other words, if
we do not know the position of P
Np
, we cannot have information about the GOP
size N. That is, we cannot allocate the target bit budget properly to a GOP and
thus we cannot perform frame-layer bit allocation effectively. For the reason, we
determine in advance the position of P
Np
such that there are two B frames between
P
Np
and I
2
. The position of other P frames (i.e., P
1
,..., P
Np−1
) are determined by
the following P-frame decision method.
6.2.3.1 Problem Formulation
Let R
i
(t
i
,Q(q
i
)) and D
i
(t
i
,Q(q
i
)) be the rate and distortion of the i-th frame with
respect to its frame type t
i
and quantization stepsize which corresponds to the
quantization parameter q
i
. Assuming that frames are independent of each other,
we can formulate the joint GOP structure decision and frame-layer bit allocation
problem as
minimize
N
i=1
D
i
(t
i
,Q(q
i
)) subject to
N
i=1
R
i
(t
i
,Q(q
i
))≤ R
T,GOP
, (6.2)
where R
T,GOP
is the target bit budget for a GOP. The object is to jointly decide
the frame types and the quantization stepsizes such that the average distortion of
a GOP is minimized under the bit-budget constraint.
The above problem suffers from the inter-dependency between GOP structure
decision and frame-layer bit allocation. It causes the excessive computational com-
plexity to solve the problem. For the reason, we need to simplify the problem. In
109
Chapter 5, we showed that the simplified rate control algorithm works very well
with fixed GOP structure. We extend the same idea for adaptive GOP structure.
In other words, we determine the quantization parameter for each frame according
to its frame type. Since the target bits are determined by its quantization para-
meter, we can perform frame-layer bit allocation at the same time by deciding the
GOP structure. Let q
I
, q
P
and q
B
are the quantization parameters for I, P and B
frames, respectively. In this work, as in the simplified algorithm of Chapter 5, q
I
is
set to q
P
−1and q
B
is set to q
P
+2.
6.2.3.2 Enhanced GOP Rate and Distortion Models
In addition, we propose enhanced GOP rate and distortion models, which are in-
variant to the change of GOP structures. Without involving multiple encoding
passes, the joint decision problem in (6.2) can be solved effectively using the pro-
posed GOP rate and distortion models along with the frame-layer bit allocation
scheme based on frame types.
GOP Complexity Measure
In Chapter 5, it was shown that the GOP complexity S, which is simply computed
from original frames, can be employed successfully to model the GOP rate for fixed
GOP structure. However, such a simple measure is not sufficient to represent rates
and distortions of varying GOP structures. For joint GOP structure decision and
frame-layer bit allocation, the GOP complexity measure should be able to represent
the change of rate and distortion as exactly as possible whenever the GOP structure
changes. The GOP complexity S is the sum of the complexities of I, P and B frames
in a GOP (i.e., S = S
I
+ S
P
+ S
B
). Based on motion-compensated prediction, we
propose the new complexity measure for each type of frames so that the rate and the
110
distortion of a GOP can be estimated robustly regardless of the internal variation
of a GOP structure.
When the i-th frame f
i
is an I frame, S
I
is computed from its down-sampled
frame f
2,i
as follows. First, for all 4×4blocksin f
2,i
, we perform intra DC prediction.
All pixels in a 4× 4 block are estimated by the average value of them. After DC
prediction, let f
d
2,i
(x,y) be the intra predicted pixel value of f
2,i
(x,y). Then, S
I
is
defined as the sum of absolute difference between them, i.e.,
S
I
=
W
2
x=1
H
2
y=1
f
2,i
(x,y)− f
d
2,i
(x,y)
. (6.3)
When f
i
is a inter (P or B) frame, let g
i
and h
i
to denote its closest forward
and backward reference frame. The complexity of f
i
is computed from the down-
sampled frame using motion-compensated prediction. We first perform motion
estimation for all 8×8blocksin f
2,i
with respect to g
2,i
and h
2,i
within a 8×8
search range. Let g
d
2,i
(x,y) be the pixel value which f
2,i
(x,y) maps to by forward
motion vector. Likewise, let h
d
2,i
(x,y) be the pixel value which f
2,i
(x,y)mapsto
by backward motion vector. Then, S
P
is measured for all P frames in a GOP as
follows:
S
P
=
∀f
i
∈P
W
2
x=1
H
2
y=1
f
2,i
(x,y)− g
d
2,i
(x,y)
, (6.4)
and, S
B
is measured for all B frames in a GOP as follows:
S
P
=
∀f
i
∈B
W
2
x=1
H
2
y=1
min (
f
2,i
(x,y)− g
d
2,i
(x,y)
, (
f
2,i
(x,y)− h
d
2,i
(x,y)
). (6.5)
It may require high computational complexity to measure the GOP complexity
due to motion estimation. Especially when the GOP complexity is measured for all
possible GOP structures, the computational complexity can increase significantly.
111
However, since we perform motion estimation using down-sampled frames, we can
reduce the computational complexity. Moreover, since we consider a small num-
ber of candidate GOP structures, the complexity necessary to measure the GOP
complexities of candidate GOP structures can be maintained to the affordable level.
GOPRateModel
The proposed GOP rate and distortion models are derived by the following exper-
iments. A set of different number of frames is grouped into a GOP and encoded
into different GOP structures with different values of q
I
, q
P
and q
B
.Tobemore
specific, 15 frames (N = 15) or 30 frames (N = 30) are grouped into a GOP. The
distance between reference frames M is set to 2, 3, 4 and 5 such that there are
eight different GOP structures. To give an example, suppose that a set of frames
is encoded into a particular GOP structure (e.g., N =15 and M = 3). We encode
this particular GOP several times using q
P
from 15 to 45, and q
I
= q
P
−1and
q
B
= q
P
+ 2 for each q
P
.
Fig. 6.4 shows the normalized R
GOP
(normalized by N) with respect to the
normalized S/Q of various types of sequences when their first 15 and 30 frames are
encoded into different GOP structures. We can see from Fig. 6.4 that, even though
the GOP structure changes, the GOP rate R
GOP
and S/Q can be related by the
same R-Q model in Chapter 5, i.e.,
R
GOP
= η·
S
Q
, (6.6)
where η is a model parameter and Q is the average quantization stepsize of a GOP.
The invariant property of a GOP rate model to the change of GOP structures
is very important for bit rate control as well as GOP structure decision. In fact,
no matter GOP structures are fixed or varying, it should be guaranteed that the
112
0
5000
1 10
4
1.5 10
4
2 10
4
2.5 10
4
3 10
4
0 5000 1 10
4
1.5 10
4
2 10
4
2.5 10
4
Carphone
N = 15
N = 30
Rate
S/Q
(a)
0
5000
1 10
4
1.5 10
4
2 10
4
2.5 10
4
3 10
4
3.5 10
4
0 5000 1 10
4
1.5 10
4
2 10
4
2.5 10
4
3 10
4
Trevor
N = 15
N = 30
Rate
S/Q
(b)
0
5 10
4
1 10
5
1.5 10
5
2 10
5
2.5 10
5
3 10
5
05 10
4
1 10
5
1.5 10
5
2 10
5
2.5 10
5
Football
S = 15
N = 30
Rate
S/Q
(c)
0
2 10
4
4 10
4
6 10
4
8 10
4
1 10
5
1.2 10
5
02 10
4
4 10
4
6 10
4
8 10
4
1 10
5
Paris
N = 15
N = 30
Rate
S/Q
(d)
Figure 6.4: The relationships between the GOP rate R
GOP
and S/Q of (a) the
“Carphone” (QCIF), (b) the “Trevor” (QCIF), (c) the “Football” (SIF) and (d)
the “Paris” (CIF) sequences. .
113
0
5000
1 10
4
1.5 10
4
2 10
4
2.5 10
4
3 10
4
3.5 10
4
02 10
4
4 10
4
6 10
4
8 10
4
1 10
5
1.2 10
5
Trevor
N = 15
N = 30
Rate
S/Q
(a)
0
5 10
4
1 10
5
1.5 10
5
2 10
5
2.5 10
5
3 10
5
3.5 10
5
4 10
5
02 10
4
4 10
4
6 10
4
8 10
4
1 10
5
Paris
N = 15
N = 30
Rate
S/Q
(b)
Figure 6.5: The relationships between the GOP rate R
GOP
and S/Q of (a) the
“Trevor” (QCIF) and (b) the “Paris” (CIF) sequences when the GOP complexity
measure in Chapter 5 is employed.
proposed algorithm can control bit rate accurately. For this reason, the proposed
algorithm estimates the values of q
I
, q
P
and q
B
for candidate GOP structures so that
we can meet the target bit budget for a GOP regardless of the finally determined
GOP structure. We can expect from Fig. 6.4 that the R-Q model in (6.6) with the
new complexity measure is suitable for this purpose.
Fig. 6.5 shows the R-Q relationship of the “Trevor” and “Bus” sequences when
the GOP complexity measure in Chapter 5 is employed. The superiority of the
new GOP complexity over the previous GOP complexity can be clearly observed
from this figure. In the previous GOP complexity, since intra prediction is not
considered, an I frame takes a large portion of the GOP complexity. The GOP
rate model is not invariant to the GOP size as a result. Moreover, since motion
estimation is not involved for P and B frames, the GOP rate model is less invariant
to the positions of P frames.
114
GOP Distortion Model
It is well known that the distortion of a coding unit such as MB and frame is
proportional to the quantization stepsize. We observed that it is also applicable to
a GOP. In other words, the distortion of a GOP is proportional to the GOP size N
and the average quantization stepsize Q. Fig. 6.6 shows the normalized D
GOP
with
respect to the normalized Q·N for the same sequences in Fig. 6.4. We can observe
from this figure that, without taking the GOP complexity into account, the GOP
distortion D
GOP
can be estimated by
D
GOP
= ψ· Q· N, (6.7)
where ψ is a model parameter.
As can be seen from the ”Football“ sequence in Fig. 6.6, the D-Q model is less
invariant to the GOP size. However, since the proposed algorithm consider the
GOP distortion after determining the GOP size, the D-Q model does not have to
be invariant to the GOP size. For the same GOP size, the D-Q model represents
clearly the linear relationship between D
GOP
and Q·N even though it is not robust
to M as much as the R-Q model.
6.2.3.3 Joint GOP structure decision and Frame-Layer Bit Allocation
Given the target bit budget R
T,GOP
for a GOP from P
0
to P
Np
(see Fig. 6.3),
joint frame-layer bit allocation and GOP structure decision is performed using
the proposed GOP rate and distortion models. Let G = {G
(1)
,G
(2)
,..., G
(k)
}
be candidate GOP structures. Note that each candidate GOP structure G
j
, for
j =1,..., k, is characterized by the number and positions of P frames between P
0
and P
Np
. The objective is to find the optimal GOP structure G
∗
∈G that minimize
115
0
5 10
5
1 10
6
1.5 10
6
2 10
6
2.5 10
6
3 10
6
3.5 10
6
0 20406080 100 120
Carphone
N = 15
N = 30
Distortion
Q*N
(a)
0
1 10
6
2 10
6
3 10
6
4 10
6
5 10
6
0 20406080 100 120
Trevor
N = 15
N = 30
Distortion
Q*N
(b)
0
5 10
6
1 10
7
1.5 10
7
2 10
7
2.5 10
7
0 20406080 100 120
Football
N = 15
N = 30
Distortion
Q*N
(c)
0
5 10
6
1 10
7
1.5 10
7
2 10
7
0 20406080 100 120
Paris
N = 15
N = 30
Distortion
Q*N
(d)
Figure 6.6: The relationships between the GOP distortion D
GOP
and Q of (a) the
“Carphone” (QCIF), (b) the “Trevor” (QCIF), (c) the “Football” (SIF) and (d)
the “Paris” (CIF) sequences.
116
the GOP distortion while frame-layer bit allocation is performed based on the the
simplified rate control algorithm.
Without loss of generality, suppose that there are only two candidate GOP
structures G = {G
(1)
,G
(2)
}. Then, the proposed method is performed by the
following steps:
1. Compute the GOP complexities S
(1)
and S
(2)
for G
(1)
and G
(2)
, respec-
tively. Since different GOP structures have the different dependencies be-
tween frames due to different positions of P frames, G
(1)
and G
(2)
will have
the different complexities, S
(1)
and S
(1)
.
2. For the target bit budget R
T,GOP
for G
(1)
and G
(2)
, determine q
(1)
and q
(2)
from S
(1)
and S
(2)
using (6.6). Here, q
(1)
and q
(2)
are the average quantization
parameters corresponding to Q
(1)
and Q
(2)
for S
(1)
and S
(2)
, respectively.
From q
(j)
where j = 1 and 2, the quantization parameter for P frames q
(j)
P
is
determined by
q
(j)
P
= q
(j)
−
2· N
(j)
B
+1
N
, (6.8)
where N
(j)
B
is the number of B frames in G
(j)
. After that, q
(j)
I
is set to q
(j)
P
−1
and q
(j)
B
is set to q
(j)
P
+2.
3. Find G
∗
∈{G
1
,G
2
} that gives the minimum GOP distortion as follows:
• First, compare q
(1)
I
and q
(2)
I
. The GOP structure which has a smaller
valueofitisdeterminedas G
∗
.
• If q
(1)
I
= q
(2)
I
, compute and the average quantization stepsizes for G
(j)
from q
(j)
I
, q
(j)
P
and q
(j)
B
,where j = 1 and 2. Then, the GOP structure
which has a smaller value of it is determined as G
∗
.
117
All frames in a GOP will be encoded into G
∗
using the corresponding q
I
,
q
P
and q
B
. It is worthwhile to note that the frame-layer bit allocation is
performed jointly by deciding G
∗
, since the target bits for each type of frames
are allocated as a result by q
I
, q
P
and q
B
.
As candidate GOP structures, we can consider all possible GOP structures.
That is, given the positions of P
0
, I
1
and P
Np
, we can do full search over all possible
numbers and positions of P frames. To reduce the complexity, the fast search
methods such as the method in [23] can be applied. However, in the proposed
algorithm, we consider only four GOP structures as candidates according to the
distance between P frames, M. More specifically, we impose the constraints on the
GOP structure such that the distance between reference frames M is fixed. Then,
we consider the GOP structure whose M is 2, 3, 4 or 5 as candidates.
There are two reasons why we impose the constraint on candidate GOP struc-
tures. One is to reduce the complexity. Motion estimation for measuring the GOP
complexity causes large computations even with down-sampled frames. If we con-
sider a lot of GOP structures, the complexity increases accordingly. Therefore, we
reduce the complexity by considering a small number of GOP structures. More
importantly, the other is to reduce errors in the GOP distortion modeling. As can
be seen from Figs. 6.4 and 6.6, the D-Q model is not accurate as much as the R-Q
model. This is because of the dependencies between frames (e.g., the dependency
of an inter frame on reference frames) in video coding. In general, it is difficult
to model distortion since it is affected more significantly than rate by such depen-
dencies. The proposed algorithm encodes the same type of frames using the same
quantization parameter. By doing so, we can reduce unpredictability of the effects
of reference frames to other predictive frames in distortion. However, it is still
118
not easy to characterize the GOP distortion accurately for arbitrary GOP struc-
ture. For the reason, we consider a small number of well-defined GOP structures
as candidates.
6.2.3.4 Rate Control Algorithm with Adaptive GOP Structure
Referring to Fig. 6.3, the rate control algorithm with adaptive GOP structure is
performed in one pass as follows.
1. Identify an I frame I
2
using the proposed I-frame decision method.
2. Initial bit allocation to a frames.
Based on frame rate F and channel rate C, allocate the target bit budget to
N frames between P
0
and P
Np
by
R
T,GOP
= N ·
C
F
+ R
0
,
where R
0
is a feedback term which compensates for the difference between
the target bits and the actual bits of the previous GOP.
3. Joint GOP structure decision and frame-layer bit allocation.
Find the GOP structure G
∗
out of the candidates GOP structures using the
proposed P-frame decision method. The values of q
I
, q
P
and q
B
,which cor-
respond to G
∗
, are stored for the next step.
4. Encode a GOP.
All frames between P
0
and P
n
are encoded using one of q
I
, q
P
and q
B
ac-
cording to their frame type. Similar to the simplified algorithm in Chapter 5,
whenever a P frame is encoded, the R-Q model parameter η is updated us-
ing the information from coded frames in a GOP. Then, the next P and B
119
frames are encoded using q
P
and q
B
, which are determined by the updated
R-Q model to meet the target bits more closely.
5. Update the GOP rate model parameters.
The R-Q model parameter η is updated by LSA using the previous 5 GOPs.
We apply the above process to each new GOP until we reach the last GOP.
6.3 GOP-Layer Bit Allocation
In this section, we address the GOP-layer bit allocation problem using the GOP rate
and distortion models. The rate control algorithm with adaptive GOP structure
can be performed in one pass, if we employ a constant GOP-layer bit allocation
method as shown in the previous section. However, a constant bit allocation method
may not be a good approach when different GOPs have different sizes. Suppose
that there are two GOPs, whose sizes are 10 and 30, respectively. By constant bit
allocation, the target bit budget for a shorter GOP is one third of that for a longer
GOP. Since an I frame demands much more bits than B and P frames, the I frame
in a shorter GOP should be encoded using a larger quantization parameter than a
longer GOP. As a result, the overall quality of a shorter GOP becomes much poorer
than a longer GOP. We propose a GOP-layer bit allocation algorithm to achieve
smoother visual quality as well as to improve overall visual quality.
Let N
GOP
be the number of GOPs being considered for bit allocation and let
R
T,N
GOP
be the available bit budget for these GOPs. Then, the GOP-layer bit
allocation problem can be formulated as
minimize
N
GOP
g=1
D
g
subject to
N
GOP
g=1
R
g
≤ R
T,N
GOP
, (6.9)
120
where R
g
and D
g
be rate and distortion of the g-th GOP. If the size of g-th GOP
is N
g
, the available bit budget R
T,N
GOP
is given as
R
T,N
GOP
=
N
GOP
g=1
N
g
·
C
F
. (6.10)
With the Lagrange multiplier and the proposed R-Q and D-Q models, we can
convert the above constrained problem into a unconstrained problem as follows:
minimize J =
N
GOP
g=1
ψ
g
· N
g
· Q
g
+ λ·
N
GOP
g=1
η
g
·
S
g
Q
g
− R
T,N
GOP
. (6.11)
This minimization problem can be easily solved by partial derivative. To be more
specific, by setting the partial derivatives with respect to Q
g
and λ to 0, the optimal
bit budget R
T,GOP
for the g-th GOP, which is denoted by R
∗
g
, can be calculated as
R
∗
g
=
η
g
· ψ
g
· N
g
· S
g
N
GOP
j=1
η
j
· ψ
j
· N
j
· S
j
· R
T,N
GOP
. (6.12)
For all g =1,..., N
GOP
, it indicates that R
∗
g
increases with the square root of the
model parameters, the size and the complexity of a GOP.
To apply the proposed GOP-layer bit allocation, it is required to estimate the
R-D characteristics (i.e., η
g
and ψ
g
)ofall N
GOP
GOPs. Thus, the GOP-layer
bit allocation is performed through two-pass encoding. Suppose that N
GOP
is
determined in advance. The GOP-layer bit allocation with adaptive GOP structure
is performed in two passes as stated below.
1. First-pass encoding
After identifying an I frame using the proposed I-frame decision, encode a
GOP into the IBBP (i.e., M=3) structure. Then, estimate the R-Q and D-Q
121
model parameters and compute the GOP complexity. Iterate this procedure
until the N
GOP
-th GOP is encoded.
2. GOP-layer bit allocation
Allocate the target bit budget to all N
GOP
GOPs using (6.12).
3. Second-pass encoding
Encode all N
GOP
GOPs after joint GOP structure decision and frame-layer
bit allocation using the proposed P-frame decision method.
6.4 Experimental Results
The proposed algorithm is implemented in the JM8.6 reference encoder and its
performance is evaluated through two different experiments.
In the first experiment, we encode several QCIF sequences at various bit rates
from 32 Kbps and 256 Kbps. Two reference frames are used for inter prediction and
total 150 frames of each sequence are encoded at 30 fps. The main purpose of this
experiment is to see the coding gain by the adaptive P frame decision only. Thus,
the GOP size is fixed at 15 and we compare the following rate control schemes
as shown in Table 6.1: “JM86” (JM8.6 rate control with N =15 and M =3),
“Fixed M” (proposed with N =15and M = 3) and “Adaptive M” (proposed with
N=15 and adaptive M). Since N is fixed, the I-frame decision is not employed
in both proposed schemes. The GOP-layer bit allocation is not employed as well.
Therefore, both proposed schemes in this experiment are performed in one pass.
Table 6.1 shows the final bit rates and the average PSNRs by the JM8.6 and
the proposed rate control schemes. The coding gains by the proposed schemes with
respect to those by the JM8.6 rate control are also shown. We can see that the
122
JM86 Fixed M Adaptive M
Sequence
Bitrate PSNR Bitrate PSNR Bitrate PSNR
32.14 31.69 32.41 31.92 (+0.23) 32.19 31.97 (+0.28)
Container 62.20 35.63 64.50 36.09 (+0.46) 63.74 36.21 (+0.58)
128.14 39.55 128.42 39.80 (+0.25) 128.29 40.06 (+0.51)
32.23 34.48 32.19 34.86 (+0.38) 32.19 34.91 (+0.43)
Akiyo 65.27 38.99 63.76 39.68 (+0.69) 63.76 39.76 (+0.77)
131.78 43.89 127.83 44.18 (+0.29) 127.86 44.33 (+0.44)
64.70 31.88 64.94 32.26 (+0.38) 64.42 32.25 (+0.37)
Foreman 128.72 35.53 128.89 35.89 (+0.36) 128.81 35.89 (+0.36)
257.18 38.74 256.58 39.14 (+0.40) 256.31 39.18 (+0.44)
64.16 28.96 64.07 29.21 (+0.25) 64.07 29.25 (+0.29)
Coastguard 128.21 31.30 128.41 31.92 (+0.62) 128.74 31.94 (+0.64)
253.67 34.05 256.26 34.78 (+0.73) 256.05 34.77 (+0.72)
Table 6.1: Performances by the different rate control schemes for QCIF sequences.
proposed scheme with fixed M can improve the average coding gain by 0.42 dB,
compared with the JM8.6 rate control. If we compare both proposed schemes with
fixed M and adaptive M, the average coding gain is improved by 0.12 dB using
adaptive M for “Container” and “Akiyo”, which have low activities. Fig. 6.7 shows
the PSNR variations for the “Container” and the “Akiyo” sequences at 64 Kbps
and 128 Kbps, respectively. However, the additional coding gain using adaptive M
is not obvious for “Foreman” and “Coastguard”, which have high activities. This
is because the I-B-B-P structure (i.e., M = 3) is optimal or near-optimal for the
sequences which have high activities.
Sequence 1 Akiyo (47) + News (48) + Hall (55)
Sequence 2 Foreman (50) + Silent (55) + Suzie (45)
Sequence 3 Grandma (45) + Salesman (48) + Claire (57)
Sequence 4 Trevor (45) + Suzie (55) + Mother & Daughter (50)
Table 6.2: Composite QCIF sequences for the second experiment.
In the second experiment, several composite sequences shown in Table 6.2 are
encoded at various bit rates from 32 Kbps and 256 Kbps. For example, “Sequence
123
1” consists of 47, 48 and 50 frames of “Akiyo”, “News” and “Hall”, respectively.
As in the first experiment, two reference frames are used for inter prediction and
total 150 frames of each sequence are encoded at 30 fps.
In this experiment, we want to see how much we can improve coding efficiency
using adaptive N and GOP-layer bit allocation. Therefore, we compare the follow-
ing rate control schemes as shown in Table 6.3: “JM86” (JM8.6 rate control with
N =25 and M =3), “Fixed M” (proposed with adaptive N and M =3), “Fixed
M + Bit Alloc.” (proposed GOP-layer bit allocation with adaptive N and M =3)
“Adaptive M + Bit Alloc.” (proposed GOP-layer bit allocation with adaptive N,
M). In general, coding efficiency decreases as the number of I frames increases.
Therefore, the number of I frames by all rate control schemes should be the same
for fair comparison. For the reason, N is set to 25 for “JM86” such that there exits
the same number of I frames by all rate control schemes.
Table 6.3 shows the final bit rates and the average PSNRs by the JM8.6 and
the proposed rate control schemes. The coding gains by the proposed schemes with
respect to those by the JM8.6 rate control are also shown. Figs. 6.8 and 6.9 show the
PSNR variations at 64 Kbps and 128 Kbps, respectively. Compared with the JM8.6
rate control. the average coding gain and the maximal coding gain are 1.40 dB and
2.30 dB by “Fixed M”. That is, we can improve coding efficiency significantly by
choosing I frames adaptively. Additional average coding gain and maximal coding
gain by employing GOP-layer bit allocation (“Fixed M + Bit Alloc.”) are 0.1 dB
and 0.24 dB, respectively. The benefit of GOP-layer bit allocation can be seen more
clearly from Figs. 6.8 and 6.9. We can observe that a smoother quality of video can
be produced using GOP-layer bit allocation. However, when we compare ”Fixed
M + Bit Alloc.” and ”Fixed M + Bit Alloc.”, we can conclude that the additional
coding gain using adaptive M is not obvious for the composite sequences.
124
FixedM AdaptiveM
JM86 FixedM
+ Bit Alloc. + Bit Alloc.
Seq.No
Bitrate PSNR Bitrate PSNR Bitrate PSNR Bitrate PSNR
33.47 33.50 33.50
36.76 32.55 32.62
(+0.92)
32.18
(+0.95)
32.02
(+0.95)
37.82 37.88 37.86
1 65.64 35.69 63.50
(+2.13)
63.59
(+2.19)
63.51
(+2.17)
42.13 42.17 42.17
129.75 40.52 128.19
(+1.61)
128.30
(+1.65)
128.75
(+1.65)
34.68 34.76 34.74
64.70 33.95 64.14
(+0.73)
64.34
(+0.81)
64.20
(+0.79)
38.08 38.24 38.25
2 128.77 37.51 128.42
(+0.57)
128.22
(+0.73)
128.60
(+0.74)
41.73 41.80 41.79
256.31 40.96 257.38
(+0.77)
256.95
(+0.85)
256.89
(+0.84)
35.35 35.39 35.48
32.15 33.14 32.05
(+2.21)
31.99
(+2.25)
32.25
(+2.34)
39.16 39.18 39.16
3 64.39 36.86 64.43
(+2.30)
64.17
(+2.32)
63.98
(+2.30)
42.87 43.03 43.08
128.25 41.21 130.16
(+1.66)
128.91
(+1.82)
128.97
(+1.88)
35.75 35.87 35.86
64.20 34.40 64.13
(+1.35)
64.41
(+1.47)
64.31
(+1.46)
39.28 39.42 39.43
4 128.07 37.91 129.05
(+1.37)
129.12
(+1.51)
128.42
(+1.52)
42.69 42.93 42.98
253.68 41.54 257.26
(+1.15)
256.27
(+1.39)
256.79
(+1.44)
Table 6.3: Performances by the different rate control schemes for the composite
sequences in Table 6.2.
6.5 Conclusion
We proposed the rate control algorithm with adaptive GOP structure in this chap-
ter. First, we proposed the I-frame decision method based on MAD between orig-
inal down-sampled frames. Then, we proposed the P-frame decision method for
joint GOP structure decision and frame-layer bit allocation. The inter-dependency
of frame-layer bit allocation and GOP structure decision was addressed using the
GOP rate and distortion models along with the frame-layer bit allocation scheme.
125
Finally, we also proposed the GOP-layer bit allocation algorithm. We observed from
the experiments that coding efficiency can be improved considerably by changing
GOP structures adaptively. Moreover, we observed that we can produce higher and
more consistent quality of video by the GOP-layer bit allocation algorithm.
However, we found out it is difficult to improve coding efficiency noticeably by
the adaptive P-frame (i.e., adaptive M) decision method. Even for high-activity
sequences, the optimal or near-optimal value of M is 3 due to the improved motion-
compensated prediction (MCP) in H.264. Except for very low-activity sequences,
M = 3 seems optimal or near optimal. In H264, the optimal number and positions
of P frames depends on various factors such as the number and the positions of
reference frames. More study on this problem must be necessary as future work to
improve coding efficiency.
126
32
33
34
35
36
37
38
0 50 100 150
Container
JM86
Fixed M
Adaptive M
PSNR (dB)
Frame
(a)
40
41
42
43
44
45
46
47
0 50 100 150
Akiyo
JM86
Fixed M
Adaptive M
PSNR (dB)
Frame
(b)
Figure 6.7: The variations of PSNRs by different rate control schemes for (a) “Con-
tainer” (QCIF, 64 Kbps) and (b) “Akiyo” (QCIF, 128 Kbps).
127
30
32
34
36
38
40
42
44
0 50 100 150
Sequence 1
JM86
Fixed M
Fixed M + Bit Alloc
Adaptive M + Bit Alloc
PSNR (dB)
Frame
(a)
30
32
34
36
38
40
0 50 100 150
Sequence 2
JM86
Fixed M
Fixed M + Bit Alloc
Adaptive M + Bit Alloc
PSNR (dB)
Frame
(b)
Figure 6.8: The variations of PSNRs by different rate control schemes for (a) “Se-
quence 1” (QCIF, 64 Kbps) and (b) “Sequence 2” (QCIF, 64 Kbps).
128
34
36
38
40
42
44
46
48
50
0 50 100 150
Sequence 3
JM86
Fixed M
Fixed M + Bit Alloc
Adaptive M + Bit Alloc
PSNR (dB)
Frame
(a)
30
35
40
45
0 50 100 150
Sequence 4
JM86
Fixed M
Fixed M + Bit Alloc
Adaptive M + Bit Alloc
PSNR (dB)
Frame
(b)
Figure 6.9: The variations of PSNRs by different rate control schemes for (a) “Se-
quence 3” (QCIF, 128 Kbps) and (b) “Sequence 4” (QCIF, 128 Kbps).
129
Chapter 7
Conclusion and Future Work
7.1 Summary of the Research
In this dissertation, we addressed various issues related with rate control for an
H.264 encoder over error-prone channels. The H.264 video can be used in real-time
conversational applications as well as non-conversational applications. Depending
on the type of application, there exist different constraints and issues in rate control.
We studied and developed rate control algorithms via rate and distortion modeling
for both applications.
In Chapter 3, we pointed out first that the inter-dependency problem of RDO
and rate control makes model-based rate control more challenging. To resolve this
problem, we investigated a two-stage encoding scheme, where the encoding process
is divided into two stages. In the first stage, the RDO process is performed with
an initial quantization parameter. The actual quantization parameter is estimated
using the header information and the residual signal after the RDO process. In the
second stage, the input video source is encoded using the estimated quantization
parameter. It was observed that, even when the quantization parameter for RDO in
stage one and the actual quantization parameter used in stage two are different, the
130
decrease of the coding gain is not significant as long as their difference is restricted
to a small range.
We also studied frame rate and distortion modeling in Chapter 3, since accurate
rate and distortion models have great impact on the performance of model-based
rate control algorithms. It was observed that the exact estimation of header bits is
as important as the exact estimation of source bits in H.264 rate control. For this
reason, we proposed a header rate model to estimate header bits more accurately.
In the proposed header rate model, header bits were modeled as a function of the
number of MVs and non-zero MV elements. Besides, we proposed enhanced source
rate and distortion models based on coded block identification. The proposed source
modeling was motivated by the facts that no bit is necessary for the skipped blocks
and the distortion of skipped blocks can be directly computed from the residual
signal. The model accuracy was improved considerably by considering only coded
4×4 blocks. A two-stage rate control algorithm which adopts the enhanced rate
and distortion models was presented for conversational applications. Moreover, a
bit allocation algorithm among the MB classes in a frame was also proposed.
In Chapter 4, the frame-layer bit allocation problem in non-conversational ap-
plications is addressed by a two-pass algorithm using the Lagrange optimization
framework. Based on the assumption that frames are independent, we formulated
the bit allocation problem as a bit-budget constrained R-D optimization prob-
lem. we demonstrated that our two-stage encoding and frame rate and distortion
models can be employed successfully for rate control of an H.264 encoder in non-
conversational applications.
In Chapter 5, one-pass frame-layer bit allocation algorithms were proposed via
GOP rate modeling. First of all, we propose an one-pass algorithm using the La-
grange optimization framework. Motivated by the fact that the Lagrange multiplier
131
determines the optimal number of bits for each frame, we investigated two GOP
rate models, i.e.,R-Q and R-λ models, to estimate the Lagrange multiplier with-
out a pre-encoding pass. We demonstrated that the GOP rate models enable the
effective use of the Lagrange optimization framework in an one-pass rate control
scheme. In addition, motivated by the fact that better qualities of reference frames
lead to better overall coding efficiency, we proposed a simplified one-pass algorithm.
In this algorithm, the quantization parameter of each frame is determined based
on its frame type. Therefore, any frame rate and distortion model is not necessary.
The GOP-based R-Q model is required only. We demonstrated that the rate control
process can be simplified significantly at the cost of negligible coding efficiency.
In Chapter 6, we proposed a rate control algorithm with adaptive GOP that
consists of two steps: the I-frame selection followed by the P-frame selection. First,
an I frame is selected by a scene-cut detection using MAD between frames. Second,
a final GOP structure is determined by choosing P frames jointly with frame-
layer bit allocation. We identified the inter-dependency problem of frame-layer bit
allocation and GOP structure decision. This problem is resolved by a simplified bit
allocation scheme with the GOP rate and distortion models. For this purpose, we
investigated the R-Q and the D-Q models that are robust against the variation of
a GOP structure. Lastly, a GOP-layer bit allocation algorithm was proposed using
the R-Q and the D-Q models. It was shown that the new rate control algorithm
can achieve higher average quality as well as smoother visual quality variation.
7.2 Future Research Directions
During the last several years, video transmission over wireless mobile channels
has received a lot of attention due to the increasing bandwidth of 3G and 4G
132
mobile networks and the increasing demand for video communications. As a result,
reliable transmission of H.264 video over error-prone wireless channels has become
an important research issue. Adequate error resilience features should be provided
to protect an H.264 bitstream from possible channel errors. There are two types of
error control techniques: 1) forward error correction (FEC) and 2) automatic repeat
request (ARQ) [8, 10]. The error control technique depends on the application
environment. For example, in real-time applications where the end-to-end delay
is critical, it is more appropriate to apply FEC than ARQ. For non-conversational
applications, both ARQ and FEC can be used when there exists a feedback channel.
We may consider a scenario where an unequal error protection (UEP) FEC
scheme is used for error control. When compared with the equal error protection
(EEP) FEC scheme, UEP FEC can reduce channel-induced errors by providing
more protection for more important data. For non-conversational applications,
we may consider the frame-based UEP scheme based on frame types and content
complexities. As a result, we can consider a joint source-channel frame-layer bit
allocation problem to minimize the expected distortion. Under this scenario, we
may study the adaptive P-frame decision problem. As presented in Chapter 6
and [23], if there is no transmission error, it is not easy to have a very noticeable
coding gain by selecting P frames adaptively. However, this may be different for
error-prone channels since we may get different packet loss rates (PLR) by applying
different UEP to different frame types (e.g., a lower loss rate of a P-frame packet
than that of a B-frame packet). Thus, we may improve the streaming video quality
significantly by selecting P frames adaptively.
Recently, a scalable extension of H.264 called the scalable video model (SVM)
has been proposed to improve both coding efficiency and the scalability function-
ality [42]. Moreover, a multi-view video coding based on H.264 has been studied in
133
the last few years [29]. Efficient rate control schemes for scalable and multi-view
H.264 video are also interesting research problems.
134
Reference List
[1] P.-Y. Cheng, J. Li, and C.-C. J. Kuo, “Rate control for an embedded wavelet
video coder,” IEEE Trans. Circuits and Syst. Video Technol., vol. 7, pp. 696–
702, August 1997.
[2] T. Chiang and Y.-Q. Zhang, “A new rate control scheme using quadratic rate
distortion model,” IEEE Trans. Circuits and Syst. Video Technol.,vol.7,pp.
246–250, February 1997.
[3] H. Chung, D. Romancho, and A. Ortega, “Fast loing-term motion estima-
tion for H.264 using multiresolution search,” in Proc. IEEE Intl. Conf. Image
Processing, September 2003, pp. 905–908.
[4] J. L. Devore and N. R. Farnum, Applied Statistics for Engineers and Scientists.
New York: Duxbury Press, 1999.
[5] W. Ding and B. Liu, “Rate control of MPEG video coding and recording by
rate-quantization modeling,” IEEE Trans. Circuits and Syst. Video Technol.,
vol. 6, pp. 12–20, February 1996.
[6] A. Dumitras and B. G. Haskell, “I/P/B frame type decision by collinearity
of displacements,” in Proc. IEEE Intl. Conf. Image Processing, October 2004,
pp. 2769–2772.
[7] Z. He, Y. K. Kim, and S. K. Mitra, “Low-dealy rate control for DCT
video coding via ρ-domain source modeling,” IEEE Trans. Circuits and
Syst. Video Technol., vol. 12, pp. 970–982, November 2002.
[8] Z. He and S. K. Mitra, “A linear source model and a unified rate control algo-
rithm for DCT video coding,” IEEE Trans. Circuits and Syst. Video Technol.,
vol. 12, pp. 511–523, June 2002.
[9] ——, “Optimum bit allocation and accurate rate control for video coding via
ρ-domain source modeling,” IEEE Trans. Circuits and Syst. Video Technol.,
vol. 12, pp. 840–849, October 2002.
135
[10] C.-Y. Hsu, A. Ortega, and M. Khansari, “Rate control for robust video trans-
mission over burst-error wireless channels,” IEEE J. Select. Areas Commun.,
vol. 17, pp. 756–773, May 1999.
[11] C.-Y. Hsu, A. Ortega, and A. R. Reibman, “Joint selection of source and
channel rate for VBR video transmission under ATM policing constraints,”
IEEE J. Select. Areas Commun., vol. 15, pp. 1016–1028, August 1997.
[12] ITU-T, “Video codec for audiovisual service at p× 64 kbps/s,” ITU-T Rec.
H.261, ver. 2, 1990.
[13] ——, “Video coding for low bit rate communications,” ITU-T Rec. H.263, ver.
2, 1998.
[14] M. Jiang and N. Ling, “On enhancing H.264/AVC video rate control by
PSNR-based frame complexity estimation,” IEEE Trans. Consumer Electron-
ics, vol. 51, pp. 281–286, February 2005.
[15] Joint Video Team, “JVT Reference Software Encoder, version 8.1a,”
http://bs.hhi.de/seuhring/tml/download.
[16] J. Kai, Z. He, and C. W. Chen, “Optimal bit allocation for low bit rate video
streaming applications,” in Proc. IEEE Intl. Conf. Image Processing, Septem-
ber 2002, pp. I–73–76.
[17] C. Kim, H.-H. Shih, and C.-C. J. Kuo, “Feature-based intra-prediction mode
decision for H.264,” in Proc. IEEE Intl. Conf. Image Processing, October 2004,
pp. 769–772.
[18] H. M. Kim, “Adaptive rate control using nonlinear regression,” IEEE
Trans. Circuits and Syst. Video Technol., vol. 13, pp. 432–439, May 2003.
[19] S. Kim and Y. S. Ho, “Rate control algorithm for H.264/AVC video coding
standard based on rate-quantization model,” in Proc. IEEE Intl. Conf. Multi-
media and Expo, June 2004, pp. 165–168.
[20] A. Y. Lan, A. G. Nguyen, and J.-N. Hwang, “Scene-context-dependent
reference-frame placement for MPEG video coding,” IEEE Trans. Circuits
and Syst. Video Technol., vol. 9, pp. 478–489, April 1999.
[21] H. J. Lee, T. Chiang, and Y.-Q. Zhang, “Scalable rate control for MPEG-4
video,” IEEE Trans. Circuits and Syst. Video Technol., vol. 10, pp. 878–894,
September 2000.
[22] J. Lee and D. W. Dickinson, “Temporally adaptive motion interpolation ex-
ploiting temporal masking in visual perception,” IEEE Trans. Image Process-
ing, vol. 3, pp. 513–526, September 1994.
136
[23] ——, “Rate-Distortion optimized frame type selection for MPEG encoding,”
IEEE Trans. Circuits and Syst. Video Technol., vol. 7, pp. 501–510, June 1997.
[24] Z. G. Li, F. Pan, K. P. Lim, and S. Rahardja, “Adaptive rate control for H.264,”
in Proc. IEEE Intl. Conf. Image Processing, October 2004, pp. 745–748.
[25] L.-J. Lin and A. Ortega, “Bit-rate control using piecewise approximated rate-
distortion characteristics,” IEEE Trans. Circuits and Syst. Video Technol.,
vol. 8, pp. 446–459, August 1998.
[26] S. Liu and C.-C. J. Kuo, “Joint temporal-spatial bit rate control for video
coding with dependency,” IEEE Trans. Circuits and Syst. Video Technol.,
vol. 15, pp. 15–26, January 2005.
[27] S. Ma, W. Gao, and Y. Lu, “Rate-Distortion analysis for H.264/AVC video and
its application to rate control,” IEEE Trans. Circuits and Syst. Video Technol.,
vol. 15, pp. 1533–1544, December 2005.
[28] H. S. Malvar, A. Hallapuro, M. Karczewicz, and L. Kerofsky, “Low-complexity
transform and quantization in H.264/AVC,” IEEE Trans. Circuits and
Syst. Video Technol., vol. 7, pp. 599–603, July 2003.
[29] P. Merkle, K. Muller, A. Smolic, and T. Wiegand, “Efficient compression of
multi-view video exploiting inter-view dependencies based on h.264/mpeg4-
avc,” in Proc. IEEE Intl. Conf. Multimedia and Expo, July 2006.
[30] MPEG-2, “Test Model 5 (TM5),” Doc. ISO/ISE JTC1/SC92/WG11/93-225b,
April 1995.
[31] D. Mukherjee and S. K. Mitra, “Combined mode selection and macroblock
quantization step adaptation for the H.263 video encoder,” in Proc. IEEE
Intl. Conf. Image Processing, October 1997, pp. 26–29.
[32] A. Ortega, K. Ramchandran, and M.Vetterli, “Optimal trellis-based buffered
compression and fast approximation,” IEEE Trans. Image Processing,vol. 3,
pp. 26–40, January 1994.
[33] F. Pan, Z. Li, K. Lim, and G. Feng, “A study of MPEG-4 rate control scheme
and its improvements,” IEEE Trans. Circuits and Syst. Video Technol., vol. 13,
pp. 440–446, May 2003.
[34] T. V. Rakshman, A. Ortega, and A. R. Reibman, “VBR video: Tradeoffs and
potentials,” Proceedings of the IEEE, vol. 86, pp. 952–973, May 1998.
[35] K. Ramchandran, A. Ortega, and M.Vetterli, “Bit allocation for dependent
quantization with applications to multiresolution and MPEG video coders,”
IEEE Trans. Image Processing, vol. 3, pp. 533–545, September 1993.
137
[36] A. R. Reibman and B. G. Haskell, “Constraints on variable bit-rate video for
ATM networks,” IEEE Trans. Circuits and Syst. Video Technol., vol. 2, pp.
361–372, December 1992.
[37] J. Ribas-Corbera, P. A. Chou, and S. L. Regunathan, “A generalized hy-
pothetical reference decoder for H.264/AVC,” IEEE Trans. Circuits and
Syst. Video Technol., vol. 13, pp. 674–687, June 2003.
[38] J. Ribas-Corbera and S. Lei, “Rate control in DCT video coding for low-delay
communications,” IEEE Trans. Circuits and Syst. Video Technol., vol. 9, pp.
172–185, February 1999.
[39] ——, “A frame-layer bit allocation for H.263+,” IEEE Trans. Circuits and
Syst. Video Technol., vol. 10, pp. 1154–1158, October 2000.
[40] E. A. Riskin, “Optimal bit allocation via the generalized BFOS algorithm,”
IEEE Trans. Information Theory, vol. 37, pp. 400–402, March 1991.
[41] G. Schuster and A. K. Katsaggelos, “Fast and efficient model and quantizer
selection in the rate distortion sense for H.263,” in Proc. SPIE, Conf. Visual
Communications and Image Processing, March 1996, pp. 784–795.
[42] H. Schwarz, D. Marpe, and T. Wiegand, “Overview of the scalable
H.264/MPEG4-AVC extension,” in Proc. IEEE Intl. Conf. Image Processing,
October 2006.
[43] Y. Shoham and A. Gersho, “Efficient bit allocation for an arbitrary set of
quantizers,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 36, pp. 1445–
1453, September 1988.
[44] T. Sikora, “The MPEG-4 video standard verification model,” IEEE Trans. Cir-
cuits and Syst. Video Technol., vol. 7, pp. 19–31, February 1997.
[45] H. Song and C.-C. J. Kuo, “Rate control for low-bit-rate video via variable-
encoding frame rates,” IEEE Trans. Circuits and Syst. Video Technol., vol. 11,
pp. 512–521, April 2001.
[46] A. Vetro, H. Sun, and Y. Wang, “MPEG-4 rate control for multiple video
objects,” IEEE Trans. Circuits and Syst. Video Technol., vol. 9, pp. 186–199,
February 1999.
[47] N. Wang and Y. He, “A new bit rate control strategy for H.264,” inProc. IEEE
Intl. Conf. Infor. Commun. and Signal Processing, December 2003, pp. 1370–
1373.
138
[48] Y. L. Wang, J. X. Wang, Y. W. Lai, and A. W. Y. Su, “Dynamic GOP
structure determination for real-time MPEG-4 Advance Simplie Profile video
encoder,” in Proc. IEEE Intl. Conf. Multimedia Signal Processing, July 2005,
pp. 293–296.
[49] T. Wiegand and B. Girod, “Lagrange multiplier selection in hybrid video coder
control,” in Proc. IEEE Intl. Conf. Image Processing, October 2001, pp. 542–
535.
[50] T. Wiegand, H. Schwarz, A. Joch, F. Kossentini, and G. J. Sullivan, “Rate-
constrained coder control and comparison of video coding standards,” IEEE
Trans. Circuits and Syst. Video Technol., vol. 7, pp. 688–703, July 2003.
[51] T. Wiegand, G. Sullivan, G. Bjontegaard, and A. Luthra, “Overview
of the H.264/AVC video coding standard,” IEEE Trans. Circuits and
Syst. Video Technol., vol. 7, pp. 1–19, July 2003.
[52] J. Xu and Y. He, “A novel rate control for H.264,” in Proc. IEEE
Intl. Symp. Circuits and Systems, May 2004, pp. III–809–812.
[53] P. Yin and J. Boyce, “A new rate control scheme for H.264 video coding,” in
Proc. IEEE Intl. Conf. Image Processing, October 2004, pp. 449–452.
[54] Y. Yokoyama, “Adaptive GOP structure selection for real-time MPEG-2 video
encoding,” in Proc. IEEE Intl. Conf. Image Processing, September 2000, pp.
832–835.
[55] A. Yoneyama, Y. Nakajima, H. Yanagihara, and M. Sugano, “MPEG encod-
ing algorithm with scene adaptive dynamic GOP structure,” in Proc. IEEE
Intl. Conf. Multimedia Signal Processing, September 1999, pp. 297–302.
139
Abstract (if available)
Abstract
In this research, we propose rate control algorithms with enhanced rate and distortion modeling for H.264 video in various applications.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Dependent R-D modeling for H.264/SVC bit allocation
PDF
H.264/AVC decoder complexity modeling and its applications
PDF
Power efficient multimedia applications on embedded systems
PDF
Efficient coding techniques for high definition video
PDF
Image and video enhancement through motion based interpolation and nonlocal-means denoising techniques
PDF
Focus mismatch compensation and complexity reduction techniques for multiview video coding
PDF
Efficient management techniques for large video collections
PDF
Advanced intra prediction techniques for image and video coding
PDF
Predictive coding tools in multi-view video compression
PDF
Distributed source coding for image and video applications
PDF
Advanced techniques for high fidelity video coding
PDF
Graph-based models and transforms for signal/data processing with applications to video coding
PDF
Low complexity mosaicking and up-sampling techniques for high resolution video display
PDF
Algorithms for scalable and network-adaptive video coding and transmission
PDF
Complexity scalable and robust motion estimation for video compression
PDF
Techniques for efficient cloud modeling, simulation and rendering
PDF
Multiple model adaptive control with mixing
PDF
Robust video transmission in erasure networks with network coding
PDF
Disparity estimation from multi-view images and video: graph models and algorithms
PDF
Advanced machine learning techniques for video, social and biomedical data analytics
Asset Metadata
Creator
Kwon, Do-Kyoung
(author)
Core Title
Rate control techniques for H.264/AVC video with enhanced rate-distortion modeling
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
11/16/2006
Defense Date
10/18/2006
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
bit allocation,H.264/AVC,OAI-PMH Harvest,rate control
Language
English
Advisor
Kuo, C.-C. Jay (
committee chair
), Neumann, Ulrich (
committee member
), Ortega, Antonio (
committee member
)
Creator Email
dokwon@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m159
Unique identifier
UC188426
Identifier
etd-Kwon-20061116 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-29434 (legacy record id),usctheses-m159 (legacy record id)
Legacy Identifier
etd-Kwon-20061116.pdf
Dmrecord
29434
Document Type
Dissertation
Rights
Kwon, Do-Kyoung
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
bit allocation
H.264/AVC
rate control