Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Algorithms and architectures for robust video transmission
(USC Thesis Other)
Algorithms and architectures for robust video transmission
PDF
Download
Share
Open document
Flip pages
Copy asset link
Request this asset
Transcript (if available)
Content
A L G O R IT H M S A N D A R C H IT E C T U R E S F O R R O B U S T V ID EO
T R A N SM ISSIO N
by
Jin-Gyeong Kim
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
August 2003
Copyright 2003 Jin-Gyeong Kim
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
UMI Number: 3116729
INFORMATION TO USERS
The quality of this reproduction is dependent upon the quality of the copy
submitted. Broken or indistinct print, colored or poor quality illustrations and
photographs, print bleed-through, substandard margins, and improper
alignment can adversely affect reproduction.
In the unlikely event that the author did not send a complete manuscript
and there are missing pages, these will be noted. Also, if unauthorized
copyright material had to be removed, a note will indicate the deletion.
®
UMI
UMI Microform 3116729
Copyright 2004 by ProQuest Information and Learning Company.
All rights reserved. This microform edition is protected against
unauthorized copying under Title 17, United States Code.
ProQuest Information and Learning Company
300 North Zeeb Road
P.O. Box 1346
Ann Arbor, Ml 48106-1346
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
UNIVERSITY OF SOUTHERN CALIFORNIA
THE GRADUATE SCHOOL
UNIVERSITY PARK
LOS ANGELES, CALIFORNIA 90089-1695
This dissertation, written by
■ O T n • " ^yeor\% ■ K iw
-------------------------r------- ^ -------------------------------
under the direction o f h s4> dissertation committee, and
approved by all its members, has been presented to and
accepted by the Director o f Graduate and Professional
Programs, in partial fulfillment of the requirements for the
degree of
DOCTOR OF PHILOSOPHY
Director
Date A u gu st 1 2 . 2003
Dissertation Committee
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
D edication
To my parents
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
A cknow ledgem ents
First of all, I would like to express my gratitude to my advisor Professor C.-C.
Jay Kuo for his support and help. His insights always made my vague ideas clear
and meaningful. I would also like to thank him for allowing me to have broad
experiences. I would like to thank Professor Antonio Ortega for giving me lots of
advice and encouragement and for serving on my dissertation committee, Professor
Bartlett W. Mel for serving on my dissertation committee, Professor Daniel C. Lee
and Dr. Changsu Kim for serving on my qualifying committee.
I would also like to thank Professor Jonwon Kim and Dr. Ioannis Katsavounidis
for their sincere and invaluable advices. And I would like to thank Dr. Younggook
Kim, Dr. Lifeng Zhao who helped me a lot including reviewing parts of my papers.
I would like to thank MPEG-4 team members including Dr. Alexander Kanaris,
Guowei Hsui, and Ron Mayor for sharing their experiences and inspirations. I also
want to take this opportunity to thank Changki Min and Kitae Nahm for their help.
I thank my wife for her love, sacrifice and help. I remember having the happiest
moments with her and my beloved daughter and son. I want to thank my mother
and father for their endless love and support.
iii
Reproduced with permission of the copyright owner. Furiher reproduction prohibited without permission.
I would like to thank Dr. Jongseok Park and Dr. Seungjong Choi in LG Elec
tronics Inc. for giving me an opportunity to study and for their support and encour
agement. I would also like to thank Mr. Steve Ro and Mr. Mike Ling in InterVideo
Inc. for giving me an opportunity for joining wonderful team and having experience
in MPEG-4 chip design.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C ontents
Dedication ii
Acknowledgements iii
List Of Tables viii
List Of Figures ix
Abstract xiv
1 Introduction 1
1.1 Video Delivery S y ste m .............................................................................. 2
1.2 Robust Video Transmission Techniques.................................................. 4
1.3 Joint Source and Channel C o d in g ..................................... 6
1.4 Distortion Estimation and Corruption M odel......................................... 8
1.5 MPEG-4 IP with Configurable P rocessor............................................... 10
1.6 Contribution of the Research..................................................................... 14
1.7 Outline of the T h e s is.................................................................................. 16
2 Background and Review of Previous Work 18
2.1 Introduction................................................................................................. 18
2.2 Overview of Robust Video Transmission
Techniques......................................... 19
2.3 DiffServ Networks........................................................... 21
2.4 P rio ritiz a tio n .............................................................................................. 23
2.5 Optimal Mode S e le ctio n ........................................................... 26
2.6 Conclusion.................. 32
3 A Corruption M odel for Robust Video Transmissions 35
3.1 Introduction................................................................................................. 35
3.2 MB-level Corruption M o d e l........................... 38
3.2.1 Investigation of MB error propagation ...................................... 38
v
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3.2.2 Derivation of MB-level Corruption M odel . 42
3.3 RPI-Based Coordination For Network
Adaptation ............................... ... .......................... 48
3.4 RPI Generation Based on the Proposed
Corruption M odel.................................................. ..................................... 54
3.5 Experimental R esu lts................................................................................. 58
3.5.1 Verification of Proposed Corruption Model ............................ 60
3.5.2 RPI-based Coordination of Network A d ap tatio n ...................... 64
3.6 Conclusion.................................................................................................... 67
4 An Integrated A IR /U E P Scheme for Robust V ideo Transmission
w ith A Corruption M odel 68
4.1 Introduction................................................................................................. 68
4.2 Transmitter Structure for AIR/UEP
Coordination .............................................................................................. 71
4.3 Joint AIR/UEP with Corruption Model ............................................... 73
4.3.1 Adaptive Intra-Refresh ( A I R ) ...................................................... 73
4.3.2 Joint AIR/UEP S c h e m e............................................................... 85
4.3.3 Analysis of Computational C om plexity...................................... 85
4.4 Experimental R esu lts................................................................................. 86
4.4.1 Computational Complexity E v alu atio n ...................................... 90
4.5 Conclusion.................................................................................................... 92
5 Video Packet Categorization for Priority D elivery to Enahnce End-
to-End QoS performance 93
5.1 Introduction................................................................................................. 93
5.2 System O verview ................. 97
5.3 Packet Categorization with Corruption M o d e l..................................... 98
5.3.1 Data P artitioning............................................................................ 98
5.3.2 Macroblock-based Corruption M o d e l............................................102
5.4 RPI Generation with Proposed Corruption M o d e l................................. 105
5.5 Experimental R esu lts.................................................................... 108
5.5.1 Verification of Proposed Corruption M o d el.................................. 108
5.5.2 RPI-based Coordination of Network A d ap tatio n .........................110
5.6 Conclusion.......................................................................................................I l l
6 Instruction D esign for M PEG -4 Video Codec for Configurable Pro
cessor 113
6.1 Introduction....................................................................................................113
6.2 MPEG-4 Codec O v erv iew ...........................................................................117
6.2.1 MPEG-4 O verview ......................................... 117
6.2.2 Computational C om ponents............................................................119
6.3 Two-Step Optimization ..............................................................................124
vi
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
6.4 Platform Independent Software Optimization.......................................... 125
6.5 Optimization for Configurable Processors.................................................127
6.5.1 Hot Spot Analysis ........................................................................... 127
6.5.2 Processor C onfiguration..................................................................131
6.5.3 Optimization with TIE ......................................................132
6.6 Optimization R esu lts................................................................................... 142
6.7 Conclusion...................................................................................................... 143
7 A dvanced Techniques for M P E G 4 V ideo E ncoder P erform ance E n
hancem ent 145
7.1 Introduction...................................................................................................145
7.2 Memory System of Embedded P ro c e sso r........................... 147
7.3 New Software Architecture for Embedded
MPEG-4 Encoding System ...........................................................................153
7.4 Motion Estimation with Low Memory
Bandw idth....................................................................................................... 156
7.4.1 Memory M a p ..................................................................................... 156
7.4.2 Memory Map for D ataR A M ............................................................158
7.4.3 DataRAM Addressing ......................................................160
7.4.4 SAD calculation.................................................................................. 162
7.4.5 Motion Compensation for Luminance C om ponent....................... 166
7.4.6 Motion Compensation for Chrominance C om ponent.................... 167
7.5 Optimization of DCT C o d in g ....................................................................168
7.5.1 Forward D C T ..................................................................................... 169
7.6 Summary of Optimized R e s u lts .............. 172
7.7 Conclusion.................................................................................... 176
8 Conclusion 179
Bibliography 180
vii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
List Of Tables
4.1 Macroblock encoding parameters for the corruption model................... 78
4.2 Parameters for the AIR simulation............................................................. 87
4.3 Parameters for the CIR simulation............................................................. 88
6.1 List of Computation E lem ents.....................................................................124
6.2 The performance of the optimized platform independent software codecl26
6.3 C codes for SAD calculations.........................................................................135
6.4 C codes for SAD calculations with TIE....................................................... 136
6.5 C codes for the half-pel frame calculation.................................................... 137
6.6 C codes for the half-pel frame calculation with TIE...................................138
6.7 The TIE optimized result (Mcycles/sec)......................................................143
6.8 The optimized IP specification.......................................................................143
7.1 Performance Comparison at Memory Latency = 22 CPU Clock Cycles 174
viii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
List Of Figures
1.1 The packet video delivery system employing the RPI-based corruption
model. ............................................................................................... 3
2.1 Video delivery over a DiffServ network..................................................... 23
2.2 An illustration of constructing a dependence graph................................ 27
2.3 The block interprediction used in the computation of distortion Dp. . 30
2.4 Error propagation in the joint source and channel coding..................... 34
3.1 Error propagation of a lost MB.................................................................. 43
3.2 Parallel Propagation:(a) the trajectory and (b) the equivalent linear
system.............................................................................................................. 44
3.3 Cascade Propagation: (a) the trajectory and (b) the equivalent linear
system.............................................................................................................. 45
3.4 Recursive weight calculation....................................................................... 47
3.5 Illustration of the reference macroblock.................................................... 56
3.6 The macroblock encoding algorithm with RPI generation.................... 57
3.7 The trace algorithm for RPI generation.................................................... 59
ix
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3.8 Effects of a single GOB loss with (a) GOB=4 and (b) GOB=5 from
the 111th frame, respectively, where the frame number indicates the
relative frame distance from the 111th frame........................................... 61
3.9 The distortion estimation for every 5th GOB of all frames.......................62
3.10 Correlation between the actual MSE and the estimated MSE by using
the proposed corruption model................................................................... 63
3.11 The effect of prediction dependency on overall quality degradation at
different multiple packet loss rates............................................................. 64
3.12 The QoS mapping........................................... 65
3.13 The PSNR comparison under 5% PLR..................................................... 66
3.14 Performance comparison for schemes with and without R PI....................66
4.1 The packet video delivery system employing the corruption model. . . 72
4.2 Distortion estimation................................................................................... 75
4.3 The rate-distortion curves........................................................................... 82
4.4 The quantization stepsize vs. the Lagrangian multiplier...........................82
4.5 The quantization stepsize vs. the bit r a te .............................................. 82
4.6 The timing diagram for feedback in terms of the frame number. . . . 83
4.7 Modification of the error propagation path with feedback information. 84
4.8 Computation of the propagation error...................................................... 86
4.9 The PSNR performance of AIR as a function of PER........................... 88
4.10 The PSNR performance of CIR as a function of PER........................... 88
4.11 The PSNR performance comparison of AIR and CIR. . ..................... 89
4.12 The gain of error tracking with feedback according to delay (for PER
= 0.05)....................................... 89
x
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4.13 The gain of the joint AIR/UEP scheme.................................................... 90
4.14 The encoding Speed with RPI generation.......................... 91
4.15 The encoding speed with AIR..................................................................... 91
5.1 The wireless packet video delivery system employing the RPI-based
corruption model....................................................................................... 97
5.2 Classification for data partitioning.........................................................100
5.3 Packetization of a partitioned video b its tre a m .................................. 102
5.4 The packet loss effect of a packet of Class 3 (right) and Class 4 (left). 109
5.5 MSE distribution for packet classes.............. 110
5.6 The packet loss rate for the unit packet cost.............................................. 110
5.7 The total cost versus the average packet loss rate......................................112
5.8 Performance comparison for data partitioned and video packets. . . . 112
6.1 Illustration of VLSI implementation methods...................................... 115
6.2 Design flow for flexibility.......................................................................... 117
6.3 The block diagram of the MPEG-4 encoder................................................118
6.4 The block diagram of the MPEG-4 decoder................................................ 118
6.5 The video encoding elements..........................................................................120
6.6 The motion estimation process...................................................................... 123
6.7 The encoder performance profile. ...............................................................128
6.8 The SAD calculation........................................................................................134
6.9 Half-pel frame calculation with TIE instructions. ........................137
xi
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
6.10 The motion compensation and reconstruction process............................. 139
6.11 Qunatization with multipliers and TIE instructions.................................140
6.12 The TIE instruction partition for the YUV-to-RGB conversion. . . . 141
7.1 The memory sub-system of the Xtensa processor..................................... 148
7.2 The direct mapped cache...............................................................................150
7.3 Cache miss count vs. multiple frame access............................................... 151
7.4 The original encoder architecture.................................................................154
7.5 The new encoder architecture.......................................................................154
7.6 The encoding flow chart. .......................................................... 156
7.7 Illustration of the macroblock-based frame memory map........................157
7.8 The DataRAM memory map for motion estimation.................................159
7.9 Loading of reference macroblocks.................................................................160
7.10 Update of reference macroblocks: (a) update for edge macroblocks
and (b) update for non-edge macroblocks....................................................160
7.11 Addressing of reference macroblock in DataRAM.....................................162
7.12 SAD 16 Calculation with TIE instructions..................................................163
7.13 Cycles for half-pel motion estimation with the old architecture 165
7.14 DCT coefficient coding and reconstruction.................................................169
7.15 TIE instructions for 1-D FDCT processing...............................................171
7.16 Cycles for motion estimation vs. memory latency.....................................175
7.17 Cycles for motion compensation vs. memory latency............................. 175
7.18 Cycles for FDCT vs. memory latency ................. 176
xii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
7.19 Cycles for VLC encoding vs. memory latency...........................................176
7.20 Cycles for encoding process vs. memory latency........................................ 177
7.21 Cycles for missed data cache allocation vs. memory latency. ................177
xiii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
A bstract
The demand of transmitting video through the Internet and mobile channels has
increased significantly in the last several years. The compressed video bitstream is
vulnerable to a small amount of bit error or packet loss. Joint source and channel
(JSC) coding techniques have been proposed to protect transmitted video data. In
most JSC coding techniques, distortion estimation due to packet loss is performed.
In this research, a corruption model is proposed, which can be used to enhance the
performance of various JSC coding techniques as a distortion estimation tool.
The loss impact of each macroblock is analyzed in this research on the corruption
model by taking into account error concealment, temporal dependency, and the loop
filtering effect. In addition, we extend the corruption model to cover more general
error propagation behaviors, and apply it to jointly coordinate adaptive intra refresh
(AIR) and unequal error protection (UEP) for robust video transmission.
The personal mobile communication system that communicates with the video
transmission capability is enabled by extensive research on robust video compres
sion and transmission. However, the complexity of the compression algorithm is
also significantly increased. The System-on-Chip (SoC) design for mobile devices is
xiv
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
very difficult due to its complexity and the associated verification problem. In this
research, we present a design approach that is more flexible while achieving a high
performance. The Xtensa processor is a configurable embedded processor for SoC
that allows users to design their own instructions to accelerate the computational
speed.
The instructions are designed for computationally intensive modules first. The
instructions can be the form of either the Single Instruction Multiple Data (SIMD)
instruction or the Multiple Instruction Single Data (MISD) instruction. Then, the
overall encoder software structure is optimized to avoid the penalty of cache misses
in the frame memory access based on the investigation of the memory system of the
embedded system.
The improved MPEG-4 encoder software and processor can be used practically
for SoC design with the external memory as the main memory, which can reduce
the system cost compared to the use of a large internal memory.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C h a p te r 1
Introduction
The ever-increasing demand on multimedia communications via the wired/wireless
Internet faces many challenges. One of the challenges is quality degradation due
to packet loss as well as bandwidth fluctuation. The dependency between image
frames makes the compressed video stream vulnerable even to a small number of
lost packets. To address error resilience, the latest versions of ITU-T H.263+ [14]
and ISO MPEG-4 have adopted a couple of options to alleviate the corruption of
compressed video to error prone channels. Examples include layered representation,
re-synchronization, error tracking, and error recovery options. Most research on ro
bust transmission techniques considers end-to-end quality degradation due to the
channel error or the packet drop. Thus, the estimation of distortion is an impor
tant issue. However, due to the complexity of video compression techniques, it is
difficult to analyze quality degradation from the packet loss. In this research, we
propose a corruption model that can be used to calculate the distortion due to error
propagation.
1
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Furthermore, the complexity of coding algorithms has been growing with the
optional tools for coding efficiency and error resilience. On the contrary, there
has been an increasing demand on the implementation of complex algorithms in
mobile devices, which require the System-On-Chip (SOC) module with low power
consumption. This is another challenge in multimedia communication.
In the first part of this thesis, robust video transmission systems will be examined.
The necessity of the corruption model in the robust video encoding techniques will
be emphasized in the first half of this chapter. In the second part of this thesis,
issues in MPEG-4 video codec design for the System-on-Chip (SoC) implementation
will be addressed. The proposed optimization method will be briefly introduced in
the second half of the thesis.
1.1 V ideo Delivery System
A packetized video delivery system with the proposed corruption model is shown in
Figure 1.1. It basically assumes the existence of a network that supports prioritized
variable-rate delivery and the associated pricing mechanism. Since a video codec has
several options to trade the compression efficiency for flexible delay manipulation,
error resiliency, and network friendliness, the coordination framework has to provide
an simplified interaction process between the video encoder and the target network.
By utilizing the corruption model in the source coder and the network adaptation
layer in an appropriate manner, the proposed delivery system can accommodate the
2
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
demand of each packet to achieve the best end-to-end performance in adaptation to
network fluctuating conditions. Under a given source and channel characteristics,
efficient and optimal source and channel coding can be exploited under the coor
dination framework. The effectiveness of the system is however dependent on the
accurate estimation of the distortion due to the channel error or packet drop , which
is key investigation issue of this research.
Video
Source
O f
R econstructed
V ideo
RPI
N etw ork
Condition
D elivery N etw ork
Packet Loss
D elay
Source
Encoder
E rror
C orrection
Source
D ecoder
U n e q u a l']
Error
Protection j
Packetizer
C orruption
M odel
D e-packetizer
Figure 1.1: The packet video delivery system employing the RPI-based corruption
model.
The proposed delivery system consists of the video encoder, the packetizer, the
proposed corruption model association module, the network adaptation module, the
underlying delivery network, the de-packetizer, and the video decoder. Compressed
video at the video encoder is packetized (maybe multiplexed with other media)
at the packetizer. When the input video sequence is encoded the error resilience
techniques can be exploited. The resulting video packets are then associated with
Relative Priority Index (RPI) so that its impact on the video quality can be relayed
3
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
to the network adaptation module and the underlying delivery network. In the case
of DiffServ, each packet may be mapped to a different DS level that again is cate
gorized by the degree of loss, delay, and bandwidth. Successfully delivered packets
are de-packetized, de-multiplexed and decoded for rendering. Generally, any reli
able transmission scheme can be assumed as long as it supports the RPI-associated
differentiation. For example, it can be applied in robust video transmission with
adaptive FEC-based protection [21] or the DiffServ packet forwarding [29].
1.2 Robust Video Transmission Techniques
Data loss is not avoidable in the error prone channel such as mobile channels and
the Internet. The compressed bitstream of image sequences by using Motion Com
pensated Predictive (MCP) coding is inherently vulnerable. Hence, even a small
number of errors in the compressed bitstream causes catastrophic quality degrada
tion. In order to overcome this problem, many robust video transmission techniques
have been proposed. Error resiliency can be improved in source encoding and/or
channel coding.
Error-resilient source coding techniques have been developed in the state-of-the-
art video encoding expert group such as MPEG4, H 263+/++. In order to reduce
error propagation due to inter-frame prediction, intra coded frames (i.e. I frames) or
intra coded macroblocks are inserted intentionally. Because intra-coded macroblocks
can be decoded without reference data, error propagation can be stopped. However,
4
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
the intra-coded macroblock requires more bits. Therefore, the quantization step size
should be increased to increase error resiliency by increasing the number of intra
macroblocks to preserve the same amount of bit rates. As a result, the quality of
reconstructed video is sacrificed for the error free case.
When the variable length code (VLC) is decoded, one should keep track the
code start point. Once the code synchronization point is lost, the remaining codes
cannot be decoded until the unique code that is inserted for the code synchronization
purpose is met again. In order to increase error resiliency, the synchronization code
can be inserted frequently in the macroblock boundary. However, the size of the
synchronization code is usually the largest in order to be a unique code. Thus, too
many insertions of synchronization codes decrease the quality of reconstructed video
for the error free case.
It is difficult to protect the transmission data enough through the mobile channel
and the Internet due to the hostile channel characteristics and the heavy traffic,
respectively. Fortunately, channel conditions can be monitored in most transmission
systems, and adaptive protection techniques have been proposed. For example, if
the channel condition indicates low SNR, a higher bit error rate is expected. Then,
the channel coder increases the number of redundant bits in order to protect more.
Consequently, the available bandwidth for the source bitstream should be decreased.
5
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
If the feedback information about specific packet loss is available, it is possible
for the source encoder to estimate the reconstructed video in the decoder frame
memory. This possibility enables error tracking (ET) techniques.
In error resilient techniques, coding efficiency is sacrificed to increase error re
siliency. Traditionally, source coding and channel coding are performed separately.
It is assumed that source coding eliminates redundancy perfectly while the channel
coder inserts redundant data to protect the data for a given channel capacity. It is
well known that the importance of each bit in the compressed bitstream is different,
and the channel coder should protect the bit stream according to the importance of
data via unequal error protection (UEP). It is however not sufficient for the channel
coder to protect data in error-prone channels, the source encoder has to contribute
to error resiliency even though its coding efficiency is sacrificed. It has been shown
that the end-to-end quality can be improved with combined source and channel cod
ing schemes [32], which is called the joint source-channel (JSC) coding technique.
In the following section, more details of the JSC technique will be described.
1.3 Joint Source and Channel Coding
As described in Section 1.2, there is a trade-off between error resiliency and coding
efficiency. By adopting intra macroblock refresh for error resiliency, the percentage
of the number of intra refresh macroblocks can be optimized to minimize the end-
to-end quality degradation with respect to a given channel condition. The simplest
6
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
intra refresh technique is the Cyclic Intra macroblock Refresh (CIR) technique, in
which the order of intra refresh macroblocks are randomly determined while the
refresh cycle for macroblocks is kept in the same spatial position. The cycle of the
refresh can only be controlled by the input parameter. However, CIR cannot stop
error propagation efficiently since the position in which the macroblock is intra-coded
has no relationship with the error propagation mechanism.
The Adaptive Intra macroblock refresh (AIR) technique provides more efficient
method in selecting intra refresh macroblocks based on the expected distortion,
which is measured by taking into account the source coding error and the channel
error. The expected distortion relies on the Packet Erasure Rate (PER) of the
channel as well as the source characteristics. Therefore, AIR can be classified to
one of the JSC techniques. The relationship between channel errors and quality
degradation on reconstructed video should be investigated to calculate the expected
distortion. After the expected distortion is calculated for each macroblock, the best
encoding mode is selected from the three possible coding modes (i.e. intra, inter,
or skip). The optimization is performed to minimize the overall end-to-end quality
degradation due to the packet loss effect in the channel.
UEP is performed based on packet loss impact or delay sensitivity of the end-
to-end quality degradation . If the packet is an important one, it is protected more.
The problem in the design of UEP schemes is how to measure the sensitivity for each
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
packet. After such a measurement is available for every packet, the best protection
level for each packet can be determined to minimize the end-to-end distortion.
If the feedback information is available for the delivery of each packet, recon
structed video in the decoder memory can be re-produced in the encoder by per
forming decoding based on the erroneous bitstream. This is called the error tracking
(ET) technique. If the feedback time is short enough, the encoder can update the
corrupted macroblock due to packet loss by inserting the intra macroblock. How
ever, the actual decoding should be performed every time the packet loss information
arrives. As a result, ET requires a very high computational complexity and a large
memory.
In various JSC techniques described above, the estimation of distortion caused by
channel errors is critical to the end-to-end quality enhancement. However, distortion
estimation for packetized compressed video is not easy since the decoding process is
very complex. Besides, once an error occurs in a decoded frame, it propagates along
successive frames. It may require a high computational complexity to calculate the
packet loss effect accurately.
1.4 D isto rtio n E stim atio n an d C o rru p tio n M odel
Different distortion estimation methods have been proposed. In order to prioritize
packetized video data, syntactical analysis was performed. For example, the header
8
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
information is treated as the most importance data while high frequency AC co
efficients are classified as the lowest priority. The encoded bitstream is packetized
with a data partitioning method. However, the decoder has the error concealment
capability, in which macroblocks of a lost packet can be reconstructed by using the
information from the correctly received macroblock data. If the error concealment
task is successful, the introduced error can be small even though the important
header information is lost.
Some researchers attempted to estimate the packet loss impact based on encoding
parameters such as the strength of the motion vector or the temporal dependency
count that gives the number of macroblocks, which refer to the same macroblock
under our consideration. More recently, interesting distortion estimation methods
called the optimal mode selection technique have been proposed by Zhang [41] and
Cote [5]. In these methods, the initial error due to the packet loss is calculated by
exploiting the error concealment algorithm. The distortion due to the propagation
error is also taken into account. Even thought their methods could provide relatively
accurate estimation compared to syntactical analysis, the error propagation behavior
is still not well considered. Furthermore, these methods are neither practical nor
flexible due to their complexity and accuracy.
9
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
1.5 M P E G -4 IP w ith C onfigurable P rocessor
There are an increasing demand on multimedia mobile devices such as mobile phones,
Personal Data Assistants (PDA), and Digital Still Cameras (DSC). Video compres
sion and decompression techniques such as H.263 and MPEG-4 are used in those de
vices for visual communication and/or carrying video clips. Even though supported
video formats are small like QCIF (176x144) or GIF (352x288), the requirement of
the computational power to process the data is tremendous because of the use of a
complex algorithm to achieve the high compression ratio and error resiliency. On the
other hand, the mobile device has a stringent requirement of low power consumption
to increase the batter lifetime. The System-on-Chip (SoC) design provides a poten
tial to low power consumption as well as a small system size. Many implementation
methods have been proposed. The Objectives of these implementations are to design
SoC for higher performance with minimum resources such as the silicon area and
the internal and external memory sizes. Thus, hardware and software optimizations
are required to achieve the goal.
W ith the experience learned from the design of previous video compression stan
dards such as MPEG-1, 2, and H.263/+, the implementational flexibility is an im
portant factor of concern. Since the traditional hardwired design is less flexible,
the processor-based implementation is a preferred choice. It is also the general
trend. The VLSI implementation can be categorized to three types, i. e. hardwired,
DSP-based and hybrid. To achieve a higher performance with flexibility, the hybrid
10
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
architecture have been proposed, where some operation-intensive software functions
are implemented with hardwired blocks while other functions of a lower complex
ity are implemented with software on the general purpose processor (GPP) [26].
The development time of the hybrid architecture is however much longer than the
DSP-based implementation due to the interface between hardwired blocks and the
GPP.
The implementation of MPEG-4 on high performance DSP (e.g. the TI TMS320
series) allows a rapid development since every function is implemented with software
with an accelerated instruction set for multimedia processing while keeping the flex
ible software structure [3]. The MPEG-4 video compression standard is widely used
in different applications, and its video format varies with these applications. There
fore, the best VLSI architecture for each application can be different. The number
of operations increases as the amount of data for processing increases while types
of operations and functions remain the same. For wireless applications where the
small video format is used, the DSP-based architecture has more advantages since
instructions can be shared by different functions and less hardware interfaces are
required. It is worthwhile to point out the DSP processor like TI TMS320C6x is not
designed only for video compression. Therefore, there are many instructions unused
in MPEG-4 video processing. One potential shortcoming for this solution is that the
processor may have too many unused components for the video coding application
and the cost of the high performance DSP could be too high to be attractive.
11
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
On the other hand, special instructions for multimedia processing have been
added in the general purpose processor. This type of processor architecture in
cludes Intel x86’s MMX [27] and Stream SIMD [34], PowerPC’s AltiVec [6], and
AMD’s 3DNow! [24]. The multimedia instructions in this category take advantage
of subword parallelism, which performs the same operation on multiple data in the
multimedia register in parallel. This is also called the Single Instruction Multiple
Data (SIMD) architecture. Depending on how wide the multimedia register is, the
speed is multiplied by the number of data in the multimedia register. However,
since the instruction set is designed for many different multimedia processing, the
operation type is limited to the basic arithmetic or logical operations. In addition,
there are some overhead operations for data alignment. Thus, the performance im
provement is not as significant as parallel processing despite of a huge number of
multimedia instruction set. For some applications, most multimedia instructions
remain unused.
To conclude, the existing DSP processor and the GPP with the multimedia in
struction set may not be suitable for the SoC implementation for video compression
and decompression applications while the optimally selected or designed instruction
set is desirable for SoC implementation. A new configurable embedded processor
called the Xtensa has been recently developed by Tensilica [36]. The Xtensa pro
cessor has a rich set of configuration parameters such as interface options, memory
subsystem options, and instruction options. In addition, the Tensilica Instruction
12
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Extension (TIE) language is used to describe new instructions and new registers and
execution units that are then automatically added to the Xtensa Processor. Thus,
unlike the situation of the general DSP processor, we can add instructions that are
only necessary for video processing. To keep high flexibility, it is better keep the core
function in software. The instruction level hardware optimization with an embedded
processor is adequate for SoC for portable devices.In this second half of the thesis,
we describe the implementation of the MPEG-4 video encoder and decoder using
the configurable embedded processor.
Many constrains limit the performance of SoC for mobile devices. For low power
design, the internal memory size is limited while the external memory bus speed
is limited. Consequently, the memory bandwidth is much lower than the SoC sys
tem without the power consumption constraint. In general, the cache size is much
smaller than one frame memory size of the smallest video format (QCIF). There is a
significant number of data cache misses when video data are processed. We observed
that the performance is significantly degraded by lowering the memory bandwidth.
In order to solve the memory bandwidth dependency problem, we propose new soft
ware and hardware architectures. First, the software architecture is re-structured
to access to minimal number of frames at one function. Second, the macroblock-
based frame memory map is used instead of the line-based frame memory map.
The macroblock-based frame memory map improves cache utilization because the
encoding and the decoding processes are performed in the macroblock order not in
13
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
the line order. Third, TIE instructions are designed to calculate addresses of the
macroblock-based frame memory that requires more complex address calculation
than the line-based frame memory map. Fourth, a small internal dataRAM is used
to store most frequently used data. This reduces a lot of external frame memory
access so that the number of cache misses is reduced.
The number of issued instructions for MPEG-4 encoding and decoding is reduced
dramatically with TIE instructions. Furthermore, cache utilization is improved sig
nificantly with the new software and hardware architecture for low memory band
width. With the proposed optimization schemes, the MPEG-4 encoder can run at
only 34 Mcycles/sec for QCIF video with 15 frames per second. Comparing to the
original simulation result, we see a factor of 10 in the performance improvement with
the same memory bandwidth. The performance gain can be reproduced in the differ
ent image and video compression and decompression standards such as JPEG/2000,
H.263, M PEG-1/2/4 and H.264.
I.6 Contribution of the Research
This research has the following contributions.
• Proposal of an accurate corruption model.
An accurate corruption model is proposed to provide a tool to facilitate var
ious JSC techniques. The error introduced by packet loss is investigated by
14
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
incorporating the error concealment algorithm. Furthermore, the error propa
gation behavior is analyzed so that the distortion transferred from the reference
macroblock to the prediction macroblock can be calculated without sacrificing
the lost accuracy much. The accuracy of the proposed corruption model is
demonstrated by experimental results.
• Development of a low-complexity algorithm for corruption model computation
The computational complexity of a practical corruption model cannot be high.
It is proved analytically and experimentally that the computational overhead
is actually small.
• Pratical applications of the proposed corruption model
The proposed corruption model can be used in various JSC coding techniques
to improve the end-to-end quality. Three applications of the proposed corrup
tion model are studied in detail:
1. UEP with the DiffServ network,
2. Optimal mode selection for AIR and ET,
3. Robust video delivery with coordinated UEP, AIR, and ET to minimize
end-to-end quality degradation.
It is demonstrated by extensive experimental results that JSC techniques in a
video delivery system can be enhanced with the proposed corruption model.
15
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
• Proposal of optimal instruction set for MPEG-4 Video codec
The TIE instruction set is designed to improve performance with a very small
gate count. This instruction set is suitable for the SoC design of mobile devices
because the power consumption can be reduced with a lower required dock
speed and less redundant instructions.
• A new software and hardware architecture for low memory bandwidth system
A macroblock-based frame memory map is proposed to improve cache utiliza
tion. Some TIE instructions are designed to speed up the address calculation
for the new memory map. The software architecture is re-structured to reduce
the overall frame memory access. The internal data RAM is utilized to avoid
the frequent external memory access.
1.7 Outline of the Thesis
The rest of the thesis proposal is organized as follows. In Chapter 2, we will survey
previous work on the corruption model, the DiffServ network, and UEP. Further
more, the joint source channel techniques, in which the distortion estimation is per
formed, will also be discussed. In Chapter 3, the corruption model is derived from
the analytical model and an application with UEP will be shown. In Chapter 4,
AIR and ET applications with the proposed corruption model will be examined,
and a coordinated UEP, AIR and ET scheme will be described. The video packet
16
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
categorization method for a data partitioned video bitstream will be described in
Chapter 5. The MPEG-4 video codec design with a configurable processor will be
performed using the TIE instruction design in Chapter 6. Further improvement with
new software and hardware architecture for a low memory bandwidth system will
be presented in Chapter 7. Some concluding remarks are given in Chapter 8.
17
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C h ap ter 2
Background and Review of Previous Work
2.1 Introduction
This chapter provide some background for the problem of our interest and review
of previous work in this field. An overview of robust video transmission systems
and techniques will be briefly described in Section 2.2. Following that, the DiffServ
network will be examined as a network channel to provide UEP for each packet
according to the packet importance in Section 2.3. Then, two JSC techniques will
be reviewed in detail from the viewpoint of distortion estimation as a result of packet
loss from Section 2.4 to Section 2.5. In these techniques, some corruption models
are used to calculate or estimate the distortion due to packet loss.
The DiffServ network has a packet forwarding mechanism that providse differ
ent levels of reliability [29]. This requires a packet categorization/prioritization to
exploit differentiated services. The JSC techniques under our consideration include
both UEP with packet prioritization/catagorization and the optimal mode selection
18
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
with distortion estimation. In addition, error tracking techniques will be reviewed
for a network environment, in which the feedback channel is available.
2.2 O verview of R ob u st V ideo T ransm ission
Techniques
In this section, we provide an overview of robust video transmission systems and
techniques. More details under each topic will be reviewed in subsequent sections.
To achieve UEP, packet prioritization is the first step, which can be performed
in different ways. The layered coding technique provides the capability to generate
different levels of bitstreams. A single layer bitsteam can also be partitioned into
units of different importance by taking into account syntactical dependency and the
type of parameters (e.g. the header, motion vectors, DC/AC coefficients, etc.) A
more meaningful catagorization can be performed by considering end-to-end quality
degradation incurred by a lost packet. The distortion can be quantified based on the
motion vector strength or temporal dependency in the prediction loop. The most
accurate distortion estimation can be performed via error propagation analysis.
By inserting intra coded macroblocks by force, error propagation can be miti
gated. The adaptive intra macroblock refresh (AIR) method can stop error propa
gation more effectively via optimal mode selection based on the expected distortion
due to the packet loss. There are two different methods to calculate the expected
19
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
distortion. Zhang et al. proposed the pixel-level distortion estimation [41] and Cote
et al. investigated the macroblock-based distortion estimation [5]. With the ex
pected distortion, the rate-distortion (R-D) optimization method can be performed
to select the optimal mode to minimize quality degradation for a given channel con
dition. If the feedback information is available for the packet loss information, the
actual distortion can be calculated with the error tracking method. Then, the ac
tual distortion can be used in the optimization process. However, the computational
complexity increases and the accuracy decreases as the feedback time increases.
All the techniques described above takes distortion due to packet loss into ac
count. Different distortion estimation methods are used in these techniques. Each
method has different characteristics in its accuracy and computational complexity.
Especially, syntactical dependency is not directly relevant to the actual packet loss
effect. The distortion estimation with the motion vector strength might be different
from the actual distortion. For instance, even thought the strength of the motion
vector is large, the loss effect can be small if the error concealment scheme is well
performed. In the case of optimal mode selection, the initial error and temporal
dependency have been considered. However, the prediction loop filtering has not yet
been studied. To estimate the packet loss impact and the expected distortion, we
should investigate how packet loss affects reconstructed video quality at the decoder,
and then derive a corruption model accordingly.
20
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2.3 DiffServ N etw orks
As an advanced network system, the DiffServ network has been introduced to provide
quality of service (QoS). It is difficult for the current Internet to achieve QoS due
to its best effort (BE) service mechanism. In BE networks, the service quality is
difficult to be maintained due to the unpredictable varying channel bandwidth. In
order to guarantee a certain level of QoS, many advanced network models have been
proposed. QoS for multimedia transmission is characterized by QoS parameters such
as end-to-end delay (EED), end-to-end delay variation (EEDV) and end-to-end bit-
error-rate (BER) [13].
The DiffServ network achieves QoS with routers to forward packets with a differ
ent per-hop behavior (PHB)[10]. PHB is applied differently according to the DiffServ
(DS) level that is carried in the Internet Protocol (IP) header of each packet. The
DS level is marked when the packet enters the DiffServ network.
The DiffServ network can be categorized into the absolute service differentiation
and the relative service differentiation. The absolute service guarantees QoS regard
less of background traffics, and can be achieved with admission control. The relative
service differentiation provides dynamic resource allocation to maintain a propor
tional quality gap among DS levels in varying traffic conditions. Relative service
differentiation is preferred for video transmission over IP networks since it provides
a more flexible QoS mechanism.
21
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Video delivery over DiffServ networks has been extensively studied by Shin et
al. [29]. In his proposed solution, the video gateway (VG) is placed at the entrance
to the DiffServ network and provides dynamic QoS mapping onto DS levels from
a relative priority index (RPI), which indicates the sensitivity of loss and/or delay
of a packet. Such a video delivery system is illustrated in Figure 2.1, where VG
forwards packets with the QoS mapping to the DiffServ network. In user’s terminal,
the input video is compressed and packetized. For each packet, the relative loss
index (RLI) and the relative delay index are calculated from the sensitivity measure
for the loss and the delay effects, respectively. In the sensitivity measure, the source
and the encoding characteristics are taken into account. The RLI and RPI values
are mapped to the RPI value according to a certain mapping algorithm. Finally,
RPI is categorized into k levels.
In VG, a packet with value k is mapped to DS level q dynamically according to
the distribution of k and the traffic condition. Then, the packet is marked with q and
passed to the DiffServ network. In the DiffServ network, each packet is forwarded to
routers with different PHB according to its DS level. The network traffic condition
can be fed back to VG, and the mapping mechanism can be adaptive to the traffic
condition. As a result, packets are delivered with different delay characteristics and
loss rates according to their sensitivity. In order to extract accurate RLI and RDI,
the encoded bitstream should be analyzed based on the loss sensitivity and the delay
sensitivity. In the following section, prioritization methods will be reviewed.
22
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
User Terminal (Sender)
Sensitivity
Estim ation
feedback
Cate-
gorizer
Receiver
Video
Input
V ideo] . jPacke-"
.Coder J I tizer ,
DiffServ \ v g
Network ( ri
VG
Pricing
’ V
Q oS j
M apping f
Figure 2.1: Video delivery over a DiffServ network.
2.4 Prioritization
The purpose of prioritization is to provide the sensitivity for transmission delay
or loss to a packet delivery scheme. The sensitivity of a packet means end-to-end
quality degradation due to packet loss or delay. The provided sensitivity can be used
in classifying packets, and classified packets are transm itted with a different manner
to minimize end-to-end quality degradation under given network resources. Network
resources are limited in terms of delay, bandwidth, or loss rate as discussed in the
previous section. In this section, techniques to classify packets of a compressed video
bitstream will be reviewed.
23
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Tan and Zakhor [31] proposed video packet classification schemes for stream
ing MPEG video over delay and loss differentiated networks. They categorize the
sensitivity into the delay sensitivity and the loss sensitivity for video packets. The
purpose of the classification is to provide the priority index for each packet to the
DiffServ network. For the delay sensitivity, frame encoding modes are taken into
account. The highest delay sensitivity is given to the intra-coded frames (I frames),
the next priority is assigned to the predicted frames (P frame), and the bidirectional
interpolated frames (B frames) have the lowest priority. Under their framework, five
different classification methods were proposed for loss sensitivity as described below.
• Uniformity.
This scheme treats the whole MPEG bit-stream as a homogeneous one, and
assigns equal importance to all bits. It is used as the base-line for comparison.
• Frame Level Reference.
The importance of all slices in a specific frame is equal to the number of frames
with reference to it. For example, every B frame has an reference count of one
since no other frames besides itself refer to it. On the other hand, the I frame
has a reference count of the sum of the number of P frames and B frames
before the next I frame is encoded.
• Slice Level Reference.
In this scheme, if a macroblock in frame i is predicted from K macroblocks in a
previously coded frame j, it will give a contribution of I j K to all its references.
24
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
The importance of each slice is then computed as the average reference counts
of its macroblocks.
• Motion.
This scheme assigns more importance to the motion information than to the
texture.
• Motion plus Slice Level Reference.
This scheme assigns motion vectors the highest importance. The importance
of the texture information is ordered by the method of Slice Level Reference.
After categorization, packets are transmitted through a DiffServ network, in
which two DS levels are assigned with 0.3% and 3% packet loss rates, respectively.
Simulation experiments were performed under a low-loss traffic level, and results
show that the motion plus slice level reference categorization method gives the best
performance. The result also shows that more accurate analysis can lead to a better
performance in the DiffServ network.
Even though the above packet categorization schemes provide some gain in the
performance, there may be some difference between the actual loss sensitivity and the
estimated sensitivity due to the accuracy limitation of the model. It is well known
that the motion vector has more importance than the texture data. However, if error
concealment is performed well in the decoder, the motion vector may be induced
from surrounding macroblocks so that the loss sensitivity may not be affected by the
loss of the motion vector information for some macroblocks. Since the error often
25
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
propagates to more than one frame, the one-frame slice level reference may lose its
accuracy too. To increase the accuracy, the error concealment scheme in the decoder
should be taken into consideration, and the number of frames used in the reference
count should be increased.
The loss impact of a macroblock was measured by counting the temporal de
pendency in the work of Willebeek-LeMair, et al. [38]. The purpose of such a
measurement is to selectively update macroblocks that have the most impact on
later frames. By doing that, the error recovery time can be reduced. To measure
temporal dependency, the encoded bitstream is analyzed and the dependency graph
is constructed by using motion vectors so that an importance measure can be as
signed to a macroblock by counting the number of macroblocks depending on it.
The temporal dependency count of macroblocks is shown in Figure 2.2. Based on
this measurement, intra macroblocks can be inserted effectively. As a result, the
error recovery time can be faster than the static update scheme by 35-65%.
2.5 Optim al M ode Selection
The intra macroblock refresh scheme can provide error resiliency with a smaller en
coder buffer size, which leads to shorter encoding delay than intra frame refresh.
Thus, intra macroblock refresh has been used in low-bit rate video streaming. How
ever, to increase error resiliency with intra macroblock refresh, more bits are needed
26
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Frame N-2 Frame N -1 Frame N
D(2)
A(6)
C(2)
m B(3)
K
M otion Vector
Figure 2.2: An illustration of constructing a dependence graph.
so that visual quality is sacrificed when there is no error. Many adaptation algo
rithms for intra macroblock refresh have been introduced. The latest adaptative
technique exploits the rate-distortion (R-D) optimization. The R-D optimization is
a popular method in the image and video compression area, in which the distortion
due to quantization noise and the corresponding rate are varied by selecting different
quantization step sizes. For a given bit budget, the optimal bit allocation problem
for each macroblock can be solved with the R-D optimization [25], [30]. The rate
distortion curve gives the smallest distortion level for a given rate, and the cod
ing parameters for each encoding unit can be determined by using the Lagrangian
approach.
On the other hand, the optimal bit allocation problem over noisy channels was
attempted by Tanabe and Farvadin [32]. In their work, the distortion is calculated
by taking into account the noisy channel effect. R-D optimization with the distortion
27
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
due to packet loss was examined by Zhang [41] and Cote [5]. They developed their
own distortion estimation methods, respectively, to estimate the expected distortion
for a macroblock due to the propagation error. Then, the estimated distortion is
used to select the best macroblock encoding mode that minimizes the following
Lagrangian function:
Jmode = (1 - p) ■ Dp(mode) + p- De(mode) + X mode ■ R(mode), (2.1)
where p is the packet erasure rate (PER), Dp is the distortion due to the propagation
error, and De is the distortion due to error concealment. In both cases, the initial
error is calculated by incorporating the error concealment algorithm of the decoder.
However, the distortion due to the propagation error is calculated in a different
manner.
In Zhang’ s paper [41], an algorithm for recursive computation of the overall
distoriton of a decoder frame reconstructed at the pixel level precision was developed.
The distortion is written as
4 = E {(fn - ! i f } = ( f i f - 2 + E { { f ’ n)2}, (2.2)
where f % n is the original value of pixel i in frame n, and /* is its reconstruction. The
value of is calculated for the intra mode and the inter mode, respectively, for a
28
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
given packet erasure rate p. The distortion of an intra coded macroblock can be
calculated by using the following equations:
E { f l] = (1 - ? )( £ ) +J>( 1 - p ) E { f l ,} + P 2E {fl_ 1}, (2,3)
E { ( & } = (1 ~ P ) ( /i) 2 + p ( . l - p ) E { ( f l l)2}+ P 2E { ( f U n , (2.4)
where pixel k in the previous frame is used if the current packet is lost and the
concealment motion vector associates the current pixel i to pixel k in the previous
frame. If the packet that includes pixel k is lost too, then pixel i in the previous
frame is used for the concealment task.
For the inter macroblock case, the propagation error should be considered for
the error-free case. If pixel i in the current frame references pixel j in the previous
frame, the reconstructed pixel i will have the expected value of pixel j and residual
error el n. Finally, the expected distortion can be calculated by using the following
equations:
E{fL) = (1 - P ) K + E { 0 ) + p(l - p ) E {%_,} + P 2E { f U h (2.5)
E { ( 0 2} = (1 - P )£ { (4 + fl) 2} + P(1 - p ) B { { f t l)2} + P2E { ( f U ) 2}. (2.6)
W ith this distortion estimation, R-D optimization can be performed and the
end-to-end quality improvement is demonstrated. This techniques provides a more
29
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
F ram e n-1 F ram e n
Figure 2.3: The block interprediction used in the computation of distortion Dp.
accurate distortion model. However, the computational complexity is very high since
the pixel level distortion estimation demands heavy computations. In addition, it
requires a huge frame memory to store the expected distortion for every pixel in
previous frames.
Cote [5] proposed another distortion estimation method of a moderate complex
ity. It is performed at the macroblock level. Instead of calculating the propagation
error, they use dependency weighting to compensate the calculation error. The
dependency weight is calculated as shown in Figure 2.3.
The distortion due to the propagation error, denoted by Dp in (2.1), can be
calculated as
N
Dp{ n, mode) = Dq(n, mode) + — k) ■ De(n — k), (2.7)
k ~ l
30
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
where Dq is the quantization noise. If the macroblock is coded in the intra mode,
Dp has only the quantization distortion term. However, If the macroblock is coded
in the inter mode, the expected distortion of the reference macroblock should be
included. De is the distortion of a macroblock when the macroblock is lost and
reconstructed by using an error concealment algorithm. The value of De for every
encoded macroblock is stored in the local memory in order to recursively calculate
(2.7). The second term in (2.7) is calculated with a weighted sum of expected
distortions of surrounding blocks as shown in Figure 2.3. The distortion calculation
due to the propagation error is given by
9
p(n — k)De(n — k) = ^ w(l)p(n -k)D '„ (2.8)
1= 1
where w{l) is the weight of the area in which the reference macroblock is placed
in macroblock /, and D\ is the expected distortion due to error concealment of
macroblock I.
The expected distortion is used in selecting the optimal encoding mode of each
macroblock with the R-D optimization method. In the experiment, A m o*. for optimal
mode selection is obtained as
^m ode 0-
45 x (2*Q)2 + A(p,Q), (2.9)
31
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
for the Foreman sequence, where Q is the quantization step size and p is the packet
erasure rate. It is observed that the dependency between \ m o d e and p is negligible.
Thus, A(p, Q) is set to zero. The performance improvement is achieved by using a
combined optimization method with both optimal model selection and resynchro
nization marker insertion.
Cote also investigated the relationship between the error concealment algorithm
and the optimal mode selection. The best performance can be achieved when the
encoder uses the same error concealment algorithm as the decoder.
2.6 Conclusion
A video delivery system over the DiffServ network was reviewed. It is demonstrated
that if the source encoder can provide accurate sensitivity analysis of the end-to-end
quality due to the packet loss or delay, the delivery system can maximize the utility
of network resources while keeping quality degradation to the minimum.
We discussed packet prioritization schemes. The syntax level classification has
been investigated by researchers. The optimal mode selection is performed with
distortion estimation by taking the initial error and error propagation into account.
The R-D optimization is carried out to minimize the end-to-end quality degradation
under a given packet erasure rate.
The DiffServ network demands packet sensitivity analysis for the loss and the
delay effects. A prioritization scheme attem pts to determine the sensitivity value
32
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
from the encoding parameters. However, there are few prioritization methods that
take into account the actual distortion and error propagation. Distortion estimation
methods used in the optimal mode selection can reflect the actual error and the gen
eral error propagation behavior. However, their models are different and restricted
to this specific application. In addition, it does not take into account the prediction
loop filtering effect in the error propagation behavior.
We could identify the common aspect between prioritization and distortion esti
mation. Prioritization is based on the sensitivity of the end-to-end quality degrada
tion. If we are interested in the loss sensitivity, it can be measured by the packet loss
impact that is affected by the error concealment algorithm and the error propagation
behavior. If we have a corruption model that can explain how the error occurs and
propagates from the reference macroblock to the predicted macroblock, then the
model can be used for both of the loss sensitivity measurement and the calculation
of ethe xpected distortion. Figure 2.4 shows the application of the corruption model.
Consider that we are going to encode macroblock A in Figure 2.4. Then, the
expected distortion of this macroblock can be calculated by taking the expectation
of the distortion from all reference macroblocks in the previous frames. In that case,
we know the transferred distortion from each reference macroblock to the current
macroblock A through the prediction path by using the corruption model. On the
other hand, if macroblock A will be transm itted through the network, which requires
the knowledge of the importance of a macroblock (or a packet that includes several
33
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Inira-MB Inieger-pel
Half-pel MV
n+3 n-2 n-1 n-3 n+2 n+1 n
O Correct Reference
0 Concealment Reference
p: Error probability
q: 1-p
Relative P riority Index O ptim al M ode Decision Problem
UEP AIR
Expected Distortion
depending on
error probability
Estimate Significance
of a packet depending
on error propagation
Figure 2.4: Error propagation in the joint source and channel coding.
macroblocks) to use that information in the differentiated forwarding mechanism,
we can calculate the distortion due to the propagation error by using the corruption
model.
To conclude, to develop an accurate corruption model is valuable to a robust
video delivery system. In addition to accuracy, we have to find a corruption model
with a flexible choice of complexity, e.g. by adjusting model parameters such as
the number of frames to be calculated for error propagation. In the next chapter, a
corruption model will be described and the application of this corruption model will
be demonstrated in the proposed robust video delivery system.
34
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C h ap ter 3
A Corruption M odel for Robust Video
Transmissions
3.1 Introduction
In the previous chapter, we investigated the importance of joint source-channel cod
ing in a robust video delivery system. Due to the nature of the source encoder, each
bit in the compressed bitstream has a different contribution to the quality of recon
structed video. Furthermore, the compressed bitstream is partitioned into packets
as the transmission unit. Packetization is performed in a way with either the same
number of bits or the same number of encoding units (e.g. macroblock). Thus, the
amount of information contained in each packet is varying from packets to packets.
The amount of information can be measured indirectly. When a packet is lost,
quality degradation can be measured in a reconstructed sequence with respect to the
error free case. By using such quality degradation as an indicator of the impact of the
35
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
lost packet, we can measure the amount of information of a packet. Furthermore,
the packet loss impact is categorized and converted to the relative priority index
(EPI) to facilitate the interface between the source coder and the channel coder.
In order to assign RPI to each video packet according to its loss propagation prop
erty, the propagation effect should be either measured or predicted with a corruption
model. To avoid the time-consuming and off-line measurement-based approach, es
timation of error propagation via a corruption model is adopted here. Building a
corruption model for highly complex and hybrid-structured compressed video is not
an easy task, since it is difficult to describe the corruption behavior due to certain
lost packets. The dependency structure for motion-compensated prediction is the
major bottleneck. Despite these difficulties, several analytical models have recently
been proposed to estimate the distortion under given parameters such as the bit rate,
the source coding parameters, and the packet loss rate [7, 28, 41]. The former two
models focused on statistical aspects of the whole sequence (i.e. the overall expected
distortion) while the latter computed the pixel-level expectation, which is however
very expensive. All these schemes did not address the loss propagation effect due to
each packet, which is needed for the RPI assignment in our proposed framework.
Packetization methods can be varied according to the transmission method. It
can be based on the number of bits or the number of macroblocks. In general, packe
tization is performed with a group of basic coding units to avoid the synchronization
36
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
problem, and the basic coding unit includes a macroblock, a set of encoding param
eters such as the quantization step size, the coding mode, and the motion vector
associated with the macroblock.
In this work, a corruption model has been developed for a lost packet which is
composed of multiple macroblocks (MBs). The initial error strength of a lost MB
is estimated based on the adopted error concealment scheme at the decoder. Then,
to capture the expected impact of the distortion on future frames (i.e. motion-
based dependency structure), a trace-back calculation for each MB is used. Also,
loop-filtering and intra-MB refreshing effects are considered through the propaga
tion path. By combining the above, the proposed corruption model can reasonably
estimate the loss propagation effect for each packet, and provide the expected dis
tortion due to its loss. The desired packet-level model with RPI assignment is then
constructed by merging the effect of MB-based models. Furthermore, the effect of
multiple-packet loss is analyzed, and the associated RPI is calculated to improve the
model accuracy at a higher packet loss rate. It is experimentally verified that the
expected distortion of each packet follows closely the actual loss behavior measured
in most cases.
W ith the derived packet-level RPI-based corruption-model, coordinated delivery
of packetized video over QoS networks are also investigated in this research. A per-
packet optimization framework with unequal error protection (UEP) is realized to
take both end-to-end video performance and pricing into account. Possible solutions
37
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
such as the Lagrangian formulation are discussed for the price-constrained mapping.
Finally, the effectiveness of the coordinated packet-level protection framework is
evaluated by simulations.
3.2 M B-level Corruption M odel
3.2.1 Investigation of M B error propagation
Most state-of-the-art video compression techniques, including H.263+ and MPEG-
1,2,4) are based on motion-compensated prediction (MCP). The video codec employs
inter-frame prediction to remove the temporal redundancy and transform coding
to reduce the spatial redundancy. MCP is performed at the MB level of 16x16
luminance pixels. For each MB, the encoder searches the previously reconstructed
frame for MB that matches the target MB, which is being encoded, the best. In order
to increase the estimation accuracy, sub-pixel accuracy is used for the motion vector
representation, where interpolation is used to build the reference block. Residual
errors are encoded by the DCT transform and quantization. Finally, all information
is coded with either a fixed length code or a variable length code (VLC). Basically,
two different types of MBs can be selected adaptively for each MB. One is the inter
mode coding that includes motion compensation and residual error encoding. The
other is intra-mode coding in which only DCT and quantization are used for original
pixels. In the initial anchor frame, or when a new object appears in a frame to be
38
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
encoded, the sum of absolute differences (SAD) of most closely matched MB can be
larger than the sum of original pixels. For these cases, the intra-mode MB is used.
Otherwise, the inter-mode MB, which usually costs less due to temporal redundancy
elimination, is used.
In the Internet environment, packets may be discarded due to the buffer overflow
at intermediate nodes of the network, or considered being lost due to long queuing
delays. VLC is very vulnerable against even one bit error. If error occurs in the bit
stream once, the VLC decoder cannot decode next codes until the decoder finds the
next synchronization point. Hence, a small error can make catastrophic distortion
on the reconstructed video sequence. Packet loss results in the loss of encoded MBs
as well as the loss of synchronization. It affects MBs in subsequent packets until
re-synchronized (or refreshed). Upon the packet loss, an error recovery action (i.e.,
error concealment) is performed at the decoder, attempting to figure out the best
alternative for the lost portion. Usually, no normative error concealment method
is defined for most MCP video compression standard. Various forms of error con
cealment schemes have been proposed [37], [8]. The temporal concealment scheme
exploits the temporal correlation in video signals by replacing a damaged MB with
the spatially corresponding MB in the previous frame. This straightforward scheme,
however, can produce adverse visual artifacts in the presence of large motion, and
the motion-compensated version of temporal concealment is usually employed with
estimated motion vectors from surrounding MBs.
39
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
However, there remains the residual error even after the sophisticated error con
cealment, and it propagates due to the recursive prediction structure. This temporal
error propagation is typical for hybrid video coding that relies on MCP in the inter
frame mode. The number of times a lost MB is referenced in the future depends on
the coding mode and motion vectors of MBs in subsequent frames. This tells us the
importance of each MB from the viewpoint of error propagation.
While propagating temporally and spatially, the residual error after error con
cealment decays over time due to the leakage in the prediction loop. Leaky prediction
is a well-known technique to increase the robustness of DPCM by attenuating the
energy of the prediction signal. For hybrid video coding, leakage is introduced by
spatial filtering operations that are performed during encoding. Spatial filtering can
either be introduced by an explicit loop filter or implicitly as a side effect of half-pixel
motion compensation (i.e. with bilinear interpolation). This spatial filtering effect
in the decoder was analyzed by Farber et al. [7]. In their work, the loop filter is
approximated as a Gaussian-shape filter in the spatial frequency domain. It is given
by
IHt(ux,uy)\2 = exp[-(ul + u 2 x) ■ t ■ a}), (3.1)
where aj is determined by the fUtershape and it indicates the strength of the loop
filter. As shown in Equation 3.1, the loop filter behaves like a lowpass filter and its
bandwidth is determined by t and filter strength aj.
40
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
By further assuming that the power spectral density (PSD) of error signal u(x, y)
is a zero-mean stationary random process
®uu{ux, Uy) = a2 ■ 46 ■ a2 g • exp[-(u£ + uj2 y) • a2], (3.2)
i.e. with a separable Gaussian PSD with variance a2 that can be interpreted as
the average energy. The shape of PSD of the error signal is characterized as a2 .
Thus, the pair of parameters (a2, a2) determines the energy and the shape of the
error signal’s PSD and can be used to match Equation(3.2) with the true PSD. With
these approximations, the variance of the error propagation random process v(x, y)
can be derived as
tilt] = T ^ —f =<%-<*[*), (3- 3)
1 + 7 • t
where j = a2 fa 2 is a parameter describing the efficiency of the loop filter to reduce
the introduced error, and a[t] is the power transfer factor after t time steps. This
analytical model as given in Equations (3.1)-(3.3) has been verified in experimental
results [7] [8]. While the statistical propagation behavior is analyzed with this model,
it is difficult to estimate the loss effect for each packet composed of several MBs,
i.e. the GOB unit, in general. Thus, in the following section, we are extending
its propagation behavior by incorporating additional coding parameters such as the
error concealment and the encoding mode so that one can track the error propagation
effect better with a moderate computational complexity.
41
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
While the statistical propagation behavior of an error can be analyzed with this
model, it is difficult to estimate the loss effect for a packet composed of several
MBs (i.e. the GOB unit) in general. Thus, We extended the propagation behavior
analysis by incorporating additional factors such as error concealment schemes and
the encoding mode so that one can track the error propagation effect better with
a moderate computational complexity. The purpose of the corruption model is to
estimate the total impact of packet loss. When one or multiple packets are lost, errors
are introduced and propagated. The impact of errors is defined as the difference
between reconstructed frames with and without packet loss and measured in terms
of mean square errors (MSE).
3.2.2 D erivation of M B -level C orruption M odel
The purpose of the corruption model is to estimate the total impact of packet loss.
When one or multiple packets are lost, errors are introduced and propagated. The
impact of errors is defined as the difference between reconstructed frames with and
without packet loss and measured in terms of mean square errors (MSE). Fig. 3.1
shows the error propagation of a lost MB, in which the initial error of the corrupted
MB is denoted by u(x,y) and its energy is measured in terms of error variance
The propagation error v(x,y) in consecutive frames has energy cr%(m,j) in impaired
MB j of frame n + m.
42
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 3.1: Error propagation of a lost MB.
The initial error due to packet loss is dependent on the error concealment scheme
adopted by the decoder. The amount of initial error can be calculated at the encoder
if the error concealment scheme used in the decoder is known a priori. For simplicity,
we assume that the decoder uses the TCON error concealment scheme as specified in
H.263+ Test Model 10 [15]. Also, at this stage, let us confine the corruption model
analysis to the case of isolated packet loss. Under a low loss rate, this corruption
model exhibits reliable accuracy. The multiple-packet loss case will be discussed in a
later section, which is needed for the scenario with a higher loss rate. Also, only MB
unit (i.e., 16x16) estimation is analyzed excluding the advanced prediction option
at present. This simplifies MB as the unit of the coding mode with a unique motion
vector (for inter-frame modes). From MB with 256 pixels, we can extract PSD of
the signal by analyzing its frequency component. In addition, MB-based calculation
costs much less than that of pixel-based computation. The MB-level corruption
model will also be extended to the GOB-level with some restrictions later.
Let us analyze the energy transition through an error propagation trajectory.
Typical propagation trajectories are illustrated in Fig. 3.1. The trajectory is com
posed of two basic trajectory elements. The first one is a parallel trajectory while
43
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
(a ) (b )
(a) (b)
Figure 3.2: Parallel Propagation:(a) the trajectory and (b) the equivalent linear
system.
the second one is a cascaded trajectory. Before the analysis, some assumptions are
required to reduce the computational complexity. Let the decoder with the DPCM
loop and the spatial filter be a linear system. Also, for each frame to be predicted,
the time difference is set to 1, i.e. t = 1 in (3.3).
Fig. 3.2(a) shows the parallel propagation trajectory. The error in reference
frame n is characterized by (
trajectory, an error can propagate to more than two different areas in subsequent
frames. For each path, a different motion vector and spatial filtering can be applied.
Hence, a different filter strength can be applied as depicted in the equivalent linear
system of Fig. 3.2(b). Ha(uj) and Hb(uj) may have a different filter function, but
both functions are the Gaussian approximation of the spatial filter. The PSD of the
error frame is transferred to the predicted frame via
^ VV(lJX, U)yj I , LOy) | ' ^ uu(u!x,0jy) T \Hb(UJX, Ld,y)\ • ®uu(uJX} UJy) , (3.4)
44
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2
,
CA
a] °7,
c
> - (a )
O a a ( c o J ( ,( » ; : )
(b)
(a) (b)
Figure 3.3: Cascade Propagation:(a) the trajectory and (b) the equivalent linear
system.
and the corresponding error energy is derived as
2 2 , 2 2 /
< T v = < r a +
1 + 7a 1 + 76
(3.5)
where j a = °'}a/ (Jg and 7& = cPjJo^. Therefore, MSE can be individually estimated
and accumulated for the parallel error propagation trajectory.
In the cascaded propagation trajectory as shown in Fig. 3.3(a), the initial error
energy a\ of U of frame n is referenced in A of frame n + 1 and then, it is transferred
to B in frame n + 2, again. The equivalent linear system is shown in Fig. 3.3(b). For
each transition, the loop filter function is characterized by aja and cr^, respectively.
Then, the PSD of the propagation error in frame n + 2 can be given as the product
of two filter functions in Eq. 3.6.
^ w i u i x i ^ y ) \ H a (u)x , UJy) | ' | Hf)(u>x , Wy) | ■ ® u u (oJXi ^ y ) (3.6)
45
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
The equivalent error energy is derived as
— — , where 7 =
1 + 7
fb
.2
(3.7)
The propagation error energy from U to B is given in (3.7). It is the same as
(3.3) except for the loop filter efficiency 7 . For the cascaded propagation, the loop
the sum of filter strengths Ojo and < 7^ . As a result, the equivalent filter strength for
the cascaded propagation is the sum of filter strengths along the propagation path.
Because of motion vector (even in sub-pixel accuracy), a portion of one MB is
usually referenced by the predicted frames. In this case, all errors of an impaired MB
do not propagate to following frames. Hence, we have to consider the portion amount
of a MB contributes to the next predicted frames as a reference, which quantity is
denoted as the dependency weight. If we denote ith MB of frame n as M B n7 , and
a portion of M B U ji contributes to M B n+mj, then the dependency weight wnji(m,j)
is defined as the normalized number of pixels that are transferred from M B n^ to
M B n+m,j. If no portion of a MB is referenced by the j th MB of the n + m th frame,
wn,i{m,j) is zero. Otherwise, it has a value between zero and one. The dependency
weight can be calculated recursively with stored motion vectors and MB types as
shown in Figure 3.4. However, since motion compensation is not a linear operation,
we have to assume that motion vectors of neighbor MBs are the same (or at least
very similar) so that the target MB is transferred without severe break as depicted in
filter efficiency 7 can be derived from the equivalent loop filter strength aj that is
46
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
A
=
11
n n+1
i Macroblock to be evaluated
n+2
w (2,j)
Figure 3.4: Recursive weight calculation.
Figure 3.4. Another assumption is that the error in a MB is uniformly distributed in
the spatial domain while it also has the Gaussian PSD. Then, the transferred error
energy from M B U t* to M B n+mj can be calculated with the loop filtering effect and
the dependency weight.
Finally, to evaluate the total impact of the loss of M B n^ the weighted error
variances for MBs of subsequent frames should be summed. Because the initial error
can sustain over a number of frames without converging to zero, we have to limit
frames to be evaluated under an acceptable computational complexity. Fortunately,
when the intra-MB refresh technique is used at the encoder, the propagated error
energy converges to zero within a fixed number of frames. Thus, in general, a
pre-defined number of frames is sufficient to estimate the total impact of the MB
loss. This defines an estimation window for the corruption model. The appropriate
estimation window might be determined based on the strength of intra-MB refresh.
As a result, the total energy of errors due to a MB loss in a sequence can be written
as
2 2
cr = at
M N
Wn,j(m,j) ■
771=1 j = 1
(3.8)
47
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
where M is the size of the estimation window and N is the total number of MB in
a frame, respectively. Also, we have
n-2 V™ a 2 f
a v ( m . j ) = " ........., and (3.9)
t i 7 m j ‘ Tj
The corruption model derived above can be viewed as an MB-level extention of
the statistical error propagation model given in [7].
3.3 R PI-Based Coordination For Network
Adaptation
For packet video applications, the priority assignment for each packet in terms for
loss and delay should reflect the influence of each packet to end-to-end video quality.
In the case of delay, classification of video streams depends more on the application
context (e.g., video conferencing or video on demand (VOD)) rather than video
contents within a stream. Thus, although it is worthwhile to distinguish a packet
based on the loss-rate/delay tuple, we focus on the packet loss priority assignment
in the network adaptation module and assign a relative priority index (RPI) to each
packet.
48
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
At the packet-level, there are several dependency relations. We categorize them
into three types: semantic dependency, prediction dependency, and redundancy de
pendency. The semantic packet-level dependency exists between a packet that in
cludes header parameters and other dependent packets that needs them for decoding.
This dependency is usually very strong because it may affect the effectiveness of all
dependent packets. The redundancy packet-level dependency is a relationship be
tween a packet and a redundancy packet. There can be a source redundancy packet
such as the motion parity packet [16]. Also, there may be channel redundancy pack
ets such as forward error correction (FEC) and automatic repeat request (ARQ)
packets, which have partial or full dependency relation with other packets [9]. Fi
nally, the corruption of a frame portion due to packet loss can affect the decoding of
packets in the prediction relation (either spatially or temporally). This is called the
prediction packet-level dependency and exclusively taken into account in this work.
RPI assignment is somewhat dependent on the employed video coding scheme.
Here, we consider MCP-based video compression schemes with the proposed MB-
level corruption model. Under this MB-level corruption model, the total loss impact
of a MB is calculated recursively under the assumption that there is only one MB
loss (called the MB-independent assumption). However, a packet usually contains
more than one MB. It may contain a number of GOB’s (group of blocks) or slices,
and even frames. Moreover, there could be the multiple-packet loss in practice. The
initial error for an MB is not affected by other MB’s in the same packet, because
49
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
they are located exclusively and lost at the same time. The propagation path of one
MB can be affected by other MB loss. Thus, the packet loss estimation under the
MB-independent assumption may deviate from the real situation, which is assumed
to be negligible here.
In this work, we assume the independent RPI assignment scheme. Under the
assumption that only one packet loss occurring in the whole sequence (or a reasonable
amount of frames surrounding this packet), the error propagation path and its loop
filter strength are not affected by other packet losses. Then, independent RPI can be
assigned to each packet by summing only MB-level corruption effects under the MB-
independent assumption with a proper normalization. In fact, if we confine only to
a video stream with error resilience where prediction dependency between spatially
adjacent packets has been reduced, the independent RPI can be a reasonable choice.
With the RPI-based corruption-model for each packet, a coordinated effort to
deliver packetized video over QoS networks is investigated in this section. In par
ticular, the packet-based protection system with unequal error protection (UEP)
is realized. This system incorporates both the end-to-end video performance and
pricing. The end-to-end video performance is measured in either objective quality
(e.g. the PSNR value) or subjective quality. In our approach, the impact on visual
quality due to the loss of packet is represented by RPI (independent or dependent
RPI).
50
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
To evaluate the total impact of the loss of the weighted error variances
for MBs of subsequent frames should be summed. Because the initial error can
sustain over a number of frames without converging to zero, we have to limit frames
to be evaluated under an acceptable computational complexity. Fortunately, when
the intra-MB refresh technique is used at the encoder, the propagated error energy
converges to zero within a fixed number of frames. Thus, in general, a pre-defined
number of frames is sufficient to estimate the total impact of the MB loss. This
defines an estimation window for the corruption model. The appropriate estimation
window might be determined based on the strength of intra-MB refresh. As a result,
the total energy of errors due to a MB loss in a sequence can be written as
O ©
E{i) = 10 log( MSE(n, i)), (3.10)
n = —oo
where
MSE(n,i) = ± \K (x ,y )-R n (x ,y )\2, (3.11)
(x,y)el
where MSE(n, i) is the mean square error of the nth frame when the zth packet is
lost, R„,(x, y) is the nth frame of the reconstructed sequence when there is no packet
loss is introduced and Rl n(x, y) is the nth frame of the reconstructed sequence when
the ith packet is lost. E(i) can be easily categorized to the RLI value.
Given RPI-assigned packets, the network adaptation task can be formulated as
follows. The loss impact of each packet expressed in terms of RPI is first categorized
51
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
(I.e., normalized and quantized) among K categories according to the significance of
the packet from the perspective of quality degradation . In fact, the actual catego-
the underlying network service. Under the network delivery scenario, the resulting
quality degradation depends on both the categorized RPI and delivery mechanism.
The delivery mechanism for each packet is again categorized into level q among a
total of Q levels anticipating price p according to q. For example, in the DiffServ
network, DS level indicates different forwarding with which a certain level of QoS is
assured.
Thus, the total quality degradation of video with N packets can be expressed as
Since the total cost P for N packets are limited and each packet i costs pq(i), the
optimal assignment of q — (q(l), q(2),..., q(N)) can be found by minimizing the total
quality degradation. That is,
rization process may vary according to the employed network adaptation module and
N
(3.12)
N
min QD = min
q q
(3.13)
subject to
N
(3.14)
52
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
We can solve the problem by finding the service level m apping q th a t minimizes the
Lagrangian formula
J i(A ) = QD(k(i), q(i)) + A • pq(i). (3.15)
The solution depends on QD(k(i),q(i)) and pq^ . The Lagrangian formulation
of this problem is illustrated in [25]. The mapping function from categorized RPI
to the quality delivery mechanism and the pricing strategy (upper-right) affects
the mapping solution. W ith these two determined, the cost to quality degradation
function can be derived for each packet. The solution is then to set each pq^ equal to
the price at which the slope — A line intersects the quality degradation curve. Since
the total cost P is the sum of all packet costs, by adjusting A , the cost-constraint
(3.14) has to be met.
In particular, we consider the following special situation where the quality degra
dation is affected by the loss effect only. The expected quality degradation
QD(k(i),q(i )) can be factored into the product of the loss impact QDk(i) and the
packet loss rate L9(q for the packet. If the quality degradation QDk is linearly
proportional to the categorized call k of RPI (i.e. QDk = k ■ Q, where Q is the
normalization factor), the price for service level pq is reciprocal to the packet loss
rate Lq, and the packet loss rate Lq is proportional to service level q, the quality
degradation can be expressed as
QD(k(i), q(i)) = QDk(i) • L9(i) = Q ■ k(i) • (3.16)
53
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Then, the Lagrangian formula of (3.15) becomes
M \) = q ■ k(i) . - L
q{i)
(3.17)
After solving (3.17) for q(i), parameter A can be calculated by the constraint equation
3.4 R PI Generation Based on the Proposed
Corruption M odel
In this section, an algorithm for calculating the propagation error will be described.
RPI can be calculated from the error energy caused by the macroblock loss. The
error energy can be expressed as the sum of error variances (under the assumption
that the error has a zero mean). The total error energy is equal to the sum of
variances of the initial and the propagation error as derived in the previous section.
In order to find corrupted macroblocks and calculate the variance of the propa
gation error, all macroblocks in the estimation windows should be examined in the
forward direction to see if the macroblock is referenced or not. However, this method
(3.14).
Q ' ( E m s / W ) ) 2
7 ~ > 9 ?
(3.18)
(3.19)
54
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
requires a very large amount of computations. Instead of finding corrupted mac
roblocks in the forward direction, reference macroblocks of an encoding macroblock
can also be searched in the backward direction. To trace the error propagation in the
backward direction, encoding parameters for the macroblocks in the reference frames
should be stored during the estimation window period. The parameters include the
macroblock coding mode, motion vectors, and the variance of the propagation error.
For convenience, let us call the macroblock that is being encoded the current
macroblock. The reference macroblock of a current macroblock may be composed
of portions of several macroblocks in the reference frame. The macroblock in the
reference frame that has a part of the reference macroblock is called the contribution
macroblock. The reference macroblock can be segmented as shown in Figure 3.5 so
that each segment comes from a single contribution macroblock. All contribution
macroblocks of the current macroblock can be found in the estimation window by
tracing backwards with stored macroblock encoding parameters.
After one macroblock is encoded, its encoding parameters are stored for the fu
ture use, and the error concealment scheme is performed for the current macroblock
in the same way as the decoder does. Figure 3.6 shows the proposed algorithm
to calculate the error variance due to macroblock loss. This procedure is also ex
ecuted when each macroblock is encoded. It is summarized as follows. First, the
error variance that contributes into the current macroblock from every contribution
55
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
(mbx,mby) (mbx+1 ,mby)
(x i.Yi)
(x,,y,)
Seg, Seg2
\
h,
/
(x,,y2 )
(Xj.y,)
\
Seg, Seg,
^2
f
<------------- ><.....................
W j w2
(mbx,mby+l) (mbx+1 ,mby+l)
Figure 3.5: Illustration of the reference macroblock.
macroblock is calculated and accumulated. Second, the initial error and its vari
ance are calculated, the PSD of the initial error is approximated by Cg2), and the
variance of the propagation error of the current macroblock is reset. Third, if the
current macroblock is not intra-coded, recursive calculation of the propagation error
is performed . The filter strength can be determined by the the half-pel flag of
the motion vector, and coordinates of the reference macroblock are calculated. The
error propagation effect is traced recursively by calling the T R A C E procedure with
the corresponding parameters until the depth of trace, measured by the number of
frames, reaches zero.
Once the T R A C E procedure is called, it calculates the propagation error from
each contribution macroblock to the current macroblock. Then, it calls the TRACE
procedures recursively for each segment as shown in Fig. 3.5. The flow chart for
the TRACE procedure is given in Fig. 3.7. For each segment, the decaying factor
56
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Intra
Mode?
Encode MB(mbx,mby)
Store Encoding Parameters:
Mode, MY
Calculate Initial Error Variance: a,
Calculate PSD: a,
Initialize Propagation Error: a v2 = 0
Calculate Filter Strength -.a,
Calculate Position of Reference MB
_____________(Xr,yr)_____________
TRACE(xr, yr, 16, 16, Esize, a f2 )
Figure 3.6: The macroblock encoding algorithm with RPI generation.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
7 is calculated from the stored PSD of the contribution macroblock and the filter
strength. W ith this decaying factor and the area of the segment, the variance of the
propagation error is calculated by (3.8 and accumulated. Then, if the depth parame
ter is larger than zero, and the contribution macroblock is encoded in the inter mode,
the tracing process is continued by calling the T R A C E procedure with updated pa
rameters. The filter strength is updated by incorporating the filter strength of the
contribution macroblock to the transferred filter strength. This is performed for all
segments. Finally, after one frame is encoded, the sum of variances of the initial
error and the propagation error of each macroblock in the earliest reference frame
in the estimation window is reported as the MSE of the error due to the macroblock
loss.
3.5 Experim ental R esults
The proposed corruption model is first verified with simulations. Then, the QoS
mapping is performed based on RPI that is calculated with the proposed corruption
model. Simulations are carried out with the QCIF Foreman sequence, encoded by an
H.263+ encoder at 30 fps and 300 kbps. Intra-MB refresh is employed to increase
robustness, and the synchronization code is inserted into the beginning of every
GOB, leading to a GOB-based packet. To evaluate the impact of packet loss, the
error energy is estimated for each MB by using the proposed corruption model. Then,
error energies for all MBs that belong to a packet is summed and averaged. Also,
58
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
TRACE(x, y, w, h, depth, a f2)
Segment
No
Segment 2
Segment 4
Segment 3
Calculate Coordinate of Seg. 1
(mbx,mby,x, ,y1 ,W | ,h,)
Calculate filter strength from MV(mbx,mby): O j
TRACE(xj, yl5 w l5 hl5 depth-1, a n 2 +af2 )
Figure 3.7: The trace algorithm for RPI generation.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
to simplify the acquisition of error PSD parameters for each MB, DOT coefficients
are utilized as a replacement for Fourier coefficients. The bitstream is decoded
with the TCON error concealment scheme, where the motion-compensated temporal
concealment method is utilized whenever it is possible.
3 .5 .1 Verification of Proposed Corruption M odel
To verify the proposed corruption model, several simulations are performed. In
all cases, the estimated distortion with the corruption model is compared with the
actual distortion. First, the energy of the propagation error as a function of time
is studied. For a single GOB loss, the actual distortion is calculated by decoding
the bitstream, where the lost GOB is discarded and error concealed. When the
corruption model is used in estimating the propagation error, the variance of the
propagated error for each frame is calculated and illustrated in Figs. 3.8(a) and (b).
To emphasize the effect of the prediction filter, no intra macroblock refresh is used.
Therefore, the propagation error lasts for a longer period of time. As shown in Fig.
3 .8, propagation error calculation with the proposed corruption model provides a
very accurate result.
The single packet loss model for independent packet RPI was verified by com
paring the measured distortion as a result of a single-packet loss with the estimated
one from the proposed corruption model. Four schemes were compared, including
two interim schemes. First, the actual measurement provides the real impact, which
60
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Proposed M odel
Frame Number
(a) (b)
Figure 3.8: Effects of a single GOB loss with (a) GOB=4 and (b) GOB=5 from
the 111th frame, respectively, where the frame number indicates the relative frame
distance from the 111th frame.
serves as the reference and requires an extraordinary computational time. W ith the
proposed corruption model, one can calculate the corruption effect in one encoding
or decoding process so that the computational overhead is small. The estimation
window was set to 20 frames. The two simplified versions of the proposed corrup
tion model were: (1) the scheme with only the temporal dependency weight, where
spatial filtering was not considered, and (2) the scheme with only the initial error
but without temporal dependency and spatial filtering.
In Fig. 3.9, experimental results were depicted in terms of the MSE evolution.
For this figure, 200 frames (5th GOBs of all frames from 111st to 310th) were
selected from the 400 ‘Foreman’ sequence and corrupted. Propagation errors were
measured up to 20 subsequent consecutive frames, which seems reasonably long for
the intra-MB refresh of the 11-frame interval utilized. As shown in Figure 3.9, the
61
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
| | f W U ! p — Actual ' (
y ‘ 0 H Initial Error Only
] f D ependency C ount Only
P ro p o sed Model________ ,
1 0 - ’ ,- - - - - - - - - - - - - - - - 1- - - - - - - - - - - - - - - 1- - - - - - - - - - - - - - - - - - - 1 - - - - - - - - - - - 1 - - - - - - - - - - - - - - - - - 1 - - - - - - - - - - - - - - - - - - - - - 1 - - - - - - - - - - - - 1 - - - - - - - - - - 1- - - - - - - - - - - - - - - - - - - , - - - - - - - - - i
0 20 40 60 80 100 120 140 160 180 20)
Fram e N um ber
Figure 3.9: The distortion estimation for every 5th GOB of all frames.
proposed corruption model is the closest one to the actual measurement. The initial-
error-only model tends to over-estimate the error, since the time-decaying effect of
spatial filtering is not considered. Especially, in the portion of frames with smaller
temporal dependency (e.g. from frame 160 to frame 200), this scheme cannot give an
accurate estimation result. The scheme that uses temporal-dependency only could
not track the actual distortion well either, especially for frames with lower temporal
dependency (e.g. from frame 80 to frame 120).
The correlation graph gives a more clear idea about the performance of the
proposed corruption model. The actual single GOB loss effect is calculated for
all GOBs while the expected GOB loss effect is calculated by using the proposed
corruption model. Then, the correlation between the actual loss effect and the
estimated loss effect is illustrated in Fig. 3.10, which demonstrates that the proposed
method offers very accurate results.
62
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
10-1 I ----------- .------ , ---- .-L - x - a . o . J ----------- , ------ „ ---- , --------------- , ------ , ---- ; ---1 .
1 0 '1 10 ° 10 1 102
Actual MSE
Figure 3.10: Correlation between the actual MSE and the estimated MSE by using
the proposed corruption model.
Finally, an experiment is performed to investigate the multiple packet loss effect.
Since the proposed corruption model was derived with the assumption of the single
packet loss, the multiple packet loss effect may give a larger estimation error between
the actual packet loss effect and the estimated packet loss effect. To measure such
an error, a noise pattern is generated with 5% of GOB loss. Then, the erroneous
bitstream is produced by discarding the error packet from the original bitstream
and the actual distortion is calculated for each frame. On the other hand, the initial
error and the propagation error for lost GOB is calculated for each frame by using
the proposed corruption model and the variances of all lost GOBs are summed for
each frame. The distortion of each frame due to the multiple GOB loss is shown
in Fig. 3.11 for different GOB loss rates. The model under-estimates the GOB
63
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Simulation
-A - Model
g
uj
a ?
( 0
3
cr
c o
c
a
o
Packet Loss Rate{%)
Figure 3.11: The effect of prediction dependency on overall quality degradation at
different multiple packet loss rates.
loss effect because it assumes the single GOB loss. However, the difference between
the actual measurement and the estimation is relatively small. The difference for
different GOB loss rates increases proportionally to the GOB loss rate. Thus, we
conclude that the proposed method can be used to approximate the multiple packet
loss effect in the case of low GOB loss rates.
3.5.2 R P I-based C oordination of N etw ork A daptation
The RPI-based coordination of network adaptation was evaluated under the pro
posed video delivery framework given in Chapter 1. The test sequence and encoding
parameters were the same as those given in the previous section. The dependent
RPI that covers the multiple-packet-loss effect was utilized for RPI categorization.
64
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
I
100
RPI
(a) Categorized RPI. (b) Unit price vs packet loss rate. (c)QoS mapping result.
Figure 3.12: The QoS mapping.
The network delivered prioritized packets with different reliability property corre
sponding to the service level, which was rated according to the packet loss rate. For
coordination, the Lagrangian-based mapping method was implemented to satisfy
the given cost-constraint.
W ith the proposed corruption model, dependent RPI was calculated for all 3600
GOB packets of the Foreman sequence. RPI was then categorized into 20 linearly
proportional levels. The categorized RPI distribution of the test sequence is shown
in Fig. 3.12(a). The packet loss rate was set to be inversely proportional to the
service level ( 10 levels in total). Their values ranged from 1.2 % to 12 %. The unit
price was proportional to the service level. The resulting packet loss rate was the
reciprocal of the unit price as shown in Fig. 3.12(b). The coordinated mapping from
the RPI category to the service level was then performed based on the Lagrangian
formula of (3.19). Fig. 3.12(c) shows the mapping result of the service level with
categorized RPI.
65
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 3.13: The PSNR comparison Figure 3.14: Performance comparison
under 5% PLR. for schemes with and without RPI.
To simulate the error pattern of the underlying network, the Gilbert model with
transition parameter 0.9 was used [40]. Also, the loss rate for each level was main
tained constant throughout the evaluation to allow fair comparison. The resulting
PSNR was compared since it provides a certain measure of end-to-end visual qual
ity. In Fig. 3.13, several PSNR curves were given in the case of 5% average packet
loss, where the effect of RPI differentiation was illustrated. As expected, the UEP
coordination based on proposed RPI gives a performance boost compared to that
does not differentiate packets. To better illustrate the gain of the proposed scheme,
the average PSNR performance was compared by varying the cost-constraint. The
resulting average PSNR curves for different cost-constraints are shown in Fig. 3.14.
The advantage with RPI assignment is very obvious.
6 6
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3.6 Conclusion
A corruption model was developed in this chapter to estimate the total impact of a
packet loss. The corruption model takes into account the initial error, the spatial
filtering effect, and temporal dependency to increase estimation accuracy. With the
proposed backward tracing algorithm, in which the proposed corruption model is
utilized to calculate the variance of the propagation error from one macroblock to
the other, the total packet loss impact can be determined with a low computational
complexity. The resulting estimation approximates well the actual loss impact within
a narrow margin while demanding a low computational overhead. Even though the
proposed corruption model is derived under the assumption of a single packet loss,
it is shown by simulation that it provides reasonable accuracy in the case of a low
packet loss rate. When applied with the RPI association, the proposed corruption
model provides a reasonable performance improvement for robust packetized video
delivery.
67
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C h ap ter 4
An Integrated A IR /U E P Scheme for Robust
Video Transmission w ith A Corruption M odel
4.1 Introduction
The ever-increasing demand on multimedia communications via wired and/or wire
less IP networks faces the challenge of packet loss as well as bandwidth fluctuation.
The dependency between image frames makes the compressed video stream vulner
able even to a small number of lost packets. To address error resilience, the latest
versions of ITU-T H.263+ and ISO MPEG-4 have adopted a couple of options to alle
viate the corruption of compressed video to error prone channels. Examples include
layered representation (e.g. data partitioning), re-synchronization, error tracking
(ET), adaptive intra refresh (AIR) and other error recovery options. The robustness
issue against packet loss is more relevant to video transmissioin over IP networks,
which is the main concern of this research.
68
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
The intra-refresh scheme increases error resilience of the compressed bitstream.
However, if we increase the number of intra-coded macroblocks (MBs), it requires
more bits to encode these MBs. Then, the quatization error has to increase for
other MBs under a fixed bit rate. To determine the percentage of intra-coded MBs,
channel parameters such as the bandwidth and the bit error probability (BER)
should be taken into account for optimization. The adaptive mode selection scheme
for the coding of each macroblock based on source and channel parameters has been
proposed, and is called the Adaptive Intra Refresh (AIR) algorithm [41], [5].
In [41], Zhang calculated the expected distortion subject to the previous packet
error rate for each pixel to be encoded. Cote [5] proposed an algorithm, with which
the macroblock level expected distortion can be calculated recursively. In both
algorithms, the expected distortion is calculated recursively by taking into account
encoding parameters, such as the prediction mode and motion vectors, and error
concealment algorithms. The optimal model selection scheme was performed in
[41], [5] based on a rate-distortion (R-D) optimization algorithm with the calculated
expected distortion. Even though the recursive distortion calculation method reflects
the error propagation effect of the introduced error caused by imperfect transmission,
they did not consider the spatial filtering effect in the prediction loop that result in
the decay of the propagation error. This is known as the leaky prediction effect. The
spatial filters applied in the prediction loop are the half-pel motion compensation
and the deblocking filter. The error propagation effect in the motion compensated
69
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
predictive coder (MCP) was analyzed and modeled by Girod [8] [7]. Girod’s work
takes into account the spatial filtering effect so that the leaky effect can be well
explained.
A macroblock based corruption model was proposed in our previous work, and
applied to prioritize packetized video bitstream for the DiffServ network [18]. In this
research, the corruption model is used to select the coding mode of a macroblock by
estimating the expected distortion due to packet loss. The amount of distortion is
calculated for the loss of the current macroblock as well as the previous macroblock.
Thus, with the same recursive calculation method, the expected distortion can be
calculated, and it can be used in the mode decision for intra refresh. This method can
be applied to error tracking when the feedback information is available. Furthermore,
the proposed AIR algorithm has the ability to regulate the packet loss effect so that
the range and the shape of the relative priority index (RPI) become better to be
used in the DiffServ network.
The rest of the chapter is organized as follows. The video delivery system is intro
duced in Section 4.2, in which the corruption model for AIR and UEP is described.
W ith the corruption-model, the mode decision for AIR and packet priotization with
RPI are investigated in Section 4.3. The mode decision is realized by estimating the
distortion that comes from packet loss as well as error propagation. The feedback
information is used for error tracking in the proposed system. Under the proposed
70
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
system, error resilience tools such as AIR, error tracking (ET) and UEP, are evalu
ated by computer simulation in Section 4.4. Finally, concluding remarks are given
in Section 4.5
4.2 T ra n sm itte r S tru c tu re for A IR /U E P
Coordination
A robust video delivery system is introduced in Chapter 1. In this research, the
joint source and channel coding is achieved by coordinating AIR in the source cod
ing and UEP in network adaptation. In the packetized video delivery system, the
proposed corruption model is used to enhance the error resilient capability. It as
sumes the existence of a network that supports prioritized variable-rate delivery and
the associated pricing mechanism. Since a video codec has several options to trade
compression efficiency for error resiliency and network friendliness, the coordinated
system has to provide a simplified interaction process between the video encoder
and the target network as well as the source level error resilience with AIR. The
main idea of this system is to use the corruption model to calculate the packet loss
effect and the expected distortion of the underlying macroblock. Thus, the only
difference from the previous video delivery system is the transm itter structure, and
the remaining part is exactly the same as that described in Chapter 1.
71
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Error Tracking
RTT
Feedback
Available
Loss Impact Sensitivity
Input
Video —
Sequence
L(i)(t-r)
C orruption___________
Model
|s(|X i)
Encoder
► Mode —► Encoding -
Decision
S(3)(t)
/
J
Output Buffer
L(i)(t~r)
^ Packe-
tizer
UEP
Protection
Level
Figure 4.1: The packet video delivery system employing the corruption model.
The modified transm itter consists of the video encoder, the packetizer, the RPI
generation module as shown in Fig. 4.1. The video encoder has the capability to
insert intra macroblocks to adapt its resiliency to the network condition. In order
to select intra refresh macroblocks, the expected distortion is calculated with the
proposed corruption model. When the video encoder encodes a macroblock, the
expected distortion is calculated for a given packet erasure rate with the corruption
model by exploring recursively previous encoded frames through the motion com
pensation paths within the estimation window. The estimated distortion is used
for the mode selection of intra refresh. While the propagation error from one of
reference macroblocks to the current macroblock is calculated, these distortions are
accumulated with respect to each of the reference macroblocks. When the encoded
72
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
bitstream leaves the encoder buffer, the bitstream is packetized and the accumulated
packet loss effect is used for RPI categorization. That is, the resulting video packets
are associated with RPI so that their impacts on video quality can be tied with
the network adaptation module and the underlying delivery network. The network
adaptation layer is responsible for mapping the RPI value to the network prior
ity level such as the DiffServe level. Generally speaking, any reliable transmission
scheme can be assumed as long as it supports the RPI-associated differentiation. For
example, it can be applied in robust video transmission with adaptive FEC-based
protection [21] or the DiffServ packet forwarding [29].
4.3 Joint A IR /U E P w ith Corruption M odel
4.3.1 A daptive Intra-Refresh (A IR )
AIR is an algorithm to select the optimal mode for MB adaptively according to
the source and channel characteristics. For its optimization, the expected distortion
for each MB due to the previous channel error should be estimated based on the
encoding parameters and channel status. Based on the calculated distortion, the
R-D optimization for each MB is performed for all possible encoding modes. Unlike
the R-D optimization for the error free case, the expected distortion caused by the
channel error as well as quantization should be considered simultaneously. The
problem is how to calculate the expected distortion.
73
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
The expected distortion of an MB is affected by the error probability, encod
ing parameters of previous frames, and picture characteristics. In general, these
characteristics should be estimated with respect to a given input image sequence.
However, in order to leverage AIR, we calculate these parameters for each MB. The
dependency of the current MB to the previous encoded MBs is needed in calculating
the parameters for each MB. This can be achieved with recursive calculation that
requires a huge computational complexity. If we consider both the error and the
error-free cases of MBs, the number of tracing paths increases quickly so that it is
not applicable in practical applications. For example, when the expected distortion
is calculated for an MB, if it is coded with non-intra MB, then the expected distor
tion depends on the distortion of reference MBs of the previous frame. Furthermore,
the distortion of the reference MBs depends on their reference MBs of their previous
frame again. Thus, an approximation is needed to simplify the analysis in practical
applications.
Figure 4.2 illustrates the recursive calculation method of the expected distortion
of an MB. The expected distortion D(a) of macroblock M B (a) in frame n can be
expressed as the distortion due to the quantization error and the propagation error.
This distortion is also determined by the macroblock coding mode of M B (a). If
M B (a) is coded in the intra-mode, the quantization error is equal to the distortion.
If M B {a) is coded in the inter-mode, the distortion is the sum of the quantization
error and the propagation error. The propagation error is caused by the error of
74
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
MB(dJ
n-2 n-1 n
Figure 4.2: Distortion estimation.
reference MB in previous frames. Let us use MB(b) to denote a reference MB of
M B {a), and MB(b) can be determined by the prediction motion vector of M B (a).
If the expected distortion of MB{b) is D(b), then the expected distortion of M B (a)
can be written as
D{a) = Dq{a) + Dp(a) = Dq(a) + f{D{b)), (4.1)
where Dq(a) and Dp(a) are the quantization error and the propagation error of
M B(a ), respectively, and /(■) is the power transfer function of the propagation
error due to the prediction loop filter.
As described in Chapter 3, the decaying of the propagation error depends on
the prediction loop filter. When the half-pel motion estimation is utilized, the filter
75
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
characteristics is determined by the half-pel flag of motion vectors. Let D(b) denote
the expected distortion of MB(b), and it can be calculated via
Dtb) = ( ! - ? ) • (D,(b) + Dr(b)) + p ■ (D°(b)), (4.2)
where p is the block error probability and Dc(b) is distortion when MB(b) is lost
and reconstructed with error concealment. If MB{b) is coded in the intra-mode,
Dp(b) is zero. If MB(b) is coded in the inter-mode, Dp(b) is the propagated distor
tion. If MB(b) is reconstructed with motion-compensated error concealment and
its reference macroblock is MB(c), then the distortion of MB(c) will affect the ex
pected distortion of MB(b). In this case, we need to examine two reference MBs for
each MB. However, the number of tracing paths increases by a geometric factor. By
assuming that p is much smaller than one, we can omit the tracing of the distortion
of concealed reference MB because p2 is negligible.
W ith the above assumptions, we can estimate the expected distortion for each
MB under a given PER (packet error rate). The estimated distortion for each MB is
used to calculate the overall distortion, and the overall distortion is minimized with
the optimal model selection method. To achieve the optimal model selection, the
R-D optimization is performed [30]. The Lagrangian formula for the model selection
of MB i can be expressed as
J i(Q P ,p , mode) — D i(Q P ,p , mode) + A mode(QP,p) • R i{ Q P , mode), (4.3)
76
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
where
Di(QP,p, mode) — (1 - p) ■ D^(QP,p, mode) + p- D^(QP,p, mode), (4.4)
and where D f is the distortion when this MB is transmitted without error and Df is
the distortion when this MB is lost. The value of Df for an intra-coded MB is given
by
D\(QP,p, Intra) = D ^ Q P ) . (4.5)
Since the intra-coded MB is not affected by previous errors, it is only a function
of the quantization step size. However, if the MB is coded in the inter-mode, its
distortion can be written as
D C i(QP,p, Inter) - D ijq(QP) + D iiP(p), (4.6)
where DitP is the distortion transferred via error propagation. Since D f (Q P , p, mode)
is the distortion when the corresponding MB is lost in the decoder, it can be calcu
lated in the same way for both intra-coded and inter-coded MBs since the decoder
does not know the coding mode of this MB. Therefore, the decoder performs the
same error concealment for the MB. As a result, the distortion due to the MB loss
is
Df(QP,p, Intra) = Df(QP,p, Inter) = Dl(QP.,p). (4.7)
77
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Table 4.1: Macroblock encoding parameters for the corruption model.
Parameters Expression
Mode
Motion Vectors
Initial Error
Decaying Factor
Macroblock Loss Rate
Intra/Inter
mvx, mvy
-I
lx, ly
P
This is a common term in (4.3) for both inter-coded and intra-coded modes. Thus,
it does not affect the mode decision. Equation (4.3) can be expressed as
Ji(Q P ,'P , mode) = (1 - p) ■ Di(QP,p, mode) + A mode(QP,p) • Ri(QP, mode). (4.8)
In (4.6), D iiP(p) is the distortion due to the propagation error and can be calculated
by utilizing the proposed corruption model and the recursive algorithm as defined
in (4.2).
To calculate the propagation error recursively, the encoding history should be
stored in the memory. That is, we have to store encoding parameters for each MB,
including the coding mode, motion vectors, initial errors, and the decaying factor
as described in Chapter 3. They are summarized in Table 4.1. In addition to these
parameters, the error probability of MB should also be stored. This information can
be used in association with the feedback information at a later stage as described at
the end of this section.
In Table 4.1,
MB, which can be calculated from the difference of reconstructed MBs with and
78
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
without loss, respectively. The reconstruction for lost MB is performed with an error
concealment algorithm in the decoder. The encoder can reconstruct concealed MB
if the error concealment algorithm adopted by the decoder is known. To calculate
the decaying factor, we perform the actual filtering on the initial error. Then, the
decaying factor is calculated as described in Chapter 1. The filters of our concern
are those when the half-pel motion compensation is performed. Thus, the horizontal
and the vertical decaying factors can be calculated and stored accordingly. These
decaying factors are selectively used according to the half-pel flag of the motion
vector when this MB is referenced by the future MB. The MB loss probability p is
set to the same value as the packet loss probability. Here, it is assumed that the
packet loss probability is the same for each packet, and each packet contains only
one GOB that consists of 11 MBs in the same row.
The error energy to be propagated to the current macroblock can be calculated
with encoding parameters of current macroblock and reference macroblocks. The
reference macroblocks can be determined by using motion vector of current mac
roblock. In this case, the reference macroblock can be placed across the boundary of
encoding macroblocks. Let’ s suppose current macroblock i references a macroblock
from the previous frame and the reference macroblock is positioned across several
encoding macroblocks. The weight is calculated with the portion of contributed area
79
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
from reference encoding macroblock to current macroblock. Then, the propagation
error energy can be calculated by
And the amount of distortion that contribute to macroblock i is the sum of (4.9) as
follows
av(i) = j)), (4.10)
j
where j is the index of the macroblocks that are referenced by macroblock i. The
tracing depth in which the propagation error is calculated can be limited by the
estimation window size. The estimation window size can be determined by the
portion of intra macroblocks. The more intra macroblocks in the sequence, the
smaller estimation window size is sufficient because error propagation is limited by
intra macroblock refresh.
The distortion calculated with the above method is substituted in (4.8), and then
it is used to minimize the Lagrangian for each macroblock subject to the selected
mode. Parameters QP and A determine the output rate and the expected distortion
of the decoded sequence under packet erasure rate p. If QP is small, the quantiza
tion error Di< q (QP) becomes small while the rate Ritq{QP) increases. Thus, if the
average QP is small, less intra macroblocks are chosen so that robustness is sacri
ficed. A smaller QP value gives better quality under the no-loss situation. However,
80
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
quality degradation becomes bigger under the packet loss condition due to its poor
performance in robustness.
If we choose bigger A , the rate constraint becomes dominant and a smaller num
ber of intra macroblocks are preferred via optimization under the no packet loss
condition. However, if there is packet loss, the expected distortion D^p(p) due to
propagation error will increase. Consequently, it affects on the mode decision in a
way that more intra macroblocks should be selected to decrease the value of DitP(p).
At the end, the number of macroblocks and the expected distortion will be balanced.
Based on a large number of simulations, rate-distortion curves for different packet
erasure rates can be drawn. Figure 4.3 shows the rate-distortion trade-off curves for
the Foreman sequence. The optimal Lagrangian multipliers for different packet loss
rates are given in Figure 4.4. We see from the figure that the optimal Lagrangian
multiplier does not depend much on the packet erasure rate, but on the quantization
step size. The optimal Lagrangian multiplier for mode selection can be obtained
from the formula A = c • (QP)2 [25]. In addition, the optimal quantization step
size depends on the packet erasure rate. In order to find optimal parameters for
a given bit rate and a packet erasure rate. First, the quantization step size can
be obtained from Figure 4.5. Then, the optimal Lagrangian multiplier is selected
from the optimal Lagrangian multiplier formula. Bitstreams encoded with optimal
parameters will be shown in Section 4.4.
8 1
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 4.3: The rate-distortion Figure 4.4: The quantization stepsize
curves. vs. the Lagrangian multiplier.
1600
1400
_ 1200
j? 1 0 0 0
4 2
& 800
& 600
400
200
0
10
* PER = 5%
- • • i - - - PBR= 10%
15 20
QP
25 30 35
Figure 4.5: The quantization stepsize vs. the bit rate
If the packet loss information is available from the feedback channel, the recon
structed video sequence in the decoder can be estimated in the encoder. In addition
to the packet loss information, if the error concealment algorithm applied at the
decoder is known, the reconstructed frames in the decoder with a certain packet
loss rate can be predicted more accurately with error tracking, in which the decod
ing task is also performed in the transm itter side. After reconstructing corrupted
8 2
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
frames, the encoder can evaluate the distortion of the macroblock to be encoded
by using the reconstructed frame as a reference frame. If the current macroblock
uses the corrupted macroblock of previous frame and the expected distortion of the
current macroblock is higher than a threshold value, then the macroblock will be
encoded with an intra mode to stop error propagation.
The accuracy of prediction of the reconstructed frame with feedback relies on
feedback delay. A short feedback delay can give more accurate prediction due to
a smaller uncertainty region, in which packets are traveling through the channel
and the packet loss information has not yet arrived at the transmitter. Figure 4.6
depicts the processing of AIR in terms of the frame number at a certain time. After a
frame is encoded, the compressed bitstream traverses the encoder buffer for a length
of N b u f f - Then, the bitstream is sent to the receiver, and the receiver sends the
feedback information to the sender for a length of NRtt- The distortion estimation
for AIR is performed for a length of Nest- If the estimation window size is larger
than N r t t + N b u ff, then the feedback information is useful.
Feedback
Information
Available
Transm ission
Channel or Decoder
Encoder
Buffer
0 i 2 3 4 5 6 7
N„.
8 9
^ Fram e
to 11 12 13 14 15 N um ber
- 1 1 -
Estim ation W indow
Figure 4.6: The timing diagram for feedback in terms of the frame number.
83
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
V Propagation Error
Calculation P ath
Figure 4.7: Modification of the error propagation path with feedback information.
feedback can be achieved through the established RD optimization method for AIR.
After the feedback information of a certain packet arrives, the corresponding mac
roblock encoding parameters given in Table 4.1 are updated. If the macroblock is
lost, then the macroblock loss rate p is set to 1, the macroblock mode is set to the
inter mode, and motion vectors are error concealed. The error concealment algo
rithm adopted in [15] copies motion vectors from those of upper macroblocks. When
the distortion is calculated recursively within the estimation window, updated mac
roblock parameters are used in the calculation as shown in Figure 4.7. Since the
propagation path has been changed due to the macroblock loss, a new path with
the concealed motion vector is considered. W ith such a method, the encoder can
calculate the distortion in the decoder from the feedback information and block error
propagation effectively by selecting the intra mode for the AIR scheme.
In the proposed video delivery system system, intra macroblock refresh with
84
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4.3.2 Joint A IR /U E P Schem e
AIR is applied to the source coding and the encoded bitstream is packetized for
transmission. If the DiffServ network is used for the transmission, the packetized
bitstream is prioritized and assigned with the DS level. The AIR and UEP with
the DiffServ network can be integrated in one unified system. W ith AIR, error
propagation can be reduced effectively by inserting intra macroblocks. As a result,
it reduces the packet loss effect as well as the difference of loss effects between
packets. The RPI values for UEP have a different range and a different distribution
according to the characteristics and the coding parameters. When the AIR method
is used in encoding, the range and distributions can be adjusted. Therefore, the
DiffServ mapping become easier with the help of RPI and by using the AIR method
for source coding.
4.3.3 A nalysis of C o m p u ta tio n a l C om plexity
The number of operations for recursive calculation depends on the number of branches,
which is dependent on motion vectors. If a reference block covers four macroblocks
partially, then four new branches will be created as shown in Fig. 4.8. In the worst
case, every subblock creates four new subblocks, then the n-th frame has 4n nodes.
It requires 0(4n) operations. However, the area decreases when a new branch is
created. The probability of four new branches is proportional to the area. The
expected number of operations in the worst case is shown in Fig. 4.8. As a result,
85
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
the number of operations is 4(n — 1 )T per each macroblock (n < 20), and this value
is very small in comparison with motion estimation (e.g. 256x256 operations for a
16x16 macroblock with a half pel search increment). Thus, our corruption model
can be used in practical applications such as on-line video streaming.
area [# o f nodes] [ Prob. | | Expected Op. |
zujnm i
lxl IT
1 0.5x0.5
H H [
4T
0.25x0.25 0.25
1 .......fL—_]
n 2-” x 2 ” 4 -
4 -M ) [ Q
4T
Figure 4.8: Computation of the propagation error.
4.4 Experim ental R esults
We established an on-line coding environment for the evaluation of the proposed
delivery system. It was designed to operate effectively regardless of the availability
of the feedback signal. The AIR and the UEP schemes were tested under various
packet error rates (PER). The simulation results demonstrate that the proposed
system allows the packet level protection based on the corruption model.
Simulations were carried out for the QCIF ‘Foreman’ sequence, encoded by the
H.263+ encoder at a frame rate of 10 fps and 100 kbps, respectively. The synchro
nization code was inserted in the beginning of every GOB, leading to a GOB-based
86
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Table 4.2: Parameters for the AIR simulation.
PER
QP
A PSNR(no error)
AIR1 0.02 8 100 32.56
AIR2 0.05 11 54 31.76
AIRS 0.1 13 55 30.98
packet. To simulate AIR, the quantization stepsize was selected from the curve of
Fig. 4.5 at a rate of 100 kbps. W ith this selected quantization step, the Lagrangian
multiplier was chosen from Fig. 4.4. Test streams were generated with the proposed
mode decision algorithm for 0, 0.02, 0.05, and 0.1 PER, respectively. The error
pattern used to simulate the packet loss is generated with the Gilbert model in the
range of 0 to 0.15 PER. The GOB packets are discarded according to the error
pattern from the stream, and the corrupted stream is decoded with the specified
error concealment method. Under each condition, the simulation was performed
50 times with different error patterns of the same PER. Then, the reconstructed
sequence was compared with the original sequence. The objective quality of the
reconstructed sequence was measured in terms of PSNR.
Table 4.2 shows parameters used in the AIR simulation. For comparison, streams
generated with the cyclic intra refresh (CIR) scheme was also used, where the intra
refresh operation was performed in a random manner. CIR streams with different
refresh ratios were encoded at the same bit rate with AIR streams by adjusting the
quantization stepsize and the percentage of intra refresh macroblocks. The coding
parameters for CIR streams are given in Table 4.3.
87
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Table 4.3: Parameters for the CIE simulation
QP
Intra MB(%) PSNR(no error)
CIR1 8 5.6 32.83
CIR2 10 20.0 31.85
CIR3 15 50.0 30.06
The average PSNE values for various AIR streams are shown with different PER
values in Fig. 4.9. The performance of CIR is given in Fig. 4.10 for comparison.
From these results, we can verify the trade-off between error resiliency and coding
efficiency in both cases. As expected, the proposed AIR scheme provides better
PSNR. This is clearly shown in Fig. 4.11 by comparing the best performances of
both cases. There are almost ldB gain in PSNR for most PER values.
3 4
ARI
~ m — AIR2
A --- A!R3
5 0 10 15
— CIR1
-m— CIR2
A- - - QR3
0 5 10 15
PER(%) PER(%)
Figure 4.9: The PSNR performance of Figure 4.10: The PSNR performance
AIR as a function of PER. of CIR as a function of PER.
To verify the performance improvement due to error tracking with the feedback
signal, the performance of AIR was studied under a round trip time (RTT) from 1 to
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
20 frame delay. The test stream was generated with 0.05 PER for the AIR case. The
estimation window size was set to a period of 20 frames. Fig. 4.12 shows the error
tracking result in comparison to that of using AIR only (i.e. without feedback). The
same error pattern was used in both cases, and the corresponding parameters were
adjusted so that their bit rates were both equal to 100 kbps. As shown in Fig. 4.12,
the performance is optimized by alternating the choice adaptively according to RTT.
34 i
F e e d b a c k
A R 5 %
15
25
5 10 15 20 0
PER(°/<$ Delay(Frames)
Figure 4.11: The PSNR performance
comparison of AIR and CIR.
Figure 4.12: The gain of error track
ing with feedback according to delay
(for PER - 0.05).
Finally, simulations were performed for the joint AIR/UEP scheme. Streams
encoded with 5% AIR and 20% CIR were tested for UEP and EEP, respectively. A
DiffSev network with 10 DS levels in the range from 2% to 12% PER was approxi
mated. For comparison, a certain DS level with the same budget as UEP was chosen
for EEP. As shown in Fig. 4.13, we see clearly an enhanced performance with UEP.
89
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
r /
— P —AIR E E P
C IR U EP /
C IR E E P
205------- 1 -------------------------------------------
26 t---------------- . -------------1 ---------------------- 1 ----------------- .------------ p ----------------------
0 1000 2000 3000 4000 5000 6000
Cost
Figure 4.13: The gain of the joint AIR/UEP scheme.
4.4.1 C om putational C om plexity Evaluation
The computational complexity of RPI and AIR were evaluated. For the test, the
Foreman sequence was used. The encoded frame was 134 out of 400 (2 skipped
frames between encoded frames). First, the encoding speed for RPI was tested.
Because RPI can be calculated by tracing error propagation through the estimation
window, the encoding speed depends on the estimation window size. The intra
macroblock refresh parameter was set to 40 (2.5%). The encoding speed degradation
for estimation window size is very small as shown in Fig. 4.14. There is only 5%
additional computational load when the estimation window size is set to 21 frames.
If more intra macroblocks are inserted, only a smaller window size is necessary. In
this case, the computational load will be negligible.
In the AIR case, there are two major computations added, i.e. distortion estima
tion and R-D optimization. In our simulation, the R-D optimization was performed
by a brute-force approach, in which a total of 3 encoding modes are performed for
90
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
6 T........................................................................................... T 35
•«•••• F ram es p e r seco n d
tntra %
Estimation window E s t im a t io n W in d o w (F ra m e s)
Figure 4.14: The encoding Speed with Figure 4.15: The encoding speed with
RPI generation. AIR.
each macroblock. DCT, quantization, and VLC are performed three times for R-D
optimization. Therefore, the maximum speed is lower than RPI generation. How
ever, this R-D optimization process can be reduced with some approximations. Note
that the speed degradation is very low (maximum 8%). This can be further reduced
to less than 4% since we can get the same error resilient property even if a smaller
estimation window size is used as shown in Fig. 4.15. This can be explained as
follows. The bigger an estimation window size, the larger the resulting estimation
distortion. Then, more intra macroblocks will be inserted. In that case, error prop
agation will be limited in smaller frames, which means that the estimation window
size can be smaller. As a result, the computation does not increase much.
The computational overhead is evaluated via simulation results. The overhead
is very small even for the worst case. This verifies the computational complexity
analysis in the Section 4.3.3. The overhead can be further reduced by selecting a
91
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
proper estimation window size according to encoding parameters. For instance, for a
higher packet erasure rate, the smaller estimation window size is enough to evaluate
error propagation due to the higher portion of intra macroblocks. The important
advantage of the proposed corruption model is that it can be used under a flexible
computational overhead according to the system computing power.
4.5 Conclusion
A corruption model was first reviewed in this research. Then, a robust video trans
mission system based on the corruption model was given, where AIR, UEP and
joint AIR/UEP schemes were proposed by exploiting such a model at a moderate
computational complexity. For the AIR scheme, the expected distortion measured
with the corruption model is used for the optimal mode selection. If the feedback
is available, the reported loss is used to update the expected distortion so that the
error can be tracked more accurately. The loss impact of each macroblock is cal
culated from the corruption model to prioritize each packet. The priotized packets
are categorized and optimally assigned to the prioritized delivery network such as
the DiffServ network. Finally, it was shown that the integration of AIR and UEP
provides a better performance and allows a flexible configuration for the proposed
video delivery system.
92
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C h a p te r 5
V ideo Packet Categorization for Priority Delivery
to Enahnce End-to-End QoS performance
5.1 Introduction
In the previous chapters, several corruption models were examined for different ap
plications, where video packets can be delivered with a different loss rate. The loss
rate is defined as the number of lost packets over the number of total transmitted
packets. However, the size of each packet was not taken into account since the wired
Internet connection is characterized by the packet loss rate (rather than the packet
size). The packet size can be variable, and it can be large enough to include one or
more GOB packets. In contrast, for wireless applications, the transmission channel
is characterized by the signal-to-noise ratio (SNR) or the bit-error-rate (BER). In
this case, the video packet size and the amount of protection bits depend on the
93
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
channel condition. Usually, the packet size is smaller and the loss of each packet can
be isolated from the loss of other packets.
To support error resilience for wireless transmission, fixed size video packetization
and data partitioning are included in most of video compression standards such as
MPEG2, H.263+, MPEG4. In this chapter, packet prioritization with the corruption
model is proposed for unequal error protection for wireless video transmission.
Data partitioning has higher coding efficiency than layered coding since it de
mands only the synchronization code overhead while layered coding requires the
whole sub-structure to make bitstreams independent in syntax parsing. Data parti
tioning has been introduced either to split one bitstream into two or more layers as
done in MPEG2 [12], or to isolate error propagation between bitstreams of differ
ent categories as done in MPEG4 [23]. Most previous work categorizes compressed
bits according to their meaning such as headers, motion vectors, DCT coefficients,
etc. These categorized bitstreams are prioritized by their importance. For instance,
headers are most important, motion vectors are more important than DCT coef
ficients, DC coefficients are more important than AC coefficients, and so on. To
quantify the importance of categorized bitstreams, the loss impact of each bitstream
in the decoded sequence can be measured experimentally, then the value is used to
determine the importance of each category.
There is however little work performed to quantify the importance of the parti
tioned individual packet. Even though the quantification task was performed in [2],
94
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
the issue of differentiating packets within one category was not addressed. In [2], the
average packet loss impact for categorized packets is pre-calculated with exhaustive
computation. (Note that it is difficult to measure the packet loss impact due to
the error propagation behavior of motion compensated prediction (MCP) existing
in most video coding standards.) However, there exist limitations in this approach.
First, the quantity can vary according to sequences and encoding parameters. Sec
ond, differentiation of individual packets within the same class is not possible based
on this pre-determined category-based packet loss quantification scheme. As a re
sult, prioritization can be inaccurate and the overall performance can be severely
degraded.
To enhance end-to-end QoS quality, it is important to have a good quantification
scheme to accurately reflect the packet loss impact with a moderate computational
complexity. To increase the accuracy of packet loss impact estimation, we should
take into account many factors such as sequence characteristics, encoding parame
ters, and dependency between packets. Besides accurate estimation, the computa
tional complexity should be kept low for the practical consideration. In this research,
we propose the use of the corruption model to estimate the packet loss impact with
respect to data partitioned video packets. The corruption model was analytically
developed to estimate macroblock-based (MB-based) error propagation in [17]. On
95
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
one hand, the model takes into account encoding parameters and picture charac
teristics to increase the accuracy of the estimation. On the other hand, it allows a
much lower computational complexity than actual calculation.
The goal of this research is to provide a packet categorization method for prior
itized packet delivery over the wireless network. It is designed for data partitioned
video packets. The proposed method can quantify packet importance for prioritiza
tion based on the end-to-end quality by partitioned classes as well as the estimated
loss impact for an individual packet. The complexity and accuracy can be coordi
nated according to the need and the limitation of a specific application. Generally
speaking, our method can be applied to error resilient video transmission through
unreliable channels with UEP, where the unreliable channel can be the DiffServ net
work in the Internet or the wireless channel and the UEP can be done via adaptive
channel coding.
The rest of this paper is organized as follows. A packet delivery system with data
partitioned video packets is described in Section 5.2. A data partitioning method
for error resilience and its associated corruption model are presented in Section 5.3.
An algorithm to estimate the distortion due to the loss of a packet is described in
Section 5.4. Experiments are performed to verify the proposed corruption model
for error resilient ITU-T H.263+ video in Section 5.5. Finally, some concluding
remarks are given in Section 7.7
96
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5.2 System O verview
A packetized video delivery system with the proposed corruption model is shown in
Figure 5.1. In this system, the transmission channel is a wireless channel and it is
assumed that a different level of error protection is achieved in the channel encoder.
'-n n n n :
Video
Source
C orruption
M odel
RPI
Source
Encoder
w /D P
T " 3
P acketizer
.......6
Channel
Coder
Error
^oncealm en
■ •
Source
D ecoder
v .
W ireless
Channel
Oepacketizei-4-
Channel
D ecoder
Reconstructed
Video
Figure 5.1: The wireless packet video delivery system employing the RPI-based
corruption model.
The proposed delivery system consists of the video encoder with data partition
ing, the packetizer, the proposed corruption model association module, the channel
encoder, the underlying wireless channel, the channel decoder for error detection
and correction, the de-packetizer, and the video decoder. The video encoder gen
erates data partitioned bitstreams for different layers. The bitstream in each layer
is individually packetized with the synchronization word and header information so
that each packet is independent of the loss of other packets. When the input video
is encoded, error resilient techniques can be exploited. The resulting packets are
97
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
then associated with an index called RPI to account for its impact on video quality.
The RPI is generated with the corruption model. In the channel coder, each RPI
be mapped to a different error protection level that controls the BER. Transmitted
packets are received in the channel decoder. The channel decoder corrects errors
or discards a packet if the error in that packet is beyond the recoverable range.
Successfully recovered packets are de-packetized, de-multiplexed and decoded for
rendering. Generally, any reliable transmission scheme can be assumed as long as
it supports the RPI-associated differentiation. For example, it can be applied to
robust video transmission with adaptive forward error correction [21] or DiffServ
packet forwarding [29].
5.3 Packet Categorization w ith Corruption M odel
5.3.1 D ata P artitioning
To apply unequal error protection to packets, packets should have different impor
tance in their quality contribution. In traditional packetization schemes, such as
group of block (GOB) in H.263 or the video packet in MPEG4, we do not see a wide
range of importance differentiation. In the case of layered coding, it requires large
redundancy to make separated bitstreams be syntactically independent. That is,
each should have the synchronization code and the header information for picture
and macroblocks. In contrast, data partitioned bitstreams are dependent on each
98
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
other so that it does not require a duplicate header structure for picture and Mac
roblock, but a small amount of extra synch codes. As a results, data partitioned
bitstreams still have high coding efficiency.
We adopt the data partitioning method similar to that specified by MPEG-2,
where data partitioned packets are separated, i.e. each packet contains only data
from one of different syntax categories. Syntax elements are classified into four
classes: (1) picture headers, (2) Macroblock headers, (3) motion vectors and DC
coefficients, and (4) AC coefficients. As shown in Fig. 5.2, encoded picture is
partitioned into different classes. They are then packetized with a fixed size with
data from the same class. In the packetization, since the picture header is too small
to form an individual packet alone, we put picture headers and Macroblock headers
together in one class so that one packet includes the picture header and Macroblock
headers . Similarly, DC coefficients and motion vectors are packetized together. AC
coefficients are packetized with themselves.
When a packet is decoded, codes are demultiplexed according to the syntax from
a higher level to a lower one of partitioned bitstreams. For one picture decoding,
the picture header is first decoded, and then the Macroblock header is decoded
based on the decoded picture header. Thus, if the picture header is lost, the entire
picture will be lost even though there are other lower level partition data. If the
Macroblock header is decoded correctly, one can know how data follow for motion
vector (MV) and AC coefficients. For instance, if the Macroblock header indicates an
99
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
VOP Header
MB Header
DC and MVs
ACs
EOB
Figure 5.2: Classification for data partitioning
inter mode macroblock with 4MV, then there will be 4 motion vector differences and
AC coefficient will follow if there is a coded block pattern included in the Macroblock
header.
The number of AC coefficients can be known from the end of block (EOB) code
in the AC coefficient bitstream. After all necessary AC coefficients are decoded,
the next Macroblock header is then decoded from the MB header bitstream. As
shown in the figure, there is dependency between different categories. Due to the
dependency, the higher level data are more important than the lower layer data. DC
and MVs can be decoded without AC coefficients in the corresponding MB. Also
AC coefficients can be decoded without DC and MVs, if the MB header is decoded
correctly. The importance of MV and AC coefficients can be different according
to their contents. Their dependency can be determined by the error concealment
algorithm of the decoder. When a decoder can use received AC coefficients without
MV coefficients, AC coefficients are not dependent on MV data. Otherwise, AC
coefficients depend on the reception of MV data.
100
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
After data partitioning, packetization is performed for each partitioned data. In
order to isolate decoding of each picture from others, we do not allow a packet to
contain data from different picture. On the other hand, we make the picture header
data belong to MB header data to avoid a very small packet. The packet size is
roughly equal. However, this rule is not strictly applied. If the remaining data
amount in one category is too small to match the size, it is added to the previous
packet to form a larger packet. We construct each packet to keep MB boundary so
that each new packet starts with the byte aligned MB data. The remaining bits in a
packet are filled with stuffing bits. If the data amount of one partition is larger than
the allowed packet size, it is split into multiple packets with synchronization codes.
In addition to the synchronization code, the frame counter, the class number, and
the MB number are inserted as shown in Fig. 5.3.
In the case of packet loss, the decoder should be able to use received data as
many as possible to improve reconstructed video quality. Packet loss may make it
difficult because a lost packet can cause the synchronization problem. The decoder
should know which macroblocks correspond to the received paceket and which class
the packet contains. For this purpose, the class number and the MB number follow
the synchronization code. Temporal reference can be inserted for heavier packer
loss or mis-ordered packet delivery that may cause mis-assignment of packets to an
incorrect frame. In this work, we adopt 17 bits for the synchronization code, 4 bits
1 0 1
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
VOP Headers ( Class 3 )
1
MB Headers
( Class 2 )
i Sync bits class nunt
DC & MV
Packets
( Class 3 )
MB_num
i AC packets
{ Class 4 )
Synch. Codes
Stuffing bits
Figure 5.3: Packetization of a partitioned video bitstream
for temporal reference, 2 bits for the class number and 9 bits for the MB number.
Therefore, 4 bytes are added for the synchronization purpose.
5.3.2 M acroblock-based Corruption M odel
The distortion can be quantified with the mean square error (MSE) between the
decoded sequence in the decoder and the reconstructed sequence in the encoder. To
calculate the MSE, it is required to decode the sequence from an impaired bitstream,
which does not include data that belong to the lost packet. The MSE can be
calculated by comparing the impaired sequence and the decoded sequence without
packet loss (called the reference decoded sequence). We can perform this calculation
for every packet. However, this would require a huge amount of computation. To
avoid that, we propose a distortion estimation method based on a corruption model.
1 0 2
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
With, this model, we can estimate the packet loss impact with reasonable accuracy
and moderate complexity.
There is dependency between packets from different categories. There are four
cases in decoding a Macroblock in the decoder. Case 1 is when all data are available.
Case 2 is when the header data (picture and MB header) and motion vectors are
available but no coefficient data are available. Case 3 is when the header data and
coefficient data are available but no motion vector. Case 4 is when no data for the
macroblock is available. For each loss case, i.e. Cases 2 to 4, the decoder should
reconstruct the MB as much as possible with the provided information. There are
many ways to minimize the distortion for the loss of macroblock data. One way is
error concealment, where the reconstructed MB is chosen to be the concealed MB.
The impaired macroblock can lessen its distortion with the concealed macroblock.
W ith the pre-estimated concealed macroblock, we can estimate the loss impact of
each data categorized by proposed scheme. From Cases 2, 3 and 4, the coefficient loss
impact, the motion vector loss impact and the header loss impact can be estimated,
respectively.
For each case except for Case 1, the reconstructed macroblock with the condition
could have error from the non-impaired macroblock. This is the first distortion due
to the packet loss. Let us define this error as the initial error. For each macroblock,
three different initial errors can be calculated based on the cases defined above. The
initial errors can propagate when the macroblock is referenced to in the next frame.
103
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
The propagating error in the next frame can propagate furthermore to successive
frames. All errors initiated by the initial error can be considered as the loss impact
of the first macroblock. The loss impact as we defined with MSE of the impaired
sequence is equivalent to the sum of MSE of the impaired macroblock due to the
initial error in the decoded sequence.
The corruption model proposed in Chapter 3 will be modified to support data
partitioned video packets. According to the corruption model, the propagation be
havior differs for a different initial error due to its frequency characteristics. Thus,
the propagation error should be calculated for different cases of packet loss. For a
data partitioned bitstream, a compressed macroblock data is partitioned into differ
ent packets and delivered in a different way. Therefore, there are possibilities that
some codes are missed as described above. In each case, the initial error can be
different. That is, the energy of the initial error and the frequency characteristics
of the error can vary for different error cases. If the frequency characteristics are
different, the propagation error will also be different because decaying factor 7 is not
the same. Thus, for one macroblock, we can have 3 different initial error energies
and frequency characteristics, which are (cr^2, cr|2), (cr^3, cr^), and (cr^4, for cases
1,2, and 3, respectively.
Note also that the initial errors depend on the error concealment algorithm. For
each error case, the decoder can do its best to conceal the artifact of the macroblock
104
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
with the received information. Different concealment macroblocks lead to different
initial errors. Finally, Eq. (3.8) can be extended to
M N
ai = ali + S ' crvi(Tn’t i ’ 1 = 2’3’4’ (5-1)
m~lj=1
where
2 sr^m 2
v lS m -i) = T Z r d — > and 7mj = *~~2 — ■ = 2,3,4. (5.2)
1 + 7 ™ , j O -g.
As shown in Eqs. (5.1) and (5.2), the same W ( m ,j ) and ajk . are applied to
all cases. This means, different errors propagate in the same path and are applied
with the same prediction filter. W ith the proposed model, the loss impact of each
component of the encoded macroblock can be quantified with the mean square error
so that they can be compared for prioritization.
5.4 R PI Generation w ith Proposed Corruption
M odel
The RPI generation method with the corruption model was described in Chapter 3.
In this section, RPI is generated for packets of each case because the loss of packets
of different cases has a different impact. When one macroblock is encoded, error
concealment is performed for each error case. The difference of the concealment
105
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
macroblock and the decoded macroblock is the initial error. The variances of initial
errors, a 2., i = 2,3,4, are calculated and stored.
To take into account the error propagation behavior, the filter strength a2 from
the power spectral density (PSD) of the initial error should be used. However,
measuring the PSD and a2 is complex. Instead of a direct calculation of these
values, the transition factor ji is calculated for each possible filter strength. In this
paper, we assume that only the half-pel motion prediction filter is used. For each
horizontal, vertical, and diagonal direction half-pel prediction, the filter is applied
to the initial error. Then, the variance of output data is calculated. We can get
transition factors:
7 i = 4 - * = (5-3)
< J
U V i
where a2 is the variance of the initial error and a2 is the variance of the filtered
H i v i
data for each case. We can calculate transition factors by applying different filters.
The transition factors are: j ih for the horizontal half-pel filter, 7 for the vertical
half-pel filter, and 7 ih v for the diagonal filter. These values are stored instead storing
a2 and ai.
yi j
After that, recursive calculation of the propagation error is performed if the
encoding mode of the current macroblock is not intra-coded. The filter type can
be determined from the half-pel flag of the motion vector. The coordinates of the
reference macroblock is then calculated. One can trace error propagation recursively
with parameters until the depth of trace in terms of the number of frames reaches
106
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
zero. While the tracing is performed, the number of usage of each filter is counted
from the current macroblock to each contribution macroblock. Let the number
of usage be denoted by nv, and n ^ , for the horizontal, the vertical, and the
diagonal half-pel filters, respectively. Then, the transition factor from contribution
macroblocks to the current macroblock can be calculated as
Based on 7; and the stored initial error o\ of the contribution macroblock, the
propagation error from the contribution macroblock to the current macroblock can
be calculated as
This is accumulated in the error propagation energy for the contribution macroblock.
Next, the reference macroblock of the contribution macroblock is found with the
If the parameter depth is larger than zero and the contribution macroblock is
encoded in the inter-mode, the trace is continued with recalculated parameters.
This is performed for all segments. Finally, after one frame is encoded, the sum of
variances of the initial error and the propagation error of each macroblock in the
7 i Tlh ' h ih d ~ ^ ' Ifiv T ^ h v ' 'Y ih v•
(5 .4 )
(5 .5 )
stored motion vector of the contribution macroblock. These are recursively per
formed.
107
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
earliest reference frame in the estimation window is reported as the MSE of the error
due to the macro block loss.
5.5 E x p erim en tal R esults
The proposed corruption model was first verified with simulations. Then, the QoS
mapping was performed based on RPI that is calculated with the proposed corrup
tion model. Simulations were carried out with the QCIF Foreman sequence, encoded
by the H.263+ encoder at 30 fps and 300 kbps. The Intra-MB refresh was employed
to increase robustness, and the proposed data partitioning scheme was applied to
the encoded bitstream. Classes 1 and 2 were packetized together, class 3 and class 4
were packetized separately. The packet size was around 100 bytes. Evaluation of the
impact of packet loss was performed for each packet. The error energy was estimated
for each MB by using the proposed corruption model for data partitioned packets.
In this simulation, the decoder performed error concealment for lost macroblocks.
Temporal replacement was used for error concealment to reduce packet dependency
which is significant in the case of motion compensated error concealment.
5.5.1 Verification of P roposed Corruption M odel
The distortion capturing capability of the proposed model was verified by comparing
the error propagation behavior over time. Fig. 5.4 shows the error propagation
behavior of a single packet loss from Class 3 and Class 4. The actual MSE was
108
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
measured by comparing with the reconstructed video without packet loss. Note
that the estimated MSE with calculation matches the actual loss closely. In the
case of AC coefficient loss (the left of Fig. 5.4), the higher frequency component
decays rapidly over time due to the prediction filtering effect while motion vector
loss introduce a largee DC error which lasts longer.
Model
A c tu a l
c
1
0.5
03 L O W in
CM
250
200
150
Actual
\ 100
03 i n 03 co r»
M OJ W CO C M
Fram e n u m b e r F ram e n u m b e r
Figure 5.4: The packet loss effect of a packet of Class 3 (right) and Class 4 (left).
Packets from the same class can have a different packet loss impact. The dis
tribution of packets of each class is given in Fig. 5.5. The packet loss impact was
estimated with the proposed corruption model. The loss impact of Class 4 packets
is relatively small compared to that of other class packets. However, some packets
from Class 4 have a higher packet loss effect, which is comparable to that of Class 3
packets and Class 1 & 2 packets. The packet loss impact of Class 3 and Class 4 has
a similar distribution. The motion vector loss gives a big impact on the decoded
sequence as much as the header data loss.
109
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
100 600
500
400
Clsss 3 60
g
300
I
=
3
o
200
100
30 50 0 20 40 10
1000 10000 100000 10 100
M e an S q u a r e E rro r P ackeS L o s s
Figure 5.5: MSE distribution for Figure 5.6: The packet loss rate for
packet classes. the unit packet cost.
5.5.2 R PI-based C oordination of N etw ork A daptation
The RPI-based coordination of network adaptation was evaluated under the pro
posed video delivery framework given in Fig. 5.1. The test sequence and encoding
parameters were the same as those given in the previous section. The network de
livered prioritized packets with different reliability property corresponding to the
service level, which was rated according to the packet loss rate. For coordination,
the Lagrangian-based mapping method was implemented to satisfy the given cost-
constraint.
With the proposed corruption model, dependent RPI was calculated for all 3998
data partitioned packets of the Foreman sequence. RPI was then categorized into
6 different levels. Each level had a different packet loss rate and the unit cost for
110
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
transmitting a packet. The relationship between the unit cost and the corresponding
packet loss rate is shown in Fig. 5.6.
For performance comparison, the video packet(VP) scheme was used as the ref
erence. VP has the synchronization code and the header information. It consisted of
3702 packets, which is smaller than that of data partitioning. As a result, a higher
price can be assigned than that of the data partitioned packet. Given the total cost
and the pricing mechanism, Optimal QoS mapping is applied to data partitioned
packets with the Lagrangian method. However, VP is assigned to a level which has
the packet loss rate corresponding to the same total cost as DP packets.
Fig. 5.7 shows the average packet loss rate for the total cost. VP has a lower
packet loss rate at the same total cost in the pricing mechanism. The end-to-end
quality is given in Fig. 5.8. The figure shows that the differentiated DP has a higher
packet loss rate, but provides higher end-to-end quality for a noisy channel. The
advantage of the RPI assignment with data partitioning is obvious.
5.6 Conclusion
We proposed a corruption model for data partitioned video packets. The data parti
tioning scheme provides the capability to differentiate the packet importance, which
was accurately estimated with the proposed corruption model. The resulting estima
tion approximates the real loss impact within a narrow margin while requiring a small
computational overhead. When applied in association with the RPI assignment, the
111
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
25 i
2 0 ------
DP
£ 15
VP
250000 300000 35000t 200000
Total Cost
5 23
VP
200000 250000 350000 300000
Total Cost
Figure 5.7: The total cost versus the Figure 5.8: Performance comparison
average packet loss rate. for data partitioned and video pack
ets.
corruption model-based RPI satisfies the requirement of the proposed coordinated
packetized video delivery and provides a reasonable performance improvement.
1 1 2
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C h a p te r 6
Instruction Design for M P E G -4 V ideo Codec for
Configurable Processor
6.1 Introduction
W ith the growing demand of multimedia communications over wired and wireless
links, numerous SoC architectures with the video processing capability have been
proposed to achieve low power consumption and cost. In such a SOC system, the
video compression IP module plays a fundamental role. The video coding technol
ogy is essential in any multimedia processing SOC, since it can greatly reduce the
bandwidth and storage requirements for multimedia data with little sacrifice in the
audio/video quality. The complexity of video codecs grows to achieve a higher cod
ing performance and to address error resiliency for wireless communications. One
good example is the latest development in MPEG-4 part 10 (or H.264). This gives
a big challenge to VLSI implementation of modern video codecs. There have been
113
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
reports on the VLSI implementation of the MPEG-4 video codec for SoC with com
bining hardwired block and GPP. However, little work has been reported on its
implementation in a instruction level configurable embedded processor. This is the
main contribution of our current work.
W ith the experience learned from the design of previous video compression stan
dards such as MPEG-1, 2 , and H.263/+, the implementational flexibility is an im
portant factor of concern. Since the traditional hardwired design is less flexible, the
processor-based implementation is a preferred choice. It is also the general trend. As
shown in Figure 6.1, the VLSI implementation can be categorized to three types, i.e.
hardwired, DSP-based and hybrid. To achieve a higher performance with flexibility,
the hybrid architecture have been proposed, where some operation-intensive software
functions are implemented with hardwired blocks while other functions of a lower
complexity are implemented with software on GPP [26]. The development time of
the hybrid architecture is however much longer than the DSP-based implementation
due to the interface between hardwired blocks and the GPP.
The implementation of MPEG-4 on high performance DSP (e.g. the TI TMS320
series) allows rapid development since every function is implemented with software
with an accelerated instruction set for multimedia processing while keeping the flex
ible software structure [3]. The MPEG-4 video compression standard is widely used
in different applications, and its video format varies with these applications. There
fore, the best VLSI architecture for each application can be different. The number
114
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
API
C ore Functions
Instructions
DSP based Hybrid Hardwired
f f i l l l l l l f i
■ m i
lllllllllliilllli
S.'W
S/W ■
lllllllllllllllllf
m i
p = = = =
:
: #
m v
H/W
\
i
- .. Pv i l o r
:y
Software Structure
Figure 6.1: Illustration of VLSI implementation methods.
of operations increases as the amount of data for processing increases while types
of operations and functions remain the same. For wireless applications where the
small video format is used, the DSP-based architecture has more advantages since
instructions can be shared by different functions and less hardware interfaces are
required. It is worthwhile to point out the DSP processor like TI TMS320C6x is not
designed only for video compression. Therefore, there are many instructions unused
in MPEG-4 video processing. One potential shortcoming for this solution is that
the processor may be too many unused components for the video coding application
and the IP cost of the high performance DSP could be too high to be attractive.
In this work, we describe the implementation of the MPEG-4 video encoder
and decoder using a configurable embedded processor. A new configurable em
bedded processor called the Xtensa has been recently developed by Tensilica [36].
The Xtensa processor has a rich set of configuration parameters such as interface
115
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
options, memory subsystem options, and instruction options. In addition, the Ten-
silica Instruction Extension (TIE) language is used to describe new instructions and
new registers and execution units that are then automatically added to the Xtensa
Processor. Thus, unlike the situation of the general DSP processor, we can add
instructions that are only necessary for video processing. To keep high flexibility,
it is better keep the core function in software. The instruction level hardware opti
mization with an embedded processor is adequate for SoC for portable devices.
For the MPEG-4 video codec IP, we propose two-step optimizations as shown
in Figure 6.2. For the portability, we optimize software only first. We apply the
general optimization rule for GPP. The software optimization includes algorithmic
optimization and optimization for GPP. The optimized software is written in C
or C ++. The software can be ported to any platform. For our current case, we
compile the program for the Xtensa processor Instruction Set Simulator (ISS). From
the cycle-based simulation, we can identify most cycle intensive functions. Then, we
design a new instruction that can help reduce the number of operations. Because
each step of this design has a low interface compared to that of a hybrid architecture,
design and verification time can be reduced significantly.
This paper is organized as follows. In Section 2, we discuss MPEG-4 software
overview. In Section 3, we describe the proposed two-step optimization in detail.
Finally, we summarize our optimization result and outline the future work.
116
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Platform Independent ^
S/W Optimization \
Platform Dependent
S/W and H/W
Optimization
Figure 6.2: Design flow for flexibility.
6.2 M PEG -4 Codec Overview
6.2.1 M P E G -4 Overview
MPEG-4 is an ISO/IEC standard developed by the Motion Picture Expert Group
(MPEG) [20] for multimedia compression and delivery. It was adapted as an Inter
national Standard (IS) in January 1999. This standard has a variety of application
areas such as digital television, interactive graphics, and interactive multimedia.
Part 2 of the MPEG-4 standard describes visual object coding technologies, where
the objects include natural video, still pictures, and 3D objects. The natural video
coding techniques consist of extended video coding tools in addition to those used
in its predecessor standards such as MPEG-1 and MPEG-2.
The MPEG-4 natural video compression standard supports bit rates ranging
from 5kbps to 10Mbps for a wide range of resolutions from Sub-QCIF (128x96) to
that beyond HDTV. It provides higher coding efficiency than MPEG-1 and MPEG-2
with sophisticate encoding tools.
117
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
R
...... ■ — •
— i- * ■■.. VLC
• j . t-rame
- u „ ;,
iiivwsei
* ' . V M ‘
Figure 6.3: The block diagram of the MPEG-4 encoder.
'iilvgniU j :
01001101
VLB
Invrtxe
S c a n iiiii
nvorsf-
;• :
i
■ L'.
sfa-
i J - 'P S B J C I f ------
; .Metnorj ____
Figure 6.4: The block diagram of the MPEG-4 decoder.
Figures 6.3 and 6.4 show the block diagrams of the MPEG-4 encoder and decoder,
respetively. In the encoding process, block-based motion estimation is performed to
find the best matched macroblock from the reference frame. The reference block is
positioned with the motion vector, which is the difference of spatial positions of two
blocks. The residual signal is calculated with the pixel difference between predicted
reference block and the original. It is then DCT-transformed and quantized. The
quantized DCT coefficients are coded with the variable length code (VLC) and
transmitted with motion vectors. On top of the basic motion compensated prediction
(MCP) described above, the advanced intra coding (AC/DC prediction) and the
advanced prediction modes are utilized. For very low bitrates less than 64kbps,
error resilience tools are added in the standard. They include the video packet
118
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
(VP) with Resynchronization Marker (RM), packet level data partitioning (DP),
and reversible variable length coding (RVLC).
The decoding process is the reverse of the encoding process except for motion
estimation. The input bitstream of the decoder is parsed into codes according to
the syntax in the input stage. Codes for quantized DCT coefficients are decoded
in the inverse quantization module, and then inverse DCT processing is applied to
de-quantized coefficients. Motion vectors are decoded from codes of motion vectors,
and then motion compensation is performed according to decoded motion vectors.
The outputs of the inverse DCT and the motion compensated block are added to
reconstruct the output block.
6.2.2 C om putational C om ponents
In the previous section, we presented MPEG-4 video encoding and decoding pro
cessing briefly with a block diagram. In this section, each block will be investigated
to estimate the computational complexity and to classify operation types.
Any operation needs operands. The complexity of a system can be revealed by
the types of data under its processing since the data type affects the complexity
of operations as well as demands the storage space. The MPEG-4 video standard
specifies the digital video data format under its processing. The most commonly used
4:2:0 digital video data format is depicted in Figure 6.5. The video data consist of a
sequence of pictures called the Video Object Plane (VOP). A VOP is divided into the
119
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
same size of a set of picture elements. The set is called a macroblock that consists of
the luminance and the chrominance components. The luminance components have
16x16 values while two chrominance components have 8 x8 values for each. Their
values are integers. The precision of the integer value can differ for various systems.
In this research, we limit the precision to the 8 -bit integer. Most operations are
the fixed-point operation rather than the floating point operation. Depending on
the processing stage, the precision can be extended up to 32 bits. The fixed point
operation can reduce the computational complexity of the floating point operation.
To appreciate the amount of digital video data to be processed, let us consider an
example. The number of input bytes per second is 570,240 bytes for the QCIF
(176x144), 15 frames/second digital video data. For the rest of this section, we will
analyze the complexity of each processing block for this video format.
1 6
0 1
2 3
VOP
Y Cb Cr
Macroblock
Figure 6.5: The video encoding elements.
1 2 0
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
The discrete Consine Transform (DCT) is used in many image and video com
pression standards due to its energy compaction characteristics. The formula is
given as
F (u, v) = -^C{u)C(v) f ( x> y) cos ^2x 2 N UV C 0S (6-1)
a := 0 y ~ 0
with u, v, x, y = 0,1,2,...N — 1, where x ,y are spatial coordinates in the sample
domain and u, v are coordinates in the transform domain, and
C{u)C(v) = <
^ for u,v= 0 ,
1 otherwise.
The inverse DCT (IDCT) can be written as
f(x,y) = c (u)C{v)F(u , v) cos cos (2y • (6.2)
x — Q y = 0
For DCT and IDCT processing, the matrix operation is required. However, thanks
to the fast fixed point algorithm, processing can be done with a small number of
multiplications and additions [1],
The quantization process requires the division operator, which is an expensive
operation in terms of the circuit size. Thus, many CPU units do not have t his
operator. A divider can be implemented by adders and multipliers. In this case, it
needs much more cycles than the multiplication. The inverse quantization process
121
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
requires the multiplication operator which is included in the instructions of most
CPU design.
After quantization, only non-zero quantized DCT coefficients are encoded. For
each 8x8 block, 64 coefficients are sterilized in the zig-zag scan order. Then, the
number of consecutive zero values and the non-zero coefficients form a run-level
pair. The run-level pairs are encoded with the variable length code (VLC). The
zig-zag scan can be done by memory accessing indexed by the zig-zag scan table. In
order to generate run-level pairs for a block, all 64 coefficients should be examined
and this requires as many memory loading and comparison operations as the input
data rate. The VLC coding for the run-level pair results in a two-dimensional table
with most entries equal to zero except for the low index number area. For this
process, many memory addressing operations are required.
In the encoding process, motion estimation requires most computation since it
needs a lot of block comparison. For the block comparison, the sum of absolute
differences (SAD) of a macroblock in the input frame and a reference macroblock in
the reference frame is calculated. The reference macroblock is a macroblock in the
displaced position shifted by the displacement vector (vx, vy) as shown in Figure 6 .6 .
The SAD value can be written mathematically as
SADi(vx,vy) = Y 2 I/(X’J/) ~ T(x + vx,V + vv)\, (6-3)
( ■ x,y)eMBi
122
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
where i is the macrobiock index number and M B i is the area of the ilh macroblock.
The motion estimation process selects the motion vector that gives the smallest SAD
value. The number of operations depends on the search range; the range of (vx, vy)
and the search algorithm. Many fast search algorithms have been introduced to
reduce the number of SAD calculation [4], [35]. Even though the number of SAD
calculation of a macroblock is significantly reduced, there are still a lot of operations
for an SAD calculation. In some hardware and DSP implementations, the SAD
calculation is accelerated by a parallel operation, in which the SAD value of one row
of a macroblock is calculated in parallel.
In the decoder, the reference macroblock is read from the reference frame accord
ing to the decoded motion vector. Then, a macroblock is reconstructed by adding the
reference macroblock to the decoded difference macroblock. This process is called
motion compensation. The motion estimation and compensation requires a lot of
memory access as well as arithmetic operations.
x
(v„v„)
-k i
l l j
r(x,y) f(x,y)
Reference frame Input frame
Figure 6 .6 : The motion estimation process.
123
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Table 6.1 lists major processes and their operation types. Fortunately, most of
them only demand the fixed point operation and the precision value can be rep
resented by 1 or 2 bytes so that the performance can be improved by the Single
Instruction Multiple D ata (SIMD) operation.
Table 6.1: List of Computation Elements
Function Equation Operatortion
DCT/IDCT
E °i • ®(*)
MAC
Quantization
Xi/Qp Divider
Dequantization Xi ■ Qp Multiplier
ME
E \Xi Vi+d\
SAD
VLC mem[i\ Table
VLD mem{i] Table and Tree Search
M C Xi + ei Addition
6.3 Tw o-Step Optim ization
To meet the performance as well as the hardware resource requirements, optimiza
tion in hardware and software implementations has to be performed for MPEG-4
video codec IP development. In this section, we discuss the platform independent
software optimization first. Then, the joint hardware and software optimization for
a dedicated hardware platform is examined. This two-step optimization approach
has advantages in addressing the wide range of applications of MPEG-4 video. The
platform independent optimized software can be used in different environments such
124
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
as PC, pocket PC or embedded systems for the security camera module, the video
phone, and the cellular phone with the video conferencing feature.
6.4 Platform In d ep en d en t Softw are O ptim ization
Platform independent software optimization mainly comes from algorithmic opti
mization. It reduces the number of operations. In addition, software can be modified
for the general CPU architecture to improve cache utilization and memory access.
In order to achieve portability, software is written in the pure C language using the
standard library. The CPU dependent instructions or assembly such as MMX [11]
should be avoided at this stage.
The software optimization tasks include: structural modification and functional
optimization. The functional optimization is performed in motion estimation, for
ward and backward DCT, and VLC/VLD. For motion estimation, we use the fast
motion search algorithms described in [4], [35]. The software optimization dra
matically saves the number of SAD calculations while keeping quality degradation
insignificant compared to the full search algorithm. Fixed-point operations with the
fast algorithm can improve the DCT performance a lot compared to the straightfor
ward floating point algorithm. The VLC coding and decoding algorithm is modified
to minimize the table size and the number of comparison.
The structural change is also conducted to reduce the number of data copies and
memory access to improve cache utilization. The main purpose is to show how much
125
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
the MPEG-4 codec can be improved by algorithmic optimization. This serves as a
reference for comparison with the platform dependent optimization to be described
in Section 6.5.
Table 6.2: The performance of the optimized platform independent software codec
Sequences MS(fps) Opt.(fps) Ratio.
Encoder
Decoder
PSNR
5.09
107.75
31.16
68.23
245.02
31.03
13.4
2.36
-0.13
Table 6.2 shows the performance improvement with the platform independent
software optimization. We used Microsoft MPEG-4 Version 1 software implementa
tion as the reference for comparison. It was tested in a PC with 900MHz Pentium 4
CPU for video of the QCIF format and 15fps at a bit rate of 64kbps. The encoder
performance is dramatically improved via optimization. The major improvement
is due to the fast motion estimation algorithm with very little quality degradation.
There are several clear advantages of platform independent software optimization.
First, the development speed is fast. Second, it can reduce the hardware cost when
it is adopted in hardware implementation.
After describing the platform independent software optimization, we will consider
the instruction level hardware optimization with respect to a configurable embedded
processor in the next section.
126
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
6.5 O ptim ization for C onfigurable P rocessors
For optimization with respect to a configurable processor, we choose the Xtensa
processor development environment as the target. This environment includes the C
compiler, the instruction simulator, and the performance profiling tool. First, we
compile the software for the Xtensa processor with the base configuration. Then, the
performance hot spots are identified by running the instruction set simulator and
the profiling software tool. The hot spots are optimized by designing TIE (Tensilica
Instruction Extension) instructions and choosing the optimal processor configuration
parameters.
6.5.1 H ot Spot A nalysis
In order to estimate the performance of the platform independent MPEG-4 video
codec software, the Instruction Set Simulator (ISS) is used for the Xtensa processor.
From the simulation result, the number of cycles to perform a specific task can
be measured. Furthermore, the performance bottleneck can be identified with the
profiling tool. This profile result provides important information for optimization. In
the beginning, we perform profiling of the improved MPEG-4 codes on the Xtensa
base configuration processor, which is the minimum hardware configuration with
16KB cache memory for instructions and data to minimize the cache utilization
effect.
127
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 6.7: The encoder performance profile.
Figure 6.7 shows the profile result of the encoder simulation. The total required
cycles for 1 second video clip encoding is 355.16 Mcycles. The hot spot analysis for
major functions is given below.
• Motion Estimation (ME)
Even though the ME speed is dramatically improved with the fast motion
search algorithm, it still occupies the major portion of the computational load.
The SAD calculation for block (16x16 or 8x8) comparison is called frequently
by the ME function. Even though the fast algorithm reduced the number of
SAD calls by a factor of 64, the amount of SAD calls is still large. Further
more, each SAD calculation still needs a large number of operations. ME also
demands interpolated reference frames that are pre-calculated for the horizon
tal, vertical, and diagonal directions. Making interpolated frames requires a
lot of computations.
128
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
• Quantization (Q)
Quantization requires the division operation. However, most embedded pro
cessors do not implement the division operation in hardware since it demands
a large number of gate counts. Software implementation of the divider requires
60 CPU cycles per division in the Xtensa processor, which is expensive.
• Variable Length Coding (VLC)
VLC is used in the coding of motion vectors and quantized DCT coefficients.
The coefficient encoding module includes: zig-zag scanning, run-level calcula
tion, and code mapping. The zig-zag scanning and the run-level encoding take
about 60% of the coefficient encoding time. The code mapping has a smaller
portion since the bit rate is low.
• DCT and IDCT
DCT and IDCT are optimized well with a fast algorithm and fixed-point op
erations. In the optimized encoder, DCT and IDCT are not applied to all
blocks. It is pre-determined whether each block is coded or not by block activ
ity and the quantization level. If a block is determined not to be coded, then
DCT, quantization, inverse quantization, and IDCT are not performed for the
block. Only 50% of total blocks are coded. After the quantization process,
if all quantized coefficients are zero, then some decoding operations such as
inverse quantization and IDCT are also omitted. The number of IDCT blocks
129
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
is only 25% of total number of blocks. Hence, the computation of DCT and
IDCT is not significant compared to that of other processing units.
• Motion Compensation (MC)
MC includes the block copy from one memory location to another. For lumi
nance components, MC is performed with ME so that it can take the advantage
of pre-calculated half-pel frames. It is however necessary to perform the inter
polation for chrominance components.
• Color Space Conversion (CSC)
The decoder output needs to be converted from the YUV format to the RGB
format for display devices. This requires matrix multiplication consisting of
three multipliers, two adders, and one saturator for each component output.
Based on the above hot spot analysis, we can identify those functional blocks that
cause performance degradation. That is, quantization, VLC, and ME are functions
to be optimized furthermore while DCT, IDCT, IQ (Inverse Quantization), and MC
are well optimized already. Thus, DCT, IDCT and IQ have a lower priority in the
TIE optimization.
Unlike other performance optimized systems, the optimization priority is deter
mined by the profiling result of a specific application and a software algorithmic
implementation. Therefore, the priority list is more relevant to desired hardware
resource utilization.
130
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
6.5.2 P ro c esso r C onfig u ratio n
Before working on the optimization with the TIE design, processor configuration
parameters should be selected by considering the trade-off between the cost and the
performance. Different processor configurations and configuration parameters are
evaluated. In our design, we choose the 16-bit multiplier option to improve DCT,
IDCT, Inverse Quantization, and CSC. Since the overall performance relies heavily
on the memory interface performance, we choose the 64 bit Processor Interface
(PIF). We test the encoder and the decoder programs with different instruction and
data cache sizes ranging from 1KB to 16KB. When the cache size is below 4KB,
the performance is degraded a lot. The result is sensitive to the instruction cache
size. When the cache size is more than 4KB, its performance is not improved in
proportion to the cache size. For both the instruction and the data caches, we select
4KB with the 16B cache line.
The reason why the 16KB data cache memory provides little help to the overall
performance is that a frame memory is of size 38KB, which is already larger than the
cache size. As a result, the cache miss is not avoidable when a frame-based processing
is performed. However, to have a larger cache size than the frame memory is not
realistic for an embedded processor due to its hardware burden.
131
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
6.5.3 O ptim ization w i t h TIE
Based on results from the hot spot analysis, we perform TIE optimization for func
tions one by one in the order of significance in terms of the CPU cycle reduction.
One new TIE instruction can replace several operations. We create a set of new
TIE instructions to reduce cycles used in functions. The TIE instructions can be
categorized as follows.
• Single Instruction Multiple Data (SIMD)
The instruction performs the same operation to multiple data in parallel. For
example, TIE can be nicely used for SAD and MC optimization.
• Combined Instruction
When a series of operations applied to a data sequentially, instructions for
these operations can be combined into one new instruction. We apply this
concept to saturation, Quantization and CSC.
When creating the TIE instruction, we should consider the gate count and the
critical path timing increment due to the addition of this instruction. To do so,
we apply simple operations for SIMD and combined simple arithmetic or logical
operations to the new TIE instruction to minimize the effect on the critical path of
timing so that the gate count is not increased significantly.
In the platform independent software optimization, motion estimation was opti
mized using the fast search algorithm. The fast search algorithm reduces the number
132
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
of SAD calculation significantly with very little quality degradation. Even though
the improvement is huge, the required cycles for motion estimation is still 238MCy-
cles/second which is too high to be implemented in embedded processors.
For MPEG4 video encoding, there are many options to improve the performance
of motion compensated prediction. The options are four motion vector (4MV),
unrestricted motion compensation (UMC), half-pel and quarter-pel motion vector
precision. In this study, we exclude the option of quarter-pel motion vector precision
due to its highest complexity with little quality improvement. Several TIE instruc
tions are designed to reduce the number of instructions for SAD calculation and the
half-pel motion vector search.
According to the initial profiling result of the MPEG4 encoder software, SAD
calculations for the 16x16 block and the 8x8 block occupy 65% of the total cycles
of the motion estimation process. The half-pel motion search is followed by the pel
unit motion search instead of the half-pel motion search for the entire search range
in order to improve the processing speed. Even though the number of half-pel SAD
calculations is reduced by this strategy, an interpolated value in a half-pel position
is used multiple times for SAD calculations for different half-pel motion vectors.
For example, SAD calculations for horizontal motion vectors -0.5 and +0.5 use the
same 56 values out of 64 values. To avoid re-calculation, three half-pel frames are
pre-calculated before the half-pel motion search. For the half-pel frame calculation,
133
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
the computation occupies 17% of the total cycles count of the motion estimation
process.
A ligned(8) boundary
ref
cur
A d d ress offset
= 176 bytes
Figure 6 .8 : The SAD calculation.
First, the SAD calculation requires the sum of absolute difference of multiple data
as shown in Figure 6 .8 . A frame is allocated in 8-byte aligned memory space so that
8-byte parallel loading and store do not need alignment that requires additional
cycles. However, the reference macroblock is not aligned in most cases since the
horizontal motion vector can be any integer number. Thus, loading of reference
data needs the alignment operation. The parallel loading and parallel alignment are
accelerated by TIE instructions. Then, the SAD calculation for 8 pixels is done by
one TIE instruction.
The original SAD calculation code and the code with TIE instructions are listed
in Tables 6.3 and 6.4, respectively. The TIE instruction LQAD64cur is used to
load 8-bite aligned data to a 64-bit user register. LOAD64ref_l and LOAD64ref_2
are used to load unaligned 16 byte data into two 64-bit user registers. The loaded
134
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Table 6.3: C codes for SAD calculations.
for( i = 16 ; i > 0 && temp>=0; i—)
{
temp -= abs(*(cur)-*(ref)) + abs(*(cur+l)-*(ref+l)) +
abs(*(cur+2)-*(ref+2)) + abs(*(cur+3)-*(ref+3)) +
abs(*(cur+4)-*(ref+4)) + abs(*(cur+5)-*(ref+5)) +
abs(*(cur+6)-*(ref+6)) + abs(*(cur+7)~*(ref+7));
temp -= abs(*(cur+8)-*(ref+8)) + abs(*(cur+9)-*(ref+9)) +
abs(*(cur+10)-*(ref+10)) + abs(*(cur+ll)-*(ref+ll)) +
abs (*(cur+12)-* (ref+12)) + abs (* (cur+13)-* (ref+13)) +
abs(*(cur+14)-*(ref+14)) + abs(*(cur+15)-*(ref+15));
ref += ref_width;
cur += cur _width;
} ___________________________________________________________________________
reference data are aligned by REF-ALIGN instruction according to the LSBs of the
reference address. Then, the OCT_ABS_ADD instruction returns the SAD result of
8 pixels. In the C program, SAD of 16 pixels requires 16 additions, 8 subtractions,
8 absolute operations, and 8 additions while the TIE program needs 5 instructions.
Therefore, the SAD calculation is improved by a factor of 8 times.
For pre-calculation of half-pel frames, TIE instructions are designed as shown in
Figure 6.9. Horizontal, vertical, and diagonal half-pel frames are generated simul
taneously. Tables 6.5 and 6.6 give the program lists for one line generation of each
half-pel frame.
Four 64-bit user registers are used to store two 8 pixels from the upper line and
two 8 pixels from the lower line. LOADpRefU and LOADpRefL load aligned 8 pixels
to user registers. ADD-8J3YTE-SHIFT1 is used to perform interpolation along the
135
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Table 6.4: G codes for SAD calculations w ith TIE.
fori i = 16 ; i > 0 && temp>=0; i— )
{
L0AD64cur( cur ) ;
L0AD64ref_l( ref) ;
L0AD64ref_2( ref) ;
REF_ALIGN( (unsigned int)ref ) ;
temp -=OCTA_ABS_ADD() ;
L0AD64cur( cur+8 ) ;
L0AD64ref_l( ref+8) ;
L0AD64ref_2( ref+8) ;
REFALIGNC (unsigned int) (ref+8) ) ;
temp -= OCTA_ABS_ADD() ;
ref += ref_width;
cur += cur _width;
}
horizontal and the vertical directions according to the argument. The vertical inter
polation is performed with three instructions; two ADD-8J3YTE instructions and
an ADD-SHIFT2 instruction to balance the size of logic circuits among instructions.
For rounding control specified in MPEG-4, RC_1 and RC-2 values are applied to
each interpolation. The interpolation instruction is combined operations of the 8-bit
addition, addition for rounding control and shift operation. This instruction can
perform three parallel basic arithmetic operations. With little sacrifice of generality,
it can improve performance more. In Table 6 .6 , one loop generates 8-pixel horizontal
half-pel frame, 8-pixel vertical half-pel frame, 8-pixel diagonal half-pel frame while
one pixel of each half-pel frame is generated in one loop in the C code routine. A
136
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
vec4 vecl
8 bytes boundary
v e c 5
C O 0 fj O 0 O O :
' 0 O O 0 G t3 !
I O C O C O O j C O © 0 © 0 O O ! © O © 0 © G O O i
H orizontal H alf-Pel
D iagonal H alf-Pel
r h f ’ ■ ■ • ($1
r r V
Tl
vec4
; <■
j
! (+H +.) ••
& a A
i 1 T
vec5
< -
W 1
vecl
Vertical H alf-P el
Figure 6.9: Half-pel frame calculation with TIE instructions.
Table 6.5: C codes for the half-pel frame calculation.
for( j = 0 ; j < width; j++)
{
*(pH + j ) = ( (sO = *(pRefO + j)) +
( s i = *(pRefl + j ) ) + RC_1 ) » 1;
*(pV + j) = ( sO + (s2 = *(pRef2 + j)) + RC_1 ) » 1;
*(pHV + j) = ( sO + si + s2 + *(pRef3 + j ) + RC_2 ) » 2;
} _ _ _
total of 144 arithmetic instructions (18 instructions * 8 loop) is replaced by 12 TIE
instructions. Therefore, the performance improvement with TIE instructions for the
half-pel frame calculation is a speed-up factor of 12.
With the TIE instruction for SAD and half-pel frame calculation, the number
of instructions required for motion estimation is reduced dramatically. Most of the
gain comes from parallel processing with the SIMD-style TIE instructions.
137
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Table 6.6: C codes for the half-pel frame calculation w ith TIE.
for ( j = 0 ; j < w i d t h ; j += 8 )
{
MOVEvecC 0 );
MOVEvecC 1 ) ;
LOADpRefU( pRefO + j + 8 ) ;
LOADpRefL( pRef2 + j + 8) ;
ADD_8JBYTE_SHIFT1 ( 0S RC_1 ) ;
ST0RE64( pH + j ) ;
ADD_8_BYTE_SHIFT1 ( 1, RC_1 ) ;
ST0RE64( pV + j ) ;
ADDJ3_BYTE( 0, 1 ) ;
ADD_8_BYTE( 1, RCJ. );
ADDJ3HIFT2C 0 ) ;
ST0RE64( pHV + j )
} _______________________________________
Because encoding and decoding are processed in the unit of macroblocks, there
are many block operations. For example, SAD is one of the block operations. A
macroblock has one 16x16 luminance data and two 8x8 chrominance data. One row
of each data is stored in a consecutive memory location. Therefore, the SIMD style
instruction can be applied to these data to improve the processing speed. Figure 6.10
shows places where the block operation is processed, where motion compensation and
prediction error calculations are shown in circle 1 while reconstruction is shown in
circle 2. We designed SIMD-style TIE instructions for the block processing. Parallel
load, store, and arithmetic operations are performed for 8-pixel data with the TIE
instructions. They can improve the processing speed up to 8 times faster. Unlike
the MMX instruction, the parallel operation and the precision change are combined
138
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
S h o rt[6 4 ] DC! S h o rt[6 4 ]
□ a l i g n e d
□ N o t aligned
Q uant
ID CT
IQuant
differei
M otion v e c :o r
original reference reconstructed
Figure 6.10: The motion compensation and reconstruction process.
into one instruction so that there are no waste cycles for pack and unpack operations
to change the precision.
Based on the profiling result, we observe that the quantization process takes a
major portion of the encoding time, which is a consequence of the division function.
The integer divider instruction is not provided by the Xtensa instruction set due to
its large gate count. Thus, the compiler links the division operation to the divider
function, which takes 60 cycles per integer division. The quantization process is
given by
where F is the DCT coefficient, Qp is the quantization scale, and Q F is the quantized
result. The “saturate operation” limits the output to a range between 127 to -
127. Therefore, there are three more operations after the division. To optimize
the quantization process, the division is first implemented by the multiplication of
QF = saturate(sign(F) (6 .4 )
139
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
the reciprocal. Then, three additional operations are combined to form two TIE
instructions.
Qf = £ '(Q P * 2 );
■ 1 2 ( - 2 0 4 8 - 2 0 4 7 )
/
TIE
abs sign
□
12 ( 0 - 2 0 4 7 )
-1 or +1
|*mqO[QP]| hm qlfQ Pll
o n -— T
20 "T
^->j |x|<=127
Q f
8 ( - 1 2 7 - 1 2 7 )
Figure 6.11: Qunatization with multipliers and TIE instructions.
We use the existing 16 bit multiplier to make the bit identical divider for quan
tization. For integer Qp value that ranges from 1 to 31, the equivalent division
equation with two 16-bit multipliers is
where i is a positive integer, mqO and m ql are pre-calculated tables that have 31
entries for Qp. Figure 6.11 shows the quantizer implementation with multipliers
and TIE instructions. The 60 cycles of the integer divider is replaced by 6 cycles of
multipliers and TIE instructions. For one multiplication, two instructions are issued
because of a table loading instruction and a multiplication instruction. Therefore,
there is an improvement of 10 times.
— — = (mq0[Qp\ * i + ((mql[Qp\ * i) > > 8) + 8) > > 10 (6.5)
2 • Qp
140
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Beside the encoding and the decoding processes, the Color Space Conversion
(CSC) processing is required for the interface of video input and output. Figure
6.12 shows CSC from the YUV 4:2:0 format to the RGB 4:4:4 format for the decoder
output interface. The YUV 4:2:0 format has subsampled U and V pixels as shown
in the figure. Therefore, the transform of U and V data is shared by four luminance
(Y) data. The transform of U and V data requires constant multiplication and
accumulation. The operations in the small box indicate TIE instructions that can
be performed in one cycle. As a result, to generate four RGB data, 20 instructions
(instead of 58 instructions) are used.
Y 0 O Y i G
□ u,v
YzC YjO
coi— ; cio —
i c* ......►b* ;! cb — »f
<>
3
Cli — ><>
■
$
J
COO : : !
Ir*t-i--j 1 pH-
)
J f(~" 1
Y o .'C .V Y ,
G0 Gj G2 G3 B ^ B 1 B2 B 3 Rq R j R^R^
Figure 6.12: The TIE instruction partition for the YUV-to-RGB conversion.
The RGB to YUV conversion for input interface can be implemented with the
same procedure as the YUV to RGB format conversion.
The optimization with TIE instructions is performed in the order of the per
formance contribution according to the hot spot analysis. Hence, DCT and IDCT
optimizations with the TIE instruction were not conducted since they do not give
141
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
as much improvement as the additional required hardware due to parallel multipli
cation operations. This is very different from the case of the general multimedia
processor, in which DCT and IDCT optimizations are always included.
6.6 O ptim ization R esults
W ith the platform independent software optimization and TIE instructions, MPEG-
4 video encoding and decoding of the 15-fps QCIF format can run at a clock rate of
61.5MHz and 21MHz, respectively. Table 6.7 shows the improvement of major hot
spot blocks and the total cycles of the encoder and the decoder based on the Xtensa
Instruction Set Simulator (ISS) with zero memory waiting. Note that the motion
estimation and the quantization modules are significantly improved. For motion
estimation, it is already improved with algorithmic optimization so that the number
of SAD calls is reduced. Then, the number of CPU cycles for the SAD function itself
is reduced by the special TIE design. The performance gain of quantization is due
to the use of multipliers instead of the software divider which requires more cycles.
The specification of MPEG-4 video codec IP is shown in Table 6 .8 . The core size
is very small and the power consumption is low so that it is cost effective and could
be used for portable devices.
142
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Table 6.7: The TIE optimized result (Mcycles/sec).
Function Before After Ratio
ME 238.0 27.7 8.59
Quantization 46.0 4.0 11.50
MC 11.6 4.0 2.90
CSC 18.0 5.0 3.60
Encoder 350.0 65.7 5.33
Decoder 58.0 23.1 2.51
Table 6 .8: The optimized IP specification.
Speed (MHz) 188 MHz
Gate Count* 59970
Power Consumption* 98 mW
Area (core only) 1.37 m m 2
Area including cache 3.17 m m 2
* Cache memories axe not included
6.7 Conclusion
We presented the MPEG-4 video Codec IP design with a configurable embedded
processor for wireless multimedia communication in this chapter. We performed
two-step optimization, i. e. first with the platform independent software optimization
and then with the instruction-level hardware optimization with the TIE instruction
design. For the first-step optimization, we achieved 13.4 and 2.36 times faster exe
cution time for the encoder and the decoder, respectively, by implementing the fast
algorithm. Then, further improvement was carried out with a special design of TIE
instructions and the proper processor configuration. The combined TIE instruction
might loose generality; however, it is shown that the performance can be improved
143
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
a lot at the expense of a small number of instructions and gate counts. The over
all improvement allows real-time MPEG-4 video encoding and decoding IP of the
15 fps QCIF sequence with low power, low cost and great software flexibility. The
proposed design methodology has several advantages in MPEG-4 video codec IP de
velopment. For example, the optimized software can be used in different platforms
and the design time can be reduced due to less hardware interface. Furthermore, we
can use the developed TIE instructions for similar video coding techiniques such as
MPEG-1, 2, and H.263 with minimal software modifications.
144
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C h a p te r 7
Advanced Techniques for M PEG 4 V ideo Encoder
Perform ance Enhancement
7.1 Introduction
In the last chapter, we presented the MPEG-4 video codec implementation on a
configurable embedded processor. In the implementation, the use of TIE (Tensilica
Instruction Extension) has accelerated the processing speed. Before the TIE opti
mization, the performance of the MPEG-4 video codec was mainly restricted by the
number of instructions. After the TIE optimization, the number of instructions is
reduced significantly, but the number of cache misses remains the same. The number
of cache misses causes performance degradation in the form of long external mem
ory latency. As a result, the overall performance is limited by the external memory
access. In this chapter, we present ways to further improve the system performance
145
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
by restructuring the software structure for more efficient cache utilization to reduce
long external memory latency.
In the design of embedded systems, an embedded processor is used to control
hardwired sub-blocks attached to the internal system bus. The bandwidth assigned
to the embedded processor is very limited. Even for the embedded processor dedi
cated for data processing, there are still many hardwired blocks to share the internal
system bus as well as the external memory bus. As a result, the embedded processor
has difficulties in accessing the external memory with sufficiently high bandwidth.
Note that, for the MPEG-4 software implemented on PC, it usually has a large
primary memory (LI) and a large secondary cache (L2), the same software imple
mentation might not fit to the embedded processor that has a much limited amount
of bandwidth to access the main memory or the external secondary memory.
In the MPEG-4 encoding process, motion estimation demands a lot of compu
tation due to the SAD (Sum of Absolute Differences) calculation. Moreover, the
process requires a lot of frame memory accesses, which makes the performance even
worse in a system with long memory access time. To reduce cache misses in the
SAD calculation, the frame memory map is modified from the line-based to the
macroblock-based map. Then, the reference macroblocks in the search range are
pre-loaded into the internal memory or DataRAM. Some TIE instructions are de
signed to accelerate the address calculation for the complex memory map. The over
all encoder software structure is optimized to avoid the cache miss penalty based on
146
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
the memory system investigation. In addition, new TIE instructions are designed
for further improvement. The improved MPEG-4 encoder software and processor
design can be used practically for an embedded system with an external memory as
its main memory, which can lower the cost in comparison of a system with a large
internal memory.
In this chapter, we will investigate the memory sub-system of the configurable
processor in Section 7.2, and then present a new software structure in Section 7.3.
The improved motion estimation scheme is shown in Section 7.4. Other optimization
techniques will be presented in Section 7.5. Then, optimized results will be shown
in Section 7.6. Concluding remarks are given in Section 7.7.
7.2 M em ory System of Em bedded Processor
The main objective of this chapter is to design a video encoding software architecture
so that its performance does not degrade due to the latency of external memory
access. In this section, we will investigate the memory system of a microprocessor
and find possible methods to solve the memory latency problem.
One challenging problem associated with fast microprocessor design is that mem
ory access is inherently slow. A processor often executes instructions faster than the
data rates to be provided by the system. One way to alleviate this problem is to
provide access to instructions and data with fast local memory. The cache is one
type of fast local memory. However, the cache memory size is very limited to solve
147
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
the problem completely. Actually, most of instructions and data are still stored in
the external memory while a small amount of instructions and data is temporarily
copied into the cache memory from the external memory when they are needed.
Bus
Interface
Peripherals
Peripherals
Memory
Instruction
Cache/RAM/ROM
Data
Cache/RAM/ROM
Cache Controller
Xtensa Core
Figure 7.1: The memory sub-system of the Xtensa processor.
Figure 7.1 shows the memory sub-system of the Xtensa processor. Instructions
and data are separately stored in cache memories. Data in the cache are accessed
through the cache controller. Note that the cache controller can also access RAM and
ROM for both instructions and data. The difference between cache and RAM/ROM
access is that the cache is automatically managed by a caching algorithm while
RAM/ROM is explicitly accessed by application programs. However, the access
speed to the cache and RAM/ROM is same. The external memory can be accessed
via the processor interface and the bus interface. The processor interface has a
proprietary protocol in the Xtensa processor. The Xtensa processor interface is
called the Processor Interface (PIF). Accessing external memory data takes longer
than the cache memory data due to the interface logic and the access time of the
148
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
external memory itself. Other types of local memories or peripherals can be accessed
through the Xtensa Local Memory Interface (XLMI). The access speed and the
allowable size of the memory attached XLMI are between those of the cache and the
external memory.
The cache interface is separated from the processor interface. There is a channel
to connect the cache interface and the processor interface. The access of a given
instruction or a data word is first to the main memory via PIF, which is subject
to stall cycles. After the initial access, the rest data access can be switched to the
cache access.
The cache memory size is relatively small compared to that of the main memory
size. To share the cache memory by data in the main memory, the cache memory
is partitioned into a smaller size of bytes called the cache line. The cache memory
is updated in the unit of the cache line. One cache line can be shared by different
addresses of the main memory. The manner to map cache lines to main memory
addresses varies depending on whether the direct mapping or the set associativity
cache is used.
Figure 7.2 shows how cache lines are shared by data of the main memory in the
direct mapped mode. In the best scenario, the data used in an application program
are in a consecutive memory address and the size is smaller than the cache size.
However, in most cases, the data size is bigger than the cache size. Then, the cache
line is replaced with main memory data of other addresses. When the replacement is
149
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Tag Cache
Main Memory
0
1
2
Line Size
Figure 7.2: The direct mapped cache.
required, a PIF transaction is issued while the processor is in stall. After the cache
line is filled with new data, the processor resumes accessing data from the cache
line. The stall time is the cache line allocation penalty. It is typically measured
as the number of CPU clock cycles during the stall time. The PIF transaction
time or cycles depends on the external memory access time. Memory access time is
the elapsed time between when the memory access request is issued and when the
requested data arrive. The time is called memory latency in the unit of CPU clock
cycles. The cache miss penalty can be a function of the memory latency and the
system cycles. The system cycle is the number of cycles used to allocate arriving
external memory data to the cache memory. Then, the cache miss penalty can be
written as
C'miss L m T C$yS , ( T - l )
150
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
where L m is the external memory and Csys is the system cycles. If there are N miss
cache misses during the application running time, the increased cycles for cache
allocation become
C j N m issC m iss N m iss ' L'm ~F N m is s ' f-'si sys-
(7.2)
Consequently, the cycle increment is proportional to the memory latency with the
slope of the number of cache misses.
To minimize the cycle increment, the memory interface should be designed to
have small memory latency and the application program should be structured to
have a less number of cache misses. However, the memory latency is determined by
the system integration. For an embedded processor, its memory interface is limited
because other processing blocks attached to the system bus share the bus and the
external memory.
500000
450000
400000
350000
^ 300000
s | 250000
J : 2 0 0 0 0 0
^ 150000
100000
50000
progressive
interleave
10 12 14
num ber of frames
Figure 7.3: Cache miss count vs. multiple frame access.
151
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Video processing deals with 2-D image data that are larger than the cache size
of any microprocessor even for the video format of the smallest size. Consequently,
cache misses in loading or storing a whole frame data cannot be avoided. Especially,
in video compression standards such as MPEG-1,2,4, and H.263, several frames
should be accessed at the same time for processing. Figure 7.3 shows the number
of cache misses when the application accesses multiple frames simultaneously and
progressively. The frame size is selected to be the QCIF (176x144) YUV 4:2:0 format
(or 38,016 bytes per frame). A data cache of 3KB is used. In the progressive access,
one frame is read after the other. In the interleaved access, pixels of the same fields
from different frames are accessed continuously. Then, pixels of the alternative field
are read. As shown in the figure, accessing more than three frames at the same
time pays a big penalty while the progressive frame access has a constant amount of
cache misses per frame. In addition to the cache memory, DataRAM can be used as
fast local memory. DataRAM can be used as part of the main memory while with a
limited size. Thus, by storing very frequently used data in DataRAM, the number
of cache misses can be reduced.
Based on the memory system investigation, we conclude th at the software ar
chitecture should be modified to reduce the number of cache misses to improve the
overall performance. We set the following guide lines in the design of the software
architecture:
• Reduce the number of accessing to the external memory size.
152
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
• Reduce the number of frame access in one processing block.
• Utilize DataRAM for frequently used data.
• Reduce the total number of frame accesses in the application software.
• Design an efficient memory map for frame memory data.
In the next section, we will present the new MPEG-4 video encoder software
structure to meet these guide lines.
7,3 N ew Software Architecture for Embedded
M PEG -4 Encoding System
As discussed in the previous section, an entire frame memory access results in a
lot of data cache misses since the data cache size is much smaller than the frame
memory size. If there are more frame memory accesses, the more cache misses occur.
In the original software architecture shown in Figure 7.4, there are six entire frame
memory accesses: one for motion estimation, one for motion compensation, one
for forward DCT and quantization, one for AC/DC prediction and VLC, one for
inverse quantization and inverse DCT, and one for half-pel frame generation. In
addition to many entire frame memory accesses, the frame memory is larger than
active frame data due to the extended boundary area for UMC (Unrestricted Motion
153
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Compensation). The half-pel frame generation makes more cache misses since its
processing requires accessing to four different memory locations at the same time.
PM V
Sam e M em ory
Pred.
R L C
M C
V L C
B oundary E xtension
PutB its
| B st b u f fe rl— >
H aJf-pei
vHalf-pel G e n )
V H alf D H alf
M acroblock A rray M acrobiock A rray
Figure 7.4: The original encoder architecture.
ScanTbl
PMV
Enc
ZZ
S c a n ,
DCT
RLC
(R)VLC
ME AC/DC Prediction
V LC
-MC'
t d c t
PutBits
buffer} g > TdC
ref_frame[3] cur_frame[3]
Figure 7.5: The new encoder architecture.
The encoder software architecture is restructured to improve the memory access
performance as shown in Figure 7.5. The encoder stores three frames in the external
memory that has long memory latency characteristics. It is designed for a small
number of access points for the frame memory. It utilizes local memories more
154
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
frequently than the external frame memory. For the local memory, 4KB DataRAM
is used for preloading data to guarantee no cache miss.
A frame consists of a number of macroblocks, which depends on the video for
mat. Many different processing operations are applied to each macroblock. The
input macroblock is read from a memory and the processed results are stored to
another memory location. We combine several encoding units that access the same
memory location to minimize the number of frame memory accesses. In the different
processing of the same Macroblock, intermediate results are stored in DataRAM to
avoid cache misses. Since the DataRAM size is limited, we design an architecture
of DataRAM to be shared among different processes. For example, contents inside
DataRAM are different for motion estimation and DCT coefficient coding.
Moreover, loaded data are used as many times as possible. The reference mac
roblock data loaded in DataRAM are used first for motion estimation. Then, they
are used for motion compensation for the luminance component. This scheme re
duces the number of frame memory accesses and removes the boundary extension for
UMC. In order for the data cache to contain data to be used in the nearest future,
the memory map for a frame is changed from the line-based to the macroblock-based
. This provides benefits in cache utilization when a macroblock is loaded and stored.
As shown in Figure 7.6, the encoding processing is re-ordered to reduce the
memory access in the processing unit.
155
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
MC for U
Prediction (org-mc)
Quantization
RLC and VLC
Inverse Quantization
Motion Estimation
M C forY
I DCT
FDCT
M C forV
Figure 7.6: The encoding flow chart.
7.4 M otion Estim ation with Low M em ory
Bandwidth
7.4.1 M em ory M ap
To store 2-D image data in a memory with the 1-D address space, the line-based
frame memory map is used in most video coding software. W ith the memory map,
pixel data from one line are stored in consecutive memory addresses. The next line is
stored in the next block of addresses. As a result, vertically adjacent pixels are stored
with a distance of the frame width in the 1-D memory. However, in the encoding
process, frame data are loaded or stored with the macroblock unit. Whenever there
is cache miss in the data load, at least one cache line of data is loaded to the data
cache from the external memory with high penalty cycles.
156
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
For a system with the 32-byte cache line and the line-based frame memory map,
when one row of macroblock data is loaded, one cache line includes 16 pixels from
the current macroblock and 16 pixels from the next macroblock. Thus, one half
of the cache line is not used for current macroblock processing. There are however
chances that the cache line can be updated with other data for other data processing
tasks. To maximally utilize the cache line, it would be better if one cache line can
store pixels from only one macroblock. Figure 7.7 shows the macroblock-based
frame memory map. Pixels of one macroblock are stored in consecutive memory
addresses. There are two ways to place macroblocks in the memory. One is to
place a row of macroblocks in consecutive addresses, the other is to place a column
of macroblocks in consecutive addresses. We choose the second method to load
reference macroblocks more efficiently in the motion estimation process.
9*256 m bx*256 10*256
m by * 2 5 6------ >
8*256------► ~ 4
Figure 7.7: Illustration of the macroblock-based frame memory map.
Figure 7.7 shows the macroblock-based frame memory map for the luminance
frame of the QCIF (176x144) video format. The chrominance components are stored
157
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
separately in frame memories. The macroblock-based frame memory map for the
chrominance frame has 8x8 pixels in a macroblock (instead of 16x16 pixels with the
same address ordering method for the luminance component). The start address
calculation is more complex. Thus, the address calculation is accelerated with the
TIE instruction. Then, the processing can be performed with a simple increment.
7.4.2 M em ory M ap for D ataR A M
4KB DataRAM is available in the configured Xtensa processor. There is little delay
to access DataRAM, which is similar to the data cache. Thus, it can be used for stor
ing frequently used data without worrying about cache miss. We use the DataRAM
to store reference macroblocks and the original macroblock in the motion estima
tion and compensation processes as shown in Figure 7.8. Since the motion search
range is between plus and minus 15 pixels in the target application, nine reference
macroblocks and one original macroblock are loaded into DataRAM. Pre-loading is
performed before the motion estimation process starts for every macroblock. The
addressing is the same as that for the macroblock-based frame memory. However,
the three column start offset is rotated after the motion estimation task of each
macroblock is finished to avoid loading of overlapped macroblocks of two adjacent
macroblocks.
When motion estimation is performed, reference macroblocks corresponding the
current Macroblock are stored in the DataRAM as shown in Figure 7.8. When the
158
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2034 Bytes
256 Bytes
u
H l !
768 1536
11
Reference Macroblocks
Original Macroblock
Reference Macroblocks
Original Macroblock
Figure 7.8: The DataRAM memory map for motion estimation.
current Macroblock is located in the edge of the frame, macroblocks located outside
of the frame are filled with edge pixels. This reduces the external memory bandwidth
and size. After completing motion estimation for one macroblock, motion estimation
is performed for the next macroblock if it is not located in the last column of the
frame. In this case, only three new macroblocks are needed to be loaded into the
macroblock address of the left three macroblocks. Then, the column address pointers
are rotated to point to the macroblocks in the geometrical order. Figure 7.10 shows
the macroblock update scheme for the current macroblock.
One frame motion estimation for QCIF video demands 22 macroblocks vertically
filled, 18 macroblocks horizontally filled, 4 macroblocks diagonally filled, and 231
macroblocks loaded into DataRAM. The average number of loaded macroblocks per
macroblock motion estimation is 2.3 macroblocks. The newly loaded macroblocks
are stored in the consecutive memory area in the macroblock-based frame memory
as shown in Figure 7.7. It requires 2.3 x 8 cache lines update per macroblock.
159
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
T O
h n ...*
1 1 1 1 ..
LIT]
• - J 1
J .i..
ill
." (
'S %
"ill
m
fi" T
Reference Frame Memory
Edge extension is
performed in ME
block automatically
3 new MBs are updated
in next MB motion
estimation
e O v O vl
b O
m
I I
ill i 1 2
■ J .
Hi
2
ii i
H
■ •
2 2 1 111
Figure 7.9: Loading of reference macroblocks.
h i
---------^ Sdli
H 12
m
3
13 | 14 ! 12
\1
- > , 1 4 : ] 2
i 4-'“
i 2 5 I 2 3 t 2 4
i. j_ .
H .U 2
“► 15 ' 13
i 25 I lit 24
(b)
Figure 7.10: Update of reference macroblocks: (a) update for edge macroblocks and
(b) update for non-edge macroblocks.
7 .4 .3 D ataR A M A ddressing
The reference macroblocks are loaded in DataRAM. The motion search range is from
-16.0 to +16.0 in both horizontal and vertical directions since motion estimation is
performed on the data in DataRAM only. Figure 7.11 shows the reference mac
roblock data loading scheme from DataRAM according to the motion vector. The
memory interface bus has a bandwidth of 64 bits. One memory access instruction
160
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
can load/store 8 pixels of data from/to the memory. The load or store addresses
should be 8 bytes aligned in order to map the pixels into the register or the memory.
Otherwise, the instruction rotates data to place the first byte to the LSB of the
register [33]. In the SAD calculation of a block of size 16x16, when three LSBs of
m vx (i.e. the horizontal motion vector) are not zero, three 8 -byte data are loaded in
three 64 bit user registers for alignment. Address pointers of three unaligned data
are pO, p i, and p2.
The loading operation is conducted by a TIE instruction with an auto address
increment on the address pointer by 16 for the next row. Sometimes, m vx indicates
an aligned position of 8 bytes. That is, three LSBs of m vx are all zeros while two
8-byte data are loaded to user’s registers. In this case, only pO and p i are used.
The address calculation is more complex than that for the line-based frame memory
access. However, the hardware implementation of address calculation demands a
small number of gate counts since all operations can be done via bit-wise operators
and simple adders. It does not need complex arithmetic logics such as the multiplier.
In order to access the column offset table (coLtbl) and the address pointer (pO, pi,
and p2) in parallel, registers are grouped together to serve as user’s registers.
Address generation for the original macroblock is simple since data are not de
pendent on the motion vector. Once the original macroblock is loaded to DataRAM,
the start address of reading is always the same for every SAD calculation. Two ad
dress registers are used for addressing the original macroblock since it is stored in the
161
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
colJbl[2]
(m v , mvy)
Colum n Boimdary(M B boundary)
Offset = 16 .
8 bytes align boundary
Figure 7.11: Addressing of reference macroblock in DataRAM.
8-byte aligned address. For one row of SAD calculation, two aligned 64-bit original
macroblock data and three unaligned 64-bit reference macroblock data are loaded
to user registers without waiting for memory access.
7 .4 .4 SA D calculation
SAD calculation is accelerated with the SIMD type TIE instructions. As described
in the DataRAM addressing, the 8 pixels of a block are loaded in the user register
with a TIE instruction. Several TIE instructions are designed to process loaded data
for SAD calculation in parallel. First, one TIE instruction is designed for unaligned
parallel data loading for the original macroblock and the reference macroblock. An
other TIE instruction is designed for parallel data alignment for the loaded reference
macroblock data. The third TIE instruction is designed for calculation of SAD of
aligned 8 pixels of the reference and the original macroblocks. In addition to user
registers for storing one row of the reference and the original macroblock data, two
162
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
64-bit user registers are used to store aligned reference macroblock data. In order
to save the number of instructions for SAD calculation, some additional instructions
are combined to form TIE instructions, such as an auto address increment and an
accumulation of the SAD result.
Figure 7.12 shows SAD calculation with TIE instructions. In the 64-bit loading,
a destination register, a source address pointer, and an auto increment of the ad
dress pointer are specified in the TIE arguments. An alignment is performed by the
ALIGN instruction with an argument, which is determined by the horizontal mo
tion vector. The 8-pixel SAD operation (denoted by SAD8) and the accumulation
operation are combined together to result in one TIE instruction. Thus, there are
total eight TIE instructions used to calculate SAD of one row of 16 pixels (denoted
by SAD16). The number of instructions is reduced by one tenth from the original
implementation without TIE instructions.
LQAD64(orgO, q0,16) LOAD64(orgl, ql,16) LQAD64(refO, p0,16) L O A D 64(refl,pl,16) LOAD64(ref2, p i , 16)
8pixels
re ft orgO orgl
ALIGN 16(align)
alignl alignO
SAD8 SAD8
sad acc > sad acc+SAD16
Figure 7.12: SAD16 Calculation with TIE instructions.
163
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
SAD8 can be implemented with the same TIE instructions used in the SAD16
calculation. In this case, one load of original 8-pixel data, two loads of reference
8 -pixel data, one alignment, and one SAD8 instruction are used. Compared to the
original C implementation, its speed is improved by 8 times.
Motion estimation with UMC can be performed without special consideration
of the preloaded reference macroblock memory architecture. Reference Macroblock
data outside the image area are filled with the boundary pixel values of the image
when the reference macroblocks are loaded in DataRAM. This data filling is also
accelerated with the TIE instruction that copies the 8-bit data repeatedly into the
64-bit register. Furthermore, UMC can be implemented without additional load on
the external memory bandwidth.
The reason to build three half-pel frames for half-pel resolution motion estimation
is to reduce the number of operations for half-pel pixel calculation. In the half-
pel motion estimation, eight half-pel positions are evaluated for every macroblock.
If the half-pel frames are pre-calculated, only three interpolation operations are
required. However, if half-pel pixels are calculated during motion estimation, eight
interpolations should be performed for each macroblock. Figure 7.13 shows the
number of cycles required per second for half-pel motion estimation according to
memory latency in the old architecture.
The number of required cycles is increased by 92% in average from 1 to 24 of
memory latency. This is due to data access from three different memory locations
164
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
when half-pel interpolation and motion estimation are performed. As a result, there
are more cache misses than the single frame memory access. Thus, the pre-calculated
half-pel frame memory scheme does not have an advantage any more in terms of
long memory latency. On the other hand, the on-the-fly half-pel pixel interpolation
method does not rely on memory latency since it accesses only one frame memory
for motion estimation.
- Generate Half-pe! Frame
- 16x16 Half-pel M E
- 8x3 Haif-pel fvE
10 15
Memory Latency (Cycles)
Figure 7.13: Cycles for half-pel motion estimation with the old architecture.
However, it still requires a lot of operations for the half-pel interpolation op
eration. This can be overcome by specially designed TIE instructions. The TIE
instructions perform on-the-fly half-pel interpolation in parallel so that the number
of instructions can be reduced. In particular, TIE instructions have been designed
for three half-pel interpolation directions, i. e. the horizontal, vertical, and diagonal
directions. Two 136-bit user registers are used to store aligned 17 pixel data for
horizontal and vertical interpolations. The SAD calculation for half-pel pixels is
165
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
done in a similar way as the SAD calculation for integer pels. Before the SAD cal
culation, the unaligned reference macroblock data are aligned and interpolated with
TIE instructions. Then, all data for the SAD calculation are loaded from DataRAM
so that no cache miss occurs.
7.4.5 M o tio n C o m p en satio n for L u m in an ce C o m p o n en t
As described in the architecture, motion compensation for the luminance component
is performed in the motion estimation process. Since the reference macroblocks are
already in DataRAM, motion compensation can be performed without cache misses.
No new TIE instruction is needed for motion compensation since TIE instructions
for motion estimation can be used for motion compensation as well. Instead of cal
culating SAD, the aligned data are stored in the motion compensated frame that
will be used for storing the reconstructed frame later. There are two disadvantages
in this method. It might cause additional cache misses due to accessing the motion
compensated frame in addition to the original frame and the reference frame. The
other one is that the mode decision for the motion vector type (1 motion vector or
4 motion vectors) cannot be determined later since the mode should be determined
before motion compensation is performed. However, a separated motion compensa
tion operation can cause more cache misses to result in long memory latency and
quality degradation due to pre-determination of the motion vector mode is small.
166
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
7.4.6 M o tio n C o m p en satio n for C h ro m in an ce C o m p o n en t
Motion compensation for the chrominance component is performed separately from
the motion estimation process since the chrominance data are not loaded in Data
RAM. The TIE instructions for the chrominance motion compensation are the same
as those used for the luminance motion compensation. The difference is to load data
directly from the reference frame memory instead of DataRAM. Thus, the perfor
mance is worse than that of the luminance motion compensation. However, it is
better than preloading of chrominance data to DataRAM since the data are used
only once. For UMC, pixel values outside of the image are filled in user registers di
rectly to avoid external memory access. We separated cases of motion compensation
with and without UMC. In the case of motion compensation using pixels outside of
an image, more complicated operations are used to determine whether user registers
should be filled with values or loaded from the memory. For the case without UMC,
motion compensation is performed without detecting positions of outside pixels. On
the other hand, if there are a few macroblocks demanding outside pixels, the av
erage operation cycles for our scheme is still smaller than that using the extended
image scheme, which requires a bigger memory size and higher bandwidth. Finally,
half-pel motion compensation is performed with TIE instructions for half-pel pixel
interpolation without pre-calculated half-pel frames.
167
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
7.5 O ptim ization of D C T C oding
After motion estimation and compensation for one frame, prediction errors are cal
culated from the original frame and the reference frame. Prediction errors are com
pressed into a bitstream after several computationally intensive operations such as
FDCT, quantization, zig-zag scanning and variable length coding (VLG). Then, re
construction processes are performed to make a decoded frame that will be used
for a reference frame in the next frame encoding. Reconstruction processes include
Inverse Quantization (IQ), Inverse DCT, and addition of the prediction error and
the motion compensated image. The input and the output of each process are either
block data (8x8) or macroblock data (16x16). These data should be stored in some
location. If they are stored in the external memory, there will be a lot of cache
misses. To avoid cache misses, we designed a software structure so that the DCT
coding and reconstruction processes use a small amount of internal memory (called
DataRAM). Figure 7.14 shows DCT coding and reconstruction with DataRAM. In
put and output data are stored in DataRAM. The motion compensated macroblock
data are loaded when prediction errors are calculated.
In addition to the prediction error and motion compensated macroblock, some
tables used for encoding are stored in the remaining space of DataRAM. This reduces
the number of external memory accesses. These tables include the VLC tables, the
zig-zag scan table and the quantization table. Reconstructed prediction errors are
added to the motion compensated macroblock to reconstruct the macroblock, and
168
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Reference
(M B Scan)
Block 0
Block 1
Block 2
Bitstream Buffer M
Block 3
Block 4
IDCT n
Blfc=0,l,2,3,4,5
Block 5
Recon.
(Add
B lock 0 B lock 1
Block 3 B lock 2
B lock 4 Block 5
DataRAM
Figure 7.14: DCT coefficient coding and reconstruction.
then the macroblock is stored to the reference frame. If the block or the macroblock
is not coded, then reconstruction is not needed since the motion compensated Mac
roblock is already in the reference frame memory. This scheme reduces about 26%
of block writing to the external frame memory.
In each processing, some TIE instructions have been designed to accelerate the
processing speed. However, the major performance gain is obtained by the low data
cache miss rate with DataRAM. For the FDCT speed up, we designed some special
TIE instructions as described in the following section.
7.5.1 F o rw ard D C T
In the old architecture, the Forward DCT (FDCT) operation was accelerated by a
fixed point fast algorithm. It was not a hot spot since other processing units took
169
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
much more cycles than FDCT processing. After the optimization of the encoding
structure, FDCT processing becomes a significant portion of the total cycles for
encoding. We tried two different acceleration methods for the TIE instruction de
sign: the MMX style and the DSP style. The MMX style implementation utilizes
the SIMD type instructions to perform 1-D FDCT while the DSP style instruction
utilizes the MISD (Multiple Instruction Single Data) type instructions.
For the design of SIMD instructions, the calculation of FDCT is decomposed to
basic operators such as addition, multiplication, and shift. Then, TIE instructions
are designed and each operation is applied to multiple data. TIE instructions can
be used for other processing tasks since their operations are very general. However,
this implementation has some overhead in data arrangement when data reordering
is needed. This method is not efficient especially for the FDCT implementation
since there are many irregular butterfly operations. In addition to the performance
drawback, the logic size for the TIE instructions cannot be balanced since the logic
size for different operations varies a lot. For instance, a multiplication operation
requires much more logics than an addition operation. Therefore, it may demand a
large logic unit and/ or a time-critical path when TIE instructions are synthesized
into logics.
The DSP style implementation of the FDCT is less flexible since several instruc
tions are combined only for FDCT. However, the performance can be better since
there is less overhead to arrange data. FDCT is widely used for many image and
170
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
— X[0]
— X[l]
— x[2 ]
— X[3]
x[0]
41]
42]
43]
— X[4]
— X[5]
— X[6]
— X[7]
multiplier
adder
shifter
16-bit register
32-bit register
Figure 7.15: TIE instructions for 1-D FDCT processing .
video compression standards. Thus, it is worthwhile to design special TIE instruc
tions for FDCT. We grouped several instructions to a single TIE instruction for 1-D
FDCT as shown in Figure 7.15. The instructions are designed to have a similar
complexity in terms of the number of additions and multiplications. The number
of multiplications in a TIE instruction is limited to two. A total of eight TIE in
structions is designed for 1-D FDCT. The data loading from the memory and the
data storing to the memory can be parallelized with 64-bit loading and storing,
respectively.
The original C codes needs 1570 cycles/block. The accelerated FDCT with MMX
style TIE instructions requires 928 cycles/block. Furthermore, the new TIE instruc
tions needs only 520 cycles/block with 9 well-balanced TIE instructions.
171
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
7.6 Summary of Optim ized R esults
The Xtensa ISS tool is used for simulation of the user program with designed TIE
instructions on a configured Xtensa processor. The simulator enables the user to
analyze, debug, and fine-tune the performance of application software before com
mitting to fabrication. Simulation results provide the number of CPU clock cycles
used for each processor components such as the instruction cache, the data cache,
the external memory interface, the total number of instructions. The Xtensa profil
ing tool allows users to analyze the number of cycles used for each function of the
user program after the ISS simulation. We performed the ISS simulation to obtain
the optimization performance.
The Xtensa processor is configured for MPEG-4 video codec IP. For faster data
loading, an external bus interface is configured to have the 64-bit bus width, which
enables 8-byte parallel loading to 64-bit user registers. The instruction cache is
configured to be of 4KB with a line size of 32 bytes and a direct mode associativity.
DataRAM is configured in the Xtensa processor to be utilized for frequently used
data. Data cache should be configured with a size of 3*2N in order to use DataRAM.
Thus, a memory of 3KB is configured for the data cache. The line size of the data
cache is set to 32 bytes after the evaluation of different line sizes. A total of 30 TIE
instructions are designed to accelerate video processing.
172
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
The performance of optimized MPEG-4 video encoding software on the config
ured processor was measured with the configured Xtensa processor. For the simu
lation, the QCIF( 176x144) Foreman sequence at a frame rate of 15 frames/second
was used as the encoder input. A total of 15 frames was encoded in the simulation.
The encoding parameters were fixed for all simulations. The MPEG-4 Simple Profile
and Level 1 video encoder was used in the simulation. The VOP structure was set
to one Intra-coded VOP followed by 14 Inter-coded VOP. Quantization step sizes of
Intra-coded and Inter-coded VOP were set to 11 and 12, respectively. The output
bitrate was 66Kbps. W ith these parameters, the PSNR values of reconstructed video
sequence were 31.699 dB, 38.490 dB, and 38.891 dB, for Y, U, and V components,
respectively.
The performance of the encoder software is measured by the number of CPU
clock cycles required to process one second of input video frames. The number is
reported by the Xtensa profiling tool. The report includes the number of cycles used
for each function of the encoder software. We compared the performance between
the old structure and the new structure of the encoder software. The performance
difference in terms of the external memory latency was observed and compared.
The performance comparison of old and new software structures are given in
Table 7.1. The memory latency is set to 22 CPU clock cycles in simulations. It
is found that the new software structure outperforms the old one in terms of long
memory latency for all processing components. Motion estimation has the biggest
173
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
gain thanks to the macroblock-based memory map and the DataRAM usage. The
next biggest gain is achieved by FDCT since TIE instructions are designed for FDCT
and input and output data is stored in DataRAM only without cache miss. The
overall gain of the new structure is 4.1 times.
Table 7.1: Performance Comparison at Memory Latency = 22 CPU Clock Cycles
Functions Old Structure
(MIPS)
New Structure
(MIPS)
Improvement
Ratio
Motion Estimation 74.64 11.46 6.51
Motion Compensation 13.37 5.22 2.56
FDCT 13.75 2.36 5.83
Quantization 5.95 3.04 1.96
VLC 14.79 4.29 3.44
Inverse Quantization 1.94 1.47 1.32
IDCT 6.70 3.18 2.11
Reconstruction 0.77 0.26 2.92
Others 9.21 2.76 3.33
Total 141.11 34.05 4.14
The performance comparison for different memory latency is shown in Figures
7.16-7.20. In general, the number of required cycles is increasing as the memory
latency increases. However, the slopes of the old and new structures are different. It
can be interpreted as that the new structure has fewer cache misses since the slope
depends on the number of cache misses. Finally, the total data cache allocation
cycles are compared in Figure 7.21. The number of data load cache misses are fixed
for each structure. Also, the slopes of the two structure coincide with the number
of cache misses of the structures.
174
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
90000
80000
70000
60000
50000
40000
30000
20000
10000
0
0 5 10 15 20 25 30
Memory Latency
Figure 7.16: Cycles for motion estimation vs. memory latency.
16000
14000
12000
10000
ji>
& 8000
y
6000
4000
2000
0 .j------------------------ ,--------------------- ,------------------- , ---------------------------,---------------------------,-------------------,
0 5 10 15 20 25 30
Memory Latency
Figure 7.17: Cycles for motion compensation vs. memory latency .
The configured processor with TIE is synthesized with 0.18[im process. The
new TIE has a gate count of 106K gates while the old TIE has 38K gates. The
performance gain is obtained by the better performance as well as the improved
software structure, which is less affected by external memory latency.
175
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
16000
14000
12000
Memory Latency
Figure 7.18: Cycles for FDCT vs. memory latency .
18000
16000
14000
6000
4000
2000
10 15 20
Memory Latency
Figure 7.19: Cycles for VLC encoding vs. memory latency .
7.7 Conclusion
In this chapter, we discussed how to optimize the MPEG-4 video encoding software
architecture for the given embedded processor. The processing units are reorganized
to reduce the total frame memory access. The frame memory map is changed to the
macroblock-based memory map that works more efficiently in a small cache system.
TIE instructions are designed to accelerate address calculation and memory accesses.
176
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
160000
140000
120000
100000
80000
60000
40000
20000 - j----------------------------------------------------------------------------------------------------
0 - J 1- - - - - - - - - - - - - - - - - - - - - - - - - - - - r — - - - - - 1 - - - - - - - - - - - - - - - - ,- - - - - - - - - - - - - - - - ,
0 5 10 15 20 25 30
Memory Latency
Figure 7.20: Cycles for encoding process vs. memory latency.
35000000
30000000
25000000
.20000000
15000000
10000000
5000000
0 5 10 15 20 25 30
Memory Latency
Figure 7.21: Cycles for missed data cache allocation vs. memory latency.
In addition to the structural optimization, FDCT performance is improved with the
DSP style TIE instruction. The overall improvement after the optimization is 4.4
times of less required cycles at the cost of 22 cycles for external memory latency.
Optimization results were shown for different external memory latency. From
these results, we observe that the performance of the new software structure is less
affected by the external memory latency and it fit the SoC implementation with
an embedded processor. Same optimization techniques can be applied to other
177
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
video processing tasks such as MPEG-4 video decoding and other video compression
standards.
178
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C h ap ter 8
Conclusion
We investigated the error propagation behavior in the motion compensated predicted
(MCP) video coding process. Based on our investigation, a corruption model was
proposed that links the packet loss effect to the end-to-end quality degradation.
The accuracy of the corruption model was illustrated with experimental results. In
addition, various robust coding techniques were explored, in which the corruption
model is used as the important tool.
UEP techniques with the DiffServ network was tested, and the optimal mapping
method was proposed to minimize end-to-end quality degradation. The optimal
mode selection technique was also applied to AIR. Furthermore, it was shown that
the error tracking process can exploit the corruption model for an improved perfor
mance when the feedback information is available.
A robust video delivery system, in which the UEP, AIR, and ET techniques are
coordinated, was proposed and studied. The performance of the proposed system
was demonstrated with extensive experimental results. Finally, the computational
179
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
complexity of the corruption model was analyzed, and it is shown that the overhead
of the computation was small.
In the MPEG-4 video codec design and its optimization with a configurable pro
cessor, it was shown that the optimization method enables video communication
SoC design with a small number of gate counts and fast development time. In the
optimization, TIE instructions are specially designed for video compression and de
compression without redundant instructions while many of multimedia instructions
are not used in general purpose processors with MMX speed up and DSP processors.
The performance improvement can be achieved with a minimal gate count while the
flexibility of the codec can be maximized due to its instruction level optimization
and the fact that most codec algorithms are implemented in C /C + + codes.
In order to overcome constraints of the embedded processor on SoC, the software
structure was re-designed. The new structure has a great advantage in utilizing the
cache memory so that its performance is not much dependent on the external memory
latency, which is one of the common constraints in the SoC design.
To summarize, two main research results were presented for robust video delivery
in this thesis. The first topic contributes to robust video coding and decoding while
the second contributes to the SoC implementation of the video codec in the video
delivery system.
1 8 0
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
R eference List
[1] Y. Arai, T. Agui, M. Nakadjima, “A Fast DCT-SQ scheme for images”, Trans.
IIS ICE, Vol. E71, No. 11, pp. 1095- 1097, 1988
[2] Gero Base, Jurgen Pandel, Sebastian Purreiter, Thomas Stockhammer, ” Data
partitoing for packcet oriented H.26L - a network friendly interface”, ITI-T
Study Group 16, Q.15/SG16 Osaka, 16-18 May, 2000
[3] Madhukar Budagavi, Wendi Rabiner Heinzelman, Jennifer Webb, and Raj Tal-
luri, “Wireless MPEG-4 Video Communiation on DSP Chips,” IEEE Signal
Processing Magazine, pp.36-53, vol. 17, Jan. 2000.
[4 ] Junavit Chalidabhongse, C.-C. Jay Kuo, “Fast Motion Vector Estimation Using
Multiresolution-Spatio-Temporal Correlations,” IEEE Transaction on Circuits
and Systems for Video Technology, Vol. 7, No. 3, June 1997, pages 477-488.
[5] Guy Cote, Shahram Shirani, and Faouzi Kossentini, “Optimal mode selection
and synchroinization for robust video communications over error-prone net
works, “ in IEEE Journal on Selected Area in Communications, vol. 18, no. 6,
June 2000.
[ 6] Keith Diefendorff, Pradeep K. Dubey, Ron Hochsprung, and Hunter Scales.
“AltiVec extension to PowerPC accelerates media processing,” IEEE Micro,
pages 85-95, March/April 2000.
[7] N. Farber, K. Stuhlmuller, and B. Girod, “Analysis of error propagation in
hybrid video coding with application to error resilience,” in Proc. IEEE ICIP
‘99, Oct. 1999.
[ 8] B. Girod, “Feedback-based error control for mobile video transmission,” Pro
ceedings of the IE E E , vol. 87, no. 10, Oct. 1999.
[9] U. Horn, K. Stuhlmuller, M. Link, and B. Girod, “Robust Internet video trans
mission based on scalable coding and unequal error protection,” in Signal Pro
cessing: Image Communication, vol. 15, no. 1-2, Sept. 1999.
[10] IETR, ” Differentiated Services Working Group,” http://w w .ietf.org/
html.charters / diffserv-charter.html.
181
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
11] Intel MMX Technology Web Site http://cedar.intel.com
1 2 ] ISO/IEC 13818, ’ ’Information Technology Generic Coding of Moving Pictures
and Associated Audio Information,” Part 2: Video, 1996.
13] ITU-T, Framework of QoSS and End-to-End Performance in Multimedia Sys
tem, Q.P/SG16, Document D01-16002, May 2001.
14] ITU-T Recommendation H.263 Version 2 (H.263+), Video coding for low bit
rate communication, Jan. 1998.
15] ITU-T, Video codec test model near-term version 10 (TMN10), Q.15/SG16,
Document Q15-D-65, Apr. 1998.
16] C.-S. Kim, R.-C. Kim, and S.-U. Lee, “An error detection and revovery al
gorithm for compressed video signal using source level redundancy,” in IEEE
Trans, on Image Processing, vol. 9, no. 2, Feb. 2000.
17] J.-G. Kim, J. Kim, and C.-C. J. Kuo, “On the corruption model of loss propaga
tion for relative prioritized packet video,” in SPIE Proc. Applications of Digital
Image Processing XXIII, July 2000.
18] J.-G. Kim, J. Kim, J. Shin, and C.-C. J. Kuo, “Coordinated packet-level pro
tection with a corruption model for robust video transmission,” in SPIE Proc.
Visual Communications and Image Processing, Jan. 2001.
19] J. Kim, W. Kumwilaisak, and C.-C. J. Kuo, “Cross-validation of proposed data
partitioning annex for enhanced error resilience,” ITU -T standardization Sector
Q.15/SG16 , Document Q15-G-23, Feb., 1999.
20] Rob Koenen,
“Overview of the MPEG-4 Standard,” ISO /IEC JTC1/SC29/W G11 N4030,
March 2001.
21] W. Kumwilaisak, J. Kim, and C.-C. J. Kuo, “Reliable wireless video trans
mission via fading channel estimation and adaptation,” in Proc. IEEE WCNC
‘ 2000, Sept. 2000.
22] Wuttipong Kumwilaisak, Jong Won Kim, and C.-C. Jay Kuo, “Video trans
mission over wireless fading channels with adaptive FEC,” inPicture Coding
Symposium 2001, Seoul, Korea, Apr. 2001.
23] Motion Pictures Experts Group, ” Overview of the MPEG-4 standard”,
ISO/IEC JTSC1/SC29/WG11 N2459, 1998.
1 8 2
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
[24] Stuart Oberman, Greg Favor, and Fred Weber. “AMD 3DNow! technology:
Architecture and implementations,” IEEE Micro, pages 37-48, March/April
1999.
[25] A. Ortega and K. Ramchandran, “Rate-distortion methods for image and video
compression,” in IEEE Signal Processing Magazine, Nov. 1998.
[26] Juhyun Park, Bontae Koo, Seongmin Kim, Ikgyun Kim, Hanjin Cho, “MPEG-4
video codec for mobile multimedia applications,” ICCE. International Confer
ence on Consumer Electronics, 2001 pp. 156 -157.
[27] Alex Peleg. “MMX technology extension to the Intel architecture,” IEEE Micro,
16(4):34-41, August 1996.
[28] G. Reyes, A. R. Reibman, and S.-F. Chang, “A corruption model for motion
compensated video subject to bit errors,” in Proc. Packet Video Workshop ‘99,
Apr. 1999.
[29] J. Shin, J. Kim, and C.-C. J. Kuo, “Content-based packet video forwarding
mechanism in differentiated service networks,” in Proc. Packet Video Workshop
‘ 2000, May 2000.
[30] G. J. Sullivan and T. Wiegand, “Rate-distortion optimization for video com
pression,” in IEEE Signal Processing Magazine, Nov. 1998.
[31] Wai-tian Tan and Avideh Zakhor, “Packet Classification Schemes for Streaming
MPEG Video over Delay and Loss Differentiated Networks,” in Proc. Packet
Video Workshop ‘ 2001, in Kyongju, Korea, April 2001.
[32] Naoto Tanabe and Nariman Farvardin, “Subband Image Coding Using Entropy-
Coded Quantization over Noisy Channel,” in IEEE Journal on Selected Areas
in Communications, Vol. 10, No. 5 June 1992.
[33] Tensilica Inc. “Tensilica Instruction Extension (TIE) Language Reference Man
ual”
[34] Shreekant (Ticky) Thakkar and Tom Huff. “Internet streaming SIMD exten
sions,” IEEE Computer, pages 26-34, December 1999.
[35] Video Group, “Optimization Model Version 3.0,” ISO /IEC JTC1/SC29/W G11
N4344, Sydney, July 2001.
[36] Albert Wang, Earl Killian, Dror Maydran, and Chris Rowen, “Hard
ware/Software Instruction Set Configurablility for System-on-Chip Processors,”
Design Automation Conference, June 2001, Las Vegas, Nevada.
183
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
[37] Y. Wang and Q.-F. Zhu, “Error control and concealment for video communica
tion: A review,” Proceedings of the IEEE, vol. 8 6, no. 5. May 1998.
[38] M. H. Willebeek-LeMair, Z.-Y. Shae, and Y.-C. Chang, “Robust H.263 video
coding for transmission over the Internet,” in Proc. INFOCOM ’ 98, March 1998.
[39] D. Wu, Y. T. How, and Y.-Q. Zhang, “Transporting real-time video over the
Internet: Challenges and approaches,” in Proceedings of the IEEE, vol. 88, no.
12, Dec. 2000.
[40] L. Zhang, D. Chow and C. H. Ng, “Cell loss effect on QoS for MPEG video
transmission in ATM networks in IEEE ICC ‘ 99, 1999.
[41] R. Zhang, S. L. Regunathan, and K. Rose, “Video coding with optimal in
ter/intra mode switching for packet loss resilience,” IEEE J. Select. Areas Com
munications, vol. 18, no. 6 , June 2000.
184
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Asset Metadata
Creator
Kim, Jin-Gyeong (author)
Core Title
Algorithms and architectures for robust video transmission
Contributor
Digitized by ProQuest
(provenance)
School
Graduate School
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
engineering, electronics and electrical,OAI-PMH Harvest
Language
English
Advisor
Kuo, C.-C. Jay (
committee chair
), Mel, Bartlett W. (
committee member
), Ortega, Antonio (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c16-639965
Unique identifier
UC11335068
Identifier
3116729.pdf (filename),usctheses-c16-639965 (legacy record id)
Legacy Identifier
3116729.pdf
Dmrecord
639965
Document Type
Dissertation
Rights
Kim, Jin-Gyeong
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Tags
engineering, electronics and electrical
Linked assets
University of Southern California Dissertations and Theses