Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Algorithms for scalable and network-adaptive video coding and transmission
(USC Thesis Other)
Algorithms for scalable and network-adaptive video coding and transmission
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ALGORITHMS FOR SCALABLE AND NETWORK-ADAPTIVE VIDEO CODING AND TRANSMISSION by Huisheng Wang A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) December 2007 Copyright 2007 Huisheng Wang Dedication To my family and the unforgettable memory of my beloved father. ii Acknowledgements I would like to express my deepest gratitude to my advisor, Dr. Antonio Ortega, for his guidance, inspiration, support and patience throughout the years I have been pursuing my Ph.D. degree at the University of Southern California. Thanks to Prof. Zhen Zhang and Prof. Cyrus Shahabi for serving on my Dissertation committee, and Prof. C.-C. Jay Kuo and Prof. Shrikanth S. Narayanan for serving on my qualifying exam committee. I would also like to thank Prof. Christos Kyriakakis for his support and guidance during myfirstyearatUSC.TheinternshipexperienceatbothLaJollaLab,STMicroelectronics Inc. and HP Labs has been enjoyable and productive. I am grateful to my mentors Dr. George Chen and Dr. Debargha Mukherjee who gave me the opportunity to work with greatmindsatthoselabs. Iwouldalsoliketothankallmygroupmembersandfriendsfor theirfriendshipandassistance. Last, butnotleast, Ithankmyfamilyfortheirconsistent supportthroughtheseyears. Ithankmymotherforherunconditionalloveandsupport; I thankmydaughterforbringingmeacompletelynewlifeasamotherwithsomanyjoyful moments; I especially thank my husband, Yinqing Zhao, for his love, encouragement and many times of technical discussions on my research projects. iii This research has been funded in part by the Integrated Media Systems Center, a National Science Foundation Engineering Research Center, Cooperative Agreement No. EEC-9529152. iv Table of Contents Dedication ii Acknowledgements iii List Of Tables viii List Of Figures ix Abstract xiii Chapter 1: Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Overview of a Video Communication System . . . . . . . . . . . . . . . . 6 1.2.1 Delay-Constrained Video Transmission . . . . . . . . . . . . . . . . 6 1.2.2 Bandwidth Variation and Transmission Impairments . . . . . . . . 7 1.2.3 Motion-Compensated Temporal Prediction . . . . . . . . . . . . . 8 1.3 Scalable Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3.1 MPEG-2 SNR Scalability . . . . . . . . . . . . . . . . . . . . . . . 10 1.3.2 Fine Granularity Scalability (FGS) and its Variants . . . . . . . . 11 1.3.3 Multiple Description Coding . . . . . . . . . . . . . . . . . . . . . 13 1.4 Distributed Source Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.5 Rate-distortion Optimized Packet Scheduling . . . . . . . . . . . . . . . . 18 1.6 Contributions of This Research . . . . . . . . . . . . . . . . . . . . . . . . 20 Chapter 2: Multiple Description Layered Coding (MDLC) 23 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2 Proposed MDLC Codecs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.2.1 General Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.2.2 Codec 1: DCT Duplication and Alternation . . . . . . . . . . . . . 29 2.2.3 Codec 2: based on MPEG-4 FGST with Temporal Subsampling . 30 2.3 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 v Chapter 3: Rate-Distortion Based Scheduling of Video with Multiple De- coding Path 37 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2 Review of Basic RaDiO Framework . . . . . . . . . . . . . . . . . . . . . . 42 3.3 Source Modelling for Redundant Representations . . . . . . . . . . . . . . 45 3.3.1 Directed Acyclic Hypergraph (DAHG) . . . . . . . . . . . . . . . . 45 3.3.2 Parameters Associated with DAHG . . . . . . . . . . . . . . . . . 51 3.3.3 Expected End-to-End Distortion . . . . . . . . . . . . . . . . . . . 52 3.4 Scheduling Algorithms with DAHG . . . . . . . . . . . . . . . . . . . . . . 59 3.4.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.4.2 Optimization Problem Formulation . . . . . . . . . . . . . . . . . . 61 3.4.3 Lagrangian Optimization Algorithm . . . . . . . . . . . . . . . . . 62 3.4.4 Greedy Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.4.5 Transport Redundancy. . . . . . . . . . . . . . . . . . . . . . . . . 67 3.4.6 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.5 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.5.1 Comparison between Scheduling Algorithms . . . . . . . . . . . . . 70 3.5.2 Redundancy’s Role in Adaptive Streaming. . . . . . . . . . . . . . 72 3.5.2.1 Transport Redundancy . . . . . . . . . . . . . . . . . . . 73 3.5.2.2 Source Redundancy without Transport Redundancy . . . 75 3.5.2.3 Source Redundancy with Transport Redundancy . . . . . 77 3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Chapter4: AFrameworkforAdaptiveScalableVideoCodingUsingWyner- Ziv Techniques 80 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.2 Successive Refinement for the Wyner-Ziv Problem . . . . . . . . . . . . . 85 4.3 Proposed Prediction Framework. . . . . . . . . . . . . . . . . . . . . . . . 87 4.3.1 Brief Review of ET Approach . . . . . . . . . . . . . . . . . . . . . 88 4.3.2 Formulation as a Distributed Source Coding Problem . . . . . . . 90 4.4 Proposed Correlation Estimation . . . . . . . . . . . . . . . . . . . . . . . 92 4.4.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.4.2 Mode-Switching Prediction Algorithm . . . . . . . . . . . . . . . . 94 4.4.3 Direct Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.4.4 Model-based Estimation . . . . . . . . . . . . . . . . . . . . . . . . 100 4.5 Codec Architecture and Implementation Details . . . . . . . . . . . . . . . 103 4.5.1 Encoding Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.5.2 Decoding Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.5.3 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.6 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.6.1 Prediction Mode Analysis . . . . . . . . . . . . . . . . . . . . . . . 113 4.6.2 Rate-distortion Performance. . . . . . . . . . . . . . . . . . . . . . 114 4.6.2.1 Coding efficiency of WZS . . . . . . . . . . . . . . . . . . 114 4.6.2.2 Rate-distortion performance vs. base layer quality . . . . 120 vi 4.6.2.3 Comparisons with Progressive Fine Granularity Scalable (PFGS) coding . . . . . . . . . . . . . . . . . . . . . . . . 120 4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Chapter 5: Conclusions and Future Work 125 Bibliography 129 vii List Of Tables 4.1 Channel parameters and the a priori probabilities for the 3rd bit-plane of frame 3 of Akiyo CIF sequence when BL quantization parameter is 20 (with the same symbol notation as Fig. 4.4).. . . . . . . . . . . . . . . . . 99 4.2 Coding overhead for News sequence. . . . . . . . . . . . . . . . . . . . . . 108 4.3 Rate savings due to WZS for WZS blocks only (percentage reduction in rate with respect to using FGS instead for those blocks). . . . . . . . . . . 118 4.4 Overall rate savings due to WZS (percentage reduction in overall rate as compared to FGS). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 4.5 Comparisons between MCLP and WZS . . . . . . . . . . . . . . . . . . . 119 4.6 The base layer PSNR (dB) for different QP. . . . . . . . . . . . . . . . . . 120 viii List Of Figures 1.1 Delay components of a communication system. Adapted from [54]. . . . . 6 1.2 Block diagram for encoder and decoder in a typical block-based hybrid video coding system. ME: motion estimation, MC: motion compensation, FM: frame memory, DCT: discrete cosine transform, IDCT: inverse DCT, VLC:variable-lengthentropycoding,VLD:variable-lengthentropydecod- ing, Q: quantization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3 MPEG-2 SNR decoder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.4 Simplified diagram of FGS encoder structure used in MPEG-4 Microsoft reference software. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.5 source coding with side information, adapted from Figure 1 in [58]. (a) SI Y is availableatboththeencoderand decoder; (b) SI Y is availableatthe decoder only. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.1 Structure of the proposed MDLC codec. . . . . . . . . . . . . . . . . . . . 27 2.2 MDLC scheme based on MPEG-4 FGST. I: I-frame, P: P-frame, F: the enhancement layer generated by coding the residual between the original frame and its base layer reconstruction, FT: the enhancement layer gener- ated by FGST using forward prediction from the base layer of its previous frame. The subscript of each label indicates the frame number. EL 0 in Figure 2.1 is not shown here as it is simply composed of F i identical to either EL 1 or EL 2 based on the frame index. . . . . . . . . . . . . . . . . 30 2.3 Rate-distortion curves of MDC codecs with different quantization param- eters, MD1 and MD2 of MDLC for Foreman and Mobile measured at the encoder without transmission impacts. . . . . . . . . . . . . . . . . . . . . 33 2.4 Performance comparison between MDLC and a set of MDC schemes with different quantization parameters for Foreman and Mobile. . . . . . . . . 35 ix 3.1 A DAG example for a LC system. . . . . . . . . . . . . . . . . . . . . . . 43 3.2 The DAHG model of the MDLC scheme shown in Figure 2.2. One of the two base layer nodes (filled with gray color), which has zero data size, is decoded as a copy or motion interpolation from the other description. We label each node sequentially as l 1 ;l 2 ;¢¢¢ starting from frame 1. Specifi- cally, in frame 2, l 3 , l 4 , l 5 and l 6 correspond to BL 1 , BL 2 , EL 1 and EL 2 , respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.3 AnotherexampleofDAHGtorepresentmultipleindependentencodingsof avideosequence. Thismodelcanalsobeusedtorepresenterrorconcealment. 48 3.4 Description of multiple clique states and multiple decoding paths using cliques C 11 , C 21 and C 22 in Figure 3.2 as an example. (a) Multiple clique states and multiple decoding paths. In frame 2, l 3 is decoded to be a direct copy of the reconstructed frame 1, and l 4 produces a reconstructed frame with better quality than l 3 . Each circle in the figure is labelled by a combination of decoding path and clique state in the form “decoding path : clique sate”. A decoding path is represented by a concatenation of each ancestor clique state. Nothing before the colon in C 11 indicates that it has no parents and there is only a virtual decoding path leading to C 11 . (b) Distortion related parameters assigned to frame 2. . . . . . . . . . . . . . 50 3.5 Streaming system architecture. . . . . . . . . . . . . . . . . . . . . . . . . 60 3.6 Comparison between scheduling algorithms at PLR=0.15 for various play- back delays. The base layer quantization parameters for Mobile, Akiyo, and Foreman are set to 12, 20, and 20, respectively.. . . . . . . . . . . . . 70 3.7 Rate-distortion curves of LC, MD1 and MD2 of MDLC for Foreman and Mobile measured at the encoder without transmission impacts. . . . . . . 73 3.8 Theimpactoftransportredundancyonstreamingperformancewhenusing Lagrangian optimization algorithm. . . . . . . . . . . . . . . . . . . . . . . 74 3.9 ComparingLCandMDLCwithlimitedretransmissions. Theperformance at w = 160 ms and PLR = 0:3 for both sequences is not included in the figure as the low PSNR achieved is out of acceptable range. . . . . . . . . 76 3.10 Comparing LC and MDLC with unlimited retransmissions. . . . . . . . . 78 4.1 Two-stage successive refinement with different side information Y 1 and Y 2 at the decoders, where Y 2 has better quality than Y 1 , i.e. X !Y 2 !Y 1 . . 85 x 4.2 Proposed multi-layer prediction problem. BL i : the base layer of the ith frame. EL ij : the jth EL of the ith frame, where the most significant EL bit-plane is denoted by j =1. . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.3 Basic difference at the encoder between the CLP techniques such as ET and our proposed problem: (a) CLP techniques, (b) our problem setting.. 92 4.4 Discrete memoryless channel model for coding u k : (a) binary channel for bit-planes corresponding to absolute values of frequency coefficients (i.e., u k;l at bit-plane l), (b) discrete memoryless channel with binary inputs (“-1” if u l k <0 and “1” if u l k >0) and three outputs (“-1” if v l k <0, “1” if v l k >0 and “0” if v l k =0) for sign bits, . . . . . . . . . . . . . . . . . . . . 93 4.5 MeasurementofapproximationaccuracyforAkiyoandForeman sequences. Thecrossoverprobabilityisdefinedastheprobabilitythatthevaluesofthe source u k and side information do not fall into the same quantization bin. The average and maximum absolute differences over all frames between the two crossover probabilities are also shown. . . . . . . . . . . . . . . . 98 4.6 Crossoverprobabilityestimation. TheshadedsquareregionsA i correspond to the event that crossover does not occur at bit-plane l. . . . . . . . . . . 101 4.7 Modelparametersofu k estimatedbyEMusingthevideoframesfrom Akiyo.103 4.8 Diagram of WZS encoder and decoder. (a) WZS encoder, (b) WZS de- coder. FM: frame memory, ME: motion estimation, MC: motion compen- sation, SI: side information, BL: base layer, EL: enhancement layer, VLC: variable-length encoding, VLD: variable-length decoding. . . . . . . . . . 104 4.9 The block diagram of mode selection algorithm. . . . . . . . . . . . . . . 106 4.10 WZS-MB percentage for sequences in CIF and QCIF formats (BL quanti- zation parameter=20, frame rate=30Hz). . . . . . . . . . . . . . . . . . . 111 4.11 Percentages of different block modes for Akiyo and Coastguard sequences (BL quantization parameter=20, frame rate=30Hz). . . . . . . . . . . . . 112 4.12 Comparison between WZS, nonscalable coding, MPEG-4 FGS and MCLP for Akiyo and Container Ship sequences. . . . . . . . . . . . . . . . . . . . 116 4.13 Comparison between WZS, nonscalable coding, MPEG-4 FGS and MCLP for Coastguard and Foreman sequences. . . . . . . . . . . . . . . . . . . . 116 4.14 Comparison between WZS, nonscalable coding, MPEG-4 FGS and MCLP for News sequence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 xi 4.15 Comparison between WZS and “WZS-SKIP only” for Akiyo and Coast- guard sequences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 4.16 The PSNR gain obtained by WZS over MPEG-4 FGS for different base layer qualities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.17 CompareWZSwithMPEG-4PFGSforForeman CIFsequence(Baselayer QP=19, frame rate=10Hz). The PFGS results are provided by Wu et al. from [31]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 xii Abstract Real-time multimedia services over the Internet face some fundamental challenges due to time constraints of those applications and network variations in bandwidth, delay and packet loss rate. Our research addresses the problem of network-adaptive video coding and streaming based on source codecs that provide scalability to match the network environments. Thefirstpartofthethesisfocusesonschedulingalgorithmdesignfornetwork-adaptive video streaming. We extend previous work on rate-distortion optimized video streaming to address more general coding techniques that support multiple decoding paths to en- hance adaptation flexibility. Prior work had only considered a single decoding path. Examples of multiple decoding paths include cases where there are multiple redundant representations of the media data or where error concealment is used. An example of these codes is our proposed multiple description layered coding (MDLC), which com- bines the advantages of layered coding and multiple description coding. This work is composed of several main components: (1) a new source model called directed acyclic hyperGraph (DAHG) to estimate the expected end-to-end distortion; (2) rate-distortion based scheduling algorithms to adjust dynamically the system’s real-time redundancy to matchthechannelbehavior;and(3)performanceanalysisonbothsourceredundancyand xiii transportredundancy. Experimentalresultsshowthattheproposedstreamingframework can provide a very robust and efficient video communication for real-time applications over lossy packet networks. ThesecondpartproposesaframeworkforadaptivescalablevideocodingusingWyner- Ziv techniques. The current scalable video coding standards suffer to some degree from a combination of lower coding performance and higher coding complexity, as compared to non-scalable coding. A key issue is how to exploit temporal correlation efficiently in scalable coding. We propose a novel scalable coding approach by introducing distributed source coding in enhancement layer prediction in order to achieve a better coding per- formance with reasonable encoding complexity. Experimental results show significant improvementsincodingefficiencyoverMPEG-4FGS,especiallyforvideosequenceswith high temporal correlation. xiv Chapter 1 Introduction 1.1 Motivation Recenttechnologicaldevelopmentsincomputing, compression, storagedevices, andhigh- speed networks have made it feasible to provide real-time multimedia services over the Internet. Different from data communications, which are usually not under strict de- lay constraints, real-time multimedia communication is delay sensitive, i.e., data become useless when arriving late. Real-time delivery of live or pre-encoded (stored) video plays an important role in real-time multimedia. For distribution of live video, such as video conferencing or the live broadcast of an event, encoding and decoding must be accom- plished in real-time. In many other applications, such as video on demand, video content is pre-encoded and stored for later viewing. Pre-encoded video has the advantage that it does not have the real-time encoding constraint. Thus, it allows more complicated and efficient encoding techniques. For example, multiple redundant representations can be created in advance for the same video content. 1 Themostpopularandwidelydeployedmediacommunicationapplicationsarelikelyto be those that are accessible over the Internet, through both currently predominant wired and emerging wireless channels. Video transmission on these networks is characterized by variations in channel bandwidth, delay and packet loss rate, which can severely affect the reproduction quality of the video delivered through the network. In addition, the increasing heterogeneity of networks and network access devices makes video streaming more difficult. Thus, real-time multimedia communication over the Internet and wireless networks has posed a number of challenges [23,27,76,88]. To address these challenges, we propose network-adaptive video coding and trans- mission solutions. Specifically, we aim at providing efficient, robust, scalable and delay- constrained media coding and streaming. The traditional rate control techniques [54,75] aim to optimize the media quality for a fixed bit rate. This poses a problem when multi- ple users are trying to access the same media source through different network links and with different computing powers. Even in the case of a single user accessing one media source over a link, when this link suffers from varying channel conditions, relying on a complex rate-control algorithm to make rate adjustments in real time may not be practi- cal (for example, if the changes in rate have to occur in a very short time frame). Thus, it is highly desirable to provide scalability through different video coding methods and transport mechanisms. Scalability refers to the ability of recovering physically meaning- ful image or video information by decoding only part of the compressed bitstreams. It is veryusefulintwoaspects: (1)providingerrorresiliencetocombatpotentialtransmission errors, and (2) enabling dynamic content adaptation to different network and terminal characteristics and user requirements. 2 Layered coding (LC) [44] and multiple description coding (MDC) [29,89] have been proposedastwodifferentkindsof“qualityadaptation”schemesforvideodeliveryoverthe current Internet or wireless networks. LC addresses the problem of network bandwidth variation through a sequence of dependent layers, while MDC provides error resilience by introducing redundancy explicitly into its independent descriptions. Most research com- paring LC and MDC [13,43,53,63,65,72,101] leads to the conclusion that LC and MDC can each be preferable in different scenarios (e.g., low packet loss rate and long end-to- end delay for LC vs. high loss rate and short delay for MDC), though a few [13,53] also showthatLCwithgoodrateallocationandschedulingtechniquesmayoutperformMDC over a broad range of scenarios. This motivates us to create a new multiple description layered coding (MDLC) approach by combining the advantages of LC and MDC to pro- vide a graceful adaptation over a wider range of application and network scenarios (see Chapter 2). The MDLC codec produces multiple redundant representations increasing theflexibilitywithwhichavideoservercanadapttovaryingnetworkconditions, without re-encoding the video stream or completely switching between different encoding modes on the fly. But in order to fully exploit the adaptation flexibility of a MDLC codec with redundancy, the system requires an intelligent transport mechanism. Scalable coding techniques make it easier for media servers to adapt to varying net- work conditions in real time. To do this, an intelligent transport mechanism is required to select the right packets (layers or descriptions) to send at a given transmission time to maximize the playback quality at the decoder. Some recent work has been focused on rate-distortion optimized scheduling algorithms for scalable video streaming [11,19– 21,50,51]. In this case, each packet is not equally important due to different distortion 3 contributions, playback deadlines, and packet dependencies caused by temporal predic- tionandlayering. Runtimefeedbackinformationisexploitedtomaketransportdecisions based on current network condition and transmitted packet status (i.e., received or not). However, those works are mainly focused on encoding techniques, such as layered coding, which generate packets that can only be decoded following a single decoding path (SDP): a packet can be decoded with distortion reduction d only when all of its dependent data units are received and decodable; otherwise, it contributes 0. In fact, a source codec that supports multiple decoding paths (MDP) can greatly enhance adaptation flexibility of a streaming system. Multiple decoding paths can arise when multiple redundant represen- tations of the same video content are created or when error concealment techniques are used. ExamplesofthesecodecsincludeMDLCandmultipleindependentlyencodedvideo streams. Compared to a SDP codec, a MDP codec introduces two additional features that the existing scheduling techniques fail to address: (1) redundancy across data units, and (2) more than one decoding choice is available for a given data unit, each with a different distortion reduction, when different subsets of its dependent data units are re- ceived. In our research, we extend the streaming framework in [19,20] to address a more generalproblemwheremultipledecodingpathsexist, bytakingintoaccountbothdepen- dency and redundancy relations among data units. The proposed rate-distortion based scheduling algorithms in Chapter 3 can dynamically adjust the system’s real-time redun- dancy to match the channel behavior, thus achieving better overall expected end-to-end performance. Unfortunately, all current scalable video coding standards suffer to some degree from acombinationoflowercodingperformanceandhighercodingcomplexity,ascomparedto 4 non-scalable coding. A key issue is how to exploit temporal correlation efficiently in scal- able coding. It is well known that motion prediction increases the difficulty of achieving efficient scalable coding because scalability leads to multiple possible reconstructions of each frame [66]. In this situation either (i) the same predictor is used for all layers, which leads to either drift or coding inefficiency, or (ii) a different predictor is obtained for each reconstructed version and used for the corresponding layer of the current frame, which leads to added complexity. MPEG-2 SNR scalability with a single motion-compensated prediction (MCP) loop and MPEG-4 FGS exemplify the first approach. Some advanced approaches with multiple MCPs are described in [7,33,66,79,93]. Distributed source coding (DSC) techniques based on network information theory provide a different and interesting viewpoint to tackle these problems. DSC first arose in the context of informa- tion theoretical problems, with compression bounds established in the 1970s by Slepian and Wolf [73] for distributed lossless coding and by Wyner and Ziv [94] for lossy coding with decoder side information (SI). But initial results did not address practical design. Recently, DSC has become an area of increasing research interest in the signal processing community for its potential applications [3,58,60,69] in both video coding (e.g., error robustness to channel losses or reduced encoding complexity) and multiterminal commu- nication systems (e.g., sensor networks). In this research, we are particularly interested in the problem of source coding with SI that is only known by decoder. In closed-loop prediction (CLP), in order to prevent drift at the decoder, the encoder needs to generate the same predictor that will be available at the decoder. Instead, a DSC encoder only needs to have access to the correlation structure between the current signal and the pre- dictor. Thus there is no need to reproduce the decoded signal at the encoder as long as 5 Video encoder Encoder buffer Video decoder Video input Decoder buffer Channel Video output ΔT eb Transmitting unit Receving unit Delay: ΔT db ΔT c ΔT e ΔT d Figure 1.1: Delay components of a communication system. Adapted from [54]. thecorrelationstructureisknown,orcanbefound. BasedonDSCprinciples,wepropose a Wyner-Ziv scalable (WZS) coder in Chapter 4 that can achieve higher coding efficiency (up to 3 - 4.5 dB gain over MPEG-4 FGS for video sequences with high temporal cor- relation), by selectively exploiting the high quality reconstruction of the previous frame in the enhancement layer coding of the current frame. This creates a multi-layer Wyner- Ziv prediction “link”, connecting the same bitplane level between successive frames, thus providing improved temporal prediction as compared to MPEG-4 FGS, while keeping complexity reasonable at the encoder. In the rest of this chapter, we will first provide an overview of a video communication system, and then give brief reviews on the related areas of scalable coding, distributed source coding, and rate-distortion optimized packet scheduling. The contributions of the thesis are summarized at the end of the chapter. 1.2 Overview of a Video Communication System 1.2.1 Delay-Constrained Video Transmission Figure 1.1providesan abstractionofa real-timevideo communicationsystem in terms of delay components. In contrast to data communication or to simple media downloading, 6 real-time video streaming is often subject to strict delay constraints. The main difference with respect to downloading applications is that media playback starts as data is still being received, so that playback could be interrupted if the decoder ran out of data to decode. As data starts to reach the client, decoding does not start immediately. Instead the client waits for a predetermined startup delay in order to accommodate enough data for decoding. Both encoder and decoder buffers can be used to smooth out the bit rate variation produced by the encoder as well as the channel delay variation, so that the decodedvideocanbeplayedoutataconstantframerate. Notethatwhenconsideringpre- encoded media, the encoding delay does not exist, as the video has already been encoded and is ready for transmission. In general, video streaming applications impose some restriction on the initial startup delay. Furthermore, the end-to-end delay requirement imposes a constraint on the encoding rate for each frame, or the number of layers to be transmitted for a scalable bit stream. 1.2.2 Bandwidth Variation and Transmission Impairments Mostcurrentlydeployednetworksprovidenoqualityofservice(QoS)guarantees. Thusit is expected that both network conditions and the available bandwidth of the underlying network may change during a real-time video communication session. Different types of transmission errors may occur, e.g., packet erasure errors or bit errors in IP or wireless based networks. Random bit errors may ultimately lead to erasures in a variable-length coded bit stream, since a single bit error can cause the remaining bits in a data packet to be undecodable. For real-time video, data packets are also treated as lost if they arrive 7 at the decoder after the playback deadline. Here we assume that when data losses occur, packets are lost. It is well known that compressed video streams are vulnerable to transmission im- pairmentsduetothevariable-lengthcodingandmotion-compensatedtemporalprediction widely used in current video standards. Thus it is necessary to provide efficient mecha- nisms to address bandwidth fluctuation and packet losses in real-time video applications in order to provide a graceful quality degradation. 1.2.3 Motion-Compensated Temporal Prediction Predictive coding is an important technique in image and video coding. The purpose of prediction is to exploit the redundancy between the samples to be coded. Temporal pre- dictivecodingusingmotion-compensatedpredictionhasbeenwidelyemployedinexisting video coding standards, such as the MPEG and ITU-T H.26x families. The block-based hybrid video coding, the core of all the international video coding standards, effectively combines motion-compensated temporal prediction (MCP) and transform coding [88]. Each video frame is divided into a number of blocks with either fixed or variable block sizes. The encoding and decoding process for a block in the typical block-based hybrid video coding system is depicted in Figure 1.2. A block is first predicted from a few previously reconstructed reference frames using block-based motion estimation. The mo- tion vector (MV) indicates the displacement between the current block and the selected reference block. The predicted block is then formed from the reference frame by using motion compensation with the estimated MV. The prediction error block is coded using the DCT transform, quantization and finally variable-length entropy coding (VLC). It 8 DCT Q VLC Q -1 IDCT Motion vectors input video FM MC ME V L D Q -1 IDCT MC FM Motion vectors Coded bit stream Coded bit stream Reconstructed video + + - (a) Encoder + (b) Decoder Figure 1.2: Block diagram for encoder and decoder in a typical block-based hybrid video coding system. ME: motion estimation, MC: motion compensation, FM: frame mem- ory, DCT: discrete cosine transform, IDCT: inverse DCT, VLC: variable-length entropy coding, VLD: variable-length entropy decoding, Q: quantization. is also noted from Figure 1.2 that the same reconstruction process is required at both the encoder and decoder. The introduction of CLP-based MCP also makes the coded bit stream very sensitive to transmission errors, which may cause error propagation over time. Distributed source coding techniques to be discussed in Section 1.4 lead to an alternative to closed-loop prediction by providing an open-loop prediction framework. 1.3 Scalable Coding A scalable compressed bitstream typically contains multiple embedded subsets, each of whichrepresentstheoriginalvideocontentin a particular amplituderesolution (so called SNR scalability), spatial resolution (spatial scalability), temporal resolution (temporal 9 scalability) or frequency resolution (frequency scalability, also known in some cases as data partitioning). Scalable coders can have either coarse granularity or fine granularity, and they usually lead to a decrease in compression performance as compared to non- scalable coding. Thus, the design goal in scalable coding is to minimize the reduction in coding efficiency while enabling sufficient scalability to match the network requirements. In this section, we will briefly discuss several SNR scalability techniques in current video coding standards, particularly MPEG-2 SNR scalability and MPEG-4 Fine Granularity Scalability (FGS). A brief review of multiple description coding is also included at the end of this chapter. More detailed descriptions of these techniques can be found in [34,35,44,76,88]. 1.3.1 MPEG-2 SNR Scalability Internationalvideocodingstandardstypicallystandardizedecodersratherthanencoders. Figure 1.3 shows the two-layer SNR scalable decoder defined in the MPEG-2 video stan- dard [34]. It can be extended to the multi-layer coding scenario in a straightforward way. The reconstructed DCT coefficients from all the layers are added together before passing throughthesingleinverseDCT(IDCT)andtheMCPlooptoproducethedecodedframe. The enhancement-layer information of previous reference frames is used in the MCP loop for constructing both base and enhancement layers of the current frame. Several encoder configurations have been proposed in the literature [7,92]. One approach [92] is to ap- ply a single MCP loop at the encoder that also uses the enhancement-layer information. This leads to drift when the enhancement layer is not received by the decoder. A more complicated pyramid encoder [7] uses multiple MCP loops, with one MCP loop per layer 10 V L D Q -1 IDCT MC FM Motion vectors Base layer bit stream Reconstructed video VLD Q -1 Enhancement layer bit stream + + Figure 1.3: MPEG-2 SNR decoder. DCT Q VLC Q -1 IDCT Motion vectors input video Base layer FM MC ME FGS Bitplane VLC FGS Bitplane shift enhancement layer + - + + - Figure 1.4: Simplified diagram of FGS encoder structure used in MPEG-4 Microsoft reference software. to control drift in case the higher layers are lost. The frame memory at each layer of the encoder corresponds to the state of the decoder frame memory assuming all its higher layers are not decoded. 1.3.2 Fine Granularity Scalability (FGS) and its Variants MPEG-2 SNR scalability usually supports only a small number of enhancement layers duetothecodingefficiencydegradationandextraencodercomplexityintroducedforeach enhancement layer in the pyramid encoder. Thus, the quality can only improve as rate increases in larger discrete steps. A common characteristic for those coarse granularity 11 scalable coding techniques is that the enhancement layer can contribute to end-to-end distortion reduction only if it is entirely decoded. In other words, it does not provide any partial enhancement. MPEG-4 fine granularity scalability (FGS) [44] provides a way to improve the quality with much smaller incremental steps. Figure 1.4 shows a simplified diagram of the FGS encoder structure used in the MPEG-4 Microsoft reference software. There are three major differences between FGS and MPEG-2 SNR scalability. First, the enhancement-layer bit stream of FGS can be truncated at arbitrary bit positions within each frame to provide partial enhancement proportional to the number of bits decoded. Second, FGS does not use the enhancement information of the previous frames topredictthecurrentframeinmotion-compensationloop. Instead,theenhancementlayer is represented by coding the residual error based on current base-layer reconstruction. Compared to MPEG-2 SNR scalability, this approach avoids drift but results in lower coding efficiency. Finally, the FGS enhancement layer is coded using bit-plane coding, i.e., the quantization step sizes are in descending powers of two. FGS provides fine granularity bit-rate scalability, channel adaptation, and elegant error recovery from occasional data losses or errors in enhancement layers [44]. However, it suffers from the disadvantage of low coding efficiency. A number of FGS variants have been proposed to exploit further the temporal correlation in the enhancement layer. Examples include progressive FGS (PFGS) [93], motion-compensation based FGS (MC- FGS) [79], and leaky prediction based FGS (LP-FGS) [33]. These techniques share a common feature in that they employ one or more additional MCP loops for enhancement layersofP andB frames(orB framesonly),forwhichacertainnumberofFGSbitplanes, M, areincludedintheenhancement-layerMCPloop, toimprovethecodingefficiency. In 12 this case prediction drift will occur within the FGS layers when fewer than M bitplanes are received. M is chosen by considering the trade-off between the coding efficiency and prediction drift. LP-FGS also introduces a second parameter ® to adjust the amount of predictive leak to control the construction of the reference frame at the enhancement layer. Rose and Regunathan [66] also proposed an estimation-theoretic (ET) approach with multiple motion-compensation loops for general SNR scalability. In this approach, the enhancement-layer predictor is optimally estimated by considering all the available infor- mation from both base and enhancement layers. It can be easily extended into the FGS framework. However, the underlying CLP prediction requires the encoder to generate all possible decoded versions for each frame, so that each of them can be used to generate a prediction residue. Thus complexity is high at the encoder, especially for multi-layer coding scenarios. 1.3.3 Multiple Description Coding Layered coding (LC) as both MPEG-2 SNR scalability and MPEG-4 FGS can enhance adaptation of a video delivery system. But it requires strong protection for base layer via eitherFECorARQschemes. Multipledescriptioncoding(MDC)hasemergedasanother promising approach to combat transmission errors. A multiple description (MD) coder generates several bit streams for the original video source (referred to as descriptions), so that each description alone provides acceptable quality and incremental improvement can be achieved with additional descriptions received. Each description is individually packetized, and transmitted through separate channels or through one physical channel 13 that is divided into several virtual channels by using appropriate time interleaving tech- niques. Each description can be decoded independently to provide an acceptable level of quality. For this to be true, all the descriptions must have some basic information about the source, and therefore redundancy is introduced between different descriptions. A number of MD coders have been proposed, including overlapping quantization [71,78], correlating linear transforms [87], polyphase transform and subband coding [39], and interleaved spatial-temporal sampling [90]. A comprehensive review of various MD algorithms, particularly for image communications is presented in Goyal’s paper [29]. Motion-compensation prediction is a fundamental component in current video coding standards. Since MD coders are designed to “tolerate” channel losses, MD video coders that incorporate motion-compensated prediction must take into account two basic prob- lems: mismatch control and redundancy allocation. Depending on the trade-off between these two problems, Wang et al. have categorized the possible predictors for a MD coder into three classes [89]: (1) predictors that introduce no mismatch using either two indi- vidualpredictorsorasinglepredictorbasedoninformationcommontobothdescriptions, (2) a single predictor identical to that used by a single-prediction predictive encoder that minimizes the prediction error while causing mismatch unless both descriptions have been received, and (3) predictors that have parameters to control the trade-off between prediction efficiency and the amount of mismatch. When combined with multiple path transport(MPT)[5,6,48,55],MDCcanexploitpathdiversitytoimproveerrorresilience, traffic dispersion and load balancing in the network. The major difference between LC and MDC is that LC proposes a hierarchical de- composition while MDC decomposes the source signal into non-hierarchical, correlated 14 descriptions. Thus,MDCdoesnotrequirespecialtreatmentofpacketsbyretransmissions orusingFECtoguaranteeadequatequality,asLCdoesforthebaselayer. ItmakesMDC particularly useful for those networks that do not support feedback or retransmission, or those applications that only allow very short delay, thus making retransmission unac- ceptable. However, MDC introduces significant redundancy between the descriptions, which reduces the coding efficiency. This observation motivates us to design an adaptive approach to combine the advantages of MDC and LC so as to provide robustness over a wide range of network scenarios and application requirements. Our proposed approach, multiple description layered coding, will be discussed in Chapter 2. 1.4 Distributed Source Coding This problem has been studied in the information theory literature back in the 1970s, starting with the work of Slepian and Wolf for distributed lossless coding and that of Wyner and Ziv on “rate-distortion with side information” for the lossy coding. However, the emergence of practical applications has only occurred recently. In this section, we first review the fundamental principles of Slepian-Wolf and Wyner-Ziv coding, and then discusssomepracticalapplicationsfocusingondistributedvideocompression. Adetailed reviewonthistopicwasalsopresentedinGirod et al.’srecentpaper[26]. Thetheoretical problemofsuccessiverefinementofinformationintheWyner-Zivsettingwillbedescribed in Chapter 4 where we design a Wyner-Ziv scalable codec based on this setting. Figure1.5showstheproblemofsourcecodingwithsideinformation. Ifthesideinfor- mation(SI)Y wereknownatbothencoderanddecoder,thentheproblemofcompressing 15 Encoder Decoder Encoder Decoder ˆ X ˆ X lossy ( ) ( ) | | R R D R D x X Y X Y * ≥ ≥ X Y X Y Lossless ( | ) R H X Y x ≥ lossy ( ) | R R D x X Y ≥ Lossless ( | ) R H X Y x ≥ (a) (b) Figure 1.5: source coding with side information, adapted from Figure 1 in [58]. (a) SI Y is available at both the encoder and decoder; (b) SI Y is available at the decoder only. the source X is well known: the theoretical rate of X is given by H(XjY) for lossless compression and R XjY (D) for lossy compression, where H(XjY) is the conditional en- tropy of X given Y, and R XjY (D) is the rate-distortion function if Y is available at both the encoder and decoder. A practical approach in this case is to use differential pulse- code modulation (DPCM). We are now interested in the case where Y is available at the decoder, but not at the encoder. Consider first the problem where X and Y are corre- lated discrete-alphabet memoryless sources and X is to be compressed losslessly. The Slepian-Wolf Theorem [73] establishes the achievable rate region R x ¸H(XjY), which is the same as if Y is also available at the encoder. Later Wyner and Ziv [94] extended this work to the lossy compression cases, for which X and Y can be continuous with infinite alphabets. The Wyner-Ziv rate-distortion function R ¤ XjY (D) is then defined as the lower bound of the achievable rate for a given distortion D = E[d(Xj ˆ X)]. Wyner and Ziv 16 proved thatR ¤ XjY (D)¸R XjY (D), a possible rate penaltyexisting when the encoder can- not access to Y. Furthermore, R ¤ XjY (D)=R XjY (D) in the case of Gaussian memoryless sources with quadratic distortion measure. By using a duality argument, Pradhan and Ramchandran recently generalized R ¤ XjY (D)=R XjY (D) to the more general case where source X is the sum of arbitrarily distributed SI Y and independent Gaussian noise [57]. Distributed source coding (DSC) is related to channel coding in that Y can be re- gardedasanoisyversionofX producedbyavirtualchannel. InsteadofperformingFEC to protect against the transmission errors introduced by the real channel, we send the parity or syndrome bits to the decoder to correct the “errors” introduced by the virtual dependence channel. The decoder can then perform channel decoding with both infor- mation from parity bits and side information Y. The first constructive framework for creating practical codes for DSC was distributed source coding using syndromes (DIS- CUS), proposed by Pradhan and Ramchandran [58], where a scalar and trellis-based coset construction was presented. Since then, more sophisticated channel codes based on turbo or low-density parity check (LDPC) codes have been applied to DSC by a number of researchers [1,8,25,42,46]. For Wyner-Ziv coding, Zamir et al. [97,98] proposed a constructive framework based on nested linear/lattice codes. This was further studied by Servetto [70] who considered the design of lattice quantizers and presented a perfor- mance analysis at high rates. Xiong et al. [95] developed a Wyner-Ziv coding framework consisting of a nested quantizer followed by a Slepian-Wolf coder. RecentlyproposedWyner-Zivcodingtechniquesforvideoapplicationsfallintoseveral categories. First, the emergence of mobile devices, such as wireless video sensors or mobile cameraphones, inspires a new class of video compression applications that require 17 low-complexity encoding. Motion estimation is the most time-consuming component in traditional MCP-based encoders. The main difference between Wyner-Ziv coding and CLP-basedapproaches(forexample, MCP)isthatWyner-Zivcodingonlyneedstoknow the correlation structure between the current signal and the predictor, rather than the exact predictor value. Therefore, it is possible in Wyner-Ziv video codecs to move the motionestimationunittothedecoderinordertoreducetheencodercomplexity. Puriand Ramchandran[59–61]andGirodetal. [2,3,26]haveproposedseveraldifferentWyner-Ziv video coding schemes based on this principle. The second class of applications is to use Wyner-Ziv coding for error resilience. Sehgal et al. [67,69] proposed a Wyner-Ziv coding scheme to prevent error propagation in predictively encoded video. In addition, Wyner- Zivcodingcanalsobeappliedinmultipledescriptioncoding[38]orlayeredcoding[68,96]. In [96], Xu and Xiong proposed an MPEG-4 FGS-like scheme by treating a standard coded video as a base layer, and building the bit-plane enhancement layers using Wyner- ZivcodingwithcurrentbaseandlowerlayersasSI.OurWyner-Zivscalableapproach, to be presented in Chapter 3, can achieve higher coding efficiency by selectively exploiting the high quality reconstruction of the previous frame in the enhancement layer bitplane coding of the current frame. 1.5 Rate-distortion Optimized Packet Scheduling Achievingreal-timevideodeliverycanbeverychallenginggiventhelimitedchannelband- width as well as packet loss rate and transmission delay variations. It is well recognized that an ideal transport mechanism should adapt to the actual channel conditions, such 18 as channel bandwidth, delay and packet loss statistics. Chou and Miao [19,20] provided a rate-distortion optimized framework of packet scheduling over a lossy packet network. They considered streaming as a stochastic control problem, with the goal of determining which packets to be sent (and when and how), in order to maximize the expected end-to- end quality under an average rate constraint. A directed acyclic graph (DAG) model was proposed to capture the dependencies between packets such that it is possible to attach more importance to packets on which multiple other packets depend. The details of this algorithm will be reviewed in Section 3.2. To reduce the scheduling complexity, Miao and Ortega [50,51] proposed an expected run-time distortion based scheduling (ERDBS) algorithmwhichsimplyusesagreedysolutionbyexplicitlyconsideringtheeffectsofdata dependencies and delay constraints into a single importance metric. The rate-distortion framework in [19,20] can be applied in a series of scenarios, including packet scheduling at the sender [19,20], at the receiver [21], or at the intermediate proxy [11]. End-to-end distortion estimation is a challenging problem in rate-distortion based scheduling algorithms. When the dependency among packets is modelled by a DAG, Chou and Miao [19,20] computed the distortion contribution of a packet based on the assumptionofasingledecodingpath: apacketcanbedecodedonlywhenallofitsparents are received and decodable. One example of this is layered coding. Cheung and Tan introduced a more general formulation based on the DAG model [15] to include the case where a packet can assume different distortion contributions when different subsets of its dependentpacketsareavailable. Theyconsideredallpossibilitiesofdecodinganddelivery scenarios,whichleadstosignificantincreasesincomplexity. Inourapproach,weproposea newdirectedacyclichypergraph(DAHG)onthetopofaDAG,byintroducingadditional 19 objects to explicitly represent the source redundancy among packets which arises from a givensourcecodingapproach(e.g.,MDLC)inordertoenhancetheadaptationflexibility. Compared to [15], our DAHG model provides a more systematic way to represent source codecs that support multiple decoding paths with reasonable complexity. All of the above methods assign distortion on a per packet basis, and calculate the expected end- to-end distortion of a GOP based on the packet relationship represented in the source model. Zhang et al. proposed an alternative approach [100] by estimating the expected distortionofaGOPatruntimebasedonthestoreddistortioninformationofthisGOPfor several selected reference vectors of packet loss probability. Compared to above methods, this approach can only provide an approximation of the expected distortion via a first- orderTaylorexpansion. Theapproximationaccuracydependsonthenumberofreference vectorsandsmoothnessoftheexpecteddistortionoverarangeofpacketlossprobabilities. 1.6 Contributions of This Research The main contributions of our research include the following: 1. Network-adaptive video streaming [83,85,86]. ² Wedevelopanovelcodingapproach,namely,multipledescriptionlayeredcod- ing (MDLC), which combines hierarchical scalability of LC with the reliability introduced by MDC. ² We extend the rate-distortion optimized streaming framework proposed in [19,20] to operate on a general class of coding formats that explicitly support redundancy in their coding structures. Such codecs (e.g., MDLC) produce 20 multipleredundantrepresentations, whichfacilitateserveradaptationtovary- ing network conditions, without re-encoding the video stream or completely switching between different encoding modes on the fly. We first introduce a new directed acyclic hypergraph (DAHG) to represent the data dependencies and correlation between different video packets, from which the expected end- to-end distortion for a group of packets can be estimated accurately. Based on the DAHG model, we then develop two rate-distortion based packet schedul- ing algorithms: one extended from the Lagrangian optimization proposed in [19,20], and another one based on a greedy solution derived from the Taylor analysis of the expected distortion. ² Weobservetwotypesofredundancy,namely,sourceredundancyandtransport redundancy,intheproposedstreamingframework. Weinvestigatetheimpacts of both redundancies on error control for a lossy packet network. 2. Wyner-Ziv Scalable predictive coding [81,82,84]. ² We propose a practical video coding framework based on distributed source coding principles, with the goal to achieve efficient and low-complexity scal- able coding. Starting from a standard predictive coder as base layer (such as MPEG-4 baseline video coder in our implementation), the proposed Wyner- Zivscalablecodercanachievehighercodingefficiency,byselectivelyexploiting thehighqualityreconstructionofthepreviousframeintheenhancementlayer coding of the current frame. 21 ² Correlation estimation at the encoder is a challenging problem in Wyner-Ziv coding. We propose two simple and efficient algorithms, namely, direct esti- mation and model-based estimation, to explicitly estimate at the encoder the parameters of a model to represent the correlation between the current frame andanoptimizedsideinformationavailableonlyatthedecoder. Ourestimates closely match the actual correlation between the source and the decoder side information. ² Since the temporal correlation varies in time and space, we propose a block- based adaptive mode selection algorithm for each bit-plane, so that it is pos- sible to switch between different coding modes. The rest of the thesis is organized as follows. We discuss the MDLC codec in Chapter 2. Then we describe the rate-distortion based scheduling framework for a general class of sources with redundancy in Chapter 3. The Wyner-Ziv scalable coding is presented in Chapter 4. Finally, conclusions and future work are provided in Chapter 5. 22 Chapter 2 Multiple Description Layered Coding (MDLC) 2.1 Introduction Inthischapterwepresentanefficientmultipledescriptionlayeredcoding(MDLC)system for robust video communication over unreliable channels. Recent technological develop- ments and the rapid growth of Internet and wireless networks make it feasible and more attractive to provide real-time video services over them. However the current best-effort Internet does not offer any QoS guarantees. The congestion, routing delay and fluctu- ations of network conditions can all result in the packet loss or large delay during the transmission, and thus greatly degrade the received video quality. A traditional method to provide bandwidth adaptation and error resilience in lossy transmission environments is layered coding (LC) [34,35,44], in which a video sequence is coded into a base layer and one or more enhancement layers. The base layer provides a minimumacceptablelevelofquality,andeachadditionalenhancementlayerincrementally improves the quality. Thus, graceful degradation in the face of bandwidth drops or transmission errors can be achieved by decoding only the base layer, while discarding 23 one or more of the enhancement layers. The enhancement layers are dependent on the base layer, and cannot be decoded if the base layer is not received. Thus LC requires the base layer to be highly protected, which can be achieved via either strong forward error correction (FEC) or automatic repeat request (ARQ) schemes. FEC has the drawback of requiring increased bandwidth, even in cases when errors do not occur, while ARQ may not be a practical alternative if the round-trip time (RTT) is long relative to the end-to-end delay in the application. Some recent work [11,19–21,50,51] has proposed rate-distortion optimized scheduling algorithms for layered video streaming, by attaching differentimportancetoeachpacket,andthusdeterminingtheoptimaltransmissionpolicy of each packet based on their importance. Another alternative to reliable communication is multiple description coding (MDC) [29,39,71,78,87,89,90]. With this coding scheme, a video sequence is coded into a num- ber of separate bit streams (referred to as descriptions), so that each description alone provides acceptable quality and incremental improvement can be achieved with addi- tional descriptions. Each description is individually packetized, and transmitted through separate channels or through one physical channel that is divided into several virtual channels by using appropriate time interleaving techniques. Each description can be de- coded independently to provide an acceptable level of quality. For this to be true, all the descriptions must have some basic information about the source, and therefore they are likely to be correlated. There have been some researches on performance comparison between LC and MDC. In [63], Reibman et al. first evaluated the performance of LC and MDC for transmission over binary symmetric channels and random erasure channels using FEC codes. Singh et 24 al. compared MDC without retransmission to LC with retransmission through network simulations[72]. In[65],Reibmanet al. furtherstudiedtheperformanceofLCandMDC for transmission over an EGPRS wireless network through either one or two correlated wireless channels. In [43], Lee et al. performed a comprehensive comparison between LC andMDCinmulti-pathenvironmentsunderdifferenterrorcontrolscenariosincludingno error control, ARQ-based and FEC-based error control for both LC and MDC. Zhou et al. examined the performance of LC and MDC with or without retransmissions based on rate-distortion lower bounds combined with the effect of excess rate and delay incurred from retransmissions [101]. Since these performance comparisons depend on the actual coderimplementationsandunderlyingnetworkenvironments,observationsfromthesere- searchesarenotcompletelyconsistent. Butmostofthesestudiesleadtoacommonbelief that (1) MDC outperforms LC in network scenarios with high error rate, long RTT or stringent real time requirements [43,63,72,101]; and (2) error control techniques such as ARQ or FEC are very useful for both LC and MDC [43,63,101]. Some recent work also studiedtheperformanceofLCandMDCwhenanadvancedtransportmechanismisused. In [53], Nguyen et al. showed that with good packet allocations LC outperforms MDC under various network conditions. In [13], Chakareski et al. described a large variation in relative performance between LC and MDC depending on the employed transmis- sion scheme. Both works further demonstrate the importance of transport mechanisms optimized for a given source coding technique, complementing traditional transport tech- niques such as ARQ or FEC. The above observation motivates us to look for an adaptive approach to combine LC and MDC in order to exploit their individual benefits, so as to provide reliable video 25 communication over a wider range of network scenarios and application requirements. The main novelty of our work is to demonstrate that it is possible to combine LC with MDC, by adding a standard-compatible enhancement to MPEG-4 version 2 [36]. The new multiple description layered coding (MDLC) approach presented here introduces redundancyineachlayersothatthechanceofreceivingatleastonebaselayerdescription is greatly enhanced. Furthermore, though LC and MDC each achieve good performance in different limit cases (e.g., long end-to-end delay for LC vs. short delay for MDC), the proposed MDLC system can address intermediate cases as well. As in a LC system with retransmission,theMDLCsystemcantakeadvantageofafeedbackchannelthatindicates whichdescriptionshavebeencorrectlyreceived. Thuswewillshowthatalowredundancy MDLC system can be implemented with runtime packet scheduling system proposed in Chapter 3 based on a priori channel knowledge and runtime feedback information. The goal of our scheduling algorithm is to find a proper on-line packet scheduling policy to maximizetheplaybackqualityatthedecoder. Thischapterisfocusedonthediscussionof theproposedMDLCtechnique,andtheschedulingalgorithmforMDLCanditsextension to a general class of coding problems with source redundancy will be described in detail in Chapter 3. Closely related research on combining LC and MDC includes work by Chou et al. [22] and by Kondi [41]. In [22], the authors start from an MDC coder that is realized by applyingunequalcross-packetFECtodifferentpartsofascalablebitstream[4], andthen convertthedescriptionsintoabaselayerandanenhancementlayer. In[41],theproposed codec relaxes the hierarchy of LC by producing a base layer and two multiple description 26 _ S Output Video e ~ S ~ S BL 2 2 ˆ S 1 ˆ S S ˆ S BL 1 Input Video _ _ + + + MDC Base Layer Encoder MDC Base Layer Decoder BL 1 Enc1 Enc2 Enc0 BL 2 Encoder MDC Base Layer Decoder Dec Base layer recv info Decoder EL 2 EL 1 EL 0 Figure 2.1: Structure of the proposed MDLC codec. enhancement layers, where the base layer is required for decoding and the two enhance- ment layers can be decoded independently of each other. Both works optimize source redundancy at the encoding stage. In contrast, our proposed MDLC codec produces multiple redundant representations increasing the flexibility with which a video server can adapt to varying network conditions. An MDLC system can dynamically adjust the run-time redundancy of the compressed bit streams during transmission, for either a live or pre-encoded video, by applying a rate-distortion optimized scheduling algorithm. The chapter is organized as follows. In Section 2.2 we describe the proposed MDLC approach. Simulation results are presented in Section 2.3. Finally we conclude our work in Section 2.4. 2.2 Proposed MDLC Codecs 2.2.1 General Structure Our general approach to MDLC video coding uses an MDC encoder to generate two base layer descriptions BL 1 and BL 2 , as shown in Figure 2.1. Then the base layer MDC 27 decoder in the MDLC encoder module mimics the three possible decoding scenarios at the receiver: both descriptions received or either one received. For the case when both descriptions are received, it reproduces the base layer as ˆ S, and the difference between the original video input S and ˆ S is coded with a standard encoder such as MPEG-4 FGS into an enhancement layer stream EL 0 . For the case when only one description is received, the base layer decoder generates a low quality reproduction ˆ S 1 or ˆ S 2 , and feeds the difference into two enhancement layer encoders separately to create EL 1 and EL 2 . The key advantage of our MDLC scheme is that it combines the LC hierarchical scalability coding scheme with the reliability introduced by adding redundancy to the base layer by using multiple descriptions. With a well-designed scheduling algorithm, the sendercanchooseonlyonebaselayerdescriptionanditscorrespondingenhancementlayer to be sent to the receiver, as in a standard LC system, in situations where the channel losses are low. Or it can send both base layer descriptions and their enhancement layer streams to get the maximum protection when channel conditions worsen. EL 0 is sent instead of either EL 1 or EL 2 to reduce the redundancy when both BL 1 and BL 2 are received, or expected to be received with high probability. The sender can select the packets to be transmitted at any given time during the transmission session based on the feedback information, in such a way as to maximize the playback quality at the decoder. The proposed decoder system, depicted in Figure 2.1, is composed of two parts: base layer MDC decoder, and enhancement layer switch and decoder. The base layer MDC decoder will generate a reproduction ˜ S which will be ˆ S, ˆ S 1 or ˆ S 2 depending on what was received. The enhancement layer switch then selects which EL stream to decode given 28 what base layer was received. Finally, the decoded base layer and enhancement layer will be combined together to generate the final video output. Given the general structure in Figure 2.1, a very important part of an MDLC codec design is determining how to construct base layer descriptions using a specific MDC algorithm. In the following discussions, we propose two MDLC codecs using different base layer MDC approaches, one based on DCT duplication and alternation, and the other based on MPEG-4 FGS temporal scalability (FGST) with temporal subsampling. 2.2.2 Codec 1: DCT Duplication and Alternation We first propose a simple MDLC codec, where the base layer descriptions are formed by simple duplication and alternation of the DCT coefficients. The base layer is obtained by applying a coarse quantizer to the original video in DCT domain. We create our multiple baselayerdescriptionsbyrepeatingimportantinformation, suchasthemotionvectorsin intermodeandDCcoefficientsinintramode. FortherestoftheDCTcoefficients,wejust alternatethemintothetwodescriptions. Forexample,ifamacroblockisassignedtoBL 1 , thenitsneighboringmacroblocksareassignedtoBL 2 . Informationfrombothdescriptions is combined before making predictions for the future frames. We first used this codec in [83] to demonstrate the advantage of MDLC over either LC or MDC. While this design is simple, we expect that more complex schemes, such as combining with an optimal rate-distortion splitting [62,64], would lead to improvements in overall performance. 29 … I 1 F 1 FT 2 P 3 F 3 FT 4 P 5 F 5 FT 6 … … P 2 F 2 FT 3 P 4 F 4 FT 5 P 6 F 6 … MD1 MD2 I 1 BL 1 EL 1 BL 2 EL 2 F 1 Figure 2.2: MDLC scheme based on MPEG-4 FGST. I: I-frame, P: P-frame, F: the enhancement layer generated by coding the residual between the original frame and its base layer reconstruction, FT: the enhancement layer generated by FGST using forward prediction from the base layer of its previous frame. The subscript of each label indicates the frame number. EL 0 in Figure 2.1 is not shown here as it is simply composed of F i identical to either EL 1 or EL 2 based on the frame index. 2.2.3 Codec 2: based on MPEG-4 FGST with Temporal Subsampling In this approach, we use video redundancy coding (VRC) [91] to create an MDC base layer, by partitioning a video sequence into two subsequences each of which mainly con- tains either odd or even frames. At the base layer, each subsequence is coded indepen- dently as an IPP sequence, where only the first frame (I-frame) of each group of pictures (GOP) is shared between both subsequences, as shown in Figure 2.2. Coding efficiency is reduced because the motion-compensated prediction using a past frame farther apart is usually less efficient than using the immediately past frame. If both descriptions are receivedcorrectly, eachbitstreamisdecodedindependentlytoproducetheevenand odd frames that are interleaved for the final base layer reconstruction. However, if only one description is received, the missed description can be estimated by simply copying the closest adjacent frame in the correctly received description or using more complicated motion interpolation techniques by exploiting both past and future frames [5]. 30 In order to construct an MDLC codec, we introduce additional fine granularity bit- rate scalability by generating enhancement layers form the base layer descriptions. For each subsequence, as shown in Figure 2.2, we create enhancement layer descriptions with the MPEG-4 FGS temporal scalability (FGST) approach [44]. Each enhancement layer descriptioncodesthedifferencebetweentheoriginalpictureandareferencepicturerecon- structed from its corresponding base layer description in bit-plane coding of the residual DCT coefficients. The residual DCT coefficients are obtained from different references depending on whether a frame is coded in the base layer description or not. Without loss of generality, consider the P-frame with an odd-index, i, in Figure 2.2. Its enhancement layer EL 1 represents the residue between frame i and its BL 1 reconstruction, denoted by F i , while its EL 2 is generated using forward prediction from the BL 2 reconstruction of the previous frame i¡1, denoted FT i , which contains the enhancement-layer motion vectors as well. At the decoder, depending on what base layer description is received, the enhancement layer can choose to decode either all (e.g., when both base layer descrip- tions are received)or a subsetof thedescriptions (when only one baselayerdescription is received). The final enhancement layer quality is the one with the best quality achieved by all decodable descriptions. ThisMDLCapproachisusedinChapter3asanexampletodemonstratetheefficiency of our scheduling algorithm for source codecs with redundancy. It is noted that this particular approach has a number of practical advantages in addition to the general features common to MDLC techniques. First, in addition to SNR quality, it provides temporal scalability that leads to a good reconstruction at half the original frame rate even when only one description is received. Second, it can be easily combined with 31 multiplepathtransporttoimproveerrorresilience. Third,itisstraightforwardtoexpand the current approach to more than two descriptions by splitting the frames evenly into multiple independent subsequences and coding each enhancement layer description using the same FGST approach. Last, it has the flexibility to provide unbalanced base layer descriptions by using different quantization steps for each description, which is useful to cover a wide range of bit rates for bandwidth adaptation. 2.3 Experimental Results We investigate the end-to-end distortion performance of the proposed MDLC technique, by comparing it with both LC and MDC. Here we use the MDLC approach based on MPEG-4FGSTtocodevideosequences. MPEG-4FGSisusedastheLCimplementation, andtheMDCsystemusesthesamemultipledescriptiongeneratingmethodasourMDLC for the base layer. The comparisons between MDLC and LC are discussed in detail in Section 3.5, which show that MDLC outperforms LC even when a rate-distortion (R-D) optimized packet scheduling algorithm is used. This is because having a source representation with built-in error resilience, through source redundancy, enables MDLC to provide an end-to-end performance that is less affected by packet losses, particularly in the case of poor channel conditions, such as high packet loss rate and short playback delay. In this section, we compare the MDLC approach with a set of MDC schemes using differentquantizationparameters(QP).AlloftheseschemesuseanR-Doptimizedpacket scheduling algorithm, based on Lagrangian optimization, described in Chapter 3. The 32 0 200 400 600 800 1000 1200 26 28 30 32 34 36 38 40 Rate (Kbps) PSNR−Y (dB) Foreman QCIF MD1(QP=20) MD2(QP=20) MDC(QP=20) MDC(QP=8) MDC(QP=4) (a) 0 1000 2000 3000 4000 5000 6000 7000 8000 24 26 28 30 32 34 36 38 Rate (Kbps) PSNR−Y (dB) Mobile CIF MD1(QP=12) MD2(QP=12) MDC(QP=12) MDC(QP=8) MDC(QP=4) (b) Figure2.3: Rate-distortioncurvesofMDCcodecswithdifferentquantizationparameters, MD1 and MD2 of MDLC for Foreman and Mobile measured at the encoder without transmission impacts. 33 video sequences used in the experiments are Foreman (QCIF) and Mobile (CIF). The encoding parameters of MDLC and streaming system setup are described in Section 3.5. In addition to the MDC codec that uses the same QP as that for the MDLC base layer, weemploytwootherMDCcodecswithfinerquantizationstepstoincreasetheachievable R-D range. Figure 2.3 shows the compression efficiencies of these MDC codecs, MD1 and MD2 of MDLC for Foreman and Mobile measured at the encoder. The three R-D operating points of each MDC codec are obtained by assuming that either one or both descriptions are received. Figure 2.4 shows a performance comparison between these schemes when the packet loss rate (PLR) is 15% and the playback delay (denoted by w) is 320 ms. The MDLC approachprovidesasimilarorbetterperformancethanitscorrespondingMDCapproach using the same base layer QP over the bandwidth range tested in the simulation. How- ever,therelativeperformancebetweenMDLCandotherMDCapproaches(i.e., QP=8or 4) depends on the available transmission rate. This can be explained as follows, by con- sidering both compression efficiency and rate scalability of a source codec. Since MDLC provides a more graceful degradation of reconstruction quality through layered coding, the performance gain of MDLC over MDC increases at lower transmission rates, when a client may not receive any descriptions for some frames in the MDC case. At medium or high transmission rates, as more and more packets are received at the client, MDLC may perform worse than an MDC approach because of their respective compression effi- ciencies. For example, in Figure 2.3 (a), the MDC approach with QP=8 can achieve 34.3 dB of luminance PSNR at 240 Kbps in a perfect channel, while MDLC can only achieve less than 32 dB. Correspondingly, in Figure 2.4 (a), we can see that this MDC approach 34 Foreman QCIF (PLR=0.15, w=320ms) 23 25 27 29 31 33 35 37 39 0 200 400 600 800 1000 1200 Transmission rate (Kbps) PSNR-Y (dB) MDLC MDC(QP=20) MDC(QP=8) MDC(QP=4) (a) Mobile CIF (PLR = 0.15, w = 320ms) 21 23 25 27 29 31 33 35 37 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Transmission rate (Kbps) PSNR-Y (dB) MDLC MDC(QP=12) MDC(QP=8) MDC(QP=4) (b) Figure 2.4: Performance comparison between MDLC and a set of MDC schemes with different quantization parameters for Foreman and Mobile. 35 outperforms MDLC between 300 Kbps and 700 Kbps. As the transmission rate increases further, MDLC may outperform MDC again, as all descriptions of MDC or MDLC can be delivered to the client on time and the reproduction quality at the decoder is bounded by the achievable reconstruction quality at the encoder. In summary, MDLC provides a more graceful bandwidth adaptation in the event of varying transmission bandwidth, compared to the sharp change in end-to-end quality typical of a MDC system. Mean- while, source redundancy introduced in an MDLC codec improves its error resilience in a lossy packet environment. 2.4 Conclusions Inthischapterweproposedanewapproach,multipledescriptionlayeredcoding(MDLC), whichcombinesthehierarchicalscalabilityofLCwiththereliabilityintroducedbyMDC. TheMDLCcodecproducesmultipleredundantrepresentations,enhancingtheadaptation flexibility to varying network conditions. Experimental results show that the proposed MDLCsystemprovidesmorerobustandefficientvideocommunicationoverawiderrange of network scenarios and application requirements. 36 Chapter 3 Rate-Distortion Based Scheduling of Video with Multiple Decoding Path 3.1 Introduction Internetmultimediaapplications,suchaslivevideostreaming,distancelearningandvideo on demand services, are becoming increasingly popular. Given the best-effort service offered by the current Internet, video transmission is inevitably affected by the network variations in bandwidth, delay and packet loss rate, and thus it is imperative to provide some means to deal with the transmission impairments. Instead of traditional error- resilient encoding techniques that introduce redundancy in the bit stream level, this chapter extends the rate-distortion optimized streaming framework proposed in [19,20] to operate on a general class of coding formats that explicitly support redundancy in their coding structure by, for example, producing multiple redundant representations of the video content. Note that in the absence of adaptation the redundancy levels may not 37 match those required by the actual network conditions. Here we propose an on-line rate- distortion based scheduling algorithm that can dynamically adjust the system’s real-time redundancy to match the channel behavior so as to achieve better overall quality. A variety of techniques have been proposed to address error control in the litera- ture,includingforwarderrorcorrection,delay-constrainedretransmission[32],intra/inter mode switching [99], reference picture selection [28,37], dynamic packet dependency con- trol[45], layeredcodingwithunequalerrorprotection[4], softARQforlayeredstreaming media[56],andmultipledescriptioncoding[29,89]. Animportantrecentadvancetovideo streamingistherate-distortionoptimized packetscheduling(RaDiO)frameworkinitially proposed by Chou and Miao [19,20]. This work formalized packet dependencies as a directed acyclic graph (DAG), prioritized packets based on their importance, and sched- uled them so as to minimize a Lagrangian cost function combining expected distortion and rate. Some techniques have been proposed to reduce the complexity of the origi- nal algorithm. Miao and Ortega [49,50] simplified the approach by running a greedy algorithm that explicitly combines the effects of data dependencies and delay constraints into a single importance metric. Chou and Sehgal [21] presented simplified methods to approximatetheoptimizedpolicies. Chakareski et al. [10]proposedafamilyofsimplified distortion models to approximate the end-to-end distortion produced by arbitrary packet loss patterns. Recent work by De Vleeschouwer et al. [80] improved the performance of greedyschedulingalgorithmbydelayingsomepacketschedulingdecisionstoavoidprema- ture retransmissions. The sender-driven rate-distortion framework in [19] has also been extended into other transmission scenarios, including packet scheduling at the receiver [21], at an intermediate proxy [11], or taking into consideration path diversity [12]. 38 Previous work on rate-distortion optimized video scheduling is mainly focused on encoding techniques, such as layered coding [44], which generate packets that can only be decoded following a single decoding path (SDP): a packet can be decoded only when all the packets it depends on are received and decodable. However, source codecs that explicitly support redundancy to combat transmission errors usually produce multiple decoding paths (MDP): there are multiple ways to decode a packet, each with a different distortionreductiondependingonwhichpackets,amongthoseitdependson,arereceived. One example of these codecs is multiple description layered coding (MDLC) proposed in [83],whichcombinesthehierarchicalscalabilityoflayeredcoding(LC)withthereliability of multiple description coding (MDC) so as to provide graceful adaptation over a wider range of application and network scenarios. Other coding examples with MDP include multiple independent encodings or decoding with error concealment. The scheduling problem becomes more challenging when considering multiple decod- ing paths. In addition to challenges arising in the framework of [19,20], such as delay constrained delivery, channel conditions and data dependency, the scheduling algorithm has to take into account the correlation or redundancy between data units, which is needed for end-to-end distortion estimation. The basic RaDiO framework [19,20] ad- dressed this problem using a simple DAG model to represent data dependency only, and is thus limited to coding scenarios that have a single decoding path. For example, it im- plicitly excludes the possibility of having multiple descriptions, in which several decoding choicesare possible based onwhichdescriptions are received. Cheung and Tan[15] intro- duced a more general formulation based on the DAG model to include the case where a packetcanbedecodedindifferentways. Theyconsideredallpossibilitiesofdecodingand 39 delivery scenarios, which leads to significant increases in complexity. In our approach, we introduce new additional components on top of a DAG, in order to explicitly repre- sent source redundancy among packets. Thus, compared to [15], our approach provides a more systematic way to represent source codecs that support multiple decoding paths with reasonable complexity. Thischapter,extendingourpriorwork[83,85],focusesondevelopingageneralstream- ingframeworkforaclassofscenariosthatemployredundantsourcecodingstructures. In [83], weproposedaheuristicschedulingalgorithmforasimplifiedMDLCcodecwithonly I-frames, and then presented a preliminary version of our general streaming framework in [85]. The present chapter extends our prior work by generalizing the source model to include a general class of source coding approaches such as MDLC with motion predic- tion, multiple independent encodings and so on, by refining the optimization scheduling algorithms, by introducing an improved MDLC predictive coder, and by evaluating per- formanceundervariousredundanciesforanumberofvideosequences. Specifically,inthis chapter, we first propose a new Directed Acyclic HyperGraph (DAHG) source model to representbothdatadependenciesandcorrelationbetweendifferentvideodataunits. The DAHGmodelintroducestheconceptsofmultipledecodablestatesandmultipledecoding paths, from which the expected end-to-end distortion D for a group of packets can be estimated accurately, when given a vector of packet loss probabilities, ², for each packet in the group. In addition, a Taylor series expansion of D in terms of² reveals important properties for different coding scenarios, depending on whether source redundancy exists or not. We then propose two rate-distortion adaptive packet scheduling algorithms, one 40 based on Lagrangian optimization with the iterative descent approach proposed in [19] and another one based on a greedy solution derived from the Taylor expansion. It is noted that, in addition to source redundancy explicitly produced at the encoding stage, the proposed streaming framework implicitly introduces a transport redundancy byallowingretransmissionofapacketwithoutwaitingforeitheranegativeacknowledge- ment (NAK) from the receiver or a timeout. In this case it is possible for the sender to transmit a packet multiple times so that more than one copy of a given packet may be correctly received at the decoder. We term this resulting rate penalty the transport redundancy introduced by the scheduling algorithm. This is different from most tradi- tional ARQ approaches applied in video applications that only retransmit a packet upon the detection of a packet error or loss. This type of redundancy has not been explicitly studied in previous research. In this work, we investigate the impacts of both transport and source redundancy on the error control for a lossy packet network. From our experi- mentswemakethefollowingobservations. First,regardlessofwhethersourceredundancy exists or not, a well-controlled transport redundancy through the Lagrangian optimized scheduling algorithm can improve the end-to-end performance in a delay-sensitive appli- cation. Second, in the absence of transport redundancy, source redundancy helps combat channel errors especially in high packet loss rate or under stringent delay constraints. Finally, these two types of redundancy can complement each other and achieve efficient videostreamingevenwithverypoorchannelconditions, forexample, atveryhighpacket loss rates or relatively long RTT as compared to the end-to-end delay. In this chapter, we use the MDLC codec proposed in Section 2.2.3 as an example to evaluate the scheduling algorithm performance under different types of redundancies. 41 Results demonstrate the benefits on error robustness provided by both source and trans- port redundancies, and show that our proposed system with both redundancies achieves the best end-to-end performance on real-time video communication over a wide range of network scenarios. Rate-distortion optimized streaming with source redundancy has also been applied to multiple independent encodings [40] and decoding with error concealment [12]. Unlike these techniques which address a particular source coding approach, we formalize the general coding relation between data units in a more structured source model that can represent various source coding approaches including these two examples. Compared to streamswitchingoftenusedincommercialstreamingsystems[9,23],ourapproachenables finer switching at the packet level rather than the stream level, and further allows more flexible adaptation options than simple switching. Thechapteris organized asfollows. In Section 3.2, webriefly review the basic RaDiO framework in [19]. Section 3.3 describes a general DAHG source model that uses the MDLC as an example, the expected end-to-end distortion, and the analysis of its Taylor expansion. Section 3.4 proposes the rate-distortion based scheduling algorithms based on theDAHGmodel, anddescribestheconcept of transport redundancy. Simulationresults are presented in Section 3.5. Conclusions and future work are discussed in Section 3.6. 3.2 Review of Basic RaDiO Framework In this section we briefly review the rate-distortion optimized streaming framework of [19]. A compressed media stream is packetized into packets or data units. Here, we 42 Frame 1 I Frame 2 P … Frame N P Figure 3.1: A DAG example for a LC system. simply assume each data unit is put into one packet, and in the following discussion we donotdifferentiatebetweenadataunitandapacket. Thesourcedependenciesbetweena group of data units are modelled as a directed acyclic graph (DAG), in which each vertex represents a data unit, and each directed edge from data unit i to data unit j indicates the decoding dependence of j on i, i.e., data unit j can only be decoded if i is received and decoded. Figure 3.1 shows a DAG representing a LC system containing a group of I-Pframes,witheachframehavingabaselayerandanenhancementlayer. Notethatthis dependency structure corresponds to a typical fine granularity scalability (FGS) codec. Associated with each data unit l in the graph are three constant quantities: its size r l in bytes, its time deadline t l , i.e., the time by which it must arrive at the receiver to be useful for decoding, and its distortion value d l , i.e., the amount by which the distortion of the decoded video will decrease if l is decoded on time at the receiver. The model implicitly assumes that when each data unit becomes decodable the total distortion is reduced by its distortion value. The streaming system decides whether, when and how to transmit each data unit in a way that maximizes the playback quality at the decoder under the given network conditions and application requirements. This framework assumes that data units are transmitted at discrete intervals of time. At each transmission time, a data unit is 43 chosen for transmission from those whose deadlines fall within a limited time window. Transmission decisions for such a group of data units at discrete times can be described by a transmission policy¼. For a group of L data units,¼ =[¼ 1 ;¢¢¢ ;¼ L ], in which ¼ l is a binary vector indicating whether data unit l will be transmitted or not at each of the available transmission opportunities, unless there is an acknowledgement that l has been received. At each transmission time, the algorithm determines which data units to send by optimizing its transmission policy for the current transmission opportunity together with a complete plan for future transmission opportunities that will likely happen. The optimal policy ¼ ¤ is the one that minimizes the expected Lagrangian cost function J(¼)=D(¼)+¸R(¼); (3.1) where D(¼) is the expected end-to-end distortion and R(¼) is the expected transmission rate for a given¼. Based on the DAG model, D(¼) is given by D(¼)=D 0 ¡ X l d l Y l 0 ¹l (1¡²(¼ l 0)) (3.2) where D 0 is the distortion of the media stream if no packets are decoded, ²(¼ l ) is the packet loss probability of data unit l under policy ¼ l (strictly speaking, the probability that l is lost or does not arrive at the receiver on time), and Q l 0 ¹l (1¡²(¼ l 0)) is the probability that l is decodable. l 0 ¹l refers to the set of data units that must arrive at the receiver for l to be decoded. The given policy ¼ also induces an expected number of transmission times, ¯(¼ l ), for each data unit l, and R(¼)= P l r l ¯(¼ l ). 44 An iterative descent algorithm was proposed in [19] to find ¼ ¤ . The algorithm starts withaninitialpolicy,andthenproceedstominimize(3.1)iterativelyuntilJ(¼)converges. At each iteration step, (3.1) is minimized with respect to ¼ l while fixing the transmission policies of other data units, ¼ l 0, l 0 6= l. The optimization is done for different data units inaround-robinorder. Tooptimize ¼ l , (3.1)canberewrittenas J(¼ l )=²(¼ l )+ ¸r l a l ¯(¼ l ), where a l is the partial derivative of (3.2) with respect to ²(¼ l ), indicating the sensitivity (or importance) of receiving data unit l to the overall distortion. ¼ is re-optimized at eachtransmissionopportunitytotakeintoaccountthefeedbackinformationandpossible changes of the group of data units since the previous transmission opportunity. 3.3 Source Modelling for Redundant Representations 3.3.1 Directed Acyclic Hypergraph (DAHG) When a video sequence is encoded into multiple redundant representations, source re- dundancy is introduced between two data units, where each of them can be decoded independently to create different representations of the same source unit. Such examples include BL 1 and BL 2 corresponding to the same frame in MDLC, or data units that contain individual independent encodings of a frame with different quantization parame- ters. The key problem here is how to represent the redundancy between data units, and furthermore the possible availability of multiple decoding paths due to the redundancy. Toaddressthisclassofsourcecodingformats,weintroduceanewsourcemodelcalled Directed Acyclic HyperGraph (DAHG) to represent both dependency and redundancy 45 relationships between different video data units. A DAHG is a generalization of a DAG G=(V;E) where 1. Each vertex C2V, rather than being a simple node, is composed of a set of nodes, eachpairofwhichisconnectedbyanundirectededge. Wenamethistypeofvertex a “clique”, representing a collection of data units that produce multiple redundant representations of the same source coding unit, such as a frame or a SNR layer of a frame in scalable coding. Each node (or data unit) in a clique represents one encoded version of this source unit, and an undirected edge connecting two nodes in the same clique indicates the redundancy between different encoded versions. A pair of nodes i and j are called siblings, and we write i»j. 2. An edge (C 1 ;C 2 ) 2 E, directed from clique C 1 to clique C 2 , is used to represent that decoding of C 2 is directly dependent on C 1 . C 1 is said to be a parent of C 2 , and C 2 is said to be a child of C 1 . A path is a sequence of vertices such that from each of its vertices there is a directed edge to the next vertex in the sequence. If a path leads from C 1 to C 2 , then C 1 is said to be an ancestor of C 2 , and C 2 is said to be a descendant of C 1 , written as C 1 Á C 2 or C 2  C 1 . Each parent of C 2 is certainly an ancestor of C 2 . On the other hand, C 1 being an ancestor but not a parent of C 2 indicates an undirected decoding dependence between C 1 and C 2 . For example, this would be the case with last P-frame in a GOP (as C 2 ) depending on the first I-frame (as C 1 ) through the other intermediate P-frames. Figure 3.2 shows an example DAHG for the proposed MDLC scheme shown in Figure 2.2. Each frame i contains a base layer clique C i1 and an enhancement layer clique C i2 . 46 l 3 l 4 42 C l 5 l 6 l 1 Frame 2 P Frame 3 P Frame 4 P … … Dependence between cliques Redundancy between siblings 11 C 31 C 32 C 21 C 22 C l 2 12 C 41 C Frame 1 I Base layer Enhancement layer Figure 3.2: The DAHG model of the MDLC scheme shown in Figure 2.2. One of the two base layer nodes (filled with gray color), which has zero data size, is decoded as a copy or motion interpolation from the other description. We label each node sequentially as l 1 ;l 2 ;¢¢¢ starting from frame 1. Specifically, in frame 2, l 3 , l 4 , l 5 and l 6 correspond to BL 1 , BL 2 , EL 1 and EL 2 , respectively. Since the I-frame of each GOP is coded without redundancy, its base and enhancement layer cliques contain only one node each. Clique C i1 of each P frame i consists of two nodes representing BL 1 and BL 2 , respectively. One of them is generated by copying (or using motion interpolation on) neighboring frames coded in the other description, such as l 3 of C 21 in the figure. While this node does not require bits being sent, it does produce a distortion reduction. Clique C i2 contains nodes EL 1 and EL 2 . Directed edges connecting cliques represent either SNR dependence or temporal dependence. There are two directed edges entering C 32 , including (C 31 ;C 32 ) for SNR dependence and (C 21 ;C 32 ) for temporal dependence. Thus, C 31 and C 21 are parents of C 32 . C 11 is an ancestor of C 32 as a path (C 11 ;C 21 ;C 31 ;C 32 ) leads from C 11 to C 32 . Figure 3.3 models multiple independent encodings of a video sequence, where the sequence is independently coded twice with different quantization steps using a typical non-scalable codec. Each clique contains two nodes to represent each encoded version. The same graph model can also 47 I P … … 1 C 2 C i C N C P P Figure 3.3: Another example of DAHG to represent multiple independent encodings of a video sequence. This model can also be used to represent error concealment. be applied to a simple “copy previous frame” error concealment method, where one of the two nodes in each clique represents a duplicate copy of the previous frame as an approximation of the current frame. Assume that a clique contains N data units. Since each unit can be either received correctly or not received (due to loss or because it is not transmitted in the first place), there are a total of 2 N possible states for the clique. A clique state is represented by a length-N binary string s, with each bit indicating the status of a data unit in the clique. Let b l denote the corresponding bit location of data unit l in s; the b l th bit of s is 1 (mathematically, s[b l ] = 1) if l arrives at the receiver on time and is 0 otherwise. b l is set to 1 for those nodes that have a size of zero bits, since they are regarded as being always received 1 . Zero state of a clique is then defined as the state such that no data units are received, and all the other states that have at least one data unit received are called non-zero states. Note that a non-zero clique state does not necessarily mean that this clique is decodable. Decoding of a clique also depends on the states of its ancestor cliques, which will be discussed later. In addition, it is convenient to define B (s) C =fljl 2 C;s[b l ] = 1g and ¯ B (s) C =fljl 2 C;s[b l ] = 0g to represent two different sets of data units in C based on their state s. 1 Examples of this type of nodes are given in Figures 3.2 and 3.3. Essentially these nodes are created to separate the direct contribution of a packet to reducing distortion, which requires transmission, from its indirect contribution via interpolation or error concealment, which requires no additional transmission rate once the original packet has been received. 48 In a directed acyclic graph, a decoding path leading to a vertex can be constructed as an ordered list of its ancestors in the decoding order. In past works that code a video sequenceintoasingleencodedversion,suchassingledescriptioncoding,avertexhasonly 1/0states, i.e., eitherreceivedornot. Thuseachvertexnodealongadecodingpathmust be received in order for the current node to be decoded, and this forms a single decoding path. In contrast, in the case of source coding with redundancy, a clique can be decoded once all its ancestor cliques are received in a non-zero state. Moreover, each clique in the ordered ancestor list can take multiple non-zero states, with different state combinations resulting in possibly different decoded versions of the current clique. A decoding path leading to clique C is then defined as a particular combination of all C’s ancestor clique states. Multiple decoding paths become possible as each ancestor may have multiple clique states. In order to use the same mathematical notation, we simply assume there is onevirtualdecodingpathleadingtothosecliquesthatdonothaveparents. Figure3.4(a) shows the concept of multiple clique states and multiple decoding paths using the MDLC in Figure 3.2 as an example. IngeneralacompletesetofdecodingpathsleadingtoC containsallthecombinations of C’s ancestor clique states. Thus theoretically the number of decoding paths may in- creaseexponentiallyinthenumberofancestorcliquespreceding C. However,inpractical pre-encoded applications, given a decoder implementation and possible simplification of source modelling, there are only a small number of effective decoding paths. For a given decoder implementation, a clique representing a source unit can only be decoded into a limited number of versions. Many of decoding paths producing the same reconstruction can thus be merged into one decoding path, while some other paths that lead to poor 49 :1 :0 110:00 110:01 110:10 110:11 111:00 111:01 111:10 111:11 C 11 C 21 C 22 1:10 1:11 C 11 C 21 C 22 Clique state s [l 1 ] [l 3 l 4 ] [l 5 l 6 ] Clique state set S {[0], [1]} {[10], [11]} {[00], [01], [10], [11]} Decoding path q [l 1 ] [l 1 l 3 l 4 ] Decoding path set Q {[1]} {[110], [111]} (a) Parameters of C 21 Distortion vector Redundancy Decoding path d(l 3 ) d(l 4 ) ] 11 [ 21 C I [1] d 3 d 4 min(d 3 , d 4 )= d 3 Parameters of C 22 Distortion vector Redundancy Decoding path d(l 5 ) d(l 6 ) ] 11 [ 22 C I [110] ) 1 ( 5 d 0 0 [111] ) 2 ( 5 d ) 2 ( 6 d ) , min( ) 2 ( 6 ) 2 ( 5 d d i d : Distortion reduction of data unit l i when there is only one decoding path ) ( j i d : Distortion reduction of data unit l i in the jth decoding path when there are multiple decoding paths I C [11]: Redundancy of clique C at state s=[11]. The other components of the redundancy matrices are zero. min(a,b): the minimum between a and b (b) Figure3.4: Descriptionofmultiplecliquestatesandmultipledecodingpathsusingcliques C 11 , C 21 and C 22 in Figure 3.2 as an example. (a) Multiple clique states and multiple decoding paths. In frame 2, l 3 is decoded to be a direct copy of the reconstructed frame 1, and l 4 produces a reconstructed frame with better quality than l 3 . Each circle in the figureislabelledbyacombinationofdecodingpathandcliquestateintheform“decoding path : clique sate”. A decoding path is represented by a concatenation of each ancestor clique state. Nothing before the colon in C 11 indicates that it has no parents and there is only a virtual decoding path leading to C 11 . (b) Distortion related parameters assigned to frame 2. 50 quality solutions are ignored by the decoder. For example, cross-description decoding of EL 2 based on BL 1 or EL 1 based on BL 2 is ignored since the information added by the cross enhancement layer is very small. In addition, even when a decoder supports certain decoding paths, a source model for the purpose of scheduling can also choose to discard some of these paths in order to reduce the computation complexity, at the penalty of some performance loss. In summary, DAHG is different from DAG mainly in two aspects: (1) multiple de- coding paths in DAHG vs. single decoding path in DAG, and (2) multiple decodable clique states in DAHG vs. 0/1 state of the data unit (i.e., it is either decodable or not) in DAG. Estimating expected end-to-end distortion under a DAHG model will be discussed in detail in Section 3.3.3. 3.3.2 Parameters Associated with DAHG As in [19], each data unit l has a size r l in bytes and a time deadline t l by which it must arrive at the receiver to be useful for decoding. However, the distortion reduction of a data unit in a DAHG model can take different values depending on the decoding path in which it is decoded. Let Q C be the set of decoding paths leading to C. Then we can represent the distortion reduction of data unit l in the clique by a distortion vector d l = [d (1) l ;d (2) l ;:::;d (q) l ;:::;d (jQcj) l ], where d (q) l is the distortion reduction if l is decoded in the qth decoding path, and j:j denotes the cardinality of the set. Setting d (q) l to 0 will force the scheduler not to transmit data unit l given the qth decoding path. This can be used to eliminate certain undesirable clique state combinations, and thus reduce the number of effective decoding paths in a DAHG. 51 Though each of the data units in clique C can produce a certain distortion reduction, thetotaldistortionreductionwhenmorethanonedataunitisreceivedcorrectlyisusually less than the sum of their respective distortion reductions. Let S C be the set of all clique states in C. We introduce a redundancy matrix I C = [I (s;q) C ] of dimension jS C j£jQ C j, to represent the redundancy between different data units inside the same clique C. The redundancy of C, when it is in state s and decoded in the qth decoding path, is stored as an entry in row s and column q of the redundancy matrix. I (s;q) C is defined as I (s;q) C = X l2B (s) C d (q) l ¡d (s;q) C ; (3.3) where d (s;q) C is the total distortion reduction of C if it is decoded in state s and the qth decoding path. An important property of this model is that, as the DAG model, the distortionreductionisstilladditiveatthecliquelevel; however, theamountbywhichthe distortion decreases when a node is decoded depends not only on the state of its ancestor cliques but also on whether its siblings in the same clique are decodable. Figure 3.4(b) lists the distortion vectors and redundancy matrices for frame 2 of the MDLC example shown in Figure 3.2. 3.3.3 Expected End-to-End Distortion Suppose we already have a DAHG model to represent a group of L data units, with each dataunitbeingpacketizedintoonepacket. Wecannowestimatetheexpectedend-to-end distortionofthisgroupofpackets(GOPkt)whengivenavectorofpacketlossprobability (PLP) providing a loss probability for each packet in the group. Recall that a packet is 52 considered lost if it is either lost or arrives at the decoder too late to be played. We now define the “transmission state” as the PLP vector which accounts for the transmission schedules and the channel conditions. Let ² l be the PLP of packet l2f1;:::;Lg and let ²=[² 1 ;:::;² L ] be the real-time transmission state. Computation of expected distortion in a DAHG for a given² differs from that in [19] by introducing two new concepts, multiple decoding paths and multiple decodable clique states. To help us write an expression of the expected distortion, we first derive some related probabilities. The probability of occurrence of clique state s is given by p (s) C = Y l2B (s) C (1¡² l ) Y l 0 2 ¯ B (s) C ² l 0 (3.4) Recall that a decoding path leading to clique C is defined by a particular combination of the clique states of all its ancestors. Thus the probability of occurrence of decoding path q can be written in terms of the probabilities of those clique states as p (q) C = Y C 0 ÁC;s C 02q p (s C 0) C = Y l2A (q) C (1¡² l ) Y l 0 2 ¯ A (q) C ² l 0 (3.5) where A (q) C = S C 0 ÁC;s C 02q B (s C 0) C , and ¯ A (q) C = S C 0 ÁC;s C 02q ¯ B (s C 0) C . We now can write the expected distortion as a function of the transmission state D(²)=D 0 ¡ X C X q2Q C p (q) C [ X s2S C p (s) C d (s;q) C ] (3.6) 53 whereD 0 is thedistortionoftheGOPktifno packetsare decoded, d (s;q) C = P l2B (s) C d (q) l ¡ I (s;q) C directly derived from (3.3), and p (s) C and p (q) C are defined in (3.4) and (3.5), respec- tively. Both transmitting and receiving a packet cause a state transition from a state ² 1 to anotherstate² 2 . TheTaylorexpansionofD intermsof²revealsdifferentcharacteristics of state transitions for different coding scenarios. The distortion reduction when receiv- ing a packet in a multiple-decoding-path scenario depends on more factors than that in a single-decoding-path scenario, because the redundancy between packets plays an im- portant role. Thus, in this case, an optimal scheduling algorithm should be designed to take into account both dependency and redundancy such that the end-to-end distortion is minimized at the decoder. The Taylor expansion of (3.6) at the current state ˜ ² is given by D(²) = 1 X k=0 [ 1 k! (Δ²¢r 0) k D(² 0 )] 0 =˜ = D(˜ ²)+ X i a i (² i ¡˜ ² i )+ X i;j a ij (² i ¡˜ ² i )(² j ¡˜ ² j )+::: (3.7) where we denote a i = @D @² i the first-order partial derivative of D with respect to ² i , a ij = @ 2 D @² i @² j the second-order partial derivative, and so on. Note that (3.7) only contains linear terms, since 8 1·j·k; if9 m j ¸2; then @ n D @² m 1 i 1 ;¢¢¢ ;@² m k i k =0 54 deriveddirectlyfrom(3.6). a i indicatestheimportanceofpacketiintermsofitscontribu- tion to the overall distortion reduction given the current transmission state. As receiving a packet will not increase the overall distortion for any coding application, a i ¸ 0 for any i. The second or higher-order terms take effect when there is more than one packet whose PLP has changed from a reference state. For example, @ 2 D @² i @² j shows that a future change of ² j , as packet j is transmitted or its ACK/NAK is received, may affect the importance of transmitting packet i at the current time. To see this, we approximate a i by its first-order Taylor expansion at ˜ ², a i (²)¼ a i (˜ ²)+ P j a ij (² j ¡˜ ² j ). Assume that packet j will be transmitted or will arrive at the receiver when the state transits from ˜ ² to ², then ² j < ˜ ² j . In this case, a ij will lead to a change in a i as follows: when a ij < 0, a i increases and vice versa. In other words, the transmission or arrival of packet j may increase or reduce the current importance of packet i depending on the sign of a ij . Now we compare the difference between single decoding path and multiple decoding paths in terms of the properties of the first-order and second-order partial derivatives of D. First consider the single decoding path case. The expected end-to-end distortion in this case is given in (3.2). We derive its partial derivatives as @D @² i = X lºi d l Y l 0 ¹l;l 0 6=i (1¡² l 0) (3.8) @ 2 D @² i @² j = ¡ X lºi;j d l Y l 0 ¹l;l 0 6=i;j (1¡² l 0) (3.9) The right hand side of (3.8) can be written as the sum of two terms f 1 and f 2 , where f 1 = d i Q l 0 Ái (1¡² l 0) corresponds to the original distortion of packet i weighted by the probability of receiving all its ancestors, and f 2 = P lÂi d l Q l 0 ¹l;l 0 6=i (1¡² l 0) indicates the 55 importance of packet i to its descendant packets. From (3.9), we can conclude @ 2 D @² i @² j ·0 for any i and j, since ² l 0 ·1 for any l 0 . When multiple decoding paths are possible, we derive its first-order derivative from (3.6) as @D @² i =f 1 +f 2 +f 3 +f 4 (3.10) with f 1 = X q2Q C p (q) C [ X s2S C ;i2B (s) C d (s;q) C Y l2B (s) C ;l6=i (1¡² l ) Y l2 ¯ B (s) C ² l ] f 2 = ¡ X q2Q C p (q) C [ X s2S C ;i2 ¯ B (s) C d (s;q) C Y l2B (s) C (1¡² l ) Y l2 ¯ B (s) C ;l6=i ² l ] f 3 = X CÂC i X q2Q C ;i2A (q) C [ Y l2A (q) C ;l6=i (1¡² l ) Y l2 ¯ A (q) C ² l ]¢[ X s2S C p (s) C d (s;q) C ] f 4 = ¡ X CÂC i X q2Q C ;i2 ¯ A (q) C [ Y l2A (q) C (1¡² l ) Y l2 ¯ A (q) C ;l6=i ² l ]¢[ X s2S C p (s) C d (s;q) C ] where C i represents the clique that contains packet i. f 1 indicates the packet importance due to its own distortion reduction; f 2 represents redundancy when both i and its sibling packetsarereceived; f 3 showsthedistortionreductionachievedbythedescendantcliques of C i in the decoding paths that require i to be received; and f 4 represents the impact of receiving i on the descendant cliques of C i in the remaining decoding paths which do not require i to be received. The signs of these terms indicate whether it is desirable to transmit i or not when different packets have been received at the decoder in the past, as a positive (or negative) term will increase (or decrease) the value of @D @² i . We now give a concrete example to clearly illustrate how to calculate the expected distortionfortheMDLCschemeinFigure3.2andthepropertiesofitspartialderivatives 56 in the case of multiple decoding paths. Here we only consider cliques C 11 , C 21 and C 22 . Let ² i denote the packet loss probability of data unit l i in Figure 3.2. Since l 3 is a direct copy of l 1 without the need to send any bits, ² 3 = 0. We use the notation of Figure 3.4 for distortion related parameters. (1) Calculation of expected distortion: The expected distortion D is given by D =D 0 ¡ΔD C 11 ¡ΔD C 21 ¡ΔD C 22 (3.11) where ΔD C 11 = (1¡² 1 )d 1 ΔD C 21 = (1¡² 1 ) · ² 4 1¡² 4 ¸ £ 2 6 6 4 d 3 d 4 3 7 7 5 ΔD C 22 = (1¡² 1 ) · ² 4 1¡² 4 ¸ £d C 22 £p C 22 with d C 22 = 2 6 6 4 0 d (1) 5 d (1) 5 d (2) 6 d (2) 5 max(d (2) 5 ;d (2) 6 ) 3 7 7 5 ; p C 22 = 2 6 6 6 6 6 6 4 ² 5 (1¡² 6 ) (1¡² 5 )² 6 (1¡² 5 )(1¡² 6 ) 3 7 7 7 7 7 7 5 correspond- ing to the non-zero clique states of C 22 in the order [01], [10] and [11]. Each entry ofd C 22 at row q and column s gives the distortion reduction of C 22 along the qth decoding path in state s. The sth element ofp C 22 is the probability of occurrence of C 22 at state s. (2) First-order partial derivatives: Take l 4 as an example to show the importance of @D @² to the system’s behavior. Written in the same way as (3.10), @D @² 4 is derived from (3.11) with f 1 = (1¡ ² 1 )d 4 , f 2 = ¡(1¡ ² 1 )d 3 , f 3 = (1¡ ² 1 )(d C 22 (2)£p C 22 ), and 57 f 4 =¡(1¡² 1 )(d C 22 (1)£p C 22 ), whered C 22 (q) (q =1;2) is the qth row ofd C 22 . From the sign of the above terms, we can see that transmission of l 3 is more favorable when f 1 and f 3 are significant, and less favorable when f 2 and f 4 become dominant. (3) Second-order partial derivatives: Assuming now l 1 and l 4 have been transmitted but without receiving acknowledgements yet, the current state ˜ ² = [² 1 ;² 3 ;² 4 ;² 5 ;² 6 ] = [² 1 ;0;² 4 ;1;1], 0·² 1 ;² 4 ·1. Consider the following second-order derivatives at ˜ ², ² @ 2 D @² 4 @² 6 =¡(1¡² 1 )d (2) 6 ·0, since l 6 is dependent on l 4 for decoding; ² @ 2 D @² 5 @² 6 = (1¡² 1 )(1¡² 4 )I C 22 [11] ¸ 0, since l 5 and l 6 have redundancy with each other. Though it is complicated to derive a general equation of @ 2 D @² i @² j from (3.6), we can see from the above example that, in the case of multiple decoding paths, @ 2 D @² i @² j can be either non-negative or non-positive. In contrast, @D @² i @² j · 0 for single decoding path. This shows that, in the case of single decoding path, the arrival of one packet at the receiver can increase, or at least not reduce the importance of the other packets, in terms of distortion reduction. However, when there are multiple decoding paths, due to the redundancy between packets which affects the high-order terms, the future transmission ofpacketsmaydecreasethecurrentimportancevalueofapacketthatcontainsredundant information. 58 3.4 Scheduling Algorithms with DAHG In this section we study two rate-distortion adaptive packet scheduling algorithms using our proposed DAHG source model, one based on Lagrangian optimization using an it- erative descent algorithm [19,20], and another one based on a greedy solution derived from the Taylor analysis in Section 3.3.3. Finally, we introduce the concept of transport redundancy in terms of the retransmission penalty observed at the client, and discuss its role in both scheduling algorithms. 3.4.1 System Architecture Figure 3.5 shows an end-to-end video transmission system, in which each video frame is encoded, transmittedanddecodedinreal-timewithinsomeacceptabledelayperiod. The inputvideoiscompressedintomultipleredundantrepresentations,e.g.,usingMDLC.For a packet-switched network, these streams are packetized and then fed into the transmis- sionbuffer. Ateachtransmissiontimet,wemakeaselectiondecisiononlyamongpackets whoseplaybackdeadlinesfallwithinatime-varyingtransmissionwindow[lag(t);lead(t)]. Thetimewindowwilladvancewitht,andthuseachpackethasalimitednumberoftrans- mission opportunities. lag(t) is defined such that any packet whose playback deadline is earlier than lag(t) could not arrive at the receiver on time if it were transmitted at t. lead(t) implies the earliest time that a packet is eligible for transmission. [19] has pro- posed a number of ways to set lead(t) by considering the receiver buffer implementation and the application playback delay. Here we assume the end-to-end delay for each frame will be constant and equal to the initial playback delay w. Thus, lead(t) = lag(t)+w. 59 A prior Channel model Time window control ACK Trans. scheduling Trans. Buffer Lost packets Video Input MDP Encoder Transmitter Lossy channel Recv buffer MDP Decoder Video output feedback feedback . . . . . . Figure 3.5: Streaming system architecture. Each transmission at time t is subjected to a constraint of the admissible channel rates during this time interval. The receiver sends an acknowledgement (ACK) back to the sender as soon as it receives a packet. With the feedback information, the sender can estimate the channel conditions such as packet loss rate and round-trip time (RTT). In our research, we simply model the network as an i.i.d. packet erasure channel with afixedRTT.Thatmeansthatapacketsentattislostwithprobability²independentoft. By time t+RTT, the sender will receive an ACK if the packet is received at the decoder; otherwise the packet is considered lost or corrupted. We also assume that the back channel is error-free. Thus, given the transmission policy that there are n transmission times of packet l in the last RTT, the expected PLP of packet l at time t is given by ² l = 8 > > < > > : 0 if the sender has received an ACK of packet l by t, ² n otherwise. (3.12) Morecomplicatednetworkmodels,suchasrandomnetworkdelayandlossybackchannel, can be easily combined into our streaming architecture and scheduling algorithms. The 60 major difference for various network models is how to estimate the expected packet loss probability in (3.12) given a transmission policy. This part has been carefully studied in [19]. Inthischapter, wefocusonthedesignofanefficientschedulingalgorithmbytaking into account both dependency and redundancy between packets. 3.4.2 Optimization Problem Formulation The goal of scheduling is to minimize the playback distortion for a streaming session, by adaptingtothenetworkconditionsandapplicationrequirements. Thoughweworkwitha moregeneralstreamingframeworkthatallowsmultipledecodingpaths,wecanfollowthe same problem formulation as originally proposed in [19] for streaming applications with single decoding path. Suppose we wish to transmit a group of L packets whose playback deadlines fall in a limited time window, and the packets are transmitted at discrete time intervals evenly distributed in a time window with a maximum of N transmission opportunities. Let ¼ l = [v 0 ;¢¢¢ ;v N¡1 ] be the transmission policy for packet l along the N transmission opportunities, where v i = 1 indicates “send packet l” and v i = 0 “do not send packet l” at the ith time interval. We are interested in finding an optimal transmission policy ¼ = [¼ 1 ;¢¢¢ ;¼ L ] for this group of packets such that the expected end-to-end distortion is minimized subject to the data rate constraint, i.e., ¼ = argmin :R( )·R b D(¼): (3.13) Since the expected PLP ² l for packet l is a function of its transmission policy ¼ l , the expected end-to-end distortion D also depends on ¼. Note that we consider expected 61 distortion because there is uncertainty about the actual decoded video quality; changes in channel bandwidth, packet loss rate, and so forth will affect the quality of received video. 3.4.3 Lagrangian Optimization Algorithm As proposed in [19], the constrained optimization problem in (3.13) can be cast as an unconstrained optimization problem using a Lagrange multiplier ¸, ¼ =argmin D(¼)+¸R(¼): (3.14) Let ¯(¼ l ) and ²(¼ l ) be the expected number of transmission times and the expected PLP for packet l under ¼ l , respectively. Then the expected rate of the group of L packets R(¼)= P l r l ¯(¼ l ), and the expected distortion D(¼) is given by (3.6) using the DAHG model. Our proposed scheduling algorithm is composed of two components: (1) at each transmission time t, the iterative descent optimization algorithm proposed in [19] is used toupdate¼ foragiven¸, bytakingintoaccountthesourcerate-distortion information, currentchannelcondition,transmissionhistoryandreceiverfeedback;(2)awindow-based rate-control algorithm is applied regularly (e.g., at each transmission time) to adjust ¸ such that the average output rate of the scheduler is matched to the channel bandwidth. First, the iterative descent algorithm in [19] is used to optimize ¼ for coding ap- plications with multiple decoding paths. For completeness, the Lagrangian optimization algorithm tailored to our DAHG model is summarized in Algorithm 1. The major differ- ence from [19] is that, at each optimization step, we derive the expected distortion from 62 Algorithm 1 Lagrangian(t, ¸,¼ t¡1 ) 1: n = 0: initialize ¼ l = f¼ l;t¡1 ;0;:::;0g for each packet l, and calculate ² l = ²(¼ l ), ¯ l =¯(¼ l ), D =D 0 ¡ P C P q2Q C p (q) C [ P s2S C p (s) C d (s;q) C ], R= P l r l ¯ l , J =D+¸R 2: repeat 3: n=n+1 4: select packet l to optimize at step n in a round-robin order 5: a l = @D @² l obtained from (3.10) 6: ¼ ¤ l =argmin ¼ l ²(¼ l )+ ¸r l a l ¯(¼ l ) 7: ² l =²(¼ ¤ l ), ¯ l =¯(¼ ¤ l ), D =D 0 ¡ P C P q2Q C p (q) C [ P s2S C p (s) C d (s;q) C ], R= P l r l ¯ l , J =D+¸R 8: until ¯ ¯ J (n) ¡J (n¡1) ¯ ¯ <Threshold 9: return ¼ =[¼ ¤ 1 ;¢¢¢ ;¼ ¤ L ] aDAHGmodelinsteadofaDAG,astheDAHGcanwellrepresentboth dependencyand redundancy between packets. The input parameter ¼ t¡1 represents the optimal trans- mission policy determined at previous time t¡1. Since all the future transmission plans followingcurrenttimetwillbere-optimized,thefunction Lagrangianisonlyinterested in the segment of¼ t¡1 that stores the transmission history up to t¡1. Let ¼ l;t¡1 denote thelthcomponentofthispastsegmentin¼ t¡1 . Wefirstinitializethetransmissionpolicy ¼ l of each packet to be the one with no further transmissions, i.e., setting all the future transmission actions as0 in Algorithm1. Ateachiteration step, the Lagrangian cost J is minimized with respect to ¼ l of a selected packet l while keeping the policies of all other packets fixed. Upon convergence, ¼ l of each packet is optimized for its complete window of transmission opportunities. Then the transmitter takes transmission actions at t, and the optimization procedure will be repeated at t+1. Second,inordertoapproachthechannelbandwidthlimit,weproposeawindow-based rate control scheme. That is, at each time, ¸ is fixed for all packets in the transmission window. The rate budget R b is increased when a new frame enters into the transmission 63 Algorithm 2 Window Based Rate Control(t, R b ) 1: if t=0 then 2: R b =0 3: if M new frames come in then 4: R b =R b +M¤channel bandwidth¤frame interval 5: use bi-section algorithm to find an appropriate ¸ with rate constraint R b 6: call Lagrangian(t, ¸, ¼ t¡1 ) 7: R b =R b ¡ P l:¼ l (t)=1 r l 8: return R b window, and decreased when packets are sent out. At each transmission time, we apply the bisection algorithm to find an appropriate ¸ for R b . This approach is different from those that fix ¸ for each group of frames or the whole session in that it can quickly respond to channel bandwidth changes and use the bandwidth in a more efficient way. The rate control algorithm is summarized in Algorithm 2. 3.4.4 Greedy Algorithm Since only the current transmission action in ¼ is used at any given time, instead of de- termining the complete transmission policy for each packet over all possible transmission opportunities (e.g., as used in the above Lagrangian optimization), we could choose to use a greedy approach by selecting the currently most important packet from the group of L candidate packets. Previous research work [50] has proposed similar solutions for single-decoding-path applications. Here, we derive the greedy algorithm for multiple- decoding-path codecs from the Taylor expansion of the expected distortion. Given the past transmission history of packet i, let¼ i;0 be a transmission schedule such that packet i is not transmitted at the current time t and all future time steps, and let ¼ i;1 be the 64 same transmission schedule as ¼ i;0 except that packet i will be transmitted at t. Send- ing packet i at t induces a state transition from ²(¼ i;0 ) to ²(¼ i;1 ), and thus leads to a distortion reduction by ΔD (t) i =D(²(¼ i;0 ))¡D(²(¼ i;1 ))=a i (² i;0 ¡² i;1 )=a i (1¡²)² i;0 (3.15) derived from (3.7), where ² i;0 and ² i;1 are the PLP of packet i given the schedule ¼ i;0 or ¼ i;1 , respectively. In fact, ² i;0 is the expected PLP of packet i at t given its transmission history as calculated in (3.12), and we simplify the notation to ² i . ΔD (t) i indicates the importance of sending packet i at the current time t when no further transmissions are considered. To favor packets with early playback deadlines, we introduce a multiplier ² m i in (3.15), where m i is designed to approximate the number of possible retransmissions by 2 m i =(t i ¡t)=RTT: (3.16) Thatisbecausethefuturepossibletransmissionsforpacketiwilldecreasetheimportance of sending it at t. Ignoring the constant term (1¡²), for comparing the importance of sending each packet at t and taking into account the packet size r i , we have the metric c i =² m i ² i a i r i (3.17) for each packet and select the one with the largest c i to send. Note that a i is calculated at the current state with the assumption that there are no future transmissions of other 2 Strictly speaking, the number of possible retransmissions can be much larger than the given m i for a system that allows retransmissions without waiting. 65 Algorithm 3 Greedy 1: for all 1·i·L do 2: c i =² m i ² i a i r i 3: Find the largest c i , say j (i.e. c j ¸c i for any i6=j) 4: return j packets. Algorithm 3 summarizes the proposed greedy technique. At each transmission time, we choose the most important packets to send by running this algorithm iteratively until the channel rate allocated to this time interval is used up. A main problem of the greedy algorithm is that it ignores the possibility of future transmissions of other packets. As Section 3.3.3 points out, for applications with multi- ple decoding paths, the future transmission of a packet may either increase or decrease the importance value of another packet depending on their coding relation. Thus, in an optimal algorithm future transmission probabilities of packets would have increased impact, through the higher-order derivatives of the Taylor expansion, on the decision at the current transmission opportunity. Another problem may arise from possible future retransmissions of the packet itself, for which this algorithm introduces a multiplier on theimportancemetrictoapproximatethisimpactonthecurrentdecision. InSection3.5, we will see that the greedy algorithm experiences a certain performance loss compared to Lagrangian optimization. Other work has studied improved greedy scheduling algorithm toaddresstheseproblems. OurpreviousMDLCworkin[83]proposedadoubletimewin- dow control to intentionally introduce an extra waiting period for MD2 such that it can only be transmitted relatively safely in a future time to avoid unnecessary redundancy with MD1. This helps when the acknowledgements generated by early transmissions of MD1 are likely to arrive at the sender soon. In order to avoid the penalty introduced 66 by premature retransmissions, [80] proposed to delay some packet scheduling decisions. However, for a general coding scenario that provides multiple decoding paths, we have not achieved a systematic solution to address the possible future (re)transmissions for the packet itself and its related packets (through either dependency or redundancy). We are now working on a possible solution by taking into account the higher-order partial derivatives described in Section 3.3.3. 3.4.5 Transport Redundancy Traditional ARQ approaches request retransmission only upon detection of lost or overly delayedpackets. Thusthenumberofretransmissionsisverylimitedfordelayconstrained real-time video communication. In comparison, our extended streaming framework, to- gether with the one originally proposed in [19] for single decoding path, allows unlimited retransmissionofapacketbeforetheplaybackdeadlineinthesensethatitcanretransmit a packet without waiting for a timeout or NAK from the receiver. This approach essen- tially relieves the delay problem caused by retransmissions. However, it may introduce a rate penalty when both retransmitted and original packets are correctly received at the decoder. We call this “transport redundancy”, since the client receives redundant information 3 . One possible variation of the proposed scheduling algorithms is to mimic the tradi- tional ARQ systems by limiting the retransmission of a packet until the last transmission of this packet has not been acknowledged within a predefined timeout. Based on our 3 If the original packet is not correctly received at the decoder, the duplicated packet contributes to the end-to-end distortion, and thus here we do not count it as a transport-redundant packet. 67 system assumption with a fixed RTT, the timeout is simply defined to be equal to one RTT. In other systems where the network produces a random delay, the timeout can be set as the mean RTT plus some tolerance (e.g., three times the standard deviation of the RTT, as is frequently used in ARQ systems). This is different from the original schedul- ing algorithms in that it completely or almost completely avoids the cost penalty due to the transport redundancy. However, if retransmission is controlled appropriately in the case where there is no waiting, the end-to-end performance can be improved without introducing longer delay. We will compare the performance of the scheduling algorithms with or without transport redundancy in Section 3.5. 3.4.6 Complexity Analysis The complexity of the Lagrangian optimization approach is on the order of N ¸ N i L2 N at each transmission time, where N i is the number of iterations performed until the algorithm converges for a given ¸, and N ¸ is the number of iterations for the rate control algorithm to adjust ¸ to meet the rate limit. The time period to adjust ¸ could cross multipletransmissiontimesinordertoreducethecomplexity. Listhenumberofpackets available for transmission in the time window, and N is the number of transmission opportunities of a packet. The complexity of the Greedy approach is O(L) since each packet only needs to be traversed once to choose the most important packet to send at a giventime. Notethatthelimitedretransmissionvariantsoftheproposedalgorithmslead to decreases in complexity as the number of packets to be considered for transmission at eachtransmissionopportunitydecreases. InsteadofconsideringL,weconsideronlythose that have not been transmitted or have been transmitted in the distant past (e.g., one 68 RTTago)withoutacknowledgement. Inaddition,forLagrangianoptimizationalgorithm, the searching space of an optimal transmission policy for each packet is greatly reduced by the retransmission limitation. 3.5 Experimental Results In this section, we examine the performance of the proposed streaming framework for video codecs with multiple decoding paths. The video sequences are coded using the proposed MDLC approach based on MPEG-4 FGST. Three standard test sequences are used: Akiyo (QCIF), Foreman (QCIF) and Mobile (CIF). The first 200 frames of each sequencearecodedat30f/swithaconstantquantizationstepsize. Eachgroupofpictures (GOP) has 10 frames coded in IPP format. Specifically, at the base layer, 5 frames correspond to MD1 and 6 frames to MD2 in each GOP. Base layer reconstruction of missing frames is done by simply copying the past frame from the other description. Each base layer packet includes a complete frame. The enhancement layer of a frame in each description is coded bit-plane by bit-plane, with each bit-plane put into one packet. Theperformanceismeasuredintermsoftheaverageluminancepeaksignal-to-noiseratio (PSNR) in decibels of the decoded video frames at the receiver as a function of various system parameters, such as available channel bandwidth, packet loss rate (PLR), RTT and application playback delay (denoted by w). In all experiments, the channel RTT is setto200ms,andeachpackethastransmissionopportunitiesevery80ms. Weperformed 100 rounds for each experimental scenario and the results shown are the average of these rounds. 69 Mobile CIF (w = 160ms) 19 21 23 25 27 29 31 33 35 1000 3000 5000 7000 9000 11000 Transmission rate (Kbps) PSNR-Y (dB) Lagrangian Greedy Heuristic (a) Akiyo QCIF (w = 320ms) 28 29 30 31 32 33 34 35 36 37 38 0 100 200 300 400 500 600 700 800 Transmission rate (Kbps) PSNR-Y (dB) Lagrangian Greedy Heuristic (b) Foreman QCIF (w = 640ms) 26 28 30 32 34 36 38 100 200 300 400 500 600 700 800 900 1000 1100 Transmission rate (Kbps) PSNR-Y (dB) Lagrangian Greedy Heuristic (c) Foreman QCIF (Greedy scheduling algorithm) 23 25 27 29 31 33 35 37 100 300 500 700 900 1100 Transmission rate (Kbps) PSNR-Y (dB) rxunlimited, w = 160ms rxunlimited, w = 640ms rxlimited, w = 160ms rxlimited, w = 640ms (d) Figure3.6: ComparisonbetweenschedulingalgorithmsatPLR=0.15forvariousplayback delays. The base layer quantization parameters for Mobile, Akiyo, and Foreman are set to 12, 20, and 20, respectively. 3.5.1 Comparison between Scheduling Algorithms In addition to Lagrangian optimization and greedy algorithm, we also include in a com- parisonwithaheuristicschedulingalgorithmbasedonARQwithprioritizedtransmission. In this algorithm, when the sender has not received the ACK of a packet after one RTT, it puts the packet back to the transmission queue for retransmission. The scheduler differentiates descriptions and layers by a predefined priority order. We choose the one that is observed in general to achieve better performance than the other orders. That is, BL1, EL1, BL2, EL2, in decreasing order of priority, which sends the MD1 first and 70 then MD2 for increased redundancy if additional rate is available. Among packets with the same priority class, priority is given to those with earlier playback deadlines. Both Lagrangian optimization and the greedy algorithm allow retransmissions without waiting for a timeout. Figures 3.6(a)-(c) show the performance comparison between these systems when PLR = 0:15. First, the Lagrangian method provides substantial gains over the heuristic approach for the whole range of bandwidths under consideration at various playback delays. The performance gain is in the range of 1-7 dB and decreases as the playback delay increases. The heuristic approach prioritizes different descriptions and layers in a predefined order without exploiting rate-distortion information of source packets. Thus, the predefined order may lead to a mismatch between the added redundancy and that requiredbythesystemconditions. Forexample,itdoesnotintroduceenoughredundancy inthecaseofshortdelayasshowninFigure3.6(a), inthat MD2isnottransmitteduntil all base and enhancement layers of MD1 have been sent. Furthermore, the number of retransmissions is restricted to be low due to the delay requirement. Therefore, the transmission of less significant enhancement layers of MD1 is more likely a waste of bandwidth due to the loss of its more significant layers. Second, the Lagrangian method outperforms the greedy algorithm by up to 3 dB, and both algorithms achieve similar performanceintheshortplaybackdelaycaseofFigure3.6(a). Thegreedyalgorithmtends to introduce more redundancy in the system since it makes scheduling decisions without considering possible future packet transmissions. In some sense, the greedy algorithm is not able to exploit a longer playback delay in a cost-efficient way. Finally, the greedy algorithm performs better than the heuristic approach in most cases, since it exploits 71 the knowledge of distortion impact of a packet loss on the reconstructed video quality. However, in the case of long playback delay, the greedy algorithm performs poorly in some transmission rates for the reasons we just explained. 3.5.2 Redundancy’s Role in Adaptive Streaming We now describe a detailed performance analysis when the two types of redundancy, namely source and transport redundancy, are used in the streaming system. We use the Lagrangian optimization algorithm as a default scheduling algorithm unless other- wise explicitly mentioned. For all the experiments, we use LC as a representative single- decoding-path(SDP)codec,andMDLCasanexampleofmultiple-decoding-path(MDP) codecs. In order to emphasize the differences in the end-to-end reconstructed quality due to the transmission impacts and the adaptation flexibility introduced by source redun- dancy, we adjust the base layer quantization parameters (QP) so MDLC and LC perform similarly in terms of coding efficiency. Figure 3.7 shows the base layer QP and rate- distortion curves of LC, MD1 and MD2 of MDLC for Mobile and Foreman measured at the encoder without any transmission impact. Three scenarios are considered in the experiments. In the first one, we compare the performance with or without transport redundancy for eachtype of codec. Inthe second scenario, when transportredundancy is notavailable,wecomparetheperformanceofSDPandMDPcodecs,i.e.,LCandMDLC. Finally, the third scenario under consideration represents a combination of the above two kinds of redundancy. Specifically, in this scenario, we examine streaming performance when both source redundancy and transport redundancy are introduced in the streaming system. 72 0 200 400 600 800 1000 1200 26 28 30 32 34 36 38 40 Rate (Kbps) PSNR−Y (dB) Foreman QCIF MD1(QP=20) MD2(QP=20) LC(QP=24) (a) 0 1000 2000 3000 4000 5000 6000 7000 8000 24 26 28 30 32 34 36 38 Rate (Kbps) PSNR−Y (dB) Mobile CIF MD1(QP=12) MD2(QP=12) LC(QP=12) (b) Figure 3.7: Rate-distortion curves of LC, MD1 and MD2 of MDLC for Foreman and Mobile measured at the encoder without transmission impacts. 3.5.2.1 Transport Redundancy We first examine in Figure 3.8 the performance of streaming Foreman, as a function of the available transmission rate and playback delay when LC and MDLC are used, re- spectively. Here, as discussed in Section 3.4.5, unlimited retransmission corresponds to the scheduling algorithms that allow retransmissions without waiting, while limited re- transmission indicates the case where retransmissions have to wait till a timeout period. First, it can be seen that, for both source codecs, unlimited retransmission outperforms limited retransmission with a significant margin over the entire range of transmission rate. This is due to the fact that for the former approach the chance of multiple retrans- missions is greatly increased without incurring an unacceptable delay, and therefore the additional bandwidth can be efficiently used to retransmit the most important packets so as to improve the end quality at the receiver. Since the set of possible choices for¼ in (3.14)forlimitedretransmissionisasubsetofthecorrespondingsetofunlimitedretrans- mission, the Lagrangian optimization should ideally always achieve better performance 73 Foreman QCIF LC (PLR = 0.3) 17 19 21 23 25 27 29 31 33 35 37 0 200 400 600 800 1000 1200 Transmission rate (Kbps) PSNR-Y (dB) rxunlimited, w = 320ms rxunlimited, w = 640ms rxlimited, w = 320ms rxlimited, w = 640ms (a) Foreman QCIF LC (PLR = 0.15) 15 17 19 21 23 25 27 29 31 33 35 37 39 0 200 400 600 800 1000 1200 Transmission rate (Kbps) PSNR-Y (dB) rxunlimited, w = 160ms rxunlimited, w = 320ms rxunlimited, w = 640ms rxlimited, w = 160ms rxlimited, w = 320ms rxlimited, w = 640ms (b) Foreman QCIF MDLC (PLR = 0.3) 22 24 26 28 30 32 34 36 38 0 200 400 600 800 1000 1200 Transmission rate (Kbps) PSNR-Y (dB) rxunlimited, w = 320ms rxunlimited, w = 640ms rxlimited, w = 320ms rxlimited, w = 640ms (c) Foreman QCIF MDLC (PLR = 0.15) 22 24 26 28 30 32 34 36 38 0 200 400 600 800 1000 1200 Transmission rate (Kbps) PSNR-Y (dB) rxunlimited, w = 160ms rxunlimited, w = 320ms rxunlimited, w = 640ms rxlimited, w = 160ms rxlimited, w = 320ms rxlimited, w = 640ms (d) Figure 3.8: The impact of transport redundancy on streaming performance when using Lagrangian optimization algorithm. 74 when removing the retransmission restriction. Transport redundancy is well adjusted by Lagrangian optimization such that retransmissions are not wasted by exploiting the statistical knowledge of the channel and past transmission history. For example, it is observed that, if the playback delay is long enough, the scheduler chooses to wait for one RTT before initiating a new retransmission, so that the retransmission only occurs if the sender does not receive the ACK. However, for an algorithm that does not take into account future packet transmissions, transport redundancy introduced by unlimited retransmissions may deteriorate the algorithm performance as shown in Figure 3.6(d) with bandwidth below 200 Kbps when the greedy algorithm is used. Second, the perfor- mance gain of LC through transport redundancy tends to be more significant than that of MDLC in the same system setting. As seen in Figures 3.8 (a) and (c) at w =320 ms, the gain reaches up to 9 dB for LC and 4 dB for MDLC at high transmission rates. This is because source redundancy in MDLC provides a benefit similar to that of transport redundancy in terms of improving error robustness in a lossy packet network. Finally we observe that the performance difference between unlimited and limited retransmission is larger under poor channel conditions, such as high PLR and short playback delay. Thus limited retransmission may be appropriate as a lower complexity scheduling technique in the case of low PLR and long playback delay. 3.5.2.2 Source Redundancy without Transport Redundancy WethencompareinFigure3.9theperformanceofLCandMDLCintheabsenceoftrans- port redundancy, i.e., when Lagrangian algorithm is used with limited retransmissions. MDLC provides a significant gain over LC in the case of short playback delay and high 75 Mobile CIF (PLR = 0.3) 17 19 21 23 25 27 29 31 33 35 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Transmission rate (Kbps) PSNR-Y (dB) LC, w = 320ms LC, w = 640ms MDLC, w = 320ms MDLC, w = 640ms (a) Foreman QCIF (PLR = 0.3) 17 19 21 23 25 27 29 31 33 35 0 200 400 600 800 1000 1200 Transmission rate (Kbps) PSNR-Y (dB) LC, w = 320ms LC, w = 640ms MDLC, w = 320ms MDLC, w = 640ms (b) Mobile CIF (PLR = 0.15) 16 21 26 31 36 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Transmission rate (Kbps) PSNR-Y (dB) LC, w = 160ms LC, w = 320ms LC, w = 640ms MDLC, w = 160ms MDLC, w = 320ms MDLC, w = 640ms (c) Foreman QCIF (PLR = 0.15) 15 19 23 27 31 35 39 0 200 400 600 800 1000 1200 Transmission rate (Kbps) PSNR-Y (dB) LC, w = 160ms LC, w = 320ms LC, w = 640ms MDLC, w = 160ms MDLC, w = 320ms MDLC, w = 640ms (d) Mobile CIF (PLR = 0.05) 24 26 28 30 32 34 36 38 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Transmission rate (Kbps) PSNR-Y (dB) LC, w = 160ms LC, w = 320ms LC, w = 640ms MDLC, w = 160ms MDLC, w = 320ms MDLC, w = 640ms (e) Foreman QCIF (PLR = 0.05) 23 25 27 29 31 33 35 37 39 0 200 400 600 800 1000 1200 Transmission rate (Kbps) PSNR-Y (dB) LC, w = 160ms LC, w = 320ms LC, w = 640ms MDLC, w = 160ms MDLC, w = 320ms MDLC, w = 640ms (f) Figure 3.9: Comparing LC and MDLC with limited retransmissions. The performance at w =160 ms and PLR=0:3 for both sequences is not included in the figure as the low PSNR achieved is out of acceptable range. 76 PLR, where source redundancy introduced by multiple descriptions greatly improves er- ror robustness. The performance gain achieves up to 8 dB for both Mobile and Foreman when w = 160 ms at PLR = 0:15. As w increases, the performance of LC improves as the number of possible retransmissions increases, and finally is close to that of MDLC at w = 640 ms. However, at very high PLR (e.g., PLR = 0:3), MDLC outperforms LC again with a gain of 1-3 dB when w =640 ms. In very few cases, there is a penalty of up to 0.6 dB for MDLC over LC. This maybe be due to the possible local minimum involved in the Lagrangian optimization of MDLC. To summarize the results, source redundancy provides significant benefits on robust video communication especially in the case of high PLRandshortplaybackdelay,whichisknowntobeaverydifficultenvironmentforvideo communication. When system conditions become favorable, source redundancy may not be necessary considering the additional complexity it introduces. 3.5.2.3 Source Redundancy with Transport Redundancy The final comparison is between LC and MDLC when transport redundancy is applied, i.e., when using the Lagrangian optimization algorithm with unlimited retransmissions. Figure 3.10 shows the performance of streaming Mobile and Foreman under the same system settings as Figure 3.9. Note that in this case the performance difference between MDLC and LC is not as large as in the previous case. This is expected as it is no longer necessary to wait for a timeout and thus packet retransmission becomes possible even in a delay-sensitive application, where end-to-end delay is of the order of the RTT. The scheduling algorithm essentially provides unequal levels of transport redundancy to differentpacketsbasedontheirestimatedimportance, andthusovercomesthesensitivity 77 Mobile CIF (PLR = 0.3) 14 19 24 29 34 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Transmission rate (Kbps) PSNR-Y (dB) LC, w = 160ms LC, w = 320ms LC, w = 640ms MDLC, w = 160ms MDLC, w = 320ms MDLC, w = 640ms (a) Foreman QCIF (PLR = 0.3) 15 20 25 30 35 0 200 400 600 800 1000 1200 Transmission rate (Kbps) PSNR-Y (dB) LC, w = 160ms LC, w = 320ms LC, w = 640ms MDLC, w = 160ms MDLC, w = 320ms MDLC, w = 640ms (b) Mobile CIF (PLR = 0.15) 20 22 24 26 28 30 32 34 36 38 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Transmission rate (Kbps) PSNR-Y (dB) LC, w = 160ms LC, w = 320ms LC, w = 640ms MDLC, w = 160ms MDLC, w = 320ms MDLC, w = 640ms (c) Foreman QCIF (PLR = 0.15) 22 24 26 28 30 32 34 36 38 0 200 400 600 800 1000 1200 Transmission rate (Kbps) PSNR-Y (dB) LC, w = 160ms LC, w = 320ms LC, w = 640ms MDLC, w = 160ms MDLC, w = 320ms MDLC, w = 640ms (d) Mobile CIF (PLR = 0.05) 27 29 31 33 35 37 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Transmission rate (Kbps) PSNR-Y (dB) LC, w = 160ms LC, w = 320ms LC, w = 640ms MDLC, w = 160ms MDLC, w = 320ms MDLC, w = 640ms (e) Foreman QCIF (PLR = 0.05) 27 29 31 33 35 37 39 0 200 400 600 800 1000 1200 Transmission rate (Kbps) PSNR-Y (dB) LC, w = 160ms LC, w = 320ms LC, w = 640ms MDLC, w = 160ms MDLC, w = 320ms MDLC, w = 640ms (f) Figure 3.10: Comparing LC and MDLC with unlimited retransmissions. 78 ofaLCsystemtotransmissionlosses. Butwecanstillobserveaperformancegainofupto 6 dB at w =160 ms and PLR=0:3. Similar to the second experiment, the performance gainvariesasafunctionofchannelconditions,andislargerforhighPLRandlowplayback delay. Thuswecanconcludethattransportredundancyandsourceredundancycanboth improve the end-to-end performance by enhancing the error robustness. Furthermore, the best performance is achieved when both types of redundancies are applied in the streaming system. Such a system can achieve efficient video streaming even under very poor channel conditions, such as for very high PLR or relatively long RTT compared to the playback delay. 3.6 Conclusions Inthischapterwehaveextendedrecentworkonrate-distortionbasedvideoschedulingto the general case where multiple decoding paths are possible. We proposed a new source model called Directed Acyclic Hypergraph (DAHG) to describe the decoding dependence and redundancy between different data units. Based on this model, we have proposed two rate-distortion based scheduling algorithms, i.e., the Lagrangian optimization and greedy algorithm. Experimental results demonstrate the performance improvement by exploiting coding relation and rate-distortion information of data units in the scheduling algorithms. The results show that our proposed system with both source and transport redundancy can provide very robust and efficient real-time video communication over lossy packet networks. 79 Chapter 4 A Framework for Adaptive Scalable Video Coding Using Wyner-Ziv Techniques 4.1 Introduction Scalable coding is well-suited for video streaming and broadcast applications as it fa- cilitates adapting to variations in network behavior, channel error characteristics and computation power availability at the receiving terminal. Predictive coding, in which motion compensated predictors are generated based on previously reconstructed frames, is an important technique to remove temporal redundancy among successive frames. It is well known that predictive techniques increase the difficulty of achieving efficient scal- able coding because scalability leads to multiple possible reconstructions of each frame [66]. In this situation either (i) the same predictor is used for all layers, which leads to either drift or coding inefficiency, or (ii) a different predictor is obtained for each recon- structed version and used for the corresponding layer of the current frame, which leads to added complexity. MPEG-2 SNR scalability with a single motion-compensated pre- diction loop and MPEG-4 FGS exemplify the first approach. MPEG-2 SNR scalability 80 uses the enhancement-layer (EL) information in the prediction loop for both base and enhancementlayers,whichleadstodriftiftheELisnotreceived. MPEG-4FGSprovides flexibility in bandwidth adaptation and error recovery because the enhancement layers are coded in “intra” mode, which results in low coding efficiency especially for sequences that exhibit high temporal correlation. Rose and Regunathan [66] proposed a multiple motion-compensated prediction loop approach for general SNR scalability, in which each EL predictor is optimally estimated by considering all the available information from both base and enhancement layers. Severalalternativemulti-layertechniqueshavealsobeenproposedtoexploitthetemporal correlation in the EL inside the FGS framework [33,79,93]. They employ one or more additional motion-compensated prediction loops to code the EL, for which a certain number of FGS bit-planes are included in the EL prediction loop to improve the coding efficiency. Traditional closed-loop prediction (CLP) techniques have the disadvantage of requiring the encoder to generate all possible decoded versions for each frame, so that each of them can be used to generate a prediction residue. Thus, the complexity is high at the encoder, especially for multi-layer coding scenarios. In addition, in order to avoid drift, the exact same predictor has to be used at both the encoder and decoder. Distributed source coding techniques based on network information theory provide a different and interesting viewpoint to tackle these problems. Several video codecs using side information (SI) at the decoder [3,26,60,61,67,69] have been recently proposed within the Wyner-Ziv framework [94]. These can be thought of as an intermediate step between “closing the prediction loop” and coding each frame independently. In closed- loop prediction in order for the encoder to generate a residue it needs to generate the 81 same predictor that will be available at the decoder. Instead, a Wyner-Ziv encoder only requires the correlation structure between the current signal and the predictor. Thus there is no need to generate the decoded signal at the encoder as long as the correlation structure is known, or can be found. Some recent work [68,74,77,96] has addressed the problem of scalable coding in the distributed source coding setting. Steinberg and Merhav [74] formulated the theoretical problem of successive refinement of information in the Wyner-Ziv setting, which serves as the theoretical background of our work. In our work, we target the application of these principles to actual video coding systems. The two most related recent algorithms are in the works by Xu and Xiong [96] and Seghal et al. [68]. There are a number of important differences between our approach and those techniques. In [96], the authors presented a scheme similar to MPEG-4 FGS by building the bit-plane ELs using Wyner- Ziv coding (WZC) with the current base and more significant ELs as SI, ignoring the EL information of the previous frames. In contrast, our approach explores the remain- ing temporal correlation between the successive frames in the EL using WZC to achieve improvedperformanceoverMPEG-4FGS.In[68], multipleredundant Wyner-Ziv encod- ings are generated for each frame at different fidelities. An appropriate encoded version is selected for streaming, based on the encoder’s knowledge of the predictor available at the decoder. This scheme requires a feedback channel and additional delay and thus it is not well-suited for broadcast or low-delay applications. In short, one method [96] ignores temporal redundancy in the design, while the other [68] creates separate and redundant enhancement layers, rather than a single embedded enhancement layer. In addition to these approaches for SNR scalability, Tagliasacchi et al. [77] have proposed a spatial 82 and temporal scalable codec using distributed source coding. They use the standards- conformantH.264/AVCtoencodethebaselayer,andasyndrome-basedapproachsimilar to [61] to encode the spatial and temporal enhancement layers. Motion vectors from the base layer are used as coarse motion information so that the enhancement layers can obtain a better estimate of the temporal correlation. In contrast, our work focuses on SNR scalability. We propose, extending our previous work [81,84], an efficient solution to the problem of scalable predictive coding by recasting it as a Wyner-Ziv problem. Our proposed technique achieves scalability without feedback and exploits both spatial and temporal redundancy in the video signal. In [84] we introduced the basic concept on a first- order DPCM source model, and then presented a preliminary version of our approach in video applications in [81]. Our approach, Wyner-Ziv scalable coding (WZS), aims at applyinginthecontextofWyner-ZivtheCLP-basedestimation-theoretic(ET)technique in [66]. Thus, in order to reduce the complexity, we do not explicitly construct multiple motion-compensation loops at the encoder, while, at the decoder, SI is constructed to combine spatial and temporal information in a manner that seeks to approximate the principles proposed in [66]. In particular, starting from a standard CLP base-layer (BL) video coder (such as MPEG-4 in our implementation), we create a multi-layer Wyner- Zivprediction“link”,connectingthesamebit-planelevelbetweensuccessiveframes. The decodergeneratestheenhancement-layerSIwitheithertheestimationtheoreticapproach proposedin[66]orourproposedsimplifiedswitchingalgorithmtotakeintoaccountallthe available information to the EL. In order to design channel codes with appropriate rates, 83 the encoder estimates the correlation between the current frame and its enhancement- layer SI available at the decoder. By exploiting the EL information from the previous frames, our approach can achieve significant gains in EL compression, as compared to MPEG-4 FGS, while keeping complexity reasonably low at the encoder. A significant contribution of our work is to develop a framework for integrating WZC into a standard video codec to achieve efficient and low-complexity scalable coding. Our proposed framework is backward compatible with a standard base-layer video codec. Another main contribution of this work is to propose two simple and efficient algorithms toexplicitlyestimateattheencodertheparametersofamodeltodescribethecorrelation between the current frame and an optimized SI available only at the decoder. Our estimates closely match the actual correlation between the source and the decoder SI. The first algorithm is based on constructing an estimate of the reconstructed frame and directlymeasuringtherequiredcorrelationsfromit. Thesecondalgorithmisbasedonan analyticalmodelofthecorrelationstructure,whoseparameterstheencodercanestimate. The chapter is organized as follows. In Section 4.2, we briefly review the theoretical background of successive refinement for the Wyner-Ziv problem. We then describe our proposed practical WZS framework and the correlation estimation algorithms in Sections 4.3 and 4.4, respectively. Section 4.5 describes the codec structure and implementation details. SimulationresultsarepresentedinSection4.6, showingsubstantialimprovement in video quality for sequences with high temporal correlation. Finally, conclusions and future work are provided in Section 4.7. 84 R 1 Enc1 Enc2 Dec1 Dec2 X 1 Y 1 ˆ X 2 ˆ X 2 Y ΔR=R 2 -R 1 Figure 4.1: Two-stage successive refinement with different side information Y 1 and Y 2 at the decoders, where Y 2 has better quality than Y 1 , i.e. X !Y 2 !Y 1 . 4.2 Successive Refinement for the Wyner-Ziv Problem Steinberg and Merhav [74] formulated the theoretical problem of successive refinement of information, originally proposed by Equitz and Cover [24], in a Wyner-Ziv setting (see Fig. 4.1). A source X is to be encoded in two stages: at the coarse stage, using rate R 1 , the decoder produces an approximation, ˆ X 1 with distortion D 1 based on SI Y 1 . At the refinementstage, theencodersendsanadditionalΔR refinementbitssothatthedecoder can produce a more accurate reconstruction, ˆ X 2 , with a lower distortion D 2 based on SI Y 2 . Y 2 is assumed to provide a better approximation to X than Y 1 and to form a Markov chain X !Y 2 !Y 1 . Let R ¤ XjY (D) be the Wyner-Ziv rate-distortion function for coding X with SI Y. A source X is successively refinable if [74]: R 1 =R ¤ XjY 1 (D 1 ); and R 1 +ΔR=R ¤ XjY 2 (D 2 ): (4.1) Successive refinement is possible under a certain set of conditions. One of the conditions, as proved in [74], requires that the two SIs, Y 1 and Y 2 , be equivalent at the distortion level D 1 in the coarse stage. To illustrate the concept of “equivalence”, we first consider the classical Wyner-Ziv problem (i.e., without successive refinement) as follows. Let Y 85 be the SI available at the decoder only, for which a joint distribution with source X is known by the encoder. Wyner and Ziv [94] have shown that R ¤ XjY =min U [I(X;UjY)] (4.2) where U is an auxiliary random variable, and the minimization of mutual information between X and U given Y is over all possible U such that U !X !Y forms a Markov chain and E[d(X;f(U;Y))] · D. For the successive refinement problem, Y 2 is said to be equivalent to Y 1 at D 1 , if there exists a random variable U achieving (4.2) at D 1 and satisfying I(U;Y 2 jY 1 ) = 0 as well. In words, when Y 1 is given, Y 2 does not provide any more information about U. Itisimportanttonotethatthisequivalenceisunlikelytoariseinscalablevideocoding. As an example, assume that Y 1 and Y 2 correspond to the BL and EL reconstruction of the previous frame, respectively. Then, the residual energy when the current frame is predicted based on Y 2 will in general be lower than if Y 1 is used. Thus, in general, this equivalence condition will not be met in the problem we consider and we should expect to observe a performance penalty with respect to a non-scalable system. Note that one special case where equivalence holds is that where identical SIs are used at all layers, i.e., Y 1 = Y 2 . For this case and for a Gaussian source with quadratic distortion measure the successive refinement property holds [74]. Some practical coding techniques have been developed based on this equal SI property, e.g., in the work of Xu and Xiong [96], where the BL of the current frame is regarded as the only SI at the decoder at both the coarse and refinement stages. However, as will be shown, constraining the decoder to use the 86 same SI at all layers leads to suboptimal performance. In our work, the decoder will use the EL reconstruction of the previous frame as SI, outperforming an approach similar to that proposed in [96]. 4.3 Proposed Prediction Framework In this section, we propose a practical framework to achieve Wyner-Ziv scalability for video coding. Let video be encoded so that each frame i is represented by a base layer BL i , and multiple enhancement layers EL i1 , EL i2 , ..., EL iL , as shown in Fig. 4.2. We assumethatinordertodecodeEL ij andachievethequalityprovidedbythej-thEL,the decoder will need to have access to: (1) the previous frame decoded up to the j-th EL, EL i¡1;k ; k · j, and (2) all information for the higher significance layers of the current frame, EL ik ; k < j, including reconstruction, prediction mode, BL motion vector for each Inter-mode macroblock, and the compressed residual. For simplicity, the BL motion vectors are reused by all EL bit-planes. With the structure shown in Fig. 4.2, a scalable coder based on WZC techniques would need to combine multiple SIs at the decoder. More specifically, when decoding the informationcorrespondingtoEL i;k ,thedecodercanuseasSIdecodeddatacorresponding to EL i¡1;k and EL i;k¡1 . In order to understand how several different SIs can be used together we first review a well-known technique for combining multiple predictors in the context of closed-loop coding (Section 4.3.1 below). We then introduce an approach to formulate our problem as one of source coding with side information at the decoder (Section 4.3.2). 87 Frame 1 EL 1L EL 11 Frame 2 EL 2L EL 21 … Frame k EL kL EL k1 … … … … … … … … … … … … BL 1 BL 2 … BL k … : BL CLP temporal prediction : EL SNR prediction : EL temporal prediction Figure 4.2: Proposed multi-layer prediction problem. BL i : the base layer of the ith frame. EL ij : the jth EL of the ith frame, where the most significant EL bit-plane is denoted by j =1. 4.3.1 Brief Review of ET Approach InthissectionwebrieflyreviewtheETapproachproposedin[66]. Thetemporalevolution of DCT coefficients can be usually modelled by a first-order Markov process x k =½x k¡1 +z k ; x k¡1 ?z k (4.3) where x k is a DCT coefficient in the current frame and x k¡1 is the corresponding DCT coefficient in the previous frame after motion compensation. Let ˆ x b k and ˆ x e k be the base and enhancement layer reconstruction of x k , respectively. After the BL is generated we know that x k 2 (a;b), where (a;b) is the quantization interval generated by the BL. In 88 addition, assume that the EL encoder and decoder have access to the EL reconstructed DCT coefficient ˆ x e k¡1 of the previous frame. Then the optimal EL predictor is given by ˜ x e k = E[x k jˆ x e k¡1 ; x k 2(a;b)] ¼ ½ˆ x e k¡1 +E[z k jz k 2(a¡½ˆ x e k¡1 ;b¡½ˆ x e k¡1 )]: (4.4) The EL encoder then quantizes the residual r e k =x k ¡ ˜ x e k : (4.5) Let (c;d) be the quantization interval associated with r e k , i.e., r e k 2 (c;d), and let e = max(a;c+˜ x e k ) and f =min(b;d+˜ x e k ). The optimal EL reconstruction is given by ˆ x e k =E[x k jˆ x e k¡1 ; x k 2(e;f)]: (4.6) The EL predictor in (4.4) can be simplified in the following two cases: (1) ˜ x e k ¼ ˆ x b k if correlation is low, ½¼0, or the total rate is approximately the same as the BL rate, i.e., ˆ x e k¡1 ¼ ˆ x b k¡1 ; (2) ˜ x e k ¼ ˆ x e k¡1 for cases where temporal correlation is higher or such that the quality of the BL is much lower than that of EL. Note that in addition to optimal prediction and reconstruction, the ET method can lead to further performance gains if efficient context-based entropy coding strategies are used. For example, the two cases ˜ x e k ¼ ˆ x b k and ˜ x e k ¼ ˆ x e k¡1 could have different statistical properties. In general, with the predictor of (4.4), since the statistics of z k tend to be different depending on the interval (a¡½ˆ x e k¡1 ;b¡½ˆ x e k¡1 ), the encoder could use different 89 entropy coding on different intervals [66]. Thus, a major goal in this chapter is to design a system that can achieve some of the potential coding gains of conditional coding in the context of a WZC technique. To do so we will design a switching rule at the encoder that will lead to different coding for different types of source blocks. 4.3.2 Formulation as a Distributed Source Coding Problem The main disadvantage of the ET approach for multi-layer coding resides in its complex- ity, since multiple motion-compensated prediction loops are necessary for EL predictive coding. Forexample, inordertoencode EL 21 inFig.4.2, theexactreproductionof EL 11 must be available at the encoder. If the encoder complexity is limited, it may not be practicaltogenerateallpossiblereconstructionsofthereferenceframeattheencoder. In particular, in our work we assume that the encoder can generate only the reconstructed BL, and does not generate any EL reconstruction, i.e., none of the EL ij in Fig. 4.2 are available at the encoder. Under this constraint we seek efficient ways to exploit the tem- poral correlation between ELs of consecutive frames. In this chapter, we propose to cast the EL prediction as a Wyner-Ziv problem, using Wyner-Ziv coding to replace the closed loop between the respective ELs of neighboring frames. We first focus on the case of two-layer coders, which can be easily extended to multi- layercodingscenarios. ThebasicdifferenceattheencoderbetweenCLPtechniques, such as ET, and our problem formulation is illustrated in Fig. 4.3. A CLP technique would compute an EL predictor ˜ x e k =f(ˆ x e k¡1 ;ˆ x b k ) (4.7) 90 where f(¢) is a general prediction function (in the ET case f(¢) would be defined as in (4.4)). Then, the EL encoder would quantize the residual r e k in (4.5) and send it to the decoder. Instead, in our formulation, we assume that the encoder can only access ˆ x b k , while the decoder has access to both ˆ x b k and ˆ x e k¡1 . Therefore, the encoder cannot generate the same predictor ˜ x e k as (4.7) and cannot explicitly generate r e k . Note, however, that ˆ x b k , one of the components in (4.7), is in fact available at the encoder, and would exhibit some correlation with x k . This suggests making use of ˆ x b k at the encoder. First, we can rewrite r e k as r e k =x k ¡ ˜ x e k =(x k ¡ ˆ x b k )¡(˜ x e k ¡ ˆ x b k ) (4.8) and then to make explicit how this can be cast as a Wyner-Ziv coding problem, let u k =x k ¡ ˆ x b k and v k = ˜ x e k ¡ ˆ x b k . With this notation u k plays the role of the input signal and v k plays the role of SI available at the decoder only. We can view v k as the output of a hypothetical communication channel with input u k corrupted by correlation noise. Therefore, once the correlation between u k and v k has been estimated, the encoder can select an appropriate channel code and send the relevant coset information such that the decoder can obtain the correct u k with SI v k . Section 4.4 will present techniques to efficiently estimate the correlation parameters at the encoder. In order to provide a representation with multiple layers coding, we generate the residue u k for a frame and represent this information as a series of bit-planes. Each bit- planecontainsthebitsatagivensignificancelevelobtainedfromtheabsolutevaluesofall DCTcoefficientsintheresidueframe(thedifferencebetweenthebaselayerreconstruction 91 pred k x ˆ b k x 1 ˆ e k x - e k x ɶ e k r (a) pred k x ˆ b k x 1 ˆ e k x - e k x ɶ e k r (b) Figure 4.3: Basic difference at the encoder between the CLP techniques such as ET and our proposed problem: (a) CLP techniques, (b) our problem setting. andtheoriginalframe). ThesignbitofeachDCTcoefficientiscodedonceinthebit-plane where that coefficient becomes significant (similar to what is done in standard bit-plane based wavelet image coders). Note that this would be the same information transmitted by an MPEG-4 FGS technique. However, differently from the intra bit-plane coding in MPEG-4 FGS, we create a multi-layer Wyner-Ziv prediction link, connecting a given bit-plane level in successive frames. In this way we can exploit the temporal correlation between corresponding bit-planes of u k and v k , without reconstructing v k explicitly at the encoder. 4.4 Proposed Correlation Estimation Wyner-Ziv techniques are often advocated because of their reduced encoding complexity. It is important to note, however, that their compression performance depends greatly on the accuracy of the correlation parameters estimated at the encoder. This correlation estimation can come at the expense of increased encoder complexity, thus potentially eliminating the complexity advantages of WZC techniques. In this section, we propose estimation techniques to achieve a good tradeoff between complexity and coding perfor- mance. 92 0 0 1 1 01 p 10 p u k,l v k,l 01 1 p - 10 1 p - (a) -1 -1 1 1 α β sign( l k u ) 1 α β - - 1 α β - - 0 β α sign( l k v ) (b) Figure 4.4: Discrete memoryless channel model for coding u k : (a) binary channel for bit-planescorrespondingtoabsolutevaluesoffrequencycoefficients(i.e., u k;l atbit-plane l), (b) discrete memoryless channel with binary inputs (“-1” if u l k <0 and “1” if u l k >0) and three outputs (“-1” if v l k <0, “1” if v l k >0 and “0” if v l k =0) for sign bits, 4.4.1 Problem Formulation Ourgoalistoestimatethecorrelationstatistics(e.g.,thematrixoftransitionprobabilities in a discrete memoryless channel) between bit-planes of same significance in u k and v k . To do so, we face two main difficulties. First, and most obvious, ˆ x e k¡1 , and therefore v k , are not generated at the encoder as shown in Fig. 4.3. Second, v k is generated at the decoder by using the predictor ˜ x e k from (4.7), which combines ˆ x e k¡1 and ˆ x b k . In Section 4.4.2 we will discuss the effect of these combined predictors on the estimation problem, with a focus on our proposed mode-switching algorithm. In what follows the most significant bit-plane is given the index “1”, the next most significant bit-plane index “2”, and so on. u k;l denotes the l-th bit-plane of absolute values of u k , while u l k indicates the reconstruction of u k (including the sign information) truncated to its l most significant bit-planes. The same notation will be used for other signals represented in terms of their bit-planes, such as v k . 93 In this work, we assume the channel between source u k and decoder SI v k to be modeled as shown in Fig. 4.4. With a binary source u k;l , the corresponding bit-plane of v k , v k;l , is assumed to be generated by passing this binary source through a binary channel. Inadditiontothepositive(symbol“1”)andnegative(symbol“-1”)signoutputs, an additional output symbol “0” is introduced in the sign bit channel to represent the case when SI v k =0. We propose two different methods to estimate crossover probabilities, namely, (1) a direct estimation (Section 4.4.3), which generates estimates of the bit-planes first, then directly measures crossover probabilities for these estimated bit-planes, and (2) a model- based estimation (Section 4.4.4), where a suitable model for the residue signal (u k ¡v k ) is obtained and used to estimate the crossover probabilities in the bit-planes. These two methods will be evaluated in terms of their computational requirements, as well as their estimation accuracy. 4.4.2 Mode-Switching Prediction Algorithm As discussed in Section 4.3, the decoder has access to two SIs, ˆ x e k¡1 and ˆ x b k . Consider first the prediction function in (4.7) when both SIs are known. In the ET case, f(¢) is defined as an optimal prediction as in (4.4) based on a given statistical model of z k . Alternatively, the optimal predictor ˜ x e k can be simplified to either ˆ x e k¡1 or ˆ x b k for a two- layer coder, depending on whether the temporal correlation is strong (choose ˆ x e k¡1 ) or not (choose ˆ x b k ). Here we choose the switching approach due to its lower complexity, as compared to theoptimalprediction,andalsobecauseitisamenabletoanefficientuseof“conditional” 94 entropy coding. Thus, a different channel code could be used to code u k when ˜ x e k ¼ ˆ x b k and when ˜ x e k ¼ ˆ x e k¡1 . In fact, if ˜ x e k = ˆ x b k , then v k = 0, and we can code u k directly via entropy coding, rather than using channel coding. If ˜ x e k = ˆ x e k¡1 , we apply WZC to u k with the estimated correlation between u k and v k . For a multi-layer coder, the temporal correlation usually varies from bit-plane to bit- plane,andthusthecorrelationshouldbeestimatedateachbit-planelevel. Therefore,the switching rules we just described should be applied before each bit-plane is transmitted. Weallowadifferentpredictionmodetobeselectedonamacroblock(MB)bymacroblock basis(allowingadaptationofthepredictionmodeforsmallerunits,suchasblocksorDCT coefficients may be impractical). At bit-plane l, the source u k has two SIs available at the decoder: u l¡1 k (the reconstruction from its more significant bit-planes), and ˆ x e k¡1 (the EL reconstruction from the previous frame). The correlation between u k and each SI is estimatedastheabsolutesumoftheirdifference. WhenbothSIsareknown,thefollowing parameters are defined for each MB, E intra = P MB i ju k ¡u l¡1 k j E inter = P MB i ju k ¡(ˆ x e k¡1 ¡ ˆ x b k )j= P MB i jx k ¡ˆ x e k¡1 j; (4.9) where only the luminance componentis used in the computation. Thus, wecan makethe mode decision as follows: WZS-MB (coding of MB via WZS) mode is chosen if E inter <E intra : (4.10) 95 Otherwise, we code u k directly via bit-plane by bit-plane refinement (FGS-MB), since it is more efficient to exploit spatial correlation through bit-plane coding. Ingeneralmode-switchingdecisionscanbemadeateitherencoderordecoder. Making a mode decision at the decoder means deciding which SI should be used to decode WZC data sent by the encoder. The advantage of this approach is that all relevant SI is available. A disadvantage in this case is that the encoder has to estimate the correlation between u k and v k without exact knowledge of the mode decisions that will be made at the decoder. Thus, because it does not know which MBs will be decoded using each type of SI, the encoder has to encode all information under the assumption of a single “aggregate” correlation model for all blocks. This prevents the full use of conditional coding techniques discussed earlier. Alternatively,makingmodedecisionsattheencoderprovidesmoreflexibilityasdiffer- ent coding techniques can be applied to each block. The main drawback of this approach is that the SI ˆ x e k¡1 is not available at the encoder, which makes the mode decision diffi- cult and possibly suboptimal. In this chapter, we select to make mode decisions at the encoder, with mode switching decisions based on the estimated levels of temporal corre- lation. Thus E inter cannot be computed exactly at the encoder as defined in (4.9), since ˆ x e k¡1 is unknown; this will be further discussed once specific methods to approximate E inter at the encoder have been introduced. 4.4.3 Direct Estimation For the l-th bit-plane, 1 · l · L, where L is the least significant bit-plane level to be encoded, we need to estimate the correlation between u k;l and v k given all u k;j (1 · 96 j < l) which have been sent to the decoder. While, in general, for decoding u k all the information received by the decoder can be used, here, we estimate the correlation under the assumption that to decode bit-plane l, we use only the l most significant bit-planes of the previous frame. The SI for bitplane l in this particular case is denoted by ˇ v k (l), which is unknown at the encoder. We compute ¯ v k (l) at the encoder to approximate ˇ v k (l), 1· l· L. Ideally we would like the following requirements to be satisfied: (1) The statistical correlation between each bit-plane u k;l and ˇ v k (l), given all u k;j (1· j < l) can be well approximated by the corresponding correlation between u k;l and ¯ v k (l); and (2) ¯ v k (l) can be obtained at the encoder in a simple way without much increased computational complexity. This can be achievedbyprocessingtheoriginalreferenceframex k¡1 attheencoder. Wefirstcalculate the residual s k =x k¡1 ¡ ˆ x b k (4.11) at the encoder, and then generate bit-planes s l k , in the same way as the u l k are generated. Let ¯ v k (l) = s l k for 1 · l · L. While ¯ v k (l) and ˇ v k (l) are not equal, the correlation between ¯ v k (l) and u k;l provides a good approximation to the correlation between ˇ v k (l) and u k;l , as is seen in Fig. 4.5, which shows the probability that u l k 6= s l k (i.e. the values of u k and s k do not fall into the same quantization bin), as well as the corresponding crossover probability between u k and decoder SI ˇ v k (l). The crossover probability here is an indication of the correlation level. SI s l k can be used by the encoder to estimate the level of temporal correlation, which is again used to perform mode switching and determine the encoding rate of the channel 97 Akiyo 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 1 2 3 4 5 bit-plane level crossover probability encoder SI (Pe) decoder SI (Pd) average (|P d - Pe|) max|Pd - Pe| (a) Foreman 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1 2 3 4 5 bit-plane level crossover probability encoder SI (Pe) decoder SI (Pd) average (|Pd - P e|) max |Pd - P e| (b) Figure 4.5: Measurement of approximation accuracy for Akiyo and Foreman sequences. Thecrossoverprobabilityisdefinedastheprobabilitythatthevaluesofthesourceu k and side information do not fall into the same quantization bin. The average and maximum absolutedifferencesoverallframesbetweenthetwocrossoverprobabilitiesarealsoshown. codes applied to MBs in WZS-MB mode. Replacing the term (ˆ x e k¡1 ¡ˆ x b k ) in (4.9) by s l k , E inter is redefined as E inter = X MB i ju k ¡s l k j: (4.12) Clearly, the larger E intra , the more bits will be required to refine the bit-plane in FGS- MB mode. Similarly E inter gives an indication of the correlation present in the i-th MB between u l k and s l k , which are approximations of u k and v k at the l-th bit-plane, respectively. TocodeMBsinWZS-MBmode,wecanfurtherapproximatetheEToptimal predictor in (4.4) by taking into account both SIs, u l¡1 k and s l k , as follows: If s k is within the quantization bin specified by u l¡1 k , the EL predictor is set to s l k ; however, if s k is outside that quantization bin, the EL predictor is constructed by first clipping s k to the closest value within the bin and then truncating this new value to its l most significant bit-planes. For simplicity, we still denote the improved EL predictor of the lth bit-plane as s l k in the following discussion. 98 Table 4.1: Channel parameters and the a priori probabilities for the 3rd bit-plane of frame 3 of Akiyo CIF sequence when BL quantization parameter is 20 (with the same symbol notation as Fig. 4.4). Pr(u k;l =1) p 01 p 10 Pr(sign(u l k )=1) ® ¯ 0.13 0.019 0.14 0.49 0.13 0.001 At bit-plane l, the rate of the channel code used to code u k;l (or the sign bits that correspond to that bit-plane) for MBs in WZS-MB mode is determined by the encoder based on the estimated conditional entropy H(u k;l js k;l ) (or H(sign(u l k )jsign(s l k )) ). For discrete random variables X and Y, H(XjY) can be written as H(XjY)= X y i Pr(Y =y i )H(XjY =y i ); (4.13) wherebothPr(Y =y i )andH(XjY =y i )canbeeasilycalculatedoncethe a priori prob- ability of X and the transition probability matrix are known. The crossover probability, for example p 01 in Fig. 4.4 (a), is derived by counting the number of coefficients such that u k;l = 0 and u k;l 6= s k;l . Table 4.1 shows an example of those parameters for both u k;l and the sign bits. Note that the crossover probabilities between u k;l and s k;l are very different for source symbols 0 and 1, and therefore an asymmetric binary channel model will be needed to code u k;l as shown in Fig. 4.4 (a). However, the sign bit has almost the same transitional probabilities whenever the input is -1 or 1, and is thus modelled as a symmetric discrete memoryless channel in Fig. 4.4 (b). Intermsofcomplexity,notethattherearetwomajorstepsinthisestimationmethod: i)bit-planeextractionfroms k andii)conditionalentropycalculation(includingthecount- ingtoestimatethecrossoverprobabilities). Bit-planesneedtobeextractedonlyonceper 99 frameandthisisdonewithasimpleshiftingoperationontheoriginalframe. Conditional entropywillbecalculatedforeachbit-planebasedonthecrossoverprobabilitiesestimated by simple counting. In Section 4.5 we will compare the complexity of the proposed WZS approach and the ET approach. 4.4.4 Model-based Estimation In this section we introduce a model-based method for correlation estimation that has lower computational complexity, at the expense of a small penalty in coding efficiency. The basic idea is to estimate first the probability density functions (pdf) of the DCT residuals (u k ;v k ;z k = v k ¡u k ), and then use the estimated pdf to derive the crossover probabilities for each bit-plane. Assume that u k ;v k ;z k are independent realizations of the random variables U;V; and Z, respectively. Furthermore, assume that V = U + Z, with U and Z independent. We start by estimating of the pdf’s f U (u) and f Z (z). This can be done by choosing appropriate models for the data samples, and estimating the model parameters using one of the standard parameter estimation techniques, e.g., maximum likelihood estimation, expectation-maximization (EM), etc. Note that since the v k are not available in our encoder, we use s k to approximate v k in the model parameter estimation. Once we have estimated f U (u) and f Z (z) we can derive the crossover probabilities at each bit-plane as follows. Recall that we consider there is no crossover when u k , v k fall into the same quantization bin. This corresponds to the event denoted by the shaded 100 A 1 A 2 A 3 U V 2x2 l 3x2 l 4x2 l 4x2 l … A -1 A 0 0 2 l 3x2 l 2x2 l 2 l -2x2 l -2 l -2 l -2x2 l Figure 4.6: Crossover probability estimation. The shaded square regions A i correspond to the event that crossover does not occur at bit-plane l. square regions in Fig. 4.6. Hence we can find the estimate of the crossover probability at bit-plane l (denoted as ˆ p(l)) by ˆ p(l)=1¡I(l); (4.14) where I(l) is given by I(l)= X i Z Z A i f UV (u;v)dudv = X i Z Z A i f U (u)f VjU (vju)dudv: (4.15) I(l) is simply the probability that U;V fall into the the same quantization bin. The conditional pdf f VjU (vju) can be obtained as f VjU (vju)=f Z (v¡u) (4.16) and the integral in (4.15) can be readily evaluated for a variety of densities. In practice we only need to sum over a few regions, A i , where the integrals are non-zero. 101 WefoundthatU andZ canbewell-modeledbymixturesoftwozero-meanLaplacians with different variances. We use the EM algorithm to obtain the maximum-likelihood estimation of the model parameters, and use (4.15) and (4.16) to compute the estimates of the crossover probabilities. The main advantage of this model-based estimation approach as compared with the direct estimation is that it incurs less complexity and requires less frame data to be measured. In our experiment the EM was operating on only 25 % of the frame samples. Moreover, sincethemodelparametersdonotvaryverymuchbetweenconsecutiveframes (Fig. 4.7) it is viable to use the previous estimates to initialize the current estimation and this can usually lead to convergence within a few iterations. Once we have found the model parameters, computing the crossover probability of each bit-plane from the modelparametersrequiresonlynegligiblecomplexitysincethiscanbedoneusing closed- form expressions obtained from the integrals in (4.15). However, the approach suffers some loss in compression efficiency due to the inaccuracy in the estimation. We can assess the compression efficiency by evaluating the entropy function on the estimates of thecrossoverprobabilities(whichgivesthetheoreticallimitincompressingthebit-planes giventheestimates[46]),andcomparetothatofthedirectestimation. Experimentsusing video frames from the Akiyo sequence show that with base layer quantization parameter (QP) set to 31 and 20, the percentage differences in entropy are about 2.5% and 4.7%, respectively. However,thepercentagedifferenceis21.3%whenthebaselayerQPissetto 8. ThislargedeviationisduetothefactthatwithQPequalto8, thebaselayerisofvery high quality, so that the distribution of U has a higher probability of zero, which is not 102 0 5 10 15 20 25 30 0.57 0.58 0.59 0.6 Mixing probability 0 5 10 15 20 25 30 15.5 16 16.5 Standard deviation of the 1st Laplacian 0 5 10 15 20 25 30 1.7 1.8 1.9 Standard deviation of the 2nd Laplacian Frame number Figure 4.7: Model parameters of u k estimated by EM using the video frames from Akiyo. well captured by our model. Note, however, that such high quality base layer scenarios are in general of limited practical interest. 4.5 Codec Architecture and Implementation Details Fig. 4.8 depicts the WZS encoding and decoding diagrams implemented based on the MPEG-4 FGS codec. Let X k , ˆ X b k and ˆ X e k be the current frame, its BL and EL recon- structed frames, respectively. 103 DCT Q VLC Q -1 IDCT Motion vectors input video DCT Base layer FM b MC ME FM e MC Bit Plane Bit Plane Mode selection Bitplane LDPC encoder FGS Bitplane VLC SI Coding Mode s k u k e k ˆ b k e ˆ b k X + - - + + + k X - (a) IDCT Q -1 V L D DCT BL Bitstream Clipping MC FM e MC Bit Plane SI MVs FM b BL Video IDCT ) Clipping MVs ) EL Coding Mode EL Bitstream EL Video Mode selection Bitplane LDPC decoder FGS Bitplane VLD v k + - + + + (b) Figure 4.8: Diagram of WZS encoder and decoder. (a) WZS encoder, (b) WZS decoder. FM: frame memory, ME: motion estimation, MC: motion compensation, SI: side infor- mation, BL: base layer, EL: enhancement layer, VLC: variable-length encoding, VLD: variable-length decoding. 104 4.5.1 Encoding Algorithm At the base layer, the prediction residual e k in DCT domain, as shown in Fig. 4.8 (a), is given by e k =T(X k ¡MC k [ ˆ X b k¡1 ]); (4.17) where T(:) is the DCT transform, and MC k [:] is the motion-compensated prediction of the kth frame given ˆ X b k¡1 . The reconstruction of e k after base layer quantization and dequantization is denoted by ˆ e b k . Then, at the enhancement layer, as in Section 4.3.2, we define u k =e k ¡ˆ e b k =T(X k ¡MC k [ ˆ X b k¡1 ])¡ˆ e b k : (4.18) The encoder SI s k is constructed in a similar way as (4.11), while taking into account the motion compensation and DCT transform as s k =T(MC k [X k¡1 ]¡ ˆ X b k ): (4.19) Both u k and s k are converted into bit-planes. Based on the switching rule given in Section 4.4.2, we define our mode selection algorithm as shown in Fig. 4.9. At each bit-plane, we first decide the coding mode on the MB-basis as in Fig. 4.9 (a), and then in each MB we will decide the corresponding modes at the DCT block level to include the two special cases ALL-ZERO and WZS-SKIP (see Fig.4.9(b)). IneitherALL-ZEROorWZS-SKIPmode,noadditionalinformationissent to refine the block. The ALL-ZERO mode already exists in the current MPEG-4 FGS 105 MB BL=intra? E inter < E intra ? MB EL = WZS-MB MB EL = FGS-MB Y Y N N All zeros? u k,l = 0 MB EL=FGS-MB BLK EL = WZS Y N Y N l l k k u s = for the whole block? N Y WZS-SKIP FGS ALL-ZERO (a) MB-based (b) Block-based Figure 4.9: The block diagram of mode selection algorithm. syntax. For a block coded in WZS-SKIP, the decoder just copies the corresponding block of the reference frame 1 . All the blocks in FGS mode are coded directly using MPEG-4 FGS bit-plane coding. For blocks in WZS mode, we apply channel codes to exploit the temporal correlation between neighboring frames. Here, we choose low-density parity check (LDPC) codes [46,47]fortheirlowprobabilityofundetectabledecodingerrorsandnear-capacitycoding performance. A (n;k) LDPC code is defined by its parity-check matrix H with size n£(n¡k). Given H, to encode an arbitrary binary input sequence c with length n, we multiply c with H and output the corresponding syndrome z with length (n¡ k) [46]. In a practical implementation, this involves only a few binary additions due to the low-density property of LDPC codes. At bit-plane l, we first code the binary number u k;l for all coefficients in the WZS blocks, using LDPC codes to generate syndrome bits 1 The WZS-SKIP mode may introduce some small errors due to the difference between the SI at the encoder and decoder. 106 at a rate determined by the conditional entropy in (4.13). We leave a margin of about 0.1 bits above the Slepian-Wolf limit (i.e., the conditional entropy) to ensure that the decoding error is negligible. Then, for those coefficients that become significant in the current bit-plane (i.e., coefficients that were 0 in all the more significant bit-planes and become 1 in the current bit-plane) , their sign bits are coded in a similar way using the sign bits of the corresponding s k as SI. Theadaptivityofourscalablecodercomesatthecostofanextracodingoverhead. It includes: (1) the prediction modes for MBs and DCT blocks, (2) the a priori probability for u k;l (based on our experiments we assume a uniform distribution for sign bits) and channel parameters, and (3) encoding rate (1¡k=n). A 1-bit syntax element is used to indicate the prediction mode for each MB at each bit-plane. The MPEG-4 FGS defines a most significant bit-plane level for each frame, which is found by first computing the residue with respect to the corresponding base layer for the frame and then determining what is the minimum number of bits needed to represent the largest DCT coefficient in the residue. Clearly, this most significant bit-plane level varies from frame to frame. Note that representation of many DCT blocks in a given frame is likely to require fewer bit-planes than the maximum number of bit-planes for the frame. Thus, for these blocks, the first few most significant bit-planes to be coded are likely to be all zero (for these blocks the residual energy after interpolation using the base layer is low, so that most DCT coefficients will be relatively small). To take advantage of this the MB prediction mode for a given bit-plane is not sent if all its six DCT blocks are ALL-ZERO. Note also that the number of bits needed to represent the MB mode is negligible for the 107 Table 4.2: Coding overhead for News sequence. Bit-plane 1 2 3 4 Overhead percentage (%) 19.8 9.6 7.5 4.6 least significant bit-planes, as compared to the number of bits needed to code the bit- planes. It is also worth pointing out that this mode selection overhead is required as well for a closed-loop coder that attempts to exploit temporal correlation through the mode-switching algorithm. For a MB in WZS-MB mode, the block mode (either WZS or WZS-SKIP) is signaled by an additional 1-bit syntax. This overhead depends on the number of MBs in WZS-MB mode, and a good entropy coding can be applied to reduce the overhead, since we have observed in our experiments that the two different modes have biased probabilities (see Fig. 4.11). The encoding rate of syndrome codes varies from 1=64 to 63=64 in incremental steps of size 1=64, and thus 6 bits are used to code the selected encoding rate. We use a fixed-point 10 bit representation for the different kinds ofprobabilitiestobesenttothedecoder. Anexampleofthetotaloverheadpercentageat each bit-plane, which is calculated as the ratio between the number of overhead bits and the number of total bits to code this bit-plane, is given in Table 4.2 for News sequence. 4.5.2 Decoding Algorithm Decoding of the EL bit-planes of X k proceeds by using the EL reconstruction of the previous frame ˆ X e k¡1 to form the SI for each bit-plane. The syndrome bits received are used to decode the blocks in WZS mode. The procedure is the same as at the encoder, 108 except that the original frame X k¡1 is now replaced by the high quality reconstruction ˆ X e k¡1 to generate SI v k =T(MC k [ ˆ X e k¡1 ]¡ ˆ X b k ): (4.20) The corresponding SI at each bit-plane is formed by converting v k into bit-planes. The decoder performs sequential decoding since decoding a particular bit-plane can only be done after more significant bit-planes have been decoded. We modified the conventional LDPC software [47,52] for the Slepian-Wolf approach by taking the syndrome information into account during the decoding process based on probability propagation. We follow a method similar to that described in [46,98] to force the search of the most probable codeword in a specified coset determined by the syndrome bits. One main difference is that the a priori probability of the source bits u k;l (p 0 = Pr(u k;l = 0) and p 1 = 1¡p 0 ) is also considered in the decoding process. The likelihood ratio for each variable node at bit-plane l is given by LLR =log Pr(u k;l =1jv k;l ) Pr(u k;l =0jv k;l ) = 8 > > < > > : log p 10 1¡p 01 +log p 1 p 0 ; if v k;l =0; log 1¡p 10 p 01 +log p 1 p 0 ; if v k;l =1: (4.21) where p ij is the crossover probability defined in Fig.4.4 (a). The syndrome information is considered in the same way as in [46] when calculating the likelihood ratio at the check node. 109 4.5.3 Complexity Analysis In our approach, the base layer structure is the same as in an MPEG-4 FGS system. An additional set of frame memory, motion-compensation (MC) and DCT modules is introduced for the EL coding at both the encoder and decoder. The MC and DCT operations are only done once per frame even for multi-layer coding. In comparison, the ET approach requires multiple motion-compensation prediction loops, each of which needs a separate set of frame memory, MC and DCT modules, as well as additional dequantization and IDCT modules to obtain each EL reconstruction. More importantly, foreachEL,theETapproachneedstorepeatalltheoperationssuchasreconstructionand prediction. Thoughourproposedapproachrequirescorrelationestimationattheencoder as discussed in Section 4.4, the additional complexity involved is very limited, including simple shifting, comparison and +=¡ operations. Therefore, the proposed approach can be implemented in a lower complexity even for multiple layers. It should be noted that the complexity associated with reconstructing the enhance- mentlayerscanbeasignificantportionoftheoverallencodingcomplexityinaclosed-loop scalable encoder. While it is true that full search motion estimation (ME) (in base layer) may require a large amount of computational power, practical encoders will employ some form of fast ME, and the complexity of ME module can be substantially reduced. For example, [14] reports that ME (full-pel and sub-pel) takes only around 50% of the overall complexity in a practical non-scalable video encoder employing fast ME. As a result, the complexity of closing the loop (motion compensation, forward and inverse transforms, quantization and inverse quantization) becomes a significant fraction of overall codec 110 0 10 20 30 40 50 60 70 80 90 100 1 2 3 4 bit-plane level WZS-MB percentage Akiyo (CIF) Container Ship (QCIF) Foreman (QCIF) Coastguard (QCIF) New s (CIF) Figure 4.10: WZS-MB percentage for sequences in CIF and QCIF formats (BL quanti- zation parameter=20, frame rate=30Hz). complexity. Moreover, we need to perform these operations in every enhancement layer in a closed-loop scalable system (while usually we perform ME only in base layer). In addition to computational complexity reduction, our system does not need to allocate the frame buffers to store the reconstructions in each enhancement layer. This can lead to considerable savings in memory usage, which may be important for embedded appli- cations. 4.6 Experimental Results Several experiments have been conducted to test the performance of the proposed WZS approach. We implemented a WZS video codec based on the MPEG-4 FGS reference software. In the experiments, we used the direct correlation estimation method, as it can lead to better compression efficiency as compared the model-based approach. 111 Akiyo CIF 0 10 20 30 40 50 60 70 80 90 100 1 2 3 4 bit plane level block mode percentage ALL-ZERO FGS WZS WZS-SKIP (a) Coastguard QCIF 0 10 20 30 40 50 60 70 80 90 100 1 2 3 4 bit plane level block mode percentage ALL-ZERO FGS WZS WZS-SKIP (b) Figure 4.11: Percentages of different block modes for Akiyo and Coastguard sequences (BL quantization parameter=20, frame rate=30Hz). 112 4.6.1 Prediction Mode Analysis In this section we analyze the block prediction modes at each bit-plane for various video sequences. Fig. 4.10 shows that the percentage of MBs in WZS-MB mode exceeds 50% for most video sequences (in some cases surpassing 90%, as in bit-plane 3 for Akiyo and Container Ship). Therefore there is potentially a large coding gain over MPEG-4 FGS with our proposed approach. The percentage of MBs in WZS-MB is on average higher for low-motion sequences (such as Akiyo) than for high-motion sequences (such as Coastguard), especially for lower significance bit-planes. Moreover, this percentage varies from bit-plane to bit-plane. For the most significant bit-planes, the FGS-MB mode tends to be dominant for some sequences (such as Akiyo and News), due to the low quality of the EL reconstruction of the previous frame. When the reconstruction quality improves, as more bit-planes are decoded, the temporal correlation is higher and the WZS-MB mode becomes dominant, for example for bit-planes 2 and 3 in Fig. 4.10. However, the WZS-MBpercentagestartstodropforevenlowersignificancebit-planes. Thisisbecause the temporal correlation decreases for these bit-planes which tend to be increasingly “noise-like”. The DCT block mode distribution in Fig. 4.11 illustrates how the motion character- istics of the source sequence affect the relative frequency of occurrence of each MB type. The Akiyo sequence has a much larger WZS-SKIP percentage, and a larger percentage of WZcodedblocks, than Coastguard; thusAkiyoseesmoresignificantreductionsincoding rate when WZS is introduced. In contrast, for Coastguard the percentage of blocks in WZS mode is less than that in FGS mode starting at bit-plane 4, thus showing that 113 as motion in the video sequence increases the potential benefits of exploiting temporal correlation in the manner proposed in this chapter decreases. Note that neither Fig. 4.10 nor Fig. 4.11 include the least two significant bit-planes since the PSNR range for these bit-planes are not of practical interest. 4.6.2 Rate-distortion Performance 4.6.2.1 Coding efficiency of WZS In this section we evaluate the coding efficiency of the proposed WZS approach. Sim- ulation results are given for a series of test sequences in CIF (352£ 288) and QCIF (176£144) resolutions with frame rate 30Hz. Akiyo and Container Ship sequences have limited motion and low spatial detail, while the Coastguard and Foreman sequences have higher motion and more spatial detail. News sequence is similar to Akiyo, but with more background motion. In addition to the MPEG-4 FGS and nonscalable (single layer) coding, we also com- pare our proposed approach with a multi-layer closed-loop (MCLP) system that exploits EL temporal correlation through multiple motion-compensation loops at the encoder. The same MPEG-4 baseline video coder is used for all the experimental systems (note that the proposed WZS framework does not inherently require the use of a specific BL video coder). The first video frame is intra-coded and all the subsequent frames are coded as P-frame (i.e., IPPP...). The BL quantization parameter (QP) is set to 20. Prior to reporting the simulation results, we give a brief description of our proposed system together with the MCLP system. 114 ProposedWZSsystem. TheDCTblocksarecodedinfourdifferentmodesasdescribed in Section 4.6.1. An LDPC code is used to code those blocks in WZS mode at each bit-plane to exploit the EL correlation between adjacent frames. The encoding rate is determined by the correlation estimated at the encoder without constructing multiple motion-compensation loops. To limit the error introduced by WZS-SKIP mode due to the small difference between the encoder and decoder SI, we disable WZS-SKIP mode once every 10 frames in our implementation. Multiple closed-loop (MCLP) system. This system is an approximation to the ET approach discussed in Section 4.3.1 through the mode-switching algorithm. We describe the coding procedure for each enhancement layer as follows. To code an EL which cor- responds to the same quality achieved by bit-plane l in MPEG-4 FGS the encoder goes through the following steps. (i) Generate the EL reconstruction of the previous frame up to this bit-plane level, which we denote ˆ x l k¡1 . (ii) Follow a switching rule similar to that proposed for the WZS system to determine the prediction mode of each MB, i.e., inter mode is chosen if E inter < E intra , and the FGS mode is chosen otherwise. Since the EL reconstruction is known at the encoder, it can calculate E inter directly using the expression of (4.9). (iii) Calculate the EL residual r e k following (4.5) by using ˆ x l k¡1 as the predictor for inter mode, and the reconstruction of the current frame with more signif- icant ELs ˆ x l¡1 k as the predictor for FGS mode. (iv) Convert r e k to bit-planes, and code those bit-planes that are at least as significant as bit-plane l (i.e., quantize to the lth bit-plane) to generate the compressed bitstream. Figs. 4.12-4.14 provide a comparison between the proposed WZS, nonscalable coder, MPEG-4 FGS and the MCLP coder. The PSNR gain obtained by the proposed WZS 115 Akiyo CIF 32 34 36 38 40 42 44 46 48 0 500 1000 1500 2000 2500 3000 3500 bitrate (kbps) PSNR-Y (dB) single layer FGS WZS MCLP (a) Container QCIF 28 30 32 34 36 38 40 42 44 46 48 0 300 600 900 1200 1500 1800 bitrate (kbps) PSNR-Y (dB) single coding FGS WZS MCLP (b) Figure 4.12: Comparison between WZS, nonscalable coding, MPEG-4 FGS and MCLP for Akiyo and Container Ship sequences. Coastguard QCIF 27 29 31 33 35 37 39 41 43 45 47 0 500 1000 1500 2000 bitrate (kbps) PSNR-Y (dB) single layer FGS WZS MCLP (a) Foreman QCIF 29 31 33 35 37 39 41 43 45 47 0 300 600 900 1200 1500 1800 2100 bitrate (kbps) PSNR-Y (dB) single layer FGS WZS MCLP (b) Figure 4.13: Comparison between WZS, nonscalable coding, MPEG-4 FGS and MCLP for Coastguard and Foreman sequences. approach over MPEG-4 FGS depends greatly on the temporal correlation degree of the video sequence. For sequences with higher temporal correlation, such as Akiyo and Con- tainer Ship, the PSNR gain of WZS is greater than that for lower temporal correlation sequences, such as Foreman, e.g., 3-4.5 dB PSNR gain for the former, as compared to 0.5-1 dB gain for the latter. 116 News CIF 30 32 34 36 38 40 42 44 46 0 1000 2000 3000 4000 5000 bitrate (kbps) PSNR-Y (dB) single layer FGS WZS MCLP Figure 4.14: Comparison between WZS, nonscalable coding, MPEG-4 FGS and MCLP for News sequence. To demonstrate the efficiency of Wyner-Ziv coding for WZS blocks, we compare the proposed coder to a simplified version that uses only the ALL-ZERO, FGS, and WZS- SKIP modes (which we call the “WZS-SKIP only” coder). The “WZS-SKIP only” coder codes the WZS blocks in FGS instead. Fig. 4.15 shows that, for both Akiyo and Coast- guard sequences, there is a significant improvement by adding the WZS mode. Note that the PSNR values for a given bit-plane level are exactly the same for the two coders. The only difference is the number of bits used to code those blocks that are coded in WZS mode. Thus the coding gain of Wyner-Ziv coding (exploiting temporal correlation) over the original bit-plane coding (that does not exploit temporal correlation) can be quanti- fied as a percentage reduction in rate We present this in two different ways as shown in Tables 4.3 and 4.4 2 . Table 4.3 provides the rate savings for only those blocks in WZS mode. It can be seen that Akiyo achieves larger coding gain than Coastguard due to higher temporal correlation. Table 4.4 provides the overall rate savings (i.e., based on 2 It is usually required for a LDPC coder to have a large code length to achieve good performance. If the number of WZS blocks is not enough for the required code length we force all blocks in the bit-plane to be coded in FGS mode instead. This happens, for example, for the most significant bit-plane of most sequences. Thus, only the results for bit-planes 2-4 are shown in these tables. 117 Akiyo CIF 32 34 36 38 40 42 44 46 48 0 500 1000 1500 2000 2500 3000 3500 bitrate (kbps) PSNR-Y (dB) FGS WZS WZS-SKIP only (a) Coastguard QCIF 27 29 31 33 35 37 39 41 43 45 47 0 500 1000 1500 2000 bitrate (kbps) PSNR-Y (dB) FGS WZS WZS-SKIP only (b) Figure4.15: ComparisonbetweenWZSand“WZS-SKIPonly”for Akiyo andCoastguard sequences. Table 4.3: Rate savings due to WZS for WZS blocks only (percentage reduction in rate with respect to using FGS instead for those blocks). Bit-plane level 2 3 4 Akiyo 24.66 31.20 26.71 Coastguard 19.98 22.91 19.19 Table 4.4: Overall rate savings due to WZS (percentage reduction in overall rate as compared to FGS). Bit-plane level 2 3 4 Akiyo 10.98 16.61 16.11 Coastguard 8.38 8.68 6.08 the total rate needed to code the sequence). These rate savings reflect not only the effi- ciency of coding each WZS block by Wyner-Ziv coding but also the percentage of blocks that are coded in WZS mode. As seen from Figures 4.12-4.14, there is still a performance gap between WZS and MCLP. We compare the main features of these two approaches that affect the coding performance in Table 4.5. It should be clarified that occasionally, at very high rate for low-motion sequences, the MCLP approach can achieve similar (or even better) coding performancethanthenonscalablecoder. Thatisbecausebit-planecodingismoreefficient 118 Table 4.5: Comparisons between MCLP and WZS Multiple closed-loop approach Wyner-Ziv scalability approach 1. Exploits temporal correlation through closed-loop predictive coding 2. Efficientbit-planecoding(run,end-of- plane)oftheELresidualin(4.5)toex- ploit the correlation between consecu- tive zeros in the same bit-plane 3. The residual error between the source and EL reference from the previous framemayincreasethedynamicrange of the difference and thus cause fluc- tuations in the magnitude of residue coefficients (as the number of refine- ment iterations grows, the magnitude of residues in some coefficients can actually increase, even if the overall residual energy decreases). 4. EachELhastocodeitsownsignmap, and therefore for some coefficients the sign bits are coded more than once. 1. Exploits temporal correlation through Wyner-Ziv coding 2. Channel coding techniques designed for memoryless channels cannot ex- ploit correlation between source sym- bols 3. The source information to be coded is exactly the same as the EL bit-planes in MPEG-4 FGS, and therefore there are no fluctuations in magnitude and no additional sign bits are needed. 4. An extra encoding rate margin is addedtocompensateforthesmallmis- match between encoder and decoder SI as well as for the practical chan- nel coders which cannot achieve the Slepian-Wolf limit exactly. than nonscalable entropy coding when compressing the high-frequency DCT coefficients. We believe that the performance gap between WZS and MCLP is mainly due to the relative inefficiency of the current channel coding techniques as compared to bit-plane coding. We expect that large rate savings with respect to our present WZS implemen- tation can be achieved if better channel codes are used, that can perform closer to the Slepian-Wolf limit, or more advanced channel coding techniques are designed for more complexchannels,whichcantakeadvantageoftheexistenceofcorrelationamongchannel errors. 119 Table 4.6: The base layer PSNR (dB) for different QP. Base layer QP 31 20 8 Akiyo 32.03 33.61 38.05 Container Ship 26.97 28.93 34.11 News 29.17 31.23 36.09 4.6.2.2 Rate-distortion performance vs. base layer quality It is interesting to consider the effects of the base-layer quality on the EL performance of the WZS approach. We use Akiyo, Container Ship and News sequences in the ex- periment. Table 4.6 shows the base layer PSNR (for luminance component only) for several sequences under different base layer quantization parameter (QP) values. The PSNR gains obtained by the proposed WZS approach over MPEG-4 FGS are plotted in Fig. 4.16. The coding gain achieved by WZS decreases if a higher quality base layer is used, as seen from Fig. 4.16 when the base layer QP decreases to 8. That is because the temporal correlation between the successive frames is already well exploited by a high-quality base layer. This observation is in agreement with the analysis in Section 4.3.1. 4.6.2.3 Comparisons with Progressive Fine Granularity Scalable (PFGS) coding The PFGS scheme proposed by Wu et al. [93] improves the coding efficiency of FGS, by employinganadditionalmotioncompensationlooptocodetheEL,forwhichseveralFGS bit-planesareincludedinthelooptoexploitELtemporalcorrelation. Fig. 4.17compares the coding performance between WZS and PFGS for Foreman sequence. WZS performs 120 Akiyo CIF 0 0.5 1 1.5 2 2.5 3 0 500 1000 1500 2000 2500 bitrate (kbps) PSNR-Y gain (dB) QP =31 QP =20 QP =8 (a) Container Ship QCIF 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 200 400 600 800 1000 1200 bitrate (kbps) PSNR-Y gain (dB) QP=31 QP=20 QP=8 (b) News CIF 0 0.5 1 1.5 2 2.5 3 3.5 4 0 500 1000 1500 2000 2500 3000 3500 bitrate (kbps) PSNR-Y gain (dB) QP=31 QP=20 QP=8 (c) Figure4.16: ThePSNRgainobtainedbyWZSoverMPEG-4FGSfordifferentbaselayer qualities. 121 30 31 32 33 34 35 36 37 38 39 40 0 200 400 600 800 1000 1200 1400 bitrate (kbps) PSNR-Y (db) FGS PFGS WZS Figure 4.17: Compare WZS with MPEG-4 PFGS for Foreman CIF sequence (Base layer QP=19, frame rate=10Hz). The PFGS results are provided by Wu et al. from [31]. worse than PFGS. In addition to the limitation of current techniques for Wyner-Ziv coding,theperformancegapmaycomefromthedifferenceofthepredictionlinkstructure between these two approaches. WZS creates a multi-layer Wyner-Ziv prediction link to connect the same bit-plane level in successive frames. However, in PFGS, usually at least two or three FGS bit-planes are used in the EL prediction for all the bit-planes. Thus, this structure is beneficial to the most significant bit-planes (for example, the 1st or 2nd bit-plane) as they have higher-quality reference than what they would in WZS. On the other hand, our proposed WZS techniques can be easily combined with a PFGS coder such that the more significant bit-planes can be encoded in a closed-loop manner by PFGS techniques, while the least significant bit-planes are predicted through Wyner-Ziv links to exploit the remaining temporal correlation. Fig. 4.10 shows that for some sequences (especially those with low-motion) the temporal correlation for some lower significance bit-planes (e.g., bit-plane 4) is still high, so that WZS-MB mode is chosen for a considerable percentage of MBs. Thus, we expect that further gain would be achieved with our techniques over what is achievable with PFGS. 122 4.7 Conclusions We have presented a new practical Wyner-Ziv scalable coding structure to achieve high coding efficiency. By using principles from distributed source coding, the proposed coder is able to exploit the enhancement-layer correlation between adjacent frames without explicitly constructing multiple motion-compensation loops, and thus reduce the encoder complexity. In addition, it has the advantage of backward compatibility with standard videocodecsbyusingastandardCLPvideocoderasbaselayer. Twoefficientmethodsare proposed for correlation estimation based on different trade-off between the complexity and accuracy at the encoder even when the exact reconstruction value of the previous frame is unknown. Simulation results show much better performance over MPEG-4 FGS for sequences with high temporal correlation and limited improvement for high-motion sequences. Though we implemented the proposed Wyner-Ziv scalable framework in the MPEG-4FGSsoftwareasbit-planes,itcanbeintegratedwithotherSNRscalablecoding techniques. Further work is needed within the proposed framework to improve coding efficiency and provide flexible bandwidth adaptation and robustness. In particular, the selection of efficient channel coding techniques well suited for distributed source coding deserves additionalinvestigation. Anotherpossiblereasonforthegapbetweenourproposedcoder and a nonscalable coder is due to less accurate motion compensation prediction in the enhancement layer when sharing motion vectors with the base layer. This can be im- proved by exploring the flexibility at the decoder, an important benefit of Wyner-Ziv 123 coding, to refine the enhancement layer motion vectors by taking into account the re- ceived enhancement layer information from the previous frame. In terms of bandwidth adaptation, the current coder cannot fully achieve fine granularity scalability, given that the LDPC coder can only decode the whole block at the bit-plane boundary. There is recent interest on punctured LDPC codes [30], and the possibility of using this code for bandwidth adaptation is under investigation. In addition, it is also interesting to evalu- ate the error resilience performance of the proposed coder. In principle, the Wyner-Ziv coding has more tolerance on noise introduced to the side information. 124 Chapter 5 Conclusions and Future Work Inthisthesiswehaveaddressedtheproblemofnetwork-adaptivevideocodingandstream- ing, particularly for those source coding algorithms that provide scalability to match changes in the network environment. First,weextendtherate-distortionoptimizedstreaming(RaDiO)frameworkproposed in [19,20] to address a more general coding scenario, where multiple decoding paths are possible. The source encoder produces redundant encoded data to provide flexibility for the scheduling algorithm to choose the right subset of data units to send, based on the current network conditions and previous transmission history. This approach allows the system redundancy to be optimized dynamically at the streaming stage rather than thosepre-optimizedat theencodingstage. Therefore, theproposed streamingframework is more appropriate for adaptation in a wide range of network environments. In terms of sourcecoding,weproposeanewcodingalgorithmthatsupportsmultipledecodingpaths, namely, multiple description layered coding (MDLC), which combines the advantages of layered coding (LC) and multiple description coding (MDC). Experimental results show that, when applied together with the extended RaDiO framework, the proposed MDLC 125 system provides more robust and efficient video communication over a wider range of network scenarios and application requirements. Second,weproposeanovelscalablecodingapproachbyintroducingdistributedsource coding (DSC) in the enhancement-layer prediction, to achieve a better coding perfor- mance with reasonable encoding complexity. In this approach the decoder complexity increases as compared to existing approaches. Specifically, in order to reduce the encod- ingcomplexity, wedonotexplicitlyconstructmultiplemotion-compensationloopsatthe encoder, while, at the decoder, side information (SI) is constructed to combine spatial andtemporalinformationinamannerthatseekstoapproximatetheETapproachin[66]. Experimental results show improvements in coding efficiency of 3-4.5 dB over MPEG-4 FGS for video sequences with high temporal correlation. It is worth noting that some of the novel concepts proposed in this research are not limited to scalable coding. For example, the rate-distortion optimized streaming frame- work can support the case of multiple independently encoded non-scalable bit streams. The DSC principle has also shown great potential to provide important functionalities which are difficult to achieve using traditional techniques. One example is to provide flexible playback at the decoder, by for example supporting both backward and forward frame-by-frameplayback[17]. Despitethefactthatthisresearchhasachievedinteresting results and proposed a number of novel algorithms, there are still relevant topics or ideas that can be addressed in future research. We describe several possible research directions as follows: ² Source Optimization for the Extended RaDiO Framework. Our current work has proposed rate-distortion optimized scheduling algorithms for a given source codec 126 with redundancy. Although optimization of source redundancy (e.g., in a MDC codec) has been studied by many researchers when simple scheduling algorithms areused, theappropriateredundancyallocationforsuchasourcecodec(e.g., MDC orMDLC)isstillanopenissuewhenthecoderistobeusedalongwithanintelligent scheduling algorithm. ² Packet Scheduling Algorithms for Peer-to-Peer (P2P) Systems. P2P video stream- ing is highly difficult due to the increased uncertainty about the number of avail- able peers and bandwidth throughput from each sender. The scheduling algorithm should decide not only which packets to send and when to send them, but also from which peer to send them. The optimization should consider system-level is- sues, such as the peer and path stability, and the congestion level of the network. Furthermore, it should also provide efficient coordination among peers to tolerate a certain peer’s failure to provide service. ² Study of Trade-Off between Entropy Coding and DSC. DSC replaces the traditional entropy coding by a representation (e.g., syndrome codes) based on a channel code. Thus, for a particular source it is hard to exploit the biased symbol probabilities and the correlation inside the symbol sequence. Future work should start from a detailed study of the tradeoffs between entropy coding and DSC-based coding under different scenarios, and consider techniques to combine entropy coding and DSC-based codes. ² Source-Adaptive Correlation Estimation for DSC. Correlation estimation is a key issue to impact the coding performance. Current approaches [16,18,26] tend to 127 oversimplify the correlation structure and then apply an advanced channel code (e.g., LDPC code or turbo code) to hope for good coding performance. This may be applicable for simple source-SI characteristics, for example, the additive Gaus- sian white noise channel (AWGN). But video signals are so complex that a unified correlation structure may fail to capture correlation information that is potentially useful for coding efficiency. Thus, it may be interesting to consider an alternative approach by developing a more complicated correlation structure to increase the adaptivity to the source, following a similar idea to that applied in H.264, with the introduction of more prediction modes. In this case a simple channel code (of short length due to a possible small-unit based correlation estimation) with good correlationstructuremayachievebettercodingperformance. Here, decodingerrors are possible due to the short-length weak channel code and the algorithm should provide a way to limit the error propagation. 128 Bibliography [1] A. Aaron and B. Girod. Compression with side information using turbo codes. In Proc. Data Compression Conference, pages 252–261, Apr. 2002. [2] A. Aaron, E. Setton, and B. Girod. Towards practical Wyner-Ziv coding of video. In Proc. Int’l Conf. Image Processing, Sept. 2003. [3] A. Aaron, R. Zhang, and B. Girod. Wyner-Ziv coding of motion video. In Proc. Asilomar Conf. Signals, Systems, and Computers, Nov. 2002. [4] A. Albanese, J. Blomer, J. Edmonds, M. Luby, and M. Sudan. Priority encoding transmission. IEEE Trans. Information Theory, 42(6):1737–1744, Nov. 1996. [5] J. Apostolopoulos. Reliable video communication over lossy packet networks using multiple state encoding and path diversity. In Proc. Visual Communications and Image Processing, pages 392–409, Jan. 2001. [6] J. Apostolopoulos, T. Wong, W. Tan, and S. Wee. On multiple description stream- ing with content delivery networks. In Proc. Conf. Computer Communications (INFOCOM), pages 1736–1745, June 2002. [7] J. Arnold, M. Frater, and Y. Wang. Efficient drift-free signal-to-noise ratio scala- bility. IEEE Trans. Circuits and Systems for Video Technology, 10(1):70–82, Feb. 2000. [8] J. Bajcsy and P. Mitran. Coding for the Slepian-Wolf problem with turbo codes. In Proc. Global Telecommunications Conf. (GLOBECOM), pages 1400–1404, Nov. 2001. [9] B. Birney. Intelligent streaming. http://www.microsoft.com/windows/ windows- media/howto/articles/intstreaming.aspx, May 2003. [10] J. Chakareski, J. Apostolopoulos, S. Wee, W. Tan, and B. Girod. Rate-distortion hint tracks for adaptive video streaming. IEEE Trans. Circuits and Systems for Video Technology, 15(10):1257–1269, Oct. 2005. [11] J. Chakareski, P. Chou, and B. Girod. Rate-distortion optimized streaming from the edge of the network. In Proc. Workshop on Multimedia Signal Processing, Dec. 2002. 129 [12] J. Chakareski and B. Girod. Rate-distortion optimized packet scheduling and rout- ingformediastreamingwithpathdiversity. InProc.DataCompressionConference, Mar. 2003. [13] J. Chakareski, S. Han, and B. Girod. Layered coding vs. multiple descriptions for video streaming over multiple paths. ACM/Springer Multimedia Systems Journal, 10(4):275–285, Apr. 2005. [14] H.-Y. Cheong, A. M. Tourapis, and P. Topiwala. Fast motion estimation within the JVT codec. Technical Report JVT-E023, JVT meeting, Geneva, Oct. 2002. http://ftp3.itu.ch/av-arch/jvt-site/2002 10 Geneva/JVT-E023.doc. [15] G.CheungandW.Tan. Directedacyclicgraphbasedsourcemodelingfordataunit selection of streaming media over QoS networks. In Proc. Int’l Conf. Multimedia and Exhibition, volume 1, Aug. 2002. [16] N.-M. Cheung, H. Wang, and A. Ortega. Correlation estimation for distributed source coding under information exchange constraints. In Proc. Int’l Conf. Image Processing, Sept. 2005. [17] N.-M. Cheung, H. Wang, and A. Ortega. Video compression with flexible playback order based on distributed source coding. In Proc. Visual Communications and Image Processing, Jan. 2006. [18] N.-M.Cheung,H.Wang,andA.Ortega. Sampling-basedcorrelationestimationfor distributed image and video coding under rate and complexity constraints. Sub- mitted to IEEE Trans. Image Processing, Aug. 2007. [19] P.ChouandZ.Miao. Rate-distortionoptimizedsender-drivenstreamingoverbest- effort networks. In Proc. Workshop on Multimedia Signal Processing, volume 1, pages 587–592, Oct. 2001. [20] P. Chou and Z. Miao. Rate-distortion optimized streaming of packetized media. IEEE Trans. Multimedia, 8(2):390–404, Apr. 2006. [21] P. Chou and A. Sehgal. Rate-distortion optimized receiver-driven streaming over best-effort networks. In Proc. Int’l Packet Video Workshop, volume 1, Apr. 2002. [22] P. Chou, H. Wang, and V. Padmanabhan. Layered multiple description coding. In Proc. Int’l Packet Video Workshop, volume 1, Apr. 2003. [23] G.J.Conklin,G.S.Greenbaum,K.O.Lillevold,A.F.Lippman,andY.A.Reznik. Video coding for streaming media delivery on the internet. IEEE Trans. Circuits and Systems for Video Technology, 11(3):269–281, Mar. 2001. [24] W. Equitz and T. Cover. Successive refinement of information. IEEE Trans. In- formation Theory, 37(2):269–275, Mar. 1991. [25] J. Garcia-Frias. Compression of correlated binary sources using turbo codes. IEEE Communications Letters, 5(10):417–419, Oct. 2001. 130 [26] B. Girod, A. Aaron, S. Rane, and D. Rebollo-Monedero. Distributed video coding. Proceedings of the IEEE, Special Issue on Advances in Video Coding and Delivery, 93(1):71–83, Jan. 2005. [27] B. Girod, J. Chakareski, M. Kalman, Y. Liang, E. Setton, and R. Zhang. Ad- vances in network-adaptive video streaming. In Proc. 2002 Tyrrhenian Interna- tional Workshop on Digital Communications (IWDC 2002), Sept. 2002. [28] B.GirodandN.Farber. Feedback-basederrorcontrolformobilevideotransmission. Proceedings of the IEEE, 87(10):1707–1723, Oct. 1999. [29] V. Goyal. Multiple description coding: compression meets the network. IEEE Signal Processing Magazine, 18(5):74–93, Sept. 2001. [30] J. Ha, J. Kim, and S. McLaughlin. Rate-compatible puncturing of low-density parity-checkcodes. IEEETrans.InformationTheory,50(11):2824–2836,Nov.2004. [31] Y. He, F. Wu, S. Li, Y. Zhong, and S. Yang. H.26L-based fine granularity scalable video coding. In Proc. Int’l Symp. Circuits and Systems, May 2002. [32] C.-Y.Hsu, A.Ortega, andM.Khansari. Ratecontrolforrobustvideotransmission over burst-error wireless channels. IEEE J. Selected Areas in Communications, 17(5):1–18, May 1999. [33] H.-C. Huang, C.-N. Wang, and T. Chiang. A robust fine granularity scalability using trellis-based predictive leak. IEEE Trans. Circuits and Systems for Video Technology, 12(6):372–385, June 2002. [34] ISO/IEC 13818-2. Generic coding of moving pictures and associated audio, part-2 video. Nov. 1994. [35] ISO/IEC 14496-2/FPDAM4. Coding of audio-visual objects, part-2 visual, amend- ment 4: streaming video profile. July 2000. [36] ISO/IEC JTC1/SC29/WG11/N2202. MPEG-4 version 2 visual working draft rev. 3.0. Mar. 1998. [37] ITU-T Recommendation H.263 version 2 (H.263+). Video coding for low bitrate communication. Feb. 1998. [38] A. Jagmohan and N. Ahuja. Wyner-Ziv encoded predictive multiple descriptions. In Proc. Data Compression Conference, pages 213–222, Mar. 2003. [39] W. Jiang and A. Ortega. Multiple description coding via polyphase transform and selectivequantization. InProc.VisualCommunicationsandImageProcessing,Jan. 1999. [40] M. Kalman and B. Girod. Rate-distortion optimized streaming of video with mul- tiple independent encodings. In Proc. Int’l Conf. Image Processing, Oct. 2004. 131 [41] L. Kondi. A rate-distortion optimal hybrid scalable/multiple-description video codec. IEEE Trans. Circuits and Systems for Video Technology, 15(7):921–927, July 2005. [42] C.-F. Lan, A. Liveris, K. Narayanan, Z. Xiong, and C. Georghiades. Slepian-Wolf coding of multiple m-ary sources using LDPC codes. In Proc. Data Compression Conference, page 549, Mar. 2004. [43] Y.-C. Lee, J. Kim, Y. Altunbasak, and R. Mersereau. Performance comparisons of layered and multiple description coded video streaming over error-prone networks. In Proc. Int’l Conf. Communications, pages 35–39, May 2003. [44] W. Li. Overview of fine granularity scalalability in MPEG-4 video standard. IEEE Trans. Circuits and Systems for Video Technology, 11(3):301–317, Mar. 2001. [45] Y. J. Liang and B. Girod. Prescient r-d optimized packet dependency management for low-latency video streaming. In Proc. Int’l Conf. Image Processing, Sept. 2003. [46] A. Liveris, Z. Xiong, and C. Georghiades. Compression of binary sources with side information at the decoder using LDPC codes. In IEEE Communications Letters, Oct. 2002. [47] D. MacKay and R. Neal. Near Shannon limit performance of low density parity checkcodes. Electronics Letters, 32:1645–1646, Aug.1996. Reprintedwithprinting errors corrected in vol. 33, pp. 457-458. [48] S. Mao, S. Lin, S. Panwar, Y. Wang, and E. Celebi. Video transport over ad hoc networks: multistream coding with multipath transport. IEEE J. Selected Areas in Communications, 21(10):1721–1737, Dec. 2003. [49] Z. Miao and A. Ortega. Optimal scheduling for streaming of scalable media. In Proc. Asilomar Conf. Signals, Systems, and Computers, Nov. 2000. [50] Z. Miao and A. Ortega. Expected run-time distortion based scheduling for delivery of scalable media. In Proc. Int’l Packet Video Workshop, volume 1, Apr. 2002. [51] Z.MiaoandA.Ortega. Fastadaptivemediaschedulingbasedonexpectedrun-time distortion. In Proc. Asilomar Conf. Signals, Systems, and Computers, volume 1, Nov. 2002. [52] R. Neal. LDPC software [Online]. Available: http://www.cs.toronto.edu/»radford/ldpc.software.html. [53] V. Nguyen, E. Chang, and W. Ooi. Layered coding with good allocation outper- forms multiple description coding over multiple paths. In Proc. Int’l Conf. Multi- media and Exhibition, pages 1067–1070, June 2004. [54] A. Ortega and K. Ramchandran. Rate-distortion methods for image and video compression. IEEE Signal Processing Magazine, 15(6):23–50, Nov. 1998. 132 [55] V. Padmanabhan, H. Wang, and P. Chou. Resilient peer-to-peer streaming. In Proc. Int’l Conf. Network Protocols, Nov. 2003. [56] M. Podolsky, S. McCanne, and M. Vetterli. Soft ARQ for layered streaming media. The Journal of VLSI Signal Processing, 27(1-2):81–97, Feb. 2001. [57] S. Pradhan, J. Chou, and K. Ramchandran. Duality between source coding and channel coding and its extension to the side information case. IEEE Trans. Infor- mation Theory, 49(5):1181–1203, May 2003. [58] S.PradhanandK.Ramchandran. Distributedsourcecodingusingsyndromes(DIS- CUS): design and construction. IEEE Trans. Information Theory, 49(3):626–643, Mar. 2003. [59] R. Puri, A. Majumdar, P. Ishwar, and K. Ramchandran. Distributed video coding in wireless sensor networks. IEEE Signal Processing Magazine, 23(4):94–106, July 2006. [60] R. Puri and K. Ramchandran. PRISM: a new robust video coding architecture based on distributed compression principles. In Proc. Allerton Conf. Communica- tions, Control, and Computing, Oct. 2002. [61] R. Puri and K. Ramchandran. PRISM: a video coding architecture based on dis- tributed compression principles. Technical report, UC Berkeley/ERL Technical Report, Mar. 2003. [62] A. Reibman. Optimizing multiple description video coders in a packet loss envi- ronment. In Proc. Int’l Packet Video Workshop, Apr. 2002. [63] A. Reibman, H. Jafarkhani, M. Orchard, and Y. Wang. Performance of multiple description coders on a real channel. In Proc. Int’l Conf. Acoustics, Speech, and Signal Processing, pages 2415 –2418, Mar. 1999. [64] A. Reibman, H. Jafarkhani, Y. Wang, and M. Orchard. Multiple description video using rate-distortion splitting. In Proc. Int’l Conf. Image Processing, pages 978 –981, Oct. 2001. [65] A. Reibman, Y. Wang, X. Qiu, Z. Jiang, and K. Chawla. Transmission of multiple descriptionandlayeredvideooveranEGPRSwirelessnetwork. InProc. Int’l Conf. Image Processing, pages 136–139, Sept. 2000. [66] K.RoseandS.Regunathan. Towardoptimalityinscalablepredictivecoding. IEEE Transactions on Image Processing, 10:965–976, July 2001. [67] A. Sehgal, A. Jagmohan, and N. Ahuja. A casual state-free video encoding paradigm. In Proc. Int’l Conf. Image Processing, Sept. 2003. [68] A. Sehgal, A. Jagmohan, and N. Ahuja. Scalable video coding using Wyner-Ziv codes. In Proc. Picture Coding Symposium, Dec. 2004. 133 [69] A. Sehgal, A. Jagmohan, and N. Ahuja. Wyner-Ziv coding of video: an error- resilient compression framework. IEEE Trans. Multimedia, 6(2):249–258, Apr. 2004. [70] S.Servetto. Latticequantizationwithsideinformation. InProc.Data Compression Conference, pages 510–519, Mar. 2000. [71] S. Servetto, V. Vaishampayan, and N. Sloane. Multiple description lattice vector quantization. In Proc. Data Compression Conference, pages 13–22, Mar. 1999. [72] R. Singh, A. Ortega, L. Perret, and W. Jiang. Comparison of multiple description coding and layered coding based on network simulations. In Proc. Visual Commu- nications and Image Processing, pages 929–939, jan 2000. [73] D. Slepian and J. Wolf. Noiseless coding of correlated information sources. IEEE Trans. Information Theory, 19(4):471–480, July 1973. [74] Y. Steinberg and N. Merhav. On successive refinement for the Wyner-Ziv problem. IEEE Trans. Information Theory, 50(8):1636 – 1654, Aug. 2004. [75] G.J.SullivanandT.Wiegand. Rate-distortionoptimizationforvideocompression. IEEE Signal Processing Magazine, 15(6):74–90, Nov. 1998. [76] M.-T. Sun and A. Reibman. Compressed Video over Networks. Marcel Dekker, New York, 2001. [77] M. Tagliasacchi, A. Majumdar, and K. Ramchandran. A distributed-source-coding based robust spatio-temporal scalable video codec. In Proc. Picture Coding Sym- posium, Dec. 2004. [78] V. Vaishampayan. Design of multiple description scalar quantizers. IEEE Trans. Information Theory, 39(3):821–834, May 1993. [79] M. van der Schaar and H. Radha. Adaptive motion-compensation fine-granular- scalability (AMC-FGS) for wireless video. IEEE Trans. Circuits and Systems for Video Technology, 12(6):360–371, June 2002. [80] C. D. Vleeschouwer, J. Chakareski, and P. Frossard. The virtue of patience in low- complexityschedulingofpacketizedmediawithfeedback. IEEETrans.Multimedia, 9(2):348–365, Feb. 2007. [81] H.Wang,N.-M.Cheung,andA.Ortega. WZS:Wyner-Zivscalablepredictivevideo coding. In Proc. Picture Coding Symposium, Dec. 2004. [82] H. Wang, N.-M. Cheung, and A. Ortega. A framework for adaptive scalable video coding using Wyner-Ziv techniques. EURASIP Journal on Applied Signal Process- ing, 2006. [83] H. Wang and A. Ortega. Robust video communication by combining scalability and multiple description coding techniques. In Proc. Symp. Electronic Imaging, volume 1, Jan. 2003. 134 [84] H. Wang and A. Ortega. Scalable predictive coding by nested quantization with layered side information. In Proc. Int’l Conf. Image Processing, Oct. 2004. [85] H. Wang and A. Ortega. Rate-distortion based scheduling of video with multiple decoding paths. In Proc. Symp. Electronic Imaging, Jan. 2005. [86] H. Wang and A. Ortega. Rate-distortion adaptive streaming of video with multiple decoding paths. Submitted to IEEE Trans. Image Processing, Sept. 2007. [87] Y. Wang, M. Orchard, V. Vaishampayan, and A. Reibman. Multiple descrip- tion coding using pairwise correlating transforms. IEEE Trans. Image Processing, 10(3):351–366, Mar. 2001. [88] Y. Wang, J. Ostermann, and Y. Zhang. Video Processing and Communication. Prentice Hall, New Jersy, 2002. [89] Y. Wang, A. Reibman, and S. Lin. Multiple description coding for video delivery. Proceedings of the IEEE, 93(1):57–70, Jan. 2005. [90] S. Wenger. Video redundancy coding in H.263+. In Workshop on Audio-Visual Services for Packet Networks (AVSPN 1997), Sept. 1997. [91] S.Wenger, G.Knorr, J.Ott, andF.Kossentini. Errorresiliencesupportinh.263+. IEEE Trans. Circuits and Systems for Video Technology, 8(7):867–877, Nov. 1998. [92] D.WilsonandM.Ghanbari. Optimisationoftwo-layerSNRscalabilityforMPEG-2 video. In Proc. Int’l Conf. Acoustics, Speech, and Signal Processing, Apr. 1997. [93] F.Wu,S.Li,andY.-Q.Zhang. Aframeworkforefficientprogressivefinegranularity scalable video coding. IEEE Trans. Circuits and Systems for Video Technology, 11(3):332–344, Mar. 2001. [94] A. Wyner and J. Ziv. The rate-distortion function for source coding with side information at the decoder. IEEE Trans. Information Theory, 22:1–10, Jan. 1976. [95] Z. Xiong, A. Liveris, S. Cheng, and Z. Liu. Nested quantization and Slepian-Wolf coding: A Wyner-Ziv coding paradigm for i.i.d. sources. In Proc. IEEE Workshop on Statistical Signal Processing (SSP), Sept. 2003. [96] Q. Xu and Z. Xiong. Layered Wyner-Ziv video coding. In Proc. Visual Communi- cations and Image Processing, Jan. 2004. [97] R. Zamir and S. Shamai. Nested linear/lattice codes for Wyner-Ziv encoding. In Proc. Information Theory Workshop, pages 92–93, June 1998. [98] R. Zamir, S. Shamai, and U. Erez. Nested linear/lattice codes for structured mul- titerminal binning. IEEE Trans. Information Theory, 48:1250–1276, June 2002. [99] R. Zhang, S. Regunathan, and K. Rose. Video coding with optimal inter/intra- mode switching for packet loss resilience. IEEE J. Selected Areas in Communica- tions, 18(6):966–976, June 2000. 135 [100] R. Zhang, S. Regunathan, and K. Rose. Optimized video streaming over lossy networks with real-time estimation of end-to-end distortion. In Proc. Int’l Conf. Multimedia and Exhibition, volume 1, Aug. 2002. [101] Y. Zhou and W.-Y. Chan. Performance comparison of layered coding and mul- tiple description coding in packet networks. In Proc. Global Telecommunications Conference, Nov. 2005. 136
Abstract (if available)
Abstract
Real-time multimedia services over the Internet face some fundamental challenges due to time constraints of those applications and network variations in bandwidth, delay and packet loss rate. Our research addresses the problem of network-adaptive video coding and streaming based on source codecs that provide scalability to match the network environments.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Robust video transmission in erasure networks with network coding
PDF
Distributed source coding for image and video applications
PDF
Complexity scalable and robust motion estimation for video compression
PDF
Algorithmic aspects of throughput-delay performance for fast data collection in wireless sensor networks
PDF
Transport layer rate control protocols for wireless sensor networks: from theory to practice
PDF
Advanced techniques for high fidelity video coding
PDF
Coded computing: Mitigating fundamental bottlenecks in large-scale data analytics
PDF
Algorithms and architectures for high-performance IP lookup and packet classification engines
PDF
Efficient coding techniques for high definition video
PDF
Understanding the characteristics of Internet traffic dynamics in wired and wireless networks
PDF
Domical: a new cooperative caching framework for streaming media in wireless home networks
PDF
Advanced techniques for green image coding via hierarchical vector quantization
PDF
Scalable peer-to-peer streaming for interactive applications
PDF
Learning, adaptation and control to enhance wireless network performance
PDF
Techniques for increasing number of users in dynamically reconfigurable optical code division multiple access systems and networks
PDF
Understanding and optimizing internet video delivery
PDF
Performant, scalable, and efficient deployment of network function virtualization
PDF
Space-time codes and protocols for point-to-point and multi-hop wireless communications
PDF
Focus mismatch compensation and complexity reduction techniques for multiview video coding
PDF
Scalable dynamic digital humans
Asset Metadata
Creator
Wang, Huisheng
(author)
Core Title
Algorithms for scalable and network-adaptive video coding and transmission
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
10/01/2007
Defense Date
04/30/2007
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
distributed source coding,multimedia communication,multiple decoding paths,multiple description coding,multiple description layered coding,network-adaptive video streaming,OAI-PMH Harvest,rate-distortion optimized streaming,scalable coding,source redundancy,transport redundancy,video coding,Wyner-Ziv coding
Language
English
Advisor
Ortega, Antonio (
committee chair
), Shahabi, Cyrus (
committee member
), Zhang, Zhen (
committee member
)
Creator Email
huisheng.wang@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m835
Unique identifier
UC1490006
Identifier
etd-Wang-20071001 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-557419 (legacy record id),usctheses-m835 (legacy record id)
Legacy Identifier
etd-Wang-20071001.pdf
Dmrecord
557419
Document Type
Dissertation
Rights
Wang, Huisheng
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
distributed source coding
multimedia communication
multiple decoding paths
multiple description coding
multiple description layered coding
network-adaptive video streaming
rate-distortion optimized streaming
scalable coding
source redundancy
transport redundancy
video coding
Wyner-Ziv coding