Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Efficient coding techniques for high definition video
(USC Thesis Other)
Efficient coding techniques for high definition video
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
EFFICIENT CODING TECHNIQUES FOR HIGH DEFINITION VIDEO by Je-Won Kang A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) August 2012 Copyright 2012 Je-Won Kang Dedication This dissertation is dedicated to my family for their endless love: Chahn-Mi Park, my beloved wife, Dai-Chun Kang, my dear father, Sang-Sun Kim, my dear mother, Bok- Nam Oh, my dear grandmother, and Yeun-Kyung Kang and Yeun-Ju Kang, my dear sisters, and Min-Jung Kang, my daughter. ii Acknowledgments I would like to thank my advisor Professor C.-C. Jay Kuo for his guidance and vision throughouttheseyears. ThankyoutoallmycollegesandmentorsduringmyPhDstudy. This dissertation would not have been possible without their support. iii Table of Contents Dedication ii Acknowledgments iii List of Tables vi List of Figures viii Abstract xiii Chapter 1: Introduction 1 1.1 Significance of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Review of Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 HD Video Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.2 High Efficiency Video Coding (HEVC) . . . . . . . . . . . . . . . . 4 1.3 Contribution of the Research . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3.1 Multi-Order-Residual (MOR) Coding . . . . . . . . . . . . . . . . 4 1.3.2 Advanced Processing Technique for Residual Coding . . . . . . . . 5 1.3.3 Advanced Entropy Coding for the HEVC . . . . . . . . . . . . . . 6 1.4 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Chapter 2: Background Review 8 2.1 Design Structure of HEVC. . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1.1 Coding Unit (CU) . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1.2 Prediction Unit (PU) . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.3 Transform Unit (TU) . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Coding Tools in HEVC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.1 Intra Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.2 Inter Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.3 Entropy Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.4 Loop Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.5 R-D Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Coding Performance of HEVC. . . . . . . . . . . . . . . . . . . . . . . . . 14 iv Chapter3: Multi-Order-ResidualVideoCodingTechniqueforHighDef- inition (HD) Videos 19 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 Overview of FOR/SOR Coding Technique . . . . . . . . . . . . . . . . . . 21 3.2.1 Analysis of H.264/AVC in HD video coding . . . . . . . . . . . . . 21 3.2.2 Motivation for FOR/SOR coding framework . . . . . . . . . . . . 23 3.3 The proposed algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3.1 Proposed FOR coding technique . . . . . . . . . . . . . . . . . . . 28 3.3.2 Proposed SOR coding technique . . . . . . . . . . . . . . . . . . . 30 3.3.3 A fast approach to two QP selection between the FOR coder and SOR coder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.4 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Chapter4: Two-layeredTransformwithSparseRepresentation(TTSR) for Video Coding 48 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2 Review of Residual Coding with Sparse Representation. . . . . . . . . . . 51 4.3 Proposed TTSR Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.3.1 System Overview of TTSR Scheme . . . . . . . . . . . . . . . . . . 54 4.3.2 Dictionary Training . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.3.3 Two Layered Transform and Its Residual Analysis . . . . . . . . . 59 4.3.4 Entropy Coder Design . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.3.5 Rate-Distortion Analysis of TTSR Scheme. . . . . . . . . . . . . . 64 4.4 Proposed SRS Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.4.1 Residual Approximation and Clustering . . . . . . . . . . . . . . . 71 4.4.2 Side Information Reduction via Context Rule . . . . . . . . . . . . 72 4.4.3 Coding Procedure of SRS . . . . . . . . . . . . . . . . . . . . . . . 73 4.5 Experimental Results for TTSR . . . . . . . . . . . . . . . . . . . . . . . . 76 4.6 Experimental Results for SRS . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Chapter 5: Advanced Tools for Entropy Coding in HEVC 86 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.2 Transform Coefficient Coding in HEVC . . . . . . . . . . . . . . . . . . . 88 5.3 Proposed Tools for CAVLC . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.3.1 Tableless Run-length Coding . . . . . . . . . . . . . . . . . . . . . 90 5.3.2 Codeword Adaptation with High Order Statistics . . . . . . . . . . 92 5.4 Proposed Tools for CABAC . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.5 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Chapter 6: Conclusion and Future Work 99 Bibliography 101 v List of Tables 2.1 Number of intra prediction modes according to PU size. . . . . . . . . . . 11 2.2 Values of W k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 Calculation of . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.1 ThedistributionofselectedSMBmodessuchas32×32,32×16,and16×32 and smaller blocks supported by H.264/AVC. . . . . . . . . . . . . . . . . 30 3.2 The BD-rate and BD-PSNR [46] of the proposed FOR/SOR coding algo- rithm as compared with the JM/KTA and the FOR coder only, where ∆ Bits are in the unit of %, and ∆PSNR is in the unit of dB, and a negative value indicates a saving as compared with the benchmark. . . . . . . . . . 38 3.3 Comparison of encoding time and BD-rate saving of several QP selection schemes as compared to the JM/KTA software. . . . . . . . . . . . . . . . 46 4.1 EnergycompactionratioΞofthedifferentschemesforsignalrepresentation. 58 4.2 Model parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.3 Transform modes with the TTSR and their codewords to signal. . . . . . 70 4.4 Context definition of a target block using the edg orientation information of its neighbor blocks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.5 Transform modes with the SRS and their codewords to signal.. . . . . . . 75 4.6 The BD-rate [46] (in the unit of %) of the proposed algorithm with HM 5.0 as the benchmark, where a negative value represents a decrement compared with the benchmark. The test sequences are class B, C, and D that are natural videos. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.7 The BD-rate of the proposed algorithm with HM 5.0 as the benchmark. The test sequences belong to class F called screen contents. . . . . . . . . 76 vi 4.8 The BD-rate of the proposed algorithm, where the sparse representation is applied to only 4×4 TU. . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.9 The BD-rate [46] (in the unit of %) of the proposed algorithm with HM 4.0 as the benchmark, where a negative value represents a decrement compared with the benchmark. . . . . . . . . . . . . . . . . . . . . . . . . 82 5.1 A VLC codeword table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.2 Performance summary of the proposed tableless CAVLC coding scheme (Y BD-rate) against the original CAVLC codec in HM 3.0. . . . . . . . . 97 5.3 Performance comparisons with the proposed scheme and the two bench- marks that are the counter-based adaptive scheme (CA) and the direct adaptive scheme (DA). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.4 Performance summary of the proposed context modification (Y BD-rate) against the original CABAC codec in HM 5.0. . . . . . . . . . . . . . . . . 98 vii List of Figures 2.1 Illustration of an exemplary CU structure. . . . . . . . . . . . . . . . . . . 9 2.2 Illustration of four symmetric PUs. . . . . . . . . . . . . . . . . . . . . . . 10 2.3 The intra prediction directions supported by HEVC. . . . . . . . . . . . . 11 2.4 Asymmetric motion partition supported in HEVC. . . . . . . . . . . . . . 12 2.5 R-D performance comparison between HEVC (HM 3.0) and H.264/AVC (JM17.1mainprofile)foratestsequence“Kimino”ofresolution1920×1080. 16 2.6 R-D performance comparison between HEVC (HM 3.0) and H.264/AVC (JM17.1 main profile) for a test sequence “Parkscene” of resolution 832× 480. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.7 R-D performance comparison between HEVC (HM 3.0) and H.264/AVC (JM17.1 main profile) for a test sequence “Blowing Bubble” of resolution 416×240. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.8 Subjective quality comparison for a test sequence “Park Scene” between two frames coded by (a) HEVC (HM 3.0) and (b) H.264/AVC (JM17.1 main profile). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.1 The H.264/AVC coding performance and the ratio of texture bits to side information with respect to QPs such as 12, 18, 24, and 30. The x-axis is a compression ratio, and the y-axis is the root mean square (RMSE). The texture information represents the bits to encode residuals, and the side information includes motion vector and other overheads. . . . . . . . . . . 22 3.2 (a) A cropped frame from the original “Traffic” sequence and (b) the quantized distortion by H.264/AVC coding as a substraction form of the original frame and the reconstructed frame. The error signal is scaled by a factor of 8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 viii 3.3 An intuitive explanation of the idea behind the proposed FOR/SOR algo- rithm using a 3D R-D curve. . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.4 The block-diagram of the proposed FOR/SOR coding system. The input signal to the SOR coder is the difference between the original signal and the reconstructed signal from FOR coder. . . . . . . . . . . . . . . . . . . 26 3.5 Supermacroblockmodessupportinglargerblocksizemotioncompensated prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.6 The directional scanning modes used in the ISP method: (a) illustration of 12 scanning modes from Mode 0 to Mode 11 and (b) Mode 0, (c) Mode 1, (d) Mode 2, and (e) Mode 3. . . . . . . . . . . . . . . . . . . . . . . . . 31 3.7 The directional scanning of pixels in a 2D block and their mapping to a 1D connected stripe in the ISP method; for instance, mode 2 and mode 1 are applied to a target signal and a reference signal, respectively. And, motion vector (m x , m y ) indicates the position of the reference block in previous frames. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.8 (a)Theoriginalsignal, (b)theresidualsignalsafterMCP,(c)theresidual signal after FOR coding, and (d) the residual signal after SOR coding. . 34 3.9 CorrelationbetweenthegradientmagnitudeandMSEaftertheFORcod- ing in 16×16 macroblocks. . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.10 TheR-DperformancecomparisonbetweentheFORcodecandtheFOR/SOR codec with ISP disabled. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.11 Fast search of (Q ∗ F ;Q ∗ S ) via two cascaded 1-D search processes. . . . . . . 37 3.12 The R-D performance comparison of the proposed algorithm and the ref- erencee algorithms in “Traffic”. . . . . . . . . . . . . . . . . . . . . . . . . 39 3.13 The R-D performance comparison of the proposed algorithm and the ref- erencee algorithms in “Ash tray smoke”. . . . . . . . . . . . . . . . . . . . 39 3.14 The R-D performance comparison of the proposed algorithm and the ref- erencee algorithms in “Sun flower”. . . . . . . . . . . . . . . . . . . . . . . 40 3.15 The R-D performance comparison of the proposed algorithm and the ref- erencee algorithms in “Blue sky”. . . . . . . . . . . . . . . . . . . . . . . . 40 3.16 The R-D performance comparison of the proposed algorithm and the ref- erencee algorithms in “Vintage car”. . . . . . . . . . . . . . . . . . . . . . 41 ix 3.17 The R-D performance comparison of the proposed algorithm and the ref- erencee algorithms in “Tractor”. . . . . . . . . . . . . . . . . . . . . . . . 41 3.18 The R-D performance comparison of the proposed algorithm and the ref- erencee algorithms in “Blowing trees”. . . . . . . . . . . . . . . . . . . . . 42 3.19 The R-D performance comparison of the proposed algorithm and the ref- erencee algorithms in “Harbour”. . . . . . . . . . . . . . . . . . . . . . . . 43 3.20 Bit allocation of the FOR and the SOR coding and the performance com- parison between the proposed algorithm and the H.264/AVC in the same PSNR range. SOR bits consist of the texture bits of SOR coding and the overheads. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.1 Thesignaldecompositionschemeintheproposedalgorithmusingmultiple representations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2 A block diagram of (a) the conventional video coding scheme and (b) the proposed TTSR technique. . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.3 Samples of representative residual signals trained by using the K-SVD algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.4 Correlation between the gradient magnitude and variance, i.e., energy in residual signals. The observation is from two different HD videos, which are (a) “Traffic” and (b) “Tractor”. . . . . . . . . . . . . . . . . . . . . . . 57 4.5 A regularization scheme in training residual samples to create multiple dictionaries that are adaptively selected from an encoder. . . . . . . . . . 58 4.6 Energycompactionratiooftheregularizedandnon-regularizeddictionar- ies, and the DCT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.7 The proposed coding scheme in (a) an encoder and (b) a decoder. . . . . 61 4.8 Probability density function of symbols in (a) indices of a dictionary and (b) level values of coefficients. . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.9 A domain rate model of the DCT and the sparse representation in 4×4 TU from different , i.e., (a) < 4, (b) 4 ≤ < 8, (c) 8 ≤ < 12, and (d) 12≤. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.10 The signal decomposition using the dictionary and the DCT. . . . . . . . 68 4.11 The Lagrangian Cost of the sparse representation with different . . . . . 69 4.12 The proposed syntax on the TTSR transform coefficients . . . . . . . . . . 70 x 4.13 Residual quad-tree (RQT) structure of transform units applying the flags to leaf nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.14 The relationship between a target block denoted by X and its four neigh- bor blocks denoted by A, B, C and D, respectively, with 12 contexts. . . 73 4.15 Two examples of pixel shift in the SRS. . . . . . . . . . . . . . . . . . . . 74 4.16 (a) The selected region of the sparse representation modes presented by an orange color and (b) prediction residuals after motion estimation. The residualsignalsarescaledby4foreaseofvisualperception. Thecaptured frame is from “BlowBubble” sequence. . . . . . . . . . . . . . . . . . . . . 78 4.17 The trained dictionary sets from the natural videos (a) with the regular- ization scheme in Fig. 4.5 and (b) without the scheme, and (c) from the screen contents. The contrast for the atoms are adjusted for ease of visual perception. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.18 (a)Motionpredictionresidualsignalsin“SliceShow”sequence, andvisual comparison of reconstructed images with (b) TTSR and (c) DCT, where the PSNR values are 37.42 dB and 37.35 dB, respectively. The contrast for the prediction residue is adjusted for ease of visual perception. . . . . 80 4.19 Subjective quality comparison in “ChinaSpeed” with (a) the TTSR and (b) the DCT. The PSNR values are respectively 38.48 dB and 38.53 dB. . 84 4.20 TheR-DperformancecomparisonoftheproposedSRSalgorithmandHM 4.0 for test sequence “ChinaSpeed”. . . . . . . . . . . . . . . . . . . . . . 85 5.1 A zig-zag scan pattern after 2-D transform, where the scanned transform coefficients are ordered in an 1-D array and the dotted lines represent consecutive zeros. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.2 A part of the 28×29 2-D VLC table containing a code number that is used for encoding the pair. The table is for an inter-coded luma block when the level value is equal to 1. The y-axis and the x-axis correspond to the Run(j;k) and the position j, respectively. . . . . . . . . . . . . . . 89 5.3 The probability distribution of runs with respect to the position of the previouscodedcoefficientj andapiece-wiselinearmodel,wherethex-axis represents the runs and the y-axis represents the normalized histogram of the pairs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.4 The VLC table generated by the proposed method when L(k)=1, where the code numbers are assigned with an increasing order and the code number for the longest run is decided by Eq. 5.1. . . . . . . . . . . . . . . 92 xi 5.5 A sorting table used to assign an syntax element to a code number with swapping entries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.6 Context selection of ‘X’ in significance map coding used for a large trans- formblock, basedonthesummationofthesignificanceatA,B,C,D,and E. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.7 The proposed context selection of ‘X’ in significance map coding used for largetransformblocks: appliedto(a)thehorizontalscanningpatternand (b) the vertical scanning pattern. . . . . . . . . . . . . . . . . . . . . . . . 96 xii Abstract High definition (HD) video contents become popular and displays of higher resolution such as ultra definition are emerging in recent years. The conventional video coding standards offer excellent coding performance at lower bit-rates. However, their coding performance for HD video contents is not as efficient. The objective of this research is to develop a set of efficient coding tools or techniques to offer a better coding gain for HD video. The following three techniques are studied in this work. First, we present a Joint first-order-residual/second-order residual (FOR/SOR) cod- ing technique. The FOR/SOR algorithm that incorporates a few advanced coding tech- niques is proposed for HD video coding. For the FOR coder, the block-based prediction is used to exploit both temporal and spatial correlation in an original frame surface for coding efficiency. However, there still exists structural noise in the prediction residu- als. We design an efficient SOR coder to encode the residual image. Block-adaptive bit allocation between the FOR and the SOR coders is developed to enhance the coding performance,whichcorrespondstoselectingtwodifferentquantizationparametersinthe FORandtheSORcodersindifferentspatialregions. Itisshownbyexperimentalresults that the proposed FOR/SOR coding algorithm outperforms H.264/AVC significantly in HD video coding with an averaged bit rate saving of 15.6%. Second, we develop two advanced processing techniques, which are referred as to two-layeredtransformwithsparserepresentation(TTSR)andslantresidualshift(SRS), for prediction residuals so as to improve coding efficiency. Prediction residues often showanon-stationaryproperty, andtheDCTbecomessub-optimalandyieldsundesired xiii artifacts. The proposed TTSR algorithm makes use of sparse representation and is targetedtowardthestate-of-the-artvideocodingstandard,HighEfficiencyVideoCoding (HEVC), in this work. A dictionary is adaptively trained to contain featured patterns of residual signals so that a high portion of the energy in a structured residual can be efficientlycodedwithsparsecoding. Then,thefollowingDCTincascadeisappliedtothe remaining signal after spare coding. The use of multiple representations is justified with an R-D analysis, and the two transforms successfully complement each other. The SRS technique is to align dominant prediction residuals of inter-predicted frames with the horizontal or the vertical direction via row-wise or column-wise circular shift before the 2-D DCT. To determine the proper shift of pixels, we classify blocks into several types, each of which is assigned an index number. Then, these indices are sent to the decoder as signaling flags, which can be viewed as the mode information of the SRS technique. It is demonstrated by experimental results that the proposed algorithm outperforms the HEVC. Third, we make a contribution to the HEVC with several efficient coding tools incor- porated into the two context adaptive entropy coding, i.e., Context Adaptive Variable Length Coding (CAVLC) and Context Adaptive Binary Arithmetic Coding (CABAC). The proposed tableless VLC coding scheme removes all the tables used for the residual coding, yet yields negligible changes to the coding performance. Statistical property of a symbol is employed to replace the conventional tables with a mathematical model and improve a coding gain with a high order Markov model. On top of that, a context for a significance map coding in a large transform block is newly designed. The proposed context model removes a dependency of neighbor significant coefficients along the same scanning line, and, thus, it enhances a throughput of the CABAC. The proposed algo- rithm extends to the mode dependent coefficient scanning method for a large transform block. The proposed algorithm has negligible effect on the coding performance while it significantly improves the parallelization. xiv Chapter 1 Introduction 1.1 Significance of the Research The digital video applications has grown rapidly in the last several decades. Owing to portable and ubiquitous video devices, people create their own video contents more than ever before. However, it is demanding to transmit or store such a huge amount of raw data. With the modern video coding technology, the size of video files can be significantly reduced while the decoded video may be degraded to a level which is still acceptable to the human visual system. The coding technology becomes increasingly important with a large volume of video data shared through the Internet, broadcast by thedigitalTVsystem, andstoredinportabledevices. Consumersexpectmoreimproved coding performance for the diversified applications. Continual efforts in video coding standards have been made to improve coding effi- ciency subject to applications, hardware resources and network infrastructure. In recent years, video applications have move towards higher resolution and fidelity. High defi- nition (HD) TV becomes popular and displays of even higher resolution such as ultra definition (UD) of size 7680×4320 are emerging. As to the network traffic, it was stated in [15] that HD video contents would account for more than 60% of Internet traffic by 2015 while video-related services would occupy about 90% of the Internet traffic. Thus, more efficient video coding techniques are needed to address the increasing demand. To meet the need, the Joint Collaborative Team on Video Coding (JCT-VC) was set up to launch the next video coding standard, called High Efficiency Video Coding (HEVC) [17]. Due to the wide acceptance of H.264/AVC in low bit rate video applica- tions, the HEVC is expected to have more impact on standard, high and ultra definition 1 video. The primary goal of the HEVC is to achieve significantly higher coding effi- ciency than H.264/AVC, say, to achieve 50% bit rate saving at the same visual quality. The current design of the HEVC contains several advanced coding tools and key design elements, while following the conventional video coding framework with a block-based motion estimation prediction, a 2D separable integer transform, and context adaptive entropy coding. The objective of our research is to shed light on the design of efficient HD video codec to yield a significant coding gain over the existing video coding standards. In thisresearch, wefirstexamine thelimitationoftheconventionalvideocodingtechniques and then propose several novel ideas for their improvement. It is demonstrated that by experimental results that the proposed techniques and tools significantly outperform existing video coding standards in perceptual quality and in objective quality such as Peak Signal-to-Noise Ratio (PSNR) at about the same bit-rates [46]. 1.2 Review of Previous Work 1.2.1 HD Video Coding TherehavebeenseveralresearchworksinimprovingthecodingefficiencyofHDvideoby exploitingitsspecialcharacteristics. OntopofconventionalblockmodesinH.264/AVC, the idea of super macroblocks (SMB) was proposed in [56] for inter-frame prediction by adopting blocks of larger sizes (e.g., 32×32). Similarly, Naito et al. [57] extended the variable block modes with a scale factor of 2 or 4. Dong et al. [43] reported strong spatial correlation of HD video that justifies the use a DCT transform of large sizes, and proposed a fast integer transform associated with the SMB accordingly. The coding methods associated with blocks of a larger size are advantageous in HD video coding. This observation is also adopted by the HEVC as a key element. The characteristics of the motion compensated prediction residual signals have been extensively studied, and it is widely accepted that the residual pixels may show less 2 correlation to neighbors unlike an image, but sometimes have strong correlation along one direction [61]. For more efficient representation of the signals, one can develop an adaptive transform set which includes directionally oriented basis functions to capture the directional components well. In [38], [41], [66], different transforms are designed to reflect some particular directions of residual signals. In [50], [53], [61], several gradient models of residual signals are studied to develop alternative transforms and improve coding efficiency. In [65], a set of KLT is obtained from sample residual signals for intra coding. Because transforms are coupled with directional intra predictions, an encoder does not need any explicit signal for the selection. Sparse representation, which allows a transform of a signal as a weighted linear com- bination of very few basis vectors in a dictionary, has been widely studied in recent years and effectively applied in various image/video processing [45], [60]. The sparse repre- sentation is utilized for video compression by incorporating an over-complete dictionary into a codec instead of using the conventional transforms such as the DCT and Wavelet. In[51], asparserepresentationisattemptedtoformthesignalwithfewerbasisfunctions via trained over-complete dictionary sets using an advanced learning technique such as K-SVD [36]. The entropy coder is improved with the new coding methods for HD video. On one hand, Context Adaptive Variable Length Coding (CAVLC) has been widely used for portable video devices due to its low complexity. The recent CAVLC offers significantly better coding performance than that in H.264/AVC. Its improved coding performance is attributed to its adaptation capability, yet the sizes of look-up tables are increased. On the other hand, Context Adaptive Binary Arithmetic Coding (CABAC) can provide better coding performance than the CAVLC with more computational complexity. An approach to combine the best features of the both entropy coding designs was proposed in [24] to enhance throughput of the CABAC. 3 1.2.2 High Efficiency Video Coding (HEVC) After H.264/AVC, ITU-T VCEG and ISO/IEC MPEG have explored chances for the next major video coding standard. VCEG organized Key Technology Area (KTA) [18] to develop advanced coding tools within the H.264/AVC framework and perform several core experiences about them. The MPEG coordinated a Call for Evidence to see the matured video coding technology for a new standard. Later on, both groups decided to establish the JCT-VC and issued a Call for Proposal to evaluate proposals submitted to the new standardization [63]. Among these proposals, JCTVC-A124 [33] and JCTVC- A125 [34] showed the best performance. Thus, they were selected as a reference software of HEVC, called HM software [25]. 1.3 Contribution of the Research We have identified several research problems in a conventional video coding standard for residual coding. The main contributions of this research are summarized below. 1.3.1 Multi-Order-Residual (MOR) Coding A Multi-Order-Residual (MOR) coding scheme is presented in Chapter 3 with the fol- lowing contributions. 1. The proposed MOR coding technique is highly adaptive to the characteristics of the source signal and its prediction residuals. The original source is coded with the First-Order-Residual (FOR) codec while the prediction residual is coded with the Second-Order-Residual (SOR) codec. 2. The FOR codec deals with the high inter-block correlation by the conventional block-based prediction technique. The SOR codec uses a novel prediction tech- nique, called inter-stripe prediction (ISP), to exploit the directional components of the residuals. 4 3. Two quantization parameters are used for the FOR and the SOR coding, respec- tively. The FOR/SOR coder can allocate the optimal bits between the FOR layer and the SOR layer and adapt local variant image features. A rate-distortion opti- mization process is used to decide an operating point at which the SOR coder will be turned on. 4. The rate-distortion (R-D) analysis is conducted to explain the coding gain of the proposed FOR/SOR codec. 1.3.2 Advanced Processing Technique for Residual Coding Novel residual coding techniques are presented in Chapter 4 with the following contri- butions. 1. The characteristics of a motion compensated residual signal is analyzed with its representative form, provided by vector quantization (VQ) and an advanced study of sparse representation. It is observed that the residual signal contains directional components that can be transformed to high frequency components in the 2-D DCT. 2. The two-layered transform with sparse representation (TTSR) is proposed to effi- ciently process a prediction residual signal using multiple representations, and improve the objective and subjective coding performance. Two cascaded trans- forms are applied to adapt the signal characteristics. For sparse representation, a dictionary is adaptively trained to contain featured patterns of residual signals, so that a high portion of the energy can be efficiently reduced. Then the following DCT is applied to transform a remaining signal. The use of the multiple repre- sentations is justified with an R-D analysis, and the two transforms successfully complement each other. 3. A slant residual shift (SRS) technique, which aligns dominant prediction residuals of inter-predicted frames with the horizontal or the vertical direction via row-wise 5 or column-wise circular shift, respectively, is proposed to achieve better energy compactionwith2-DDCTand, hence, theoverallvideocodinggain. Todetermine the proper shift amount of a given block, we classify blocks into several types, each ofwhichisassignedanindexnumber. Then,theseindicesaresenttothedecoderas signalingflags, whichcanbeviewedasthemodeinformationoftheSRStechnique. Thepropershiftamountofablocktypecanbelearnedthroughanoff-lineprocess. 1.3.3 Advanced Entropy Coding for the HEVC Context-Adaptive Variable Length Coding (CAVLC) and Context-Adaptive Binary Arithmetic Coding (CABAC) are entropy coding methods that are widely used for image/video coding standards such as H.264/AVC. CAVLC is a low complexity entropy coder supported in a baseline profile of H.264/AVC applied to portable devices owing to a low computational complexity. CABAC provides a higher coding efficiency than that of CAVLC at the cost of computational resources. In the recent activity on the HEVC, one unified entropy coding scheme, which takes an advantage of the two entropy coders, is actively discussed. Several advanced coding tools for the two entropy coders are proposed in Chapter 5. The specific contributions include the following. 1. The CAVLC provides improved coding performance against that in H.264/AVC at the cost of storing large VLC tables for extensive adaptation to different signal types. We propose a new method to remove all run-length VLC tables used for residual coding by using a simple equation to generate a code number. 2. We design a parallelizable context model in a significance map coding of CABAC to enhance a throughput of the CABAC. The proposed method excludes the latest coded coefficient in the context model, so that it removes the dependency among neighboring transform coefficients. 6 1.4 Organization of the Thesis The rest of this thesis is organized as follows. The background of this research and the current HEVC standard are reviewed in Chapter 2. The Multi-Order-Residual (MOR) coding is presented in Chapter 3. The residual processing technique with sparse repre- sentation is described in Chapter 4. Several advanced tools for the entropy coding are proposed in Chapter 5. Finally, concluding remarks and future research topics are given in Chapter 6. 7 Chapter 2 Background Review The Joint Collaborative Team on Video Coding (JCT-VC) has initiated an effort to develop a new video coding standard referred as to High Efficiency Video Coding (HEVC). In the beginning stage of the standardization, a software package, called Test Model under Consideration (TMuC) [14], was developed with several key elements from proposals submitted. TMuC was used for a test bed to evaluate coding performance and computational complexity of proposed coding tools. After examinations, only a few of codingtoolsinTMuCwereadoptedintotheHEVCreferencesoftware,i.e.,HMsoftware. We will review an overview of HEVC in this chapter with its objective and subjective coding performance in comparison with H.264/AVC. 2.1 Design Structure of HEVC 2.1.1 Coding Unit (CU) A coding unit (CU) is a square block that serves as the basic unit in coding. It cor- responds to a macroblock unit used in previous coding standards. A size of a CU can vary with a local content. Yet, it has user-defined maximum and minimum sizes, respec- tively called the largest CU (LCU) and the smallest CU (SCU). For a more flexible content-adaptive coding framework, a CU has a quad-tree structure that allows a recur- sive partitioning into four smaller blocks of equal size, called sub-CUs. An example of a CU structure with the LCU, SCU, and sub-CUs is shown in Fig. 2.1. The numbers labeled for sub-CUs represent the processing order in an LCU. 8 Figure 2.1: Illustration of an exemplary CU structure. 2.1.2 Prediction Unit (PU) A prediction unit (PU) is a subset of a CU that serves as the basic unit for the intra or inter prediction. The PU contains generates side information, e.g. prediction modes and motion vectors, that is associated with a prediction. Intra prediction is performed from pixelsamplesofadjacentPUstoremovearedundancyinspatialdomain. Interprediction is to exploit a temporal correlation between video frames using a motion estimation and compensation technique. A PU can be of a rectangular form for more efficient motion estimation or even non-symmetric in shape, while only a symmetric shape of a block was used for the prediction in previous standards. A single CU can contain multiple PUs. Fig. 2.2 shows an example of PU structures of sizes 2N ×2N, 2N ×N, N ×2N and N ×N. 2.1.3 Transform Unit (TU) A transform unit (TU) is a basic unit that processes a transform and a quantization. A transform size ranges from 4x4 up to 32x32 for a luma component. EachCU can contain one or multiple TUs, and the size of TU can not exceed the size of the CU. 9 2Nx2N Nx2N 2NxN NxN Figure 2.2: Illustration of four symmetric PUs. Residual quad-tree transform (RQT) [31] is a process to split a region of a CU into oneormultipleTUs. TheRQTallowsrecursivepartitioningdepictedbyafullquad-tree. A root node corresponds to a TU whose size is the same as the CU size. Each child node in a tree represents a partition of the CU. The maximum depth of a quad-tree can be specified by users. While Discrete Cosine Transform (DCT) is a core transform of HEVC standard, a different transform can be applied based upon the characteristics of prediction residuals. In intra-coding, Discrete Sine Transform (DST) is employed with different directional prediction modes [19] instead of DCT. 2.2 Coding Tools in HEVC 2.2.1 Intra Prediction Intra prediction is conducted by angular directional prediction, DC prediction, or planar prediction. The numberof theprediction modesis dependentona sizeof aPU asshown in Table 2.1. 33 intra prediction directions are illustrated in Fig. 2.3. Planar prediction is developed to reduce a blocking artifact and improve a subjective visual quality. On top of the intra prediction modes, Mode Dependent Intra Smoothing (MDIS) [20] was proposed as a supplementary tool to enhance the prediction by applying a smoothing filter to a prediction residual signals. The filtering process is decided by a PU size and a prediction mode. 10 Table 2.1: Number of intra prediction modes according to PU size. PU size Number of intra prediction modes 4 17 8 34 16 34 32 34 64 3 0 -5 -10 -15 -20 -25 -30 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 5 10 15 20 25 30 Figure 2.3: The intra prediction directions supported by HEVC. 2.2.2 Inter Prediction Conventional motion estimation and compensation techniques are employed in HEVC. A PU creates side information such as motion vectors, reference picture indices and prediction directions associated with motion prediction. A merge mode [26] is a new prediction method that is developed for HEVC. In the mode, a PU can combine with each other so as to transmit common motion parameters. Asymmetricmotionpartition(AMP)[2]isshowninFig.2.4. Anon-symmetricshape of a prediction can offer better representations in edges or object boundaries. The width and the height of AMP are controlled by TU and PU sizes. Adaptive motion vector prediction (AMVP) [12] is developed to exploit the spatio- temporal correlation of motion vectors in adjacent PUs. An encoder can choose the 11 Figure 2.4: Asymmetric motion partition supported in HEVC. best predictor in a motion vector prediction list. For an interpolation filter, a separable DCT-based interpolation filter is used for luma and chroma components. Since a block size of a chroma component is smaller than that of a luma component, a filter tap for a chroma is also typically smaller. 2.2.3 Entropy Coding Different context adaptive entropy coding techniques were examined including binary arithmetic coding, variable-to-variable length codes, variable-to-fixed length codes, and variable length code. Recently, one unified entropy coding based on CABAC is decided as an entropy coder in HEVC while the two entropy coders are used for H.264/AVC. Because parallel processing becomes more important as a frame resolution is higher and entropy coding is a bottleneck in decoding, a high throughput of CABAC is more carefullyaddressed. In[4],anefficientbinarizationoftransformcoefficientswasproposed to address a bottleneck caused by an increasing portion of transform coefficients in high bit-rates. The method employed a binarization used in CAVLC, so that multiple bins can be coded with a codeword at a time. In [21], a throughput of the CABAC was improved by reducing a context-based coded bin in transform coefficient coding. Most of bins for a flag, which signals a level of a coefficient greater than one or two, are coded with a bypass mode in CABAC. The context coded bins could be significantly saved with a negligible change of a coding gain. 12 2.2.4 Loop Filtering Several loop filtering techniques are newly developed for post processing in addition to a deblocking filter. Sample Adaptive Offset (SAO) [23] is applied after a deblocking filter. Offsetvaluesaredeliveredinsliceheaderssothattheyarecompensatedtoreconstructed samples to adjust the distortion error. The SAO classifies the signal into pre-defined categories and add offsets for the region. The SAO types are designed for different signal characteristics, e.g., the band offset (BO) and the edge offset (EO). All pixels in a regionareadjustedintheBOwhile1-Dpatternsareusedtopresenttheedgedirectional information in the EO. Adaptive Loop Filter (ALF) [1] is applied to the reconstructed signal after the SAO process. The filtering process employs several 2-D shape filters. Filter coefficients are transmitted in the slice header such as in the SAO. Pixel variances are used to select the filter types and coefficients. 2.2.5 R-D Optimization InamodernvideocodingstandardsuchasH.264/AVC,arate-distortion(RD)optimiza- tion is used to decide the best coding parameters such as block modes and transform sizes, which are needed to encode the corresponding residuals. In HEVC, a Lagrangian cost is computed to find the best mode from produced bits anddistortionduringtheR-Doptimization. TheLagrangiancostJ forthemodedecision is defined as J =SSE +·R MODE ; (2.1) where SSE is the sum of square errors between the source and the reconstructed block, R MODE isabitcostassociatedwiththemode,includingsideinformationsuchasmotion vectors and texture information such as quantized DCT coefficients of residual signals, andistheLagrangianmultiplier. Parameterindicatesthetrade-offbetweenbit-rates and distortion, corresponding to the slope of a point on the R-D curve. 13 Table 2.2: Values of W k . k Hierarchical level Slice type Number of B frames W k 0 0 I - 0.57 1 0 I or GPB > 0 0.68 2 0 I or GPB 0 0.85 3 > 0 GPB - 0.68 Table 2.3: Calculation of . For referenced pictures 1 - 0.05 × number of B frames For non-referenced pictures 1 In HEVC, is defined as a function of a quantization parameter (QP) in form of =×W k ×2 (QP−12) 3 ;; (2.2) whereW k is a weighting factor that depends on an encoding configuration such as a slice type and a hierarchy level of the current picture. The values of W k is listed in Table 2.2 while a parameter is chosen according to Table 2.3. Since the parameter exponentially increases with QP, a selected block mode may depend on the QP value for the same source. In lower bit rates (or higher QP), a small change in the bit-cost term can significantly affect the total cost, J. Thus, the encoder may prefer to choose a block mode that uses fewer bits with greater distortion. For example, a skip mode that transmits only one bit can be effective. On the other hand, an encoder can allow more bits to reduce the same amount of distortion in high bit-rate coding. 2.3 Coding Performance of HEVC Peak Signal-to-Noise Ratio (PSNR) is defined as PSNR =20·log 10 255 √ MSE ; (2.3) 14 where MSE is a mean square error between the source and the lossy signal. Coding bits and PSNR values are used to measure the objective coding performance. The Bjontegaard method [46] is used to measure a difference of coding performance between two R-D curves. Several R-D points are plotted with different QP values, and an area between two curves is integrated and averaged over the interval. As a result, the differences in bit-rates or PSNR are measured. They are called the BD-rate and the BD-PSNR, respectively. HEVChasatargetat50%codingbitssavingwiththesamevisualquality. Asshown in Fig. 2.5- 2.7, the most recent HM software provides improved coding performance about43%BD-ratesavingsonaverageascomparedtoH.264/AVCmainprofileinbroad bit-rates and test sequences of different resolutions. The test sequences shown in these figures are “Kimono (1920×1080),” “Party Scene (832×480),” and “Blowing bubble (416×240)”, whichareusedintheHEVCstandardization. Acomparisonofaperceptual quality is also provided in Fig. 2.8 for the “Party Scene”. The QP values are 34 and 30, and the bit-rates are 3,035 kbps and 3,060 kbps, respectively, for H.264/AVC and HM 3.0. 15 Figure 2.5: R-D performance comparison between HEVC (HM 3.0) and H.264/AVC (JM17.1 main profile) for a test sequence “Kimino” of resolution 1920×1080. 16 Figure 2.6: R-D performance comparison between HEVC (HM 3.0) and H.264/AVC (JM17.1 main profile) for a test sequence “Park scene” of resolution 832×480. Figure 2.7: R-D performance comparison between HEVC (HM 3.0) and H.264/AVC (JM17.1 main profile) for a test sequence “Blowing Bubble” of resolution 416×240. 17 (a) (b) Figure 2.8: Subjective quality comparison for a test sequence “Park Scene” between two frames coded by (a) HEVC (HM 3.0) and (b) H.264/AVC (JM17.1 main profile). 18 Chapter 3 Multi-Order-Residual Video Coding Technique for High Definition (HD) Videos 3.1 Introduction Recently, high definition (HD) TV becomes popular, and displays of even higher reso- lution such as ultra definition are emerging. To encode HD video contents, the Joint Collaborative Team on Video Coding (JCT-VC) [17] has been set up to launch the next video coding standard, currently referred to as High Efficiency Video Coding (HEVC). The primary goal of this new standard is to achieve significant coding efficiency over H.264/AVC, i.e., 50% bit-rate saving while preserving the same perceptual visual qual- ity (or objective quality measure). It is expected that HEVC will have a great impact to industrial products and systems in the future. BesidestheHEVCstandardizationeffort, therehavebeenotherresearchactivitiesin improving the coding efficiency of HD video by exploiting its special characteristics. On top of conventional block modes in H.264/AVC, the idea of super macroblocks (SMB) was proposed in [56] for inter-frame prediction by adopting blocks of larger sizes (e.g., 32× 32). Similarly, Naito et al. [57] extended the variable block modes with a scale factor of 2 or 4. Dong et al. [43] reported strong spatial correlation of HD video that justifies the use a DCT transform of large sizes, and proposed a fast integer transform associatedwiththeSMBaccordingly. Toimprovetheefficiencyofintra-frameprediction, 19 theblock-basedintraprediction(BIP)wasproposedtosupplementtheconventionalline- based intra prediction (LIP) in [42]. It is apparent that the coding methods associated with blocks of a larger size are advantageous in HD video coding. This observation is also adopted by HEVC as a key element. However, as discussed in [52], a simple extension of block sizes is not sufficient to guarantee good coding performance in high-bit-rate video with fine texture due to a considerable amount of residual signals in object boundaries as well as increased over- head. For HD video coding, these residual signals are actually the main bottleneck in achieving high coding efficiency. To address this issue, Tao et al. [61] analyzed statistical characteristics of residual signals and applied a probability model for residual coding or fast motion estimation. Li et al. [55] proposed the second-order-prediction (SOP) tech- nique to remove the correlation in the original as well as the prediction residual images. The SOP is conducted with two cascaded intra predictions to reduce the correlation in motion compensated residuals. Directional transforms such as directionally oriented basis functions were proposed in [66] to match a specific orientation of an image block, and different zigzag scanning orders are adopted to result in more efficient entropy cod- ing. The directional intra prediction was proposed in [65] to yield regular patterns of residual signals, and a set of Karhunen-Loeve basis functions can derived accordingly. A novel HD video coding technique, called the first-order-residual (FOR)/second- order-residual (SOR) coding scheme, was first proposed in [52]. Simply speaking, the FOR/SOR coding method decomposes prediction residuals into two layers, called the FOR layer and the SOR layer, and applies different coding methods based on the char- acteristics of residual signals in each layer. The conventional intra-prediction and inter- prediction schemes used in H.264/AVC are treated as the first-order prediction, and the prediction residuals are encoded by the FOR coder. To deal with higher spatial reso- lution of HD video, a motion compensated prediction using SMB was adopted in the FOR coding. After the FOR coding, there still exist residuals with directional patterns, called the second-order residuals. They are not random but structured residuals, and 20 an SOR coder is developed to encode these residuals. It was shown in [52] that the pro- posedFOR/SORschemeoutperformsH.264/AVCbyasignificantmargininnearlossless video coding. In this work, we present an improved FOR/SOR coding technique with more detailed treatment and extensive experimental results. The optimal bit allocation between the FOR and the SOR coders is formulated to adapt to different residual char- acteristics. It is shown by experimental results that the proposed algorithm significantly outperforms H.264/AVC for HD video coding (i.e., about a bit rate saving of 15.6%) in a more practical bit rate range. Therestofthischapterisorganizedasfollows. AnoverviewoftheFOR/SORcoding system is described in Sec. 3.2 with an analysis on previous coding methods in high bit rates. Then, the proposed FOR/SOR coding is presented in Sec. 3.3. Experimental results are shown in Sec. 3.4 to demonstrate the effectiveness of the proposed algorithm. Concluding remarks are given in Sec. 3.5. 3.2 Overview of FOR/SOR Coding Technique 3.2.1 Analysis of H.264/AVC in HD video coding The superior coding performance of H.264/AVC in lower bit rates is attributed to its ability to yield highly skewed residual signals centered around zero based on effective intra or inter prediction techniques and encode them with context adaptive entropy codecs such as CABAC and CAVLC. However, the characteristics of prediction residuals are dramatically different so that its R-D performance degrades in higher bit rates. One can observe the different characteristics of residual signals by studying the probability distribution of non-zero DCT coefficients with a finer quantization parameter (QP). When the QP takes a smaller value, the number of non-zero quantized DCT coefficients increases while the residual signals have a wider dynamic range. In Fig. 3.1, we show the coding performance of H.264/AVC and the proportions of coded bit streams with different QPs. The compression ratio given in the figure is the 21 ratio of the number of coded bits to that of the original video, and the root mean square error (RMSE) is used to measure the distortion. The coded bitstreams consist of texture bits, which are used to encode residual coefficients, and the side information such as motion vector bits and overhead bits. As shown in Fig. 3.1, the coding performance of H.264/AVC degrades for a smaller QP value. For instance, the RMSE is decreased by 1.35 from QP=30 to QP=24, yet it is decreased by 0.7 only from QP=18 to QP=12. The number of coding bits increases rapidly for smaller QP values, which is primarily attributed to the number of texture bits. We see that the ratio of the texture bits over thesideinformationbitsdramaticallyincreasesastheQPvaluedecreases. Consequently, for HD video coding, the coding performance of H.264/AVC encounters a bottleneck. Figure 3.1: The H.264/AVC coding performance and the ratio of texture bits to side information with respect to QPs such as 12, 18, 24, and 30. The x-axis is a compression ratio,andthey-axisistherootmeansquare(RMSE).Thetextureinformationrepresents the bits to encode residuals, and the side information includes motion vector and other overheads. 22 To explain this phenomenon, we plot the error signal after motion compensation and prediction (MCP). For instance, we sample one frame from the “Traffic” sequence of full HD resolution (i.e., 1920×1080) and encode it with QP = 24. The original image frame and the quantized error signal (scaled by a factor of 8 for ease of visualization) are shown in Fig. 3.2 (a) and (b), respectively. We see that the quantized distortion still has a certain structure along the edge region or in the fine texture region. The structured residuals are attributed to the limitation of the conventional block-based prediction technique. (a) (b) Figure 3.2: (a) A cropped frame from the original “Traffic” sequence and (b) the quan- tized distortion by H.264/AVC coding as a substraction form of the original frame and the reconstructed frame. The error signal is scaled by a factor of 8. 3.2.2 Motivation for FOR/SOR coding framework As discussed in Sec. 3.2.1, the residual image consists of structured patterns after the FORcoding, especiallyinthefinetextureregionsandintheneighborhoodofedges. Itis expensive to directly encode the structured signals with entropy coders. For this reason, we propose the SOR coding method which can reduce the correlated residuals. We use Fig. 3.3 to offer an intuitive explanation of the basic idea behind the FOR/SOR coding scheme. Let us first examine the R-D curve in the D ′ −R ′ F plane, 23 which behaves similarly to the R-D curve of the H.264/AVC codec. Since it character- izes the R-D behavior of the FOR codec, it is called the FOR R-D curve. Points A, B and C are coded with different QPs to yield different bit-rates. The portion between points A and B has larger QP values, where the curve has a steeper slope, and a great majority of coding bits is used for side information, e.g. motion vectors to offer a coarse approximation to the input video. For the portion between points B and C, when the QP values are smaller, the percentage of residual coding bits increases, and its coding efficiency drops with the flatter curve. The above discussion was confirmed by Fig. 3.1. In order to enhance the coding efficiency of the FOR codec in the interval between B and C, we switch from the FOR codec to the SOR codec. Three switch positions are shown in Fig. 3.3 (i.e., points A, B and C) for comparison. The curve in the D ′ −R ′ S plane represents the R-D curve of the SOR coder, which can can start from an arbitrary intermediate point of the FOR R-D curve given by a QP of the FOR codec. In the proposed FOR/SOR coding method, there exist multiple paths using different QP values for the FOR codec. Three switch positions are shown in Fig. 3.3 (i.e., points A, B and C) for comparison. And, there should be one optimal switching position as a result of the R-D optimization. Generally speaking, more coding modes can provide better coding performance than one single mode at the expense of computational complexity. For example, if B is the optimal switch position and B’ is the corresponding value after the switch, the R-D slopes at B and B’ should be the same. Then, one can utilize the SOR R-D curve to enjoy more rapid distortion drop, if the slope of the SOR R-D curve becomes steeper than that of the FOR R-D curve after B’. For the FOR/SOR codec with B as the switch position, the total bit rate is R ∗ F +R ∗ S and the distortion is D MIN . If the slope of the FOR curve is always steeper than that of the SOR curve, then we will need the FOR codec only. Insum,thesignificanceoftheSORR-Dcurveintheproposedalgorithmistoprovide thebestpathminimizingthetotaldistortionwithinabitbudget. TheR-Dpoints, given byoptimalQPvaluesoftheFORandtheSORcoder,connectthebestpathintheR ′ −D ′ 24 plane. The decision on the proper switch position, which is equivalent to selection of optimalQPvalues, willbediscussedinSec. 3.3. Clearly, theFOR/SORcodecdemands highercomputationalcomplexitythanthetraditionalH.264/AVC.Thetrade-offbetween coding efficiency and computational complexity will be investigated in Sec. 3.4. Figure 3.3: An intuitive explanation of the idea behind the proposed FOR/SOR algo- rithm using a 3D R-D curve. 3.3 The proposed algorithm The block diagram of the proposed FOR/SOR coding system is shown in Fig. 3.4. In the FOR coding, we aim to remove the spatial and temporal correlation of original frames. ToimplementtheFORcoder, weintroduceminormodificationstothestandard H.264/AVCcoder,whereitscodingmodulessuchasblockbasedMCP,DCT,andcontext entropy coder remain the same. The FOR coder produces a coarse approximation of the input video, and the corresponding bit stream is called the FOR bit streams in Fig. 3.4. 25 Figure 3.4: The block-diagram of the proposed FOR/SOR coding system. The input signaltotheSORcoderisthedifferencebetweentheoriginalsignalandthereconstructed signal from FOR coder. The input signal to the SOR coder is the quantized error between the original signal and the reconstructed signal which is obtained from the FOR coder. The signal recon- structed from the combined outputs of the FOR coder and the SOR coder is presented as the SOR bit stream in Fig. 3.4. The SOR coder is developed to exploit the specific characteristics of residual signals as shown in Fig. 3.2 (b). The SOR coder consists of several advanced prediction techniques such as ISP for inter-coding, block based DCT, quantization, and entropy coding, which will be explained later. There are two independent prediction modules for the FOR and SOR coders, and their reconstructed frames (denoted by REC F and REC S , respectively) are stored in the frame memory as reference frames. As shown in Fig. 3.4, the FOR coder uses a 26 reference frame of higher fidelity, which is REC F +REC S , for prediction rather than REC F alone. The coded signal is updated in each macroblock. The input signal to the SORcodecisobtainedasthedifferencebetweentheoriginalsignalandthereconstructed signal after the FOR coding. ItishighlightedthattheproposedcodingsystemisdifferentwithScalableVideoCod- ing (SVC). Scalable video coding is designed to provide functional features in scalability with video decomposition. While coding efficiency may degrade, the main objectiveness of a scalable video codec is to offer the scalable functionality. For instance, H.264/SVC adopts a notion of a single loop decoding that forbids an inter-layered prediction for an inter-coded macroblock since a decoder would demand too much latency due to predic- tion dependency. However, the proposed coding system improves a coding gain with an R-D optimization in the second stage. Since the SOR signal is obtained as a substraction form of the original signal and the reconstructed signal after FOR coding, the SOR signal depends on how to encode the original signal in the FOR coder. As described before, we employ multiple QPs in each layer. Without loss of generality, we formulate the problem with two QPs, denoted by Q F in the FOR coding and Q S in the SOR coding, and the two encoders that produce the FOR and the SOR bit streams are denoted by R F and R S , respectively. Owing to this layered coding structure, the encoder can provide more flexibility in minimizing distortion by allocating different bits to two coders. As noted, we would like to find the proper point to switch two codecs. That is, for a given total bit budget R B , these bits to the FOR and SOR coders are allocated by choosing their optimal qantization parameters so as to minimize the coding distortion. By assuming that the operating R-D curves are convex, the bit allocation task can be formulated as the following constrained optimization problem: min Q F ∈Q F ;Q S ∈Q S D(Q F ;Q S ) s.t.R F (Q F )+R S (Q F ;Q S )≤R B ; (3.1) 27 where Q F and Q S are quantization parameters for FOR and SOR, D(Q F ;Q S ) is the distortion function, R F (Q F ) and R S (Q F ;Q S ) are coding bit numbers of FOR and SOR, and Q F and Q S are the feasible sets of Q F and Q S , respectively. Theconstrainedoptimizationproblemcanbeconvertedtoadualunconstrainedprob- lem using the Lagrangian approach [37]. That is, we can define the following Lagrangian cost function for the joint optimization: J(Q F ;Q S )=D(Q F ;Q S )+·(R F (Q F )+R S (Q F ;Q S )); (3.2) where is the Lagrangian multiplier. The solution can be obtained by minimizing the Lagrangian cost if the two R-D curves are convex functions. For each macroblock, we improve the R-D performance by selecting the optimal quantization parameter pair (QP ∗ F ;QP ∗ S ). Mathematically, we have (Q ∗ F ;Q ∗ S )= argmin Q F ∈Q F ;Q S ∈Q S J(Q F ;Q S ): (3.3) As illustrated in Fig. 3.3, with selected Q ∗ F and Q ∗ S , we assign bit rates R ∗ F and R ∗ S to FOR and SOR, respectively, to achieve the optimal coding performance. Since it is computationally expensive to perform a full search over all possible candidates in Eq. (3.3), a fast algorithm will be presented at the end of this section. 3.3.1 Proposed FOR coding technique The SMB, also known as extended block sizes, provides a richer set of coding options, which is particularly attractive in HD video coding, since there exist larger smooth regions in high resolution video sequences. The SMB modes use additional larger block sizes up to 64×64, as shown in Fig. 3.5. Furthermore, a large size of transforms such as 16×16, 8×16, and 16×8 can be adopted in association with SMB. In smooth regions, a blocktransformofalargersizecanprovidebetterenergycompactionwithalessamount of side information such as motion vectors. 28 32x32 32x16 16x32 16x16 16x16 16x 8 8x16 8x8 8x8 8x4 4 x8 4x4 SMBblockmodes Figure 3.5: Super macroblock modes supporting larger block size motion compensated prediction. The probabilities for SMB to be selected as the optimal modes with respect to a few representative QPs are shown in Table 3.1, respectively, for the SMB. The experiments wereconducted on severalHD video sequences such as “traffic”, “vintage car”, “tractor” and “rush hour”, and report the average performance. We see clearly from this table that the SMB modes (either skip or coded) are often selected when QP=36 and 28, e.g., 81%. This justifies the use SMB in FOR coding. However,whentheQPvalueissmaller,thechanceofselectingthesemodesdecreases quickly. This is caused by higher prediction accuracy required at the high bit rate. Due to the rich spatial variation of the content inside a large block, the SMB mode cannot provide smaller residual signals through inter-frame prediction any longer. In high-bit- rate coding, an encoder has a higher bit budget and it tends to divide an SMB into smaller blocks so as to minimize the distortion at the cost of slightly higher overhead since this choice offers a better R-D trade-off. Inspiteofgoodperformanceatlowerbitrates,theeffectivenessoftheSMBislimited in HD video coding. A simple extension of the prediction technique used in H.264/AVC, e.g. the extension of macroblock sizes, does not work well. However, by adopting the 29 Table 3.1: The distribution of selected SMB modes such as 32×32, 32×16, and 16×32 and smaller blocks supported by H.264/AVC. Coding modes Quantization parameter (QP) QP 36 QP 28 QP 20 QP 12 H.264/AVC 18.66% 38.93% 83.74% 92.65% Skip (32×32) 40.88% 10.34% 0.28% 0.15% 32×32 31.99% 32.41% 8.56% 4.32% 32×16 4.12% 9.34% 3.37% 1.33% 16×32 4.35% 8.98% 4.05% 1.55% FOR/SOR coding, we can still apply a larger QP and incorporate the SMB in the FOR coder. The residual signal from the FOR coder will be coded by the SOR coder as discussed below. 3.3.2 Proposed SOR coding technique As shown in Fig. 3.4, the input signal to the SOR coder is the quantized difference between the original and the decoded signals obtained by the FOR decoder. The 2D residual blocks have salient directional components near object boundaries or edges due tothelimitedperformanceofSMB.Thus, weemploytheline-basedpredictiontechnique in the SOR coder so as to exploit the line-oriented correlation. For intra-coding, we slightly modify the LIP used in H.264/AVC by setting the default DC prediction value to 0. For inter-coding, we employ an advanced inter-frame prediction technique called the inter-frame stripe prediction (ISP) to remove the temporal redundancy of residual frames. A block is divided into multiple stripes with the same orientation, which is deter- mined by pre-defined modes. Four different modes are shown in Fig. 3.6(b)-(e), which correspond to mode 0, mode 1, mode 2 and mode 3, respectively. Each mode has a certain scanning order with the orientation to capture the gradient components in resid- ual signals. For example, mode 0 has the vertical scanning order from bottom to top and then from left to right. With a specific scanning method, we map pixels in a 2D 30 block into a 1D pixel sequence, which is called a stripe, as shown in Fig. 3.7. We can generate other scanning modes (i.e., modes 4-11) via rotations with a straightforward way. Additional bits are required to indicate the selected rotation. (0) (2) (1) (3) ... ... (11) (a) (b) (c) (d) (e) Figure 3.6: The directional scanning modes used in the ISP method: (a) illustration of 12 scanning modes from Mode 0 to Mode 11 and (b) Mode 0, (c) Mode 1, (d) Mode 2, and (e) Mode 3. To select the scanning mode, we find the optimal predictive stripe by comparing all scanning modes in the reference and target blocks via R-D optimization. To given an example, if there exist strong directional components as indicated by darker arrows in adjacent frames as shown in Fig. 3.7, the encoder will select mode 1 and mode 2 for the reference and the target blocks, respectively, if they provide the best R-D performance. This process will be detailed below. Fig. 3.7 shows how the ISP is applied in a 8×8 block. Let motion vector (m x ;m y ) in Fig. 3.7 denote the displacement of the reference block in the previous frame. In the most cases, the motion vector can be predicted by the motion vector used in the FOR coding or set to 0 which represents a co-located block in the previous frame. For 31 ... ... ... ... The target signal The reference signal (20) (12)(20) (25) (19) (19) (25) ... (0) (1) (63) (27) (31) (16) (0)’ (21)’ (28)’ (21)’ (28)’ (17)’ (32)’ (x,y,T) (x+m x ,y+m y ,T-1) Mode 2 Mode 1 Figure 3.7: The directional scanning of pixels in a 2D block and their mapping to a 1D connected stripe in the ISP method; for instance, mode 2 and mode 1 are applied to a target signal and a reference signal, respectively. And, motion vector (m x ,m y ) indicates the position of the reference block in previous frames. the target stripe, we partition the stripe of 64 pixels into four equal-length segments, each of which consists of 16 pixels. Then, each of four segments individually predicted from the connected stripes in the reference signal. To increase matching accuracy, we introduceadisparityvectortoadjusttherelativepositionsofmatchingwindowsbetween thepredictedsegmentandthereferencestripe. Forexample, thetargetsignalfrompixel 16 to pixel 31 is predicted using the reference signal which is windowed from pixel 17 to pixel32asshowninFig.3.7. Thedisparityvectoris1, whichistheshiftofthewindows. Although the length of line segments can vary in general, we stick to the case where all line segments are of equal length to reduce the complexity and the overhead bits. The motion prediction can be concatenated for more efficient prediction. That is, we find an appropriate reference block with the motion vector and then perform the ISP. To give an example, motion vector (m x ;m y ) in Fig. 3.7 is used to indicate the displacement 32 ofthereferenceblockinthepreviousframe. Thesideinformationsuchasmotionvectors, disparity vectors, and scanning modes is jointly determined by minimizing the following Lagrangian cost function: min dv;mv;M J PRED (dv;mv;M)= min dv;mv;M D(dv;mv;M)+·R(dv;mv;M); (3.4) where dv, mv and M denote the disparity vector, the motion vector and the scanning mode, respectively, D is the distortion between the target and the reference signals, R is the required bits to encode the side information, and is the Lagrangian multiplier which is chosen empirically. After ISP, the second-order residual signal is inverse-mapped to a 2D block and fed into the 2D DCT, quantization, and the context-based entropy coding processes. For the entropy coding, we apply the same context as used in H.264/AVC since the context behavior of residual coefficients is similar to that of FOR coefficients. To understand the contributions of MCP, FOR coding, and ISP to video coding, we show residual images of the “Foreman” sequence after MCP, FOR coding, and ISP in Fig. 3.8. After MCP, the residual image contains errors in form of surface errors (see the mouth region) and line errors (see the edge region). After the FOR coding, the FOR residual coder help remove surface errors as shown in Fig. 3.8 (b) and (c), yet line errors still remain. Finally, the strong line errors are suppressed by ISP as shown in Fig. 3.8 (d). The justification of ISP is given below. The residuals after the FOR coding often contain stripe patterns along a certain direction. We show the correlation between the gradient magnitude in a 16×16 macroblock and the MSE per pixel in Fig. 3.9, where thegradientiscomputedwiththefirst-orderedgedetectionfilter. Weseefromthefigure that, if the residuals have more salient directional property, there exists more room for 33 (a) (b) (c) (d) Figure 3.8: (a) The original signal, (b) the residual signals after MCP, (c) the residual signal after FOR coding, and (d) the residual signal after SOR coding. furtherimprovementbytheSORcoder. ISPre-organizesdirectionalpatternssothatthe correlation along the stripe can be exploited more conveniently. Clearly, the traditional MCP cannot achieve such a goal, and it is confirmed with experiments. To demonstrate the impact of ISP, we show the result of the SOR coding with ISP disabled in Fig. 3.10. The coding performance is even slightly inferior to the FOR coding only due to the coding overhead. As a result, the SOR coder without ISP is rarely adopted in the R-D optimization. 34 0 1 2 3 4 5 6 7 −2 0 2 4 6 8 10 12 14 Magnitude of gradient MSE Figure 3.9: Correlation between the gradient magnitude and MSE after the FOR coding in 16×16 macroblocks. 3.3.3 AfastapproachtotwoQPselectionbetweentheFORcoderand SOR coder The optimal QP values of FOR and SOR coders can be obtained by minimizing the cost function in Eq. (3.2). The solution lies in =Q F ×Q S ={(Q F ;Q S )|Q F ∈Q F ;Q S ∈Q S }: (3.5) where Q F =Q S ={1;2;:::;51}, is the admissible quantizer parameter set. The exhaus- tive search over is computationally expensive, and it is desirable to develop a fast algorithm to find a sub-optimal solution. We use Q T to denote a target QP which offers an operating point in the encoder. That is, the Lagrangian multiplier in Eq. (3.2) is a function of Q T as =0:85·2 (Q T −12)=3 ; (3.6) 35 Figure 3.10: The R-D performance comparison between the FOR codec and the FOR/SOR codec with ISP disabled. and the Lagrangian cost can be minimized with Q T . For given Q T , a feasible range of Q F and Q S can be represented as Q F =Q T + F and Q S =Q T − S ; where F and S are parameters that take positive integer values. It is straightforward to relate Q F and Q S via Q S =Q F −( F + S ): (3.7) It is verified by extensive experiments that the Lagrangian cost is a convex func- tion in the neighborhood of optimal quantization parameters, denoted by (Q ∗ F ;Q ∗ S ). By followingtheabovediscussion, wecansearchfortheoptimalpair, ( ∗ F ; ∗ S ), instead. Fur- thermore, we decompose the 2-D search problem into two cascaded 1-D search problems as shown in Fig. 3.11. That is, we first search along the direction of F + S =k. In Sec. 36 3.4, we set k to 2 or 3, since they are the two best possible choices empirically, and we select the best candidate denoted as k ∗ . Next, for a given k ∗ , we can determine ∗ F via ∗ F = argmin 0< F <k ∗ J( F ;k ∗ − F ); (3.8) where the cost function, J, is given in Eq. (3.2). Finally, we have ∗ S =k ∗ − ∗ F . Figure 3.11: Fast search of (Q ∗ F ;Q ∗ S ) via two cascaded 1-D search processes. 3.4 Experimental Results In this section, we perform experiments on a set of test HD video sequences of the YUV 4:2:0 format to demonstrate the superior R-D performance of the proposed algorithm. They include: “Vintage car,” “Traffic,” “Tractor,” “Ashtray Smoke,” “Sun Flower,”, “Blue Sky,” “Blowing Trees,” “Soccer,” and “Harbour”. The proposed FOR/SOR algo- rithm is implemented based on the KTA (Key Technology Area) [18] software, version 2.6. The main coding parameters are listed below. • FEXT High Profile • CABAC 37 Table3.2: TheBD-rateandBD-PSNR[46]oftheproposedFOR/SORcodingalgorithm as compared with the JM/KTA and the FOR coder only, where ∆ Bits are in the unit of %, and ∆PSNR is in the unit of dB, and a negative value indicates a saving as compared with the benchmark. Sequence FOR/SOR versus JM/KTA FOR only Vintage Car (1080p) −26:1% (0.69 dB) −23:3% (0.58 dB) Traffic (1080p) −15:8% (0.75 dB) −3:0% (0.13 dB) Blue Sky (1080p) −14:2% (0.73 dB) −6:6% (0.31 dB) Tractor (1080p) −8:6% (0.31 dB) −4:9% (0.20 dB) Sun Flower (1080p) −9:9% (0.36 dB) −1:2% (0.05 dB) Ashtray Smoke (1080p) −23:2% (1.07 dB) −9:2% (0.38 dB) Blowing Trees (1080p) −17:3% (0.76 dB) −13:9% (0.66 dB) Soccer (4CIF) −9:2% (0.38 dB) −5:3% (0.22 dB) Harbour (4CIF) −16:2% (0.62 dB) −9:6% (0.44 dB) Average −15:6% (0.60) −8:1 (0.33) • RDO On (“High Complexity Mode”) • QP 26, 28, 30, 32, 34, 36, 38, and 40. • IPPP... • Transform 8×8 On • Quarter-pel motion search with EPZS We compare the performance of the proposed FOR/SOR codec with two benchmarks: 1) H.264/AVC, and 2) the FOR coder only, which includes SMB on top of H.264/AVC. In Table 3.2, we employ the Bjontegaard method [46] to measure the average bit rate reduction (i.e., BD-rate) or the equivalent PSNR improvement (i.e, BD-PSNR). As shown in the table, the proposed FOR/SOR codec offers significantly better R-D performance with respect to the two benchmarks over a wide range of bit-rates. The BD-ratesavingandtheBD-PSNRperformance[46]are, respectively, about−15:6%and +0.60 dB with respect to H.264/AVC and −8:05% and +0.33 dB with respect to the FOR coder. 38 Figure 3.12: The R-D performance comparison of the proposed algorithm and the refer- encee algorithms in “Traffic”. Figure 3.13: The R-D performance comparison of the proposed algorithm and the refer- encee algorithms in “Ash tray smoke”. 39 Figure 3.14: The R-D performance comparison of the proposed algorithm and the refer- encee algorithms in “Sun flower”. Figure 3.15: The R-D performance comparison of the proposed algorithm and the refer- encee algorithms in “Blue sky”. 40 Figure 3.16: The R-D performance comparison of the proposed algorithm and the refer- encee algorithms in “Vintage car”. Figure 3.17: The R-D performance comparison of the proposed algorithm and the refer- encee algorithms in “Tractor”. 41 Figure 3.18: The R-D performance comparison of the proposed algorithm and the refer- encee algorithms in “Blowing trees”. The R-D performance of the proposed FOR/SOR codec and that of the two bench- marking codecs is compared. Fig. 3.12 ∼ Fig. 3.15 clearly shows the effectiveness of the proposed algorithm in high bit-rates. The FOR coder provides the remarkable per- formance at lower bit-rates with efficient FOR coding techniques, tailored to HD video sequences, such as the SMB. In particular, in very low bit ranges, the FOR coding shows the best coding performance. And, the SOR coder slightly improves the coding performance because of the increased side information, e.g., a flag indicating the FOR layer and the SOR layer. However, as studied in Table 3.1, the FOR coder confronts to a bottleneck to the structured residual signals and fail to yield a persistent coding performance, as the high fidelity is required. In high bit-rates coding scenario, a more accuratepredictionisrequired,otherwiseevensmallpredictionerrorwithhighfrequency components can degrade the overall coding performance. In the proposed algorithm, the SOR coder can be efficiently switched so as to exploit the structured noise components and encode them with different coding methods. Eventually, the proposed algorithm significantly outperforms the H.264/AVC over wide bit-rates region. 42 Figure 3.19: The R-D performance comparison of the proposed algorithm and the refer- encee algorithms in “Harbour”. We have seen that the block-based prediction techniques can be efficient in lower bit rates but has some limitations in high bit-rates, and the proposed algorithm can efficiently solve the limits by exploiting the inter-layer dependencies. On top of that, Fig. 3.16 ∼ Fig. 3.19 shows that the proposed algorithm can provide significantly improved R-D performance as compared to H.264/AVC, even though the FOR coder would be not efficient even in lower bit rates. Commonly, the test sequences such as “Vintage Car,” “Tractor,” and “Blowing Trees” have a lot of fine textures and fast motions instead of large homogeneous areas. For example, “Blowing Trees” contains the large number of irregular movements of leaves, and “Vintage Car” has many details not only in objects but also in backgrounds such as in the forest and in the ground. Therefore, the residual signals still have the rich image features. In this scenario, the block based motion prediction does not perform well, and the simple extension of the block-size is even worse. Therefore, the R-D performance of the FOR coder is almost same with the H.264/AVC. Nevertheless, the proposed algorithm can efficiently deal with the limitations caused by the motion prediction and provide 43 0 200 400 600 800 1000 1200 1400 1600 1800 2000 The proposed algorithm H.264/AVC Bits per SMB ← FOR bits (984) ← SOR bits (498) Texture bits in FOR Side information in FOR Texture bits in SOR Side information in SOR Figure 3.20: Bit allocation of the FOR and the SOR coding and the performance com- parison between the proposed algorithm and the H.264/AVC in the same PSNR range. SOR bits consist of the texture bits of SOR coding and the overheads. better coding performance. The proposed algorithm shows the significant bit saving more than 25% in an experiment as shown in Fig. 3.16. The bit numbers used in the FOR coding and the SOR coding for the “Vintage Car” sequence are shown in Fig. 3.20. For comparison, the distribution of coding bits in H.264/AVC for a similar PSNR range is also given. The H.264/AVC and the FOR/SOR codec consume 1,974 bits and 1,482 bits in total for one SMB, respectively, while the PSNR values are, respectively, 44.23 dB and 44.15 dB. This corresponds to a bit rate savingof24.9%. InFig.3.20,the“texturebitsinFOR(orSOR)”representthenumberof bits required to encode residuals by the FOR (or SOR) coder, and the “side information in FOR (or SOR)” includes motion vectors and other overhead information. We see that averysignificantportionofbitsinH.264/AVCwasspentinresidualcoding. Incontrast, 44 we trade bits required by residual coding for bits of non-residual data in the FOR/SOR codec, while the total number of bits is reduced. The FOR/SOR method demands higher complexity caused by the R-D optimization for QP selection in Eq. (3.8). The R-D optimization process is performed per mac- roblock, and the complexity increases as the number of QP candidates becomes larger. It is important to find a good balance between computational complexity and coding performance. It is observed with extensive experiments that k ∗ = 3 in Eq. (3.8) often provides the best result. As a result, we can focus on the following QP candidate set: 0 ={(Q T +1;Q T −2);(Q T +2;Q T −1);(Q T ;N=A)}; (3.9) where the last item indicates that the SOR coder is turned off and Q F =Q T . Further- more, we consider three other QP candidate sets: 1 = {(Q T +1;Q T −2);(Q T ;N=A)}; (3.10) 2 = {(Q T +2;Q T −1);(Q T ;N=A)}; (3.11) 3 = {(Q T +3;Q T −1);(Q T +1;Q T −3);(Q T +2;Q T −2); (Q T +1;Q T −2);(Q T +2;Q T −1);(Q T ;N=A)}: (3.12) We compare coding efficiency and encoding time for the above four cases in Table 3.3 with JM/KTA as the reference. The ∆ BD-rate represents the BD-rate saving with respect to the reference, and the encoding time indicates the time required to encode a 1080p frame on the average. For comparison, we select “Blowing Tree” as the test sequence, and the encoding time is measured in Q T = 26 with Intel CPU 2.2 GHz. As shown in Table 3.3, the FOR encoder requires more computational complexity over the JM/KTA software. The increase is mainly attributed to the large block size in motion search. As to 1 and 2 , the FOR coding is conducted twice, and the SOR coding is conducted once for full R-D search. Based on the measured time, the SOR 45 Table 3.3: Comparison of encoding time and BD-rate saving of several QP selection schemes as compared to the JM/KTA software. Tested QP set ∆ BD-rate Encoding time JM/KTA {(Q T ;N=A)} - 25.6 (sec) FOR only {(Q T ;N=A)} -3.4% 43.2 (sec) FOR/SOR 1 -14.4% 111.2 (sec) 2 -13.7% 107.8 (sec) 0 -17.3% 184.1 (sec) 3 -16.9% 372.7 (sec) encoding time is about equal to 22 sec. 0 offers the best coding performance among the four QP candidate sets with its complexity about four times of the FOR coding. For 3 , its coding efficiency is slightly worse than that of 0 (caused by the overhead required to encode more QP choices) while its complexity is about twice of that of 0 . Thus, we can get a fast QP selection algorithm by focusing on candidates in 0 . 3.5 Conclusion An efficient HD video coding technique based on the FOR/SOR coding idea was pre- sented. For the FOR codec, the block based coding technique such as the block-based intra prediction (BIP) technique and super macroblock (SMB) was employed. For the SOR codec, a novel line-oriented prediction technique, called the inter-frame stripe pre- diction (ISP), was developed to exploit the directional feature of the structured resid- ual. The bit allocation problem between the FOR codec and the SOR codec was addressed, which corresponds to using two quantization parameters in adaptive. It was demonstrated by simulation results that the proposed FOR/SOR codec outperforms the H.264/AVC by an averaged bit rate saving of about 15.6%. TheHEVCstandardprovidesanimprovedcodinggainoverH.264/AVCinHDvideo [11]. It is interesting to generalize the FOR/SOR coding idea to the HEVC framework. The reference software of HEVC, called the HM software [25], can be used for the FOR coding, and the SOR coding can be developed and integrated. It is expected that the 46 resulting FOR/SOR coder will provide the improved coding gain in high bit-rates for HEVC, too. This idea will be explored furthermore in the near future. 47 Chapter 4 Two-layered Transform with Sparse Representation (TTSR) for Video Coding 4.1 Introduction Sparse representation, which allows a transform of a signal as a weighted linear combi- nation of very few atoms in a dictionary, has been intensively studied in recent years and effectively applied to various image/video processing applications [45], [60]. The concept of sparse representation is utilized for video compression by incorporating an overcomplete dictionary into a codec instead of using the conventional transforms. The dictionary can be designed to include more useful prototype approximations of a signal and,thus,formsasignalwithasmallnumberofatoms. Meanwhile,theTwo-dimensional (2-D) block discrete cosine transform (DCT) has been proven to be effective in practical image/video coding standards. The new video coding standard, i.e., High Efficiency Video Coding (HEVC), also adopts the 2-D DCT as the core transform [28]. Our approach leverages the benefits of both worlds. The aim is to integrate sparse represen- tation into the state-of-the-art coding standard through a novel two-layered transform scheme. The DCT consists of sinusoidal basis functions, and a signal can be efficiently rep- resented by its DCT coefficients if it can be represented by the linear combination of a small number of DCT basis functions. In the JPEG image coding standard, the DCT 48 is applied to image sources which can be well approximated by a first-order stationary Markov model with a strong inter-pixel correlation. However, in the context of video coding, the DCT is applied to prediction residuals after inter or intra prediction. The residualsignalsoftencontainstronggradientcomponentscausedbymotioncompensated predictionerrorsalongedgesorobjectboundaries,andexhibitastrongerdynamiccorre- lation between neighbor pixels [48], [59], [61]. The 1-D horizontal and vertical sinusoidal bases cannot efficiently represent the slant features and, thus, lead to undesired degra- dation such as the ringing artifact. Accordingly, the traditional DCT may not provide the optimal energy compaction property for prediction residuals in video coding. Several modified DCT schemes have been proposed to address the non-stationary property of prediction residuals. They have been developed to account for the slant features in residual signals. One idea is to apply an adaptive transform [41], [66], which includes directionally oriented basis functions, to directional components. For comparison, the 2-D separable DCT contains only vertically or horizontally oriented bases. In [67], the position and size of a transform block can be varied to localize pre- diction errors. During the HEVC standardization, several contributions were made for the development of transform core experiment (CE) with supplementary unitary trans- forms. The proposals are categorized into a mode dependent transform and secondary transform. In the current HEVC design, 1-D DCT and 1-D DST are adaptively applied to the intra 4×4 transform unit (TU) based on an intra prediction mode [7]. Besides, mode dependent KLT transform and boundary adaptive transform are also proposed in this CE [5], [8]. In [29], the 1-D vertical/horizontal transforms are selectively skipped to yield a better coding gain. In the secondary transform, a rotational transform was proposed to move transform coefficients to low frequency positions to facilitate entropy coding [6]. In this work, two efficient prediction residual coding techniques are proposed for advanced residual signal processing. One is so-called two layered-transform with sparse 49 representation (TTSR). The basic idea of TTSR can be intuitively explained as fol- lows. The spare representation is first exploited to encode the structured pattern of the residual signal. In other words, the spare representation provides an effective yet coarse approximation to structured residuals, which is implemented with an adaptively trained dictionary and a sparse coding scheme. After the sparse coding, the remaining residual signal is coded by DCT in cascade. Spare representation is exploited to remove a structured pattern of a residual signal, which can be accomplished with an adaptively trained dictionary. The remainder after the spare representation is completely coded with the DCT in cascade, if the spare representation coarsely approximates the signal. The proposed TTSR algorithm adopts the advanced sparse coding idea and integrates it with the state-of-the-art HEVC video codec to yield an impressive coding gain. The other proposed method is a slant residual shift (SRS) technique that make dom- inant inter prediction residuals be aligned with the horizontal or vertical direction via row-wise or column-wise circular shift, respectively, before the DCT. The objective is to improve the coding gain of the conventional 2-D block DCT. To determine the proper shift amount, we classify blocks into several types, each of which is assigned an index number. Then, these indices are sent to the decoder as a signaling flag, which can be viewed as the mode information of the SRS technique. The proper shift amount of a block type can be learned through an off-line process. It is demonstrated by experimen- tal results that the proposed SRS algorithm outperforms the HEVC reference codec by a substantial margin. The rest of this paper is organized as follows. Previous video coding technique with sparse representation is reviewed in Sec. 4.2. Two layered transform with sparse representation technique is described in Sec. 4.3. A slant residual shift (SRS) technique is presented in Sec. 4.4. Experimental results are given in Sec. 4.5 and Sec. 4.6 to demonstrate the effectiveness of the proposed algorithms. Finally, concluding remarks and future work are given in Sec. 4.7. 50 4.2 Review of Residual Coding with Sparse Representa- tion Sparse representation takes an account for an approximation of a signal as a weighted linear combination of a few atoms in an overcomplete dictionary D [39]. Typically, a dictionary contains a more general prototype of signals called atoms, and the number of atoms is greatly larger than the signal dimension N. Since the sparse representation is an important ingredient in the proposed algorithm, we conduct a quick review on this subject in this section. Considering an image signal y ∈ R N (e.g. a block having a size of N = n × n pixels),y can be represented using superpositions of orthonormal bases denoted asT= [t 1 ;··· ;t N ]. Mathematically, the signal representation is given as y = N ∑ i=1 i t i ; (4.1) where i is a transform coefficient that is an inner product of y and t i , denoted as < y;t i >. Because T contains linearly independent bases spanning the signal space, it is called a complete dictionary set. Instead of (4.1), using an overcomplete dictionary D ∈ R N×K (K > N), y can be approximated with a sparse representation. The initial research using the overcomplete dictionary was conducted in [58]. They built a dictionary with modulated Gabor atoms or other mathematical models created by a union of orthonormal bases such as Wavelet, DFT, and Curvelets. Mathematically, the spare representation is described by solving an error constraint optimization problem, given as min∥b∥ 0 subject to ∥y−Db∥ 2 ≤; (4.2) 51 where D contains K columns, and b ∈ R K is the coefficient vector for columns in D, and ∥b∥ 0 is the l 0 norm of vector b, counting the non-zero entries, and is an error tolerance for the approximation. Typically, both D and b are unknowns. Anaturalquestionarisingfrom(4.2)ishowtoselectthedictionaryDtofacilitatethe sparse representation. It appeals to determine a pre-specified dictionary as a collection of parameterized waveforms because it is simple. However, to fit the given samples better, the dictionary D can be designed based on learning of real world signals to leverage the characteristics of residuals. The K-SVD algorithm [36] is such an effective technique to adapt a dictionary from samples and provide a sparse representation. The error constraint optimization in (4.2) can be converted into a problem, namely sparse constraint problem, by changing the term to be minimized and the constraint to obtain a dictionary allowing a sparse representation. In (4.2), we add an additional sparsity constraint of C onb to conduct the K-SVD algorithm that isb has at most C non-zero elements. We call the sparse constraint C sparsity. Then, the formulation is changed to the sparse constraint problem: min b;D ∥y−Db∥ 2 subject to ∥b∥ 0 ≤C: (4.3) The K-SVD algorithm offers an iterative procedure to update coefficient vectors and a dictionary matrix. As a result, the dictionary is trained to provide a representation of a signal with a number of non-zero entries that are less than or equal to C, and high energy compaction can be achieved in the few coefficients. The K-SVD is applied for the proposed algorithm in the dictionary training. The sparse representation of motion compensated residual signals is deployed in the dictionary based video coding [51], [62], where a trained dictionary using residual sam- ples is used for transform instead of the DCT. The dictionary could be designed to containausefulpatternofaresidualsignal, e.g. adirectionalcomponentandprovideda comparable coding gain to the H.264/AVC at very low bit-rates and a better subjective 52 quality [51]. However, the atoms in the dictionary are not orthogonal, and their mutual incoherence [44] are sufficientlylarge to enhance the sparse representation. However, the entropy coding becomes inefficient because of significantly increasing side information. The atom indices in D are directly coded and transmitted to the decoder as side infor- mation, and the bits for the side information are too demanding to take an advantage of energy compaction provided by the sparse representation. Accordingly, the coding efficiency drops at high bit-rates when a greater number of atoms are required. Figure 4.1: The signal decomposition scheme in the proposed algorithm using multiple representations. Motivated by the observation, we consider that multiple representations of signals can allow more flexible decomposition of a signal using different transforms in cascade, and thus will provide a better coding efficiency. Fig. 4.1 shows how a signal can be represented using multiple representations. The given signal, y, is reconstructed with y F asareflectionofytothesignalspacespannedbyatomsinDandtheremainingsignal y S . There exist different paths to achieve the reconstruction, presented withy 1 F +y 1 S and y 2 F +y 2 S as an example. In this paper, we employ sparse representation and the DCT as thetwo-stagetransformsandfindthebestcombinationwithanR-Danalysis. Thesparse representation removes a large portion of energy in structured components of residual 53 signals with fewer coefficients given with a learned dictionary, while the DCT is used to efficiently encode the stationary part left from the sparse coding. 4.3 Proposed TTSR Scheme 4.3.1 System Overview of TTSR Scheme A block diagram of the proposed coding system is shown in Fig. 4.2. On top of the con- ventional video coding scheme, the proposed algorithm includes multiple transforms in cascadeappliedtoresidualsignals. Thedecomposedsignalsbythesparserepresentation and the DCT are respectively denoted as y F and y S in Fig. 4.2 (b), and the quantized signals are y ′ F and y ′ S . As stated above, the residual signals undergo the two step approximation. In the sparse representation denoted as T 1 , we perform a sparse representation of y incorpo- rated with an off-line trained dictionary and the Matching Pursuit (MP) as a forward transform. The dictionary is well adapted to encode a residual signal containing high frequencycomponentsconsideredintheDCT,e.g., directionalcomponents. Itisempha- sized that the sparse representation produces a coarse approximation of y using fewer coefficients, which isy ′ F . Thus, the sparse coding can be considered as a lossy transform providing the remainder from the approximation. The input signal to the 2-D DCT, i.e., y S is the quantized error between the source and the reconstructed one obtained from the sparse representation, and successively coded with the DCT (T 2 ). In the decoder side, y is reconstructed as the sum of y ′ F and y ′ S . A CABAC-based entropy coder is modified to encode the transform coefficients and related side information given by the sparse representation, while the same transform coefficient coding scheme with the current HEVC design is applied for the DCT coeffi- cients. The main change for the sparse representation is associated with the information provided by the MP. The MP iteratively selects an atom having the highest inner prod- uct in the dictionary. Then, the indices of the selected atom and coefficient values from 54 the inner products are obtained. The indices and quantized coefficients are transmitted to the decoder and used for synthesizing y ′ F . More details are provided in the sequel. (a) (b) Figure 4.2: A block diagram of (a) the conventional video coding scheme and (b) the proposed TTSR technique. 4.3.2 Dictionary Training Prediction residual samples are trained to create a dictionary that contains a set of representative patterns of residuals. As introduced to (4.3), the K-SVD [36] is employed to generate the dictionary, because it is known to provide a better approximation to the probability distribution of the underlying source than the other clustering algorithms with the same number of atoms. The dictionary is learned off-line ahead of an actual encoding, and the same dictionary is stored both in an encoder and decoder side. Fig. 4.3 shows the learned dictionaries to transform a 8×8 TU and 4×4 TU. Each image pattern represents an atom in a dictionary. Because the number of the atoms is two times larger than the signal dimension, they are called 2−× overcomplete dictionaries. 55 For example, the dictionary for 4×4 TU contains 32 atoms for representing a signal of 16 dimension. Figure 4.3: Samples of representative residual signals trained by using the K-SVD algo- rithm. Because the DCT is expected to efficiently transform stationary components, the dictionary is guided to include more faithful features for non-stationary signals for com- plementing the DCT action. For example, the DCT yields considerable high frequency coefficients in representing anisotropic patterns that are often observed in a boundary edge or fine textured region after prediction. In contrast, the dictionary can be well designed to include more useful atoms tailored to the representation. Followingthisidea,aregularizationstepwithaclassifierisaddedtopreventatrained dictionary from losing salient features by massive low frequency components in residuals and to provide a better capability of a representation of the non-stationary character- istics. We first examine statistical characteristics of residual signals and exploit their energy, i.e., variance, to characterize the residual signals. It is clearly observed from Fig. 4.4thattheenergypresentedinthey-axisis highlycorrelatedtothestrengthofthegra- dient in the x-axis computed with the first-order edge detection filter. The assumption made from the observation is also well verified in previous works [48], [59]. Fig4.5showsthesystemtosupporttheregularizedtrainingwhichprovidesanadap- tive dictionary. The classifier operates based upon a pre-defined condition of residuals, 56 (a) (b) Figure 4.4: Correlation between the gradient magnitude and variance, i.e., energy in residualsignals. TheobservationisfromtwodifferentHDvideos, whichare(a)“Traffic” and (b) “Tractor”. which are i) the variance of residuals, ii) intra or inter type or mode, and iii) the size of the TUs during the RDO process. First, the classifier accepts the input residual samples only if they contain energy higher than a threshold based on the assumption made in Fig. 4.4. Second, different block types, e.g. inter/intra or luma/chroma block samples trained for different dictionaries. Lastly, the size of a TU is considered. An encoder may generate numerous residual samples applied to different TU sizes with the RDO. For the regularization, once a particular TU size is determined as the best mode, the corresponding residual sample is used for the training for the size of the TU. Multiple dictionaries trained under the pre-defined situations are adaptively used for an actual encoding. We demonstrate a superior energy compaction of a trained dictionary to the DCT. For the tests, we set the sparsity to 4, so the signal only needs four coefficients with a given dictionary for the approximation. In Fig. 4.6, the x-axis represents the order of non-zero coefficients with their magnitude. That is, for the x-axis, a diagonal scan is applied to collect the non-zero coefficients of the DCT, while the iteration order of the MPisusedforthetraineddictionary. And,they-axisrepresentsanexpectednormalized energy of a transform coefficient in the position. It is clearly seen from Fig. 4.6 that 57 Figure 4.5: A regularization scheme in training residual samples to create multiple dic- tionaries that are adaptively selected from an encoder. a significant portion of the energy concentrates on a smaller number of transformed coefficients with a trained dictionary. And, the energy level is more quickly decreasing with the regularized training. For the quantitative measurement, we define the energy compaction measure Ξ [49] as the ratio of arithmetic mean to the geometric mean of transform coefficients, given with Ξ= ∑ N i=1 2 i ( ∏ N i=1 2 i ) 1 N ; (4.4) whereN is set to 4 to understand the capability for the reconstruction and i represents a transform coefficient in the ith scanning order. Physically, a larger Ξ implies larger energy compaction of the transform. Table 4.1 lists the ratios for the tested signal representation schemes in Fig. 4.6, and the adaptive dictionary with the regularization scheme provides the highest energy compaction ratio. Table 4.1: Energy compaction ratio Ξ of the different schemes for signal representation. Transform Ξ DCT 1.1993 SR without regularization 1.2623 SR with regularization 1.4962 58 Figure 4.6: Energy compaction ratio of the regularized and non-regularized dictionaries, and the DCT. 4.3.3 Two Layered Transform and Its Residual Analysis As shown in Fig. 4.2, residual signals after prediction are represented with the multiple transforms in cascade. The TTSR decomposes the input residual signals to improve the coding gain using sparse representation and the following DCT. We provide more thorough explanation of the proposed sparse representation scheme with an analysis. Considering an overcomplete dictionary D with K atoms, i.e., D = [d 1 ;··· ;d K ], a signal y ∈ R N is decomposed with the column atom and a coefficient vector b=[ 1 ;··· ; K ] T usingtheMP.TheMPisaniterativedecomposingprocess, wherethe remainedsignalafterthedecompositionisusedasinputtothenextiteration. Beginning from an initial residual r (0) = y, the MP sequentially builds the sparse approximation stepwise. In stage k, the MP finds an atomd∈R N , which provides the best correlation 59 withr (k−1) , and its scalar multiplication =<r (k−1) ;d>, where <;> is an inner prod- uct. Then, thecurrentsignalissuccessivelyapproximatedwithdand, mathematically given as r (k−1) =d+r (k) ; (4.5) where r (k) is the residual at iteration k−1, and d is the column vector whose absolute value of the inner product is maximum. After all stages, given with the sparse constraint, one can approximate y using only C non-zero atoms. We denote an ordered sequence of indices I C = (i 1 ;··· ;i C ) for the non-zero elements as appeared in the process, i.e., | i k | > | i j |, if k < j. Similarly, d i k is the corresponding column vector to the index. Then,y is approximated from (4.3) to y = C ∑ k=1 i k d i k +r (C) : (4.6) From (4.6), we adopt a sparse representation technique using multiple atoms. The main idea is to provide more flexility for the signal representation so that the residual signal can be further coded, which is described by y = C ′ ∑ k=1 i k d i k + N ∑ j=1 j t j ; (4.7) where t is one of N orthogonal bases in a complete set T such as DCT or Wavelet. It is noted that C ′ is not necessary to be equal to C, i.e, C ′ 5 C, so y is not completely reconstructed with D, and the remaining terms are successively represented using T. In 4.7, we have defined two respective sums of scalar multiplications of d and t for the two-layered transform. The block diagram of the proposed coding system using multiple representations is shown in Fig. 4.7. Input residuals are decomposed intoy F andy S by an encoder shown in Fig. 4.7 (a). For T 1 , the dictionary based transform is applied to construct y F . As a result of the transform, side information related to the selected atoms, i.e., I C ′ and 60 quantized transformed coefficients denoted as Q( I C ) need to be transmitted. In Fig. 4.7 (a), y is coarsely reconstructed as y F using the information and quantized to y ′ F . The residual signal subtracted from the sources are successively coded with the DCT (T 2 ) and quantized, providing the coefficient Q(). All the information related to the proposed algorithm, i.e., I C , Q(), Q(), and an overhead is feeded into an entropy coder, whose design issue will be explained later. In a decode side, the reconstruction process is similar and can be processed in parallel to reduce the latency presented as shown in Fig. 4.7 (b). The transmitted bit streams are parsed for I C , Q(), and Q() to reconstruct y ′ F and y ′ S . In final, y ′ is formed as the sum of the two reconstructed signals. (a) (b) Figure 4.7: The proposed coding scheme in (a) an encoder and (b) a decoder. The key idea in the proposed scheme is that the sparse representation behaves as a lossy transform, so that a signal is partially reconstructed and the remainder is fed into the DCT, and one transform does not necessarily complete all the reconstruction process. A control parameter shown in Fig. 4.7 (a) makes a decision on how coarsely an 61 encoder approximates y F , which can be accomplished by a controlled sparsity. That is, an encoder finds the number of atoms used for the sparse representation to improve the coding gain. On one hand, a smaller number of atoms yields a coarser reconstruction, while a sparser representation can yield less bits. On the other hand, if a TU retains a significant structured pattern, more atoms are used to remove the higher distortions. The optimization process will be explained in the next Section. 4.3.4 Entropy Coder Design Generally speaking, the HEVC entropy coding design is designed based on the assump- tionthattheDCTcoefficientsfollowaLaplaciandistribution. Thetransformcoefficients are collected to an 1-D array using a diagonal scan and coded with CABAC. The proce- dure is briefly reviewed next. TheCABACencodesi)thelastpositionofthenon-zerocoefficients, ii)asignificance map indicating the positions of the non zero coefficients, and iii) the quantized level values. In the last position coding, the x− y coordinate of the position in a TU is directly coded. And, a significance map coding is performed with multiple stages. A 4×4sub-blockis scannedfirst, and ifthe sub-blockincludes atleast onenon-zero entity, thenon-zerosarefurtherscannedinthesub-block. Forthelevelcoding,greater-than-one and greater-than-two flags are used to efficiently encode level values with their signs. The same CABAC in HEVC is used for the transform coefficient coding produced by the 2-D DCT. In this paper, we propose a modified entropy coding for the sparse rep- resentation to encode indices of the selected atom I C and the quantized MP coefficients Q( I C ). The modified design is explained in the sequel. The statistical properties are investigated in Fig. 4.8 for the design. For the index coding, we come up with a fixed length code (FLC) as a result of the Huffman coding. In a dictionary, atoms are not orthogonal and their mutual incoherence are sufficiently large for the sparse coding capability, so the probability distribution of the atoms is supposedly uniform in the signal space. In Fig. 4.8 (a), we gathered some statistics of 62 the selected atoms and justify the assumption. As shown, the distribution approximates an uniform distribution, and the FLC scheme can be employed to fit the distribution. The binarization of the indices is dependent on the TU size and the dictionary training. For a 2−× overcomplete dictionary whose size is 2N, log 2 (N) + 1 is the length of the binarization. For the level coding, we perform a chi-square test to approximate the curve in Fig. 4.8 (b), and it turned out that that the Laplacian distribution is a good choice. Therefore, aprogressivebinarization, i.e., atruncatedunarycodecombinedwith an Exponential-Golomb code, is employed to code the levels and fed to the CABAC. (a) (b) Figure 4.8: Probability density function of symbols in (a) indices of a dictionary and (b) level values of coefficients. 63 4.3.5 Rate-Distortion Analysis of TTSR Scheme We perform an Rate-Distortion (R-D) analysis of the sparse representation and the DCT. The goal is to find the best combination of the two transforms providing improved R-D performance in the proposed algorithm. For the analysis, we apply a -domain analysis [47], which is well established for the estimation of bit-rates by using a portion of quantized non-zero coefficients [54], because the rate model of the proposed sparse representation can be easily formed with the modeling technique. A Laplacian distribution is widely accepted for the probability distribution of the transform coefficients, i.e., p(x)= 1 2 e (− |x| ) ; (4.8) where > 0 is a scale parameter. The variance of the Laplacian distribution is given as 2 2 , and the shape of the curve becomes flatter as gets larger, which is a function of the number of non-zero values. We assume that the transform coefficients from the HEVC also follows the Laplacian distribution as the previous video coding standard. Considering the proposed entropy coding scheme for the sparse representation, the bit-rates linearly increase with the number of coefficients because the index with the FLC generates the major portion of bits per non-zero coefficient. We define the ratio of the number of non-zero coefficients to a sample size as , and R F () as an estimator of the bit-rates of the sparse representation can be approximated as a linear function as below. The following model is justified with extensive experiments: R F ()=·+; (4.9) where and are model parameters, and ∈ (0,1). For the rate model of the DCT, the probability of non-zero DCT coefficients is given by, =P(x̸=0|q;)=1− ∫ 0:5q −0:5q 1 2 e (− |x| ) dx=e − q 2 ; (4.10) 64 where q is the quantization step size, and a uniform quantizer is assumed. Thus, from (4.10), there exists one-to-one mapping of q to . Also, considering a random variable as a function of , mathematically we have, q()=−2()ln: (4.11) The Rate-Quantization model is intensively studied, and a quadratic model [40] is widely used for the estimation model considering a uniform quantizer. The model is formulated in the equation as R S (q)= +×q −1 +×q −2 ; (4.12) where , , and are model parameters. With (4.11) and (4.12), we have R S ()= + (2()ln) + (2()ln) 2 : (4.13) Considering a source x with and a uniform quantizer with a step sizeq, the distor- tion is given by definition as below. D(q) = ∞ ∑ i=−∞ ∫ (i+0:5)q (i−0:5)q |x−iq|p(x)dx = 2 ∫ 0:5q 0 xp(x)dx+2 ∞ ∑ i=1 ∫ (i+0:5)q (i−0:5)q |x−iq|p(x)dx = ∫ 0:5q 0 x 1 e − x dx+ ∞ ∑ i=1 e − iq ∫ 0:5q −0:5q |x| 1 e − x dx = (1−e − q 2 )+ e − q 1−e − q (2−e q 2 −e − q 2 ); (4.14) which allows the estimation of the distortion with respect to by using (4.11). Fig. 4.9 (a)-(d) shows the estimated rate curves, which are fitted to the derived rate models in (4.9) and (4.13), respectively for the sparse representation and the DCT, 65 (a) (b) (c) (d) Figure 4.9: A domain rate model of the DCT and the sparse representation in 4×4 TU from different , i.e., (a) <4, (b) 4≤ <8, (c) 8≤ <12, and (d) 12≤. when a sample TU has a different variance. The model parameters are summarized in Table 4.2 with their fitness tests. As shown in Fig. 4.9, the rate model for the sparse representation is relatively increasing faster than that for the DCT. The phenomenon results from coding bits for I C that are linearly increasing with the number of non-zero coefficients. For comparison, the DCT coefficients can be more efficiently gathered using a scanning pattern, so the increments are smaller. In spite of the expense of coding bits, however, it is noted that the sparse repre- sentation provides a significantly better energy compaction ratio as shown in Fig. 4.6, and the trained atoms are particularly efficient to represent structured residues that are 66 Table 4.2: Model parameters. Goodness Test Goodness Test <4 80.03 1.04 0.9942 0.78 42.03 27.41 0.9985 4≤ <8 91.29 0.8464 0.9939 0.34 49.85 25.07 0.9984 8≤ <12 107.30 2.08 0.9997 3.59 54.46 77.07 0.9991 12≤ 103.80 2.975 0.9996 4.21 60.78 63.48 0.9996 frequently occurred after the prediction. The DCT is known to yield a poor represen- tation for such the impulse-like signal or a directional signal due to the sinusoidal basis functions. On top of that, the bits for the sparse representation are comparable to those for the DCT in the first few coefficients, when a sample TU contains a higher variance as shown in Fig. 4.9 (d). Motivated by the analysis, we aim to find the best combination of the two represen- tation and justify the use of multiple representations. That is, an optimization problem is formulated from (4.7), where the sparse representation builds a coarser version of a signal using atoms at most the sparsity C, and the left signal is coded with the DCT so as to minimize the distortion D subject to R ≤ R B . A classical solution to the optimization problem is the Lagrangian optimization technique. The objective function augmentsaweightedsumoftheconstraintswiththeLagrangianmultiplier,converting the problem into an unconstrained problem in form of J(i;)=×D(i)+R(i);i=0;:::;C; (4.15) where i denotes the number of used atoms for the sparse representation, and D(i) and R(i) are the estimated distortion and the bit-rates. Aspresentedin(4.7),thesignalcanberepresentedinmultiplewaysthataredifferent combinations with the two representation sets. Fig. 4.10 shows the decomposing steps of a signal using atoms in a trained dictionary and the DCT as an example, where d i denotes the atoms having i th highest inner products in order andr i is the remainder in the ith steps. Note that the DCT is performed for the last remainder. 67 Figure 4.10: The signal decomposition using the dictionary and the DCT. Our goal is to minimize the cost with the optimal combination of the sparse repre- sentation and the DCT. As we will see, a different i can be chosen for a signal. Each cost is plotted in Fig. 4.11 as a function of in (4.15). It is seen that a large value of yields a signal representation to be coded with more atoms. Conversely, a small value of favors more weight to R(i), so the DCT processes a dominant portion of the signal. However, in a practical coder, the problem in (4.15) is intractable to solve because we do not have an accurate model for the to apply on the fly. Obviously, the computa- tional complexity is demanding to search for a proper Lagrangian constant. Thus, after extensive experiments, we made the following two observations for the optimal sparsity condition that we used to exhaustively search. They are: 1. the selected number of atoms is barely greater than √ N 4 , where N is a signal dimension of a TU; 2. sometimes, the DCT alone stands the best R-D performance, and the sparse rep- resentation can be bypassed, particularly for a signal with a lower variance. Considering the two observations, the proposed algorithm is incorporated with flags representingmodesthatcanbeselectedfortheR-Doptimizationprocessintheencoder. The flags are presented in Table 4.3, which is explained as follows. 68 For mode 0, the sparse coding is skipped and the residuals go through only 2-D DCT as in the HEVC. For the modes greater than 0, the number of atoms is implicitly determined by the flags. For example, for the mode 3 in 8×8 TU, a decoder knows there will be information on two atoms. Otherwise, only one atom is considered in the decoder side. It is noted that the flags play a role in a coded block flag (CBF) to the DCT coefficients, while the original CBF remarks non-zero entities for the two transforms. That is, for the mode 2, the decoder knows there is information only for the sparse representation. And, all the flags are effective, only if the original CBF in HEVC is turned on. Fig. 4.12 shows the modified syntax to the proposed algorithm. The overhead decides whether or not to enable the sparse representation and DCT. If one of them is not used, the related syntax is skipped. Figure 4.11: The Lagrangian Cost of the sparse representation with different . In HEVC, a namely residual quad tree (RQT) [64] to recursively divide CU into a TU is employed for the transform. In the original RQT, the 2-D DCT is performed in each leaf node as shown in Fig 4.13. In the proposed algorithm, the R-D costs of the 69 Figure 4.12: The proposed syntax on the TTSR transform coefficients . flags representing modes are evaluated in leaf nodes of the RQT. We choose the best mode in Table 4.3 to provide the minimum cost given as J HM =D+ HM ×(R O +R F +R S ); (4.16) whereR o isthebit-ratefortheoverhead,R F isthebit-ratesforthesparserepresentation and R S is the bit-rate for the DCT coding, if any, and HM is the constant value as a function of the QP in the HM software. D is the distortion. The minimization in (4.16) is more computationally tractable than the problem in (4.15), though the solution can be sub-optimal. Table 4.3: Transform modes with the TTSR and their codewords to signal. TU size Mode SR DCT Codewords No. of atoms 4×4 0 By-pass Use 0 0 - 1 Use Use 10 1 - 2 Use By-pass 11 1 8×8 0 By-pass Use 0 0 - 1 Use Use 10 1 - 2 Use By-pass 110 1 - 3 Use By-pass 111 2 4.4 Proposed SRS Scheme In this section, we examine the problem of effective residual image coding after the inter prediction. Such residual images contain oriented edge components along object boundaries caused by motion compensated prediction errors. Since the orientation of 70 Figure 4.13: Residual quad-tree (RQT) structure of transform units applying the flags to leaf nodes. residual images is often neither horizontal nor vertical, the efficiency of the traditional 2-D DCT can degrade significantly. From this point of view, we develop a slant residual shift(SRS)techniquethatmakesdominantinterpredictionresidualsbealignedwiththe horizontal or vertical direction via row-wise or column-wise circular shift, respectively, before the DCT. The objective is to improve the coding gain of the conventional 2-D block DCT. To determine the proper shift amount, we classify blocks into several types, each of which is assigned an index number. Then, these indices are sent to the decoder as a signaling flag, which can be viewed as the mode information of the SRS technique. The proper shift amount of a block type can be learned through an off-line process. 4.4.1 Residual Approximation and Clustering In the proposed SRS scheme, we use a clustering technique to determine a set of repre- sentative residual patterns after the inter prediction and call them the “code-patterns”, which corresponds to codewords in the context of vector quantization (VQ). Here, we also adopt the K-SVD [36] algorithm to classify blocks into code-patterns and the same classifier developed in Fig 4.5. We show samples of trained prediction residual signals for blocks of size 8×8 in Fig. 4.3 obtained by the K-SVD algorithm. The SRS technique 71 is motivated by observed anisotropic patterns of trained residual signals via the unsu- pervised learning technique. That is, since the 2-D DCT consists of basis functions that are only horizontally or vertically oriented, it is desirable to realign the residual signals so as to achieve a better coding gain. 4.4.2 Side Information Reduction via Context Rule It is expensive to encode the indices of these code patterns directly. In this subsection, we propose a new method that exploits the context information of a target block and its neighbor blocks to reduce the number of bits required to encode the side information. To characterize the orientation of dominant residuals of a code pattern, we employ a simple first-order edge detection filter to estimate the gradient components. That is, we first convolve the residual signal with the “gradient of Gaussian” filter horizontally and vertically to produce horizontal and vertical gradients, which are denoted by S x and S y , respectively. Then, they are combined to yield edge magnitude = √ S 2 x +S 2 y and edge orientation =arctan( Sy Sx ) in the target image block. We observe the following two properties in most residual blocks in inter predicted frames. First, there is at most one single edge in each block. Second, the gradient component of neighbor blocks is connected in stripe form, and the gradient patterns between adjacent blocks are similar. Based on these two properties, we show 12 contexts ofatargetblock,X withitsfourneighborblocksdenotedbyA,B,C,andD,respectively, asshowninFig. 4.14. WeindexeachcasewithacontextnumberN =1;··· ;12inTable 4.4, where A , B , C , and D denote the edge orientation of its associated neighbors, respectively. Context 0 is reserved for when the neighbors satisfy none of the conditions or is less than a threshold. In practice, the arctan operation can be skipped for complexity reduction since we are only concerned with the directional information of the stripe. It is worth to mention that the contexts are processed in 4×4 and 8×8 blocks, no matter what the depths of coding units and transform units are. 72 Figure4.14: TherelationshipbetweenatargetblockdenotedbyX anditsfourneighbor blocks denoted by A, B, C and D, respectively, with 12 contexts. 4.4.3 Coding Procedure of SRS A prediction residual signal is categorized into one of classes in Fig. 4.14 based on its gradient component. In each class, the block is mapped to a code-pattern, which are transmitted to a decoder, in a code book for the SRS. The Euclidian distance is used for the measurement. The SRS technique modifies a directional pattern of the residual signal using the code pattern. Then, the processed residual data is feeded to a transform and entropy coded. When mode 0 is selected, the SRS is skipped, and the 2-D DCT is directly performed. 73 Table 4.4: Context definition of a target block using the edg orientation information of its neighbor blocks. N Neighbor condition N Neighbor condition 1 B ≥ 4 , D ≥ 4 7 − 4 ≤ D <0 2 B ≥ 4 8 − 4 ≤ A <0, − 4 ≤ D <0 3 C ≥ 4 9 − 4 ≤ A <0, − 4 ≤ B <0 4 0≤ D < 4 10 A < − 4 , D < − 4 5 0≤ C < 4 , 0≤ D < 4 11 A < − 4 , B < − 4 6 0≤ B < 4 , 0≤ D < 4 12 B < − 4 Howtoperformthepattern-modificationprocessandreducesideinformationviathe SRSisexplained. Pixelsaredisplacedwithaslopeobtainedfromaselectedcodepattern. For example, Fig. 4.15 (a) and (b) respectively show code patterns with context 8 and context2,andtheirdifferentpixelshiftsfortheSRS.Itistooexpensiveifallinformation related to the displacement is directly coded. Instead, an encoder transmits an index of the code pattern to save bits, while a code pattern embeds the information of pixel shifts for decoding. As shown in Fig. 4.15 (a) and (b), a decoder can reconstruct the original pattern with the index received by moving pixels per the arrows in the figure. Each code book size is dependent on a transform block size TB. We set the code book size to TB 2 , and log 2 (TB)−1 bits are transmitted to indicate the index of the code-pattern. (a) (b) Figure 4.15: Two examples of pixel shift in the SRS. 74 Table 4.5: Transform modes with the SRS and their codewords to signal. Mode SRS Trans. type Description 0 0 - 2-D DCT without SRS 1 1 0 2-D DCT with SRS 2 1 1 a 1-D DCT with SRS After the SRS, the conventional 2-D DCT transform is performed along horizontal andverticaldirection. However,thoughablockisprocessedwiththeSRS,pixelsaligned to a line can still be viewed as high frequency components along one direction since one line pattern over a 2-D block can be an impulsive event to the other 1-D line. In video coding standard, two 1-D DCTs are separately performed on a block. Motivated from the observation, the two 1-D DCT can be selectively applied by skipping one of them [29]. However, the coefficient distribution is significantly different withthatofthe2-DDCT.Forexample,arow-wise1-DDCTplacestransformcoefficients to the left side of the block. As a result, a column-wise scanning order is employed instead of a zig-zag scanning. An encoder can choose the best performing mode via R-D optimization between the two modes, i.e., 2-D transform or 1-D transform after the SRS. The proposed SRS method is integrated into an RQT in the HEVC. In each leaf node of an RQT, the overhead signal decides which mode will be used. The modes are summarized with codewords in Table 4.5. Note that the direction of the 1-D DCT does not need to be explicitly signaled but can be inferred from the context information. In the R-D optimization, the best mode is chosen by minimizing the Lagrangian cost as follows. J i (k|C =c)=D i (k|C =c)+×R i (k|C =c); (4.17) where k is an index of a code pattern in the given context c∈C. D i (:) is an estimated distortionbyagivenmodeiandtheappliedmodificationofaresidualpattern,andR i (:) is an estimated bits to encode the transformed coefficients and related side information. Finally, the best mode i and parameter k is determined. 75 4.5 Experimental Results for TTSR Table 4.6: The BD-rate [46] (in the unit of %) of the proposed algorithm with HM 5.0 as the benchmark, where a negative value represents a decrement compared with the benchmark. The test sequences are class B, C, and D that are natural videos. Types Sequences RA HE RA LC LDB HE LDB LC Class B (1080p) Kimono −0:0% −0:1% −0:1% −0:2% ParkScene −0:1% −0:2% −0:2% −0:4% Cactus −0:2% −0:4% −0:3% −0:5% BasketballDrive −0:4% −0:7% −0:6% −1:3% BQTerrace −0:5% −0:7% −0:7% −1:3% Average (B) −0:24% −0:42% −0:38% −0:74% Class C (WVGA) BasketballDrill −0:5% −1:0% −0:9% −1:6% BQMall −0:7% −1:1% −1:3% −2:4% PartyScene −0:5% −0:9% −0:9% −1:9% RaceHorses −0:2% −0:3% −0:3% −0:8% Average (C) −0:48% −0:83% −0:85% −1:70% Class D (WQVGA) BasketballPass −0:7% −1:5% −0:9% −1:7% BQSquare −1:1% −1:9% −1:4% −2:7% BlowingBubbles −0:9% −1:3% −1:1% −2:2% RaceHorses −0:2% −0:4% −0:3% −0:7% Average (D) −0:73% −1:28% −0:93% −1:83% Total Average −0:46% −0:81% −0:69% −1:37% Encoding Time 203% 191% 236% 221% Table 4.7: The BD-rate of the proposed algorithm with HM 5.0 as the benchmark. The test sequences belong to class F called screen contents. Types Sequences RA HE RA LC LDB HE LDB LC Class F BasketballDrillText −0:2% −0:5% −0:7% −1:3% ChinaSpeed −2:1% −3:0% −5:1% −8:1% SlideEditing −0:3% −0:7% −1:2% −1:9% SlideShow −1:0% −1:9% −3:9% −6:5% Average −0:90% −1:53% −2:73% −4:45% In this section, we conduct experiments on a set of YUV videos to demonstrate the improvedR-Dperformanceoftheproposedalgorithm. Theproposedtechniqueisimple- mentedbasedonHM5.0referencesoftware. Wealsocomparesimulationresultswiththe same software and configuration. The coding parameters and tools are determined by 76 the “Low Complexity (LC)” and “High Efficiency (HE)” configurations and “Low Delay B (LDB),” and “Random Access (RC)” tests as defined in the common test conditions and HEVC software reference configurations [10]. The QP values are set to 22, 27, 32, and 37 for the evaluations. For the screen contents categorized to class F sequences in Table 4.6, since they have different characteristics with the natural videos, we applied a different dictionary set that are trained with the similar screen contents. “Web browsing,” “Doc. editing,” and “Game playing” are used for the training sequences. The maximum RQT depth is set to 3, and the proposed algorithm is applied to inter coded 4×4 TU and 8×8 TU. And, the sizes of the dictionaries are 2−× overcomplete dictionary, which turns out to be a better choice than other larger dictionaries greater than or equal to 4−× overcomplete dictionary. The proposed algorithm improves the R-D performance over the HM 5.0 as shown in Table 4.6 for natural video sequences. We tested sequences categorized to the class B, C, and D that are commonly used for the “Random Access” and “Low Delay B” tests in the common condition. The BD-rate [46] is employed to measure the average bit rate reduction. TheBD-ratesavingisabout−0:46%and−0:81%onaverageforthe“Random Access” HE and LC cases and about −0:69% and −1:37% for the “Low-delay B” HE and LC cases. With respect to the types of the sequences, the proposed algorithm yields bettercodingperformanceinlowerresolutionofvideossuchasinWVGAandWQVGA, where the smaller TU sizes such as 4× 4 and 8× 8 are more probably selected. For example, in “BQSquare” sequence, the BD-rate gain, i.e.,−2:7%, is significantly greater than those in higher resolution videos. Fig. 4.16 (a) shows the selected region for the proposed method, which is represented with an orange color, while the gray region represents the only DCT coding mode. The motion compensated prediction residues are also shown in Fig. 4.16 (b). The region represented with the proposed algorithm contains more oblique patterns, and the sparse representation is adaptively used for the regions. 77 (a) (b) Figure 4.16: (a) The selected region of the sparse representation modes presented by an orange color and (b) prediction residuals after motion estimation. The residual signals are scaled by 4 for ease of visual perception. The captured frame is from “BlowBubble” sequence. We also apply the proposed algorithm to the screen contents using different dictio- naries. The proposed algorithm provides superior coding performance to the anchor as shown in Table 4.7 in class F. For the “Random Access” case, the BD-rate saving is −0:90% and−1:53%, respectively for the HE and LC. And, for the “Low Delay B” case, the BD-rate saving is−2:73% and−4:45% for the HE and LC configuration, respetively. The remarkable coding gain from class F results from the fact that the adaptive dictio- nary is tailored to the screen contents. We showed the trained dictionaries for 8×8 from 78 (a) (b) (c) Figure 4.17: The trained dictionary sets from the natural videos (a) with the regulariza- tion scheme in Fig. 4.5 and (b) without the scheme, and (c) from the screen contents. The contrast for the atoms are adjusted for ease of visual perception. natural videos and screen contents for comparison in Fig. 4.17. As to the characteris- tics of residues, the class F sequence contains more sharp edges such as in a character or a graphical element. As shown in the dictionaries, Fig. 4.17 (c) even includes an impulse-like signal. In contrast, a dictionary using natural videos contains several diag- onal components, but their variations are relatively gradual in Fig. 4.17 (a) and (b). And, as clearly observed, the results in Fig. 4.17 (a) contains more salient features than in Fig. 4.17 (b), owing to the proposed regularization scheme in Fig. 4.5. 79 The computational complexity of the matching pursuit would be demanding, i.e., O(N 3:5 ) as shown in the measurement time of Table 4.6, we provide a trade-off between the complexity and coding performance by disabling the spare coding in 8×8 TU. As showninTable4.8,thecomputationalcomplexitysignificantlydecreases,i.e.,from212% to 140% on average. (a) (b) (c) Figure 4.18: (a) Motion prediction residual signals in “SliceShow” sequence, and visual comparison of reconstructed images with (b) TTSR and (c) DCT, where the PSNR values are 37.42 dB and 37.35 dB, respectively. The contrast for the prediction residue is adjusted for ease of visual perception. Because the dictionary contains more adaptive patterns to apply to directional com- ponents and textured patches, the representation provides less undesired artifacts than 80 Table 4.8: The BD-rate of the proposed algorithm, where the sparse representation is applied to only 4×4 TU. Types RA HE RA LC LDB HE LDB LC Class B −0:11% −0:25% −0:21% −0:38% Class C −0:26% −0:36% −0:35% −0:73% Class D −0:36% −0:55% −0:42% −0:90% Class F −0:42% −0:67% −0:96% −1:49% Total Average −0:29% −0:46% −0:64% −0:88% Encoding Time 132% 127% 156% 145% theDCT.ThesubjectivequalityisalsocomparedinFig. 4.18andFig. 4.19,respectively using “Slice Show” and “China Speed” for the test sequences. As shown, the proposed algorithm privies significantly enhanced subjective quality, especially in the boundary regions of the characters or in edges, when the objective measurement, i.e., PSNR is almost same. The ringing artifact is clearly reduced. 4.6 Experimental Results for SRS In this section, we conduct experiments to show that the proposed SRS technique improves the coding performance. The technique is implemented on HM 4.0. “Race- Horse” of WQVGQ (416×240), “BasketballPass” of WVGA (832×480), “ParkScene” of HD 1080p (1920×1080), and “SlideShow” of HD 720p (1280×720) are used for the training of residual signals, which produces code books. The coding parameters and tools are determined by the “Low delay, high efficiency, P slices only” configuration as defined in the common test conditions and HEVC software reference configurations. The GOP structure is IPPP, where the I frame is coded only first time, and the rate GOP size is 4 used for QP assignment. The QP values for an I frame (QP I ) are set to 22, 27, 32, and 37 during the coding performance evaluation. And, the QP values of every four P frames are respectively set toQP I +3, QP I +2, QP I +3, andQP I +1 after the I frame. In implementation, the SRS method is only applied to the P frames having QP I +1. Because we found that the other P frames may choose fewer SRS modes, it 81 Table 4.9: The BD-rate [46] (in the unit of %) of the proposed algorithm with HM 4.0 as the benchmark, where a negative value represents a decrement compared with the benchmark. Sequence (Resolution) BD-rate saving BlowingBubble (WQVGA) −1:72% BQSquare −3:73% BQMall (WVGA) −0:73% PartyScene −1:31% ChinaSpeed (HD 720p) −8:26% SliceEditing −7:72% BasketBallDrive (HD 1080p) −0:77% BQTerrace −1:08% Average −3:16% is better to turn off the mode in terms of the trade-off between the coding performance and computational complexity. We create an additional flag in a slice header to indicate the use of the SRS, so all overheads are included to the bit-stream. And, the method is not applied to large TU sizes such as 16×16 and 32×32. The proposed SRS algorithm improves the R-D performance over the HM 4.0 as shown in Table 4.9. The BD-rate [46] is employed to measure the average bit rate reduction. In implementation, the SRS method is only applied to P pictures The BD- rate saving is about −3:16% on average as compared to the anchor. The effectiveness of the proposed SRS algorithm is also shown in Fig. 4.20. As shown in the figure, the proposedmethodprovidesabettercodinggainathigherbit-rates. Thatisbecausesmall high frequency components are still remained by the small quantization step size and the SRS is more frequently selected. According to the reported coding performance, the coding performance of the proposed method varies with video contents. Particulary, the SRS method brings the significant coding gain in the test sequences having fast motions or fine textures, in which more oblique prediction residual signals are left. 82 4.7 Conclusion We proposed an efficient video coding scheme that processes prediction residual signals so as to improve a coding efficiency. Two-layered transforms combined with sparse rep- resentation and slant residual signal techniques were developed. For the TTSR, the prediction residual signals went through two transforms in cascade. In the sparse rep- resentation, more energy compaction can be achieved with fewer coefficients, while the bit-rates are rapidly increasing with the number of transmitted atoms. A dictionary can be adaptively learned and applied to a structured pattern of a residual signal. For the SRS, a residual signal is aligned to a horizontally or a vertically oriented line, so that the DCT can facilitate an entropy coding. It was demonstrated by experimental results that theproposedalgorithmoutperformstheHighEfficiencyVideoCoding(HEVC)reference codec by a substantial margin. 83 (a) (b) Figure 4.19: Subjective quality comparison in “ChinaSpeed” with (a) the TTSR and (b) the DCT. The PSNR values are respectively 38.48 dB and 38.53 dB. 84 Figure 4.20: The R-D performance comparison of the proposed SRS algorithm and HM 4.0 for test sequence “ChinaSpeed”. 85 Chapter 5 Advanced Tools for Entropy Coding in HEVC 5.1 Introduction Context-Adaptive Variable Length Coding (CAVLC) and Context-Adaptive Binary Arithmetic Coding (CABAC) are entropy coding methods that are widely used for image/video coding standards such as H.264/AVC. CAVLC is a low complexity entropy coder supported in a baseline profile of H.264/AVC applied to portable devices owing to a low computational complexity. CABAC provides a higher coding efficiency than that of CAVLC at the expense of demanding computational resources in hardware. In the recent activity on the HEVC, one unified entropy coding scheme, which takes an advantage of the two entropy coders, is actively discussed [3]. Because the HEVC is expected to have a more impact on a HD video coding, a CABAC-based coder is more highlighted than a CAVLC-based coder. The CAVLC in HEVC incorporates several advanced tools to improve a coding gain. It is performed with three steps: i) a VLC table is selected for a given syntax element based on its context, ii) a corresponding table index to the element is extracted, and iii) a codeword is generated based on the table index. Typically, the shortest codeword is assigned to the beginning of the VLC table. In [32], a sorting table yields an adaptive codeword. The basic idea was to move a frequent index toward the top position of the sorting table, so that a shorter codeword can be assigned. For a residual coding, a two-step coding scheme, so-called “run mode” and “level mode,” is proposed in [16]. In the “run mode”, a level and a run of zeros are jointly coded using a run-length 86 coding. The “run mode” coding is conducted for high frequency components. For low frequency components, remaining coefficients are coded in the “level mode” which encodes each coefficient by the end of the coefficient coding. The two modes can be adaptively switched. The CAVLC extends to larger transform sizes, e.g. 16×16 and 32×32 block sizes, in [9]. The CABAC extends to a single entropy coder in the HEVC. However, because it has a drawback that limits a parallel processing, there were several attempts to enhance a throughput. In [4], an efficient binarization of transform coefficients was proposed to address a bottleneck caused by an increasing portion of transform coefficients in high bit-rates. The method employed a binarization used in CAVLC, so that multiple bins can be coded with a codeword at a time. In [21], a throughput of the CABAC was improved by reducing a context-based coded bin in transform coefficient coding. Most of bins for a flag, which signals a level of a coefficient greater than one or two, are coded with a bypass mode in CABAC. The context coded bins could be significantly saved with a negligible change of a coding gain. In this chapter, we present several advanced tools that contribute to the two entropy coding, which are submitted to the technical proposals in HEVC [22], [27]. They are included in the HM 4.0 software and the corresponding working draft WD-4 [35] for the standard. The rest of this chapter is organized as follows. In Sec. 5.2, we review a transform coefficient coding in HEVC. A novel tableless run-length coding method for transform coefficients and a codeword adaptation based on the first-order-Markov model are shown in Sec. 5.3. And, a parallelizable context is designed to improve a throughput of the CABAC in Sec. 5.4. Experimental results are shown in Sec. 5.5. Concluding remarks are given in Sec. 5.6. 87 5.2 Transform Coefficient Coding in HEVC After a 2-D transform, transform coefficients are ordered into an 1-D array using a scanning pattern such as a zig-zag scan. The transform coefficient coding is performed backward in the 1-D array from the highest frequency coefficient position to the DC coefficient position. For instance, Fig. 5.1 shows the process which converts a 2-D transform to an 1-D array. The last non-zero coefficient is coded first, and remaining coefficients are coded until there is no more coefficient in the array. Figure 5.1: A zig-zag scan pattern after 2-D transform, where the scanned transform coefficients are ordered in an 1-D array and the dotted lines represent consecutive zeros. Because statistical characteristics of transform coefficients are different with their frequency position, the CAVLC coding in HEVC is divided into two stages, i.e., a run mode coding and a level mode coding. In the run mode coding, a run-length coding is performed because there are a number of consecutive zero coefficients and few non-zero coefficients in high frequency positions. For the level mode coding, each coefficient is individually coded until there is no more coefficient to be coded. A run is defined as the number of consecutive zero coefficients between the previous non-zero coefficient and the current non-zero coefficient in the run-length coding. And, a level is the absolute value of the coefficient. The run and level values are jointly coded 88 with a pair denoted to <Level;Run>. How to encode the pair is explained as follows. The indices of the current coefficient and the previous coded coefficient are, respectively, denoted byk andj, and the run isj−k−1, denoted byRun(j;k), as shown in Fig. 5.1. The pair can be accordingly represented as < Level(k);Run(j;k) >, where Level(k) is the level value of the coefficient at k. There are multiple context dependent VLC Figure 5.2: A part of the 28×29 2-D VLC table containing a code number that is used for encoding the pair. The table is for an inter-coded luma block when the level value is equal to 1. The y-axis and the x-axis correspond to the Run(j;k) and the position j, respectively. tables with a different Level(k) and block types. After a VLC table is determined by the context, two parameters are used to choose a code number. They are the previous coefficient position j, with which a decoder can decide a maximum run value, and the run value Run(j;k). Fig. 5.2 shows a VLC table including code numbers. For example, to encode the coefficient at k = 3 as shown in Fig. 5.1, the maximum run value of the previous coefficient at j = 6 is 5 and the Run(6;3) is 2. The parameters 5 and 2 correspond, respectively, to the row and the column of the table in Fig. 5.2. Thus, the code number 1 is chosen to represent the run-length value. The code number is coded with a codeword provided by a codeword table. Table 5.1 depicts a part of the codeword table for code numbers. 89 Code number VLC codeword 0 00 1 01 2 100 3 101 4 11000 5 11001 6 11011 7 11100 8 11101 9 11110 10 11111000 11 11111001 ... ... Table 5.1: A VLC codeword table. 5.3 Proposed Tools for CAVLC 5.3.1 Tableless Run-length Coding The CAVLC in HEVC provides a better coding gain than that that in H.264/AVC. The improved performance is attributed to its extensive adaptation to the HEVC. For example, one can make statistical adaptation with a depth of CU or coded block types (e.g. chroma or luma blocks and intra or inter-coded blocks). However, the CAVLC in HEVC requires a much bigger size of context tables than those in H.264/AVC at the expenseofthehugeadaptations. Inthiswork, wetotallyremovecontext-adaptivetables that are used for a transform coefficient coding, while negligibly changing a coding gain. A basic idea is to provide a probability distribution model of a symbol, and replace the VLC tables with an equation and a few adaptation parameters which came with the probability distribution models. < Level(k);Run(j;k) > assigns a code number using an equation based upon parameters that are decided by a context. Fig. 5.3 plots the probability distribution of runs with respect to the position of the previous coded coefficient denoted by j. The x-axis and the y-axis, respectively, represent runs and normalized histograms of < Level(k);Run(j;k) >. The statistical relationship is 90 observed based on extensive experiments. We see from this figure that the probability tends to decrease with an index and have a bump in the longest run. Therefore, a piecewise linear model can be adopted to represent the characteristics. Fig. 5.3 also shows the approximations when the previous coefficient position j is 6 and 9. Figure 5.3: The probability distribution of runs with respect to the position of the previous coded coefficient j and a piece-wise linear model, where the x-axis represents the runs and the y-axis represents the normalized histogram of the pairs. As noted, a smaller code number is assigned to a symbol of a higher probability. Thus, a run is sequentially mapped to a code number in an increasing order. On top of that, the longest run in j should be more carefully treated because of its frequent occurrence. That is, the code number corresponding to the longest run is placed earlier for more efficient coding. Following this idea, we decide the code number of the longest run as CN l = (j+A) K ; (5.1) 91 where j is the previous coefficient position, K and A are adaptation parameters for a different context. For example, K is 4 and A is 1 when the coded block is an inter- coded luma block and L(k) is equal to 1. Fig 5.4 shows a VLC table created by the proposed method. As can be seen, the code numbers are assigned with an increasing order while the code number for the longest run is determined by Eq. (5.1). Parameters K and A can be used for other context adaptations. For example, a simple linear model, which is monotonically decreasing, is better than the piece-wise linear model for a chroma component. Then, the linear model with A = 0 and K = 1 becomes a better approximation. Furthermore, Eq. (5.1) can be implemented with a single add and bit shift operator, which has negligible computational complexity, so the complexity becomes negligible. Figure 5.4: The VLC table generated by the proposed method when L(k) = 1, where the code numbers are assigned with an increasing order and the code number for the longest run is decided by Eq. 5.1. 5.3.2 Codeword Adaptation with High Order Statistics A look-up table is used for an adaptation of a codeword. The table sorts coded syntax elements based on the recent occurrences of the elements. Once an element is decoded, 92 it places in a higher position by swapping for a more efficient coding. The process allows localcodewordadaptationsincesmallercodenumbersoffersshortercodewords. Fig. 5.5 shows the sorting process in the table used for codeword adaptation. Figure 5.5: A sorting table used to assign an syntax element to a code number with swapping entries. A counter-based adaptation technique has been proposed for sorting in [13]. A counter is employed to accumulate syntax element samples, increased by one per input sample. If the counted number is greater than that of other elements that are assigned to smaller code numbers, the entries are swapped. The counter-based adaptation may avoidimmediatechangesinthetable. However, theprobabilitygivenbythesortedtable may not give the local statistics but the global one. In addition, the sample-by-sample adaptation does not exploit the statistical characteristics of correlated source samples fully. When a sample, X i , is correlated with previous samplesS i =(X i−1 ;X i−2 ;:::), the adoption of a higher order statistics can improve the coding gain. Mathematically, the entropy measure, H(X i |S i ), is always less or equal to H(X i ). Thus, a tool based on higher order statistics is proposed for codeword adaptation in CAVLC. Intheproposedscheme, afirst-orderconditionalprobabilityisadoptedduetoitslow memory requirement. The syntax elements include a prediction mode and a partition mode, a coded block flag (CBF), a coding unit split flag, an inter-prediction direction, and a reference frame index. The counter-based adaptation scheme is used to collect the 93 samples locally. However, each counter also contains a context, C, about the previous symbol and computes the local probability distribution. Syntax elements j and k will swap their code numbers if the probability of j is greater than that of k as given by P(X i =j|C =c)>P(X i =k|C =c): (5.2) The local probability is computed with the accumulated numbers of syntax elements in a counter. After several samples, the counted numbers decrease by a half so that more recent samples can give more weights to the distribution calculation. 5.4 Proposed Tools for CABAC The throughput in parsing a bitstream is severely worsened if there exists a dependency betweenacurrentsymbolandaprevioussymbolthathasbeencodedjustbeforebecause of a latency. In the HM 4.0 design, a reverse diagonal scanning order is adopted to enhance the parallel processing [30]. HM 4.0 utilizes 15 contexts for a 4×4 block and 16 contexts for a 8×8 block. The context derivation is solely dependent on the position of the current coefficient, and thus there is no dependency problem. On the contrary, in largeblocksizessuchas16×16and32×32,contextnumbersareassignedusingneighbor coefficients as shown in Fig. 5.6. The context of the current significance is determined by the existences of the neighbor non-zero coefficients such as ‘A’, ‘B’, ‘C’, ‘D’, and ‘E’. Therefore, there is a room for improving a throughput by removing the dependency in neighbors. Amodedependentcoefficientscanning(MDCS)methodinintracodingwasextended to the large block sizes, e.g. 16×16 and 32×32. The MDCS chooses different scanning methodsamongdiagonal, horizontal, vertical, andadaptivescanningpattern, depending on prediction direction. However, the new scanning methods in transform blocks larger 94 Figure5.6: Contextselectionof‘X’insignificancemapcodingusedforalargetransform block, based on the summation of the significance at A, B, C, D, and E. than 8×8 can obstruct the parallel processing in decoding without the proper modi- fication of the context model in significance map coding. This contribution proposes a parallelizable context modeling on the top of the MDCS in large transform blocks. The proposed algorithm is to exclude the latest coded coefficient in a context model, if it is along the same scanning line. Fig. 5.7 shows the modified context selection when thehorizontalorverticalscanningorderisusedforthesignificancemapcoding. InFig. 2 (a),theposition‘A’ismovedfromtherightpositionofthecurrentcoefficientposition‘X’ to the bottom left position. Otherwise, the significance coding in ‘X’ cannot start until the significance coding/decoding in the right position is completed. This problem can stop the parallel processing particularly in decoder side. The same principle is applied to the vertical scanning. The context of ‘X’ defined asCTX(X) is derived as the sum of the significance in the position of A, B, C, D, and E, respectively defined in Fig.2, given as, CTX(X)=min{4;SIG(A)+SIG(B)+SIG(C)+SIG(D)+SIG(E)}; (5.3) SIG(:) is 0 when the corresponding coefficient is zero or the position is not available. Otherwise, it is set to 1. 95 (a) (b) Figure 5.7: The proposed context selection of ‘X’ in significance map coding used for largetransformblocks: appliedto(a)thehorizontalscanningpatternand(b)thevertical scanning pattern. 5.5 Experimental Results In this section, we conduct experiments to evaluate the proposed algorithms. According to the common CE test condition, all intra (AI) coding, a random access (RA) coding, and a low delay (LD) coding configurations are used for the evaluation. We use JCT- VC test sequences, which are categorized into classes A, B, C, D, and E, of different resolutions. The QP values are chosen to 22, 27, 32, and 37 so as to cover a broad range of bit-rates. The tableless CAVLC coding scheme significantly reduces the memory size, while slightly changing a coding gain. One VLC table requires 812 Bytes to store the entries. Table 5.2 shows the coding performance of the tableless CAVLC coding scheme. The coding performance in “AI”, “RA”, and “LD” are respectively -0.1%, -0.1%, and +0.1% BD bit-rate saving on average. The experimental result indicates that there is little change in coding performance with this simplification. The performance of the proposed codeword adaptation scheme is given in Table 5.3. Two benchmarks are chosen. One is the counter-based adaptive (CA) scheme adopted in HM 3.0, and the other is a direct adaptive (DA) scheme that performs an instant change adopted in HM 2.0. As shown in the table, the proposed codeword adaptation scheme has a gain over two bench marks up to 1.2% BD-rate reduction. 96 Table 5.2: Performance summary of the proposed tableless CAVLC coding scheme (Y BD-rate) against the original CAVLC codec in HM 3.0. Sequence AI RA LD Class A -0.1% 0.0% N/A Class B -0.1% -0.1% +0.0% Class C -0.2% -0.1% +0.1% Class D -0.2% -0.1% +0.1% Class E -0.1% N/A +0.4% Average -0.1% -0.1% +0.1% Table 5.3: Performance comparisons with the proposed scheme and the two benchmarks that are the counter-based adaptive scheme (CA) and the direct adaptive scheme (DA). Sequence Benchmark AI RA LD Class A DA -0.5% -0.8% N/A CA -0.2% -0.3% N/A Class B DA -0.5% -1.2% -0.9% CA -0.1% -0.2% -0.2% Class C DA -0.3% -0.7% -0.7% CA -0.1% -0.3% -0.3% Class D DA -0.3% -0.5% -0.7% CA -0.2% -0.1% -0.2% Class E DA -0.4% N/A -0.7% CA -0.2% N/A -0.5% Average DA -0.4% -0.8% -0.8% CA -0.2% -0.2% -0.3% The proposed algorithm in a significance map coding enhances the throughput in parsing a bitstream, while the context modification affects a coding efficiency minimally. Table 5.4 summarizes the result. 5.6 Conclusion Three coding tools were proposed to improve the CAVLC and the CABAC. First, in the tableless coding scheme, the VLC tables are completely removed in order to reduce a memory. Instead, a code number is created by an equation with adaptive parameters. Theproposedmethodwasadoptedintothestandard. Second,intheimprovedcodeword 97 Table 5.4: Performance summary of the proposed context modification (Y BD-rate) against the original CABAC codec in HM 5.0. Sequence Y U V Class A 0.1% -0.2% -0.2% Class B -0.1% -0.6% -0.6% Class C 0.0% -0.2% -0.2% Class D 0.0% -0.3% -0.2% Class E -0.2% -0.4% -0.3% Average -0.1% -0.3% -0.3% adaptation scheme, syntax elements are adaptively sorted based on their first order statistics. Lastly, the parallelizable context modification for a significance coding of large transform blocks enhances a throughput of CABAC with a minimal change of a coding gain. 98 Chapter 6 Conclusion and Future Work In this work, we proposed a set of novel and efficient HD video coding techniques which address the significant limitations of conventional video coding techniques. Our main objective was to improve a coding efficiency by exploiting the characteristics of HD videos that were carefully examined. It is demonstrated with experimental results that the proposed algorithms outperform the state-of-the-art video coding standard by a substantial margin. In Chapter 3, we proposed a novel video coding technique called the joint FOR/SOR coding scheme and showed its superior coding performance over H.264/AVC. We observed that the main bottleneck of H.264/AVC in HD video coding lies in struc- tural noises in residual signals after motion compensation. To address this problem, we developed an efficient prediction technique called ISP for the SOR coding and employed a SMB for the FOR coding. Rate-Distortion optimized bit allocation was considered for the FOR coder and the SOR coder. In Chapter 4, we presented two advanced processing techniques for prediction resid- uals, referred as to two-layered transform with spare representation (TTSR) and slant residual shift (SRS) techniques. In the TTSR technique, a residual signal was repre- sented with two transforms. A sparse representation, which represents a signal using few atoms, was applied to coarsely reconstrct a residual, and a 2-D DCT in cascade per- formed a transform of the remaining signal. In the SRS method, a slant residual signal was processed to contain a pattern aligned with a horizontal or a vertical orientation. The vector quantizer (VQ) scheme was used to train residual signals and classify them into pre-defined contexts so as to reduce side information. It was shown by experimental results that the two techniques provided improved coding performances over the HEVC. 99 InChapter5,weproposedseveralcodingtoolsforCAVLCandCABACintheHEVC. The proposed methods were proposed to the HEVC standardization and adopted into the recent standard draft. A novel tableless run-lengh coding in CAVLC removes the entire VLC tables while it changes a small coding gain. On top of that, a higher order statistics was investigated for an adaptation of a codeword in CAVLC. To enhance a throughput of the CABAC, we modify a context used for a significance map coding by removing a dependency among neighbor transform coefficients. This research can extend to various applications such as medical image/video coding and synthesized screen contents and lead to a significantly improved coding efficiency. It will be worthwhile to observe the signal characteristics and apply the proposed methods, adaptively. We will also consider a fast algorithm reducing computational complexity while it would slightly change the coding gain. 100 Bibliography [1] Adaptive loop filter with low encoder complexity. Doc. JCTVC-C113, Oct. 2010. [2] Asymmetric motion partition (AMP). Doc. JCTVC-E316, Mar. 2011. [3] CE1: Summary report of core experiment on entropy coding. Doc. JCTVC-H0031, Feb. 2012. [4] CE1.D1: Nokia report on high throughput binarization. Doc. JCTVC-H0232, Feb. 2012. [5] CE7: Boundary-Dependent Transform for Inter-Predicted Residue. Doc. JCTVC- H0309, Feb. 2012. [6] CE7: Experimental Results for the ROT . Doc. JCTVC-G304, Nov. 2011. [7] CE7: Mode-dependent DCT/DST without 4x4 full matrix multiplication for intra prediction. Doc. JCTVC-E125, Mar. 2011. [8] CE7: Mode dependent intra residual coding . Doc. JCTVC-E098, Mar. 2011. [9] Coefficient coding with LCEC for large block. Doc. JCTVC-E383, Mar. 2011. [10] Common HM test conditions and software reference configurations . Doc. JCTVC- G1200, Nov. 2011. [11] ComparisonofCompressionPerformanceofHEVCWorkingDraft4withAVCHigh Profile. Doc. JCTVC-G399, Nov. 2011. [12] Competition-Based Scheme for Vector Selection and Coding. Doc. VCEG-AC06, july 2006. [13] Counter based adaption for LCEC. Doc. JCTVC-E143, Mar. 2011. [14] Draft Test Model under Consideration. Doc. JCTVC-B205, july 2010. [15] Global mobile data traffic forecast update, 2009-2014. Cisco Systems, 2010. [16] Improvements on VLC. Doc. JCTVC-C263, Oct. 2010. 101 [17] Joint Call for Proposals on Video Compression Technology. ITU-T Q.6/SG16, Doc. VCEG-AM91, Jan. 2010. [18] Key technology area (KTA) software of the ITU-T. [19] Mode-Dependent 8x8 DCT/DST for Intra Prediction. Doc. JCTVC-F282, July 2011. [20] Mode-Dependent Intra Smoothing Modifications. Doc. JCTVC-F126, July 2011. [21] Non-CE1: throughput improvement on CABAC coefficients level coding . Doc. JCTVC-H0554, Feb. 2012. [22] Parallelizablecontextforsignificancecodingoflargetransformblocks.Doc.JCTVC- G586, Nov. 2011. [23] Quadtree-based adaptive offset. Doc. JCTVC-C147, Oct. 2010. [24] Single entropy coder for HEVC with a high throughput binarization mode. Doc. JCTVC-G569, Nov. 2011. [25] Summary of HEVC working draft 1 and HEVC test model (HM). Doc. JCTVC- C405, Oct. 2010. [26] Summary report of core experiment 9 on MV Coding and Skip Merge operations. Doc. JCTVC-E029, Mar. 2011. [27] Tableless run-length coding for transform coefficients in CAVLC. Doc. JCTVC- F543, july 2011. [28] Transform design for HEVC with 16-bit intermediate data representation. Doc. JCTVC-E243, Mar. 2011. [29] Transform Skip Mode. Doc. JCTVC-F077, July 2011. [30] Unified scans for the significance map and coefficient level coding in high coding efficiency. Doc. JCTVC-F288, July 2011. [31] Video Coding Technology Proposal by Fraunhofer HHI. Doc. JCTVC-A116, Apr. 2010. [32] Video Coding Technology Proposal by Nokia. Doc. JCTVC-A119, Apr. 2010. [33] Video Coding Technology Proposal by Samsung (and BBC). Doc. JCTVC-A124, Apr. 2010. [34] Video Coding Technology Proposal by Samsung (and BBC). Doc. JCTVC-A125, Apr. 2010. [35] Working Draft 4 (WD 4) of High-Efficiency Video Coding . Doc. JCTVC-F803, July 2011. 102 [36] M. Aharon, M. Elad, and A. Bruckstein. K-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation. IEEE Trans. Signal Process., 54:4311–4322, November 2006. [37] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. [38] Chuo-Ling Chang, M. Makar, S.S. Tsai, and B. Girod. Direction-Adaptive Par- titioned Block Transform for Color Image Coding. IEEE Trans. Image Process., 19:1740–1755, July 2010. [39] S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM J. Sci Comp., 20:33–61, 1999. [40] TChiangandYQZhang. Anewratecontrolschemeusingquadraticratedistortion model. IEEE Trans. Circuits Syst. Video Tech., 7(1):246–250, February 1997. [41] R. Cohen, S. Klomp, A. Vetro, and H. Sun. Direction-Adaptive Transforms for Coding Prediction Residuals. In Proc. ICIP, September 2010. [42] Yunyang Dai, Qi Zhang, A. Tourapis, and C.-C. Jay Kuo. Efficient block-based intra prediction for image coding with 2D geometrical manipulations. In Proc. ICIP, October 2008. [43] J. Dong, K. N. Ngan, C.-K. Fong, and W.-K. Cham. 2-D Order-16 Integer Trans- forms for HD Video Coding. IEEE Trans. Circuits Syst. Video Tech., 19:1462–1474, October 2009. [44] D. L. Donoho and M. Elad. Optimally sparse representation in general(nonorthogonal) dictionaries via L1 minimization. In Proc. Proc. Nat. Acad. Sci. USA, volume 100. [45] Michael Elad and Michal Aharon. Image Denosing via Sparse and Redundant Rep- resentations Over Leared Dictionary. IEEE Trans. Image Process., 15:3736–3745, December 2006. [46] G. Bjontegaard. Calculation of average PSNR differences between RD-curves. ITU- T Q.6/16, Doc. VCEG-M33, Mar. 2001. [47] Z. He and S. K. Mitra. A linear source model and a unified rate control algorithm for dct video coding. IEEE Trans. Circuits Syst. Video Tech., 12(11):970–982, November 2002. [48] K.C.HuiandW.C.Siu. Extendedanalysisofmotioncompensatedframedifference forblock-basedmotionpredictionerror. IEEETrans.ImageProcess.,16:1232–1245, May 2007. [49] A. K. Jain. Image data compression: A review. Proc. IEEE, 69:349–389, March 1981. 103 [50] F.KamisliandJ.S.Lim. One-DTransformsfortheMotionCompensationResidual. IEEE Trans. Image Process., 20:1036–1046, April 2010. [51] Je-Won Kang, Robert Cohen, Anthony Vetro, and C.-C. Jay Kuo. EFFICIENT DICTIONARY BASED VIDEO CODING WITH REDUCED SIDE INFORMA- TION. In Proc. ISCAS, May 2011. [52] Je-Won Kang, Seung-Hwan Kim, and C.-C. Jay Kuo. FOR/SOR Video Coding with Super Macroblock and Inter-frame Stripe Prediction. In Proc. ICASSP, May 2011. [53] Ho June Leu, Seong-Dae Kim, and Wook-Joong Kim. Statstical Modeling of Inter- Frame Prediction Error and Its Adaptive Transform. IEEE Trans. Circuits Syst. Video Tech., 21:519–523, April 2011. [54] Jin Li, Moncef Gabbouj, and Jarmo Takala. Zero-Quantized Inter DCT Coefficient Prediction for Real-Time Video Coding. IEEE Trans. Circuits Syst. Video Tech., 22:249–259, February 2012. [55] Shangwen Li, Sijia Chen, Jianpeng Wang, and Lu Yu. Second Order Prediction On H.264/AVC. In Proc. Picture Coding Symposium (PCS), May 2009. [56] SiweiMaandC.-C.JayKuo.High-definitionVideoCodingwithSuper-macroblocks. In Proc. SPIE Visual Communications and Image Processing, January 2007. [57] S. Naito and A. Koike. Efficient coding scheme for super high definition video based on extending H.264 high profile. In Proc. SPIE Conference on Visual Communications and Image Processing, pages 607727– 1–607727–8, jan 2006. [58] Ralph Neff, Avideh Zakhor, and Martin Vetterli. Very low bit rate video coding using matching pursuit. In Proc. SPIE Conference on Visual Communications and Image Processing, pages 47– 60, Sep 1994. [59] W. Niehsen and M. Brunig. Covariance analysis of motion compensated frame difference. IEEE Trans. Circuits Syst. Video Tech., 9:536–539, 1999. [60] J.Starck,M.Elad,andD.L.Donoho. ImageDecompositionviatheCombinationof Sparse Representations and a Variational Approach. IEEE Trans. Image Process., 14:1570–1582, October 2005. [61] Bo Tao and Michael T. Orchard. Gradient-based residual variance modeling and its applications to motion-compensated video coding. IEEE Trans. Image Process., 10(1):24–35, January 2001. [62] Mehmet Turkan and Christine Guillemot. Sparse Approximation with Adaptive Dictionary for Image Prediction. In Proc. ICIP, October 2009. 104 [63] T. Wiegand, J. R. Ohms, G. J. Sullivan, W. J. Han, R. Joshi, T. K. Tan, and K. Ugur. Special Section on the Joint Call for Proposals on High Efficiency Video Coding Standardization. IEEE Trans. Circuits Syst. Video Tech., 20:1661–1666, December 2010. [64] M. Winken, P. Helle, D. Marpe, H. Schwarz, and T. Wiegand. Transform Coding in the HEVC Test Model. In Proc. ICIP, September 2011. [65] Yan Ye and Marta Karczewicz. Improved H.264 Intra Coding Based on Bi- directional Intra Prediction, Directional transform, and Adaptive Coefficient Scan- ning. In Proc. ICIP, October 2008. [66] B. Zeng and J. Fu. Directional Discrete Cosine Transforms - A New Framework for Image Coding. IEEE Trans. Circuits Syst. Video Tech., 18:305–313, March 2008. [67] C.Zhang,K.Ugur,J.Lainema;A.Hallapuro,andM.Gabbouj. VideoCodingUsing Spatially Varying Transform. IEEE Trans. Circuits Syst. Video Tech., 21:127–139, February 2011. 105
Abstract (if available)
Abstract
High definition (HD) video contents become popular and displays of higher resolution such as ultra definition are emerging in recent years. The conventional video coding standards offer excellent coding performance at lower bit-rates. However, their coding performance for HD video contents is not as efficient. The objective of this research is to develop a set of efficient coding tools or techniques to offer a better coding gain for HD video. The following three techniques are studied in this work. ❧ First, we present a Joint first-order-residual/second-order residual (FOR/SOR) coding technique. The FOR/SOR algorithm that incorporates a few advanced coding techniques is proposed for HD video coding. For the FOR coder, the block-based prediction is used to exploit both temporal and spatial correlation in an original frame surface for coding efficiency. However, there still exists structural noise in the prediction residuals. We design an efficient SOR coder to encode the residual image. Block-adaptive bit allocation between the FOR and the SOR coders is developed to enhance the coding performance, which corresponds to selecting two different quantization parameters in the FOR and the SOR coders in different spatial regions. It is shown by experimental results that the proposed FOR/SOR coding algorithm outperforms H.264/AVC significantly in HD video coding with an averaged bit rate saving of 15.6%. ❧ Second, we develop two advanced processing techniques, which are referred as to two-layered transform with sparse representation (TTSR) and slant residual shift (SRS), for prediction residuals so as to improve coding efficiency. Prediction residues often show a non-stationary property, and the DCT becomes sub-optimal and yields undesired artifacts. The proposed TTSR algorithm makes use of sparse representation and is targeted toward the state-of-the-art video coding standard, High Efficiency Video Coding (HEVC), in this work. A dictionary is adaptively trained to contain featured patterns of residual signals so that a high portion of the energy in a structured residual can be efficiently coded with sparse coding. Then, the following DCT in cascade is applied to the remaining signal after spare coding. The use of multiple representations is justified with an R-D analysis, and the two transforms successfully complement each other. The SRS technique is to align dominant prediction residuals of inter-predicted frames with the horizontal or the vertical direction via row-wise or column-wise circular shift before the 2-D DCT. To determine the proper shift of pixels, we classify blocks into several types, each of which is assigned an index number. Then, these indices are sent to the decoder as signaling flags, which can be viewed as the mode information of the SRS technique. It is demonstrated by experimental results that the proposed algorithm outperforms the HEVC. ❧ Third, we make a contribution to the HEVC with several efficient coding tools incorporated into the two context adaptive entropy coding, i.e., Context Adaptive Variable Length Coding (CAVLC) and Context Adaptive Binary Arithmetic Coding (CABAC). The proposed tableless VLC coding scheme removes all the tables used for the residual coding, yet yields negligible changes to the coding performance. Statistical property of a symbol is employed to replace the conventional tables with a mathematical model and improve a coding gain with a high order Markov model. On top of that, a context for a significance map coding in a large transform block is newly designed. The proposed context model removes a dependency of neighbor significant coefficients along the same scanning line, and, thus, it enhances a throughput of the CABAC. The proposed algorithm extends to the mode dependent coefficient scanning method for a large transform block. The proposed algorithm has negligible effect on the coding performance while it significantly improves the parallelization.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Compression of signal on graphs with the application to image and video coding
PDF
Efficient transforms for graph signals with applications to video coding
PDF
Graph-based models and transforms for signal/data processing with applications to video coding
PDF
Advanced intra prediction techniques for image and video coding
PDF
Texture processing for image/video coding and super-resolution applications
PDF
Focus mismatch compensation and complexity reduction techniques for multiview video coding
PDF
Application-driven compressed sensing
PDF
Advanced machine learning techniques for video, social and biomedical data analytics
PDF
Distributed source coding for image and video applications
PDF
Advanced techniques for green image coding via hierarchical vector quantization
PDF
Advanced techniques for high fidelity video coding
PDF
Representation, classification and information fusion for robust and efficient multimodal human states recognition
PDF
Techniques for compressed visual data quality assessment and advanced video coding
PDF
Novel variations of sparse representation techniques with applications
PDF
Predictive coding tools in multi-view video compression
PDF
Efficient graph learning: theory and performance evaluation
PDF
Labeling cost reduction techniques for deep learning: methodologies and applications
PDF
Human motion data analysis and compression using graph based techniques
PDF
Syntax-aware natural language processing techniques and their applications
PDF
Transmission tomography for high contrast media based on sparse data
Asset Metadata
Creator
Kang, Je-Won
(author)
Core Title
Efficient coding techniques for high definition video
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
06/07/2013
Defense Date
05/03/2012
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
data compression,entropy coding,H.264/AVC,HEVC,high definition video coding,multimedia signal processing,OAI-PMH Harvest,sparse representation,transform
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kuo, C.-C. Jay (
committee chair
), Gabbouj, Moncef (
committee member
), Neumann, Ulrich (
committee member
), Ortega, Antonio K. (
committee member
)
Creator Email
jewonkan@usc.edu,sagittak@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-45608
Unique identifier
UC11289487
Identifier
usctheses-c3-45608 (legacy record id)
Legacy Identifier
etd-KangJeWon-881.pdf
Dmrecord
45608
Document Type
Dissertation
Rights
Kang, Je-Won
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
data compression
entropy coding
H.264/AVC
HEVC
high definition video coding
multimedia signal processing
sparse representation
transform