Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Advanced techniques for high fidelity video coding
(USC Thesis Other)
Advanced techniques for high fidelity video coding
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ADVANCED TECHNIQUES FOR HIGH FIDELITY VIDEO CODING by Qi Zhang A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) August 2010 Copyright 2010 Qi Zhang Table of Contents List Of Tables iv List Of Figures v Abstract ix Chapter 1 Introduction 1 1.1 Signicance of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Review of Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 Lossless Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.2 Fast ME in Transform Domain . . . . . . . . . . . . . . . . . . . . 4 1.2.3 Fast Subpel ME . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Contribution of the Research . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . 10 Chapter 2 Research Background 11 2.1 Motion Compensated Prediction . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.1 Block-based Motion Compensated Prediction . . . . . . . . . . . . 13 2.2 Frequncy-domain MCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.1 H.264/AVC Intra Prediction . . . . . . . . . . . . . . . . . . . . . 17 2.3 Film Grain Noise Compression . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4 Sub-pel Motion Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Chapter 3 Direct Subpel Motion Estimation Techniques 25 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2 Characterization of Local Error Surface . . . . . . . . . . . . . . . . . . . 28 3.2.1 Problem with Traditional Surface Modeling . . . . . . . . . . . . . 29 3.2.2 Condition Number Estimation . . . . . . . . . . . . . . . . . . . . 33 3.2.3 Deviation from Flatness . . . . . . . . . . . . . . . . . . . . . . . . 37 3.3 Optimal Subpel MV Resolution Estimation . . . . . . . . . . . . . . . . . 38 3.4 Direct Subpel MV Position Prediction . . . . . . . . . . . . . . . . . . . . 42 3.4.1 Ill-Conditioned Blocks . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.4.2 Well-Conditioned Blocks . . . . . . . . . . . . . . . . . . . . . . . . 46 3.4.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 50 ii 3.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Chapter 4 Granular Noise Prediction and Coding Techniques 62 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.2 Impact of Graunular Noise on High Fidelity Coding . . . . . . . . . . . . 64 4.2.1 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.2.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.3 Overview of GNPC Coding Framework . . . . . . . . . . . . . . . . . . . . 71 4.4 Granular Noise Prediction and Coding in Frequency Domain . . . . . . . 74 4.4.1 Review of Frequency Domain Prediction Techniques . . . . . . . . 75 4.4.2 Granular Noise Prediction in Frequency Domain . . . . . . . . . . 76 4.4.3 Rate-Distortion Optimization . . . . . . . . . . . . . . . . . . . . . 79 4.4.4 Translational Index Mapping . . . . . . . . . . . . . . . . . . . . . 81 4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Chapter 5 Multi-Order Residual (MOR) Coding 90 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.2 Signal Analysis for High-bit-rate Video Coding . . . . . . . . . . . . . . . 92 5.2.1 Distribution of DCT Coecients . . . . . . . . . . . . . . . . . . . 92 5.2.2 Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.3 Multi-Order-Residual (MOR) Prediction and Coding . . . . . . . . . . . . 98 5.3.1 Overview of MOR Coding System . . . . . . . . . . . . . . . . . . 98 5.3.2 Goals of MOR Prediction . . . . . . . . . . . . . . . . . . . . . . . 100 5.3.3 MOR Prediction in Frequency Domain . . . . . . . . . . . . . . . . 104 5.3.4 Rate-Distortion Optimization . . . . . . . . . . . . . . . . . . . . . 106 5.3.5 Pre-search Coecient Optimization for TOR . . . . . . . . . . . . 108 5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 114 Chapter 6 Conclusion and Future Work 115 6.1 Summary of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Bibliography 119 iii List Of Tables 3.1 The complexity saving S(%) of the proposed ZDK-I, ZDK-II and the SCJ method with respect to H.264 full search. . . . . . . . . . . . . . . . . . . 54 3.2 Coding eciency comparison of the proposed ZDK-I scheme and the SCJ method with respect to H.264 with quarter pel resolution. . . . . . . . . . 56 3.3 Coding eciency comparison of the proposed ZDK-I scheme and the SCJ method with respect to H.264 with eighth pel resolution. . . . . . . . . . . 56 4.1 Mode distribution of blocks at QP=28 for several test CIF sequences. . . 66 4.2 Mode distribution of macroblocks for HD sequences with QP=8. . . . . . 66 4.3 Experimental setup for H.264/AVC and the GNPC scheme. . . . . . . . . 84 4.4 Coding eciency comparison between H.264 and GNPC in the frequency domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.1 Coding eciency comparison of the proposed MOR scheme v.s. H.264/AVC for high-bit-rate coding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 iv List Of Figures 1.1 The video quality as a function of the compression ratio for the state-of- the-art video coding algorithms, where the compression ratio is in (a) the regular scale and (b) the logrithmic scale. . . . . . . . . . . . . . . . . . . 2 1.2 The complexity proling of H.264/AVC (a) encoder and (b) decoder [40]. 3 2.1 Block matching process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 The block diagram of (a) the encoder and (b) the decoder with motion estimation and compensation in the DCT domain. . . . . . . . . . . . . . 17 2.3 Neighboring pixel samples used in (a) Intra 16x16 (b) Intra 8x8 and (c) Intra 4x4 modes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4 Illustration of dierent inter prediction block sizes in H.264/AVC. . . . . . 22 2.5 Interpolation lter for sub-pel accuracy motion compensation. . . . . . . . 23 3.1 Illustration of a square window of dimension1 < x; y < 1 centered around the optimal integer-pel MV position indicated by the central empty circle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2 Illustration of error surfaces for a well-conditioned block: (a) the 3D plot of the actual error surface; and the 2D contour plots of (b) the actual error surface, (c) error surface model E 9 , and (d) error surface model E 6 . . . . . 31 3.3 Illustration of error surfaces for an ill-conditioned block: (a) the 3D plot of the actual error surface; and the 2D contour plots of (b) the actual error surface, (c) error surface model E 9 , and (d) error surface model E 6 . . . . . 32 3.4 The error curves passing through the origin along the 0-, 45-, 90- and 135- degree directions for (a) a well-conditioned block, and (b) an ill-conditioned block. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 v 3.5 (a) Block examples that are likely to have well-conditioned error surfaces; (b) block examples that are likely to have ill-conditioned error surfaces. Blocks are taken from sample sequences of Foreman CIF, Vintage Car HD and Harbor HD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.6 (a) The histogram ofD f at QP=20 and (b)-(e) the probability distributions for the optimal MV resolution at integer-pel, 1/2-pel, 1/4-pel and 1/8-pel for a set of test video sequences. . . . . . . . . . . . . . . . . . . . . . . . 39 3.7 (a) The histogram of condition numbers and (b) the prediction error dis- tance " s as a function of the condition number using the SJ E 9 model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.8 Illustration of the optimal subpel MV position prediction for ill-conditioned blocks: (a) Step 1: nding the minma in three vertical planes using quadratic curve tting and (b) Step 2: connecting the three minima found in Step 1 and nding the optimal subpel MV position with another quadratic curve tting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.9 Illustration of the optimal subpel MV position prediction for a well condi- tioned block. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.10 (a) The histogram of the condition number, and (b) the prediction er- ror distance " s as a function of the condition number using the proposed prediction method as described in Secs. 3.4.1 and 3.4.2. . . . . . . . . . . 51 3.11 The complexity saving as a function of the coding bit rate with (a) ZDK-I and (b) ZDK-II for four sample sequences. . . . . . . . . . . . . . . . . . . 54 3.12 The R-D performance of ZDK-I and two benchmark methods for four CIF sequences: (a) Foreman, (b) Mobile, (c) Stefan, and (d) Flower garden. . 57 3.13 The R-D performance of ZDK-I and two benchmark methods for four HD sequences: (a) City Corridor, (b) Night, (c) Blue sky, and (d) Vintage car. 58 3.14 The R-D performance of ZDK-II and two benchmark methods for four CIF sequences: (a) Foreman, (b) Mobile, (c) Stefan, and (d) Flower garden. . 59 3.15 The R-D performance of ZDK-II and two benchmark methods for four HD sequences: (a) City Corridor, (b) Night, (c) Blue sky, and (d) Vintage car. 60 4.1 The marcoblock partition modes and (b) B-frame prediction. . . . . . . . 65 4.2 The mode distribution of H.264/AVC for Rush Hour HD sequence at var- ious QP values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 vi 4.3 The block diagram of the proposed granular noise extraction process. . . . 73 4.4 Granular noise block in frequency domain partition (a) full mode, (b) par mode and (c) prediction alignment for par mode. . . . . . . . . . . . . . . 77 4.5 The DCT-domain based granular noise prediction for (a) intra noise frame and (b) inter noise frame with search range S. . . . . . . . . . . . . . . . . 78 4.6 The block diagram of (a) the encoder and (b) the decoder of the proposed GNPC scheme for high delity video coding. . . . . . . . . . . . . . . . . 80 4.7 The translational vector maps for (a) the content layer and (b) the granular noise layer for Rush Hour frame at a resolution of 352x288. . . . . . . . . 82 4.8 Illustration of the translational indexing scheme for (a) the intra GNPC frame and (b) the inter frame in frequency domain based GNPC. . . . . . 83 4.9 Rate-Distortion curves for HD video sequences (a) Rush Hour, (b) Blue Sky, (c) Sun ower and (d) Vintage Car. . . . . . . . . . . . . . . . . . . . 85 4.10 The mode distribution chart for the Rush Hour sequence with (a)full mode vs par mode and (b)zero mode distribution with and without GNPC in the frequency domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.11 The mode distribution chart for the Blue Sky sequence with (a)full mode vs par mode and (b)zero mode distribution with and without GNPC in the frequency domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.12 The mode distribution graph for the Sun ower sequence with (a)full mode vs par mode and (b)zero mode distribution with and without GNPC in the frequency domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.13 The mode distribution chart for the Vintage Car sequence with (a)full mode vs par mode and (b)zero mode distribution with and without GNPC in the frequency domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.1 The probability distribution of non-zero DCT coecients at each scanning position for (a)Jet (b)City Corridor and (c)Preakness frames. . . . . . . . 93 5.2 (a) A sample frame from the Jet sequence, and its prediction residual dierence (b) a sample frame from the City Corridor sequence, and its prediction residual dierence and (c) a sample frame from the Preakness sequence, and its prediction residual dierence at 1280x720 resolution with QP 1 = 10 and QP 2 = 30. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 vii 5.3 (a) The correlation analysis for scenes with dierent complexities and (b) the relationship between the bit rate and quantization. . . . . . . . . . . . 96 5.4 Overview of the Multi-Order-Residual (MOR) coding scheme. . . . . . . . 98 5.5 The block diagram of the proposed MOR coding scheme. . . . . . . . . . 99 5.6 A typical histogram of prediction residuals in the DCT domain. . . . . . . 100 5.7 Histograms of (a) MOR data in form of pixel dierences and (b) MOR data in form of DCT coeceints. . . . . . . . . . . . . . . . . . . . . . . . 102 5.8 Re-grouping of the same frequency coecients to obtain planes of DCT coecents, denoted by P i , where i = 0; 1; ;M 2 1. . . . . . . . . . . . 104 5.9 MCP in frequency domain for each frequency plane. . . . . . . . . . . . . 105 5.10 The DCT coecients histograms of MOR data after MOR prediction in frequency domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.11 The block diagram of the pre-search DCT coecients optimization process for TOR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.12 The block diagram of the proposed MOR coding scheme with pre-search DCT coecients optimization for TOR. . . . . . . . . . . . . . . . . . . . 109 5.13 Rate-Distortion curves for Pedestrian sequence. . . . . . . . . . . . . . . . 111 5.14 Rate-Distortion curves for Rush Hour sequence. . . . . . . . . . . . . . . . 111 5.15 Rate-Distortion curves for Riverbed sequence. . . . . . . . . . . . . . . . . 112 5.16 Rate-Distortion curves for Vintage Car sequence. . . . . . . . . . . . . . . 112 5.17 Decoded Rush Hour frames with (a) MOR and (b) H.264 at 60Mbps. . . 113 5.18 Decoded Vintage Car frames with (a) MOR and (b) H.264 at 80Mbps. . . 113 viii Abstract This research focuses on two advanced techniques for high-bit-rate video coding: 1) subpel motion estimation and 2) residual processing. First, we study sup-pixel motion estimation for video coding. We analyze the char- acteristics of the sub-pel motion estimation error surface and propose an optimal subpel motion vector resolution estimation scheme that allows each block with dierent charac- teristics to maximize its RD gain through a exible motion vector resolution. Further- more, a direct subpel MV prediction scheme is proposed to estimate the optimal subpel position. The rate-distortion performance of the proposed motion prediction scheme is close to that of full search while it demands only about of 10% of the computational complexity of the full search. Secondly, we investigate high-bit-rate video coding techniques for high denition video contents. We observed that under the requirements of high-bit-rate coding, there still left a large portion of uncompensated information in the prediction residual that represents similar signal characteristics of lm grain noise. Due to small quantization step size used by high-bit-rate coding, these untreated small features render all existing coding schemes ineective. To address this issue, a novel granular noise prediction and coding scheme ix is proposed to provide a separate treatment for these residuals. A frequency domain- based prediction and coding scheme is proposed to enhance the coding performance. The proposed granular noise prediction and coding scheme outperforms H.264/AVC by an average of 10% bit rate saving. Thirdly, we further investigate on the impact of high-bit-rate coding from the more fundamental signal characteristics point of view. A probability distribution analysis on DCT coecients from the H.264/AVC codec under dierent bit rates is conducted to reveal that the prediction residual in the form of DCT coecients have a near uniform distribution for all scanning positions. To further understand this phenomenon, a cor- relation based analysis was conducted to show that the dierent types of correlations existed in the video frame and the distribution of these correlations highly impact the coding eciency. A signicant amount of short and medium-range correlations due to the use of a ne quantization parameter cannot be easily removed by existing compensa- tion techniques. Consequently, the video coding performance degrades rapidly as quality increases. A novel Multi-Order-Residual (MOR) coding scheme was proposed. The con- cept is based on the numerical analysis to extract dierent correlation through dierent phases. A dierent DCT-based compensation and coding scheme combined with an im- proved rate-distortion optimization process was proposed to target the higher-order signal characteristics. An additional pre-search coecient optimization phase was proposed to further enhance compression performance. Experimental results show that the proposed MOR scheme outperforms H.264/AVC by an average of 16% bit rate savings. x Chapter 1 Introduction 1.1 Signicance of the Research Video coding has been extensively studied in the last three decades. It has been widely used in video storage and communication. Earlier video compression research has put a lot of emphasis on low bit rate coding due to the limited availability in storage and bandwidth. However, with the increased popularity of high denition (HD) video content and increased transmission bandwidth in recent years, more research focus has shifted from low-bit-rate video coding to high-bit-rate (or high delity) video coding. We show the video quality as a function of the compression ratio in Fig. 1.1 for today's state-of-the-art video coding algorithms. For applications such as video conferencing and DVD, the current H.264/AVC video coding algorithm can achieve a compression ratio of 100 or higher. However, when the quality requirement is above 35dB (e.g. for high- delity video), the compression ratio drops rapidly. This observation motivates us to investigate ways to further improve coding eciency of high-delity video. Studies in Chapter 4 reveal that some of the uncompensated ne structural information present 1 in HD video contents prevents the current compression algorithms from achieving good coding eciency in the high-bit-rate range. Additional analysis in Chapter 5 shows that the dierent types of correlation existing in the video highly impacts the coding eciency under high-bit-rate coding environments. With the market demand shifting more towards high-delity and high-denition video, a more eective video coding algorithm for such an application is highly desirable. Figure 1.1: The video quality as a function of the compression ratio for the state-of-the- art video coding algorithms, where the compression ratio is in (a) the regular scale and (b) the logrithmic scale. Subpel motion estimation (ME) provides another mechanism to achieve high delity video coding. However, the computational complexity of subpel ME is high. The com- plexity of a coding algorithm is typically measured in terms of the number of arithmetic operations (millions of instructions per second or MIPS), memory requirement, power consumption, chip area and the hardware cost. As compared to previous standards, H.264/AVC delivers the best coding performance at the cost of highest complexity. As shown in Fig. 1.2 which was reported in [40], integer and sub-pel motion estimation (ME) and interpolation are the two most time-consuming coding modules in the H.264/AVC 2 encoder while the luma interpolation is the most demanding module in the H.264/AVC decoder. (a) (b) Figure 1.2: The complexity proling of H.264/AVC (a) encoder and (b) decoder [40]. In addition to computation complexity, HD video coding also imposes a lot of stress on frame memory allocation and access. For example, to encode a HD frame of size 19201080 with 5 reference frames, the interpolation module alone would require 1920 1080516 = 158MB of the frame memory just to store the pixel values on each integer position. Another 158 4 = 632MB of frame memory for 1/2 pel values and another 6324 = 2528MB of frame memory for 1/4 pel values. That is, a total of more than 3GB memory is needed to save the interpolated values at the subpel position. Hence, there has been a large amount of eorts on the speed-up of subpel ME and interpolation, including instruction level optimization, fast search and fast interpolation algorithms. However, in spite of previous work as reviewed in Sec. 1.2, we see good opportunities for further performance improvement. 1.2 Review of Previous Work Some previous work that is closely related to our research is reviewed in this section. 3 1.2.1 Lossless Coding JPEG 2000 and M-JPEG2000 have been chosen by the Digital Cinema Initiative (DCI) [3] as the lossless video coding standard. In H.264/AVC, the 4x4 integer DCT and quan- tization processes contain shift operations, thus causing rounding errors. To meet the lossless coding requirement for intra coding in H.264, a Dierential Pulse Code Modu- lation (DPCM)-based prediction scheme is rst applied and prediction errors are then fed into the entropy coder in [48], where the transform and quantization processes are skipped. Although this new lossless coding scheme is more ecient than the M-JPEG2000 lossless standard, its coding eciency is still slightly worse than that of the JPEG-LS standard [45]. 1.2.2 Fast ME in Transform Domain One ME approach is to estimate the cross-correlation of two macroblocks in the frequency domain [37]. The frequency spectrum of the input are normalized to give a phase corre- lation. However, the correlation performed by DFT-based methods is in circular (rather than linear) fashion. Hence, the correlation function could be inaccurate due to the edge eect. To reduce the problem of edge artifacts, Kuglin and Hines [9] proposed the zero padding method at the expense of higher computation complexity. Another ME approach as studied in [41] is to use a transform whose size is much larger than the maximum dis- placement under consideration. This approach is able to limit the amount of introduced errors, yet it is more suitable for global motion estimation rather than block-based local motion estimation. The third approach is to use the complex lapped transform (CLT) in the cross correlation calculation [49]. This technique is based on lapped orthogonal 4 transform (LOT), where its basis functions are overlapped and windowed by a smooth function with a shape like a half cosine. It introduces less block edge artifacts as compared to LOT in the spatial domain. The latest eort of the frequency domain prediction was proposed for intra prediction in VC-1 [3], where DC and AC components are separated and predicted from their left and top neighbor frequency components. 1.2.3 Fast Subpel ME There has been extensive research on complexity reduction for subpel ME. In general, fast sub-pel ME schemes fall into two categories: 1) search complexity reduction and 2) interpolation complexity reduction. Fast search schemes lower subpel search complexity by reducing the number of sub-pixel search points under the assumption that the subpel error surface is monotonic (or parabolic) [36]. Lee et al. [26] proposed a subpel ME scheme that tests the four most promising half-pixel locations out of the eight. Thus, the complexity is halved. The surrounding eight integer positions are used to decide which half-pixel locations are selected. Yin et al. [36] proposed a similar method that used fewer integer positions which are determined by a thresholding method. The resultant fast search can reduce the complexity by 85% with the PSNR degradation of around 0.1 dB. A center-biased fractional pel search (CBFPS) algorithm for fractional-pel ME in H.264/AVC was proposed in [50] based on the characteristics of multi-prediction modes, multi-reference frames. However, CBFPS is only applied to smaller blocks while full fractional ME search is still adopted for larger blocks such as 1616, 168, and 816. As a result, the speedup of CBFPS is somehow limited. To overcome this issue, an improved sub-pel search method was proposed in [47], which includes a simple and ecient sub-pel 5 skipping method based on statistical analysis and an immediate-stop technique based on the minimum cost. However, the total computation, as compared to the full subpel search method, is decreased only by 30%, because all candidate blocks still need to be interpolated to obtain the fractional-pixel resolution. Thus, the interpolation module is still the major bottleneck in terms of computation and memory access latency. Fast interpolation schemes reduces the computational complexity by reducing the interpolation complexity instead of the search complexity. By establishing a subpel error surface based on a mathematical model, subpel ME errors on each sub-pixel position can be calculated from the model, thus eliminating the need of interpolation computation [25, 35]. Usually, model parameters can be found from errors at the nine integer MV locations and the optimal sub-pel MV location can be solved directly or iteratively. However, there is one major drawback with this approach; namely, the accuracy of subpel prediction can be heavily impacted if the model cannot characterize the error surface accurately. There may exist some discrepency between the actual and the modeling error surfaces as a result of the local image texture pattern. 1.3 Contribution of the Research In this research, we propose a fast subpel MV prediction scheme in Chapter 3, and residual prediction and coding schemes in Chapters 4, and 5 respectively. Major contributions are summarized as below. Contributions in fast subpel ME (Chapter 3) 6 { We conduct an in-depth analysis on the problem of previous optimal subpel MV resolution estimation algorithms. Basically, they are based on input block texture characteristics. As the input block subject to motion compensation, quantization and noise factors, they are not accurate for some cases. Then, we propose an optimal MV resolution estimation scheme that can provide a more accurate estimation. { The proposed optimal MV resolution estimation allows blocks with dierent characteristics to maximize its RD gain through a exible MV resolution while signicantly reduce complexity. Based on how well the error surface is condi- tioned, two dierent optimal MV prediction schemes are proposed respectively. The rate-distortion performance of the proposed optimal MV prediction excels that of full search with an average of 90% complexity reduction. Contributions in granular noise prediction and coding (Chapter 4) { We conduct an mode distribution analysis on the residual image from current H.264 codec under the requirements of high-bit rate coding. The study shows that there are still a large amount of uncompensated ne features in form of granular noise left in the residual that causes the coding eciency to degrade signicantly. The design of proposed GNPC system allows the system to be easily integrated with any traditional video codec. No modication is required to the existing codec. Decoder could discard the transmitted noise frame if the decoding time frame is not sucient. This would not aect the decoding of the consecutive incoming content frames, as they are coded independently. 7 { We propose an granular noise prediction and coding scheme that resembles the lm grain extraction process to extract these ne features in the residual. A frequency-domain based prediction and compensation scheme was further proposed for granular noise data. By correlating the same frequency bands be- tween dierent blocks, we could maximize the possibility between target GN block and candidate blocks that might contain similar low frequency compo- nents but dierent high frequency components to be considered as candidate reference blocks and vice versa. The prediction between the same frequency bands avoids the complication of sparse matrix multiplication for reconstruc- tion as required in earlier ME in frequency domain. { By quantizing the input DCT block before the prediction module, there will be no additional computation requirement to perform the quantization and inverse quantization during the rate-distortion optimization phase. Hence, complexity can be signicantly reduced. Furthermore, the proposed coding scheme is more friendly for the rate control purpose. As quantization is done prior to the GNPC, the distortion/PSNR of each block/frame can be known ahead of time without going through the entire RDO process. { Experimental results demonstrate the eectiveness of proposed frequency-domain based GNPC scheme with an average bit rate reduction of 10%. Contributions in multi-order residual coding (Chapter 5) { To understand the impact of high bit rate coding, we study the DCT coecient distribution and show that, as the video quality requirement increases, the 8 distribution of DCT coecients is close to an uniform one. This explains the poor performance of traditional image/video codecs in the high bit rate region. { We conduct a correlation analysis on input image frames, which reveals that there exist dierent types of correlations in the image, which has a signicant impact on coding eciency. To address this problem, we adopt dierent coding schemes to remove dierent types of correlations in image frames, which is called the multi-order residual (MOR) prediction and coding system. { We study the characteristics of the extracted medium and long correlations in the higher-order residuals. Since these MOR data have a small dynamic range with a at distribution at every scanning position in the block, the traditional MCP with RDO in the pixel domain may not be eective. This observation motivates us to adopt a frequency-domain based prediction for MOR data. { By quantizing the input DCT block before the prediction module, the RDO phase can have a direct evaluation of prediction results without going through the DCT, quantization, inverse DCT and inverse quantization for each search position. Hence, the complexity can be reduced signicantly. { The eectiveness of the proposed MOR coding scheme is demonstrated by experiments, which outperforms the state-of-the-art H.264/AVC codec by 30- 50% in the bit rate saving. 9 1.4 Organization of the Dissertation The rest of this dissertation is organized as follows. Related research background on lossless compression, intra coding, lm grain noise sythesis, and subpel motion prediction is reviewed in Chapter 2. A direct subpel motion vector prediction is proposed in Chapter 3. The granular noise prediction and coding scheme (GNPC) is presented in Chapter 4. The multi-order residual (MOR) prediction and coding scheme is investigated in Chapter 5. Finally, concluding remarks and future work are given in Chapter 6. 10 Chapter 2 Research Background Earlier video coding algorithms have focused on low bit rate coding due to the limited storage space and transmission bandwidth. However, due to increased popularity of high denition video and availability of the broadband networking infrastructure, the research focus has gradually shifted from low-bit-rate coding to high-bit-rate (or high delity) coding. The latter includes lossless and near lossless video coding. High denition (HD) video programs have several unique characteristics worth special attention. First, as compared with standard denition TV (SDTV), more detail textures are recorded at a much higher delity range to create a more involving experience to the audience. Second, HD video has a higher spatial resolution. The well-known resolution is 1920x1080 progressive (or 1080p). The latest released HD-DVD and Blu-ray disc both support 1080p. Even higher resolutions have been considered. For example, Digital Cinema Initiatives (DCI) [1] recommends the 4K by 2K camera. As compared to previous video coding standards, the latest H.264/AVC standard [46] can provide about 50% bit rate saving for the same video quality. However, H.264/AVC was initially developed for the low bit rate applications, and most of its experiments were 11 conducted on low resolution QCIF and CIF sequences. As the spatial resolution increases, H.264/AVC reaches a performance bottleneck. Thus, one of the long term objectives set by the Joint Video Team (JVT) is to develop a new generation video coding standard that would keep abreast with this signicantly increased demand on high denition and high delity video coding. In this chapter, we brie y review some background in the areas of motion compensated prediction, frequency-domain motion estimation (ME), noise synthesis/coding and fast subpel ME. 2.1 Motion Compensated Prediction The main module in a compression system is prediction, which exploits the spatial and temporal redundancy between pixels to achieve bit rate saving. A classic prediction schemes can be divided into two distinct phases: 1) modeling and 2) coding. In the modeling phase, the encoder gathers the statistics about the input data and builds up a probabilistic model. A prediction model is formed to make inference on the coming sample by assigning a conditional probability distribution to it. The prediction error is then sent to the coding phase with some level of quantization, where either arithmetic coding or Human coding is used as the lossless entropy coder. Since the entropy coder is well developed, the most critical design choice is hence the algorithm in the modeling phase. The structure of a typical prediction scheme can be stated as follows. 12 1. A prediction phase is carried out to determine the prediction, ^ x i+1 , of the input data sample, x i+1 , based on a nite set of previously coded data x 1 ;x 2 ;:::;x i . 2. The prediction error, i+1 =x i+1 ^ x i+1 , is computed. 3. A context decision rule is developed to determine a context c i+1 in which x i+1 occurs. This context is usually another function of elements in the previously coded data set. 4. A probabilistic model is derived for the prediction error i+1 based on contextc i+1 . The prediction error is then entropy coded based on the incoming symbol probability distribution. Since the decoder follows the same set of rules while decoding, the same prediction, context and probability distribution can be repeated at the decoder and, hence, the original input data sample can be reconstructed completely without any error. Thus, the key to an ecient coding scheme lies in the capability of the prediction scheme to minimize the prediction error. In the following, we will present several predic- tion schemes. 2.1.1 Block-based Motion Compensated Prediction Block-based motion compensated prediction was mainly used to explore temporal simi- larities and hence were widely adopted for inter-frame coding [21]. It is initially designed based on the concept of block matching as shown in Fig. 2.1. It assumes that there is a very small displacements (d x ;d y ) between the consecutive frames. Thus, the frame to frame dierence FD(x;y) can be approximated mathematically as: 13 Figure 2.1: Block matching process FD(x;y) @S(x;y) @x d x @S(x;y) @y d y ; (2.1) For practical implementation, the block matching process is proposed as follows. It rst subdivide an input image into squared block and nd a displacement vector for each block. Within the given search range, a best \match" is found based on minimizing a given error measure criteria [28]. Some of the popular error measurement matrix include sum of squared error (SSD) in Eq. (2.2), sum of absolute dierence (SAD) in Eq. (2.4) and sum of absolute transformed dierence (SATD) in Eq. (5.2). 14 SSE(d x ;d y ) = X S k (x;y)S k1 (x +d x ;y +d y ) 2 ; (2.2) SAD(d x ;d y ) = X S k (x;y)S k1 (x +d x ;y +d y ) : (2.3) SATD(d x ;d y ) = X T S k (x;y)S k1 (x +d x ;y +d y ) : (2.4) The T used in Eq.(5.2) is usually Hadmard transform for simplicity. The general under- standing is that SATD usually oers the best performance as it is more close to the true prediction error that is being encoded by the entropy coder. While SSE and SAD oers very similar performance, with SAD has the lowest computation requirements. As motion compensated prediction is the most computationally complex module in the encoder, there have been extensive research done to speed up the search while maintain a good search quality [31, 11, 33, 14]. 2.2 Frequncy-domain MCP Another type of MCP is to conduct the MCP in frequency domain rather than in the pixel domain. Earlier MCP scheme is to estimate the cross-correlation function in the frequency domain [37]. The frequency spectrum of the input can be normalized to give a phase correlation. However, the correlation performed by a DFT-based method corresponds to a circular convolution rather than a linear one, and the correlation function could be 15 aected by the edge eect. To reduce the edge artifact, Kuglin and Hines [9] proposed to use zero padding to the input data sequence at the cost of higher complexity. Another technique is to use a transform size which is much larger than the maximum displacement considered [41]. This approach can limit the error size, but it is more suited for global ME rather than block-based ME. A third technique is to use the complex lapped transform (CLT) to perform the cross correlation in the frequency domain [49]. Since the basis functions are overlapped and windowed by a smooth function that shapes like a half cosine, it introduces less block edge artifacts as compared to the LOT in the spatial domain. A frequency-domain ME technique was proposed by Chang and Messerschmitt in [12]. As shown in Fig. 2.2, motion search of 8x8 DCT blocks with respect to the pre- vious frame is conducted in the DCT domain. The prediction error is quantized and entropy coded. This allows to skip the inverse DCT (IDCT) since ME is performed in the frequency domain. Since the coding loop of the spatial domain ME is modied, the memory requirement reduces as well [27]. However, these schemes are not widely used for some reasons. First, most previous video coding algorithms focus on low bit rate coding. With coarse quantization used in low bit rate coding, most DCT coecients in a block are quantized to zero and there is little space for rate distortion improvement. Second, the proposed frequency-domain ME treats all frequency components equally, where all frequency components are compensated simultaneously with the same spatial oset. It is similar to that of motion compensation in the spatial domain except that frequency components are compensated directly rather than pixel values. Then, if the spatial do- main cannot provide enough correlation, it is unlikely to get a better prediction in the 16 (a) (b) Figure 2.2: The block diagram of (a) the encoder and (b) the decoder with motion estimation and compensation in the DCT domain. frequency domain. The latest eort of prediction in the frequency domain was proposed for intra prediction in VC-1. The DC and the AC components are predicted from their left and top neighboring frequency components. 2.2.1 H.264/AVC Intra Prediction As intra frame coding does not have the luxury to explore temporal correlation, intra prediction is mainly designed to explore only spatial correlation. H.264 employes a unique line-based intra prediction scheme. The prediction is carried out on a marcoblock basis, but can be subdivided into smaller partitions such as 8x8 and 4x4 subblock sizes. For intra 16x16 predictions, one of the four prediction modes can be chosen: horizon- tal, vertical, DC and plane modes. For intra 8x8 or 4x4 predictions, nine directions can 17 (a) (b) (c) Figure 2.3: Neighboring pixel samples used in (a) Intra 16x16 (b) Intra 8x8 and (c) Intra 4x4 modes. 18 be applied. See Fig. 2.3. We may use the horizontal prediction mode as an example, where the prediction can be expressed as r 0 = p 0 q 0 ; (2.5) r 1 = p 1 q 0 ; (2.6) r 2 = p 2 q 0 ; (2.7) r 3 = p 3 q 0 ; (2.8) The residual dierence of r 0 ; ;r 3 predicted from block boundary samples are sent to the decoder together with the mode information for correct reconstruction of the block. The only dierence between lossless and lossy intra predictions is that the residual dierence will go through DCT and quantization in lossy coding but these steps are skipped in lossless coding. An improved lossless intra prediction was proposed by Lee et al. [48] that changes the block-based prediction to a sample-based prediction. For example, for the horizontal prediction (mode 1), the residual dierence of r 0 ; ;r 3 are predicted using a sample-by-sample DPCM method. Mathematically, it can written as r 0 = p 0 q 0 ; (2.9) r 1 = p 1 p 0 ; (2.10) r 2 = p 2 p 1 ; (2.11) r 3 = p 3 p 2 : (2.12) The vertical mode (mode 0), mode 3 and mode 4 can be conducted in a similar fashion. 19 2.3 Film Grain Noise Compression Film grain noise is related to the physical characteristics of the lm and can be perceived as a random pattern following a general distribution statistics [8]. Film grain noise is not prominent in the low-resolution video format such as CIF and SD. However, the ne structure becomes more visible once the video resolution goes to HD. Film grain noise is one of key elements used by artists to relay emotion or cues so as to enhance the visual perception of the audience. Sometimes, the lm grain size varies from frame to frame to provide dierent clues in time reference, etc. Here, we consider lm grain noise as one type of granular noise. For lossless video coding, it is desirable to preserve the quality of granular noise without modifying the original intent of lmmakers. In addition, it is the requirement in the movie industry to preserve granular noise throughout the entire image and delivery chain. Due to the random nature of granular noise, it is dicult to have an ecient energy compaction solution. Since lm grain noise has a relatively larger energy level in the high frequency band, the block-based encoder in the current video coding standards is not ecient even in the DCT domain. Besides, it also degrades the performance of motion estimation. Thus, researches have been focusing on granular noise removal and synthesis. That is, granular noise is rst removed in a pre-processing stage at the encoder using and then re-synthesized using a model and added back to the ltered frame at the decoder. Because only the noise model parameters are sent to the decoder instead of actual noise, the overall bit rate can be reduced signicantly. Film grain coding has been considered in H.264/AVC [32]. 20 Several algorithms on texture synthesis have been proposed and can be used for gran- ular noise synthesis [16, 42]. Research on granular noise synthesis can be classied into three areas: 1) sample noise extraction, 2) granular noise database, and 3) model-based noise synthesis. Gomila and Kobilansky [10] proposed a sample-based approach that extracts a noise sample from a source signal and applies a transformation to it. Only one noise block is sent to the decoder in the SEI message. However, it could suer from visible discontinuity and repetition. The granular noise database method employs a com- prehensive granular noise database [22] that contains a pool of pre-dened granular noise values for the lm type, exposure, aperture, etc. The lm grain selection process follows a random fashion corresponding to the average luminance of the block, and a deblocking lter is used to blend in granular noise. This method allows the generation of realistic granular noise but requires both the encoder and the decoder to have access to the same granular noise database. The model-based noise synthesis approach extracts granular noise in a pre-processing stage, the extracted noise is analyze and a parametric model containing a small set of parameters is estimated and sent to the decoder. It provides an ecient coding method. However, the noise removal operation could potentially remove actual contents (e.g., the explosion dust) as well. 2.4 Sub-pel Motion Estimation A well performed motion vector (MV) search is critical to the eciency of video coding because of its capability to reduce temporal redundancy between a sequence of frames [21, 21 33]. The motion estimation algorithm using the full search block matching algorithm (FS- BMA) is often used as performance benchmarking. The best integer MV is obtained under the assumption that all pixels within the same blocks have the same horizontal and vertical displacements in an integer unit. However, the best frame-to-frame block displacement of video contents may not coincide with the sampling grid. As a result, the integer MV cannot represent the desired displacement well, and a sub-pixel motion compensation scheme is more suitable. The importance of sub-pel accuracy in ME has been widely recognized. An increased subpel MV resolution will provide signicant improvement on rate-distortion performance for some blocks as analyzed by Girod in [7]. Figure 2.4: Illustration of dierent inter prediction block sizes in H.264/AVC. To implement sub-pel motion search, either the reference frame has to be completely interpolated and stored in the memory or some blocks need to be repeatedly interpo- lated as subpel renement is performed. The former requires a large storage space while the latter will signicantly increase computational complexity. This problem becomes even more severe if a higher sub-pel resolution (such as the 1/8-pel) is used. Therefore, 22 although 1/8 pel motion estimation was proposed, only up to quarter-pel ME is standard- ized in H.264/AVC. In addition, subpel ME is performed together with the mode decision algorithm in H.264/AVC. When inter prediction is used in H.264/AVC, one 16x16 MB can be partitioned into one 16x16 block, two 16x8 or 8x16 blocks or four 8x8 blocks while each 8x8 block can be further partitioned into two 8x4 or 4x8 blocks or four 4x4 blocks as shown in Fig. 4.1. As a result, there are totally 19 modes to encode one 16x16 MB. Figure 2.5: Interpolation lter for sub-pel accuracy motion compensation. Moreover, each block whose size is larger than 8x8 can be predicted using dierent reference frames. For each mode, the MV can be of integer-, half- or quarter-pel resolu- tion. The half-pel value is obtained by applying a one-dimension 6-tap FIR interpolation lter horizontally (the x-direction) or vertically (the y-direction). The quarter-pel value 23 is obtained by the average of two nearest half-pel values. For example, the half-pel value of the fractional sample b in Fig. 2.5 is obtained by applying 6-tap FIR interpolation lter to those pixels E, F, G, H, I and J as b = E 5F + 20G + 20H 5I +J 32 (2.13) Then, the quarter-pel value of the fractional sample a in Fig. 2.5 is given by a = b +G 2 : (2.14) 2.5 Conclusion In this chapter, we provided a review on the background that is relevant to the research presented in the following chapters The challenges and requirements with high delity video coding were presented. In Chapter 3, we would like to design an accurate subpel MV estimation scheme that has the ability to predict the optimal subpel MV position without exorting to the overly complex subpel interpolation. In Chapters 4 and 5, our goal is to design a high delity video coding system that has the ability to encode high denition video in the high bit rate range more eciently. 24 Chapter 3 Direct Subpel Motion Estimation Techniques 3.1 Introduction A well performed motion vector (MV) search is critical to the eciency of video coding because of its capability to reduce temporal redundancy between frames of a sequence. The importance of subpel accuracy in motion estimation (ME) has been widely recognized [21]. An increased subpel MV resolution will provide signicant improvement on the rate- distortion (R-D) performance for some blocks as analyzed by Girod [7]. Traditionally, to implement subpel MV search, either the reference frame is completely interpolated and stored in the memory or some blocks are repeatedly interpolated as the subpel renement process is performed. The former requires a large storage space while the latter will have higher computational complexity. This problem becomes more severe if a higher subpel resolution is adopted. For example, with 1/8-pel MV resolution, the computational complexity and memory requirements involved in the motion estimation and interpolation are very high [46, 40]. For this reason, although 1/8 pel motion estimation was proposed for H.264/AVC, only up to quarter-pel ME is standardized in H.264/AVC. 25 There have been extensive research on reducing the complexity of subpel motion estimation (ME). In general, fast subpel ME schemes fall into two categories: 1) reducing the search complexity and 2) reducing the interpolation complexity. Fast search schemes lower the subpel search complexity by reducing the number of search points on each subpel position based on the assumption that the subpel error surface is often concave [36, 50, 47, 26]. However, as they are search-based, each subpel position still needs to be interpolated ahead of time, which could be a major bottleneck in performance speedup. Fast interpolation schemes address this issue by reducing the interpolation complexity. By establishing a subpel error surface with a mathematical model, the subpel ME error at each subpel position can be extrapolated from the model, thus eliminating the need of heavy interpolation computation [25, 24, 13, 35]. However, the performance of these schemes is highly dependent on model accuracy. In addition, fast interpolation is conducted for one resolution at a time. That is, one has to perform search for the optimal subpel position among extrapolated subpel ME errors at all subpels of a given resolution before moving to the next subpel resolution. Although it is common to nd the optimal MV position by tting a local error surface using integer-pel MVs, the characteristics of the error surface have not been thoroughly studied in the past. In this work, we use the condition number of the Hessian matrix of the error surface to characterize its shape in a local region. Specically, we characterize an error surface by its condition number, which is dened as the ratio of the largest and the smallest eigenvalues of the 22 Hessian matrix (or the ratio of the long and the short axes of its 2D elliptic contour). To reduce the complexity, we propose an approximate 26 condition number in the implementation. After the error shape analysis, we study direct techniques for the optimal resolution estimation and position prediction of subpel MVs. It is known in the literature [7], [39] that the optimal MV resolution should be adaptive to the characteristics of the underlying video. However, there is no practical algorithm that estimates the optimal subpel MV resolution on a block-to-block basis. Ribas-Corbera and Neuho [39] proposed a texture-based estimation scheme to determine the optimal MV resolution for dierent blocks. Their method only considers the characteristics of the input block without leveraging integer search results. By exploiting the result of the error surface analysis, we propose a block-based subpel MV resolution estimation scheme that allows blocks of dierent characteristics to maximize their rate-distortion (R-D) gain by choosing the optimal subpel MV resolution adaptively. Fast subpel MV prediction has been studied by researchers before, e.g., Suh and Jeong [25, 24], Cho et al. [13] and Hill et al. [35]. However, there has been no rigorous study on the accuracy of predicted subpel MVs. We propose two MV prediction schemes for well-conditioned and ill-conditioned blocks, respectively. All proposed techniques are called direct methods, since no iteration is involved in optimal subpel MV resolution estimation and position prediction. Experimental results are given to show the excellent R-D performance of the proposed sub-pel MV prediction schemes. In this work, we rst conduct an analysis on the existing subpel MV estimation model to reveal its weakness in Sec. 3.2. Then, we propose a block-based optimal subpel MV resolution estimation scheme in Sec. 3.3. Based on how well the error surface is conditioned, two optimal MV prediction schemes are presented in Sec. 3.4. A subpel MV prediction scheme is proposed for ill-conditioned blocks to estimate the optimal subpel 27 position in one step (namely, without rening the resolution by half at a time as done in [25, 24, 13, 35]) in Sec. 3.4.1. This direct prediction scheme is further extended to provide accurate prediction for well-conditioned blocks in Sec. 3.4.2 [51]. Experimental results are provided to demonstrate the eectiveness of the proposed schemes in Sec. 3.5. Finally, concluding remarks and future research directions are given in Sec. 3.6. Figure 3.1: Illustration of a square window of dimension1 < x; y < 1 centered around the optimal integer-pel MV position indicated by the central empty circle. 3.2 Characterization of Local Error Surface Subpel ME is usually conducted after the optimal integer MV is obtained through integer motion estimation. It is typically assumed that the optimal subpel MV should reside within a square window of dimension1x;y 1 centered around the optimal integer position as shown in Fig. 3.1. Then, we can dene a subpel motion estimation error surface over this window using a common error measure known as the sum of squared dierences (SSD) E(x; y) = X s(x;y)c(x 0 + x;y 0 + y) 2 ; (3.1) 28 where (x 0 ;y 0 ) is the location of the optimal integer pel, s(x;y) is the target block and c(x 0 + x;y 0 + y) is a reference block with1< x; y< 1. 3.2.1 Problem with Traditional Surface Modeling Several surface models have been used as to approximate the SSD error surface. Ex- amples include the 9-term, 6-term and 5-term error models, denoted by E 9 , E 6 and E 5 , respectively. Mathematically, they can be written as [25], [34]: E 9 (x; y) = ax 2 y 2 +bx 2 y +cxy 2 +dxy +ex 2 +fx +gy 2 +hy +i; (3.2) E 6 (x; y) = ax 2 +bxy +cy 2 +dx +ey +f; (3.3) E 5 (x; y) = ax 2 +by 2 +cx +dy +e: (3.4) Coecients a, b, in above are model parameters and they are calculated based on the measured prediction error at the specied nine integer positions [25], [34]. Note that a contour of surface model E 6 corresponds to a rotated 2D ellipse while that of surface model E 5 corresponds to a simple ellipse whose axes aligned well with the x- and the y- axes. Simply speaking, the ratio of the long and the short axes of these ellipses denes the condition number of an error surface. The ratio is small (or large) for a well-conditioned (or ill-conditioned) surface. 29 Usually, one estimates model parameters (i.e., cocients in these models) based on errors in the nine integer MV locations given in Fig. 3.1 and solve for the optimal subpel MV location directly (for models E 5 and E 6 ) or iteratively (for model E 9 ). However, depending on the local image texture pattern, there may exist great discrepency between the actual error surface and the approximated ones provided by models E 9 andE 6 some- times. We show the 3D plot and the 2D contour plot of the actual error surface and the 2D contour plots of models E 9 and E 6 for well-conditioned and ill-conditioned error surfaces in Figs. 3.2 and 3.3, respectively. One problem with previous model-based fast interpolation schemes [25, 35]) is that the minimum of the subpel error surface predicated by models may fall outside the dened square window as shown in Fig. 3.3 (c). More recently, methods were proposed to reduce the 2-D model into 1-D models in [24, 13], where the minimum search along the X and the Y axes is done independently. To overcome the problem that the minimum of the subpel error surface will fall outside the dened square window, Cho et al. [13] added a step of selective interpolation with a hope that the error surface could be well behaved in a smaller window of size 0:5 x;y 0:5. Their scheme demands additional MV error computation at eight new half-pel locations. Although they can get the optimal half-pel MV among these evaluated locations, the resulting subpel MV may deviate from the true one signicantly as shown in Fig. 3.3 (c). Besides, there is no easy way to determine the behavior of the error surface at a ner resolution. Hill et al. [35] derived a surface model from E 6 for the quarter-pel MV resolution and adopted a fallback scheme by performing the actual interpolation if the quality of MV estimation is poor. However, no statistical analysis 30 (a) (b) (c) (d) Figure 3.2: Illustration of error surfaces for a well-conditioned block: (a) the 3D plot of the actual error surface; and the 2D contour plots of (b) the actual error surface, (c) error surface model E 9 , and (d) error surface model E 6 . 31 (a) (b) (c) (d) Figure 3.3: Illustration of error surfaces for an ill-conditioned block: (a) the 3D plot of the actual error surface; and the 2D contour plots of (b) the actual error surface, (c) error surface model E 9 , and (d) error surface model E 6 . 32 on the relationship between the model quality and the accuracy of subpel MV resolution prediction has been conducted before. By following previous work, we assume that the local error surface is a convex function (i.e. a uni-modal error surface). We claim that prediction accuracy actually depends on whether it has a narrow valley with a certain orientation at the bottom of the error surface. This is visually apparent by comparing Figs. 3.2(b) and 3.3(b). Mathematically, the local error surface can be characterized by the second-order derivatives of the center pixel, known as the Hessian matrix [6] of that point. The eigenvalues of the Hessian matrix are called principal curvatures. The condition number is the ratio of its largest and smallest eigenvalues of the Hessian matrix (or, geometrically, the ratio of the long and the short axes of its 2D elliptic contour). For a well-conditioned block, its error surface is circularly symmetric. As the condition number increases, it becomes ill-conditioned gradually. To solve the coecients of error surface models E 5 , E 6 and E 9 , we need to solve a linear system of equations via matrix inversion. If the matrix is ill-conditioned, one cannot get the model coecients robustly. This explains why the model-based approach fails to predict the optimal subpel motion location accurately. 3.2.2 Condition Number Estimation In this subsection, we focus on the problem of estimating the condition number of the error surface in a local window consisting of 33 pixels based on the nine sampled points. Here, we consider four slices of the error function; namely, the intersection of the error surface and four planes: the horizontal (or the 0-dgree) slice with y = 0 as the intersecting plane; 33 the vertical (or the 90-degree) slice with x = 0 as the intersecting plane; the 45-degree slice with x = y as the intersecting plane; the 135-degree slice with x =y as the intersecting plane. The four intersection curves are shown in Figs. 3.4 (a) and (b). Fig. 3.4 (a) correspond to a well-conditioned block case where the four curves have similar curvatures. Fig. 3.4 (b) correspond to an ill-conditioned block case, where the curvatures spread over a wider range. Based on the above observation, we can derive a simple test to check how well a block is conditioned. That is, we can dene the following four paramters: 0 = je(1; 0) +e(1; 0) 2e(0; 0)j; (3.5) 45 = je(1; 1) +e(1;1) 2e(0; 0)j; (3.6) 90 = je(0;1) +e(0; 1) 2e(0; 0)j; (3.7) 135 = je(1; 1) +e(1;1) 2e(0; 0)j; (3.8) where e(x; y) is the measured integer-pel error at (x; y). They correspond to the 1D discrete Laplacian along the 0-, 45-, 90- and 135-degree directions, respectively. Generally speaking, a larger (or smaller) value of implies a more rapidly-changing (or slowly-changing) error surface along the corresponding direction. The maximum and the 34 (a) (b) Figure 3.4: The error curves passing through the origin along the 0-, 45-, 90- and 135- degree directions for (a) a well-conditioned block, and (b) an ill-conditioned block. 35 minimum of these four parameters are denoted by max and min , respectively. Then, we can compute an approximated condition number of the Hessian matrix in this region via C = max min : (3.9) Foreman CIF Vintage Car HD Harbor HD (a) (b) Figure 3.5: (a) Block examples that are likely to have well-conditioned error surfaces; (b) block examples that are likely to have ill-conditioned error surfaces. Blocks are taken from sample sequences of Foreman CIF, Vintage Car HD and Harbor HD. In Fig. 3.5, we show some representative regions from three test sequences that yield well-conditioned or ill-conditioned error surfaces after the sub-pixel motion estimation process. As shown in the Foreman, Vintage Car, the Harbor examples in Fig. 3.5, we see that it is likely to get well-conditioned cases for regions with certain symmetry and ill-conditioned cases for regions with angled textures. However, the characteristics of the local error surface is ultimately determined by the temporal relationship of two adjacent frames. In other words, we are not able to make robust decision based on the texture pattern of a single frame. This also explains why previous work [39], which determines 36 the subpel MV accuracy based on the input block texture only, does not yield satisfactory results. 3.2.3 Deviation from Flatness The optimal subpel MV resolution is related to the curvature of the error surface. For a at error surface, the cost of the increased MV resolution tend to impact the overall R-D performance negatively. On the other hand, for a steep error surface, a ner subpel resolution is advantageous as it would result in an additional R-D gain. To capture the curvature information of the error surface, we may consider the following two simple measures: D f = q 2 max + 2 min : (3.10) For simpler computation, we can approximate this parameter as: D f max + min : (3.11) In this work, we adopt the measure dened in Eq. (3.11) and call it the deviation from atness for its greater simplicity. The optimal subpel MV position prediction is related to the bottom shape of the error function. We will elaborate this in the following section. 37 3.3 Optimal Subpel MV Resolution Estimation Girod [7] pointed out that optimal subpel MV resolution has critical impact on coding eciency and not all blocks need the same MV resolution to obtain the best coding performance. Some blocks can benet from higher MV resolution while others cannot. Thus, it is desirable to have adaptive MV resolution for optimal coding eciency rather than xed MV resolution. To study the problem of optimal subpel MV resolution, Girod [7] provided an analytical framework that estimates the dierence-frame energy with an optimal subpel MV resolution, which is expressed as a function of the probability distribution of MV accuracy, the Fourier transform of the frame and the power spectral density of inter-frame noise. This framework is however not easy to implement in practice. Ribas-Corbera and Neuho [39] extended Girod's framework and developed a scheme to estimate the optimal MV resolution for a block using its texture. However, their method is still too complex for actual implementation and not accurate enough for prediction on a block-to-block basis. In this section, we propose an optimal subpel MV resolution estimation scheme. It is related to the characterization of the subpel error surface features with parameter D f as given in Eq. (3.11). This method is not only easy to compute on a block-to-block basis but also eective in enhancing the R-D performance. In Figs. 3.6 (a), we show the histogram of D f for a collection of video sequences while Figs. 3.6 (b)-(e) depict the probability for the optimal subpel MV resolution to be of integer, half-, quarter-, eighth-pel accuracy as a function of D f with quantization pa- rameterQP = 20. In computing these probabilities, the optimal MV resolution selection 38 (a) (b) (c) (d) (e) Figure 3.6: (a) The histogram of D f at QP=20 and (b)-(e) the probability distributions for the optimal MV resolution at integer-pel, 1/2-pel, 1/4-pel and 1/8-pel for a set of test video sequences. 39 process is similar to the H.264/AVC Rate-Distortion Optimization (RDO) procedure [44] using the following Lagrangian cost function: J(s;cj motion ) =SSD(s;c) + motion R(s;c); (3.12) where R(s;c) is the number of bits associated with the coding of the prediction error and MV, s is the source block texture, c is the reference block texture, and motion is the Lagrangian multipler which is set to p 0:85 2 QP=3 . Under this RDO framework, the distortion model does not consider the quantization eect on the prediction error since the optimal MV resolution selection is performed with a given QP value. For a very at error surface whoseD f value is extremely small, additional subpel MV accuracy does not bring a sucient performance gain to justify the rate overhead. Thus, the probability of selecting the integer pel as its optimal MV resolution is nearly 100%. However, as D f increases, the probability of selecting 1/2 pel resolution as its optimal MV becomes dominant. The quarter-pel MV resolution is important when D f exceeds 25,000. The switch between quarter- and eighth-pel resolution is more gradual as shown in Fig. 3.6 3.6(d) and (e). Actually, the probability of selecting quarter-pel or eight-pel is similar for a range of D f values. When the error surface is very steep (corresponding to a largeD f value), the chance of selecting eight-pel MV becomes very high. Since there are few blocks that has a D f value over 150,000 as shown in Fig. 3.6 (a), there is no advantage to go to ner MV resolution such as the 1/16 pel. Although QP is chosen in Fig. 3.6, the same observation holds for dierent QP values. In other words, only the D f 40 value is critical to the subpel MV resolution estimation. This conclusion is much simpler than that given in [7] and [39]. Based on the above discussion, we can have a simple estimation scheme for the optimal subpel MV resolution of local block b, denoted by b (MV ), as i;j (MV ) = 8 > > > > > > > > > > > > > > < > > > > > > > > > > > > > > : 1 if D f (i;j) 1 , 1 2 if 1 <D f (i;j) 1=2 , 1 4 if 1=2 <D f (i;j) 1=4 , 1 8 if 1=4 <D f (i;j) 1=8 , (3.13) where D f (i;j) is the deviation from atness measure at pixel (i;j) that has the smallest integer MV value (i.e. the central pixel in Fig. 3.1 and i , i = 1; 1=2; 1=4 are proper threshold values). We do not observe an advantage to go to a subpel of less than 1=8 so that we choose 1=8 as the nest resolution as shown in above. If D f 1 , only the best integer position is coded. No further subpel MV prediction is needed since the error surface in this block is too at for subpel MV to improve the coding gain. Generally speaking, thresholds 1 , 1=2 and 1=4 can be selected via statistical analysis. Sometimes, we may set them to higher (or lower) values to trade the quality for lower (or higher) computational complexity. 41 3.4 Direct Subpel MV Position Prediction In this section, we propose two direct methods for subpel MV position prediction depend- ing on the condition number of a local block. The ill-conditioned and well-conditioned blocks are considered in Secs. 3.4.1 and 3.4.2, respectively. First, we show the distribution of the condition number of all blocks from a set of test video sequences in Fig. 3.7 (a). They include four CIF sequences (i.e., Container, Football, Coastguard and Tempete), one HD sequence at 1280 720 resolution (i.e., Sheri) and one HD sequence at 19201080 resolution (i.e., Station2). We did not use the same sequences in Sec. 3.5 to show the robustness of the training process in determining the well- and ill-conditioned cases. The performance of the E 9 model by Suh and Jeong [25], called the SJ E 9 model in short, is evaluated in Fig. 3.7 (b), where the average Euclidean distance between the predicted and the actual subpel positions is plotted as a function of the condition number. This prediction error distance can be written mathematically as " s = p (x a x s ) 2 + (y a y s ) 2 ; (3.14) where (x s ; y s ) and (x a ; y a ) are the predicted and the actual subpel MV positions, respectively. We see that, for a well-conditioned block with C 4, the prediction error distance, " s , generated by the SJE 9 model is small enough for accurate quarter-pel MV resolution. However, as the condition number increases, the average prediction error becomes larger. The average prediction error goes beyond the quarter-pel resolution for blocks withC > 4, 42 (a) (b) Figure 3.7: (a) The histogram of condition numbers and (b) the prediction error distance " s as a function of the condition number using the SJ E 9 model . 43 and exceeds the half-pel resolution for blocks with C > 10. As the condition number continues to increase, the prediction error increases to the maximum allowed by any subpel search method (i.e., the integer-pel resolution), and the SJ method fails completely. For a typical video stream, a large percentage of blocks falls in the well-conditioned block group as shown in Fig. 3.7 (a). Generally speaking, about 60% of blocks are in the well-conditioned group. For the remaining blocks, if only the half-pel resolution is needed, the SJE 9 model can cover additional 20% of blocks. For a higher subpel resolution such as the 1/8 pel, the SJ E 9 model only applies to a very small percentage of blocks, i.e. only blocks with C = 1. 3.4.1 Ill-Conditioned Blocks In this subsection, we focus on blocks whose error surface is ill-conditioned and propose a direct subpel MV position prediction scheme. The basic idea is to decompose a 2D optimization problem into two 1D optimization problems. Without loss of generality, we assume 90 > 0 in the following discussion. Under this condition, we know that the error surface changes more rapidly along the axis of y than that of x. Our algorithm consists of the following two steps as illustrated in Figs. 3.8 (a) and (b), respectively. 44 Step 1: For a xed value of x(=1; 0; 1), we use three values e(x; 1), e(x; 0) and e(x;1) to t a quadratic function. Mathematically, they are in form of e(1; y) = A 1 y 2 +B 1 y +C 1 ; (3.15) e(0; y) = A 0 y 2 +B 0 y +C 0 ; (3.16) e(1; y) = A 1 y 2 +B 1 y +C 1 : (3.17) The global minima of Eqs. (3.15)-(3.17), denoted by (1; y 1 ), (0; y 0 ) and (1; y 1 ), can be determined analytically as y 1 = 1 2 e(1; 1)e(1;1) e(1; 1) +e(1;1) 2e(1; 0) ; (3.18) y 0 = 1 2 e(0; 1)e(0;1) e(0; 1) +e(0;1) 2e(0; 0) ; (3.19) y 1 = 1 2 e(1; 1)e(1;1) e(1; 1) +e(1;1) 2e(1; 0) : (3.20) Based on Eqs. (3.18)-(3.20) and (3.15)-(3.17), we can compute the minimal error value. This process is shown in Fig. 3.8(a). Step 2: When the approximate condition number max = min is larger, it is observed that the three minima, (1; y 1 ), (0; y 0 ) and (1; y 1 ), tend to have a co-linear relationship as illustrated in Fig. 3.8(b). Then, we can examine the plane that passes through these three points and t another quadratic function e f (t) = A f t 2 +B f t +C f ; (3.21) 45 which goes through their corresponding error values. Coecients A f , B f and C f can be solved and the minimum of Eq. (3.21) gives the optimal MV position at (x opt ; y opt ). Although the predicted optimal MV position can take any real value in x opt and y opt , their values should be quantized to the optimal MV resolution, which is estimated using the technique presented in Sec. 3.3. There exist four possible candidates around (x opt ; y opt ) at a supported subpel resolution. A simple quantization scheme is to select its nearest neighbor among these four. 3.4.2 Well-Conditioned Blocks We extend the direct subpel MV position prediction to blocks with a well-conditioned error surface in this section. One distinct error surface characteristics associated with ill-conditioned blocks is that the surface has a narrow valley with a certain orientation. Hence, for direct subpel MV prediction, there exists only one axis that can produce two or three minima to forme f (t) as shown in Fig. 3.8. On the other hand, for well-conditioned blocks, Step 1 should be repeated for both x- and y-axis. Thus, we can modify the process as follows. Step 1-x: It is the same as Step 1 in Sec. 3.4.1. 46 (a) (b) Figure 3.8: Illustration of the optimal subpel MV position prediction for ill-conditioned blocks: (a) Step 1: nding the minma in three vertical planes using quadratic curve tting and (b) Step 2: connecting the three minima found in Step 1 and nding the optimal subpel MV position with another quadratic curve tting. 47 (a) (b) (c) (d) (e) Figure 3.9: Illustration of the optimal subpel MV position prediction for a well condi- tioned block. 48 Step 1-y: For given y = (1; 0; 1) we use e(1; y), e(0; y) and e(1; y) to t a quadratic function. Mathematically, they are in form of e(x;1) = D 1 y 2 +E 1 y +F 1 ; (3.22) e(x; 0) = D 0 y 2 +E 0 y +F 0 ; (3.23) e(x; 1) = D 1 y 2 +E 1 y +F 1 : (3.24) The global minima of Eqs. (3.22)-(3.24), denoted by (x 1 ;1), (x 0 ; 0) and (x 1 ; 1), can be determined analytically as x 1 = 1 2 e(1; 1)e(1;1) e(1;1) +e(1;1) 2e(0;1) ; (3.25) x 0 = 1 2 e(1; 0)e(1; 0) e(1; 0) +e(1; 0) 2e(0; 0) ; (3.26) x 1 = 1 2 e(1; 1)e(1; 1) e(1; 1) +e(1; 1) 2e(0; 1) : (3.27) Step 2-x: It follows the same process as Step 2 in Sec. 3.4.1, which will produce a vertical-oriented optimal MV position at (x v ; y v ). Based on Eqs. (3.25)-(3.27), we can compute the minimal error value using Eqs. (3.22)-(3.24). Step 2-y: We examine the plane that passes through these three points obtained in Step 1-y and t them with another quadratic function e h (t) = D h t 2 +E h t +F h ; (3.28) 49 which goes through their corresponding error values. Coecients D h , E h and F h can be solved and the minimum of Eq. (3.28) gives the horizontal-oriented optimal MV position at (x h ; y h ). We then divide the (1; 1)(1; 1) window into north- east (NE), north-west (NW), south-east (SE), and south-west (SW) four sectors. Then, (x v ; y v ) obtained in Step 2-x would identify the east or the west sector of the actual optimal MV horizontally, and (x h ; y h ) would identify the south and the north sector of the actual optimal MV vertically. This process is shown in Fig. 3.9 (c) and (d). Step 3: Based on the coordinates of (x v ; y v ), two closest integer positions along the vertical direction can be selected to form one line. The same process can be done in the horizontal direction based on the coordinates of (x h ; y h ) to form another line. These two lines are denoted by v = mx +n; (3.29) h = py +q: (3.30) Finally, The optimal subpel MV position is the intersection point of these two lines as illustrated in Fig. 3.9 (e). 3.4.3 Performance Evaluation The performance of the proposed subpel MV position prediction method is shown in Fig. 3.10 (b). As compared to SJ's E 9 method in Fig. 3.7 (b), we see that more than 90% of the blocks can achieve an average prediction error smaller than 1/4 pel resolution using 50 (a) (b) Figure 3.10: (a) The histogram of the condition number, and (b) the prediction error distance " s as a function of the condition number using the proposed prediction method as described in Secs. 3.4.1 and 3.4.2. the proposed method. In addition, we see from the error histogram that as the condition number exceeds a certain value, the prediction error would be larger than the required 51 subpel MV resolution. Thus, we modify the optimal subpel MV estimation scheme given in Eq. (3.13) as follows: (MV ) = 8 > > > > > > > > > > > > > > > > > > > < > > > > > > > > > > > > > > > > > > > : 1 if D f 1 , 1 2 if 1 <D f 1=2 , 1 4 if 1=2 <D f 1=4 , 1 8 if 1=4 <D f Disable prediction if C > C , (3.31) where the threshold value, C , can be obtained statistically. 3.5 Experimental Results As the existing H.264/AVC reference codec does not support subpel MV resolution higher than quarter-pel MV, we modied reference codec JM12.1 [2] slightly to accommodate the 1/8-pel MV resolution for optimal subpel MV search in this section. A total of eight video sequences were tested: four of the CIF resolution (i.e., Foreman, Mobile, Stefan, and Flower garden @352x288) and four of the HD resolution (i.e., City corridor @1280x720 HD, Night @1280x720 HD, Blue sky @1920x1080 HD and Vintage car @1920x1080 HD). We adopted a window of size 32 32 for full integer MV search with one reference frame. The rate-distortion (R-D) optimization was employed in the MV search process. Each GOP consisted of 15 frames. The CAVLC was chosen as the entropy coder. The thresholds in Eq. (3.13) for the optimal MV resolution selection were set experimentally. 52 The experiments were run on a Macbook with Intel core 2 duo at 2.2GHz. We have implemented a state-of-the-art fast subpel MV estimation algorithm proposed in [24], which is an extension of theE 9 model in [25], for performance benchmarking. It is called the SCJ method, and we use SCJ 1/4pel and SCJ 1/8pel to denote the results for its application to the quarter-pel and the eighth-pel cases. We conducted experiments with the following two test settings. Test Setting 1 We compare the R-D performance between H.264 full quarter-pel MV search (de- noted by H.264 1/4pel), the proposed subpel MV position prediction scheme using the same quarter-pel MV resolution without optimal MV resolution (denoted by ZDK-I), and the SCJ 1/4pel method. Test Setting 2 We compare the performance between H.264 full subpel MV search with the eighth- pel MV resolution (denoted by H.264 1/8pel), the proposed MV prediction scheme with the optimal MV resolution estimation method enabled (denoted by ZDK-II), and the SCJ 1/8pel method. First, we examine the complexity saving of the direct subpel MV position prediction. Here, the complexity saving factor is dened as S = 1 C proposed C full 100(%); (3.32) 53 Table 3.1: The complexity saving S(%) of the proposed ZDK-I, ZDK-II and the SCJ method with respect to H.264 full search. Proposed Methods SCJ Method Resolution Sequences ZDK-I ZDK-II SCJ 1/4pel SCJ 1/8pel 352x288 Foreman 82.28 99.35 83.34 84.54 Mobile 87.85 99.78 84.82 81.46 Stefan 86.54 99.61 82.48 85.76 Flower garden 87.91 99.50 86.35 87.25 Average 86.15 99.56 84.25 84.75 1280x720 City corridor 84.26 99.46 82.11 84.38 Night 83.24 99.58 83.49 85.59 1920x1080 Blue sky 85.16 99.81 80.32 82.96 Vintage car 89.24 99.67 82.77 82.05 Average 85.48 99.63 82.17 83.75 whereC proposed andC full denote the computational time required for the proposed subpel MV prediction and the full subpel search processes, respectively. For the latter, it includes the time required to interpolate the reference frame. (a) (b) Figure 3.11: The complexity saving as a function of the coding bit rate with (a) ZDK-I and (b) ZDK-II for four sample sequences. The complexity saving factors for ZDK-I, ZDK-II and the SCJ method are shown in Table 3.1. We see that ZDK-I, ZDK-II and the SCJ method all oer a signicant amount 54 of complexity saving. With the help of the optimal subpel MV resolution, ZDK-II can provide additional 13-15% complexity saving. Furthermore, the complexity saving for ZDK-I and ZDK-II are shown as a function of the bit rate in Fig. 3.11. We depict the results of Foreman and Mobile CIF sequences at a lower bit rate range and those of Blue sky and Vintage car at a higher bit rate range. Generally speaking, the complexity saving is stable for a range of bit rates. The R-D performance of various subpel MV search schemes is compared in Figs. 3.12- 3.15. We use two subpel MV search schemes for performance benchmarking. They are: 1) the integer-pel MV and 2) the subpel MV with a xed resolution (of quarter-pel or eighth-pel). The results of eight test sequences with ZDK-I are shown in Figs. 3.12 and 3.13. For performance evaluation, we provide the rate reduction comparison in Table 3.2 and Table 3.3 based on the method described in [18]. We see that the proposed ZDK-I has very small rate increase (around 5%) as compared with the full H.264 1/4pel search. In contrast, the SCJ 1/4pel method has a larger rate increase (around 15 to 20%). The results of eight test sequences with ZDK-II are shown in Figs. 3.14 and 3.15. We see that, with the optimal MV resolution estimation enabled, the proposed ZDK-II scheme can achieve almost the same R-D performance as the H.264 1/8pel scheme with a complexity saving factor of 99.6%. 55 Table 3.2: Coding eciency comparison of the proposed ZDK-I scheme and the SCJ method with respect to H.264 with quarter pel resolution. ZDK-I vs. SCJ 1/4pel vs. H.264 1/4pel H.264 1/4pel Sequence Resolution Bit Rate (%) Bit Rate (%) Foreman 352x288 4.59 14.93 Mobile 352x288 5.89 16.37 Stefan 352x288 4.73 16.17 Flowergarden 352x288 2.37 15.84 Average 4.40 15.83 City Corridor 1280x720 4.40 16.54 Night 1280x720 5.01 14.39 Vintage Car 1920x1080 5.78 16.70 Blue Sky 1920x1080 5.60 15.02 Average 5.20 15.66 Table 3.3: Coding eciency comparison of the proposed ZDK-I scheme and the SCJ method with respect to H.264 with eighth pel resolution. ZDK-II vs. SCJ 1/8pel vs. H.264 1/8pel H.264 1/8 pel Sequence Resolution Bit Rate (%) Bit Rate (%) Foreman 352x288 3.14 17.22 Mobile 352x288 4.28 20.43 Stefan 352x288 3.21 19.06 Flowergarden 352x288 3.56 18.69 Average 3.54 18.85 City Corridor 1280x720 3.46 17.79 Night 1280x720 4.31 19.05 Vintage Car 1920x1080 4.09 19.46 Blue Sky 1920x1080 4.40 20.83 Average 4.07 19.28 56 (a) (b) (c) (d) Figure 3.12: The R-D performance of ZDK-I and two benchmark methods for four CIF sequences: (a) Foreman, (b) Mobile, (c) Stefan, and (d) Flower garden. 57 (a) (b) (c) (d) Figure 3.13: The R-D performance of ZDK-I and two benchmark methods for four HD sequences: (a) City Corridor, (b) Night, (c) Blue sky, and (d) Vintage car. 58 (a) (b) Figure 3.14: The R-D performance of ZDK-II and two benchmark methods for four CIF sequences: (a) Foreman, (b) Mobile, (c) Stefan, and (d) Flower garden. 59 (a) (b) Figure 3.15: The R-D performance of ZDK-II and two benchmark methods for four HD sequences: (a) City Corridor, (b) Night, (c) Blue sky, and (d) Vintage car. 3.6 Conclusion The behavior of the subpel MV error surface was studied and two parameters were pro- posed to chacterize the error surface; namely, the condition number and the deviation from atness. These two parameters can be easily computed based on the prediction residuals at nine integer MV values centered at the minimum integer MV location. Then, an optimal MV resolution estimation scheme was derived, which allows each block to 60 select an optimal MV resolution adaptively based on the deviation from atness param- eter. Furthermore, two direct subpel MV position prediction schemes were described for ill- and well-conditioned blocks, respectively. It was shown by experimental results that the R-D performance of the proposed ZDK-II scheme is comparable with that of the full subpel MV search at a much lower computational complexity. Several extensions of our current work can be explored in the future. For example, we adopt xed thresholds on the condition number and the deviation from atness param- eters for all test sequences based on an o-line training process in this work. It may be worthwhile to investigate adaptive thresholding based on the properties of the underlying video sequences to achieve better R-D performance. Furthermore, we may consider the framework of rate-distortion-complexity (RDC) optimization and adjust threshold values accordingly to nd a good balance between complexity and the R-D performance. 61 Chapter 4 Granular Noise Prediction and Coding Techniques 4.1 Introduction Video compression has been extensively studied for the last two decades. Earlier research has primarily focused on low-bit-rate coding due to the limited storage space and band- width. Recently, research focus has shifted to high-bit-rate video due to increased popu- larity of high denition (HD) video and availability of broadband network infra-structure in recent years. HD video oers higher spatial resolution as well as enhanced quality (which means a higher PSNR range). To meet these requirements, the H.264/AVC stan- dard has included the Fidelity Range Extension (FR-Ext) in its high prole to support 4k/2k contents [43]. High denition (HD) video sequences have several unique characteristics as compared to video sequences of lower resolution. First, its content has higher delity with more detail texture recorded to create an more involving experience to the audience. Promoted by the advancement in storage and transmission technologies, the market is gearing to- wards a high bit rate, high quality content coding. To satisfy these unique requirements, 62 H.264/AVC incorporated the Fidelity Range Extension (FR-Ext) in its high prole to support 4K/2K contents [43]. Second, HD video sequences are typically of very high resolution. The current well- known resolution is about 1920x1080 progressive. Even higher resolutions are being intro- duced into the market. For example, Digital Cinema Initiatives (DCI) specied the use of 2k/4k cameras [1], the latest released HD-DVD and Blu-ray disc both supports resolution of 1080p. Because of the above-mentioned unique characteristics, the main challenges as- sociated with HD content are storage and bandwidth requirements for compression and streaming over IP networks. Compared to earlier standards, the latest compression stan- dard H.264/AVC is able to provide a nearly 50 percent rate saving with the same PSNR requirements [46]. However, H.264/AVC initially was proposed to target at low bit rate coding environment and most of the experiments are conducted on low resolution QCIF and CIF sequences as well. Hence, as the spatial resolution increases, H.264/AVC reaches a performance bottleneck. Therefore, in JVT meetings, one of the long term objectives is to develop a new generation video coding standard that would keep abreast with this signicantly increased demand for storage and transmission bandwidth. In summary, the coding eciency from the existing coding schemes are limited once high delity is required. This phenomenon indicates that there exists some unique features causing ineciency in high-bit-rate coding environments. In this paper, we rst provide a systematic analysis on the unique characteristics of this feature that we identify as granular noise and its impact on high bit rate, high delity video coding in Section 4.2. The analysis emphasizes the importance to treat granular noise dierent and separately. 63 The rest of this chapter is organized as follows. In Sec. 4.3, a new granular noise prediction and coding scheme is proposed. This is an extension from our earlier proposed residual image prediction and coding (RIPC) for lossless coding [52]. This GNPC is fur- ther extended to incorporate a frequency-domain based prediction scheme to enhance the coding performance Sec.4.4. Experimental results are given to demonstrate the eective- ness of the proposed GNPC scheme in Sec. 4.5. Finally, concluding remarks are given in Sec. 4.6. 4.2 Impact of Graunular Noise on High Fidelity Coding Due to the increased resolution in HD video, the texture complexity of a block is often simpler than that of SD video. Generally speaking, if an image has a higher correlation among pixels, its entropy will be lower and it is possible to achieve a higher compression ratio. However, we do not observe such a coding gain in existing video coding standards. One of the main reasons is the existence of granular noise in HD video. We will elaborate on this topic in this section. 4.2.1 Observations H.264/AVC was initially proposed for low bit rate coding. It has several unique fea- tures such as the use of sophisticated multiple frame reference motion search, quarter-pel motion compensation, multiple mode selection, rate-distortion optimization (RDO) tech- niques. Its coding performance outperforms previous MPEG standards by a signicant margin. 64 (a) (b) Figure 4.1: The marcoblock partition modes and (b) B-frame prediction. For P frame prediction, a marcoblock can be divided into smaller partitions as shown in Fig. 4.1. H.264/AVC supports luma block partitions of 16x16, 16x8, 8x18 or 8x8. One additional syntax element can be assigned to each 8x8 partition to indicate if it will be further divided into smaller sub-partitions of 8x4, 4x8 or 4x4. The motion prediction for each block is performed by searching a displacement in the reference frame. H.264/AVC also supports multi-frame motion compensated prediction, where more than one of pre- viously coded frames can be used as the reference frame for inter prediction. For B frame prediction, more reference frames are incorporated so that a marcoblock in a B frame can use a weighted average of two distinct motion compensated prediction values to con- struct the prediction. In the B frame prediction, four dierent types of inter prediction can be used: list 0 (rst list of reference pictures), list 1 (second list of reference pictures), bi-predictive and direct prediction. Being similar to the P frame prediction, the same marcoblock partitions as indicated in Fig. 4.1 are used. Besides inter prediction modes, a SKIP mode is also introduced for extremely ecient coding. In this SKIP mode, all residual DCT coecients of the block are quantized to zero so that neither quantized residuals nor the motion vector (or the reference index) is encoded. Only one bit is used to signal this SKIP mode. For a large area with no change 65 or motion, a large number of blocks can be coded eciently by this SKIP mode using a small number of bits. The statistics given in Table 4.1 [23] show that the SKIP mode in H.264/AVC is very eective for a large QP value, where a large majority of blocks are quantized to zero and encoded as the SKIP mode. Table 4.1: Mode distribution of blocks at QP=28 for several test CIF sequences. Sequence SKIP mode Other inter modes Intra mode Container 98.133% 1.847% 0.020% Foreman 53.949% 45.648% 0.403% Mobile 54.534% 45.447% 0.019% News 86.139% 13.861% 0.000% Tempete 62.877% 36.541% 0.609% Average 71.126% 28.669% 0.205% Equipped with powerful inter-prediction tools, H.264/AVC can provide ecient coding performance for video sequences of lower resolution (e.g. QCIF, CIF and SD video) with a medium to coarse quantization stepsize. In contrast, the mode distribution for the coding of HD video sequences is very dierent as shown in Table 4.2. Due to the small quantization stepsize, the SKIP mode is rarely selected. Furthermore, intra modes are preferred over inter modes for most macroblocks. This is especially true for Riverbed and Rush Hour sequences. Table 4.2: Mode distribution of macroblocks for HD sequences with QP=8. Sequence SKIP mode Other inter modes Intra mode Riverbed 0% 0.086% 99.914% Blue sky 0.064% 35.794% 64.142% Rush hour 0% 6.642% 93.358% Station2 0% 31.366% 68.634% Pedestrian 8.012% 11.517% 80.472% Average 1.615% 17.081% 81.304% 66 However, this scenario changes dramatically in high bit rate video coding. The per- centages of modes used in the coding of HD sequences of resolution 1920x1080 are shown in Table 4.2. The SKIP mode is rarely used due to the small quantization stepsize. Fur- thermore, most macroblocks choose intra modes over inter modes as its optimal prediction mode. The percentages may go higher than 90% for Riverbed and Rush hour sequences. Then, the inter frames are coded nearly in the same fashion as an I frame. Figure 4.2: The mode distribution of H.264/AVC for Rush Hour HD sequence at various QP values. 4.2.2 Analysis To understand the shift from the inter modes to the intra modes, we perform a detailed analysis on mode distributions with respect to a wide range of QP values for the Rush Hour sequence. We show the mode distribution for QP equal to 8, 14, 20, 26, 32 and 38 in Fig. 4.2. For the high-bit-rate coding with QP=8 or 14, intra modes are clearly 67 the dominant choice with its occurrence frequency higher more than 80%. However, for low and medium QP values, the occurrence of intra modes dropped signicantly to 20% or lower. As the quantization stepsize becomes larger, more macroblocks can take advantages of the SKIP mode, which becomes dominant when QP becomes 32 and 38. To conclude, for the coding of HD video, the eciency of MCP is not obvious until the QP value is close to 20 or higher, which corresponds to a medium-bit-rate coding setting. Otherwise, intra modes are the dominant choice for a smaller QP. As the QP value becomes smaller, more details and ne textures appear in a mac- roblock, and the number of quantized zero DCT coecients decreases. Generally speak- ing, smaller partitions have a higher probability to nd a match yet they demand more header bits as the overhead. The reduced residual bits can be oset by the increased overhead bits. To obtain the best trade-o mathematically, the Lagrangian cost func- tion is commonly used, which is also known as the Rate-Distortion Optimization (RDO) process [44] in the H.264/AVC mode evaluation. The Lagrangian function is expressed as minfJ(blk i jQP;m)g; J(blk i jQP;m) =D(blk i jQP;m) + m R(blk i jQP;m); (4.1) where D(blk i jQP;m), R(blk i jQP;m) and m are the distortion, the bit rate and the Lagrangian multiplier of blockblk i for a given coding modem and quantization parameter 68 (QP), respectively. In H.264/AVC, m can be expressed as a function of the quantization parameter QP: m = m 2 QP=3 : (4.2) Thus, for a given QP, the total number of bits associated with a marcoblock coded with intra mode m I or inter mode m P can be calculated as R total (m I jQP ) =R hdr (m I jQP ) +R coef (m I jQP ); (4.3) and R total (m P jQP ) =R hdr (m P jQP ) +R coef (m P jQP ); (4.4) respectively. Hence, the cost dierence between intra and inter modes can be expressed as J(m P )J(m I ) = [D(m P )D(m I )] + P [R hdr (m P jQP ) +R coef (m P jQP )] I [R hdr (m I jQP ) +R coef (m I jQP )]: (4.5) Since the inter prediction can nd a better match than the intra prediction for the CIF video, the distortion for inter modes D(m I ) is usually less than that of intra mode D(m P ). Therefore, in spite of the advantage of lower header bits associated with intra modes, the Lagrangian optimization process is still in favor of inter modes. For the HD 69 video, quantization is restricted to a small value for the high delity requirement. Based on high bit rate coding theory in [19], the distortion can be expressed as D(m I jQP )uD(m P jQP ) = 2 12 . (4.6) where is a small quantization stepsize. By substituting (4.6) into (4.5), we get the cost dierence as J(m P )J(m I ) = P R hdr (m P jQP ) I R hdr (m I jQP ) + P R coef (m P jQP ) I R coef (m P jQP ); (4.7) which mainly depends on the total number of bits required to encode this marcoblock. Generally speaking, the same is used for both intra and inter modes, which results in I = P . Thus, if MC cannot nd a good match due to external noise, texture, motion blur, etc, the inter prediction can be worse than the intra prediction. As a result, the overall cost will be decided based on the dierence of header bits used in intra and inter modes. This gives an advantage to the intra prediction since it demands no bits for the reference frame and the motion vector. The above discussion studied why a majority of blocks choose intra modes over inter modes at a ne QP value for HD video. To summarize that, there exists some ne information in a frame due to an increased resolution, which cannot be well compensated, due to the existence of lm grain noise [32] and tiny surface variation in HD video. However, they do not show up in lower resolution 70 video since they are averaged out in a lowpass ltering process. In the following, we will propose a novel coding scheme for HD video with granular noise in Sec. 4.3. 4.3 Overview of GNPC Coding Framework Film grain noise is a type of random optical texture from processed lm. It is linked to the physical characteristics of the lm and is perceived as a random pattern and normally follows a gGNPCeneral distribution statistics [8]. Film grain noise is not prominent in standard denition television format and is even less perceivable in smaller formats such as CIF or QCIF. However, these ne surface variations become much more visible once the resolution is increased to HD. In addition, lm grain noise is usually one of the key elements used by artists to relay emotion or provide cue that enhance the visual perception of the scene to the audience. Sometimes, lm grain size varies from frames to frames to provide dierent clues as to time reference and etc. Therefore, for lossless or high delity video coding, it is desirable to preserve the quality of the lm grain noise without modifying the original intent of lmmakers. In addition, it has become the requirement in the motion picture industry to preserve lm grain throughout the entire image and delivery chain. Due to the random nature of lm grain noise, it is very dicult to have ecient energy compaction solution. Therefore, conventional researches have been focusing on lm grain synthesis. Film grain noise is rst removed during a pre-processing stage at the encoder using a lter, then re-synthesized at the decoder end and added back to the ltered frame. As lm grain is known to follow a near Gaussian distribution and therefore, instead of 71 coding the noise block by block, a good approximation model is composed based on the extracted lm grain noise features and sent to the decoder. With the received model parameter, the decoder is able to re-synthesized noise, then added back to the decoded frames in the post-processing stage. Because only the noise model parameters are sent to the decoder instead of the real noise, the overall bit rate can be reduced signicantly. Many successful algorithms on texture synthesis have been proposed and can be used on lm grain noise synthesis [16, 42]. To reduce the computational complexity, Gomila and Kobilansky [10] proposed a sample based approach using a noise sample extracted from source signal and apply dierent transformation on it. Only one noise block is sent to the decoder in the SEI message. However, these approaches general involve duplication part of the original grain source and could suer from visible discontinuity and repetition. Another type of lm grain noise synthesis method employees the use of a comprehensive lm grain database [22]. This lm grain database contains a pool of pre-dened lm grain values. The lm grain selection process follows a random fashion corresponding to the average luminance of the block and a deblocking lter is applied to blend in the lm grain. This method allows generation of realistic lm grain but requires both ends to have access to the same lm grain database. In this work, we consider this type of lm grain noise and other surface variations as granular noise. We attempt to exploit the spatial and temporal correlation of granular noise by another level of prediction so as to lower redundancy furthermore. In this section, we propose a new lossless granular noise prediction and coding (GNPC) scheme targeting 72 Figure 4.3: The block diagram of the proposed granular noise extraction process. HD video/image contents. The overview of the coding system architecture is shown in the block-diagram in Fig. 4.3. The input frame is decomposed into two parts via a de-noising technique. Then, these two parts can be coded independently. They are integrated again in the decoder end. Thus, there are two key questions in this design; namely, 1) the development of a good decomposition scheme; and 2) the design of an eective residual image prediction and coding scheme. They will be addressed below.Two dierent prediction schemes are per- formed for contents and granular noise, and their residuals are entropy coded dierently in our proposed coding system. There are many ways to extract granular noise. As shown in Fig.4.3. Here, we use a H.264/AVC based video coding process as the noise ltering process. There are several advantages associated with this proposed scheme. First, it can be easily integrated with any traditional video codec. No modication is required to the existing codec. Decoder could discard the transmitted noise frame if the decoding time frame is not sucient. This would not aect the decoding of the consecutive incoming content frames, as they 73 are coded independently. Second, current lossless and lossy systems are designed with completely dierent prediction schemes, which are inherently mutually exclusive. There- fore, to have a system that can produce both lossless and lossy results, the hardware designers need to have two independent modules to achieve that. In contrast, with the proposed scheme, one system can achieve both lossless and lossy goals by only turn o the residual image prediction module. Third, the proposed system can achieve scalability with minimal modication (e.g. to quantize prediction errors of the residual image before entropy coding). Fourth, encoder no longer needs to encode dierent versions of bit- streams if the decoders ranges from mobile device to HDTV sets. Encoder only needs to encode once and it depends on the decoder to decide which granular noise layers are not needed. This could potentially save the storage and streaming bandwidth signicantly. For example, consider to encode a video program with lossy H.264/AVC. Then, for an input frameF , we rst encode it with the H.264 encoder with a medium coarse QP. Then, the dierence between reconstructed frameF 0 and original frameF is the extracted noise frame denoted by N. There is a tradeo between bits assigned to the coding of F 0 and the coding of N, depending on the selection of QP. 4.4 Granular Noise Prediction and Coding in Frequency Domain We exploit the frequency correlation of granular noise by another prediction so as to remove redundancy furthermore. In this section, we rst review the frequency domain 74 prediction techniques and then discuss the proposed GNPC in frequency domain coding method. 4.4.1 Review of Frequency Domain Prediction Techniques Most prevalent motion estimation (ME) and motion compensation (MC) algorithms used in image and video compression areas are based on block matching techniques in the spatial domain. An alternative ME scheme is to estimate the cross-correlation function in the frequency domain [37]. The frequency spectrum of the input can be normalized to give a phase correlation. However, the correlation performed by a DFT-based method corresponds to a circular convolution rather than a linear one, and the correlation function could be aected by the edge eect. To reduce the edge artifact, Kuglin and Hines [9] proposed to use zero padding to the input data sequence at the cost of higher complexity. Another technique is to use a transform size which is much larger than the maximum displacement considered [41]. This approach can limit the error size, but it is more suited for global ME rather than block-based ME. A third technique is to use the complex lapped transform (CLT) to perform the cross correlation in the frequency domain [49]. Since the basis functions are overlapped and windowed by a smooth function that shapes like a half cosine, it introduces less block edge artifacts as compared to the LOT in the spatial domain. The latest eort of prediction in the frequency domain was proposed for intra prediction in VC-1. The DC and the AC components are predicted from their left and top neighboring frequency components. However, the above-mentioned schemes are not widely used for several reasons. First, most previous video compression algorithms focus on low bit rate coding. With ecient 75 spatial motion estimation and coarse quantization, most block DCT coecients are quan- tized to zero. Hence, there is little room left for the rate-distortion improvement with the frequency domain prediction. Second, all frequency components are compensated simul- taneously with the same spatial osets, which is similar to the motion compensation in the spatial domain except that frequency components are compensated rather than pixel values. Then, if there is little correlation in the spatial domain, it is dicult to get a better prediction in the frequency domain. 4.4.2 Granular Noise Prediction in Frequency Domain As studied in previous section, granular noise consists of high frequency components. Hence, if the quantization step size is too ne to reduce them to zero, the coding perfor- mance suers since the entropy coder is optimized with respect to long runs of zeros. In the new scheme, a target block rst goes through the DCT and quantization and, then, the quantized DCT block is subject to the following two prediction modes. The full mode The target DCT block is predicted by its candidate DCT blocks with their AC and DC components completely aligned. The par mode The target DCT block is further partitioned into four frequency partitions of size mxm, where M = 2m. From low to high frequencies, each partition is named as np 0 , np 1 , np 2 and np 3 , respectively, as shown Fig. 4.4(a). 76 An additional mode, call the zero mode, is proposed to further improve coding e- ciency. If the residual of the prediction in the full mode results in all zero coecients within a DCT block, this block is coded as the zero mode, where no prediction residual will be coded. (a) (b) (c) Figure 4.4: Granular noise block in frequency domain partition (a) full mode, (b) par mode and (c) prediction alignment for par mode. Each partition np will be independently predicted from its own corresponding parti- tion of candidate blocks. Hence, the predicted DCT block could be composed by partitions from dierent reference blocks. As prediction errors are the dierences in the frequency domain, they do not demand any additional DCT or quantization operations and can be 77 (a) (b) Figure 4.5: The DCT-domain based granular noise prediction for (a) intra noise frame and (b) inter noise frame with search range S. 78 sent directly to the entropy encoder. The granular noise prediction for intra noise frame and inter noise frame are illustrated as in Fig. 4.5. When the par mode is chosen for a transformed and quantized granular noise block, there are four displacement vectors pointing to four predicted partitions of candidate blocks in the reference frame so that the number of overhead bits could be higher. On the other hand, since frequency bands should be well aligned between the target and the reference blocks, the unit of search stride should be the same of the block width (or height). 4.4.3 Rate-Distortion Optimization To select the best mode for the coding of a GN block, we can employ the Lagrangian Rate-Distortion Optimization (RDO) technique as minfJ(blk i jm)g; J(blk i jm) =R hdr (blk i jm) +R coef (blk i jm): (4.8) As the prediction is performed on the already DCT transformed block, the residual is in fact in the form of quantized and transformed DCT coecients. As a result, the distortion can be taken out from the Lagrangian optimization formula and the entire process can be simplied to a rate optimization process. Based on the rate estimation given in [29], the total number of bits needed to code prediction error R coef is modeled as R coef = SATD(QP ) Q p ; (4.9) 79 where is a model parameter andp is a frame type dependent value. In our case, we use p = 1:0. usually represents the transformed coecients of the prediction error within a block. The total number of bits needed to encode the GN block is a sum of the prediction error R coef and header bits R hdr . A simple block code of bits log 2 S are assigned to the pair of displacement vectors. We will explain more details in the next subsection. This simplied RDO process helps to reduce the computational complexity in rate control for the video streaming application. The block diagram of the high delity GNPC scheme is shown in Fig. 4.6. (a) Encoder (b) Decoder Figure 4.6: The block diagram of (a) the encoder and (b) the decoder of the proposed GNPC scheme for high delity video coding. There are a few advantages with the proposed GNPC in frequency domain scheme. First, by correlating a sub-band of the target DCT block with those of dierent DCT blocks, we can enhance the matching probability and reduce the energy of prediction 80 errors. Second, by quantizing the input DCT block before the prediction process, there will be no additional computation needed in the rate-distortion optimization phase and complexity can be signicantly reduced. Third, the rate of the proposed scheme can be adjusted more easily in video streaming applications. As quantization is done prior to prediction, the distortion/PSNR of each block/frame can be known ahead of time without going through the entire RDO process[29]. 4.4.4 Translational Index Mapping In the proposed frequency domain-based GNPC scheme, when the par mode is chosen for a transformed and quantized GN block, there are four displacement vectors pointing to four predicted partitions of candidate blocks in the reference frame so that the number of overhead bits could be higher. The proposed frequency-domain GNPC scheme has to encode translational vector pairs ( x ; y ) T within the DCT domain to indicate the best match location. For the full mode, one prediction error block T plus one set of ( x ; y ) T are needed. For the par mode, one prediction error T together with four sets of ( x ; y ) T are needed since each partition has its own unique ( x ; y ) T in the par mode. Thus, the translational vector cost will be higher if the par mode is chosen as the best mode. In addition, due to the random nature in the frequency domain-based prediction, classic DPCM-based translational vector prediction does not bring eciency into the coding of the translational vectors. See Fig.4.7. Hence, to limit the overhead cost from the use of four pairs of translational vectors, one translational index T is used to replace each pair of translational vectors. Each unit 81 (a) (b) Figure 4.7: The translational vector maps for (a) the content layer and (b) the granular noise layer for Rush Hour frame at a resolution of 352x288. 82 of T is equivalent to a certain distance in both horizontal and vertical directions. The translational indexing method is developed by modifying the circular zonal techniques [4]. Each zone is color dierentiated as shown in Fig. 4.8. (a) (b) Figure 4.8: Illustration of the translational indexing scheme for (a) the intra GNPC frame and (b) the inter frame in frequency domain based GNPC. As for the intra GN prediction, the zone is not completely circular because of the uncoded blocks in the zone and the indexing scheme represents a half zone and the indexing always starts from the left side of the target block with its equivalent x = 0 to 83 maximize the spatial correlation between the target block and the candidate blocks [15]. Table 4.3: Experimental setup for H.264/AVC and the GNPC scheme. Parameter H.264/AVC GNPC in frequency domain Content Intra Inter Prole High baseline n/a n/a QP 4,8,10,12,14,16,18 25(xed) 4,8,10,12,14,16,18 GOP 15 15 n/a 15 # of reference frame 5 1 na 1 RDO Full complexity Low complexity Fast Fast Subpel ME 1/4 pel na na na Search range 64 32 32 32 Deblocking lter Enabled Disabled N/A N/A Entropy CAVLC CAVLC CAVLC CAVLC 4.5 Experimental Results In the experiment, we conducted experiments to compare the performance of H.264/AVC [2] and the proposed GNPC scheme for high delity video coding. Only the luminance channel is compared in this experiment. Four HD YUV sequences were used in the exper- iments, namely Rush hour(@1920x1080), Blue sky(@1920x1080), Sun ower(@1920x1080) and Vintage car(@1920x1080). The results were averaged over 10 frames for each sequence. For H.264/AVC, we chose the high complexity RDO process, multiple reference frames, the quarter pel motion estimation, and the deblocking lter option. In contrast, for the high delity GNPC scheme, we set the encoder to the lowest complexity such as low complexity RDO, 1 84 reference frame, no B frame, no subpel search and no deblocking lter. More details of the experimental setup are given in Table 4.3. (a) (b) (c) (d) Figure 4.9: Rate-Distortion curves for HD video sequences (a) Rush Hour, (b) Blue Sky, (c) Sun ower and (d) Vintage Car. The rate and distortion curves are shown in Figs. 4.9. The rate and distortion dierence are presented in Table 5.1. They are calculated based on the formula given in [18]. It was observed in [10] that a coarser quantization (with QP > 18) could reduce granular noise to the minimal. Here, we used a quantization stepsize range from 4 to 85 18 in the context of near-lossless coding. Results in Figs. 4.9 to ?? conrm a similar trend; namely, granular noise is gradually suppressed by the quantization eect. The performance of the proposed GNPC scheme and H.264/AVC converges for video sequences of simplex content at lower bit rates. The proposed GNPC provides a higher coding gain for highly complex sequences such as Vintage Car. Table 4.4: Coding eciency comparison between H.264 and GNPC in the frequency domain. Resolution Sequence Bit Rate (%) 1920x1080 Rush Hour -11.85 Blue Sky -7.20 Sun ower -11.54 Vintage Car -9.73 Average -10.08 (a) (b) Figure 4.10: The mode distribution chart for the Rush Hour sequence with (a)full mode vs par mode and (b)zero mode distribution with and without GNPC in the frequency domain. The mode distribution charts in subgure (a) of Figs. 4.10-4.13 show that there are more blocks adopting the par mode in the GNPC scheme if the quantization stepsize 86 (a) (b) Figure 4.11: The mode distribution chart for the Blue Sky sequence with (a)full mode vs par mode and (b)zero mode distribution with and without GNPC in the frequency domain. (a) (b) Figure 4.12: The mode distribution graph for the Sun ower sequence with (a)full mode vs par mode and (b)zero mode distribution with and without GNPC in the frequency domain. 87 (a) (b) Figure 4.13: The mode distribution chart for the Vintage Car sequence with (a)full mode vs par mode and (b)zero mode distribution with and without GNPC in the frequency domain. becomes smaller. By the zero mode distribution charts in subgure (b) of Figs. 4.10- 4.13, we count the number of blocks that meet the criterion of the zero mode. Note that thiszero mode has no counterpart in H.264/AVC. Thus, we only signify it as no GNPC. These charts show a larger percentage of blocks exploits the eciency from thezero mode enabled by the near-lossless GNPC. The above experimental results clearly demonstrate the eectiveness of the proposed GNPC scheme with an average bit rate gain of 10%. 88 4.6 Conclusion In this chapter, we rst conducted an analysis on existing video codecs to show that they are eective for high-bit-rate coding. We pointed out that the existence of granular noise could be the main reason for the coding ineciency of existing coding techniques, and proposed a granular noise prediction and coding scheme. Furthermore, we proposed a novel prediction scheme for the GN coding based on the frequency domain prediction. The resultant scheme allows ecient prediction and coding at a low computational complexity. The proposed GNPC scheme can outperform H.264/AVC by an average of 10% bit rate reduction in the high-delity coding case. 89 Chapter 5 Multi-Order Residual (MOR) Coding 5.1 Introduction Due to the constraint of communication and computational resources, traditional video compression algorithms have focused on the low bit rate video coding. The state-of-the- art video coding standard, H.264/AVC, was initially developed to target at the low-to- medium bit rate applications. However, with the popularity of high denition video such as high denition TV (HDTV) and the blue-ray disk (BD) in recent years, an eective high bit rate coding scheme becomes more and more important. The need to store/transmit high resolution video with high delity imposes a great challenge on the video coding technology. To address this requirement, H.264/AVC has the Fidelity Range Extension (FR-Ext) in its high prole to support the coding of 2k/4k contents [43]. However, as explained in the last chapter, due to the existence of uncompensated ne structured features in the prediction residual, most prediction/compensation techniques used in H.264/AVC were ineective in the high bit rate region. This results in coding eciency degradation as 90 the quality requirement increases. It was claimed in previous work [10, 32] that these ne features exhibit a behavior similar to lm grain noise. Some coding schemes were developed based on the idea of lm grain noise synthesis [16, 10, 42, 8, 32]. Although these methods can improve the compression ratio, they are not widely adopted by the industry or the coding community due to the signicant loss in the objective quality measure. To address this problem, we introduced a coding scheme called the granular noise prediction and coding (GNPC) was proposed in Chapter 4. As the analysis conducted in previous chapter was mainly based on the existing codec behavior, in this Chapter, we would like to provide a more thorough investigation on the target signal characteristics. We hence further investigate the impact of the high-bit- rate requirement on coding eciency from two angles. First, we study the distribution of prediction residuals in form of DCT coecients. Second, we conduct a correlation analysis on dierent video scenes to understand the long-, medium- and short-range correlations in the input video frame. Based on the analysis, we propose a new coding approach called the multi-order-residual (MOR) coding in Sec. 5.3 [54]. This MOR is a generalized and improved scheme based on our earlier proposed SOR scheme [53]. As compared with the previously proposed GNPC method, the MOR approach extract dierent types of correlation from the rst-order prediction residuals in multiple stages based on the concept from numerical analysis. It is worthwhile to point out that the MOR approach is dierent from quality-scalable video coding since it allows dierent coding methods used in dierent stages (including dierent prediction, transform and entropy coding techniques) to achieve better overall coding eciency. Experimental results are provided 91 in Sec. 5.4 to demonstrate the eectiveness of the proposed MOR coding approach. Concluding remarks and future directions are given in Sec. 5.5. 5.2 Signal Analysis for High-bit-rate Video Coding In this section, we will examine how the original signal characteristics impacts on the codec design for high-bit-rate coding requirements. 5.2.1 Distribution of DCT Coecients Generally speaking, a coding scheme can be classied into two distinct phases: modeling and coding. In the modeling phase, the spatial and temporal redundancy of the input video data is removed via transform and/or prediction and the statistics about the pre- diction residual is then gathered to form a probabilistic model [38]. It is one of the most fundamental pieces in data compression. In earlier research, the Gaussian distribution is often used to describe the distribution of AC coecients [38]. However, it was soon found that the Laplacian distribution is more suitable to describe the signal statistics when the Kologorov-Smirnov goodness-of-t test is used [17, 19]. Recent studies on the coding of standard denition video with H.264/AVC also reveals that the AC coecients distribution is Laplacian-like and their probability distribution is skewed with a large zero peak after a large or medium quantization step-size is used [30]. This property is utilized to design the zigzag scanning order and entropy coding modules. In this section, we take another look at the DCT coecients distribution under a higher delity requirement since this requirement is accomplished through the use of ner quantization step sizes in today's video codec. We rst analyze the eect of the 92 (a) (b) (c) Figure 5.1: The probability distribution of non-zero DCT coecients at each scanning position for (a)Jet (b)City Corridor and (c)Preakness frames. 93 quantization parameters (QPs) for a block of size 4 by 4 on DCT coecients distribu- tions in H.264/AVC. In Fig. 5.1 we show the probability distribution of nonzero DCT coecients of prediction residuals after the motion compensated prediction (MCP) pro- cess as a function of the scanning position with dierent QPs for three dierent types of video frames. position 1 is the DC coecient, and position 2 through 16 are zigzag- scanned AC coecients. We see that when a coarser quantization stepsize is used (e.g., QP=32), higher frequencies (e.g., with the position higher than 10) are all quantized to zero with most nonzero coecients concentrated in the lower frequency region (scanning position smaller than 5) for all three sequences. This distribution is consistent with the Laplacian distribution assumption with a large zero peak, which indicates that the MCP process is eective with respect to coarse QPs. The prediction residual has been properly processed to allow the following coding modules to encode the remaining nonzero coecients. To be specic, the zigzag scan- ning process can be used to compact most zeros with a \end-of-block" symbol and the entropy coding module is eective in the coding of non-zero coecients. However, we further observe that as the QP becomes ner to meet the higher delity requirement, the previously skewed distribution becomes increasingly uniform for each scanning position. In the case of QP=12, all three sequences have a close to uniform distribution of nonzero coecients. In this case, we see that the Laplacian distribution with a large-zero peak is no longer a suitable model under high-bit rate coding. Most coecients are non-zero and their probabilities become similar. This is a clear indication that the existing MCP process can no longer oer eective prediction for high-bit-rate video coding. 94 (a) (b) (c) Figure 5.2: (a) A sample frame from the Jet sequence, and its prediction residual dif- ference (b) a sample frame from the City Corridor sequence, and its prediction residual dierence and (c) a sample frame from the Preakness sequence, and its prediction residual dierence at 1280x720 resolution with QP 1 = 10 and QP 2 = 30. 95 5.2.2 Correlation Analysis To further analyze this change in nonzero coecients distribution, we examine the pre- diction residuals generated by H.264/AVC MCP. We take one sample frame from each of the three sequences we used in Fig. 5.1 and encode them with two dierent quantization parameters as shown in Fig. 5.2. We see that not only the residual dierence images contain some untreated features very small sizes but also the amount of these untreated small structural features are directly related to the complexity of the input video frame. (a) (b) Figure 5.3: (a) The correlation analysis for scenes with dierent complexities and (b) the relationship between the bit rate and quantization. We perform a correlation analysis on the input video signal in the context of high-bit rate coding based on the idea in [5]. Again, we use the Jet, City Corridor and Preakness sequences as examples. Note that the Jet sequence contains a scene of an aireld which is mainly still background with little detail. As shown in Fig. 5.3(a), the correlation analysis for such a low complexity scene reveals that the correlation remains very strong (> 0:9) even when the pixel distance oset has been increased steadily to 40 pixels. In other words, its frame mainly consists of long-range correlation. For a typical frame of 96 the City Corridor, which has medium complexity, we see that the correlation drops below 0.4 when the oset distance is greater than 8 pixels. This shows that the City Corridor frame contains a larger percentage of medium to short range correlations. For a Preakness frame, which is a highly complex scene, we observe that the overall correlation diminishes very quickly, which indicates that the preakness frame is dominated by the short range correlation. Recall the residual dierences observed in Fig. 5.2. We see that dierent correlations inside a frame impacts the amount of ne structural features that cannot be well compensated by the current codec. We further plot the rate-QP curves in Fig. 5.3(b) for three exemplary sequences to understand the impact of the correlation analysis on the video codec. When the QP is coarse (say, QP > 35), all three frames can be coded eectively. For a medium value of QP (say, 20 < QP < 35), the Jet sequence can still be eectively encoded while the coding bit rate of the Preakness increases very quickly. In the high-bit-rate range with QP < 20, we see a huge rate increase in all three sequences. This observation can be explained as follows. The traditional MCP process is designed to remove the long-range correlation via block-based prediction with a search window. The neglected medium- and short-range correlations do not play an important role due to the use of a coarser QP. Thus, the overall coding eciency is high in the low-bit-rate coding application. As the coding bit rate increases and the QP becomes ner, the quantization can no longer remove the medium and short-range correlations eectively. Thus, the overall coding gain drops signicantly even for the Jet sequence of low complexity. The above analysis indicates a need for a new codec design that can eciently remove the long-range correlation as well 97 as the medium- and the short-range correlations since the latter has a great impact on the high-bit-rate video coding. 5.3 Multi-Order-Residual (MOR) Prediction and Coding Based on the analysis in Sec. 5.2, we propose a coding system that removes dierent correlations in the input sequence with multiple residual layers. Hence, it is called the multi-order residual (MOR) prediction and coding scheme. Figure 5.4: Overview of the Multi-Order-Residual (MOR) coding scheme. 5.3.1 Overview of MOR Coding System The MOR coding system is motivated by the multi-order dierencing operation in nu- merical analysis. In our current context, the long-range correlation of an input image sequence is treated in the rst stage and the prediction residuals are called the rst- order residuals (FOR). The medium-range correlation and the short-range correlation remain in the FOR image, which can be removed in the second and the third stages using dierent coding schemes. The prediction residuals in the second stage are considered uncompensated medium-range correlation and are termed as the second-order residuals 98 (SOR). Similarly, the prediction residuals in the third stage are considered short-range correlations and are called the third-order residuals (TOR). An overview of the MOR coding system is shown in Fig. 5.4. Figure 5.5: The block diagram of the proposed MOR coding scheme. In the following subsections, we will discuss specic design choices in the three coding stages as shown in Fig. 5.5. For the FOR coding, since H.264/AVC is highly eective in removing the long-range correlation, it is employed to encode the FOR image with a coarser quantization value denoted by Q 1 . Second- and third-order residuals mainly consist of the medium-range correlation, and the traditional H.264/AVC MCP process no longer provides an ecient so- lution. Thus, the higher residual images are transformed and quantized with a ner quantization value denoted by Q 2 and Q 3 at the second and the third stages, 99 respectively. In other words, higher-order residuals can be coded using this new framework. A new MCP process in the frequency domain is proposed in Sec. 5.3.3 to remove the medium- and the short-range correlations in the higher-order resid- uals. 5.3.2 Goals of MOR Prediction In Chapter 4 we proposed a frequency domain-based prediction and compensation for granular noise data. It is based on the concept that as the GN data have the similar characteristics of lm grain noise. Therefore, it will be mainly manifested on the high frequency bands of a transformed DCT block. To directly compensated these high fre- quency residuals, we propose to perform a prediction and compensation phase in the frequency domain. Figure 5.6: A typical histogram of prediction residuals in the DCT domain. Here, we propose a dierent frequency domain-based prediction technique to predict signals with MC and SC. The idea is not to replace pixel domain motion compensation, 100 but to introduce a more eective prediction scheme that can achieve a good prediction results without incurring a high increase in computation complexity. In order to illustrate the MOR predictor design purpose, we rst show a histogram for residual signals in the form of DCT coecients after applying traditional prediction in Fig. 5.6. After the MCP of H.264/AVC, the dynamic range of the prediction residual in form of DCT coecients is much smaller (i.e. from -60 to 60) compared to the original pixel value range (from 0 to 255). The evaluation process used in H.264/AVC is to minimize the following cost function: J(s;cj m ) = D(s;c) + m R(s;c), (5.1) where s and c are the original and reconstructed blocks, D(s;c) is the distortion and R(s;c) is the number of bits required to encode the residual and the overhead. The distortion, D(s;c), is obtained by calculating the sum of absolute transformed dierence (SATD) in form of SATD(s;c) = X T n s[x;y]c[x;y] o , (5.2) whereT is a certain orthonormal transform. In H.264/ AVC,T is chosen as the separable Hardmard transform due to its simplicity. This RDO optimization allows the encoder to 101 select the best prediction without the exact knowledge of prediction residuals in the form of DCT coecients. (a) (b) Figure 5.7: Histograms of (a) MOR data in form of pixel dierences and (b) MOR data in form of DCT coeceints. However, if we apply the RDO procedure to the MOR signal, the prediction results will be poor. The reasons will be explained below. Fig. 5.7(a) shows a histogram of the SOR signal in the form of pixel values before the SOR prediction takes place. We can see that the MOR signal in pixel domain already has a much reduced dynamic range. Fig. 102 5.7(b) show the histogram of the same SOR signal AFTER being DCT transformed. We see that even without further prediction and compensation, the SOR data in the form of transformed DCT coecients already has an extremely small dynamic range of (-8, 8). Recall in Fig.5.1, the SOR data for ne quantization stepsize (QP=12) also has a much uniformed distribution for all scanning positions. Therefore, we conclude that the MC and SC signal in the MOR layer, exhibit some unique features such as very limited dynamic range and uniform distribution of the nonzero coecients. Hence, the best prediction copy obtained through H.264/AVC's RDO optimization process in the motion search in pixel domain might not be able to translate into the true best match after the residual is transformed and compensated. Therefore, we need to develop a dierent prediction scheme that can target this type of signal characteristics and provide an accurate prediction. One way to achieve this goal is to introduce additional DCT and quantization process during the motion search phase to evaluate every prediction candidate. This allows the encoder to have an exact knowledge of the prediction results. However, this will require the addition of extremely large amount of computation to be spent on the extra DCT and quantization processes into the already most complex motion search module. Therefore, to maintain an eective motion compensated prediction process for the MOR data while keeping the evaluation process as simple as possible, we introduce the MOR prediction in frequency domain in Sec. 5.3.3. 103 5.3.3 MOR Prediction in Frequency Domain The block ow diagram of the system is illustrated in Fig. 5.5. The higher order residual input block will go through the standard DCT module rst to achieve frequency separa- tion and quantization. The quantized DCT coecients will be used in the SOR or TOR prediction in the frequency domain. This predictor design allows the RDO process to be able to evaluate the prediction results directly in the form of DCT coecients on the y; thus, making sure the nal prediction copy can improve the coding performance. We will further illustrate the RDO design in Sec. 5.3.4. Figure 5.8: Re-grouping of the same frequency coecients to obtain planes of DCT coecents, denoted by P i , where i = 0; 1; ;M 2 1. To perform the prediction in frequency domain, the target block rst goes through an MxM DCT and quantization outside of the coding loop. Then, each coecient of 104 fk 0 (j);k 1 (j);:::k M 2 1 (j)g in thisMM DCT block of blockB(j) is extracted into an in- dividual corresponding coecient plane offP 0 ;P 1 ;:::P M 2 1 g. This coecient extraction process can be mathematically expressed as P i = N [ j=0 k i (j), (5.3) This coecient extraction process as shown in Fig.5.8 is a graphical representation of aggregating the same frequency components from dierent blocks into a closer plane. Figure 5.9: MCP in frequency domain for each frequency plane. The coecients on each coecient plane ofP i are further grouped into a ss partitions of np l . This re-arranged ss partition will then go through a compensated prediction phase which is similar to the MCP in spatial domain except that the compensation is done on each individual coecient planes as follows. See Fig.5.9. Given a SOR/TOR frame ofF t , the frequency extraction process will generate a series of coecient planes of fP 0 ;P 1 ;:::P MM1 g corresponding to each frequency component. For a DCT transformed and quantized SOR/TOR blockQ[n T x ] of sizeMM at location (i;j) onP i , the prediction 105 error of a SOR block in frequency domain is made up ofl = MM ss number ofnp l obtained as ^ n x T = X l np l (i + i;l ;j + j;l ), (5.4) where each translational vectors ( i;l ; j;l ) to point to one predicted partitions within the reference planes. 5.3.4 Rate-Distortion Optimization The RDO process for MOR prediction is based on the same principle of minimizing the Lagrangian function in Eq. (5.1). However, note that in our MOR scheme, MCP with RDO process is performed after the DCT and quantization on the transformed and quantized DCT coecients. Hence, the distortion is xed and will not be changed for each search candidate. Therefore, the Lagrangian cost function can be reduced to minimize rate only as J(s;c) = R res (s;c) +R mv , (5.5) whereR res is the bits required to encode prediction residual andR mv is the bits to encode motion vectors. As we are working in the frequency domain already, we can estimate the rate of the prediction residual based on the -domain's approach [20] as R res = (1), (5.6) 106 To estimate the bit used to encode motion vectors, we see that the as the search is operated in a much reduced search range. The MV data has a similar distribution compared to the residual coecients. Hence, we can consider each MV to use the same bits as a nonzero coecients. Thus, we can further simplify Eq. (5.5) as J(s;c) = N nzTC +N nzMV , (5.7) where N nzTC is the number of nonzero transformed coecients and the N nzMV is the number of nonzero motion vectors. The eectiveness of the proposed RDO method can be observed by comparing the histograms of DCT coecients before prediction and after prediction in Fig. 5.10(a) and (b). (a) (b) Figure 5.10: The DCT coecients histograms of MOR data after MOR prediction in frequency domain. 107 5.3.5 Pre-search Coecient Optimization for TOR To further improve the coding eciency and to facilitate a fast search, we propose to add a pre-search coecient optimization phase for third-order residuals. This process is based on the observation that after the coecients are extracted to individual frequency planes, the last few high frequency planesP i ; (i = 13; 14; 15) are more sparsely populated with nonzero coecients. This is mainly due to the fact that short-range correlations in the TOR has a very random distribution. This sparse distribution of nonzero coecients could potentially have a very detrimental eect on the entropy coder, as the CABAC is designed to perform most eciently when there is a long consecutive run of zeros. Hence, to facilitate the entropy coder, we introduce this pre-search coecients optimiza- tion process for TOR. This optimization happens before the coecient extraction, and the detailed ow diagram is shown in Fig. 5.11. Figure 5.11: The block diagram of the pre-search DCT coecients optimization process for TOR. 108 In this pre-search optimization phase, we rst examine the DCT coecient block K. If the DCT block has only less than C z number of nonzero coecients in the last three high frequency scanning positions (SP = 13; 14; 15). We zero out these positions and perform an IDCT. This partially zero out block is compared to the original incoming block I. Note that this I is the pixel residual from FOR. If the optimization error is lower than the predened empirical threshold , we will use this optimized block K 0 and proceed further to the frequency extraction phase. Otherwise, we will take the original block K instead. This presearch optimization phase helps to boost the consecutive run of zeros in the higher frequency plans and thus allows the later entropy coder to have a better compression results. The complete system diagram is shown in Fig. 5.12. Figure 5.12: The block diagram of the proposed MOR coding scheme with pre-search DCT coecients optimization for TOR. 109 Table 5.1: Coding eciency comparison of the proposed MOR scheme v.s. H.264/AVC for high-bit-rate coding. Sequence Resolution Bit Rate (%) Pedestrian 1920x1080 -18.41 Rush Hour 1920x1080 -18.46 Riverbed 1920x1080 -11.40 Vintage Car 1920x1080 -16.71 Average -16.42 5.4 Experimental Results In this section, experimental results for the proposed MOR-based coding scheme is pre- sented and compared with H.264/AVC [2] to demonstrate the superior performance of the proposed coding framework. Only the Luminance channel is compared. Four test YUV sequences of High Denition (HD) format at 1920x1080 resolution were used. They were: Pedestrian, Rush Hour, Riverbed and Vintage Car. The results were averaged over 5 P frames for each test sequence. The benchmark H.264/AVC codec used the high prole, with full (high complexity mode) RDO enabled, 1/4pel MCP, and CAVLC as its entropy coder. For the proposed MOR, QP 1 = 30;QP 2 = 22, and QP 3 = 16. If MOR's desired quantization step size is larger than 16, for example QP=18, TOR will not be performed and QP 2 will be changed to the target quantization stepsize, i.e, QP=18. DCT size is set to 4 4 and the partition sizes for SOR and TOR data with MCP in DCT domain is set to s SOR = 4 4 and 2 2 and s TOR = 2 2. A zero-order binary arithmetic coding engine is used for entropy coder. The coding eciency comparison is listed in Table. 5.1 based on [18]. 110 Figure 5.13: Rate-Distortion curves for Pedestrian sequence. Figure 5.14: Rate-Distortion curves for Rush Hour sequence. 111 Figure 5.15: Rate-Distortion curves for Riverbed sequence. Figure 5.16: Rate-Distortion curves for Vintage Car sequence. 112 Secondly, we compare the rate distortion performance for the high-bit-rate coding scenario in Fig.5.13 to Fig.5.16. We see that the proposed MOR reduces the coding bit rate of H.264/AVC by up to 23% depending on the quality requirements. (a) (b) Figure 5.17: Decoded Rush Hour frames with (a) MOR and (b) H.264 at 60Mbps. (a) (b) Figure 5.18: Decoded Vintage Car frames with (a) MOR and (b) H.264 at 80Mbps. Thirdly, we examine visually of the decoded frames using both MOR and H.264/AVC. In Fig.5.17, we show a side-by-side comparison of Rush Hour frames decoded using the two schemes at the same bit rate. In Fig.5.18, we show a side-by-side comparison of Vintage Car frames decoded using the two schemes at the same bit rate of 80Mbps. We 113 can observe that the MOR scheme consistently provides a sharper decoded frame with much less coding artifacts. 5.5 Conclusion and Future Work In this paper, we rst examine the impact of high-bit-rate coding to the existing state-of- art codec such as H.264/AVC. We then performed the correlation analysis on the video signals to show that there exist the medium- and short-range correlations in video which result in performance degradation when the coding bit rate becomes higher. Then, we proposed a novel MOR coding scheme that handles the long-, medium- and short-range dierently in high-bit-rate coding. The proposed MOR coding scheme employed the H.264/AVC in the coding of the FOR, a frequency domain MCP in the coding of SOR and TOR. It was shown by experimental results that the proposed MOR coding scheme has an average rate reduction of 16% compared to H.264/AVC under dierent quality requirements. We will continue to improve the coding performance of the MOR scheme in the near future. 114 Chapter 6 Conclusion and Future Work 6.1 Summary of the Research In this research, we have studied two major issues in high delity video coding: 1) residual processing and 2) subpel motion estimation. The research was motivated by the signicant decrease in coding ineciency of prediction residuals in today's coding schemes under the requirements of high bit rate coding and the high computational complexity associated with the subpel motion estimation. In Chapter 3, we proposed a direct subpel MV prediction scheme with optimal subpel MV resolution estimation. Existing optimal subpel MV resolution estimation is developed using the texture characteristics of an input block. However, due to motion compensation, quantization and noise, they are not accurate in some cases. The proposed optimal MV prediction scheme can handle blocks of dierent characteristics by maximizing its rate- distortion (RD) gain through a exible MV resolution while reducing the computational complexity. Two dierent optimal MV prediction schemes were developed based on the dierent shape of the error surface. The rate-distortion performance of the proposed 115 optimal MV prediction is about the same as that of full search with an average of 90% complexity reduction. In Chapter 4, we conducted an analysis on the prediction residual and showed that it contains ne structured features when the coding bit rate becomes higher. These ne features were considered as lm grain noise in earlier. To treat these granular noise, we introduce an extra granular noise prediction and coding scheme based on the lm grain noise extraction process to extract these ne features in the residual. A frequency-domain based prediction and compensation scheme was further proposed for granular noise data. By correlating the same frequency bands between dierent blocks, we could maximize the possibility between target GN block and candidate blocks that might contain similar low frequency components but dierent high frequency components to be considered as candidate reference blocks and vice versa. The prediction between the same frequency bands avoids the complication of sparse matrix multiplication for recon- struction as required in earlier ME in frequency domain. It was shown by experimental results that, as compared to H.264/AVC, the proposed GNPC scheme can achieve an average of more than 10% bit rate reduction in high-bit-rate coding. In Chapter 5, we further investigate the impact of high bit rate coding from the fun- damental signal characteristics. We rst study the DCT coecient distribution and show that, as the video quality requirement increases, the distribution of DCT coecients is close to an uniform one. This explains the poor performance of traditional image/video codecs in the high bit rate region. We then performed a signal correlation analysis and showed dierent types of correltions in video frames. Due to the use of a ne quan- tization step, the quantization process can no longer be used to remove the short and 116 medium range correaltion eectively. Since the block-based motion-compensated predic- tive (MCP) coding technique is only eective in removing the long range correlation, the coding performance of the traditional video codecs degrades rapidly as the quality requirement becomes higher. Based on the study, we propose a multi-order residual (MOR) coding scheme. A coecient optimization technique was proposed to enhance the compression performance furthermore. It was shown by experimental results that the proposed MOR scheme outperforms the state-of-the-art H.264/AVC codec by an average of 16% in bit rate saving. 6.2 Future Research Directions To make the current research more complete, we would like to extend the current work along several directions as detailed below. Advanced sub-pel interpolation scheme H.264/AVC employs the quarter-pel ME, and there are two dierent sub-pel in- terpolation schemes. A 6-tap lter approach is used for half-pel interpolation, and a bilinear lter is used for quarter-pel interpolation. Although there exist other more sophisticated interpolation schemes that oers better performance than the 6-tap and bilinear lters, the existing codec does not adopt them due to the con- sideration of computational complexity. Since the proposed optimal subpel MV prediction scheme reduces the complexity of subpel interpolation by an average of 90%, it opens a new opportunity for more advanced interpolation schemes to further improve the overall RD performance. 117 Improved QP and layer number selection in the MOR scheme In the proposed MOR scheme, QPs for the FOR and the SOR images were chosen empirically at xed values. As dierent video streams have dierent rate-distortion characteristics, an adaptive QP selection scheme should improve the coding perfor- mance. Furthermore, the number of layers in the MOR scheme may be adjusted according to video characteristics. For example, a highly complex video stream which contains long-, medium-, and short-range correlations may benet from more layers while a simple video stream may only demand the FOR and the SOR two layers. For the latter case, the use of fewer layers will reduce the layer overhead. Thus, the ability of dynamically adjusting QP and the layer number should improve the coding performance. Advanced prediction techniques and preprocessing of DCT coecients In the proposed MOR scheme, we used a frequency-domain compensation technique. However, we may use more sophosicated prediction techniques in the SOR image. In addition, we employed a simple preprocessing technique for DCT coecients in the MOR scheme. It is interesting to develop a more sophisticated DCT coecient preprocessing technique to enhance the coding performance furthermore. 118 Bibliography [1] \Digital Cinema System Specication V1.0," in [Online] www.dcimovies.com/DCI Digital Cinema System Spec v1.pdf, July 2005. [2] \H.264/AVC Reference software," in [Online] Available: http://iphome.hhi.de/suehring/tml/, July 2005. [3] \VC-1 Technical Overview," in [Online] Available: www.microsoft.com/windows/windowsmedia/howto/articles/vc1techoverview.aspx, October 2007. [4] A.M.Tourapis, O.C.Au, and M.L.Liu, \Fast motion estimation using modied cir- cular zonal search," IEEE Intl. Symposium on Circuits and Systems, vol. 4, pp. 231{234, July 1999. [5] J. Bennett and A. Khotanzad, \Modeling textured images using generalized long- correlation models," IEEE Transactions on Pattern Analysis and Machine Intelli- gence, vol. PAMI-6, pp. 800{809, June 1998. [6] D. P. Bertsekas, \Nonlinear Programming," 2nd Edition, Athena Scientic, 1999, 1999. [7] B.Girod, \Motion-compensating prediction with fractional-pel accuracy," IEEE Trans. on Communications, vol. 41, pp. 604{612, Apirl 1993. [8] B.T.Oh, C.-C.J.Kuo, S.Sun, and S.Lei, \Film grain noise modeling in advanced video technology," Visual Communications and Image Processing, Proc of SPIE Electronic Imaging, vol. 6508, May 2005. [9] C.D.Kuglin and D.C.Hines, \The phase correlation image alignment method," IEEE Int. Conf. on Image Processing (ICIP2007), pp. 163{165, September 1975. [10] C.Gomila and A.Kobilansky, \Sei message for lm grain encoding," in Joint Video Team (JVT) of ISO/IEC MPEG & ITU-T VCEG (ISO/IEC JTC1/SC29/WG11 and ITU-T SG16 SG16 Q.6) JVT-H022.doc, September 2003. [11] J. Chalidabhongse and C.-C. J. Kuo, \Fast motion vector estimation using multiresolution-spatio-temporal correlations," IEEE Trans. on Circuits and Systems Video Technologies, vol. 7, pp. 477{488, August 1997. 119 [12] S. F. Chang and D. G. Messerschmitt, \A new approach to decoding and compositing motion-compensated DCT-based images," IEEE Intl Conf on Acoustics, Speech, and Signal Processing, vol. 5, pp. 421{424, April 1993. [13] J. Cho, G. Jeon, J. Suh, , and J.Jeong, \Subpixel motion estimation scheme using selective interpolation," IEEE Trans. on Communications, vol. E91-B, December 2008. [14] C.Zhu, X.Lin, and L.Chau, \Hexagon-base search pattern for fast block motion estimation," IEEE Trans. on Circuits and Video Technologies, vol. 12, January 2007. [15] Y. Dai, Q. Zhang, A. Tourapis, and C.-C.J.Kuo, \Ecient block-based intra pre- diction for image coding with 2D geometrical manipulations," IEEE Intl Conf on Image Processing, October 2008. [16] A. A. Efros and T. K. Leung, \Texture synthesis by non-parametric sampling," Proceedings of the IEEE Intl. Conference on Computer Vision, vol. 2, pp. 1033{ 1038, September 1999. [17] T. Eude, R. Grisel, H. Cheri, and R. Debrie, \On the distribution of the DCT coecients," Proc. 1994 IEEE Int. Conf. Acoustics, Speech, Signal Processing, vol. 5, pp. 365{368, April 1994. [18] G.Bjontegaard, \Calculation of average PSNR dierence between RD-curves," in Joint Video Team (JVT) of ISO/IEC MPEG & ITU-T VCEG-M33.doc, April 2001. [19] H. Hang and J. Chen, \Source model for transform video coder and its applica- tion part I: fundermental theory," IEEE Trans. on Circuits and Systems for Video Technology, vol. 7, pp. 287{298, April 1997. [20] Z. He and S. K. Mitra, \A unied rate-distortion analysis framework for transform coding," IEEE Trans. on Circuits and Systems for Video Technology, vol. 11, De- cember 2001. [21] J. R. Jain and A. K. Jain, \Displacement measurement and its application in inter- frame image coding," IEEE Trans. Communication, vol. COM-29, pp. 1799{1808, December 1981. [22] J.Cooper, J.Boyce, J.Llach, A.Tourapis, P.Yin, and C.Gomila, \Techiques for lm grain simulation using a database of lm grain patterns," vol. Pattern EP 1 809 043 A1, July 2007. [23] J.Lee, I.Choi, W.Choi, and B.Jeon, \Fast mode decision for b slice," in Joint Video Team (JVT) of ISO/IEC MPEG & ITU-T VCEG (ISO/IEC JTC1/SC29/WG11 and ITU-T SG16 SG16 Q.6) JVT-k021.doc, March 2004. [24] J.W.Suh, J. Cho, and J.Jeong, \Model-based quarter-pixel motion estimation with low computational complexity," Electronic Letters, vol. 45, June 2009. 120 [25] J.W.Suh and J.Jeong, \Fast sub-pixel motion estimation techniques having lower comtational complexity," IEEE Trans. on Consumer Electronics, vol. 50, pp. 968{ 973, August 2004. [26] K.H.Lee, J.H.Choi, B.K.Lee, and D.G.Kim, \Fast two-step half-pixel accuracy mo- tion vector prediction," Electronic Letters, vol. 36, pp. 625{627, 2000. [27] R. Kleihorst and F. Cabrera, \Implementation of DCT-domain motion estimation and compensation," IEEE Workshop on Signal Processing Systems, pp. 53{62, Oc- tober 1998. [28] T. Koga, K. Iinuma, A. Hirano, Y. Lijima, and T. Ishiguro, \Motion compensated inter frame coding for moving picture conferencing," Proc. NTC 81, pp. C9.6.1{9.6.5, November 1981. [29] D. K. Kwon, M. Y. Shen, and C. C. J. Kuo, \Rate control for H.264 video with enhanced rate and distortion models," IEEE Trans. on Circuits and Systems for Video Technology, vol. 17, pp. 517{529, May 2007. [30] E. Y. Lam and J. W. Goodman, \A mathematical analysis of the DCT coecient distributions for images," IEEE Trans. on Image Processing, vol. 9, pp. 1661{1666, October 2000. [31] R. Li, B. Zeng, and M. L. Liou, \A new three-step search algorithm for block motion estimation," IEEE Trans. on Circuits and Systems Video Technologies, vol. 4, pp. 438{442, August 1994. [32] M.Schlockermann, \Film grain coding in H.264/AVC," Joint Video Team(JVT) JVT-1034d2.doc, September 2003. [33] Y. Nie and K.-K. Ma, \Adaptive rood pattern search for fast block-matching motion estimation," IEEE Trans. on Image Processing, vol. 11, pp. 1442{1450, December 2002. [34] P.Anandan, \A computational framework and an algorithm for the measurement of visual motion," International Journal of Computer Vision, vol. 2, pp. 283{310, 1989. [35] P.R.Hill, T.K.Chiew, D.R.Bull, and C.N.Canagarajah, \Interpolation free subpixel accuracy motion estimation," IEEE Trans. on Circuits and Systems for Video Tech- nology, vol. 16, pp. 1519{1526, December 2006. [36] P.Yin, H.Cheong, A.Tourapis, and J.Boyce, \Fast mode decision and motion esti- mation for JVT/H.264," IEEE Intl. Conference on Image Processing, pp. 853{856, September 2003. [37] L. R. Rabiner and B. Gold, \Theory and application of digital signal processing," Englewood Clis NJ: Prentice Hall, 1975. [38] R. Reininger and J. Gibson, \Distributions of the two-dimensional DCT coecients for images," IEEE Trans. on Communication, vol. COM-31, pp. 835{839, June 1983. 121 [39] J. Ribas-Corbera and D. L. Neuho, \Optimizing Motion-Vector Accuracy in Block- Based Video Coding," IEEE Trans. on Circuits and Systems for Video Technology, vol. 11, April 2001. [40] R.Rao and G.Bjontegaard, \Complexity analysis of multiple block sizes for motion estimation," in Joint Video Team (JVT) VCEG-m47.doc, April 2001. [41] M. Song, A. Cai, and J. Sun, \Motion estimation in DCT domain," IEEE Int. Conf. on Communication Technology Proceedings, vol. 2, pp. 670{674, May 1996. [42] A. D. Stefano, B. Collis, and P. White, \Synthesising and reducing lm grain," Journal of Visual Communication and Image Representation, vol. 17, pp. 163{182, June 2005. [43] G. J. Sullivan, P. Topiwala, and A. Luthra, \The H.264/AVC advanced video coding standard: overview and introduction to the delity range extensions," SPIE conf. Applications of Digital Image, Processing XXVII, vol. 5558, pp. 454{474, August 2004. [44] G. J. Sullivan and T. Wiegand, \Rate-distortion optimization for video compression," IEEE Signal Processing Magazine, vol. 15, pp. 74{90, November 1998. [45] M. Weinberger, G. Seroussi, and G.Sapiro, \The LOCO-I lossless image compres- sion algorithm: priniciples and standarization into jpeg-ls," IEEE Trans. on Image Processing, May 2000. [46] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, \Overview of the H.264/AVC coding standard," IEEE Trans. on Circuits and System for Video Tech- nology, vol. 7, pp. 560{576, July 2003. [47] X.Yi, J.Zhang, N.Ling, and W.Shang, \Improved and simplied fast motion estima- tion for JM," in Joint Video Team (JVT) JVT-P021.doc, July 2005. [48] Y.Lee, K.Han, and G.J.Sullivan, \Improved lossless intra coding for H.264/MPEG-4 AVC," IEEE Trans. on Image Processing, vol. 15, September 2006. [49] R. W. Young and N. G. Kingsbury, \Frequency-domain motion estimation using a complex lapped transform," IEEE Trans. on Image Processing, vol. 2, January 1993. [50] Z.Chen, P.Zhou, and Y.He, \Fast integer pel and fractional pel motion estimation for JVT," in Joint Video Team (JVT) JVT-F017r1.doc, December 2002. [51] Q. Zhang, Y. Dai, and C.-C. J. Kuo, \Direct techniques for optimal subpel motion resolution estimation and position prediction," IEEE Trans. on Circuits and Systems for Video Technology, 2010. [52] Q. Zhang, Y. Dai, and C.-C. Kuo, \Lossless video compression with residual image prediction and coding (RIPC)," IEEE International Symposium on Circuits and Systems, May 2009. 122 [53] Q. Zhang, S. H. Kim, Y. Dai, and C.-C. J. Kuo, \A second-order-residual (SOR) cod- ing approach to high-bit-rate video compression," SPIE Electronic Imaging, January 2010. [54] Q. Zhang, S.H.Kim, Y. Dai, and C.-C. J. Kuo, \Multi-order-residual (MOR) video coding: framework, analysis and performance," SPIE Visual Communications and Image Processing, July 2010. 123
Abstract (if available)
Abstract
This research focuses on two advanced techniques for high-bit-rate video coding: 1) subpel motion estimation and 2) residual processing.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Advanced intra prediction techniques for image and video coding
PDF
Focus mismatch compensation and complexity reduction techniques for multiview video coding
PDF
Efficient coding techniques for high definition video
PDF
Error tolerance approach for similarity search problems
PDF
Efficient management techniques for large video collections
PDF
Image and video enhancement through motion based interpolation and nonlocal-means denoising techniques
PDF
Advanced techniques for green image coding via hierarchical vector quantization
PDF
Algorithms for scalable and network-adaptive video coding and transmission
PDF
Advanced liquid simulation techniques for computer graphics applications
PDF
Predictive coding tools in multi-view video compression
PDF
Complexity scalable and robust motion estimation for video compression
PDF
Rate control techniques for H.264/AVC video with enhanced rate-distortion modeling
PDF
Robust video transmission in erasure networks with network coding
PDF
Low complexity mosaicking and up-sampling techniques for high resolution video display
PDF
Advanced techniques for human action classification and text localization
PDF
Source-specific learning and binaural cues selection techniques for audio source separation
PDF
Advanced visual processing techniques for latent fingerprint detection and video retargeting
PDF
Graph-based models and transforms for signal/data processing with applications to video coding
PDF
Machine learning techniques for perceptual quality enhancement and semantic image segmentation
PDF
On optimal signal representation for statistical learning and pattern recognition
Asset Metadata
Creator
Zhang, Qi
(author)
Core Title
Advanced techniques for high fidelity video coding
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
08/04/2010
Defense Date
03/22/2010
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
complexity,error surface modeling,fast motion estimation,MOR,motion vector prediction,multi-order residual,OAI-PMH Harvest,subpixel motion search
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kuo, C.-C. Jay (
committee chair
), Ortega, Antonio (
committee member
), Shahabi, Cyrus (
committee member
)
Creator Email
yydqz@me.com,zhang0@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m3135
Unique identifier
UC1487441
Identifier
etd-Zhang-3729 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-352557 (legacy record id),usctheses-m3135 (legacy record id)
Legacy Identifier
etd-Zhang-3729.pdf
Dmrecord
352557
Document Type
Dissertation
Rights
Zhang, Qi
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
complexity
error surface modeling
fast motion estimation
MOR
motion vector prediction
multi-order residual
subpixel motion search