Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Design and performance analysis of low complexity encoding algorithm for H.264 /AVC
(USC Thesis Other)
Design and performance analysis of low complexity encoding algorithm for H.264 /AVC
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
DESIGN AND PERFORMANCE ANALYSIS OF LOW COMPLEXITY ENCODING ALGORITHM FOR H.264/AVC by Changsung Kim A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) May 2006 Copyright 2006 Changsung Kim UMI Number: 3237130 3237130 2007 UMI Microform Copyright All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. ProQuest Information and Learning Company 300 North Zeeb Road P.O. Box 1346 Ann Arbor, MI 48106-1346 by ProQuest Information and Learning Company. Dedication Tomyfamily,whoofferedmeunconditionalloveandsupportthroughoutthecourse of my Ph.D. study. ii Acknowledgements I am thankful to Prof. C.-C. Jay Kuo with all my heart, since he gave me a chance to pursue my Ph.D degree at USC. From the idea formative stage to the final draft of this thesis, I owe an immense debt of gratitude to him. I am also greatly influenced by his enthusiasm and curiosity toward new research areas, his passionateyetrigorousattitudestothepursuitofinnovativesolutionstochallenging problems as well as his educational philosophy and sincere support to students. Through countless discussions and cross-validations with him, I could begin with some raw ideas and gradually polish them into concrete and presentable results. This interaction makes my graduate study an enjoyable and fruitful experience. His sound advice and careful guidance have been invaluable to me. I believe that his words will continue help me tremendously in my future research career. I would also like to express my great appreciation to Prof. Antonio Ortega. I started my graduate research with him, and he led me to explore many interesting problems in video coding and showed me how to get initial ideas, validate them and develop test and realization plans. His time and efforts have made me a much more mature researcher, and they are highly appreciated. I would also like to thank Dr. Hsuan-Huei Shih of MAVs Lab. Inc. for his valuable suggestions on the fast intra mode decision work as presented in Chapter 3 of the thesis. The work on risk-minimizing intra/inter mode decision in Chapter 4 was initiated by my collaboration with Dr. Feng-Tsun Chien of the National iii Chiao Tung University in Taiwan. Finally, the work on fast motion estimation with block-size adaptive referencing (BAR) in Chapter 5 was collaborated with Dr. Ma of the University of Southern California. Thanks also go to Dr. Jun Xin, Dr. Anthony Vetro, and Dr. Huifang Sun of the Mitsubish Electric Research Laboratory (MERL), Cambridge, Massachusetts, whose generosity in offering me the summer internship in 2005 will be always remembered. Finally, it is never enough for me to express my appreciation to my family. I especially thank my parent for everything they have done for me. It is their spirit and their continuing support that help me achieve one of my most important milestones. Thank you all very much. iv Contents Dedication ii Acknowledgements iii List Of Tables viii List Of Figures ix Abstract xiii 1 Introduction 1 1.1 Significance of the Research . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Feature Based Intra/Inter Coding Mode Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.2 Outline of The Proposed Solution . . . . . . . . . . . . . . . 6 1.3 Fast Motion Estimation via Block-size Adaptive Referencing . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3.2 Outline of The Proposed Solution . . . . . . . . . . . . . . . 8 1.4 Contributions of the Research . . . . . . . . . . . . . . . . . . . . . 9 1.5 Outline of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . 12 2 Research Background 14 2.1 H.264 Video Coding Standard . . . . . . . . . . . . . . . . . . . . . 14 2.2 Intra/Inter Mode Decision . . . . . . . . . . . . . . . . . . . . . . . 18 2.3 Long-term Memory Motion Compensated Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3 Fast Intra Mode Decision for H.264 25 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2.1 Spatial Domain Feature Selection . . . . . . . . . . . . . . . 26 v 3.2.2 Transform Domain Feature Selection . . . . . . . . . . . . . 28 3.2.3 Sequential Mode Filtering . . . . . . . . . . . . . . . . . . . 30 3.3 Mode Filtering with Joint Features . . . . . . . . . . . . . . . . . . 31 3.3.1 Rank-Ordered Joint Features . . . . . . . . . . . . . . . . . 31 3.3.2 Screen Window Size Selection . . . . . . . . . . . . . . . . . 32 3.4 Final Mode Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4.1 Feature-based Method . . . . . . . . . . . . . . . . . . . . . 35 3.4.2 RDO-based Method . . . . . . . . . . . . . . . . . . . . . . 35 3.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4 Feature-based Intra/InterCodingMode SelectionforH.264/AVC 49 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2 Simple & Effective Feature Selection . . . . . . . . . . . . . . . . . 51 4.2.1 Intra Mode Feature . . . . . . . . . . . . . . . . . . . . . . . 51 4.2.2 Inter Mode Feature . . . . . . . . . . . . . . . . . . . . . . . 52 4.2.3 Motion Activity Classification . . . . . . . . . . . . . . . . . 55 4.3 3D Feature Space Partitioning . . . . . . . . . . . . . . . . . . . . . 57 4.4 Coding Mode Prediction in Risk-Free Region . . . . . . . . . . . . . 61 4.5 Coding Mode Prediction in Risk Region . . . . . . . . . . . . . . . 63 4.6 Probability Density Estimation . . . . . . . . . . . . . . . . . . . . 66 4.6.1 Quantized Cells in 3D Feature Space . . . . . . . . . . . . . 67 4.6.2 Non-parametric Likelihood Estimation . . . . . . . . . . . . 68 4.6.3 Parametric Likelihood Estimation . . . . . . . . . . . . . . . 72 4.7 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5 FastH.264MotionEstimationwithBlock-sizeAdaptiveReferenc- ing (BAR) 89 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.2 Analysis of Rate-Distortion-Complexity Tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.2.1 Effect of Sequence Characteristics on RD Coding Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.2.2 Effect of Block sizes on RD Coding Gain . . . . . . . . . . . 92 5.3 Block-size Adaptive Referencing . . . . . . . . . . . . . . . . . . . . 94 5.3.1 Set based Block-size Adaptive Referencing . . . . . . . . . . 95 5.4 Proposed Frame-level BAR Set Selection Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.4.1 RD Gradient Estimation . . . . . . . . . . . . . . . . . . . . 97 5.4.2 Quality-to-Bit Rate Ratio (QBR) Prediction . . . . . . . . . 101 5.4.3 Modeling of Increased Rate . . . . . . . . . . . . . . . . . . 105 5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 108 vi 6 Conclusion and Future Work 116 6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 Reference List 119 vii List Of Tables 3.1 Rate and distortion performance comparison [QP=10∼40] . . . . . 41 3.2 Computational complextiy comparison [QP=10∼40] . . . . . . . . . 42 4.1 Expectation Maximization for Gaussian Mixture Model . . . . . . . 76 4.2 Bayes risk minimized coding mode prediction algorithm . . . . . . . 78 4.3 Rate Comparison [R:RDO, NP: Non-parametric, P:Parametric] . . . 81 4.4 Distortion Comparison [R:RDO, NP: Non-parametric, P:Parametric] 82 4.5 Compleixty comparison [R:RDO, NP: Non-parametric, P:Parametric] 83 5.1 The seven sets in the BAR scheme. . . . . . . . . . . . . . . . . . . 95 5.2 Rate Comparison [ R: RDO, P: Proposed ] . . . . . . . . . . . . . . 111 5.3 Distortion Comparison [ R: RDO, P: Proposed ] . . . . . . . . . . . 112 5.4 Computational Speedup [ N: ¯ n, T:(T Prop )/T RDO ] . . . . . . . . . . 112 viii List Of Figures 1.1 Coding efficiency evolution of video coding standards in terms of bit rates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.1 The block diagram of the H.264 encoder structure based on the mo- tion compensated hybrid coding technique. . . . . . . . . . . . . . . 15 2.2 Motion compensated hybrid coding . . . . . . . . . . . . . . . . . . 16 2.3 H.264 intra/inter mode decision scheme . . . . . . . . . . . . . . . . 19 2.4 The intra/inter prediction modes for an MB: (a) 4×4 Intra Modes (b) 4×4 Intra Modes (c) Inter Modes. . . . . . . . . . . . . . . . . 20 2.5 Long-term memory motion compensated prediction based on vari- able block-size and sub-pel accuracy motion estimation . . . . . . . 23 3.1 The posteriorerrorprobabilitycalculated basedon(a)thedefinition in(3.2),(b)thedefinitionin(3.4),(c)thecumulative histogramwith respect to the SAD value, and (d) the cumulative histogram with respect to the SATD value. . . . . . . . . . . . . . . . . . . . . . . 28 3.2 An example to illustrate the relationship between the 9 candidate modes (m 0 ···m 8 ) and their ranked SAD and SATD values. . . . . 31 3.3 ThehistogramofRDOmodesintherank-orderedjointfeaturespace for test sequences “Akiyo” (left) and “Foreman” (right). . . . . . . 33 3.4 The cumulative histogram of RDO modes as a function of K, where the screening window is of size K ×K, for test sequences “Akiyo” (left) and “Foreman” (right).. . . . . . . . . . . . . . . . . . . . . . 34 3.5 Theaveragerate(thex-axis)andtheaveragedistortionincrease(the y-axis) for the QCIF Mobile sequence of 300 frames. . . . . . . . . . 35 ix 3.6 The two line segments labeled by solid thick lines give the distor- tion model as a function of the quantization error: (a)Akiyo.QCIF ,(b)Stefan.QCIF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.7 The predicted rate (circled lin) model and actual bit rate (solid curves): (a)Akiyo and (b)Stefan. . . . . . . . . . . . . . . . . . . . . 39 3.8 Comparison of (a) the R-D performance and (b) the computational complexity for the Akio QCIF sequence. . . . . . . . . . . . . . . . 44 3.9 Comparison of (a) the R-D performance and (b) the computational complexity for the Foreman QCIF sequence. . . . . . . . . . . . . . 45 3.10 Comparison of (a) the R-D performance and (b) the computational complexity for the Table Tennis QCIF sequence. . . . . . . . . . . . 46 3.11 Comparison of (a) the R-D performance and (b) the computational complexity for the Susie D1 sequence. . . . . . . . . . . . . . . . . . 47 3.12 Visual qualitycomparison ofreconstructed QCIFframesof(a)Fore- manand(b)Mobileusing theRDOmethod(theupper row)andthe proposed method (the lower row) where QP=16, 28 and 34 for the left, middle and right columns, respectively. . . . . . . . . . . . . . 48 4.1 The feature based Intra/Inter mode decision scheme using risk min- imization criterion. . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2 (a) Large Diamond Search Pattern (LDSP) and (b) Small Diamond Search Pattern (SDSP). . . . . . . . . . . . . . . . . . . . . . . . . 54 4.3 The decision error probability versus the frame-skip number for se- quences Akiyo, Foreman and Stefan. . . . . . . . . . . . . . . . . . 56 4.4 Cumulative histogram of macroblock distribution using (a) mac- roblocksoflowmotionactivities, (b)macroblocksofmedium motion activities, (c) macroblocks of high motion and (d) all macroblocks activities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.5 Illustration of the partition of the 3D feature vector space. . . . . . 60 4.6 The joint PDF of intra/inter features for correct decisions, including both both intra or inter modes, and black dots represent scattered distribution of erroneous decision results along the diagonal bound- ary with the feature difference Δf = 0. . . . . . . . . . . . . . . . . 61 4.7 The risk and risk-free region partitioning as a function of feature difference Δf. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 x 4.8 Illustration of a re-mapping of the risk region. . . . . . . . . . . . . 67 4.9 Illustration of the partition of the 3D feature vector space, where green, orange dot and red circle represents intra, inter mode and wrong decision, respectively. . . . . . . . . . . . . . . . . . . . . . . 68 4.10 Estimated parameters N = 9 and M = 8 for QP =28. . . . . . . . 69 4.11 Surface plot of conditional probability and ellipse plot for [mean, covariance] of component Gaussian pdfs overlapped with data risk at motion vector class 7, and QP=22 (a,c) f(F|m 0 ) (b,d) f(F|m 1 ) . 75 4.12 Variation of (a) the R-D performance and (b) the computational complexity (square and triangular lines represent number of single mode decisions) (c) encoding time saving as L ∗ p increased for the QCIF Table Tennis sequence. . . . . . . . . . . . . . . . . . . . . . 79 4.13 Variation of (a) the R-D performance and (b) the computational complexity as L ∗ p increased for the QCIF Foreman sequence. . . . . 85 4.14 Performance comparison in (a,c) rate-distortion and (b,d) compu- tational complexity as L ∗ p increase for the QCIF Foreman sequence when frame-skip is set to 0 and 1, respectively. . . . . . . . . . . . . 86 4.15 Comparison of (a) the R-D performance and (b) the computational complexity for the QCIF Carphone sequence. . . . . . . . . . . . . . 87 4.16 Comparison of (a) the R-D performance and (b) the computational complexity for the QCIF Stefan sequence. . . . . . . . . . . . . . . 88 5.1 The block diagram of the proposed block-size adaptive referencing (BAR) scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.2 The normalized rate-distortion coding gain as a function of the ref- erence frame number for two test sequences where Q p = 28 and B s = 16×16. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.3 The plot of P(ΔRD = 0) as a function of QP parameterized by the block size, where the frame skip is 2 and the maximum number of references is 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.4 Comparison of the RD curves of several different BAR sets applied to the Foreman CIF sequence. . . . . . . . . . . . . . . . . . . . . . 97 5.5 Comparisonofthecomputationalcomplexitysavingcurvesof7BAR sets applied to the Foreman CIF sequence. . . . . . . . . . . . . . . 98 xi 5.6 The relationship between the RD gradient (α) and increased rate (ΔR ∗ n .) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.7 TheverificationoftheRDgradientmodelattheframelevelforthree different test sequences over a wide range of quantization parameters.100 5.8 The relationship between the RD gradient (α) and the QBR value (θ.)101 5.9 The structure of the normalized LMS adaptive filter. . . . . . . . . 102 5.10 Comparisonofpredictedandactualframe-by-frameresultsusingthe normalized LMS adaptive filter for the Stefan CIF sequence with frame skip equal to two: (a) the rate, (b) the quality measure and (c) the corresponding errors. . . . . . . . . . . . . . . . . . . . . . . 104 5.11 Theprobabilistic distribution oftheregression errorandtheapprox- imating Gaussian distribution. . . . . . . . . . . . . . . . . . . . . . 105 5.12 Modeling of increased bit rates for different BAR schemes. . . . . . 107 5.13 Theflowchart oftheproposed block-size adaptivereferencing (BAR) algorithm for multiple-reference motion estimation. . . . . . . . . . 113 5.14 TheperformancedegradationoftheproposedBARschemeasthresh- old (T) increases for the Foreman CIF sequence. . . . . . . . . . . . 114 5.15 PerformancecomparisonbetweentheJVTreferencesoftwareJM10.1 with and without the proposed BAR scheme (T =0.02) in the rate- distortion tradeoff for Stefan and Foreman CIF sequences. . . . . . 114 5.16 The performance degradation of the proposed BAR scheme (with T = 0.02) in terms of the normalized encoding time ratio for Stefan and Foreman CIF sequences. . . . . . . . . . . . . . . . . . . . . . . 115 5.17 PerformancecomparisonbetweenH.264andtheproposedBARscheme (T = 0.02) in terms of average BAR set index for Stefan and Fore- man CIF sequences. . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 xii Abstract The emerging H.264 video coding standard provides state-of-the-art video cod- ing techniques and has significant coding performance improvement over existing standards. The general objective of this research is to reduce the H.264 encoder complexitywithoutsignificantRDperformancedegradation. Inparticular,wehave focused on the complexity reduction for the coding mode decision and motion es- timation, which are the modules that demand a high computational cost in the encoder. Fortheimproved codingmodedecision, wedevelopanefficienttwo-stageframe- work. Fist, a coarse level intra/inter coding mode prediction isperformed todecide theclass. Theproposedalgorithmcalculatesthreefeaturesandmapsthemintothe one of three regions; namely, risk-free, risk-tolerable, and risk-intolerable regions. Dependingonthemappedregion, wecanapplyalgorithmsofdifferentcomplexities for the final mode decision. Based on the coding mode decision result, either in- tra predictive coding or inter predictive coding process is proceeded in macroblock adaptive manner. In case of intra predictive coding, we propose a fast intra mode decision scheme for the fine-level mode decision. For inter predictive coding, a fast multiple reference assignment scheme for efficient motion estimation is proposed. First, the dynamic block-size effect on time-varying sequence characteristics is quantified by its impact on the RD per- formance degradation. The block-size effect on the RD coding gain is exploited to xiii develop a new algorithm, called the block-size adaptive referencing (BAR) scheme, that assigns a different number of references in a block-size adaptive manner. The proper BAR scheme can be chosen to minimize the reference frame numbers while keeping the expected RD loss under a target level. Finally, fast motion search is performedwithinselected referencesinablock-size adaptivemannerandthemodel parameters are adjusted using the normalized LMS adaptive filter to accommodate time varying sequence characteristics. The proposed algorithms can save consider- able computational complexity of H.264 reference code with negligible degradation in the RD performance. xiv Chapter 1 Introduction 1.1 Significance of the Research Duetotheemergence ofhighqualitymultimedia contents andservices such ashigh definition (HD) TV or HD-DVD, the demand for high performance video coding standards has increased in recent years. Therefore, the enhanced coding efficiency is essential to provide a capability that enables the transmission of more video channels or higher quality multimedia contents through various types of digital transmission channels such asthe cable, xDSL,UMTS or3Gcellular delivery infra- structure that offering much lower data rates than broadcast channels. H.264 is the latest evolutionary video coding standard, developed by the Joint Video Team (JVT) of ITU-T and MPEG [48, 33], to meet the need for enhanced coding efficiency. It has evolved from the development of ITU-T H.261, H.262 (MPEG-2), andH.263video codingstandardsand thelater enhancements ofH.263 known as H.263+and H.263++. Throughout this evolution, continued efforts have been made tomaximize codingefficiency asshown inFig.1.1while dealing with the diversification of network types and their characteristic formatting and loss/error robustness requirements [44, 12]. H.264 offers a significantly improved coding gain 1 compared with prior coding standards such asMPEG-2, MPEG4 ASP and H.263+ over a wide range of bit rates for a broad variety of applications. However, there is a price to pay; namely, the computational complexity of the encoder increases greatly. (a) Figure 1.1: Coding efficiency evolution of video coding standards in terms of bit rates. On one hand, H.264 has improved the coding gain significantly by employing a rich set of coding modes [44] with advanced encoding techniques. Among those techniques, it has been well-known that the coding mode decision and the long- term memory motion compensated prediction (LTMCP) are the most beneficial techniques in rate-distortion (RD) sense [58, 49]. The coding mode decision algo- rithm is recognized to be one of the main factors that contribute to the success of H.264. It is reported in [16] that H.264 Intra-frame coding outperforms the JPEG- 2000stillimagecompressionstandardduetothisfeature. Also,theLTMCPisvery 2 effective due toreasonssuch astherepetitive natureofmotion, camera motion, un- covered background, and changes in pixel values caused by occlusion, shadowing and lighting variation, etc. On the other hand, the major computational bottlenecks of H.264 lie in the coding mode decision and motion estimation [49] for each coding unit called the macroblock. In practice, H.264/AVC encoder complexity has been increased be- cause the encoder required to test all possible coding modes by evaluating the cost associated with each of intra/inter modes, and then select the best mode with the minimum cost. The cost is usually defined to be the Lagrangian RD cost that demands a lot of effort to compute in general. Also, the encoder complexity has been greatlyboostedup due tothe motionestimation processsince itssearch space is allowed to extend to multiple reference frames on top of the variable block sizes supported. In this research, we focus on the low complexity encoder design for H.264/AVC and propose several new techniques for fast coding mode decision and low com- plexity motion search method to achieve a good tradeoff between coding perfor- mance and complexity. The encoder complexity is critical in delivering video over wired/wireless networks as well as in video capturing with hand-held devices such as digital still cameras and digital video camcorders. Thus, it is highly desirable to reduce the encoder complexity as much as possible. 3 1.2 Feature Based Intra/Inter Coding Mode Selection As an effort to find out the proper intra/inter predictive coding modes, the cur- rent reference codes of the H.264 codec employ exhaustive search to determine the optimal modes using the RD optimization (RDO) technique [48]. However, the complexity is too high for many applications since the encoder has to encode the target block by searching all possible modes exhaustively for the best mode in the RD sense. We focus on the problem of encoder complexity reduction in this work by considering binary coding mode prediction. 1.2.1 Related Work Some effort has been made to address the computational complexity problem in fast intra-mode decision in the literature [35, 27, 30, 31, 10, 24]. Pan et al. [36] proposed a fast intra mode decision scheme with a pre-processing technique, which measures the edge direction ofagiven blocksoastoreduce the number ofprobable modes for complexity reduction. The performance of this method is about 20∼30 % (or 55∼65%) faster than the RDO method at the cost of 2% (or 5%) extra bits. Jeon and Lee [21] proposed another fast Intra mode decision scheme, where the encoding speed is approximately 30% faster than that of the RDO method which are not merely enough in practice. Recently, numerous research about fast inter mode decision [59, 60, 58, 55, 39, 26]hasbeenaddressedsuccessfully. Relativelyfewresultshavebeenreportedinthe areaoffastintra/intermodedecision, i.e.,thatistodecideearlierwhethertheintra or inter predictive mode should be used for a given macroblock of P or B frames. 4 Chen et al. [7] examined a model-based intra/inter mode selection scheme based on simple features, where the costs of intra- and inter-coding were modeled by the variance and the sum of absolute differences (SAD) of a macroblock, respectively. The information of motion vectors and quantization parameters was sometimes taken into account in [7]. Turaga and Chen [46] developed a classification-based mode decision scheme using the maximum likelihood (ML) criterion to facilitate video transmission over networks. The above two schemes arehowever not suitable for intra/inter mode decision in H.264 reference code since their selected features are not accurate enough to provide efficient mode prediction. JagmohanandRatakonda[20]proposedasupervised binarymodeclassification scheme using a decision tree. They used the down-sampled SATD (sum of absolute transformdifferences)valuestoformafeaturespaceandoptimallypartitioneditby minimizing the mis-classification rate. However, the RD performance loss incurred by the wrong decision was not considered. More recently, the simple feature based mode decision algorithm was proposed in [27]. Basically, it always performs the inter mode decision first, and then checks whether it continues to search for the intra mode or not by comparing the tempo- ral correlation (indicated by the average residual rate) and the spatial correlation (indicated by the sum of boundary pixel errors). In general, this algorithm works well when the inter mode is most likely the dominant one. However, there could be noticeable RD performance loss caused by an erroneous decision when the feature difference is not accurate enough. It is worthwhile to emphasize that a wrong decision may or may not be critical depending on the resultant RD performance loss. This issue was not discussed in [27]. 5 1.2.2 Outline of The Proposed Solution In this work, we develop an efficient coarse-to-fine framework to address the ex- cessive computational cost of coding mode decision. For the coarse level decision, we present an efficient binary inter/intra mode decision scheme using the Bayes risk-minimization criterion in a multidimensional feature space. The objective of fast intra/inter coding mode prediction is to reduce the computational complexity by deciding the most probable intra/inter mode at an early stage. Obviously, there is a risk in making the wrong prediction in the computation-saving effort. Thus,themanagementofthepredictionrisk,whichisquantifiedbytheaveraged RD performance degradation (rather than by the mis-classification rate alone) in this work, is critical. The proposed algorithm adopts three simple features to esti- mate thetemporalcorrelation, spatialcorrelation andmotionactivity, respectively, and then performs the binary mode decision in the 3D feature space. Depending on the expected RD loss, the 3D feature vector space is further par- titioned into three regions: the risk-free, risk-tolerable, and risk-intolerable regions. Depending on the region where the feature vector of a macroblock is located, we can apply mechanisms of different complexity for final coding mode decision. Once the coding mode is determined, only intra or inter mode decision is proceeded in the fine level of decision. As a cost-effective way for fine level decision, we present a simple yet effective fast mode decision algorithmfor H.264intra prediction using spatial and transform domain features of the target block jointly. This method is able to filter out the majority of candidate modes so that we only have to focus 2∼3 modes for the final mode selection. To justify the use of joint features, two quantities are measured and analyzed. They are the posterior error probability and the average RD loss. 6 Furthermore, to speed up the RDO process in the ultimate mode selection, an RD model is developed. 1.3 Fast Motion Estimation via Block-size Adaptive Referencing 1.3.1 Related Work Recently, several flexible fast motion search algorithms have been introduced to reducethecomplexityofH.264/AVC.First,thereisstudyaboutmotioncomplexity reduction based on efficient initial search centers and early termination of motion estimation [51]. From a similar perspective, a new merge and split algorithm was proposed in [61] toexploit hierarchical motion vector dependency between different block sizes toreduce the search space. Also, there have been effortsmade to reduce the complexity of LTMCP based on the correlation between motion vectors among different reference frames [41, 6]. Astosearch-patternbasedmethods,therearemanywellknownalgorithmssuch astheenhancedpredictivezonalsearch(EPZS)[45],theunsymmetrical-crossmulti- hexagon-grid search [8], the prediction-based directional sub-pel motion estimation [53]andtherecent-biasedthreedimensionalextensionofdiamondsearch[42]. Most of them follow the structure consisting of three components: (1) initial predictor selection, (2)earlytermination, and(3)finalpredictionrefinement. Itispossibleto employ multiple search patterns around a local search center, which is determined by evaluating a number of spatial, temporal prediction vectors or the up-layer motion vector prediction [61] initially. Then, several prediction refinement steps are conducted with criteria to terminate the search process. To remove redundant 7 motion vector candidates, a successive elimination algorithm [54] was proposed in [2, 25] to reduce the computational burden in the distortion measure (SAD) and the transform, respectively. Generally speaking, these algorithms work well for the case of a single reference frame. If the number of reference frames increases, the complexity is still high. Our research is motivated by the observation that it is possible to assign multiple references adaptively depending on the block-size and time-varying sequence char- acteristics to reduce the motion estimation complexity. It was shown in [22] that motioncoherence canbeusedtopredictthetemporalsearch rangebymodelingthe relationship between the RD coding gain and the required computational complex- ity. Also, the histogram similarity was exploited in [34] to select the best subset of multiple reference frames. 1.3.2 Outline of The Proposed Solution For efficient motion estimation in terms of coding performance and complexity tradeoff, we propose a new way of using multiple references in a block-size adaptive manner. We found that the dynamic block-size effect is related to the time-varying sequence characteristics, and its impact on the RD performance degradation can be quantified. Based on this observation, we propose a new method that assigns a different number of references in a block-size adaptive manner to reduce the complexity of motion search. First, a new block-size adaptive referencing (BAR) set is defined to exploit the block-size effect on the RD coding gain. The RD gradient is employed to build a relationship between sequence characteristics and the RD coding gain. To quantify the expected RD performance degradation of each set with respect to different 8 sequence characteristics, the model of the RD gradient function can be built via non-linear regression. Then, a proper BAR scheme can be determined to minimize the reference frame numbers while keeping the expected RD loss under a target level. Finally, fast motion search is performed based on selected references in a block-size adaptive manner. 1.4 Contributions of the Research We have clearly identified the complexity bottlenecks in the H.264 encoding pro- cess. Focused on the encoder complexity reduction, some significant results have been accomplished for coding mode decision at the cost of tolerable RD perfor- mance degradation and fast motion estimation on multiple references. The major contributions of the research conducted in this thesis are summarized below. • Development of a new fast-intra mode decision algorithm. – We discuss simple yet effective feature selection schemes, i.e. the SAD (sum of absolute differences) and SATD (sum of absolute transform dif- ferences), and show how these features correlate with true RDO mode for fast intra mode decision. The discussion leads to the early termina- tion scheme aswell astheutilization ofjointrank-order filtering method to filter out less likely candidates. – To evaluateperformance ofthe proposed fastintraprediction algorithm, the new algorithm have been integrated with the H.264 JM7.3a codec and compared with the RDO mode decision of H.264 in terms of rate, distortion, complexity performance metrics. Itisshown byexperimental results thatthe speedup factor forthe proposed fast-intra modedecision 9 algorithm is about ten times as compared with the RDO-based method with negligible R-D performance degradation. • Development of a new fast intra/inter mode decision algorithm. – While there is some research on complexity reduction for either fast intra or fast inter mode decision for H.264, there is little work on fast inter/intra mode decision making. In Chapter 4, we propose a new fast (binary) inter/intra mode decision scheme by minimizing the expected risk in the sense of R-D performance degradation. – For the proposed intra/inter decision scheme based on Bayes-risk mini- mized criterion, low complexity partitioning algorithm is developed for riskregionbased ontheproduct quantization basedonin-depth statisti- calanalysisofmacroblockdistributionineachofpartitionedrisk-regions. It is worthwhile to note that proposed partitioning scheme can produce different partitioned region adaptive to the given tolerance threshold for RD performance degradation. This contributes to perform efficient de- cision of quantized cells in 3D feature vector space for non-parametric pdf estimation method. – The RDperformanceaswell ascomputationalcomplexitysaving fortwo different probabilistic density estimation methods (parametric or non- parametric) has been analyzed for wide range of test sequencces with different resolutions and frame-skips as well. – The proposed Bayes risk minimized intra/inter decision scheme have been integrated with the H.264 JM7.3a codec and compared with the RDO mode decision of H.264 in terms of coding bit rates, the average 10 PSNR, and the modular/total encoding time for test sequences recom- mended in [38]. • Development of a new fast motion estimation based on block-size adaptive referencing (BAR). – The major contribution is to identify the block-size effect on multiple references in terms of the RD coding gain and quantify the RD per- formance degradation caused by reduced search-space, which depends on time-varying sequence characteristics. Also, we develop a simple yet effective BAR algorithm as a flexible way of referencing. The BAR scheme saves a significant amount of computational complexity while achieving nearly identical RD performance with fast motion search (say, UMHexagonS [8]) using all available references. – The expected RD performance degradation due to the use of fewer ref- erences in the proposed BAR scheme is quantified by a RD gradient regression model, which is obtained using a non-linear least squares cri- terion. A simple yet effective way to choose the proper BAR scheme is proposed to minimize the reference frame numbers while keeping the expected RDlossunderatargetlevels. Itisalsoobserved thattheframe level RD gradient is correlated with the quality-to-bit rate ratio (QBR). Based on this observation, the RD gradient is formulated as a function of QBR and the quantization parameter. – To estimate QBR before the actual the encoding process, it is neces- sary to predict the quality (in terms of PSNR) and the bit-rate of the current frame on the fly. A new method called the normalized LMS adaptive filter is employed in this research to predict the frame-level 11 rate and PSNR accurately, which is modeled as wide sense stationary auto-regressive processes. 1.5 Outline of the Dissertation The rest of this thesis is organized as follows. The background and overview on proposed binary coding mode prediction algorithm and set based BAR algorithm for H.264 intra/inter mode decision and long-term memory motion compensated prediction are reviewed in Chapter 2, respectively. A new fast-intra mode decision algorithm is presented in Chapter 3, where we develop a novel scheme to achieve suboptimal coding performance using low cost features, i.e. the SAD (sum of absolute differences) and SATD (sum of absolute transformdifferences) forintra-predictive modeselection. Inparticular,we propose a feature-based unlikely mode filtering algorithm based on a joint rank-order table. Thisfilteringprocessisaccuratesothatwecanfocusonasmallnumberofcandidate modes for final mode decision. Finally, experimental results are given to show the significant complexity re- duction performance at the cost of negligible RD performance degradation. A fast intra/inter (binary) mode decision algorithm is presented in Chapter 4, where we propose a way to decide the intra/inter predictive coding mode by minimizing the expected R-D performance degradation, which is treated as the Bayes risk. We perform experiments to compare the Bayes risk minimized decision algorithm with the H.264 RDO method. A fast motion search algorithm based on the BAR scheme is presented in Chapter 5. First, we investigate three key effects related with the rate-distortion- complexity tradeoff in temporal search range extension. 12 The BAR scheme is introduced and its impact on the RD coding gain is ana- lyzed. The proposed algorithms is designed to assign different BAR to each frame while keeping RD performance degradation low. Experimental results in terms of the RDperformance and the speedup factorare provided toillustrate the efficiency of the proposed BAR scheme. Concluding remarks and future research work are given in Chapter 6. 13 Chapter 2 Research Background 2.1 H.264 Video Coding Standard The emerging H.264 video coding standard is developed to provide more effective coding solutions to interactive storage (e.g. DVD), broadcasting, conversational services, video-on-demand or multimedia streaming services, multimedia messag- ing services (MMS) over networks (e.g. the Ethernet, LAN, DSL, wireless and mobile networks, etc.) How to handle such a large variety of applications and networks imposes a major challenge on the design of the H.264 video codec. To address the need forflexibility, the H.264/AVCdesign covers the video coding layer (VCL),which isdesigned toefficiently represent thevideo content, andthenetwork adaptation layer (NAL), which provides the VCL representation for video and the associated header information in a manner appropriate for conveyance by a variety of transport layers or storage media as shown in Fig. 2.1. As shown in Fig. 2.2, the fundamental codec structure of H.264 is the motion compensated hybrid coding module, which is similar to most previous video coding standardssuch asH.261,H.263+andMPEG-1/2/4. The newfeaturesoftheH.264 14 C o n t r o l D a t a Video Coding Layer Data Partitioning Network Adaptation Layer H.320 MP4FF H.323/IP H.323/IP etc. MPEG2 Coded Macroblock Coded Slice/Partition Figure 2.1: The block diagram of the H.264 encoder structure based on the motion compensated hybrid coding technique. designthatenablehighercodingefficiencyaresummarizedbelowfromninedifferent perspectives. 1. Transform perspective: The H.264/AVC design is based primarily on a 4×4 integer transform whose inverse transform has an exact-match property. This allows the encoder to represent signals in a more locally-adaptive fashion, which reduces artifacts known as “ringing”. It can also be used to reduce the “drift” effect due to the mismatch problem. 2. Motion compensation perspective: To boost up the motion compensated prediction performance, a hierarchical variable block-size motion compensation scheme is performed based on the quarter-accuracymotionvector(MV)whichisobtainedbysearchingmultiple reference picturescodedperviously. The weighted prediction isalsoemployed for improving coding efficiency for scenes containing fades. 15 3. Coding mode decision perspective: Moreflexibilityintheselectionofmotioncompensationblocksizesandshapes is allowed for inter prediction. Directional spatial prediction is employed for intra coding. This improves the quality of intra coding, and also allows prediction from neighboring areas that were not coded using intra coding. H.264/AVC also includes an enhanced motion inference method known as “skipped” and “direct” motion compensation, which improves further on the “direct” prediction in MPEG-4 Visual. 4. Entropy coding perspective: Two entropy coding methods are adopted by H.264/AVC, called CAVLC (context-adaptive variable-length coding) and CABAC (context-adaptive bi- nary arithmetic coding). They both use context-based adaptivity to improve performance relative to prior standards. Deq ./Inv. Transform Motion - Compensated Predictor Control Data Quant. Transf . coeffs Motion Data 0 Intra/Inter Coder Control Decoder Motion Estimator Transform/ Quantizer - Deq ./Inv. Transform Motion - Compensated Predictor Control Data Quant. Transf . coeffs Motion Data 0 Intra/Inter Coder Control Decoder Motion Estimator Transform/ Quantizer - Entropy Coding Entropy Coding Video in Figure 2.2: Motion compensated hybrid coding 5. Flexible referencing perspective: H.264/AVC allows the encoder to choose the order of pictures for referencing 16 anddisplaywithahigherdegreeofflexibility, whichisconstrainedonlybythe total memory capacity. Removal of the restriction also enables the removal of extra delay associated with bi-predictive coding. 6. Subjective quality perspective: The in-loop deblocking filter is adopted to reduce blocking artifacts as well as improving the resulting video quality. 7. NAL perspective: Each syntax structure in H.264/AVC is placed in a logical data packet called the NAL unit. The NAL unit syntax structure allows greater customization ofthewaytocarrythevideocontentinamannerappropriateforeachspecific network. H.264 employs a parameter set structure which enables separation of the key header information for a more flexible and specialized representa- tion. Also, to improve the end-to-end delay in real-time applications, H.264 providesanarbitraryslice ordering(ASO)thatcanbeused fornetworkswith an out-of-order delivery behavior (e.g. the internet protocol networks). 8. Error resilience perspective: Flexible macroblock ordering (FMO) is a new feature to partition a pic- ture into regions called slice groups, with each slice being an independently- decodable subset of a slice group. When used effectively, FMO can signif- icantly enhance robustness to data losses. As another means of enhancing robustness to data loss, the H.264/AVC design allows an encoder to send redundant representations of regions of pictures, enabling a representation of regions of pictures where the primary representation has been lost during data transmission. 17 9. Data partitioning & Synchronization perspective: Data Partitioning is used for priority transmission since there is more im- portant information than others in representing the video content. The H.264/AVC design includes a new feature consisting of picture types that allow exact synchronization of the decoding process of some decoders with an ongoing video stream produced by other decoders without penalizing all decoders with the loss of efficiency resulting from sending an I picture. This enable the switching of a decoder between representations of the video con- tent using different data rates, the recovery from data losses or errors, as well as trick modes such as fast-forward, fast-reverse, etc. 2.2 Intra/Inter Mode Decision In this section, we review the H.264 intra/inter mode prediction scheme. Actually, H.264 coding mode decision is composed of two major decisions, which are inter and intra mode decisions asshown in Fig. 2.3. H.264 exploits spatial and temporal correlations of underlying video through intra/inter mode prediction [44]. For the intramodeprediction, thecurrent macroblock(MB)ispredicted byadjacentpixels in the upper and the left macroblocks that are decoded earlier. To get a richer set of prediction patterns, H.264 offers 4 prediction modes for 16×16 luma blocks and 9 prediction modes for 4×4 luma blocks as shown in Figs. 2.4(a) and (b). For the chrominance components (i.e. U and V), there are 4 prediction modes for U and V chroma blocks. As defined in H.264/AVC, there are three categories of predictive coding mode such as skip mode, direct mode (B-slice) and inter mode. For the inter mode prediction, seven modes of different sizes and shapes are supported by H.264 as shown in Fig. 2.4(c). 18 16x16 Best Mode [ SATD ] 4x4 Best Mode [ RD Measurement ] 16x16 B es t 4x4 B es t Calculate Rate & Distortion Cost Intra Best H.264 Mode Decision Intra Best Mode Inter Best Mode Inter Calculate Rate & Distortion Cost Best Mode Intra Best Best Intra Mode Decision Figure 2.3: H.264 intra/inter mode decision scheme Specifically, the H.264 encoder can test all possible modes, so called the full mode decision, by evaluating the cost associated with each of intra/inter modes, and then select the best mode with the smallest cost. The cost is usually defined to be Lagrangean rate-distortion (RD) cost. In order to estimate the cost, H.264 employs a rate distortion optimization (RDO) procedure that is given below. • Initialization: Given the last decoded frame, the MB quantization parameter QP, the Lagrangian multiplier is selected to be in a mode-dependent way. • Step 1: Calculate the residuals of various intra/inter prediction modes. For inter-predictive modes, perform the motion estimation within a search range 19 (a) (b) 16x16 16x8 8x16 8x8 8x4 4x8 4x4 Example (c) Figure 2.4: The intra/inter prediction modes for an MB: (a) 4×4 Intra Modes (b) 4×4 Intra Modes (c) Inter Modes. for the multiple reference frames. For intra-predictive modes, the directional prediction is applied to calculate the residual as shown in Figure 1. • Step 2: Select the best prediction mode among all possible intra/inter pre- dictive modes by minimizing the following Lagrangian functional J(s,c,mode|QP,λ mode ) =SSD(s,c,mode|QP)+λ mode ·R(s,c,mode|QP), (2.1) 20 where QP is the quantization parameter, λ mode is the Lagrangemultiplier for mode decision, SSD is the sum of the squared differences between the original block luminance(denotedbys)anditsreconstructionc,andR(s,c,mode|QP)represents the number of bitsassociated with the chosen mode. It includes the bits needed for the coding of the selected prediction mode and the DCT coefficients for the given block. By following the above RDO procedure for intra/inter mode decision in H.264, thecomputationalcostisveryhigh. EvenwithoutRDO,thecomplexityisstillhigh since the encoder checks all inter and intra modes. Especially, the RDO procedure forinter modesismore complex thanthatforintra modessince the formerinvolves the complexity of full motion estimation search. When the true mode is one of the intra modes, a large amount of computations for motion search is wasted. There exist fast algorithms to select the optimal inter-prediction mode [26] and the optimal intra-prediction mode [23] individually. However, little work has been doneyetindevelopingafastalgorithmthatperformsfast(binary)intra/intermode decision. It is desirable to make the coarse-level mode decision about which class of modes (the class of intra-predicted modes or the class of inter-predicted modes) to try at the first stage. The objective of fast intra/inter coding mode prediction is to reduce the com- plexity by deciding the most probable coding mode in an early stage. Obviously, there is risk in predicting the coding mode prior to the calculation of the RD cost for all possible coding modes. Thus, the management of the prediction risk (i.e. the RD performance degradation), is a critical issue in our research. Then, we can apply fast algorithms in [26] or [23] so as to choose the specific mode within each class at the second stage for the fine-level mode decision. 21 Topredict thepropercodingmode, itisdesirable toemploylowcost featuresto reducethecomplexity. Basedonourexperience, theintermodeissuperiorthanthe intra mode when the temporal correlation is stronger than the spatial correlation, and vice versa. Also, we observe that the intra/inter mode decision often depends on the motion activity. Based on these observations, we employ three features that reflect the spatial correlation, the temporal correlation, and the motion activity, respectively, in Chapter 4 and the expected risk is calculated for the 3-D feature space. Depending on the probability of erroneous mode selection and the average RD performance loss at the position that feature vector indicates, we can partition the featurespaceintothreeregions;namely,risk-free,risk-tolerable,andrisk-intolerable regions. If the input feature vector lies in the risk-free region, the decision is made based on simple feature comparison. If the feature vector lies in the risk-tolerable region, the risk minimizing mode is selected. Finally, if the feature vector is in the risk-intolerable region, a full mode decision process is conducted to prevent significant RD loss. More details are given in Chapter 4. 2.3 Long-term Memory Motion Compensated Prediction TheH.264long-termmemorymotioncompensatedprediction(LTMCP)[49]scheme is reviewed in this section. To obtain lower motion-compensated residual, the best match is searched in all available reference frames that are decoded earlier and stored in frame memory in LTMCP. It has been reported that LTMCP can achieve 22 a bit-rate saving of 20-30% (for the Mother & Daughter sequence) which corre- sponds to the improvement of reconstruction PSNR up to 1.5 dB approximately [49]. Togetarichersetofpartition-blockpatternsforcodingefficiencyimprovement, H.264 offers seven different block-sizes; namely, 16×16, 16×8, 8×16, 8×8,8×4, 4×8,4×4. Then, the current MB in frame t can be partitioned into smaller blocks of size up to 4 × 4 as defined in the H.264 coding standard [38]. Also, sub-pel accuracy MVs are supported such as the half and the quarter-pel MVs as shown in Fig. 2.5. Figure 2.5: Long-term memory motion compensated prediction based on variable block-size and sub-pel accuracy motion estimation . On top of this framework, the H.264/AVC reference code is equipped with the RD optimized rate control scheme which makes encoder performance close to the optimal one. Simply speaking, the rate control scheme is used to select the encoderparameter-set tominimize thedistortionsubject toagivenrateconstraint. However, proper selection of the optimal set of parameters is non-trivial, and it usually demands a large amount of computation. The objective of this research is 23 to reduce the encoding complexity while keeping the RD performance as close to that of the full search as possible. To determine the best MV in the RD sense, an RD model is usually utilized in the motion estimation process. In H.264, the best matching motion vector is found by minimizing the following cost function [38, 60]: J(m|q,λ motion ) =SA(T)D(s,c(m)|q)+λ motion ·R(m−p|q), (2.2) whereSA(T)D isthesumofabsolute(transformed)differences, sandctheoriginal and reconstructed luminance values, m and p the MV and the predicted MV at a chosen reference frame, and q and λ motion the quantization parameter and the Lagrange multiplier, respectively. For the motion search algorithm, there are multiple fast motion search methods used in the H.264 reference code, including the fast full search (FFS),the enhanced predictive zonal search (EPZS) [45], and the unsymmetrical-cross multi-hexagon- grid search [8]. FFS is advantageous to speed up the search process since SA(T)D of4×4blocksisfirstcalculated,anditcanbere-usedforblocksofalargersize. The UMHexagonSalgorithmwasdevelopedtotacklethelocalminimumprobleminhigh motion sequences using a large search range together with sequential search based on multiple search-grids such as unsymmetrical-cross, rectangular, and multiple hexagon grids. Basically, this search scheme contains multiple passes, and early termination is used to speed-up the sparse-grid based search process furthermore. It is robust with respect to visual quality degradation for high motion/complex texture sequences. Thus, UMHexagonS is employed for performance evaluation in this research. 24 Chapter 3 Fast Intra Mode Decision for H.264 3.1 Introduction In this chapter, we present a simple yet effective fast mode decision algorithm for H.264 Intra prediction using spatial and transform domain features of the target block jointly. This method is able to filter out the majority of candidate modes so thatweonlyhavetofocus2∼3modesforthefinalmodeselection. Tojustifytheuse of joint features, two quantities are measured and analyzed. They are the posterior error probability and the average rate-distortion loss. Furthermore, to speed up the RDO process in the ultimate mode selection, an RD model is developed. The proposed mode decision scheme has been integrated with the H.264 JM7.3a codec for performance evaluation. It is compared with the RDO mode decision scheme of H.264 in terms of the computational complexity, the average PSNR and the coding bit rates for several test sequences [15]. Simulation results demonstrate that the RD performance of the proposed algorithms is almost identical with that of the RDO mode decision scheme for a wide range of bit rates. However, the proposed algorithmonlydemandsabout7-10%ofthe complexity oftheH.264RDOmethod. 25 3.2 Feature Selection In this section, we consider both spatial and frequency domain featuresto filter out unlikely mode candidates. 3.2.1 Spatial Domain Feature Selection For each prediction mode, we can compute the sum of absolute differences (SAD) between the true and the predicted pixel values as a spatial domain feature. It can be written as SAD = X (x,y)∈b k |D(x,y)|, D(x,y)=I(x,y)−P i (x,y), (3.1) where b k represents k th 4×4 block in current macroblock, I and P represent the true and the predicted pixel values of i th 4×4 Intra mode, respectively. Intuitively speaking, a good mode should lead to a small SAD value. This is called the amplitude test. It is important to study the accuracy of mode selection basedonthissimpletest. Toachievethisgoal,onewayistoexaminetheprobability that a wrong mode is chosen for the current block, given observation of f(m SAD ). Itisreferredtoastheaposteriorierrorprobabilitysinceitcontainstheinformation after f(m SAD ) is observed: Prob(e|f(m SAD )) =Prob(m SAD 6=m RDO |f(m SAD )), (3.2) where m SAD and m RDO are modes selected based on the optimal SAD and RDO criteria, respectively, and f(m SAD ) represents the SAD value of mode m SAD . In- tuitively speaking, the smaller the value of f(m SAD ), the smaller the probability 26 of erroneous prediction. This conjecture is studied experimentally in Fig. 3.1(a), where we plot Prob(e|f(m SAD )) as a function of f(m SAD ). The statistical data are obtained from a collection of three QCIF sequences (Akiyo, Foreman and Table Tennis) of 300 frames long. We see that if the smallest SAD value (i.e. f(m SAD ) is lower than 50, we can get an excellent prediction with the error probability lower than 10%. For most of bigger SAD values than 50, the prediction error is approx- imately in the range between 30% and 60%. Then, the prediction based on the smallest SAD value alone is not very accurate. The posterior error probability figure shows the possibility of early termination with sacrifice of decision accuracy. Based on experimental results, the RD perfor- mance loss is insignificant in most test sequences when the decision error is less than 10%. Due to this reason, the proposed algorithm employs the early termina- tion rule using the SAD feature given in (3.2). That is, if the f(m SAD ) value is less than threshold Δ SAD = 50, then we can choose m SAD as the Intra-prediction mode. The usefulness of this early termination rule can be found by examining the cumulative histogram of blocks with computed feature values of f(m SAD ). For example, under threshold Δ SAD = 50, there are about 38% of blocks in Fig. 3.1(c) that have f(m SAD ) ≤ Δ SAD for early termination at the risk of less than 10% in making the wrong mode selection. Itisworthwhiletopointoutthatthespatialedgefilteringtechniqueproposedin [36] passes the examining block only once while the SAD and SATD computations have to repeated for every possible Intra-mode in our scheme. However, the edge filtering method always selects the fourmost probable modesforRDOsince itdoes not have enough accuracy in picking up one or two final modes. The complexity of H.264 Intra-mode decision lies primarily in RDO or RD estimation rather than the 27 0 200 400 600 800 1000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 P o s t e r i o r E r r o r P r o b a b i l i t y SAD P(Error | SAD ) 0 200 400 600 800 1000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 N o r m a l i z e d C u m u l a t i v e H i s t o g r a m SAD Threshold (a) (c) 200 400 600 800 1000 1200 1400 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 P o s t e r i o r E r r o r P r o b a b i l i t y SATD P(Error | SATD ) 200 400 600 800 1000 1200 1400 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 N o r m a l i z e d C u m u l a t i v e H i s t o g r a m SATD Threshold (b) (d) Figure 3.1: The posterior error probability calculated based on (a) the definition in (3.2), (b) the definition in (3.4), (c) the cumulative histogram with respect to the SAD value, and (d) the cumulative histogram with respect to the SATD value. simple feature calculation like SAD. Our algorithm has focused on the reduction of the number of RDO and, consequently, achieved a modular speedup of around 10. 3.2.2 Transform Domain Feature Selection Based on the Parseval theorem, the total energy (or the 2-norm) of the difference inthe space domain andthetransformdomain should bethe same. Itisreasonable 28 to say that a good prediction should also produce a small value of the sum of the absolute transform coefficient differences (SATD), which is defined by SATD = X (x,y)∈b k |T{D(x,y)}|, (3.3) where D(x,y) is defined in (3.1) and T is a certain 2D orthonormal transform. To compute the SATD feature, T is chosen to be the separable 4-point Hadamard transform along each dimension due to its simplicity and good performance in as- sisting themode selection. NotethattheHadamardtransformcanbeimplemented with only addition and shift operations that are computationally efficient. In fact, it is observed that the computation of SATD does not increase the overall com- putational cost much. Also, it is important to point out that SAD and SATD are generally different since they are the 1-norms (rather than the 2-norms) of spatial- and transform-domain signals, respectively. Following the discussion for SAD, we can analyze the usefulness of the SATD featurebyexaminingtheposteriorprobabilityoferroneousdecision. Letusconsider the posterior error probabilities based on SATD observation below. Prob(e|f(m SATD ))=Prob(m SATD 6=m RDO |f(m SATD )), (3.4) where m SATD and m RDO denote the best SATD mode and the best RDO mode, respectively, and f(m SATD ) denotes SATD value for mode m SATD . The posterior error probability is plotted in Figs. 3.1(b) and the cumulative histogram of blocks withrespecttotheSATDisshowninFigs. 3.1(d). Thestatisticaldataaresampled from the same sequences with SAD feature like Akiyo, Foreman and Table Tennis QCIF sequences of 300 frames. Similarly, we can derive the early termination rule 29 basedontheSATDfeature. Inthenextsection,wewillshowthatitisadvantageous to use both SAD and SATD features jointly. 3.2.3 Sequential Mode Filtering Wecanapplytheselectedfeaturesonebyoneincascadetofilteroutunlikelymodes (or to reach an early termination criterion). This is called the sequential filtering process. The effectiveness of the sequential filtering process generally depends on the order of features. According to the above discussion, we have two features, i.e. SAD and SATD. We do not have a conclusion about which feature should go first to guarantee a better filtering performance. However, for the complexity concern, itisworthwhile togowith theSAD featurefirst. The amplitude tests fortheSAD feature as stated in Section 3.2.1 can be applied. Then, we can move to the SATD feature. It is worthwhile to point out that there exist strong correlations between SAD and SATD features. Thus, it is difficult to get an effective filtering result by cascading them in two stages. It is often that we can filter out about 30 to 45% (3 or 4 out of 9) reliably at the first stage. Then, at the second stage, we may filter out, most likely, less than one more unlikely candidate. Thus, we still have about 4 or 5 modes left for final selection. If we apply the RDO technique to the remaining modes, then the complexity reduction is about 50%. In the next section, we will develop a better mode filtering scheme that uses the SAD and SATD features jointly. By using joint mode filtering scheme, we can exploit the mode-priority correlation between two spatial and frequency domain features. 30 3.3 Mode Filtering with Joint Features In this section, we will present an approach to filter out unlikely modes with the SAD and SATD features jointly. 3.3.1 Rank-Ordered Joint Features LetusrankthecomputedSADandSATDvaluesfromthesmallesttothelargestas shown in Fig. 3.2, where the horizontal and the vertical directions show the rank- ordered SAD and SATD features, respectively. A smaller index number denotes a smaller value. For the case of 9 candidate modes, we can obtain a matrix of size 9×9. However, only 9 locations of the 81 empty slots will be filled by a certain candidate mode. As shown in the figure, modem 2 has the smallest SAD value and the second smallest SATD value. SAD S A T D 1 2 3 4 5 6 7 8 9 4 5 6 7 8 9 m 1 m 3 m 5 m 6 m 7 m 8 Screen Window 1 2 3 m 2 m 4 m 0 Figure3.2: Anexampletoillustratetherelationshipbetweenthe9candidatemodes (m 0 ···m 8 ) and their ranked SAD and SATD values. 31 The next step is to choose a small screen window to concentrate. In Fig. 3.2, a screen window of size 3×3 is chosen to cover the three smallest SAD and SATD locations. Modes m 0 , m 2 and m 4 are located inside the window in this example. This means these modes have both small SAD and SATD values. For the modes outside the window, they have at least one larger feature value (with its rank equal to 4 or higher). Since each column or each row can only have one mode, it is easy to see that a screen window of size K ×K can at most contain K modes. A smaller value of K implies a more aggressive filtering scheme. In the next subsection, we will argue thatK = 3isusually agoodchoice. Then, we will have at most 3candidate modes left for further selection. 3.3.2 Screen Window Size Selection To justify the screen window size selection, let us examine the distribution of the RDO mode in the rank-ordered joint feature space. Based on the data obtained from two QCIF sequences of 300 frames long, we plot the cumulative histograms of RDO modes in Fig. 3.3. It is clear that most RDO modes fall in the window of 3×3 lowest ranks. Furthermore, we plot the cumulative histogram of RDO modes as a function of K, where K = 1,2,···9 is the screen window size in Fig. 3.4. As shown in the figure, 93-95% of the best mode among the candidates when the 3×3 screen window is selected. On the other hand, increasing the window size more than 3×3 does not improve the overall RD performance much but demands one more additional RDO process. Thus, we can efficiently search the optimal mode by focusing on modes that fall in this region. 32 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 0 0.2 0.4 0.6 0.8 SATD SAD 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 0 0.2 0.4 0.6 0.8 SATD SAD Figure 3.3: The histogram of RDO modes in the rank-ordered joint feature space for test sequences “Akiyo” (left) and “Foreman” (right). The other way to justify the screen window size selection is to perform the rate-distortion analysis. Given a specific rank pair (s 1 ,s 2 ) in the joint SAD-SATD feature domain. We can compute the average rate and distortion increase with respect to the RDO mode as ΔR(s 1 ,s 2 ) = N −1 s 1 ,s 2 X i∈Ss 1 ,s 2 [R s 1 ,s 2 (i)−R RDO (i)], ΔD(s 1 ,s 2 ) = N −1 s 1 ,s 2 X i∈Ss 1 ,s 2 [D s 1 ,s 2 (i)−D RDO (i)], where i is the block index, S s 1 ,s 2 is the set of all events in which there exists a Intra-prediction mode in (s 1 ,s 2 ), N s 1 ,s 2 is the cardinality of S s 1 ,s 2 . Specifically, the subscript (s 1 ,s 2 ) represents the feature rank pair such as (SAD rank, SATD rank), forinstance, mode0(DCmode)could beranked asthe second intermsoftheSAD cost and the third in terms of SATD. Then, mode 0 falls in the position of (2,3). Furthermore,S s 1 ,s 2 represents theevent thatthereisamodefallingintherankpair (s 1 ,s 2 ). Ifthemodeisthetrueone, itwillnotresultinanyrateincreaseand/orthe 33 1 2 3 4 5 6 7 8 9 0.75 0.8 0.85 0.9 0.95 K T h e C u m u l a t i v e H i s t o g r a m o f R D O M o d e s 95.65 % 1 2 3 4 5 6 7 8 9 0.7 0.75 0.8 0.85 0.9 0.95 1 K T h e C u m u l a t i v e H i s t o g r a m o f R D O M o d e s 93.46 % Figure3.4: The cumulative histogramofRDOmodesasafunctionofK, where the screening windowisofsizeK×K, fortest sequences “Akiyo” (left)and“Foreman” (right). distortion increase. Otherwise, it will. We use R s 1 ,s 2 (i) and D s 1 ,s 2 (i) to denote the bitrateandthedistortionforblocki, respectively. Please notethatΔR(s 1 ,s 2 )and ΔD(s 1 ,s 2 ) can be viewed as the posterior bit rate increase and distortion increase by choosing the mode associated with (s 1 ,s 2 ) as the final mode. WeplotΔR(s 1 ,s 2 )andΔD(s 1 ,s 2 )asthex-andy-axisinFig. 3.5fortheQCIF Mobile sequence, where the origin denotes the RDO mode and the modes within the lowest 3×3 rankwindow are labelled by the rankorder. We see fromthe figure that (s 1 ,s 2 )= (1,1) is the closest to the RDO mode in the rate-distortion tradeoff. The next closest one is (2,1), and the third one is (3,1), and so on. Similar R-D performance in the rank-ordered joint feature domain has been observed in other test sequences such as Akiyo, Foreman and Stefan. 3.4 Final Mode Selection Based on the arguments given in Section 3.3.2, we conclude that the candidate modes to be considered are those modes that fall in the lowest 3×3 window. The 34 0 1 2 3 4 5 6 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 Average Rate Increase per Block A v e r a g e D i s t o r t i o n I n c r e a s e p e r P i x e l (1,1) (2,1) (3,1) (3,3) (2,2) (2,3) (3,2) (1,2) (1,3) Figure 3.5: The average rate (the x-axis) and the average distortion increase (the y-axis) for the QCIF Mobile sequence of 300 frames. number of Intra-prediction modes in this region is typically 2 to 3, but no more than 3. In this section, we will focus on the final mode selection. 3.4.1 Feature-based Method If these 2 to 3 candidate modes have quite distinct values in SAD and/or SATD, we can apply early termination rules stated in Section 3.2 to select the best mode. That is, we have two termination criteria: amplitude test for the smallest SAD and SATD values. However, if the smallest SAD (or SATD) is sufficiently large, we cannot make a robust decision based on the features alone. Then, we will turn to the RDO based mode search as given in the next subsection. 3.4.2 RDO-based Method We consider the application of the RDO method to these candidate modes if the feature-based method does not apply. 35 Since the RDO complexity is very high, we propose a new RD model to predict the rate and distortion of 4x4 blocks for further complexity reduction. This RD model is obtained by fitting the curve of observed data and demonstrated to give good performance in simulation. The proposed R-D model only works well when the quantization parameter is greater than 16 (i.e. excluding very high rate video) due to the limitation of the accuracy of the model. If the quantization parameter is less or equal to 16, the conventional RD optimization process as described in Section 2.2 is adopted. Thedistortionmodelisbasedonthequantizationerrorbetween coefficientsand transformedcoefficientsofthetargetblock, ratherthanthequantizationparameter ortheblock-variancebasedquadraticmodel, sincethelatterisnotaccurateenough for the RD estimation of 4×4 blocks in the experiments. Fig. 3.6 shows that the relationship between the actual distortion and the esti- mated distortion. The actual distortion means the distortion between the original and reconstructed blocks. The relationship is approximately linear in the natural logarithmic domain. This leads to the following logarithmic distortion model: ln(D) = μ 1 ·ln(E Q )+η 1 , if E Q ≤e χ 0 μ 2 ·ln(E Q )+η 2 , otherwise where E Q is the quantization error, χ 0 is the coordinate at intersection of two lines and μ k ,η k are parameters to characterize the piecewise linear model as shown in Fig. 3.6. Therefore, we have D =e μ k ·ln(E Q )+η k ,k =1,2 36 With this model, the distortion can be accurately estimated without actual recon- struction. (a) (b) Figure 3.6: The two line segments labeled by solid thick lines give the distortion model as a function of the quantization error: (a)Akiyo.QCIF ,(b)Stefan.QCIF The actual coding bit rate depends on the entropy coding scheme. For the JVT baseline, the entropy codec is CAVLC (Context adaptive variable length code). CAVLC encodes 5 different types of symbols (or called tokens): 1. Thecoefficienttoken(thenumberofcoefficients, thenumberoftrailingones); 2. The sign of trailing ones; 3. The level of nonzero coefficients; 4. The total number of zeros before the last coefficients; 5. The run of zeros. Here, we propose a rate model that predicts the rate of a 4x4 block using the above five tokens. Actually, the rate estimation of 4x4 blocks is difficult since H.264 entropy coding is context adaptive. To get an accurate rate model, we tune 37 the parameters carefully so that the estimated rate is close to the actual rate and the performance is consistent from low motion, smooth texture sequence to high motion, complex texture sequence as shown in Fig. 3.7. First, the rate of the given 4×4 block is estimated by the sum of the bits spent for each token such as b R 4×4 = P x C(x) = P x ω x ·x+α x where x is one of five tokens for the given 4×4 block, C(x) is the weighted bit cost function of x(encoding tokens), and ω x ,α x are corresponding weight and constant for each encoding tokens. Second, the estimated rate b R 4×4 is refined furthermore using the relationship between the true and the estimated rates. The model parameters ω x and α x are obtained empirically based on the observations for the various test sequences. The true bit rate as a function of estimated rate is shown in Fig. 3.7. The curve fitting method is used to approximate the mapping function which maps estimated rate into true rate. For the simple and accurate curve fitting, proposed rate model is obtained in three different regions: R = e k 1 · b R 4×4 , Region A e k 2 ·ln( b R 4×4 −k 3 )+k 4 , Region B e k 5 · b R 4×4 +k 6 , Region C where the parameter vector K = [k 1 ,k 2 ,··· ,k 6 ] in regions A and C are obtained by linear regression and those in region B are obtained by minimizing MSE. Fig. 3.7comparesrealbitrateandthe predicted bitrateusing theproposed ratemodel. They are close to each other. 38 (a) (b) Figure3.7: The predicted rate(circled lin)model andactualbitrate(solid curves): (a)Akiyo and (b)Stefan. In summary, the proposed algorithm can be described using six major steps as given below. Feature based fast Intra mode decision algorithm Step 1. SAD Feature: Calculate SAD cost of nine modes for the given 4×4 block based on (3.1). Step 2. SAD Early Termination: If the minimum SAD value is less than thresholdΔ SAD whichispre-determinedbyposteriorerrorprobability,thendecision process is terminated with the SAD best mode as final mode. Step 3. SATD Feature: Calculate SATD cost of nine modes for the given 4×4 block based on (3.3). Step 4. SATD Early Termination: If the minimum SATD value is less than threshold Δ SATD then decision process is terminated with the SATD best mode as final mode. Step 5. Unlikely Mode Filtering: Sort the modes in ascending order of joint features, SAD and SATD values, by using Quick-sort algorithm. Screen the less 39 likely modes placed outside of 3×3 screen window in joint rank-order table. Step6. RDOFinalDecision: FineleveldecisionisswitchedtoRDO/RDmodel based on quantization parameter QP. If QP is less than or equal to 16, then RDO method is used to search the final mode, otherwise RD model is used to search it among the most probable modes from Step 5. 3.5 Experimental Results The mode decision scheme as described in Sections 3.3 and 3.4 has been integrated with the H.264 JM7.3a codec for performance evaluation. It is compared with the RDO mode decision of H.264 in terms of the computational cost (the average CPU time per call for the mode decision routine) and the average PSNR as a function of the coding bit-rate for test sequences recommended in [15]. The frame rate of each test sequences is 30 frames per second and the frame skip was selected as five so that total encoded frames are fifty frames per sequence because total number of frames are 300 frames. Two different frame formats, QCIF (176×144) and D1 (720×480), have been examined to verify the performance deviation due to different frame sizes. All macroblocks are intra-coded and the quantization parameter set is chosen to be [10,16,22,28,34,40], which means the step size is doubled from 2 to 64 because H.264 uses exponentially increasing quantization step size scaling scheme.For this set of quantization parameters, the average number of bits per frame, the aver- age PSNR, and the average encoding time of the core mode decision routine are measured for comparison, respectively. The rate-distortion and complexity performance comparison at various coding rates and frame sizes are shown in Figs. 3.8 to 3.11 and summarized in Tables 40 Table 3.1: Rate and distortion performance comparison [QP=10∼40] R:RDO Rate (Kbit/frame) PSNR (dB) M:Proposed 10 16 22 28 34 40 10 16 22 28 34 40 Akio R 74.6 48.8 31.2 19.1 11.8 7.4 52.0 47.9 43.4 38.9 34.5 30.3 [QCIF] M 75.1 49.1 31.5 19.4 12.1 7.6 51.7 47.7 43.3 38.9 34.5 30.4 Foreman R 109 72.5 43.8 24.8 13.8 8.0 51.8 46.4 41.3 36.9 32.7 28.7 [QCIF] M 109 73.0 44.2 25.0 14.1 8.2 51.3 46.2 41.2 36.8 32.7 28.8 T. T. R 110 73.3 44.8 26.1 15.7 9.2 51.9 46.5 41.2 36.9 32.9 28.8 [QCIF] M 111 74.0 45.3 26.6 15.9 9.4 51.3 46.2 41.1 36.8 32.8 28.8 Mobile R 208 159 118 81.5 51.6 26.8 52.0 46.4 40.6 34.8 29.2 24.1 [QCIF] M 210 161 119 82.8 52.4 27.4 51.3 45.9 40.1 34.4 28.9 24.0 Susie R 1147 595 275 141 81.5 56.5 51.9 46.3 42.1 38.7 35.5 32.1 [D1] M 1153 606 280 146 84.3 58.4 51.4 46.2 42.0 38.7 35.6 32.3 Mobile R 2334 1703 1153 740 455 247 51.9 46.3 40.7 35.6 30.6 25.9 [D1] M 2348 1717 1165 750 462 253 51.4 45.9 40.4 35.4 30.4 25.8 3.1 and 3.2. As shown in these tables, the percentage of Intra coding highly varies depending on the input video characteristics such as motion activity and texture complexity. It also fluctuates with the values of coding parameters such as the frame rate. For this reason, we reported both the modular encoding speedup for the Intra mode decision and the total Intra encoding speedup factor to show the computational savings due to the proposed Intra mode decision algorithm. The speedup factor is defined to be the ratio of the encoding time of the mode decision routine using the RDO technique and that of the proposed scheme. The modular time is measured as the average encoding time of the 4x4 Intra modedecisionroutinewhilethetotalIntraencodingtimeismeasuredastheaverage encoding time of all Intra frames. The encoding time of each method is measured by the function timing profile in Visual Studio. Test sequences are chosen from low motion and smooth texture sequence to high motion and complex texture sequence among the MPEG Class A, B, and C sequences in [15]. 41 In terms of the rate-distortion performance, the proposed scheme achieves al- most identical RD performance while providing a modular speed-up factor ranging from 8.8 to 14.6 as shown in Figs. 3.8 to 3.11. Note that the computational complexity of the H.264 RDO process increases much faster than the proposed al- gorithm as the average bit rate increases. The reason is that the RDO procedure consists of both rate and distortion measurements. When a smaller quantization parameter (QP) is used, the bit rate increases. The RDO procedure consists of rate and distortion measurements that require entropy coding and reconstruction of the current macroblock. The additional residual information from smaller QP demands more computation time of the RDO procedure accordingly. Contrary to the RDO procedure, the proposed algorithm does not require entropy coding and reconstruction so that the speedup factor increases as the bit rate increases. Forsubjective quality comparison, reconstructed framesbased on RDOand the proposed method are captured atthree different QPsasshown inFig. 3.12. Again, it is difficult to tell the visual quality difference between two pictures obtained by two different mode decision method. Table 3.2: Computational complextiy comparison [QP=10∼40] Test Modular Speedup (scale) Sequences 10 16 22 28 34 40 Akio QCIF 13.6 12.6 10.9 10.5 10.9 10.8 Foreman QCIF 13.5 13.4 10.4 11.3 10.1 9.2 T. T. QCIF 14.5 14.3 10.6 11.3 9.3 8.8 Mobile QCIF 14.3 14.8 12.1 15.4 14.3 11.9 Susie D1 15.7 13.3 12.1 11.3 10.8 9.4 Mobile D1 12.9 12.9 12.3 12.5 12.1 11.0 TheproblemofIntra-modedecisionforH.264basedonjointfeatureswasstudied in this research. Two simple features, SAD and SATD, are used to filter out the 42 majority of candidate modes. Only 2-3 modes are left for the final mode decision, which can be done by feature-based or RDO-based methods. It was demonstrated by experimental results shows that we can speed up the intra prediction module of the JVT reference software JM7.3a by a factor of 10 times or more without noticeable RD performance degradation. Future research topics include the low complexity Intra/Inter mode decision for the H.264 which is the natural extension of this research. 43 10 20 30 40 50 60 70 32 34 36 38 40 42 44 46 48 50 52 Average Bitrate[10 Kbits] A v e r a g e L u m i n a n c e P S N R [ d B ] H.264 RDO H.264 Proposed (a) 10 20 30 40 50 60 70 5 10 15 20 25 Average Bitrate[10 Kbits] C o m p u t a t i o n a l C o m p l e x i t y [ 1 0 0 m s ] 13.6 12.6 10.9 10.5 10.9 10.8 H.264 RDO H.264 Proposed (b) Figure 3.8: Comparison of (a) the R-D performance and (b) the computational complexity for the Akio QCIF sequence. 44 20 40 60 80 100 30 32 34 36 38 40 42 44 46 48 50 Average Bitrate[10 Kbits] A v e r a g e L u m i n a n c e P S N R [ d B ] H.264 RDO H.264 Proposed (a) 20 40 60 80 100 5 10 15 20 Average Bitrate[10 Kbits] C o m p u t a t i o n a l C o m p l e x i t y [ 1 0 0 m s ] 9.2 13.5 13.4 10.4 11.3 10.1 H.264 RDO H.264 Proposed (b) Figure 3.9: Comparison of (a) the R-D performance and (b) the computational complexity for the Foreman QCIF sequence. 45 20 40 60 80 100 30 32 34 36 38 40 42 44 46 48 50 Average Bitrate[10 Kbits] A v e r a g e L u m i n a n c e P S N R [ d B ] H.264 RDO H.264 Proposed (a) 20 40 60 80 100 5 10 15 20 25 Average Bitrate[10 Kbits] C o m p u t a t i o n a l C o m p l e x i t y [ 1 0 0 m s ] 14.5 14.3 10.6 11.3 9.3 8.8 H.264 RDO H.264 Proposed (b) Figure 3.10: Comparison of (a) the R-D performance and (b) the computational complexity for the Table Tennis QCIF sequence. 46 0.2 0.4 0.6 0.8 1 34 36 38 40 42 44 46 48 50 Average Bitrate [Mbits] A v e r a g e L u m i n a n c e P S N R [ d B ] H.264 RDO H.264 Proposed (a) 0.2 0.4 0.6 0.8 1 1 2 3 4 5 6 7 x 10 4 Average Bitrate [Mbits] C o m p u t a t i o n a l C o m p l e x i t y [ m s ] H.264 RDO H.264 Proposed 9.4 10.8 11.3 12.1 13.3 15.7 (b) Figure 3.11: Comparison of (a) the R-D performance and (b) the computational complexity for the Susie D1 sequence. 47 (a) (b) Figure 3.12: Visual quality comparison of reconstructed QCIF frames of (a) Fore- man and (b) Mobile using the RDO method (the upper row) and the proposed method (the lower row) where QP=16, 28 and 34 for the left, middle and right columns, respectively. 48 Chapter 4 Feature-based Intra/Inter Coding Mode Selection for H.264/AVC 4.1 Introduction A fast inter/intra mode prediction scheme with carefully selected features is de- veloped in this work. Under this framework, we divide the mode decision into “coarse-scale decision” and “fine-scale decision” two stages. At the first stage, we perform a binary decision to choose either the inter or the intra prediction type to beusedforatargetblock. Theobjective istoreduce thecomputationalcomplexity by deciding the most probable intra/inter type earlier. Obviously, there is a risk in making the wrong prediction in the computation- saving effort. Thus, the management of the prediction risk, which is quantified by the averaged RD performance degradation rather than by the mis-classification rate alone in this work, is critical. To manage the prediction risk efficiently, the proposed algorithm employed the Bayes risk minimization criterion. An overview of the proposed algorithm is depicted in Fig. 4.1. First, three cost-effective features are extracted from the current macroblock to form a 3D feature vector. It is observed that the inter prediction provides better 49 performance than the intra prediction when the temporal correlation is stronger than the spatial correlation, and vice versa. We also observe that the intra/inter mode decision is highly correlated with the degree of motion activity. Based on these observations, we employ three features that reflect the spatial correlation, the temporal correlation, and the motion activity, respectively. Second, the feature space is partitioned into three mutually exclusive regions off-line according to the risk; namely, risk-free, risk-tolerable, and risk-intolerable regions. We apply algo- rithms of different complexity for final coding mode decision to blocks located in these three regions. Features 1 f Intra FINAL M 0 f Inter MV Motion k MB Risk-Free Tolerable Risk Intra Mode Inter Mode Figure 4.1: The feature based Intra/Inter mode decision scheme using risk mini- mization criterion. Intuitively, the risk is calculated based on the probability of erroneous mode selection and the average RDperformance loss. As shown in Fig. 4.1, if the feature vector lies in the risk-free region, the decision is made based on simple feature comparison which is the differential cost between intra feature and inter feature. If it is in the risk-tolerable region, the risk-minimizing mode is selected, where the risk is chosen to be the Bayes-risk, which will be described in detail in Sec. 4.5. Finally, if it is in the risk-intolerable region, i.e., it is included neither in risk-free region nor in risk-tolerable one as well, a full mode decision process is conducted to avoid significant RD loss. Once the intra/inter prediction is determined, we proceed to the mode decision at the second stage; namely, which specific intra or inter mode to be used. 50 4.2 Simple & Effective Feature Selection The proposed mode decision algorithm is a feature-based approach. Good features should be easily computed at a low computational cost while capturing the spatial and/or temporal correlation well so as to offer important clues about which mode to select. 4.2.1 Intra Mode Feature To characterize the spatial domain correlation, which can be exploited by an intra- prediction mode, we use the sum of absolute transform differences (SATD) of the prediction residual due to its simplicity and good mode-discriminating capability [23]. For the transform function, we adopt the simple Hadamard transform since only addition and shift operations are needed in the computation. For each 4×4 block, we calculate SATD values for only 5 modes, which are the DC, vertical, hor- izontal, diagonal down-left, and diagonal down-right modes, and pick the smallest one as the representative value. Finally, the feature to reflect the spatial domain correlation, denoted by f Intra or f 1 , is chosen to be the sum of the SATD values of sixteen 4×4 blocks contained by the current 16×16 macroblock. Mathematically, we have f Intra =f 1 = 16 X k=1 SATD(m k ), (4.1) m k = argmin m={m 0 ,...,m 4 } SATD(m), SATD(m) = X (x,y)∈B k |T(I(x,y)−P m (x,y))|), where k is an index of 4×4 blocks B k within current macroblock, m denotes the one of five candidate prediction modes, T(·) represents the Hadamard transform, 51 I(x,y) and P(x,y) are pixels of the k th block of the current macroblock and the corresponding predictive intra mode, respectively. If the value of f 1 is small, it is likely that the intra-predictive mode will be selected. In calculating intra featuref Intra , the original pixels of its neighbor 4×4 blocks are used to generate the prediction pixels P(x,y). Basically, the intra and inter features are chosen to be simple yet effective for the binary mode decision. At the same time, the reconstruction cost of each 4x4 block is saved in our scheme. We see that the proposed feature difference is very accurate as shown in Figs. 4.3 and 4.4 when the MB is in the risk-free region for various frame-skips as well as a wide rangeofmotionactivity. Bytheframe-skip, wemeanthenumberofframesskipped between two encoded frames. For a sequence of 30fps, we will have 30, 15 and 10 fps if the frame-skip number is equal to 0, 1 and 2, respectively. It is important to emphasize that the main purpose of simple intra and inter features are not for the final intra-mode decision but for simple binary decision of the risk-free region in the observation space. 4.2.2 Inter Mode Feature To compute the feature to reflect the temporal domain correlation, which can be exploited by an inter-prediction mode, we search the best matched macroblock with respect to one reference frame. In our implementation, the motion vector is obtained using modified MVFAST [62] with the quarter pixel accuracy. Three modifications are made to fine-tune the performance. First, we add two more can- didate motion vectors in the spatial (the left upper macroblock) and temporal (the co-located macroblock in the previous frame) neighborhood. Second, the residuals of previously visited search points are kept in the memory to avoid re-calculation. 52 Third, the total number of search points for a block is restricted to be M = 512 in the worst case. Modified MVFAST motion search algorithm can be summarized as: 1. Detection of Stationary Blocks: The search for the best matched block will be terminated immediately, if its SAD value obtained at zero motion position (0,0) is less than threshold T = 512, and the motion vector is assigned as (0,0). 2. Determination of Local Motion Activity: ThelocalmotionactivityListhemaximumcityblocklengthofmotionvector in region of support (ROS) and classified to three levels (where the ROS of the current MB includes five spatial-temporal neighborhood MBs which are upper, upper left, upper right, left MB in current frame, and collocated MB in previous frames): Motion Activity = Low,L<L 1 = Medium,L 1 <L<L 2 (4.2) = High,L 2 <L where L1 =1 and L2 =2 are integer constants. 3. Search Center: The choice of the search center depends on the local motion activity L at the current MB position. If the motion activity is low or medium, the search center is the origin. Otherwise, the vector belonging to set V that yields the minimum sum of absolute difference (SAD) is chosen as the search center. 53 4. Motion Search: A local search is performed around the search center to obtain the motion vector for the current MB. The search patterns employed for the local search are shown in Fig. 4.2. Two strategies are proposed for the local search and their choice depends on the motion activity identified. If the motion activityisloworhigh,weemploythesmalldiamondsearch(SDS).Otherwise, we choose large diamond search (LDS). To reduce the searching complexity, earlier visited search points are kept in the memory to avoid re-calculation and the total number of search points for a block is restricted to M = 32 in the worst case. (a) (b) Figure 4.2: (a) Large Diamond Search Pattern (LDSP) and (b) Small Diamond Search Pattern (SDSP). Mathematically, the inter-prediction feature, denoted by f Inter or f 0 , can be expressed as f Inter =f 0 = X (x,y)∈MB l |T(I(x,y)−Q(x ′ ,y ′ ))|, (4.3) (x ′ ,y ′ ) = (x,y)+(v x ,v y ) 54 where (v x ,v y ) is the motion vector obtained by the fast search algorithm, and I(x,y) and Q(x ′ ,y ′ ) are pixels of the current macroblock MB l and its predictive macroblock. Simply speaking, the intra and inter features are the SATD values of the spatial and the temporal prediction residuals. 4.2.3 Motion Activity Classification The third feature used is the magnitude of the motion vector, which can be com- puted as |MV|= (v 2 x +v 2 y ) 1/2 . (4.4) Thereasontoincludethemotionvectormagnitude(orstrength)isthatitisrelated to the reliability of inter-prediction feature f 0 . In Fig. 4.3, we show the decision errorprobabilityasweadjusttheframe-skipnumberforthreetypicaltestsequences that have different motion activities. That is, the motion activities of Akiyo, Foreman and Stefan are low, medium and high, respectively. The decision metric is chosen to be feature difference as given by d f =f Intra −f Inter . (4.5) In (4.5), the intra feature is the sum of SATD values for sixteen 4×4 blocks within a target macroblock of size 16×16 as shown in Eq. (4.1). Also, the inter featureistheSATDvalueofthemacroblockasgiveninEq. (4.3). Simplyspeaking, they are the SATD values of the spatial and the temporal prediction residuals of the same macroblock. 55 Figure 4.3: The decision error probability versus the frame-skip number for se- quences Akiyo, Foreman and Stefan. Please note that our feature difference is very accurate (with a decision error less than 6% for three test sequences) when the frame skip is zero (30fps) or one (15fps) as shown in Fig. 4.3. According to the feature difference measure, the decision error probability can be defined as, P(e) =P(d f < 0|inter)·P(inter)+P(d f > 0|intra)·P(intra), (4.6) where, P(intra) (or P(inter)) is the probability that best mode is intra (or inter) coding mode. As shown in Fig. 4.3, we see that the probability of erroneous deci- sion increases as the motion activity grows. Furthermore, we collect the statistics of macroblocks from seven test sequences (QCIF) with the same quantization pa- rameter, and draw the probability distribution of the RD-cost difference d c and the feature difference d f in Fig. 4.4. The RD-costdifference d c is defined to be the cost 56 difference of the RD function between the best inter mode and the best intra mode written, and can be written as d c = (D Intra +λ Intra ·R Intra )−(D Inter +λ Inter ·R Inter ) (4.7) where,λ Intra andλ Inter aretheLagrangianmultipliersusedinH.264referencecode. Itiseasytoverifythatifthebestmodeisaninter-predictive(oranintra-predictive) mode, then d c is positive (or negative). We see that the feature difference has excellent correlation with the RD-cost difference as shown in Fig. 4.4(a). Thus, we may use the feature difference to do the mode prediction, and the overall decision accuracy is 84.7%. However, it is observed that the prediction accuracy degrades as motion activity increases as shown in Fig. 4.4(b∼d). It is also worthwhile to point out that the number of intra modes used for pre- diction increases as the motion becomes faster. From these data, we conclude that the intra/inter features as defined in (4.1) and (4.3) are good ones to characterize the spatial and the temporal correlations forinter/intra mode prediction. Also, the motion vector length plays an important role in the decision making process. 4.3 3D Feature Space Partitioning As described in Section 4.1, the 3D feature space is partitioned into three regions (i.e. risk-free,risk-tolerableandrisk-intolerableregions)dependingontheexpected risk a.k.a. the expected RD loss as F = [f 0 ,f 1 ,|MV|]∈ R free L p ≤L free R tolerable L free ≤L p ≤L ∗ p R intolerable L ∗ p ≤L p 57 (a) (b) (c) (d) Figure4.4: Cumulativehistogramofmacroblockdistributionusing(a)macroblocks of low motion activities, (b) macroblocks of medium motion activities, (c) mac- roblocks of high motion and (d) all macroblocks activities. where F is the input feature vector, L p is the expected RD loss at position F in the 3D feature space, L free is the threshold for the risk-free region and L ∗ p ∈ [0,1] is the given RD loss threshold for the risk-tolerable region. For example, L ∗ p = 0.1 means that we accept the risk of 10% of the RD cost increase for false decision. In other words, we decide the current codingmode to the riskminimized coding mode if the expected RD loss is less than or equal to 10% of the RD loss with respect to that of the true mode. 58 The expected RD loss, denoted by L p , is defined as the normalized Lagrangian RD cost difference between the true mode and the wrongly selected mode: L P = P (R T − ˆ R)+λ·(D T − ˆ D) P R T +λ·D T . (4.8) To facilitate the classification of an input feature vector to one of the three classes, it is convenient to partition the 3D feature space based on a off-line train- ing process. That is, we collect the three features as described in Sec. 4.2 from all macroblocks of seven training sequences of different motion and texture character- istics. They are: Akiyo, Hall Monitor, Foreman, Coastguard, Stefan, Table Tennis, and Mobile. The three features of a macroblock correspond to a point in the 3D feature space. Therearetwoimportantfactorstoconsiderwhenwepartitionthefeaturespace: representation accuracy and search complexity. For an efficient partition of the feature vector space, it is desirable to prevent empty cells which have no training data since it is difficult to calculate the expected RD loss in empty cells. The cell that does not have enough training data is also unfavorable since the expected RD loss may not be reliable. To avoid cells with little or no training data, we can employ vector quantization (VQ) for space partitioning. However, the complexity of VQ clustering is still high [13]. Here, we propose a non-uniform quantization scheme where each cell has about the same number of training data so that a reliable estimate for every cell with reasonable complexity can be obtained. This is illustrated in Fig. 4.5. To address the above problem, a non-uniform quantization scheme is employed so that each cell has about the same number of training data so that a reliable 59 estimate for every cell with reasonable searching complexity can be obtained. This is illustrated in Fig. 4.5. For given quantization parameter QP, the motion vector length is first non- uniformly quantized into N classes such that each motion class has about an equal number of training data with respect to the marginal probability. For each motion class i, the remaining two features (i.e. intra and inter features) are jointly quan- tized into m i 0 ×m i 1 non-uniform cells, where m i τ =⌊ q |class i | Mτ ⌋ as shown in Fig. 4.5 with the product VQ technique. That is, cells are obtained by the tensor product of two independent 1-D partitions. Note that the number of cells can be different in different motion classes. The larger the cardinality of the i th class |class i |, it is quantized into smaller cells. The training data per cell (M τ ) and the number of motion class (N) are chosen to minimize the RD performance degradation caused by wrong decision. The details are given in Sec. 4.5. |MV| f 0 f 1 Low Motion Activity High Motion Activity Figure 4.5: Illustration of the partition of the 3D feature vector space. 60 4.4 CodingModePredictioninRisk-FreeRegion As stated in Sec. 4.3, the 3D feature space is partitioned into multiple classes according to the motion vector length. For a given motion class, we plot the distribution of the remaining two features, i.e. f 0 and f 1 . A typical example is given in Fig. 4.6. The correct and erroneous decisions based on simple feature difference, are labeled by shaded surface plot and solid dots, respectively. Errorneous Decision Figure 4.6: The joint PDF of intra/inter features for correct decisions, including both both intra or inter modes, and black dots represent scattered distribution of erroneous decision results along the diagonal boundary with the feature difference Δf =0. Weseethaterroneousdecisionsprimarilyoccuralongthediagonalregion,which is the boundary of these two features. It is also apparent that a large amount of macroblocks can be predicted without any RD loss using the feature difference as given in (4.5). We call the region that has a low probability of erroneous decision 61 the risk-free region. Forthe risk-free region, the decision can be made simply based on the feature difference Δf =f Intra −f Inter Inter ≷ Intra 0. (4.9) This is a reasonable choice when the feature difference value correlates with the desired mode well. That is, we select the intra mode, if Δf ≥ 0. Otherwise, the inter mode is selected. To be more specific, the risk-free region is chosen under the criterion that the expected RD loss L p is less then L free = 0.5% in this work. This criterion is actually a conservative choice that guarantees no significant RD performance loss in the risk-free region. Figure 4.7: The risk and risk-free region partitioning as a function of feature dif- ference Δf. Fig. 4.7 gives an example of risk and risk-free region partitioned based on the expected RD loss. This subfigure shows two conditional PDF of feature difference Δf when the inter or the intra mode is the best mode. The bold (or thin) curve 62 at the left (or right) hand side of Δf = 0 is the erroneous region of choosing the inter (or the intra) mode when the intra (or the inter) mode is the correct one. With Δf = 0 as the decision threshold, the erroneous decision regions are shaded by dark and light gray colors, respectively. The risk region corresponds to the interval inside the two dotted lines in Fig. 4.7 where the risk-free region consists of areas outside of the two dotted lines. We see that the risk region heavily overlaps with the shaded region of light gray. This is because that the expected RD loss from the erroneously chosen intra mode is more severe than the erroneously chosen inter mode. This demonstrates that it is not sufficient to consider the conditional probabilities alone. We need to consider the cost of an erroneous decision, too. 4.5 Coding Mode Prediction in Risk Region Basically, proposed coding mode prediction scheme is to early decide the most probable coding mode based on observation of three features which are intra and inter features and motion activity. This can be done by optimally partitioning of observation space. In the previous section 4.6.1, whole feature space is partitioned into risk-free region and diagonal risk region by using RD loss threshold L free . To reduce the complexity associated with inter/intra mode decision, the risk regioninFig. 4.7isfurtherdecomposedintotworegions; namely, risk-tolerableand risk-intolerable regions, depending on the expected RD loss value. If the expected RD loss is less than a threshold, it is the risk-tolerable region. Otherwise, it is the risk-intolerable region. For the risk-tolerable region, we may develop an algorithm of medium com- plexity for mode decision. For the risk-intolerable region, the full mode search is 63 performed to avoid significant RD loss. We will focus on the risk-tolerable case in this subsection. In this work, we define the risk as R = M−1 X i=0 M−1 X j=0 ˜ C ij P( ˆ m i |m j ), (4.10) where m j denotes the ground truth (or correct decision), ˆ m i the actual decision made, ˜ C ij the cost of making decision ˆ m i while the ground truth is m j . We can rewrite (4.10) as R = M−1 X i=0 M−1 X j=0 C ij P( ˆ m i ,m j ), (4.11) where C ij = ˜ C ij /P(m j ). Furthermore, since decision ˆ m i is made based on the feature space partition, we have P( ˆ m i ,m j ) = Z χ i P(m j |F)f(F)dF, (4.12) where F denotes a vector in the feature space, χ i the subspace where decision ˆ m i is chosen and f(F) the probability density function of feature F. Substituting (4.12) in (4.11), we obtain R = M−1 X i=0 Z χ i θ i (F)f(F)dF, (4.13) where θ i (F)= M−1 X j=0 C ij P(m j |F), (4.14) 64 whichiscalledtheBayes-risksinceitisthesumofcostsC ij weightedbyconditional probabilities P(m j |F) given an observed feature vector. In the current context, there are onlytwo choices andwe usem 0 andm 1 todenote the decision ofchoosing the inter and the intra modes, respectively. Itisusuallydifficulttocharacterizetheprobabilitydistributionf(F)asgivenin (4.13). To simplify decision making, we simply focus on θ i (F). By risk-minimizing mode selection, we mean the following mode selection rule: θ 0 (F) Inter ≷ Inter θ 1 (F). (4.15) We have θ 0 (F)=C 01 P(m 1 |F)+C 00 P(m 0 |F). By setting C 00 =0, we obtain θ 0 (F)=C 01 P(m 1 |F)=C 01 f(F|m 1 )P(m 1 ) f(F) , where the last equality is based on the Bayesian rule. Similarly, we have θ 1 (F)=C 10 P(m 0 |F)=C 10 f(F|m 0 )P(m 0 ) f(F) . Then, we can rewrite the decision rule in (4.15) as f(F|m 1 ) f(F|m 0 ) Inter ≷ Inter C 10 ·P(m 0 ) C 01 ·P(m 1 ) . (4.16) where m 0 and m 1 denote selections of the inter and the intra modes, respectively. The cost C 10 represents the RD cost of false decision like choosing the intra mode when the inter mode actually gives the smaller RD value, and vice versa, while the 65 cost of correct decision, i.e. C 00 and C 11 , are typically set to zero. The left term in (4.16) is the likelihood ratio and the threshold is the ratio of the cost weighted by the probability of each mode. For each quantized cell in the risk-tolerable region, we determine the Bayes- risk minimizing mode using the test in (4.8) with training data. After the risk- minimizing mode is determined, its expected RD loss is obtained using Eq. (4.8). In other words, we can compute the optimal mode and the associated expected RD loss pair (m opt ,L P ) for each cell using test sequences off-line and store these values in a lookup table. It is observed in the experiment that the a priori probabilities of the inter and intra modes in the risk-region is in general not equal so that there exists MAP rule betterthanML.Also,itsassociatedRDcostisobservedasnon-uniform. Therefore, risk minimization criterion can be justified in terms of statistical fitness. 4.6 Probability Density Estimation To determine the coding mode based on (4.16), we need to estimate the likelihood ratio function which is the ratio of the posterior probabilistic density function. In general, there are three ways to do the likelihood estimation: parametric, semi- parametric, and non-parametric methods [4]. The parametric approach assumes a functional form of f(F|m i ), which can be customized by a set of parameters like the Gaussian mixture model [52, 9, 37, 32]. In this case, estimating f(F|m i ) is essentially to find a set of parameters that fit the functional form to the training data. On the other hand, the non-parametric approach describes the observation data distribution directly. The semi-parametric approach, most notably used in neural networks, adopts a general functional form 66 thatcanhaveavariablenumberofadjustableparameterssuchastheself-organizing map network [57, 56, 47]. In this work, we consider the parametric and the non-parametric likelihood estimation methods, and compare them in terms of the RD performance and the computational complexity. 4.6.1 Quantized Cells in 3D Feature Space Asdescribed inSec. ,thetwo conditionaldistributionsf(F|m i ),i∈0,1, overlap in the risk region as shown in Fig. 4.6. Thus, for a given motion class, it is desirable to re-map the 2D feature space (f Intra ,f Inter ) for compact quantization using the following coordinates transform: F 0 F 1 = 1 −1 1 1 f 0 f 1 (4.17) Fig. 4.8 illustrate the re-mapping from the original 2D feature space of the risk region using (4.17). f 0 f 1 F 0 F 1 Figure 4.8: Illustration of a re-mapping of the risk region. 67 4.6.2 Non-parametric Likelihood Estimation As shown in Fig. 4.5, the re-mapped risk region is quantized into m i ×m i cells each motion class i. On one hand, if the risk region is coarsely partitioned, there are sufficient training data per cell and the density estimate will be more reliable. On the other hand, if the risk region is finely partitioned, the prediction error in each cell will be smaller and it will be easier to make correct mode decision in a smaller region. Figure4.9: Illustration ofthe partitionofthe 3Dfeature vector space, where green, orange dot and red circle represents intra, inter mode and wrong decision, respec- tively. On the other hand, we can assign a different mode to a smaller region if more partitioned cells are allowed in the same space. However, for a partition which is too fine, we may get inaccurate mode assignment due to unreliable estimation of the conditional probability density function, and eventually, the RD performance degrades. Consequently, there exists a tradeoff in choosing a proper quantization scheme. Thus, the number of motion class (N) and the number of data per cell (M) should be chosen carefully to minimize the expected RD loss. 68 Mathematically, we have [N,M] = min x,y X ΔD+λ·ΔR = min x,y x X i=1 y X j=1 ΔD ij +λ·ΔR ij (4.18) whereΔD ij =D True (i,j)−D classifiedmode (i,j),ΔR ij =R True (i,j)−R classifiedmode (i,j) and (i,j) is the coordinates of the corresponding cell in the risk region. To estimate parameters efficiently, we first quantize the observation space using the[N,M]valuesandassigntheriskminimized modepercellbasedonthetraining sequence. Next, we test various non-training video sequences from MPEG classes A to C and search the optimal parameters that produce the minimum overall RD performance degradation as shown in Fig. 4.10. The proposed algorithm works the best in terms of rate distortion performance when N = 9 and M = 8 for QP = 28. Following this procedure, we can efficiently quantize the 3D space for the non-parametric method for all quantization step sizes. Figure 4.10: Estimated parameters N =9 and M =8 for QP = 28. 69 For each quantized cell in the risk region, the conditional probability density functions f(F|m 0 ) and f(F|m 1 ) are estimated to calculate the likelihood ratio as follows [5]. • Step 1: For given QP = q and motion class i, a training data set T i in the risk region from test sequences that contain a large number of labeled intra and inter modes is first prepared. • Step 2: The feature vector x i = [F 0 ,F 1 ] in T i is quantized according to estimated parameters [N,M] so that we have x i q = [F q 0 ,F q 1 ] in T i q . • Step 3: Initialize two 2D matrices of size m i ×m i for intra and inter modes as H Inter (F q 0 ,F q 1 ) = 0, H Intra (F q 0 ,F q 1 ) =0, (4.19) where m i is the range of each quantized feature. • Step 4: To construct the histogram, we update mode counts for cell x i q ∈T i q as H Inter (F q 0 ,F q 1 )=H Inter (F q 0 ,F q 1 )+1 ifx∈m 0 , H Intra (F q 0 ,F q 1 ) =H Intra (F q 0 ,F q 1 )+1 ififx∈ m 1 . (4.20) 70 • Step5: Normalizethehistogramtoobtainthepdfestimatesformotionclass i: P(x i |m 0 ) ∼ =h 0 (x i q ) = H Inter (x i q ) N i 0 , P(x i |m 1 ) ∼ =h 1 (x i q ) = H Intra (x i q ) N i 1 , (4.21) where N i 0 and N i 1 are the total number of inter and intra modes in motion class i, respectively. These non-parametric density estimates are stored in a lookup table, which are then used by the decision rule in (4.16). The obtained density estimates may have irregularities when the training set does not provide sufficient data to cover the feature space. This problem can be alleviated by applying a 3D smoothing filter (kernel) to the density estimates. The kernel function also reduces the estimation error variance. The kernel selection has been extensively studied in [14, 50]. In practice, a k×k×k averaging filterwasusedintheliterature[5]. Here, weadoptthefollowingweightedaveraging filter to obtain regularized estimate: ˆ P(F i |m i ) = P k l=−k P k m=−k h i (F i q ) d l,m P k l=−k P k m=−k 1 d l,m , where d l,m =|F i q (0,0)−F i q (l,m)| 2 is the Euclidean distance between the codeword of the current cell and the codeword in (2k+1) 2 neighboring cells in a quantized feature space. 71 4.6.3 Parametric Likelihood Estimation Inpractical, feature space rarelyberepresented byasingle parametric form. When there is little or no prior knowledge about data properties except for the data sam- ples themselves, the sample’s joint distribution can often be modeled as a mixture of simple parametric forms such as Gaussian, Cauchy, or Laplace [43, 40]. This provides a tradeoff between the simple and limited parametric approaches and the computationally intensive nonparametric approaches. In general, Gaussian Mixture Model (GMM) approach is a popular method in density estimation since its a powerful tool for representing virtually any distribu- tion. Although, the expression of GMM is simple, the training of a GMM, i.e., finding a model given the feature vectors, is rather complex and time consuming due to its computational complexity and iterative nature. Training of a GMM is generally accomplished by the EM algorithm [32] , which guarantees convergence to a local maximum. In literature, Xu and Jordan [52] applied the expectation- maximization (EM) method [9] to the Gaussian mixture problem and showed its advantages over other algorithms. The parameters for each component density in the mixture, however, have to be derived or learned solely from the data samples. The maximum likelihood (ML) approach to this problem, when all parameters are unknown, results in a set of implicit equations, which require numerical methods to solve [11]. Basically, the EM algorithm maximize the sample joint-likelihood, i.e. the pseudo-likelihood, in batch operation. Intheory,thesamplelikelihoodwillapproachthetruelikelihoodwhenthenum- ber of the sample points tends to infinity. However, due to its use of deterministic 72 gradient descent and batch operation nature, the EM algorithm has a high pos- sibility of being trapped in local optima and is also slow to converge [37, 57]. In order to overcome these technical difficulties, we employed initial model, K-means algorithm that has been applied for finding a robust model approximation to the GMM. In each of motion classes, The 2D feature vector space in risk region is modeled as mixture of multiple Gaussian distribution. Generally, the probability given a GMM can be formulated as P(x) = M X i=1 ω i ·N i (x;μ,Σ) N i (x;μ,Σ) = 1 (2π) |d| 2 |Σ| 1 2 exp( 1 2 (x−μ) T Σ −1 (x−μ)) (4.22) X i ω i = 1,∀i :ω i ≥0 where M is the number of Gaussian, (μ,Σ) are mean and covariance matrix, d represents dimension of input vector space, and ω i is the prior probability of i th component Gaussian pdf, N i (x;μ,Σ). In cased of 2D GMM, commonly used mixture models are circular or diagonal GMMs, i.e., all components in a mixture model are has single (circular) variance over 2 dimension or two unique variances for each dimension. In our approach, we used full rank GMM whose covariance matrix full rank that means the number of parameters is the square of the dimension. In order to estimate the two conditional pdfsin(4.16),themodelparameters[μ k ,Σ k ,ω k ]have tobeestimated ifthenumber of component pdfs is k. This process is called model-training and GMM can be trainedbyefficient EM algorithmwhich isaniterative MLprocedureforparameter estimation under incomplete data or missing data situation[9]. 73 By using the EM procedure, the likelihood is obtained by the average or ex- pectation of the complete-data likelihood with respect to the missing data using the current parameter estimates (E-step), then the new parameter estimates are obtained by maximizing the marginal likelihood (M-step). Basically, the objective of EM is to maximize likelihood P(x;Θ) of the data X drawn from an unknown distribution, given the model parameter Θ under the assumption that sample ob- servation X is independent events: b Θ = argmaxθP(X|Θ) = argmaxθ n Y j=1 P(x j |θ) (4.23) To be more specific in concept, EM algorithm introduce the hidden variable Q whose role is to tell each point about which Gaussian generated the point and forms a auxiliary function A(θ,θ s ) that helps the maximization of likelihood such as: A(θ,θ s ) = E[logP(X,Q|θ)|X,θ s ] = n X i=1 N X j=1 E[q i,j |X,θ s ](logP(j|x i ,θ)+logP(x i |θ)) (4.24) The auxiliary function means expected log-likelihood of training data when we will set the each data belongs to the component pdf indicated by Q and the each component pdf is formed with the parameter set θ, for the given training data X and parameter set θ s for Gaussian mixture. It can be shown that maximizing auxiliary function, θ s+1 =argmaxθA(θ,θ s ) (4.25) 74 always increase the probability of the data P(X,Q|θ), and a maximum of A(θ,θ s ) correspondstoamaximum likelihood[3]. Table4.1isthecompactdescription ofthe EM algorithm for GMM. (a) (b) 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 F 0 F 1 (c) 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 F 0 F 1 (d) Figure 4.11: Surface plot of conditional probability and ellipse plot for [mean, covariance] ofcomponent Gaussianpdfsoverlapped with datariskatmotionvector class 7, and QP=22 (a,c) f(F|m 0 ) (b,d) f(F|m 1 ) 75 Table 4.1: Expectation Maximization for Gaussian Mixture Model Initialize GMM: K-means algorithm is used to initialize GMM not to be trapped in local optima. Using K-means algorithm, 2D feature vectors in risk region is clustered into the given number of clusters. E-Step: For each point in risk region, estimate the conditional mean E[q i,j |X,θ s ] which is the same with the probability that each component Gaussian distribution will generated it such as: E[q i,j |X,θ s ] = 1·P(q i,j = 1|X,θ s )+0·P(q i,j = 0|X,θ s ) = P(j|X,θ s ) (4.26) = P(x i |j,θ s )·P(j|θ s ) P(x i |θ s ) M-Step: Modify the parameters of each component Gaussian pdfs to new parameters θ s+1 to maximize the likelihood of the data as well as the one of hidden variable like (4.24). ∂A(θ,θ s ) ∂θ = 0, (4.27) for each parameter of component Gaussian pdfs, such as mean, covariance and prior weights [μ j ,Σ j ,ω j ]. Summarizing, the estimates of the new parameters in terms of old parameters are as follows: ˆ ω j = P(j|θ)= 1 n n X i=1 P(j|x i ,θ s ) ˆ μ j = P n i=1 x i ·P(j|x i ,θ s ) P n i=1 P(j|x i ,θ s ) (4.28) ˆ Σ j = P n i=1 (x i − ˆ μ j )·(x i − ˆ μ j ) T ·P(j|x i ,θ s ) P n i=1 P(j|x i ,θ s ) Note that the above equations perform both the expectation step and the maxi- mization step simultaneously. The algorithm proceeds by using the newly derived 76 parameters as the guess for the next iteration. Iteration is performed until the parameter set is converged. Following the above procedures defined in Table4.1, we perform the experiment to train the GMM using EM algorithm for the risk region in each motion classes at every QPsets, QP =[10,16,22,28,34]. Firstofall, we quantized 3Dfeaturevector space by using the same way as non-parametric method in each QP by using the minimum RD loss criterion in section4.6.2. Initially, each risk region is clustered using K-means algorithm with 20 iterations. Each conditional probability f(F|m i ) is modeled as Gaussian Mixture Model like Fig.4.11 and the number of component Gaussian pdfs is chosen as minimum number which achieves minimum error within 50 iterations when the error Δe is defined as the difference between old and new log-likelihood such as. Δe = e s+1 −e s = X i log 1 P s+1 (x i |X,θ s ) − X i log 1 P s (x i |X,θ s ) (4.29) Each 2D Gaussian pdf needs 3 parameters which are 2×1 mean vector, 2×2 covariance matrix and prior weight, so that maximum number of parameters will be seven per component pdf which is affordable in terms of memory complexity. Those parameters are optimized by EM algorithm and stored in lookup table in thecodec. Actually, followingissimpleexample oftrainedGaussianmixture model for conditional probability f(F|m 0 ) in risk region where the QP = 28 and motion class is 7 out of 9 total. For given RD loss threshold L ∗ p ∈ [0.0,1.0], the proposed intra/inter coding mode prediction algorithm can be summarized as Table4.2. 77 Table 4.2: Bayes risk minimized coding mode prediction algorithm Step 1: For the current macroblock, calculate three features [f Intra ,f Inter ,|MV|] using (4.1) and (4.3) and the Euclidean norm of the motion vector. Step 2: Quantize the motion activity as described in section4.3 and check whether the feature vector is within the risk-free region. If yes, we will calculate the feature difference Δf and decide the mode based on (4.9). If no, we proceed to Step 3. Step 3: Quantize the 2D feature vector in the risk region into small cells. If the non-parametric method is used, then determine the risk-minimized mode m opt as desired mode, if associated RD loss ¯ L P is less than L ∗ p which means it belongs to risk-tolerable region. Otherwise, we proceed to Step 4. In case of parametric method, calculate likelihood ratio based on GMM trained by EM algorithm and determine risk minimized mode on the fly. Step 4: Find the optimal intra mode and the optimal inter mode, respec- tively, using RDO or RD estimation algorithms. Then, the Lagrangian RD cost values for these two modes are compared and the one gives the smaller one is chosen to be the desired one. 4.7 Experimental Results In the experiment, the proposed algorithm was integrated with the JVT reference software JM7.3a. For the experimental setup, the general main profile encoding configuration is used and the motion vector search range is set as 32×32 centered at best predictive motion vector where the fast full search algorithm is used for five reference frames. The frame skip was set to zero (30fps), one (15 fps) and five (5 fps), respectively, as shown in Fig. 4.14. Also, the B frame was not used in the experiment. This algorithm can be applied to any profiles since it switches the intra/inter mode prior to the coding of macroblocks. For the entropy coder, we chose CABAC [28]. 78 (a) (b) (c) Figure 4.12: Variation of (a) the R-D performance and (b) the computational complexity (square and triangular lines represent number of single mode decisions) (c) encoding time saving as L ∗ p increased for the QCIF Table Tennis sequence. The quantization parameter (QP) set QP = {10,16,22,28,34} was chosen to cover a representative portion of the entire QP range, which is from 0 to 51. We show the PSNR value under 30dB in Fig. 4.15. Also, the performance of the proposed algorithm is closer to the desired rate-distortion optimized performance for a larger quantization parameter (say, 34, in our test). In practice, the RD performance drops more at a higher bit rate when the fast mode decision algorithm is applied as shown in Fig. 4.15. The test sequences 79 were chosen to be MPEG sequences of classes A (News, Container), B (Foreman, Carphone), C (Stefan) and Football sequence of various resolutions (from QCIF to D1). The simulation was conducted on aPC with Intel Pentium 4 Processor of speed 1.8GHz, 512MB DDR RAM. To compare the rate distortion performance and the computationalcomplexityoftheproposedschemewiththoseoftheRDOscheme in the H.264reference code, the PSNR andthe bit rate (perframe) were measured for a frame skip set to five. When the frame skip is small, the best mode is primarily the inter mode since the temporal correlation is more dominant than the spatial correlation. This is well reflected in the simple feature difference in Eq. (4.5) and the expected decision risk in the RD performance degradation is negligible. In particular, when the frame skip is 0 or 1 as shown in Fig. 4.14, it is con- firmed bytheexperiment thatmostmacroblocksfallintheriskfreeregion, andour algorithm quickly chooses the inter mode and the performance is excellent. When the frame skip is large, the percentage of intra prediction is comparable with that of inter prediction since temporal correlation and spatial correlation are quite com- petitive in most macroblocks. Then, the risk of feature-difference-based decision is higher. The risk-minimizing decision scheme is developed to treat these difficult circumstances. Hence, the reason to set the frame skip to five is to evaluate the proposed method under the case where the inter and inter modes have more com- parable performance. For computational complexity profiling, the encoding time was measured. The first experiment is to show the relationship between the rate-distortion- complexity (RDC) performance and the RD loss threshold L ∗ p , which determines the boundary between the risk-tolerable and the risk-intolerable regions. As shown in Fig. 4.12, the RD performance loss increases and the computational complexity 80 decreaseswhentheRDlossthresholdbecomeslarger,whichallowsahigherdecision risk in the rate-distortion sense. Table 4.3: Rate Comparison [R:RDO, NP: Non-parametric, P:Parametric] Test Rate (Kbit/frame) [QP=10-34] Sequences 10 16 22 28 34 News R 35.89 20.45 11.35 5.83 2.94 (QCIF) NP 36.66 21.15 11.83 6.09 3.21 P 37.09 21.45 12.06 6.25 3.14 Foreman R 72.70 39.82 19.83 9.33 4.51 (QCIF) NP 75.65 41.83 20.80 10.12 5.12 P 76.81 43.07 21.50 10.34 5.18 Carphone R 62.49 33.27 17.53 8.53 3.93 (QCIF) NP 63.65 34.09 17.99 8.79 4.28 P 65.18 35.17 18.29 9.14 4.38 Stefan R 133.0 92.78 61.63 36.29 18.15 (QCIF) NP 135.9 94.52 61.08 36.39 18.16 P 141.8 100.9 67.2 38.37 18.66 Container R 238.45 122.22 52.67 16.83 5.64 (CIF) NP 243.59 126.40 54.90 17.59 6.164 P 249.1 130.5 58.76 18.04 6.044 Football R 1638 1006 563.0 315.7 167.5 (D1) NP 1698 1042 568.7 316.6 168.2 P 1686 1045 574.4 336.3 175.5 In other words, the risk-intolerable region is shrinking. It is worthwhile to mention that the contribution in terms of encoding time saving from the risk-free region (L ∗ p ≃0.01)goesup to almost 60%of totalencoding time saving atL ∗ p ≃0.5 asshowninFig. 4.12(c). ThisRDCperformancetrendissimilarlyobservedinother sequences for a wide range of quantization parameters as shown in Fig. 4.13. We see thatthe rate-distortionperformance lossincreases slightly while the complexity decreases as we increase the tolerable RD loss threshold. 81 Table 4.4: Distortion Comparison [R:RDO, NP: Non-parametric, P:Parametric] Test PSNR (dB) [QP=10-34] Sequences 10 16 22 28 34 News R 50.17 45.86 41.34 36.78 32.38 (QCIF) NP 50.14 45.83 41.31 36.77 32.31 P 50.18 45.89 41.34 36.79 32.36 Foreman R 49.98 44.94 40.19 35.93 32.17 (QCIF) NP 49.92 44.86 40.10 35.89 32.13 P 50.04 44.95 40.16 35.83 32.12 Carphone R 50.19 45.71 41.36 36.90 32.66 (QCIF) NP 50.11 45.62 41.25 36.84 32.61 P 50.25 45.73 41.32 36.84 32.61 Stefan R 50.05 44.94 39.72 34.33 28.87 (QCIF) NP 49.80 44.61 39.34 34.23 28.66 P 50.26 45.11 40.29 34.43 28.81 Container R 50.32 45.17 40.34 36.03 32.34 (CIF) NP 50.27 45.13 40.32 36.01 32.28 P 50.32 45.18 40.34 36.02 32.31 Football R 50.35 44.79 40.03 35.79 31.78 (D1) NP 50.01 44.38 39.52 35.59 31.47 P 50.46 44.88 40.52 35.81 31.63 In Figs. 4.15 and 4.16, we compare the RDC for three different algorithms, which are the RDO-based method and the proposed Bayes risk minimized deci- sion using the parametric and the non-parametric density estimation methods. As showninthesefigures, thenon-parametricmethodismoreaccurateandfasterthan the parametric method since the parametric method calculate the likelihood using GMM on the fly. On the other hand, in terms of the memory requirement, the non-parametric method needs more memory space since it needs to retrieve the risk-minimizing mode and the expected RD loss per cell from lookup table. For comparison, the parametric method only requires 7×N GMM parameters per cell where N is the number of Gaussian mixture components and one Gaussian pdf needs 7 parameters as described earlier. 82 Table 4.5: Compleixty comparison [R:RDO, NP: Non-parametric, P:Parametric] Test Speedup Factor (%) [QP=10∼34] Sequences 10 16 22 28 34 News NP 32.26 29.63 26.06 23.77 19.59 (QCIF) P 29.44 28.14 24.81 22.97 18.91 Foreman NP 31.66 28.78 23.15 24.29 19.94 (QCIF) P 25.71 26.95 21.18 23.47 19.32 Carphone NP 30.13 25.95 21.03 20.14 16.86 (QCIF) P 27.23 23.86 19.42 18.70 15.24 Stefan NP 29.97 27.83 24.91 22.74 20.83 (QCIF) P 27.57 26.88 23.58 21.74 20.50 Container NP 25.43 24.55 20.44 17.48 16.09 (CIF) P 23.90 22.96 19.04 16.86 15.13 Football NP 25.08 22.45 20.46 17.85 17.27 (D1) P 23.88 21.27 19.04 17.26 16.97 The encoding bit rates, the PSNR values, and the speed-up factor for 6 test sequences of the proposed algorithm using the non-parametric and the parametric density estimation is compared with the RDO method in Tables 4.3-4.5. These table shows the result when RD loss threshold L ∗ p is set as 1.0 and different RDC tradeoff can be achieved by adjusting one parameter which is RD loss threshold which partitions decision regions. L ∗ p . For example, if we lower the L ∗ p value, then we can achieve closer RD performance to RD optimized performance with less computational complexity savings and vice versa. Asshown in Tables4.3-4.5,there areoutliersthatproduce some loss, especially, when QP is very large. However, considering the overall RD performance, we can achieve an average rate loss of 3.81% and an average distortion loss of -0.34% for all test sequences. Tables I and II show the result when the RD loss threshold, Lp, is set to 1.0. Please also note that the proposed algorithm allows the encoder to trade the computational complexity for video quality in a flexible way, i.e., by 83 adjusting the Lp value, which determines the partition of risk regions. That is, by lowering the Lp value, we can achieve an RD performance closer to the RD optimized performance with less saving in the computational complexity. 84 (a) (b) Figure 4.13: Variation of (a) the R-D performance and (b) the computational complexity as L ∗ p increased for the QCIF Foreman sequence. 85 1 2 3 4 5 x 10 4 36 38 40 42 44 46 48 50 Rate [bits] D i s t o r t i o n P S N R [ d B ] H.264 RDO Prop Lp=0.01 Prop Lp=0.10 Prop Lp=1.00 (a) 1 2 3 4 5 x 10 4 2.5 3 3.5 4 Rate [bits] C o m p l e x i t y [ S e c ] H.264 RDO Prop Lp=0.01 Prop Lp=0.10 Prop Lp=1.00 (b) 1 2 3 4 5 6 x 10 4 36 38 40 42 44 46 48 50 Rate [bits] D i s t o r t i o n P S N R [ d B ] H.264 RDO Prop Lp=0.01 Prop Lp=0.10 Prop Lp=1.00 (c) 1 2 3 4 5 6 x 10 4 2.6 2.8 3 3.2 3.4 3.6 3.8 4 4.2 Rate [bits] C o m p l e x i t y [ S e c ] H.264 RDO Prop Lp=0.01 Prop Lp=0.10 Prop Lp=1.00 (d) Figure 4.14: Performance comparison in (a,c) rate-distortion and (b,d) computa- tional complexity as L ∗ p increase for the QCIF Foreman sequence when frame-skip is set to 0 and 1, respectively. 86 (a) (b) Figure 4.15: Comparison of (a) the R-D performance and (b) the computational complexity for the QCIF Carphone sequence. 87 (a) (b) Figure 4.16: Comparison of (a) the R-D performance and (b) the computational complexity for the QCIF Stefan sequence. 88 Chapter 5 Fast H.264 Motion Estimation with Block-size Adaptive Referencing (BAR) 5.1 Introduction The long-term memory motion compensated prediction (LTMCP) [49] scheme is included in H.264 to enhance the coding performance. That is, to obtain lower motion-compensated residual, the best match is searched in all available reference framesthataredecodedearlierandstoredinframememoryinLTMCP.Ithasbeen reportedin[49]thatLTMCPcanachieveabit-ratesavingof20-30%(fortheMother & Daughter sequence) which corresponds to the improvement of reconstruction PSNRupto1.5dBapproximately. Inthiswork,weexaminetechniquestospeedup the motion search in the context of LTMCP while keeping the coding performance at a certain desired level. To achieve this goal, we first study the relationship between the block-sizes and the underlying video characteristics. Based on the study, we can develop a quantitative model that relates the block-size effect with the rate-distortion (RD) performanceofLTMCP.Then,weproposeamethodthatassignsadifferentnumber 89 of references in a block-size adaptive manner to reduce the complexity of motion search. An overview ofthe proposed algorithm isdepicted in Fig. 5.1. First, we explore the empirical relationship between the temporal search range and the block size. This is meaningful since a smaller block size often implies weaker temporal correla- tion between adjacent framesunder the multiple-frame prediction framework. This block-size effect on the RD coding gain is exploited to develop a new algorithm, called the block-size adaptive referencing (BAR) scheme, that assigns a different number of references in a block-size adaptive manner. A proper BAR scheme can bedetermined tominimize the reference framenumbers while keeping theexpected RD loss under a target level. The selection of a proper BAR scheme is elaborated below. First, we build several models and relationships from several training sequences. The expected RDperformance degradation ofeach BAR set isobtained using aregression model. With this model, the RD coding loss from BAR can be quantified by the RD gradient. Furthermore, it isobserved that the framelevel RDgradient iscorrelated with the quality-to-bit rate ratio (QBR), which can described by a model, too. With all the above models available, the resultant on-line fast algorithm consists of the following four steps. • Step 1: We estimate QBR using the normalized LMS adaptive filter and then the frame-level RD gradient of the current frame. • Step 2: Based on these results, the expected RD coding loss of each BAR scheme is obtained by a model. • Step 3: A proper BAR scheme is chosen to minimize the reference frame numbers while keeping the expected RD loss under a target level. 90 • Step 4: A fast motion search is performed within selected references in a block-size adaptive manner. Quality-Bit-rate Ratio (NLMS Adaptive Filter) RD Gradient Model Estimation of RD Performance (Increased-Rate Model) Prediction of Sequence Characteristic Fast motion estimation (UMHexagonS) LTMCP Decision of BAR Set Figure 5.1: The block diagram of the proposed block-size adaptive referencing (BAR) scheme. 5.2 Analysis of Rate-Distortion-Complexity Tradeoff In this section, we analyze the tradeoff between rate, distortion and complexity in the context of temporal search range extension, and study their dependence on sequencecharacteristics(e.g.,motionand/ortexturevariation)andcodingparame- ters(e.g. blocksizesandquantizationparameters). Generallyspeaking, weobserve the following factors that have impacts on the RD coding gain. • Thetestsequenceeffect. Forthesameblocksizeandquantization param- eter, the RDcoding gain from multiple references is larger for a test sequence of higher motion activity with more complex/texture background. • The block-size effect. The RD coding gain from multiple references is larger for blocks of a smaller size. • The quantization parameter effect. The RD coding gain from multiple references is larger when the QP is smaller (or the bit rate is higher). They are detailed below. 91 5.2.1 Effect of Sequence Characteristics on RD Coding Gain To investigate the effect of sequence characteristics on the RD coding gain, we compare the rate and distortion gain when we increase the number of references at each macroblock for two sequences, i.e. Foreman and Stefan sequences, that have different local motion and/or texture characteristics. In Fig. 5.2, we show the normalized RDcodinggainasafunction ofthe temporalsearch rangeforthese two sequences, where the normalized rate and distortion gains are defined, respectively, as ΔR = |R 1 −R n ref | R 1 , (5.1) ΔD = |D 1 −D n ref | D 1 . We see from Fig. 5.2 that the RD gain is highly dependent on the input sequence characteristics such as motion activity variation and texture complexity. Specif- ically, the higher the variation of motion activity and/or texture complexity, the more RD coding gain from the extended temporal search range since there are higher prediction residuals in these kinds of local regions if the number of reference frames is small. Also, for a fixed quantization parameter, the distortion gain is much smaller than the rate gain. Generally speaking, we expect a different RD coding gain in local regions of different characteristics. 5.2.2 Effect of Block sizes on RD Coding Gain Next, we explore the block size effect on the RD coding gain by assigning multiple references to different block sizes with various quantization parameters (QPs). To 92 Figure5.2: Thenormalizedrate-distortioncodinggainasafunctionofthereference frame number for two test sequences where Q p = 28 and B s = 16×16. quantify the block-size adaptive RD coding gain with respect to target frame (at timet),thenormalizeddifferentialdistancebetweentheRDLagrangecostfunction with one reference frame only (at time t−1) and with the last i reference frames is employed. Mathematically, we have ΔRD i = R t−1 D t−1 − R t−i D t−i 2 R t−i D t−i 2 , (5.2) where R t−i and D t−i , i = 1,2,3,..., represent the rate and distortion with the best motion search over the previous i references, respectively. To verify the dependence of the RD coding gain on the block size, the probabil- ity of finding the best match in the previous coded frames (i.e. i=1) is measured 93 and plotted in Fig. 5.3 for the Stefan CIF test sequence. As shown in Fig. 5.3, it is more likely for larger blocks to find their best match in the previous frame than smaller blocks. Since a block of smaller size often implies weaker temporal corre- lation between adjacent frames, we expect a higher RD coding gain from multiple references for them. Figure 5.3: The plot of P(ΔRD = 0) as a function of QP parameterized by the block size, where the frame skip is 2 and the maximum number of references is 5. 5.3 Block-size Adaptive Referencing Here, we consider a scheme that assigns different numbers of references to blocks of different sizes and study its impact on the RD coding gain. Generally speaking, since a smaller block size often implies weaker temporal correlation between adja- cent framesunder themultiple-frame prediction framework, we expect ahigher RD coding gain from multiple references for smaller blocks. 94 Table 5.1: The seven sets in the BAR scheme. BAR Set Index 16×16 16(8)×8(16) 8×8 8(4)×4(8) 4×4 1 1 1 1 1 1 2 1 1 2 2 3 3 1 2 2 3 3 4 2 2 3 3 4 5 2 3 3 4 4 6 3 3 4 4 5 7 3 4 4 5 5 5.3.1 Set based Block-size Adaptive Referencing As described in Sec. 5.2.2, the block-size adaptive referencing (BAR) scheme is developed to exploit the block-size effect on the RD coding gain. There are several ways to exploit this property. For example, one straight-forward way is to model the RD coding gain as a function of multiple references at each block-size. This method is advantageous because it is possible to select the set of references in a block size adaptive manner to meet the required RD performance level in fine granularity. However, the computational complexity overhead of this approach is very high since multiple models are needed for each block-size, and to find optimal set of references for each macroblock may not be a trivial task. In this research, instead of considering all possible choices of block sizes and reference numbers, we choose several more common choices as shown in Table 5.1, each of which is called a BAR set. Foreach set, block sizes are categorized into five classes depending on their areas: (i) 16x16, (ii) 8x16 or 16x8, (iii) 8x8, (iv) 4x8 or 8x4and(v)4x4. All these setsmeetthe criterionthatablockofasmaller size have more references than that of a larger size based on the observation in Sec. 5.2.2. To explore the block size effect on RD performance degradation, the selected seven BAR sets are examined in Fig. 5.4. In practice, the quality variation ismuch 95 smaller than the rate variation for a fixed quantization parameter. For simplicity, we employed the distortion difference reflected rate [1] here. For this measure, a small distortion difference is included as part of the rate term as R ∗ n = R n + ΔD n α , (5.3) where α = ΔD/ΔR≃(D q+1 −D q−1 )/(R q+1 −R q−1 ) (5.4) andwhereΔD n =D Full −Dnisthedistortiondifferencebetweenthefull5-reference scheme and BARn. In other words, R ∗ n is the rate needed to achieve the distortion level of the full reference scheme. It is worthwhile to point out that the increased rate ΔR ∗ n =R ∗ n −R Full is reciprocally related with α, which is the gradient of the RD curve. On the other hand, it is easy to see that the normalized encoding time ratio Tprop T RDO growslinearlyregardlessofcodingparameterssuchasquantizationparameter QP =q ∈{10,16,22,28,34,40}and sequence characteristics as shown in Fig. 5.5. 5.4 Proposed Frame-level BAR Set Selection Algorithm Our main idea in this work is to lower the complexity of motion vector search by assigning different BAR to each frame while keeping increased rate ΔR ∗ n low. To 96 0 0.5 1 1.5 2 2.5 30 32 34 36 38 40 42 44 46 48 50 PSNR (dB) Rate (bpp) BAR1=[1,1,1,1,1] BAR2=[1,1,2,2,3] BAR3=[1,2,2,3,3] BAR4=[2,2,3,3,4] BAR5=[2,3,3,4,4] BAR6=[3,3,4,4,5] BAR7=[3,4,4,5,5] BAR8=[5,5,5,5,5] Figure 5.4: Comparison of the RD curves of several different BAR sets applied to the Foreman CIF sequence. estimatetheincreasedrate,wedevelopamodelbasedontheRDgradient,α,which can be in turn estimated from the quality-to-bit rate ratio (QBR). 5.4.1 RD Gradient Estimation The relationship between the RD gradient, α, and the increased rate, ΔR ∗ n , is depicted in Fig. 5.6. It is worthwhile to point out that the increased rate is reciprocally related to the gradient of the RD curve. Thus, we have ΔR ∗ L > ΔR ∗ H , if α L <α H , (5.5) where α L and α H are the RD gradient values, ΔR ∗ L =ΔR L + ΔD L α L , and ΔR ∗ H =ΔR H + ΔD H α H (5.6) 97 1 2 3 4 5 6 7 10 20 30 40 50 60 70 80 90 BAR Index T p r o p / T R D O R a t i o ( % ) QP=10 QP=16 QP=22 QP=28 QP=34 QP=40 Figure 5.5: Comparison of the computational complexity saving curves of 7 BAR sets applied to the Foreman CIF sequence. are the increased distortion difference reflected rates. In Eq. (5.6), ΔR L =R L n −R L Full , ΔR H =R H n −R H Full , are the increased rates, and ΔD L =D L Full −D L n , ΔD H =D H Full −D H n denote the quality degradation, respectively. Based on the above observation, we conclude that the RD coding loss due to the use of the BAR scheme can be quan- tified by the RD gradient. To estimate the RD gradient before the encoding process is a challenging task in practice. In various types of test sequences, we observe that RD gradient α of 98 Figure5.6: TherelationshipbetweentheRDgradient(α)andincreasedrate(ΔR ∗ n .) the current frame is strongly correlated with the quantization parameter (QP) and the quality-to-bit rate ratio (QBR) defined as θ(i) = Q(i) R(i) , (5.7) where Q(i), R(i) and i are the quality measure (PSNR), the rate (bpp) and the frame index, respectively. To understand the relationship between the RD gradient α in Eq. (5.4) and QBR in Eq. (5.7), we collect their values for four CIF sequences (i.e. Container, Foreman, Stefan, and Mobile) of 100 frames (30fps) under the general main profile encoding configuration with frame skip set to 2. In the experiment, the intra mode prediction is disabled and no data is collected from first five frames to prevent the initial effect [49]. Using the curve fitting method based on least square, the RD 99 Figure 5.7: The verification of the RD gradient model at the frame level for three different test sequences over a wide range of quantization parameters. gradient can be modeled by logarithmic function of QBR as shown in Fig. 5.8. Mathematically, we have α =ω q ·log(θ q )+c q , (5.8) where q is a given quantization parameter, andω q and c q are the model weight and a constant associated with q, respectively. To justify the estimation performance of the RD gradient model in Eq. (5.8), we measure the RD gradients of the 50 th encoded frame for three different test sequences, i.e. Container, Foreman, Stefan CIF sequences and show them in Fig. 5.7. Based on Eq. (5.4), the RD gradients at 6 different quantization parameters q ∈{10,16,22,28,34,40}are tested using the values from 12adjacent quantization parameters such as q ∈ {11,12,15,17,21,23,27,29,33,35,39,41}. As shown in Fig. 5.7, the estimated RD gradients using the logarithmic model is very close to 100 0 500 1000 1500 2000 2500 0 10 20 30 40 50 60 70 80 90 100 θ α (θ|q) QP=10 QP=16 QP=22 QP=28 QP=34 QP=40 Figure 5.8: The relationship between the RD gradient (α) and the QBR value (θ.) their true values over a wide range bit rates. Similar results have been observed throughout the entire set of frames in these test sequences. 5.4.2 Quality-to-Bit Rate Ratio (QBR) Prediction To estimate the RD gradient, we should predict QBR before the actual encoding process since we need to assign a BAR set to the target frame before it is fully encoded. In general, the quality variation is much smaller than the rate variation for fixed QP as shown in Fig. 5.10. The rate is likely to increase for higher motion and/or larger textured areas as a result of higher prediction residuals. In other words, a stationary frame with flat background is more likely to have a larger QBR value. Thus, the QBR measure θ reflects the variation of input sequence characteristics and can be used to estimate the RD gradient. To estimate QBR, it is necessary to predict the quality(PSNR) and the rate of the current frame before its actual encoding. 101 To predict rate and/or PSNR, a number of different approaches have been pro- posed in the literature. For instance, the bit-rate and/or distortion associated with each encoded video frame is often modeled as a self-similar process [19] or a discrete autoregressive model [18]. In this research, based on the statistical charac- teristics of frame-level rate and PSNR, they are modeled as wide sense stationary auto-regressive processes and predicted using optimum linear discrete-time adap- tive filter. There are many kinds of adaptive filters designed to address different applications such as LMS, RLS, IIR adaptive filter, and lattice structured one [17]. Among various types of adaptive filters, the normalized LMS filter is employed in this research since it is robust. It is a natural extension of the simplest LMS al- gorithm with a much faster convergence rate. Its algorithmic time complexity is O(N), where N is the filter tap order. Also, this approach overcomes the gradient noise amplification problem of the LMS algorithm. Figure 5.9: The structure of the normalized LMS adaptive filter. The structure of the normalized LMS adaptive filter is shown in Fig. 5.9, where M istheorderofthefilter. Thepurposeofthisfilteristoproduceanestimateofthe desired response which corresponds to the rate or the PSNR value of the current frame, i.e., d(i) = u(i), where the tap input vector is U(i) = [u(i − 1),··· ,u (i−M)] T , and u(i) is the mean removed rate R(i) or PSNR D(i). Note that the 102 mean is removed since the filter input and the desired response should be jointly zero-mean wide-sense stationary stochastic processes. Thereusuallyapredictionerrorassociatedwithpredictedd(n). Mathematically, we have e(i) =d(i)−y(i), y(i)= ˆ ω T (i)·U(i), (5.9) where y(i) is the filtered output and ˆ ω(i) the tap weight vector. The tap weight vector ˆ ω(i) is determined by the normalized LMS adaptive filter by minimizing the squaredEuclidean normofthechangeinthetapweightvector ˆ ω(i+1)withrespect to its old value ˆ ω(i); namely, Δˆ ω(i+1) = ˆ ω(i+1)− ˆ ω(i), (5.10) subject to the constraint e(i) = 0 or, equivalently, ˆ ω(i+1)·U(i) =d(i). To solve thisconstrained optimization problem, the cost function can be formu- lated using the Lagrangian multiplier as J(i) =kΔˆ ω(i+1)k 2 +λ·(d(i)− ˆ ω T (i)·U(i)). The optimum value of filter tap-weight vector can be obtained by differentiating the cost function J(i) with respect to each weight parameter and then set each term to zero. As a result, the tap-weight vector can be found via ˆ ω(i+1) = ˆ ω(i)+0.5μλ·U(i) (5.11) λ = 2 a+kU(i)k 2 ·e(i) 103 Figure 5.10: Comparison of predicted and actual frame-by-frame results using the normalized LMS adaptive filter for the Stefan CIF sequence with frame skip equal to two: (a) the rate, (b) the quality measure and (c) the corresponding errors. whereμistheadaptationstepsize. Itsrangeisrestrictedto0<μ<2·(M·r(0)) −1 , where M · r(0) represents the tap input power to assure the convergence of the adaptive filter in the mean-squared sense and a is a positive constant to prevent the gradient noise amplification problem when u(i) is very small. AsshowninFig. 5.10,themeansquarederrorsbetweenthepredictedandactual values after encoding are -17.3 dB and -35.6 dB for rate and quality, respectively. Similar prediction accuracy has been observed for other test sequences. Thus, the rate and quality values can be accurately predicted using the normalized LMS adaptive filter. 104 5.4.3 Modeling of Increased Rate To model the expected increased rate ΔR ∗ n caused by BAR as a function of the RD gradient α, their statistics are collected for the same test sequences as described in Sec. 5.4.1 with the same configuration. In general, we can find a nonlinear regression model of the following form that provides the best fit ΔR ∗ n =f(β,α)+ε (5.12) where β, α and ε are the regression model parameter vector to be fitted, the RD gradient and the regression error, respectively. The modeling error ε is observed to follow the zero-mean normal distribution N(0,σ 2 ) as shown in Fig. 5.11. Figure 5.11: The probabilistic distribution of the regression error and the approxi- mating Gaussian distribution. 105 In other words, the likelihood of the nonlinear regression error can be written as P(θ(β))= 1 (2πσ 2 ) m/2 ·e − θ(β) 2σ 2 . (5.13) The likelihood of the nonlinear regression error is maximized when the sum of squared residual errors θ(β) = P m i=1 (ΔR ∗ n −f(β,α))) 2 is minimized. The solution can be obtained by differentiating θ(β) with β; i.e., ∂θ(β) ∂β =−2 X ΔR ∗ n −f(β,α)) ∂f(β,α) ∂β , (5.14) and setting the partial derivatives to zero. Since these equations in (5.14) are nonlinear, their solution is obtained via numerical optimization. In this work, the Levenberg-Marquardt algorithm [29] is adopted for curve fitting based on the non- linear least square criterion. It is more robust algorithm than the Gauss-Newton method in the sense that it can find a best-fit even if its initial point is far from the final minimum and it is more efficient than an unconstrained method in the number of iterations. As a result of the above procedure, we can find a statistical model that relates the expected increased rate ΔR ∗ n to the RD gradient α as shown in Fig. 5.12. By exploiting the smooth property of these fitted curves, we can have the following simplifying approximation ΔR ∗ n ≃ ω n (α−c n ) +c ′ n , (5.15) where ω n is the model weight of BAR set index n, c n and c ′ n are regression model constants obtained by numerical optimization. 106 Figure 5.12: Modeling of increased bit rates for different BAR schemes. Throughout the non-linear regression modeling process, by varying the BAR set index n, we can obtain an expected increased rate as a function of RD gradient α as shown in Fig. 5.12. Now, we can estimate the expected increased rate at a different BARsetasafunctionoftheRDgradient. Consequently, we candevelop a simple yet effective algorithm to determine the BAR set based on the RD gradient measure. Intuitively, the algorithm saves the motion estimation complexity by choosing a lower BAR set index if the increased rate is expected to be less than the given threshold. Therefore, different strategies in selecting the BAR set will yield different RD performance and demand different computational complexity. In the proposed algorithm, the BAR set is simply chosen to maximize the com- plexity saving under the constraint that its estimated increased rate is less than a desired level. Mathematically, we have argmin n s n , subject to ΔR ∗ n ≤T, (5.16) 107 wheres n representsthen th BAR.Forgivenquantization parameterq, theproposed algorithm is stated below and the corresponding flow chart is given in Fig. 5.13. Block-size Adaptive Referencing Algorithm Step 1. QBR Prediction. For the current frame, we estimate the rate and quality using the normalized LMS adaptive filter ˆ R(i)= ˆ ω R (i) T ·U R (i), ˆ Q(i) = ˆ ω Q (i) T ·U Q (i). Step 2. RD Gradient Estimation. Based on the predicted rate and quality values, the RDgradientα is obtained using the nonlinear regression model in(5.7), where the QBR value is calculated by ˆ θ(i) = ˆ Q(i) ˆ R(i) . Step 3. Determination of the Proper BAR Set. The BAR scheme, s n , is selected to maximize the computational cost reduction subject that its expected rate increase is less than the desired level T as given in Eq. (5.16). Step 4. Fast motion search. Fast motion search, such as UMHexagonS [8], is performed within the temporal search range defined by the BAR scheme. Step 5. Normalized LMS filter update. Update the tap weight vector of the adaptive filter for QBR prediction using Eq. (5.11). 5.5 Experimental Results The proposed algorithm was integrated with the JVT reference software JM10.1 in the experiment. The motion vector search range was set to be 32 × 32 cen- tered at best predictive motion vector. The fast motion estimation algorithm, 108 UMHexagonS, was employed for performance analysis. The UMHexagonS algo- rithm is designed to address the local minimum problem especially in high motion sequences using a large search range together with sequential search based on mul- tiple search-grids such as unsymmetrical-cross, rectangular, and multiple hexagon grids. The motion search scheme is multi-pass and early termination is adopted to speed up the sparse-grid based search process furthermore. As a result, it is robust in visual quality degradation for high motion/complex texture sequences. To see the RD performance variation as the frame skip increases, three values were tested and shown in Fig. 5.14; namely, zero (30fps), one (15fps) and five (5fps). Bframeswerenotincludedforrate-distortion-complexityperformancepro- filing. CABAC[28]wasusedastheentropycodingmethod,andtestsequenceswere chosen to include Class A (Container), Class B (Foreman) and Class C (Mobile & Calendar and Stefan) MPEG test sequences. Also, we show the result for tempete and football at different resolution (D1). In the experiment, the intra mode pre- diction was disabled and no data were collected from the first five frames to avoid the initial effect [49]. The first experiment was conducted to show the rate-distortion performance degradation as threshold T ∈ {0 ∼ 0.2} for desired increased rate ΔR ∗ n (bpp) in- creases. As shown in Fig. 5.14, the rate-distortion performance degradation in- creases when threshold T becomes larger and the frame skip number increases. ThereasonisthatlargerT allowsahigherrate-increasepotentiallycausedfromthe complexity reduction which can be achieved through the reduced temporal search rangeinablock-sizeandsequence characteristic adaptivemanner. Intuitively, tem- poral correlation between adjacent frames decreases as frame skip increases. This rate-distortionperformancetrendissimilarlyobservedinothersequencesforawide range of quantization parameters as shown in Tables 5.2 and 5.3. 109 The RD performance and the computational complexity for both the Foreman and Stefan sequence are shown in Fig. 5.15 and 5.16. In Fig. 5.15, the circled line indicates the JM reference code with full references for all block-sizes while the dotted line with cross represents the performance of the proposed BAR algorithm. We see from Fig. 5.15 that the BAR scheme can achieve nearly the same RD performance as the H.264 RDO (with full reference) over a wide range of bit rates. Also, the variation of rate-distortion performance degradation for different test sequences in terms of motion and/or texture characteristics is small. The computational saving as shown in Fig. 5.16 varies from 25-65% for the selected range of QPs. The proposed algorithm assign a different BAR set (s n ,n∈ [1,2,··· ,7]) to each frame adaptively based on the estimated RD gradient to meet the allowed rate increase level for given QP. Please also note that the proposed algorithm selects the BAR scheme according to the sequence characteristics frame byframe,sincetheRDgradientisadaptivelyestimatedaccordingtotheQBRvalue that reflects sequence characteristics. In our experiment, the proposed algorithm assigned the BAR scheme of lower indices to the Foreman sequence to achieve an equivalent level of the increased rate of the Stefan sequence. That means that the complexity of encoding the Foreman sequence is lower than that of the Stephen sequence. To see the relationship between the normalized encoding time ratio and the average number of references used in the BAR scheme, we measure the average BAR index for two test sequences (300 frames with frame skip set to 2) as shown in Fig. 5.17. Intuitively, the encoding time saving is strongly correlated with the average number ofreferences used intheBARscheme. Itisalsoimportanttopoint out that the proposed algorithm selects the BAR scheme according to the sequence characteristics frame by frame. 110 Table 5.2: Rate Comparison [ R: RDO, P: Proposed ] Rate (bpp) [QP=10∼40] 10 16 22 28 34 40 Container R 1.844 0.887 0.350 0.105 0.032 0.013 (CIF) P 1.846 0.888 0.352 0.107 0.033 0.012 Foreman R 2.401 1.227 0.533 0.225 0.101 0.051 (CIF) P 2.420 1.250 0.550 0.237 0.106 0.052 Mobile R 3.674 2.286 1.259 0.584 0.228 0.093 (CIF) P 3.684 2.296 1.271 0.592 0.229 0.085 Stefan R 3.569 2.156 1.220 0.627 0.290 0.140 (CIF) P 3.568 2.186 1.243 0.631 0.300 0.148 Tempete R 4.180 2.510 1.134 0.478 0.195 0.077 (D1) P 4.184 2.520 1.140 0.491 0.195 0.073 Football R 4.167 2.469 1.244 0.631 0.308 0.135 (D1) P 4.175 2.476 1.251 0.680 0.313 0.138 To compare the R-D performance and the complexity saving of the proposed schemewiththoseoftheRDOschemeinH.264referencecodes,welisttheencoding bit rate, the PSNR value, the encoding time ratio T Prop T RDO and average BAR index (¯ n) for six test sequences in Tables I-III. We see that the R-D performance of the proposed algorithm degrades slightly while its complexity saving is substantial. Generally, we saved 25∼70% of the en- codingcomplexity ofmotionestimation attheexpense ofnegligiblerate(+0.547%) anddistortion(−0.109%)performancedegradationforthesetestsequences. Similar results have been observed for D1 video at different frame rates. 111 Table 5.3: Distortion Comparison [ R: RDO, P: Proposed ] PSNR (dB) [QP=10∼40] 10 16 22 28 34 40 Container R 49.84 45.07 40.27 35.87 31.98 28.16 (CIF) P 49.84 45.07 40.27 35.85 31.92 28.06 Foreman R 49.77 45.02 40.46 36.33 32.53 29.02 (CIF) P 49.77 45.02 40.43 36.25 32.35 28.75 Mobile R 49.56 44.44 39.23 34.10 29.28 25.09 (CIF) P 49.57 44.44 39.21 34.07 29.15 24.61 Stefan R 49.76 44.92 40.17 35.38 30.67 26.23 (CIF) P 49.76 44.92 40.16 35.36 30.61 26.07 Tempete R 49.65 44.30 39.36 35.19 31.11 27.34 (D1) P 49.65 44.31 39.36 35.15 31.06 27.13 Football R 49.66 44.44 39.94 35.85 31.98 28.18 (D1) P 49.66 44.45 39.94 35.80 31.94 28.12 Table 5.4: Computational Speedup [ N: ¯ n, T:(T Prop )/T RDO ] Encoding Time Ratio (%) [QP=10∼40] 10 16 22 28 34 40 Container T 75.20 67.28 55.96 48.69 37.33 29.63 (CIF) N 6.050 5.100 4.010 3.090 2.190 1.421 Foreman T 75.31 67.47 63.15 52.33 39.45 36.22 (CIF) N 6.073 5.017 4.663 3.684 2.305 1.989 Mobile T 73.83 72.80 65.93 62.31 49.98 36.69 (CIF) N 6.091 5.705 4.952 4.631 3.368 2.030 Stefan T 74.33 70.08 66.88 60.22 50.38 36.88 (CIF) N 6.009 5.400 5.082 4.463 3.421 2.147 Tempete T 76.23 75.6865 74.2189 70.8646 55.5821 43.5907 (D1) N 6.000 6.000 6.000 5.533 4.000 2.488 Football T 76.5085 76.6903 82.6902 71.8149 56.4727 47.2337 (D1) N 6.422 6.170 6.022 5.822 3.955 2.755 112 Figure 5.13: The flowchart of the proposed block-size adaptive referencing (BAR) algorithm for multiple-reference motion estimation. 113 0.5 1 1.5 2 2.5 36 38 40 42 44 46 48 50 P S N R ( d B ) Rate (bpp) Full Reference T=0.02 T=0.06 T=0.20 Frame skip=0 Frame skip=1 Frame skip=2 Figure 5.14: The performance degradation of the proposed BAR scheme as thresh- old (T) increases for the Foreman CIF sequence. 0 0.5 1 1.5 2 2.5 3 3.5 30 35 40 45 50 PSNR (dB) Rate (bpp) H.264 PROP Foreman Stefan Figure 5.15: Performance comparison between the JVT reference software JM10.1 with and without the proposed BAR scheme (T = 0.02) in the rate-distortion tradeoff for Stefan and Foreman CIF sequences. 114 10 15 20 25 30 35 40 30 40 50 60 70 80 QP T Prop / T RDO Ratio (%) Stefan Foreman Figure 5.16: The performance degradation of the proposed BAR scheme (with T = 0.02) in terms of the normalized encoding time ratio for Stefan and Foreman CIF sequences. 10 15 20 25 30 35 40 1 2 3 4 5 6 QP Average BAR Index Stefan Foreman Figure5.17: PerformancecomparisonbetweenH.264andtheproposedBARscheme (T = 0.02) in terms of average BAR set index for Stefan and Foreman CIF se- quences. 115 Chapter 6 Conclusion and Future Work 6.1 Conclusion In this thesis, we developed an efficient framework to reduce the computational cost of coding mode decision and motion estimation. For the fast mode decision, the coarse level mode decision is performed first in order to determine which class of modes (the class of intra-predicted modes or the class of inter-predicted modes) should be adopted for a block. Then, at the second stage, we can apply fast algo- rithms to choose the specific mode within each class for a fine-level mode decision. The main idea of the coarse level intra/inter coding mode prediction is to de- cide the mode using the expected risk of choosing the wrong mode in a multidi- mensional simple feature space. The proposed algorithm calculates three features and maps them into the one of three regions; namely, risk-free, risk-tolerable, and risk-intolerable regions. Dependingonthe mappedregion, we canapplyalgorithms of different complexities for the final mode decision. The results were presented in Chapter 4 of this thesis proposal. 116 For the fine-level mode decision, we propose a fast intra mode decision scheme. First, the proposed scheme uses spatial and transform domain features of the tar- get block jointly to filter out the majority of candidate modes. For the final mode selection, either the feature-based or the RDO (rate-distortion optimization)-based method is applied to 2 ∼ 3 candidate modes. It is demonstrated by experimental results that the proposed algorithm can save considerable computational complex- ity with little degradation in the rate-distortion performance. The results were described in Chapter 3 of this thesis. As to fast motion search, a simple yet effective fast multiple reference assign- ment algorithm was proposed to speed up H.264/AVC encoding in Chapter 5. The main idea is built upon the block-size effect on the RD coding gain of LTMCP with respect to various time-varying characteristics of test sequences. This effect is exploited to develop a new algorithm, called the block-size adaptive referencing (BAR) scheme, that assigns a different number of references in a block-size adap- tive manner. The expected RD performance degradation due to the use of fewer references is analyzed. Furthermore, it is found that the RD performance degrada- tion of an individual BAR set can be modeled using the RD gradient function. A proper BAR scheme can be systematically chosen to minimize the reference frame numbers while keeping the expected RD loss under a target level. Finally, fast mo- tion search is performed within selected references in a block-size adaptive manner and the model parameters are adjusted using a normalized LMS adaptive filter to accommodate time varying sequence characteristics. It was demonstrated that the proposedalgorithmcanachieve significant complexityreductionwithoutnoticeable quality degradation. Furthermore, the proposed algorithm allows the encoder to trade the computational complexity for video quality based on the characteristics of input video in a flexible way. 117 6.2 Future Work To enhance the coding gain, H.264 allows the use of multiple reference frames for each of inter-predictive coding modes of different block sizes. The temporal search region consists of previously coded multiple frames while the spatial search region isfixed inside awindowforallframesin thetemporalsearch region. Away togeta better prediction result is to perform an exhaustive search for all possible reference locations of all possible block sizes with MVs of sub-pel accuracy. The exhaustive search provides the optimal coding gain at the expense of a high computational complexity. Most efforts to reduce the bit rates have focused on finding the best match in coded previous frames rather than the motion compensation technique, which generates the prediction residual information. Contrary to complex motion vector search, simple motion compensation based on the difference between the current andthe reference blocks isadoptedinH.264. Thus, forbalanced motionestimation andcompensation,itishighlydesirabletodistributethecomputationalloadamong the encoder and the decoder while improving the coding gain at the same time. To tackle this problem, it may be feasible to develop a simple yet efficient motion compensation scheme. One idea is to perform low complexity motion estimation with fewer reference frames, fewer block types and low resolution MV (such as integer-pel MV) to identify a best-matched domain in the 3D (space-time) domain first. Then, pixel-based 3D prediction is performed for the target block using an adaptive filtering technique retrieved from the best matched area. Finally, the adaptive filterinformationisstored inthetargetblockregionandpassed tothe the next frame if necessary. The idea is motivated by the observation that the natural scenecanbemodeledasaMarkovprocessandthemotioncompensatedresidualcan 118 be well modeled as an auto-regressive (AR) process. Preliminary experiments have beenperformedtotesttheproposedmethod. Itisobservedthatthenewschemehas a potential to offer better tradeoff between the coding gain and the computational complexityunderdifferentapplicationscenariosandplatforms. Furtheralgorithmic development is an interesting task to be conducted in the near future. 119 Reference List [1] J. Andersen, S. Forchhammer, and S. M. Aghito, “Rate-distortion-complexity optimization of fast motion estimation in H.264/MPEG-4 AVC,” IEEE Inter- national Conference on Image Processing, vol. 1, pp. 111–114, Oct. 2004. [2] H.F. Ates and Y. Altunbasak, “SAD reuse in hierarchical motion estimation for the H.264 encoder,” Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, 2005. [3] Jeff A. Bilmes, “A gentle tutorial of the EM algorithm and its application to parameter estimation for gaussian mixture and hidden markov models,” 1998. [4] C. M. Bishop, Neural Networks for Pattern Recognition, Oxford: Clarendon Press, 1996. [5] D. Chai, S. L. Phung, and A. Bouzerdoum, “A bayesian skin/non-skin color classifier using non-parametric density estimation,” IEEE Proceedings of In- ternational Symposium on Circuits and Systems, vol. 2, May 2003. [6] Mei-Juan Chen, Yi-Yen Chiang, Hung-Ju Li, and Ming-Chieh Chi, “Efficient multi-frame motion estimation algorithms for MPEG-4 AVC/JVT/H.264,” Proceedings of the 2004 International Symposium on Circuits and Systems, vol. 3, May 2004. [7] Y.-K.Chen, A.Vetro,H.Sun,andS.Y.Kung, “Optimizingintra/intercoding mode decisions,” Proc. 1997 International Symposium on Multimedia Infor- mation Processing, 1997. [8] Zhibo Chen, JianFeng Xu, Yun He, and GuoZhong Wang, “Simplifications on fast motion estimation,” Document JVT-H026 8th JVT Meeting, Geneva, Switzerland, May 2003. [9] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete date via the EM algorithm,” J. Roy. Statist. Soc. B, vol. 39, pp. 1–38, 1977. 120 [10] Yong dong Zhang, Feng Dai, and Shou xun Lin, “Fast 4x4 intra-prediction mode selection for H.264,” IEEE International Conference on Multimedia and Expo, vol. 2, pp. 27–30, 2004. [11] R. O. Duda and P. E. Hart, Pattern Classification and Scene Analysis, New York: Wiley, 1973. [12] B. Erol, M. Gallant, G. Cote, and F. Kossentini, “The H.263+ video cod- ing standard: complexity and performance,” Proc. IEEE Data Compression Conference, pp. 259–268, 1998. [13] Allen Gersho and R. M. Gray, Vector Quantization and Signal Compression, The Kluwer Academic Publishers, Kluwer International Series in Engineering and Computer Science, 159 edition, 1992. [14] R. M. Gray and R. A. Olshen, “Vector quantization and density estimation,” Proceedings of Compression and Complexity of Sequences, pp. 172–193, 1997. [15] JVT Test Model Ad Hoc Group, “Evaluation sheet for motion estimation,” ISO/IEC JTC1/SC29/WG11 and ITU-T SG16 Q.6, Draft version 4, 2003. [16] T. Halbach, “Performance comparison: H.26L intra coding vs. JPEG2000,” ISO/IEC JTC1/SC29/WG11 and ITU-T SG16 Q.6, JVT 4 th Meeting, Kla- genfurt, Austria, 2002. [17] Simon Haykin, Adaptive filter theory, Prentice Hall International, INC., third edition, 1996. [18] D.P. Heyman and T. V. Lakshman, “Source modelsfor VBR broadcast-video traffic,” IEEE/ACM Transactions on Networking, vol. 4, pp. 40–48, 1994. [19] C. Huang, M. Devetsikiotis, I. Lambadaris, and A. R. Kaye, “Modeling and simulation of self-similar variable bit rate compressed video: a unified ap- proach,” in Procedding of ACM SIGCOMM, vol. 8, pp. 114–125, 1995. [20] Ashish Jagmohan and Krishna Ratakonda, “Time-efficient learning theoretic algorithmfor H.264mode selection,” Proc. IEEE International Conference on Image Processing, vol. 2, pp. 749–752, Oct. 2004. [21] B. Jeon and J. Lee, “Fast mode decision for H.264,” ISO/IEC JTC1/SC29/WG11 and ITU-T SG16 Q.6, JVT 10 th Meeting, Waikoloa, Hawaii, 2003. [22] Changsung Kim and C.-C. Jay Kuo, “Efficient temporal search range predic- tion for fast motion estimation in H.264,” IEEE International Workshop on Multimedia Signal Processing, Oct. 2005. 121 [23] Changsung Kim, Hsuan-Huei Shih, and C.-C. Jay Kuo, “Feature-based intra- prediction mode decision for H.264,” Proc. IEEE International Conference on Image Processing, 2004. [24] Changsung Kim, Hsuan-Huei Shih, and C.-C. Jay Kuo, “Fast H.264 intra- prediction mode selection using joint spatial and transform domain features,” Elsevier Journal of Visual Communication and Image Representation, ac- cepted, 2005. [25] Gyu Yeong Kim, Yong Ho Moon, and Jae Ho Kim, “An early detection of all-zero DCT blocks in H.264,” Proceedings of IEEE International Conference on Image Processing, vol. 1, Oct. 2004. [26] Chih-HungKuo,MeiyinShen,andC.-C.JayKuo, “Fastinter-predictionmode decision and motion search for h.264,” Proc. IEEE International Conference on Multimedia and Expo, 2004. [27] Jeyun Lee and Byeungwoo Jeon, “Fast mode decision for H.264,” IEEE International Conference on Multimedia and Expo, vol.2, pp.1131–1134,June 2004. [28] D. Marpe, H. Schwarz, and T. Wiegand, “Context-based adaptive binary arithmetic coding in the H.264/AVC video compression standard,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, pp. 620– 636, July 2003. [29] D.Marquardt, “Analgorithmforleast-squaresestimationofnonlinearparam- eters,” SIAM J. Appl. Math., vol. 11, pp. 431–441, 1963. [30] B. Meng and O. C. Au, “Fast intra-prediction mode selection for 4a blocks in H.264,” IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 3, pp. 389–92, 2003. [31] B. Meng, O. C. Au, Chi-Wah Wong, and Hong-Kwai Lam, “Efficient intra- prediction mode selection for 4x4 blocks in H.264,” IEEE International Con- ference on Multimedia and Expo, vol. 3, pp. 521–4, 2003. [32] D. Ormoneit and V. Tresp, “Averaging, maximum penalised likelihood and bayesian estimation for improving gaussian mixture probability density esti- mates,” IEEE Trans. Neural Networks, vol. 9, pp. 639–650, 1998. [33] J. Ostermann, J. Bormans, P. List, D. Marpe, M. Narroschke, F. Pereira, T. Stockhammer, and T. Wedi, “Video coding with H.264/AVC: tools, per- formance, and complexity,” IEEE Circuits and Systems Magazine, vol. 4, no. 1, pp. 7–28, First Quarter 2004. 122 [34] N. Ozbek and A. M. Tekalp, “Fast H.264/AVC video encoding with multiple frame references,” IEEE International Conference on Image Processing, vol. 1, pp. 597–600, Sept. 2005. [35] F. Pan, X. Lin, S. Rahardja, K. P. Lim, and Z. G. Li, “A directional field based fast intra mode decision algorithm for H.264 video coding,” IEEE In- ternational Conference on Multimedia and Expo, vol. 2, pp. 1147–1150, June 2004. [36] F.Pan, X. Lin, and et. al., “Fastmode decision forIntra prediction,” ISOIEC JTC1SC29WG11 and ITU-T SG16 Q.6, JVT7 th Meeting Pattaya II, Pattaya, Thailand, 2003. [37] R.A.Redener andH.F.Walker, “Mixture densities, maximum likelihood and the EM algorithm,” SIAM Rev., vol. 26, no. 2, pp. 195–239, 1984. [38] ITU-TStandardizationSector, “Advancedvideocodingforgenericaudiovisual services,” ITU-T Rec. H.264/ISO/IEC 14496-10:2005(E), 2005. [39] Yanfei Shen, Dongming Zhang, Chao Huang, and Jintao Li, “Fast mode se- lection based on texture analysis and local motion activity in H.264/JVT,” IEEE International Conference on Communications, Circuits and Systems, vol. 1, pp. 27–29, 2004. [40] B. W. Silverman, Density Estimation: For Statistics and Data Analysis, Lon- don: Chapman and Hall, 1986. [41] Yeping Su and M.-T. Sun, “Fast multiple reference frame motion estimation for H.264,” IEEE International Conference on Multimedia and Expo, vol. 1, pp. 695–698, June 2004. [42] Chi-Wang Ting, Hong Lam, and Lai-Man Po, “Fast block-matching motion estimation viarecent-biased search formultiple reference frames,” Proceedings of IEEE International Conference on Image Processing, vol. 4, Oct. 2004. [43] D. M. Titterington, A. F. M. Smith, and U. E. Makov, Statistical Analysis of Finite Mixture Distributions, New York: Wiley, 1985. [44] P.Topiwala, G.Sullivan, A.Joch, andF.Kossentini, “Performance evaluation of H.26L TML 8 vs. H.263++ and MPEG4,” ITU-T Q.6/SG16 VCEG-042, Video Coding Experts Group 15th Meeting, Pattaya, Thailand, 2001. [45] Alexis M. Tourapis, Oscar C. Au, and Ming L. Liou, “Highly efficient pre- dictive zonal algorithms for fast block-matching motion estimation,” IEEE Transactions on Circuits and Systmes for Video Technology, vol. 12, no. 10, pp. 1069–79, Oct. 2002. 123 [46] D.S. Turagaand T. Chen, “Classification based mode decisions forvideo over networks,” IEEE Transactions on Multimedia, vol. 3, no. 1, pp. 41–52, 2001. [47] A.Utsugi, “Hyperparameter selection forself-organizing maps,” Neural Com- put., vol. 9, pp. 637–648, 1997. [48] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the H.264/AVC video coding standard,” IEEE Transactions on Circuits, System and Video Technology, vol. 7, pp. 1–19, 2003. [49] T. Wiegand, X. Zhang, and B. Girod, “Long-term memory motion compen- satedprediction,” IEEETransactionsonCircuitsandSystemsforVideoTech- nology, vol. 9, no. 1, pp. 70–84, Feb. 1999. [50] Qiaobing Xie, C. A. Laszlo, and R. K. Ward, “Vector quantization technique for nonparametric classifier design,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 15, pp. 1326–1330, Dec. 1993. [51] Jianfeng Xu, Zhibo Chen, and Yun He, “Efficient fast ME predictions and early-termination strategy based on H.264 statistical characters,” Proceedings of the Joint Conference of the Fourth International Conference on Informa- tion, Communications and Signal Processing 2003 and the Fourth Pacific Rim Conference on Multimedia, vol. 1, Dec. 2003. [52] L. Xu and M. I. Jordan, “Unsupervised learning by EM algorithm based on finite mixture of gaussians,” Proc. World Congr. Neural Networks(II), pp. 431–434, 1993. [53] LiboYang,KemanYu,JiangLi,andShipengLi, “Prediction-baseddirectional fractionalpixelmotionestimationforH.264videocoding,” IEEEInternational Conference on Acoustics, Speech, and Signal Processing, vol. 2, Mar. 2005. [54] M. Yang, H. Cui, and K. Tang, “Efficient tree structured motion estimation using successive elimination,” Proceeding of IEE Vision, Image and Signal Processing, vol. 151. [55] Ming Yang and Wensheng Wang, “Fast macroblock mode selection based on motion content classification in H.264/AVC,” IEEE International Conference on Image Processing, vol. 2, pp. 24–27, Oct. 2004. [56] H. Yin and N. M. Allinson, “Bayesian learning for self-organizing maps,” Electron. Lett., vol. 33, pp. 304–305, 1997. [57] H. Yin and N. M. Allinson, “Comparison of a bayesian SOM with the EM algorithm for gaussian mixtures,” Proc. Workshop Self-Organizing Maps, pp. 304–305, 1997. 124 [58] Peng Yin, H.-Y.C. Tourapis, A.M. Tourapis, and Jill Boyce, “Fast mode deci- sion and motion estimation for JVT/H.264,” IEEE International Conference on Image Processing, vol. 3, pp. 853–856, Sept. 2003. [59] A. C. Yu and G. R. Martin, “Advanced block size selection algorithm forinter frame coding in H.264/MPEG-4 AVC,” IEEE International Conference on Image Processing, vol. 1, pp. 95–98, Oct. 2004. [60] Zhi Zhou and Ming-Ting Sun, “Fast macroblock inter mode decision and motion estimation for H.264/MPEG-4 AVC,” IEEE International Conference on Image Processing, vol. 2, pp. 789–792, Oct. 2004. [61] ZhiZhou,Ming-TingSun,andYuh-FengHsu, “Fastvariableblock-sizemotion estimation algorithm based on merge and slit procedures for H.264/MPEG-4 AVC,” Proceedings of International Symposium on Circuits and Systems, vol. 3, May 2004. [62] S. Zhu and K.-K. Ma, “A new diamond search algorithm for fast block- matching motion estimation,” IEEE Transactions on Image Processing, vol. 9, no. 2, pp. 287–290, Feb. 2000. 125
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Algorithms and architectures for robust video transmission
PDF
Design and analysis of server scheduling for video -on -demand systems
PDF
Design and applications of MPEG video markup language (MPML)
PDF
Advanced video coding techniques for Internet streaming and DVB applications
PDF
Contributions to coding techniques for wireless multimedia communication
PDF
Error resilient techniques for robust video transmission
PDF
Intelligent systems for video analysis and access over the Internet
PDF
Adaptive video transmission over wireless fading channel
PDF
Color processing and rate control for storage and transmission of digital image and video
PDF
Contributions to content -based image retrieval
PDF
Contributions to image and video coding for reliable and secure communications
PDF
Information hiding in digital images: Watermarking and steganography
PDF
Design and analysis of MAC protocols for broadband wired/wireless networks
PDF
A stochastic block matching algorithm for motion estimation in video coding
PDF
Algorithms for streaming, caching and storage of digital media
PDF
Energy and time efficient designs for digital signal processing kernels on FPGAs
PDF
Geometrical modeling and analysis of cortical surfaces: An approach to finding flat maps of the human brain
PDF
Complexity -distortion tradeoffs in image and video compression
PDF
A comparative study of network simulators: NS and OPNET
PDF
Design and performance analysis of a new multiuser OFDM transceiver system
Asset Metadata
Creator
Kim, Changsung (author)
Core Title
Design and performance analysis of low complexity encoding algorithm for H.264 /AVC
School
Graduate School
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
engineering, electronics and electrical,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Advisor
Kuo, C.-C. Jay (
committee chair
), Ortega, Antonio (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c16-439056
Unique identifier
UC11341387
Identifier
3237130.pdf (filename),usctheses-c16-439056 (legacy record id)
Legacy Identifier
3237130.pdf
Dmrecord
439056
Document Type
Dissertation
Rights
Kim, Changsung
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Tags
engineering, electronics and electrical