Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Power efficient multimedia applications on embedded systems
(USC Thesis Other)
Power efficient multimedia applications on embedded systems
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
POWER EFFICIENT MULTIMEDIA APPLICATIONS ON EMBEDDED SYSTEMS by Yu Hu A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) December 2006 Copyright 2006 Yu Hu Dedication This dissertation is dedicated to my beloved family. ii Acknowledgements In my research, I have received assistance from many people. First and foremost, I would like to thank my advisor, Dr. C.-C. (Jay) Kuo for his support, precious advice, consistent guidance and good wishes. Dr. Kuo has a profound influence not only as my academic advisor in U.S.C., but also on my life. His constant availability all the times including weekends, dedication towards work and research group, professional integrity, and pursuit of perfection helped me become a better individual. Dr. Kuo has made it his responsibility to make sure that I, as well as many other his students, have the financial support we need to accomplish our educations. My gratitude goes to the committee members (in alphabetical order), Dr. Kai Hwang, Dr. Shri Narayanan, Dr. Antonio Ortega and Dr. Roger Zimmermann, for their invalu- able comments and suggestions, and their reading my thesis draft. Dr Lei Huang and Dr. Siwei Ma have contributed distinctive insights to my research. They have contributed to my research by providing valuable comments and useful feed- back on my paper drafting. I would like to thank the students in Dr. Kuo’s group - Dr. Dahua Xie, Bei Wang, Maychen Kuo, Dr. Yifeng Chen, Dr. Mingsui Lee, Yu Shiu and Szu-Wei Lee etc. They iii contributed to my research by providing valuable comments and useful feedback. I am so lucky to have these excellent people around me during my Ph.D program. During the Ph.D. program, I have submitted several papers to peer-reviewed confer- ences. The anonymous reviewers have provided valuable insights and criticisms that I have used to polish the research results. Thanks to Diane Demetras, Tim Boston, Gloria Halfacre and other administrative assistants who work in Electrical Engineering in the past years. I would like to thank my parents, who have a tremendous influence on my life. My sister Ye is my best friend who is always there when I need her help. I am grateful to my husband Qing Li’s consistent trust, inspiration and support. This is not something I could have accomplished alone. YU HU University of Southern California October 2006 iv Table of Contents Dedication ii Acknowledgements iii List Of Tables viii List Of Figures x Abstract xiii 1 Introduction 1 1.1 SignificanceoftheResearch . ... .... ... ... .... ... .... . 1 1.2 BackgroundoftheResearch . ... .... ... ... .... ... .... . 5 1.3 ContributionsofTheResearch . . .... ... ... .... ... .... . 9 1.4 OrganizationoftheDissertation . .... ... ... .... ... .... . 11 2 Background Review on Embedded Multimedia Systems 13 2.1 NewTechnologiesinMicro-architecture . ... ... .... ... .... . 14 2.1.1 Multiple-issueMicro-architecture . ... ... .... ... .... . 15 2.1.2 TLPMicro-architectures.. .... ... ... .... ... .... . 18 2.2 MultimediaProcessingonEmbeddedSystems. ... .... ... .... . 19 2.2.1 SIMDExtension ... ... .... ... ... .... ... .... . 20 2.2.2 VLIW/EPICMicro-architecture . ... ... .... ... .... . 22 2.3 PowerAnalysis . ... .... ... .... ... ... .... ... .... . 24 2.3.1 PowerComponents.. ... .... ... ... .... ... .... . 25 2.3.2 PowerPerformanceEfficiency ... ... ... .... ... .... . 26 2.3.3 ApproachtoLowPowerDesign. . ... ... .... ... .... . 26 2.4 Conclusion ... ... .... ... .... ... ... .... ... .... . 27 3 Run-timePowerConsumptionModelingforEmbeddedMultimediaAp- plications 28 3.1 Introduction... ... .... ... .... ... ... .... ... .... . 28 3.2 ReviewofRelatedWork ... ... .... ... ... .... ... .... . 30 3.2.1 Cycle-AccurateGate-LevelPowerAnalysis . .... ... .... . 31 3.2.2 Instruction-LevelPowerAnalysis . ... ... .... ... .... . 32 3.2.3 VLIWMicro-ArchitecturePowerAnalysis .. .... ... .... . 33 v 3.2.4 Macro-ModelingEnergyEstimation . . ... .... ... .... . 35 3.2.5 Run-TimeSystem-LevelPowerAnalysis ... .... ... .... . 36 3.3 MethodologyforPowerAnalysis . .... ... ... .... ... .... . 36 3.3.1 SimulationEnvironment .. .... ... ... .... ... .... . 36 3.3.2 Benchmarks. . .... ... .... ... ... .... ... .... . 39 3.4 Observations on Routine-Level Power/Energy Consumption . . ... .... ... .... ... ... .... ... .... . 41 3.4.1 Power/EnergyProfilingandAnalysis.. ... .... ... .... . 42 3.4.2 Power-IPCRelationship .. .... ... ... .... ... .... . 47 3.4.3 ImpactofAppliedVoltageandOperatingFrequency... .... . 52 3.5 Routine-LevelPower/EnergyConsumptionModel . .... ... .... . 55 3.6 ModelValidation ... .... ... .... ... ... .... ... .... . 57 3.6.1 ValidationviaTrimaranSimulation .. ... .... ... .... . 57 3.6.2 ValidationviaCommercialMediaProcessor . .... ... .... . 59 3.7 Applicability of the Proposed Power Model . . . . . .... ... .... . 65 3.7.1 PowerSavingOrientedDesigninMultimediaProcessors . .... . 65 3.7.2 ExecutionParameterAdjustment. ... ... .... ... .... . 69 3.7.3 DecoderFriendlyEncodingAlgorithm . ... .... ... .... . 72 3.8 Conclusion ... ... .... ... .... ... ... .... ... .... . 72 4 Complexity-Adaptive Motion Search for H.264 Encoding 74 4.1 Introduction... ... .... ... .... ... ... .... ... .... . 74 4.2 BackgroundandRelatedWork .. .... ... ... .... ... .... . 76 4.2.1 MotionEstimationinH.264.... ... ... .... ... .... . 76 4.2.2 MotionEstimationModesofVariableBlockSizes.. ... .... . 77 4.2.3 Rate-Distortion(R-D)Optimization . . ... .... ... .... . 79 4.2.4 RelatedWork . .... ... .... ... ... .... ... .... . 82 4.3 Rate-Distortion-Complexity(RDC)OptimizationFramework . . .... . 84 4.4 LagrangianMultiplierSelection . . .... ... ... .... ... .... . 88 4.4.1 ExperimentalVerification . .... ... ... .... ... .... . 92 4.5 ExperimentalResults. .... ... .... ... ... .... ... .... . 93 4.6 Conclusion ... ... .... ... .... ... ... .... ... .... . 98 5 Fast Inter-Mode Decision with RDC Optimization 99 5.1 Introduction... ... .... ... .... ... ... .... ... .... . 99 5.2 FastInter-ModeDecisionwithRDCOptimization . .... ... .... . 100 5.2.1 AlgorithmOverview . ... .... ... ... .... ... .... . 100 5.2.2 ExplanationofTestConditions . . ... ... .... ... .... . 101 5.2.3 SADThresholdSelection . .... ... ... .... ... .... . 103 5.3 Complexity-AdaptiveMotionSearch ... ... ... .... ... .... . 109 5.4 ExperimentalResults. .... ... .... ... ... .... ... .... . 111 5.4.1 RDCPerformance . . ... .... ... ... .... ... .... . 111 5.4.2 Compatibility with Other Fast Motion Search Schemes . . .... . 116 5.4.3 ComplexityControl . ... .... ... ... .... ... .... . 116 vi 5.4.4 EnergySaving .... ... .... ... ... .... ... .... . 120 5.5 Conclusion ... ... .... ... .... ... ... .... ... .... . 121 6 Decoder-Friendly Adaptive Deblocking Filter (DF-ADF) Mode Deci- sion in H.264/AVC 122 6.1 Introduction... ... .... ... .... ... ... .... ... .... . 122 6.2 ReviewofAdaptiveDeblockingFilter .. ... ... .... ... .... . 124 6.3 Rate-Distortion-ComplexityOptimizationFramework ... ... .... . 127 6.4 ComplexityModels . . .... ... .... ... ... .... ... .... . 129 6.4.1 ComplexityModelBasedonSoftwareExecution .. ... .... . 129 6.4.2 Complexity Model Based on Boundary Strength . . . . . .... . 131 6.5 EnergyCostsAssociatedwithSelectedModes. ... .... ... .... . 132 6.6 ExperimentalResults. .... ... .... ... ... .... ... .... . 133 6.7 ConclusionandFutureWork ... .... ... ... .... ... .... . 136 7 Conclusion and Future Work 137 7.1 Conclusion ... ... .... ... .... ... ... .... ... .... . 137 7.2 FutureWork . . ... .... ... .... ... ... .... ... .... . 139 Bibliography 142 vii List Of Tables 1.1 Evolution of DSP’s capabilities over years. . . . . . . .... ... .... . 2 2.1 FourmajorcategoriesofILParchitectures. . . ... .... ... .... . 17 2.2 GPPSIMDextension. .... ... .... ... ... .... ... .... . 21 2.3 Listofmajorprocessors. ... ... .... ... ... .... ... .... . 24 3.1 Targetprocessorconfiguration... .... ... ... .... ... .... . 38 3.2 Benchmarks ... ... .... ... .... ... ... .... ... .... . 39 3.3 Power consumption models for most energy consumptive routines in H.264/AVC encoder. . .... ... .... ... .... ... ... .... ... .... . 49 3.4 Powerconsumptionmodelsforseveralmultimediaroutines. ... .... . 52 3.5 Parameter γ as a function of f and vwithsmall-scalevariation.. .... . 54 3.6 Parameter γ as a function of f and vwithlarge-scalevariation. . .... . 55 3.7 MultimediaapplicationIPCandpower. . ... ... .... ... .... . 66 4.1 BlockmodesandtheirSADcomplexity. . ... ... .... ... .... . 87 4.2 DescriptionofExperimentalDataandParameters. . .... ... .... . 94 4.3 R-D-CperformanceofthereferenceH.264encoder . .... ... .... . 96 4.4 ExperimentalresultsoffivevideosequenceswithalgorithmI . . .... . 97 5.1 Experimental results of five video sequences with algorithm II (κ=2 λ =0) 112 viii 5.2 Experimental results of five video sequences with hybrid algorithm. (κ=2 λ =(4.14)) ... ... .... ... .... ... ... .... ... .... . 113 5.3 ExperimentalDataandSimulationParameters. ... .... ... .... . 118 5.4 Performanceofcomplexitycontrol. .... ... ... .... ... .... . 119 5.5 BlockmodesandtheirSADenergy. ... ... ... .... ... .... . 120 6.1 Boundary strength parameter . . . .... ... ... .... ... .... . 125 6.2 Deblockingfilters’complexityandenergycost. ... .... ... .... . 130 6.3 Edgenumbersassociatedwithaselectedmode. ... .... ... .... . 133 6.4 DescriptionofExperimentalEnvironmentandParameters. ... .... . 134 6.5 Experimentalresultsoffourvideosequences.. ... .... ... .... . 136 ix List Of Figures 1.1 Thepowerdissipationtrend. ... .... ... ... .... ... .... . 3 1.2 Thecostofremovingheatfromamicroprocessor... .... ... .... . 4 1.3 The diverging gap between the actual battery capacities and demanded energy.. . .... ... .... ... .... ... ... .... ... .... . 4 2.1 Illustration of the instruction set architecture (ISA) and the micro-architecture and their relationship to other software and hardware components in a computer. .... ... .... ... .... ... ... .... ... .... . 14 2.2 Relationship between three main tasks and four ILP architectures. . . . . 17 2.3 Threadlevelparallelismsofsuperscalar,SMTandCMP... ... .... . 19 2.4 Theaveragepowerconsumptionanditsstandarddeviation. ... .... . 22 2.5 SimplifiedblockdiagramofaVLIWcore. ... ... .... ... .... . 23 3.1 Globaldataflowfortheroutine-levelpowerestimation. . . ... .... . 37 3.2 The instruction number and energy consumption of functional units in severaltypicalmultimediaapplicationroutines. ... .... ... .... . 42 3.3 Energyconsumptionbreak-downfortheMPEG-2decoder. ... .... . 43 3.4 Theaveragepowerconsumptionanditsstandarddeviation. ... .... . 45 3.5 Energydistributionforseveraltestroutines... ... .... ... .... . 46 3.6 Powercorrelationforperformanceevents. ... ... .... ... .... . 48 3.7 The α and βvaluesfortheH.264encoder.... ... .... ... .... . 50 x 3.8 Theaveragepower-IPCrelationshipforseveraltestbenchmarks. .... . 51 3.9 The plot of standard deviations of power estimation errors with validation dataforvariouscodecbenchmarks..... ... ... .... ... .... . 58 3.10 The plot of standard deviations of energy estimation errors with validation dataforvariouscodecbenchmarks..... ... ... .... ... .... . 59 3.11 ThepowerconsumptionplotforC64x. . . ... ... .... ... .... . 62 3.12 Comparison of parameter γ from the simulation and the manual of the TI C6416 DSP. . . . . . . .... ... .... ... ... .... ... .... . 62 3.13 Comparison of parameter γ with respect to the fine-scale frequency/voltage variation. .... ... .... ... .... ... ... .... ... .... . 63 3.14 Comparison of parameter γ with respect to the coarse-scale frequency/voltage variation. .... ... .... ... .... ... ... .... ... .... . 64 3.15 The normalized energy for encoders and decoders of multimedia applications. 67 3.16 The normalized power consumption for encoders and decoders of multime- diaapplications. ... .... ... .... ... ... .... ... .... . 67 3.17 TheIPCforencodersanddecodersofmultimediaapplications. . .... . 68 3.18 The energy consumption as a function of QP for three test sequences. . . 70 3.19 Comparison of energy consumption in various algorithmic modules. . . . . 71 3.20 The energy distribution in different algorithmic modules of the H.264/AVC decoder. . .... ... .... ... .... ... ... .... ... .... . 73 4.1 A hybrid video codec using motion compensated predictive coding. . . . . 76 4.2 MBpartitionmodes. . .... ... .... ... ... .... ... .... . 78 4.3 Coarse and fine search patterns for local refinement: (a) the diamond search and (b) the hexagon search. .... ... ... .... ... .... . 79 4.4 J R,D Motion applicationinlocalrefinementandfastmodedecision. . .... . 82 4.5 Proposed algorithm: RDC optimization in local refinement and fast mode decision. . .... ... .... ... .... ... ... .... ... .... . 83 xi 4.6 The rate-complexity and the rate-quality performance of the Stefan se- quence parameterized by the value of λ . . ... ... .... ... .... . 89 4.7 The R-D cost J R,D Mode as a function of the complexity per MB for the Stefan sequence with the H.264 reference coder and the RDC-optimized algorithm. 90 4.8 The complexity efficiency (CE) is plotted as function of the Lagrange pa- rameter, λ ,parameterizedbyQPvalues. ... ... .... ... .... . 91 4.9 Lagrange multiplier λ asaquadraticfunctionofQP..... ... .... . 91 4.10 The plot of−dJ R,D Motion /dC Mode asafunctionofQP.. .... ... .... . 93 4.11 The rate-distortion (R-D) and rate-complexity (R-C) curves for test se- quences. . .... ... .... ... .... ... ... .... ... .... . 95 5.1 Fastinter-modedecisionwithRDCoptimization . . .... ... .... . 101 5.2 The performance of the proposed algorithm for Mobile & calendar. . . . . 108 5.3 Relationship between the Lagrange multiplier and complexity reduction. . 108 5.4 Performance of rate-distortion and rate-complexity using the proposed schemeofdiamondsearch... ... .... ... ... .... ... .... . 114 5.5 Flower Frame-to-frame computational complexity and video quality com- parison. . .... ... .... ... .... ... ... .... ... .... . 115 5.6 Performance of rate-distortion and rate-complexity using the proposed scheme of hexagon search. . . . . . .... ... ... .... ... .... . 117 5.7 Comparisonofframe-to-framecomputationalcomplexity. . ... .... . 119 5.8 Energy consumed by motion estimation for original H.264/AVC encoder andtheproposedalgorithms. ... .... ... ... .... ... .... . 120 6.1 TheorderofdeblockingfiltersappliedtoanMB. .. .... ... .... . 124 6.2 Horizontalblockedge. .... ... .... ... ... .... ... .... . 126 6.3 Performance of rate-distortion and rate-energy-saving using the proposed DF-ADFalgorithmfortheForemansequence.. ... .... ... .... . 135 xii Abstract This thesis proposes a complete solution for power efficient multimedia applications on embedded system. We concentrate the research on the emerging video standard: H.264/AVC, both in the encoder and decoder ends. First a run-time power/energy esti- mation model is presented to provide a fast yet accurate tool for the energy analysis of multimedia application on embedded system. Then a rate-distortion-complexity (RDC) optimization algorithm is proposed to simplify the H.264/AVC motion estimation, which is the most energy-consuming component in this emerging video standard. Based on this RDC framework, a fast inter mode decision algorithm is used to enhance the energy saving. Finally a decoder-friendly adaptive de-blocking filter (DF-ADF) mode decision algorithm is proposed to reduce the decoder energy consumption requirement. The power dissipation of a wide spectrum multimedia applications on VLIW processor is studies first to characterize their power performance. As revealed by collected statistics, the instruction decode unit in a VLIW processor consumes nearly 50% of the total energy in most multimedia applications. This implies that the instruction set architecture (ISA) design is the key to the power minimization of an embedded multimedia system. The power profiling result suggests a strong correlation between IPC and power dissipation. xiii By exploiting this relationship, the power and the energy models for multimedia applica- tions in VLIW processors were proposed. The proposed model was validated by the TI C6416 chip-set, yielding a low error rate. This simple model leads to efficient run-time power estimation for various multimedia systems, and provides insights into the energy saving of the VLIW processor. It is observed that motion estimation is the most complexity- and energy-expensive component in H.264/AVC Therefore a novel complexity adaptive motion estimation was proposed and applied to the H.264 video standard so that the encoder motion estimation complexity is reduced with little degradation in video quality at the price of small bit rate increase. Experiments were conducted using test sequences with low to high motion activities to demonstrate the advantages of our proposed system. Up to 35% motion estimation complexity can be saved at the encoder with less than 0.2 dB PSNR loss and a maximum increase of 3% bit rate. To further improve the complexity saving, a complexity-constrained inter-mode deci- sion algorithm for H.264/AVC video coding was developed. The proposed algorithm, i.e., the ”mode skip test”, is followed by the rate-distortion-complexity (RDC) optimization framework. To be more specific, the “mode skip test” is executed between the MV search for mode 16x16 and that for mode 8x8. It contains two tests to check the number of DCT coefficients of prediction errors quantized to zeros and use it to predict whether the underlying block modes can be skipped or not. As a result, more redundant block modes can be filtered out to save the complexity while maintaining the excellent coding performance of H.264/AVC. The corresponding energy saving shows the similar trend as the complexity saving. xiv A novel decoder friendly adaptive deblocking filter (DF-ADF) mode decision algo- rithm was examined. Both the complexity and the energy saving of the decoder relies on the usage of ADFs. We first construct the complexity model for the deblocking fil- ter, and the energy consumption model is built based on the IPC-based dynamic power model described in Chapter 3. Then, the encoder performs the rate-distortion-decoder complexity (RDC) optimization to save the energy needed for deblocking filter operations in decoding. We observed a significant amount of of energy saving (up to 30%) in the deblocking filter with negligible quality degradation (less than 0.2 dB) and bit rate raise (within 1%). xv Chapter 1 Introduction 1.1 Significance of the Research Two branches have been developed along the microprocessor’s evolution: the digital sig- nal processor (DSP) and the general purpose processor (GPP). Basically, DSP is a type of specialized microprocessors designed particularly for digital signal processing applica- tions used in controllers and embedded multimedia systems (EMS). DSP has powerful processing capability for some algorithms using a specialized instruction set or a devoted co-processor. In contrast, GPP aims at fast execution for generic applications, from multimedia to scientistic computations, using much higher frequency and larger memory hierarchy than DSPs. The microprocessor design is usually performance-driven. That is, the primary goal of both DSP and GPP design is to pursue faster execution. Table 1.1 shows the evolution of DSP’s capabilities over years. Although only DSP or GPP data are summarized in Table 1.1 and Fig. 1.1, they reveal the general trend of microprocessor systems. In 1980, a 50,000-transistor DSP offered a 5-MIPs for $150. Two 1 Table 1.1: Evolution of DSP’s capabilities over years. 1980 1990 2000 2010 Die size(mm 2 )50 50 50 5 Technology(μm) 3 .8 .1 .02 MHz 20 80 1K 10K MIPS 5 40 5K 50K Transistors 50K 500K 5M 50M RAM (bytes) 256 2K 32K 1M Price(dollars) 150 15 5 0.15 decades later, a 5,000,000-transistor DSP capable of 5,000 MIPs cost only $5. These num- bers show that microprocessors have entered the realm of performance since 1990’s. This trend continues until today, since the market demands high performance microprocessors for wireless channels, faster Internet delivery and higher quality multimedia. The pursuit of higher clock rates and better CPU performance has driven power and energy consumption so high that they become a key issue in the microprocessor design. Let us take a glance at the evolution of the power consumption as the improvement of the processor’s capability. Fig. 1.1 [7] shows that power dissipation increases along the pro- cessor’s evolution. The faster a processor, the higher its power dissipation. Consequently, more heat cooling systems are required in a processor. As shown in Fig. 1.2 [15], since the cooling cost rises exponentially as thermal dissipation increases, the processor pack- aging becomes a major expense item. Therefore, there is an urgent need for techniques to help measure and control power dissipation, especially, the runtime response providing safe cooling by changing the processor’s behavior rather than relying on costly thermal packaging. Evaluating such techniques, however, requires a power dissipation model that is practical for architectural studies. 2 1 10 100 1000 1980 1990 2000 2010 Power Density (W/cm 2 ) Hot Plate Nuclear Reactor 386 486 Pentium Pentium Pro Pentium 2 Pentium 3 Pentium 4 (Prescott) Pentium 4 Figure 1.1: The power dissipation trend. Multimedia applications have become one of the most dominating tasks for modern microprocessor systems. The EMS is a major platform in the digital media era due to the significant growth of the mobile Internet and portable media devices. Most EMSs are battery-powered. There is a growing gap between the actual battery capacity and energy need for EMSs because of the highly demanded computation requirements on today’s multimedia applications as shown in Fig. 1.3 [7]. Thus, it is critical to study the power behavior of multimedia applications on the embedded microprocessor system. Although power analysis has been long explored over the general purpose processor (GPP) or the superscalar-structured processor, little attention has been paid to EMS even it possesses a specific energy dissipation pattern which is quite different from that of GPPs. Furthermore, due to its sophisticated instruction set architecture (ISA), it is diffi- cult to perform the “run-time” power and energy estimation for a media processor since it is too complicated to apply. It is critical to understand the energy dissipation phenomena of embedded multimedia applications and build a run-time system-level power model for battery-powered portable devices so as to achieve power-efficient implementations. 3 0 5 10 15 20 25 30 35 40 20 30 40 50 60 70 80 Thermal Dissipation (W) Cooling Solution Cost ($) Figure 1.2: The cost of removing heat from a microprocessor. 15 0 1000 2000 3000 4000 5000 2000 2001 2002 2003 2004 2005 2006 2007 Battery capacity (mAh) Energy requirement (mAh) 10kbps 64kbps 384kbps 2Mbps PIM, SMS, Voice Video email, Voice recognition, Mobile commerce Mobile video- Conferencing, Collaboration Downlink dominated Interactive Lithium Ion Lithium Polymer Fuel Cells Web browser, MMS, Video clips Energy (mAh) Source: Anand Raghunathan, NEC Labs Figure 1.3: The diverging gap between the actual battery capacities and demanded en- ergy. 4 More and more complex multimedia coding algorithms have been developed to pur- sue a better coding gain [63], [48]. H.264, which is an emerging video coding standard, provides a representative example. H.264 uses the same hybrid block-based motion com- pensation and transform coding model as previous standards such as H.263 and MPEG-4. However, a number of new features and capabilities [62] have been introduced in H.264 to effectively improve its compression efficiency. As the standard becomes more complex, the encoding and decoding processes require more computations. Consequently, more power dissipation and energy consumption are needed as compared with most existing standards. For example, the H.264 reference encoder on the state-of-the-art processors runs at orders of magnitude slower than real time even for the CIF video sequences. Since multimedia applications normally have a large amount of data access, a common belief is that they have a poor memory behavior as compared to traditional programs and that current cache architectures cannot handle them well. It is therefore important to quantitatively characterize the memory behavior of these applications in order to provide insights for future design and research of EMS. Nevertheless, very few results on this topic have been published. The cache behavior analysis for the H.264 standard will be investigated in this research. 1.2 Background of the Research The goal of power analysis is to understand the power consumption phenomenon and minimize the power consumption under the hardware constraint. Research on power analysis have been conducted at different levels, including the transistor level, the gate 5 level, the register transfer level (RTL), the instruction level and the system level. They are detailed below. 1. The Transistor (Circuit) Level The transistor level is of the lowest level that captures the power dissipation statis- tics by collecting data such as the transistor number, size, cycle-accurate active status. Consequently, the simulation is extremely time-consuming. Usually it may take days for a small program. 2. The Gate Level The gate level exploits the probabilistic simulation by sampling and compaction the ASIC library models. Power analysis tools for the transistor and gate levels have been long commercially available such as PowerMill [52] and SPICE [36]. 3. The Register Transfer Level (RTL) The RTL approach specifies the data flow between registers, e.g., what and where this information is stored and how it is passed through the circuit during its op- eration. The RTL simulation is usually conducted by the hardware description language such as Verilog or VHDL. 4. The Instruction Level Research on the instruction level simulation was first initiated for the simple RISC system by Tiwari [55]. An instruction level model is derived empirically, where de- tails of lower-level models are hidden behind a simple interface, i.e. the instruction set architecture (ISA). The main goal is to provide an energy consumption analysis for a given piece of software codes over an available processor. 6 5. The System Level The system level power analysis, which is the most abstract level, aims at flexible treatment of different low-level components and programs. Most research in this level cannot be applied in the design stage but to available processors only. In general, the higher level the power analysis is, the less physical details and the shorter simulation time. However, most of these studies have only focused on GPPs. Embedded media pro- cessors possess a specific energy dissipation pattern, which is quite different from that of GPPs due to a different micro-architecture known as the VLIW and the multimedia instruction extension (SIMD). Some researchers attempted to extend the instruction-level power model to VLIW processors, e.g. [41]. The active power analysis in [41] includes: the inter-instruction-word effect, the intra-instruction-word effect and the instruction per- mutation inside an instruction word. However, it is difficult to perform the “run-time” power and energy estimation for a media processor using this approach since it is too complicated to apply. On the other hand, due to extensive multimedia applications, it is critical to provide a fast yet accurate energy and power estimation method for the power analysis purpose. As mentioned before, H.264 trades the computational complexity for better coding gain. However, many new features in H.264 may have to be sacrificed to meet the real-time requirement over EMS. Our individual experiment reveals that, to achieve an encoding speed of 15 frames per second over TI C6416 DSP at 600 MHz frequency for CIF size video, the encoded bitstream can only possess a rate-distortion characteristic similar to 7 the MPEG-2 standard. Among all the components in an H.264 encoder, motion estima- tion (ME) is no doubt the most time-consuming one. From the computer architecture perspective, H.264 ME is difficult to handle because of the large computation and highly memory reference frequency. Even though the overall cache behavior for some embedded multimedia benchmark suite has been studied before, e.g. [66], [47], to the best of our knowledge, there is no quantitative analysis available on H.264 ME yet. Kossentini et al. explored the computation-constrained MPEG-2 motion estimation in [32] and [14] without much con- cern on the rate. Moreover, the complexity weight and control parameters were not carefully tuned. Chen et al. [11] proposed a VLSI architecture to support fast motion estimation for H.264, where a lot of efforts were made to cut down the ME cost. Tourapis [58] extended his enhanced predictive zonal search (EPZS) algorithm to reduce the ME cost in H.264. Moreover, He et al. [16] optimized the power-rate-distortion performance by constraining the sum of absolute difference (SAD) operations during the ME process at the encoder. There are also efforts to reduce motion compensation cost in the decoder. Ray and Radha [40] proposed a method to reduce the decoding complexity by selectively replacing the I-B-P Group of Picture (GOP) structure with I-P only. Due to the popularity of mobile/portable video devices, most of which are video decoders nowadays, research has been done to design the decoder-friendly encoder algorithm that generates low-decoding- complexity and high-quality bit stream for decoders, e.g. [61]. 8 1.3 Contributions of The Research Several contributions have been made in this research. They are highlighted below. • Bottleneck identification of EMS As revealed by the collected statistics, the instruction decode unit in a VLIW/EPIC processor consumes nearly 50% of the total energy in most multimedia applications, which is associated with the the specialized VLIW instruction structure and is differ- ent from that of superscalar-structured processor. This implies that the instruction set architecture (ISA) design is the key to the power minimization of an embedded multimedia system. • IPC-based run-time power model Based on our study, a run-time power model is proposed to achieve fast and accurate power estimation at the routine-level. This modeling technique can be extended to the program-level when there exists stable occupation of each routine in the entire application. It is shown that the model can predict the cumulative energy consumption of validating sequences with little error. • Model validation The validation step is crucial since an estimation model is acceptable only when it is reasonably accurate. Since we have no access to the details of any industry hardware design and its detailed capacitance values, our run-time power model is validated by two ways. The first approach is to compare the predicted cumulative energy consumption value with experiment statistics. The other is to validate the analysis 9 with a commercial VLIW processor by comparing the power behavior characteristics provided by the manufacturer. These two approaches show that our simple model leads to an efficient run-time power estimation for various multimedia systems, and provides insights into the energy saving of the VLIW processor. • Complexity-adaptive H.264 motion estimation scheme A novel complexity adaptive motion estimation is proposed and applied to the H.264 coding standard so that the encoder motion estimation complexity is reduced with little degradation in video quality and small bit rate increase. Our experiments consider a wide range of test sequences consisting of low to high motion. It is demonstrated that up to 35% motion estimation complexity can be saved at the encoder with less than 0.2dB PSNR loss and a maximum increase of 3% bit rate. • Fast H.264/AVC inter-mode decision with RDC optimization The rate-distortion-complexity (RDC) optimization framework is extended to de- velop a complexity-constrained inter-mode decision algorithm for H.264/AVC video coding. The number of DCT coefficient to be quantized to zero is considered jointly with the RDC optimization framework to predict redundant block modes. Conse- quently, more complexity saving is achieved by incorporating the DCT-quantization procedure. The good R-D-C performance tradeoff was supported by experimental results. • Decoder-friendly adaptive Deblocking filter mode decision in H.264/AVC Video encoding to yield a decoder-friendly H.264 bit stream that consumes less decoding energy yet with little coding efficiency degradation is investigated. The 10 energy saving of the decoder relies on the use of adaptive deblocking filters (ADF). An energy consumption model for the deblocking filter is first constructed via the IPC-based run-time power model. Then, the encoder performs the rate-distortion- decoder complexity optimization (RDC) to save the decoder energy needed for deblocking filter operations, which is called the decoder-friendly adaptive deblock- ing filter (DF-ADF) mode decision. The RDC optimization framework presents a way to balance coding efficiency and the ADF decoding cost in the mode decision process. The effectiveness of the proposed DF-ADF algorithm is demonstrated by experiments with diverse video contents and bit rates. 1.4 Organization of the Dissertation The rest of the dissertation is organized as follows. The background on embedded mul- timedia systems (EMS), including its computer architecture, the major challenge it faces in performance and power consumption and typical low-power techniques are reviewed in Chapter 2. The run-time power consumption model for multimedia application rou- tines in an embedded system is developed Chapter 3. Specifically because of the high- complexity of emerging H.264, its execution over EMS is extremely time-consuming and power-expensive. A complexity-constrained motion estimation is proposed to alleviate the motion estimation cost for EMS in Chapter 4. The rate-distortion-complexity optimiza- tion framework is extended to develop a fast inter-mode decision with RDC optimization in Chapter 5. The number of DCT coefficients of prediction errors quantized to zeros is exploited to predict whether the underlying block modes is redundant or not. The 11 corresponding energy saving is also examined. The IPC-based run time power model helps build a decoder-friendly adaptive deblocking filter mode decision in Chapter 6, which yields a decoder-friendly H.264 bit stream consuming less decoding energy. Fi- nally, the thesis dissertation is summarized and future research directions are pointed out in Chapter 7. 12 Chapter 2 Background Review on Embedded Multimedia Systems With the significant growth in the Internet and portable media devices, the embedded multimedia system (EMS) has become a major platform in the digital media era. The pursuit of higher clock rates and better processor performance has driven power and energy drastically higher so that they become a key issue in the EMS system design and the software implementation. In this chapter, we will briefly review the computer architecture and the emerging problems of EMS. The computer architecture in general is composed by two layers: the instruction set architecture (ISA) and the micro-architecture as shown in Fig. 2.1, which is from [17]. Basically, ISA describes a programmer’s vision of the computer architecture, including the native data-types, instructions, registers, addressing modes, etc. Thus, a good ISA design should provide an friendly interface for the OS, the compiler and programmers to manipulate the processor. The micro-architecture is built based on of a set of micropro- cessor design techniques such as the pipeline and the cache subsystems. They are used to implement the instruction set in hardware. Hence, a good ISA design has to be easily implementable for computer architects or designers. Moreover, it has to be considered for 13 Instruction Set Architecture Micro-architecture Software Hardware Operation System Application: (AI, DB, Graphics) Program Language, compiler VLSI Hardware implementation Computer Architecture Figure 2.1: Illustration of the instruction set architecture (ISA) and the micro- architecture and their relationship to other software and hardware components in a com- puter. the future compatibility since a good ISA may last for 30 years. In this chapter, we will briefly review the new micro-architecture techniques and the approach for multimedia processing. 2.1 New Technologies in Micro-architecture The data level parallelism (DLP), the instruction level parallelism (ILP) and the thread level parallelism (TLP) are the three main parallelisms used by advanced processors to improve system performance,i.e., to reduce program execution time (ET) and to enhance system throughput. The multiple-issue processor is motivated by ILP while simultaneous multithread (SMT) and chip multi-processing (CMP) are motivated by TLP. 14 2.1.1 Multiple-issue Micro-architecture To enhance the computer performance, multiple-issue processors are designed to take advantage of ILP so that processor’s throughput is improved by a larger number of instructions per clock (IPC) as compared with the single-issue processor. To process instructions in parallel, there are three major tasks [42]. 1. Grouping Checking dependencies between instructions to determine which instructions can be grouped together for parallel execution. 2. Function unit assignment Assigning instructions to functional units in the hardware. 3. Initiation Determine when the instructions can be executed. Multiple-issue processors originally have two basic flavors: the superscalar processor and the very long instruction word (VLIW) processor. A superscalar processor issues a varying number of instructions per clock and can be either dynamically scheduled or statically scheduled. By dynamic scheduling, we mean to schedule and issue independent instructions according to the computational need and the previous outcome. It provides more flexibility and better performance than static scheduling at the price of additional expensive hardware support. For the VLIW-structured processor, the ILP implementation is accomplished by com- piler software. Rather than issuing multiple independent instructions to function units, a 15 VLIW processor groups instructions into a very long instruction package during the com- piling stage. Thus, the compiler has to detect potential data hazard stalls and organize instructions in an instruction issue package. Since there is no need for hardware to check dependencies explicitly, the VLIW architecture offers an advantage of simpler hardware while exhibiting relatively good performance through extensive compiler optimization. Nevertheless, static scheduling brings several problems to the VLIW processor such as object-code incompatibility, a large object code size and poor scheduling for unpredictable branches. Among these drawbacks, incompatibility is the worse one, which means that it has to recompile all codes in every machine. In contrast, the object codes of a superscalar processor have better compatibility since all three tasks are done by hardware. Since 90’s, academia and industry have begun to explore new architectures: dynamic VLIW and explicit parallel instruction computing (EPIC), which conduct the three tasks using the compiler and hardware together. Table 2.1 compares the four classes of ILP architectures which accomplish the three tasks either by hardware or the compiler. Dy- namic VLIW can respond to events at the run-time, which cannot be handled by the compiler. For example, a cache miss may go beyond the control of an earlier VLIW processor since it would disrupt the sequence of long instruction words by invalidating compiler’s assumption on latency for load instructions. The load-miss interlock is added into a dynamic VLIW processor to stall the entire machine on a cache miss. Today’s VLIW media processors are mostly not of the form of the original VLIW but the dy- namic VLIW. The EPIC processor retains compatibility across different implementations as superscalars but does not require the dependency-checking hardware of superscalars. In this manner, EPIC can be said to combine the best of the superscalar and the VLIW 16 Table 2.1: Four major categories of ILP architectures. Grouping Function Unit Initiation assignment Superscalar Hardware Hardware Hardware EPIC Compiler Hardware Hardware Dynamic VLIW Compiler Compiler Hardware VLIW Compiler Compiler Compiler Code generation Instruction group Function unit assignment Initiation timing Compiler Instruction group Function unit assignment Initiation timing Hardware superscalar EPIC Dynamic VLIW VLIW Figure 2.2: Relationship between three main tasks and four ILP architectures. processor. The Intel Itanium and Itanium 2 [25] are the first EPIC-structure commercial processors that cost Intel and HP more than ten years to develop. Fig. 2.2 describes how the three main tasks are performed by the four types of ILP processors. The four horizontal lines show four different levels where the program information can be transferred from the software (compiler) to the hardware. At the top level where a traditional instruction set is used, the superscalar hardware must perform three tasks since no instructions in the instruction set convey the information about independent instruction groups, function unit assignment or initiation. 17 2.1.2 TLP Micro-architectures The multiple-issue processor has the potential to achieve performance, which is ultimately limited by instruction dependencies and long-latency operations within a single executing thread. In other words, the ILP-technique is limited by the amount of parallelism that can be found in a single thread in a single cycle. At the same time, as the issue width increases, the ability of traditional ILP to utilize the processor resource will decrease. The thread level parallelism (TLP) is the parallelism inherent in one or multiple appli- cations that allows the processor to run multiple threads at once. Both the simultaneous multi-threading (SMT) and chip multiprocessing (CMP) have been proposed to speed-up processors by employing TLP. SMT allows multiple threads to compete and share the available processor resource every cycle. A commercial prototype is Intel’s hyper-threading [24] technology which is adopted in the Pentium 4 processor. CMP incorporates multiple processor cores in one chip, which usually share the same level-2 data cache (L2D). As a result of this architecture, CMP exploits TLP by executing different threads on different processors. A successful example of CMP is Intel’s Pentium 4 server processor Xeon [26] featuring dual-core processing. Fig. 2.3 illustrates how the architectures of superscalar, SMT and CMP take advantage of TLP. The superscalar processes threads in a serial order as shown in Fig. 2.3(a). One thread can enter the pipeline only after the previous one finishes its execution, even there is redundant resource in the processor. On the contrary, SMT can have more than one thread processed in the CPU only if there is no function units usage collision as shown in Fig. 2.3(b). The superscalar, SMT and CMP operations are 18 Time (a) superscalar Time (b) simultaneous multi-threading Time Processor core #2 Processor core #1 Thread 2 Thread 1 Idle function unit Thread 3 (c) chip multi-processing Figure 2.3: Thread level parallelisms of superscalar, SMT and CMP. illustrated in Fig. 2.3(c). The CMP can handle multiple independent threads on each processor since it contains multiple function units on one chip. 2.2 Multimedia Processing on Embedded Systems Contemporary computer applications are multimedia-rich, involving a significant amount of compression and decompression tasks for audio and video signals, 2-D and 3-D graphics and speeches. Two common approaches to handle multimedia workloads are reviewed in this section. They are the single instruction multiple data (SIMD) instruction extension and the very long instruction word (VLIW) design. 19 2.2.1 SIMD Extension Usually 8-bit and 16-bit data are used to express the video and audio signal, respectively. However, for most today’s computer system, the calculation and storage unit is 32-bit or 64-bit. Thus, there is waste in the computation and storage resource for multimedia applications. The SIMD design can exploit the inherent DLP in multimedia applications by incorporating multiple video or audio data in one register. For example, two 16-bit data or four 8-bit data can be packed into one 32-bit data for calculation. By doing so, the throughput can be enhanced by 2∼ 4 times, if the overhead of packing and unpacking can be negligible. Basically, the main advantage of the SIMD technique is to hide latency of memory access by loading multiple elements at once and then overlapping execution with multiple data transfer. SIMD extension can be found on most CPUs today, from GPP to DSP, including PowerPC’s AltiVec and Intel’s MMX, SSE, SSE2 and SSE3, etc. In the meanwhile, 64- bit or 128-bit SIMD has become popular on general-purpose CPUs recently. Table 2.2 shows a list of processor vendors which shipped SIMD extensions to their cores. Due to different applications and manufacturers, SIMD extension may vary one from the other. Nevertheless, to operate the multiple-element instructions such as addition and multiplication, packing and unpacking operations are essential to get the data well-aligned in registers for the calculation and the storage purposes. Thus, the SIMD extension includes two fundamental components: computation and packing/unpacking operations. Moreover, to speedup the digital signal processing, permutation and clip operations are also concluded in some extensions. Below are several examples of SIMD instructions. 20 Table 2.2: GPP SIMD extension. Vendors Processor SIMD Description extension Hewlett Packard PA-RISC Max-1 Media acceleration Max-2 extensions Sun Microsystems UltraSparc VIS Visual instruction set Intel x86 MMX SSE Multimedia eXtensions Pentium SSE2 SSE3 Streaming SIMD extension AMD x86 MMX MultiMedia eXtensions SSE Streaming SIMD Extensions 3DNow! 3Dnow! Extension Cyrix x86 MMX Multimedia eXtensions MIPS MIPS V MDMX MIPS Digital Media eXtensions Compaq Alpha MVI Motion Video Instructions Motorola PowerPC AltiVec AltiVec extensions IBM Apple Philips PNX 1300 unknown N/A Semiconductors (Trimedia) Texas TMS320C6000 unknown N/A Instruments Fig. 2.4 show examples of subword execution of common multimedia operations. In Fig. 2.4(a), a purely data parallel add operation with two subwords in each register is accomplished. In Fig. 2.4(b), the multiplication of two subword pairs and the addition of two sets of partial results lead to one result word. Fig. 2.4(c) illustrates how the packing works to merge less significant bytes of two resource registers into the destination register [13]: themergelsb operation interleaves two pairs of least-significant bytes from arguments rsrc1 and rsrc2 into rdest. 21 a2 a1 b2 b1 c2=b2+a2 c1=b1+a1 + = 32170 rsrc1 rsrc2 rdest (a) add operation a2 a1 b2 b1 c2=b2xa2+b1xa1 x+ = 32170 rsrc1 rsrc2 rdest (b) multiplication-add operation rsrc1 32170 32170 32170 rsrc2 rdest (c) packing operation Figure 2.4: The average power consumption and its standard deviation. 2.2.2 VLIW/EPIC Micro-architecture Due to the processing regularity of multimedia applications and the cost concern, stati- cally scheduled processors such as the VLIW processor are a viable option over dynam- ically scheduled processors, e.g. state-of-the-art superscalar GPPs. As discussed in Sec. 2.1.1, VLIW processors rely on the compiler to identify the ILP during compiling, and then assemble wide instruction packets to issue multiple instructions per cycle. Fig. 2.5 shows a block diagram of a generic VLIW processor core which can issue an instruction word containing as many as 6 operations inside. The deficiencies such as incompatibility and low efficiency in handling the cache miss prevents VLIW (at least the original forms) from being the mainstream. Table 2.3 lists major vendors of media processors and their processors [54], [49], [43], [22]. Several 22 FADD Floating point function unit Integer ALU #1 Load/Store unit #2 Branch unit 196-bit instruction package Integer ALU #2 ADD SHIFT LD ST BRCC Load/Store unit #1 Figure 2.5: Simplified block diagram of a VLIW core. features have been added to EPIC, the ”2nd generation VLIW” architecture, to overcome these shortcomings. They include the following. Dependency information in bundles Each of the multiple operation instructions is called a bundle. Each of the bundles has the information to indicate if this set of operations will affect the subsequent bundle. This allows future implementations to issue multiple bundles in parallel. The dependency information is calculated by the compiler, thus relieving the hard- ware implementation of doing operand dependency checking. Data prefetch A speculative load instruction is used as a type of data prefetch. This prefetch increases chances for a primary cache hit under normal loads. Furthermore, a check load instruction that further aids speculative loads by checking that a load did not depend on a previous stored value. Speculative execution Predicated execution is used to decrease the occurrence of branches and increase the speculative execution of instructions. For this feature, branch conditions are 23 Table 2.3: List of major processors. Manufacturer Media processor Application BSP-16 Digital media Equator 4-way VLIW processor Digital imaging Technologies with SIMD extension Video conference 350∼ 500 MHz Video security and surveillance PNX 1300 (TriMedia) PVR and home media Philips 5-instruction VLIW architecture DVD recorder Semiconductors with special multimedia SIMD Media adaptor 200 MHz, 7.7 BOPS Sony PlayStation2’s Emotion Engine 3-D graphics 2-way superscalar TMS320C6000 wireless communication Texas (C6200, C6400, C6700) base station Instruments 2-duplicate 4-way VLIW with video and image processing SIMD capability Audio and medical image converted to predicate registers which are used to kill results of executed instructions from the side of the branch which is not taken. Large register file A large architectural register file is kept to avoid the need for register renaming. Besides, the Itanium [25] architecture adds register rotations, which shows how a digital signal processing techniques can be used for loop un-rolling and software pipelining. 2.3 Power Analysis Several new techniques to improve system performance were reviewed in Sec. 2.1. Mean- while, their high power consumption brings up problems in cooling, reliability, clock synchronization as well as the packaging cost. According to the Environment Protection Agency (EPA), computers consume 10% of commercial electricity consumption [2]. It 24 was cited in EPA’s report that the growth of Internet data centers contributed to the 2000/2001 California energy crisis. As mentioned before, the diverging gap between the actual battery capacity and energy requirement becomes the primary driver to research on low-power embedded systems and portable devices. Thus, the power-aware design is essentially across all computing platforms, from desktop/set-top devices, mobile/portable systems, and even to servers. 2.3.1 Power Components Power dissipation consists of two parts: dynamic power and static power. For most processors, dynamic power is dominant while static power has a larger weight due to more transistors are integrated into one unit area nowadays. Dynamic power mainly comes from two sources: capacitive power and short-circuit power. The former is caused by charging and discharging at transitions from 0 to 1 and 1 to 0 while the latter is due to the brief short-circuit current during transitions. Fortunately, the short circuit power can be minimized and it is relatively less important. Since capacitive power is the major contributor to dynamic power, dynamic power is also called switch power. SMT and CMP technologies get dynamic power increased by having more components on processors involved into program execution. Static power, which is also named the leakage power, is independent of the system busy status. As a by-product of advanced VLSI techniques, which aims at integrating more and more transistors into one unit die, static power is getting bigger and bigger. One straightforward method to lower static power is to use fewer or smaller transistor. 25 2.3.2 Power Performance Efficiency Since both power consumption and performance are now designer’s concern, there should be some metric to characterize power-performance efficiency of a processor. To measure power and performance for a given program execution, we may use a fused metric such as the power-delay product (PDP) or the energy-delay product (EDP) [6]. In general, the PDP-based formulation is more appropriate for a low-power, portable systems, in which the battery life is the primary concern of energy efficiency. For higher end systems, e.g., PC, workstation and servers, the EDP based formulation is more appropriate since an extra delay factor ensures greater emphasis on performance. 2.3.3 Approach to Low Power Design Research on power reduction has been explored by designers at almost all levels; namely, from the bottom transistor level to the top system level. Here, we give a brief review on major techniques that are popular in low power design nowadays [70]. • Voltage scaling is a method to reduce the energy consumption by reducing the supply voltage at the price of more delay due to lower frequency. It is the most straightforward and efficient way to improve energy-delay. • Clock gating refers to activating clocks in a logic block only when there is work to be done. Every unit on the chip has a power reduction plan, and almost every Functional Unit Block (FUB) contains clock gating logic. Although it is a rather old technique, it has not yet been employed until recently when the power concern becomes a key issue. 26 • Low power multiported memories attempt to achieve high performance yet low power via adding additional ports to the memory whose storage size and number of ports grows with increasing ILP. • The primary compilation technique to generate an energy-efficient code is to reduce the cycle number needed to execute a given program. Standard compiler optimiza- tions, such as loop unrolling and software pipelining, etc., are also helpful to energy reduction since they reduce the running time of the program. 2.4 Conclusion For the background reviews presented above, it is obvious that the rapid power dissipa- tion growth is resulted from the power density increase due to more transistors in the chip dies. Those advanced technologies also improve system performance by taking the advantage of DLP, ILP and TLP. Hence, the power constraint is not only for high-end systems but also for portable computers, mobile devices and EMS due to their battery- powered features: it is a war across all computing platforms. Meanwhile, due to the great enhancement of the GPP performance’s, new multimedia standards are designed to aim at the high quality such as better rate-distortion (R-D) characteristics, using an exhausted-computation style. A good paradigm is the new video standard H.264. For important platforms of multimedia applications: DSP or EMS, the speed of neither the encoder nor the decoder is satisfactory. Consequently, it is important to explore how to keep H.264 good R-D quality while speedup the calculation over the limited-resource EMS. 27 Chapter 3 Run-time Power Consumption Modeling for Embedded Multimedia Applications 3.1 Introduction Historically, performance has been the primary concern in modern embedded processor design. The pursuit of higher clock rates has driven power (or energy) consumption so high that it becomes a key issue across all platforms nowadays. The situation is even more severe for battery-powered embedded multimedia systems since the increasing gap between the battery life and the energy consumption requirement in multimedia applications has imposed a great challenge on today’s embedded processor design. Power analysis has long been studied at the gate level for the purpose of VLSI de- sign. Research on instruction-level [55] and system-level [46] power models were later developed to provide a guide for application power optimization. However, most of these studies have only focused on general purpose processors (GPPs). Embedded media pro- cessors possess a specific energy dissipation pattern, which is quite different from that of GPPs due to a different micro-architecture known as the VLIW and the multimedia 28 instruction extension (SIMD). Some researchers attempted to extend the instruction-level power model to VLIW processors, e.g. [41]. In [41], the active power includes the inter- instruction-word effect, the intra-instruction-word effect and the instruction permutation inside an instruction word. However, it is difficult to perform “run-time” power and en- ergy estimation for a media processor using this approach since it is too complicated to apply. It is critical to understand the energy dissipation phenomena of embedded multimedia applications and build a run-time system-level power model for battery-powered portable devices so as to achieve power-efficient implementations. This research is conducted based on extensive experimental results to obtain the energy consumption profiling for a wide spectrum multimedia applications. The power behavior is characterized in terms of components in the VLIW architecture and the routines in multimedia standards. The contributions of this work are four folds. First, as revealed by the collected statistics, the instruction decoder unit in a VLIW processor consumes nearly 50% of the total energy in most multimedia applications. This implies that the instruction set ar- chitecture (ISA) design is the key to the power minimization of an embedded multimedia system. Second, based on this study, a run-time power model is proposed to achieve fast and accurate power estimation at the routine-level. This modeling technique can be extended to the program-level when there exists stable occupation of each routine in the entire application. It is shown that the model can predict the cumulative energy consumption of validating sequences with little error. Third, through an extensive study on the power performance of multiple-issue width, it has been realized that the solu- tion with more function units may not be efficient in power consumption. Finally, this 29 model is validated with a commercial VLIW processor by comparing the power behavior characteristics provided by the manufacturer. The rest of this chapter is organized as follows. Previous work on power estimation and modeling is briefly reviewed in Sec. 3.2. The experimental framework, including the methodology, benchmarks and test sequences, are described in Sec. 3.3. The routine- level multimedia application model based on the profiling sequence result is presented in Sec. 3.4. In Sec. 3.5, we demonstrate the model’s capability to predict the cumulative energy consumption of validating sequences accurately by comparing theoretical and ex- perimental data. The model is further validated using a commercial VLIW processor by examining their power dissipation characteristics in Sec. 3.6. The power performance ef- ficiency of multiple-issue width is exammed in Sec. 3.7. Finally, some concluding remarks are given in Sec. 3.8. 3.2 Review of Related Work The goal of power analysis is to understand the power consumption phenomenon and minimize the power consumption under the hardware constraint. Thus, this research includes two tightly coupled topics; namely, power modeling and power optimization. The former is to build power dissipation models for circuits, architectures, instructions and even programs while the latter is to develop algorithms to reduce power (or the total energy) consumption over different layers. A brief review of previous work on power analysis is given in this section. 30 Power analysis techniques can be categorized to two types: the model-based and the measurement-based approaches. The measurement-based approach examines power consumption at a certain level with actual power measurements while the model-based approach analyzes the system power performance with simulation tools. Most commer- cial simulators use low-level tools built upon available under-layer information (e.g.,the physical layer). Researchers have continued to explore higher level simulation tools and systems, and some success has been achieved in the past decade. 3.2.1 Cycle-Accurate Gate-Level Power Analysis Commercial tools have been developed for power dissipation analysis at the gate/circuit level,e.g., PowerMill [52] and SPICE [36]. There are also quite a few academic architecture- level simulators available today. Most of them are based on the renowned performance simulator SimpleScalar [5]. Examples include Wattch [8], SimplePower [39] and Sim- Panalyzer [59]. With full details of the physical layer, it is possible to conduct fine-grained analysis for a specific application unit, e.g. the branch decision in multimedia standards. However, since SimpleScalar aims at the superscalar architecture whose instruction is dynamically scheduled, all power analysis results apply to the superscalar system only. They are not useful to our understanding of the power dissipation phenomena of the VLIW microstruc- ture. Furthermore, the simulation process is fairly slow, which motivates the development of a higher level simulation tool and system that can hide unnecessary details yet get ac- curate on-line power consumption results as discussed below. 31 3.2.2 Instruction-Level Power Analysis An instruction level model is derived from measure-based experiments [55], where details of lower-level models are concealed. The main goal is to provide an energy consumption analysis for a given piece of software codes. The basic idea is to associate the consumed energy with each individual executed instruction. An empirical instruction-level energy model for an instruction can be given as E = i (B i × N i )+ i j (W ij × N ij )+ k (S k ), (3.1) where E is the total energy consumed by the underlying software program, B i is the base energy consumed by instruction i, which is executed N i times in the whole program. W ij reflects the dissipated energy due to circuit switching between each pair of consecutive instructions (i,j)and S k accounts for the extra energy consumption introduced by the CPU stall caused by cache miss and/or bank conflict of the kth instruction. The variation in the total energy comes from two aspects: the inter-instruction effect of switching circuit states and some resource constraints that may lead to CPU stalls. Although the modern microprocessor is a complex system, its complexity is hidden behind a simple interface, i.e. the ISA. The parameter B i could be obtained by an exhaustive measurement of the entire ISA in advance. Similarly, W ij can be determined through the measurement of any possible instruction pair. Thus, both B i and W ij can be pre-determined for a certain micro-architecture and ISA. It is also worthwhile to point out that the workload of exhaustive measurement of W ij may vary significantly depending on the size of ISA. For example, TI TMS320C64X has an ISA of 167 instructions so that 32 the number of total possible pairs are 167× 166 = 27,722. Intel’s Itanium 2 has 331 instructions, and the pair number can increase to 109,230(= 331×330). Parameter S k is largely dependent on the available processor resource such as the cache and the register file during the program execution. Thus, it varies from one case to another, and should be determined by on-line measurements. 3.2.3 VLIW Micro-Architecture Power Analysis The development of microprocessors leads to research on power analysis at the micro- architecture level. The target platform evolves from the general purpose processor (GPP) RISC CPU to the special purpose processor such as media processors and communication processors. Since all media processors adopt the VLIW microarchitecture and the single instruction multiple data (SIMD) ISA with no exceptions, researchers have made a lot of effort in analyzing power consumption for VLIW processors. VLIW processors issue a fixed number of instructions as one large instruction pack- age with multiple regular instructions inside. The compiler statically schedules issued instructions. The instruction-level power model has been extended to that of the VLIW micro-architecture level. Intuitively, the measurement workload can be even heavier in getting the inter-instruction-package energy. Let m be the total available instruction number in an n-issue VLIW processor. Thus, the number of all possible instructions in one package is m n . The total probability of the instruction-package pair is the level of O(m 2n ). However, due to the hardware constraint (i.e., only a subset of all possible instructions can be issued to certain slots), the number of actual instruction packages should be smaller than the estimation given above, but it is still extremely high. 33 A comprehensive instruction-level energy model, extended from Eq. (3.1), for an N-instruction-word program executed on the VLIW-based embedded processor was pre- sented in [41] and given below: E(W)= n∈N (E(w n |w n−1 )) + E c + c 0 , (3.2) where E(w n |w n−1 ) is the energy dissipated by the data path associated with the execution of instruction word w n with w n−1 preceding, E c is the total energy consumed by the control unit and c 0 is a constant associated with the energy to initialize the processor. E(w n |w n−1 ) in Eq. (3.2) can be further decomposed as E(w n |w n−1 )≈ s A s (w n |w n−1 )+ I((w n |w n−1 )), where A(w n |w n−1 ) is the average energy consumed per stage when executing instruc- tion w n after instruction w n−1 and I(w n |w n−1 ) is the energy consumed by the connec- tions between pipeline stages (i.e. the inter-stage connections). Note that A(w n |w n−1 ) and I(w n |w n−1 ) consist of several lower level power components [41]. To validate this power/energy model, a VLIW processor with a clock rate of 100M Hz described by VHDL was constructed and PowerMill was used to verify the derived theoretical values. A VLIW-based simulator was developed by Ascia et al. [4], [12] for power/energy performance simulation. Their work is similar to that in [8] by dividing the complex CPU into several structured components, including function units, register files, memory 34 hierarchy, buses, etc. The main drawback of this work is that the architecture level simulation is inheritably slow in the speed. 3.2.4 Macro-Modeling Energy Estimation A high level energy estimation method is proposed using the characterization-based macro-modeling approach [53]. The characteristics of interest are first retrieved from the software function via data analysis or profiling results. Then, the power and energy data are collected by means of a low-level energy simulation framework under different circumstances. Under this philosophy, the overall energy cost can be represented by a linear regression model with respective to the n coefficients [c 1 ,c 2 ,...,c n ] ˆ E = n j=1 c j P j , where P j is the jth macro-modeling parameter used to indicate the software complexity or profiling characteristics and c j is the corresponding coefficient that depends on the processor configuration. Two methods were used to extract macro-modeling parameters in [53]: algorithm complexity/data analysis and trace-based block correlation. The former requires a delicate investigation and a deep understanding of the algorithm while the latter demands off-line trace analysis with a complete control flow graph (CFG). Thus, being similar to instruction-level power modeling, the macro-modeling method is too complicated to provide on-line power estimation due to its intensive computation and difficulty for large program suites. 35 3.2.5 Run-Time System-Level Power Analysis As compared with the instruction-level power model, the system level simulation tools and models aim at flexible treatment of different low-level components and programs. Most research in this level cannot be applied in the design stage but to available processors only. A built-in source meter of variable power supply was used to measure the current with the help of a subroutine [46], where the average current value was probed from inputs of the test-bed of a StrongARM processor and a Hitachi SH4 processor under different addressing modes. Consequently, a first-order and a second-order models are constructed based on these measurements. 3.3 Methodology for Power Analysis To minimize the power/energy consumption for multimedia application routines in em- bedded systems, it is important to explore a system-level power model. In this section, we first describe an experimental environment used to compile and execute the benchmarks such as the simulator and its configuration. Then, several multimedia benchmarks and their test sequences will be presented. 3.3.1 Simulation Environment Fig. 3.1 illustrates the global data flow for the power estimation model at the routine level for a given microprocessor core. The experiments were carried out over a cycle-accurate, instruction level EPIC(explicitly parallel instruction computing)/VLIW simulator called Trimaran [18]. It is a parameterized processor architecture, providing a vehicle to explore 36 Power/Energy Simulator EPIC-explorer Power data Power/energy estimation model Target processor configuration compiler Application program Test data Execution code Profiling/tracing simulator Trimaran Execution information Routine level power/energy model Execution parameter Figure 3.1: Global data flow for the routine-level power estimation. the instruction level parallelism (ILP) in compilation. The EPIC-explorer [4] calculates the energy consumption value using the collected profiling result produced by Trimaran, which behaves similarly to other low-level power simulators such as Wattch [8]. By di- viding the processor architecture into a set of function units (FUs), the EPIC-explorer captures the power and energy characteristics using an adapted model of Cai-Lim’s [9]. Once the run-time power model is constructed, the time-consuming low-layer power sim- ulator can be skipped to acquire fast yet accurate power and energy estimation with the performance profiling data alone. Table 3.1 gives the target configuration of Trimaran used in our experiments. This basic configuration is adopted based on the observation that most media applications require a relatively small cache size (especially for the L1D) to obtain a high hit rate. The level 1 data cache (L1D) miss rate is below 10% for most benchmarks. One of the speech codecs is even of the order 10 −5 ∼ 10 −4 . On the other hand, under this configuration, the level 1 instruction cache (L1I) hit rate is below 90%. 37 Table 3.1: Target processor configuration Processor core parameters Frequency 200 MHz Technology 0.25μm Voltage 1.3V Fetch/Issue 8/8/8 /Retire Width Instruction 128 window size integer FUs 4 Integer FUs 1/1/3/8 latencies add/cmpp/multi/div float FUs 2 float FUs 3/1/3/8 latencies add/cmpp/multi/div memory FU 1 branch unit/penalty 1/7 general register 64 float point register 64 Memory hierarchy L1 cache size L1D 16 KB L1I 16 KB L1D hit/miss latency 1/2 L1I hit/miss latency 1/2 L2U cache size 32KB L2U hit/miss latency 2/7 Memory 7/35 38 Table 3.2: Benchmarks Codec Profiling Validation name data data Image compression EPWIC grayscale grayscale Lenna House Baboon Elaine 512× 512 512× 512 JPEG color color Lenna House Baboon Elaine 512× 512 512× 512 Video compression MPEG-2 QCIF: Akiyo Foreman CIF: Akiyo MPEG-4 Mobile & calendar Foreman H.264/AVC 10, 25, 50, 100 frames Mobile &calendar Speech & audio compression AMR DT1, DTX1, DT2, DTX2, DT3, DTX3, DT4 DTX4 3.3.2 Benchmarks Our choice of benchmarks is to model the next generation multimedia applications (such as speech, audio, digital still image and video) in embedded processors. Although Medi- abench [33] provides a suite of media benchmarks, some of them become obsolete due to the fast development of multimedia algorithms. For example, the most advanced video compression standard has evolved from MPEG-2, which is included in Mediabench’s suite, to today’s MPEG-4 and H.264. The latter two represent the future video compression trend since they offer a higher coding gain as compared with MPEG-2. We consider totally six sets of multimedia benchmark codecs, covering both high and low bit rate applications. A brief review of these applications is given below. The adaptive multi-rate (AMR) speech codec is widely used in GPRS and 3G cellular communication systems. The embedded predictive wavelet image coder (EPWIC) is a grayscale image 39 compression algorithm adopting the wavelet pyramid transform in the first stage. Its predecessor, known as the embedded pyramid image coding, is included in Mediabench’s suite. JPEG is a technology widely used in digital cameras for color image compression. MPEG-2 is the most popular video format used in the DVD system nowadays. The MPEG-4 standard was finalized in 1999 with important features to support higher-level interactions between users and contents, controlled by content developers. It was further extended to low bandwidth networks to support mobile applications. As an advanced multimedia coding standard jointly developed by MPEG and ITU, H.264/AVC provides better video quality and a higher compression ratio as compared to other MPEG stan- dards. Several new coding techniques such as deblocking filter, fractional pixel interpola- tion and content based binary arithmetic coding (CABAC), etc. have been incorporated to enhance coding efficiency and video quality. The execution of a multimedia application routine is largely dependent on its al- gorithmic complexity. The input data parameters such as the size (the width and the height), the frame length and the correlation of the subsequent frames, also have an impact on program’s execution. To do the experiment and verify our conclusion, test data are divided into two groups, e.g., profiling and validating. The profiling group is used to generate data to construct the power and energy model. The validating group is employed to check the accuracy of proposed models. For example, Lenna and House 512× 512 is the profiling data and the same size Baboon and Elaine image is for the validation purpose. The difference is grayscale images are used for EPWIC and color images are for JPEG. For video compression algorithms, a sequence of 4 different frame lengths i.e., consisting of 10, 25, 50, 100 frames are tested. The profiling and validation 40 data for MPEG-2, MPEG-4 and H.264/AVC are the same; namely, the QCIF size Akiyo, Foreman and Mobile & calendar for the training purpose while the validation data are the CIF size video sequences. The test data for AMR consist of 8 INP files from [1], where 4 for training and 4 for validation. Please refer to Table 3.2 for more details. 3.4 Observations on Routine-Level Power/Energy Consumption The behavior of modern multimedia programs is rather unpredictable due to the het- erogeneous source data. Generally speaking, the compression algorithms try to elimi- nate all possible redundancies inside the original data. For example, the image codec consists of three main blocks: DCT/IDCT, quantization/dequantization and the entropy coder/decoder to remove the spatial redundancy and the probability redundancy. Besides these fundamental modules, the block-based motion search and compensation technique is used to remove the temporal redundancy in the video codec. There are different im- plementation details in each module. Furthermore, the video coding standards evolve from time to time, which makes the same function module vary for different standards. For instance, the floating-point 8x8 DCT in MPEG-2 is much more complex than the newly adopted 4x4 integer DCT in H.264/AVC. Thus, the program performance depends largely on the coding algorithm employed, the platform and the implementation details. For this situation, we choose the routine, which is the software function, as the basic observation unit. In this section, we will study the characteristics of the execution and the power/energy consumption behavior for various multimedia applications. 41 10 20 30 40 50 60 70 80 90 100 0 1 2 3 4 5 6 7 x 10 7 Quality Instruction number branch CMPP PBR load store IALU 0.2 0.3 0.4 0.5 Energy consumption(J) Energy con. (J) (a) JPEG decoder 10 20 30 40 50 60 70 80 90 100 0 2 4 6 8 10 12 x 10 7 Quality Instruction number branch CMPP PBR load store IALU 0.4 0.5 0.6 0.7 0.8 Energy consumption(J) Energy con. (J) (b) JPEG encoder 10 25 50 100 0 5 10 15 x 10 8 Frames Instruction number branch CMPP PBR load store IALU FALU 1 2 3 4 5 6 7 8 9 Energy consumption (J) Energy con. (J) (c) MPEG-2 decoder 10 25 50 100 0 1 2 3 4 5 6 7 x 10 9 Frames Instruction number branch CMPP PBR load store IALU FALU 10 15 20 25 30 35 40 45 50 Energy consumption (J) Energy con. (J) (d) MPEG-2 encoder DT1 DT2 DT3 DT4 0 2 4 6 8 10 12 14 16 x 10 8 Instruction number branch CMPP PBR load store IALU 8 9 10 11 12 13 Energy consumption(J) Energy con. (J) (e) AMR decoder DT1 DT2 DT3 DT4 0 2 4 6 8 10 x 10 9 Instruction number branch CMPP PBR load store IALU 50 55 60 65 70 75 80 Energy consumption(J) Energy con. (J) (f) AMR encoder Figure 3.2: The instruction number and energy consumption of functional units in several typical multimedia application routines. 3.4.1 Power/Energy Profiling and Analysis To understand the energy consumption characteristics of multimedia application routines, we provide the profiling data of the instruction number and the energy consumption of each benchmark in an VLIW architecture in Fig. 3.2, where results for JPEG, MPEG-2 and AMR codecs are given as examples. In each subfigure, the bars represent the number of the total executed instructions while the line denotes the total consumed energy. The parameter along the X-axis corresponds to an exclusively executed program. For example, the x-axis parameters denote the quality of compressed images for the JPEG codec, the frame number of test sequences for the MPEG-2 codec and four different test sequences DT1∼DT4 for the AMR codec, respectively. As shown in Fig. 3.2, the total instruction number and the consumed energy have the same increasing trend when the complexity 42 Clock 23% Datapath + pipeline 65% L1D 4% L1P 6% L2U 2% Main memory < 1% (a) energy consump- tion break-down Integer Float Branch Memory Decode Register 0 1 2 3 4 5 Function units and register file Energy (J) Static Con. Active Con. (b) energy break-down for the datapath and pipeline module Figure 3.3: Energy consumption break-down for the MPEG-2 decoder. and/or the requirement of test sequences increases. It verifies a common assumption, namely, more instructions lead to higher energy consumption. Moreover, it is observed that the instructions executed by each functional unit possesses a similar percentage throughout all test sequences. Thus, we may conclude that the energy consumption of every functional unit keeps a stable proportion of the total dissipated energy. By breaking down the total energy consumption of the MPEG-2 decoder, we show the contribution of each VLIW architecture component to the total energy consumption in Fig. 3.3(a). The datapath and pipeline module, which is the core of a processor, take charge of instruction decoding, issue and execution. It consumes around 65% of the total energy. Due to the frequently wire switch in each tick [21], the clock system is responsible for 23% of the total energy cost. The memory hierarchy, including the contribution of L1, L2 and main memory, only consumes about 12% of the total energy. A more detailed breakdown for the datapath and pipeline module is shown in Fig. 3.3(b). Based on these experimental data, we have the following observations and design guidelines. 43 • The active energy consumption of the instruction decode unit plays an important role in the total energy consumption due to the continuous operation of this unit. The percentage of this unit over the entire program is close to 50%. It points out a direction for energy optimization, i.e. instruction decoding is the key area to save energy consumption in the multimedia applications. Usually, a better ISA design that saves the actually executed instruction counts can reduce the energy consumption of instruction decoding to a large extent. • Unlike superscalar processors, the static energy consumption is relatively higher in VLIW processors. Thus, we need to work on static power as well to achieve power optimization, which is almost neglected for the superscalar architecture. • For expensive and rarely used units such as the floating-point and branch units, the static power is even higher than the active one so that the clock-gating technique [56] is needed. Fig. 3.2 alone is sufficient to give a complete picture of the power/energy consumption characteristics. Now, let us focus on the average power as defined by Average Power = energy execution time The average power is shown by bars in Fig. 3.4 while the lines in Fig. 3.4 give the standard deviation of the power of each routine normalized by the average power of the whole program. The x-axis parameters have the same meanings as those in Fig. 3.2. More interesting information is revealed in Fig. 3.4. We see that the average power fluctuates at 44 10 20 30 40 50 60 70 80 90 100 2 2.05 2.1 2.15 2.2 2.25 2.3 Quality Ave Power (w) Ave. Power 18.5 19 19.5 20 20.5 Std dev (%) Std dev (%) (a) JPEG decoder 10 20 30 40 50 60 70 80 90 100 2.2 2.25 2.3 2.35 2.4 2.45 2.5 2.55 2.6 Quality Ave Power (w) Ave. Power 19 19.5 20 20.5 21 Std dev (%) Std dev (%) (b) JPEG encoder 10 25 50 100 2.59 2.595 2.6 2.605 Frames Ave Power (w) Ave. Power 12.8 13 13.2 13.4 13.6 Std dev (%) Std dev (%) (c) MPEG-2 decoder 10 25 50 100 2.1 2.11 2.12 2.13 2.14 2.15 Frames Ave Power (w) Ave. Power 25.91 25.92 25.93 25.94 25.95 25.96 25.97 25.98 25.99 Std dev (%) Std dev (%) (d) MPEG-2 encoder TD1 TD2 TD3 TD4 1.79 1.7902 1.7904 1.7906 1.7908 1.791 1.7912 1.7914 1.7916 1.7918 1.792 Ave Power (w) Ave. Power 18.25 18.275 18.3 18.318 Std dev (%) Std dev (%) (e) AMR decoder TD1 TD2 TD3 TD4 1.7725 1.773 1.7735 1.774 1.7745 1.775 1.7755 1.776 Ave Power (w) Ave. Power 27.7 27.75 27.8 27.85 27.9 Std dev (%) Std dev (%) (f) AMR encoder Figure 3.4: The average power consumption and its standard deviation. a different degree for a different application routine. For MPEG-2, the encoder’s average power variation is confined to a very small range (i.e. ±0.005), which is independent of the frame length of the input sequence. However, the standard deviation of the power consumption varies significantly in almost all application routines. Take the MPEG-2 encoder as an example, the value of the standard deviation is always above 25%. Fig. 3.5 shows the routine-level energy distribution for JPEG, MPEG-2 and AMR codecs. Due to the space limit, we show a couple of typical cases, including JPEG of image quality 10 and 100, MPEG-2 of frame length 10 and 100 and AMR for test sequences DT1 and DT4. Each routine is indexed by the actual memory location. Fig. 3.5 can explain the difference of the average power and its standard deviation between image and video codecs. Take the MPEG-2 encoder as example, the energy consumption varies a lot from one routine to the other, which explains the large standard deviation in power. However, 45 10 20 30 40 50 60 70 80 90 0 2 4 6 8 10 12 14 16 18 Unique routine serial No. % in total energy consumption (a) JPEG encoder quality = 10 10 20 30 40 50 60 70 80 90 0 5 10 15 20 25 30 35 Unique routine serial No. % in total energy consumption (b) JPEG encoder quality = 100 10 20 30 40 50 60 70 80 90 100 0 5 10 15 20 25 30 35 40 Unique routine serial No. % in total energy consumption (c) JPEG decoder quality = 10 10 20 30 40 50 60 70 80 90 100 0 5 10 15 20 25 30 35 40 Unique routine serial No. % in total energy consumption (d) JPEG decoder quality = 100 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 Unique routine serial No. % in total energy consumption (e) MPEG-2 encoder 10 frames 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 Unique routine serial No. % in total energy consumption (f) MPEG-2 encoder 100 frames 10 20 30 40 50 60 0 10 20 30 40 50 60 Unique routine serial No. % in total energy consumption (g) MPEG-2 decoder 10 frames 10 20 30 40 50 60 0 10 20 30 40 50 60 Unique routine serial No. % in total energy consumption (h) MPEG-2 decoder 100 frames 50 100 150 200 0 2 4 6 8 10 12 14 16 18 Unique routine serial No. % in total energy consumption (i) AMR encoder DT1 50 100 150 200 0 2 4 6 8 10 12 14 16 18 Unique routine serial No. % in total energy consumption (j) AMR encoder DT4 20 40 60 80 100 120 140 0 2 4 6 8 10 12 Unique routine serial No. % in total energy consumption (k) AMR decoder DT1 20 40 60 80 100 120 140 0 2 4 6 8 10 12 Unique routine serial No. % in total energy consumption (l) AMR decoder DT4 Figure 3.5: Energy distribution for several test routines. 46 the power percentage for each routine keeps rather steady at the two cases of encoding 10 frames and 100 frames. So are the cases of 25 and 50 frames. It explains why the average power is nearly consistent for the MPEG-2 encoder. However, for the JPEG codec, the power percentage of each routine varies more greatly as the testing requirement changes so that the average power of JPEG codec changes to a larger extent as compared with MPEG-2. Note that the JPEG codec program is much smaller than the MPEG codec program. Thus, the change brought by the quality requirement can have a larger impact on the overall power consumption pattern. The observations given above show that the overall power and energy behavior vary from one multimedia application routine to another. An average power consumption model across various routines will lead to significant errors in power consumption predic- tion. Thus, an application-routine specific power model is needed. 3.4.2 Power-IPC Relationship We may understand the processor performance by monitoring the busy status of individ- ual microprocessor function unit. Naturally, we should observe the performance events that are highly correlated with power. Fig. 3.6 shows the correlation coefficients of power dissipation with performance events including the busy status of the instruction decode unit, the memory unit, the integer arithmetic logic unit (IALU), the floating arithmetic logic unit (FALU) and the branch unit. We see from Fig. 3.6 that multimedia applications possess a strong correlation between the power and the instruction decode unit. This can be explained by the heavy workload for the instruction decode unit due to the VLIW’s multiple issue width feature in a 47 −0.2 0 0.2 0.4 0.6 0.8 1 Correlation with Power H.264 enc H.264 dec MPEG−4 enc MPEG−4 dec MPEG−2 enc MPEG−2 dec JPEG enc JPEG dec EPWIC enc EPWIC dec AMR enc AMR dec Decode Memory Integer Float Branch Figure 3.6: Power correlation for performance events. VLIW-structured processor. Furthermore, the instruction decode unit consumes a large percentage of energy. Accordingly, the instruction per cycle (IPC), which indicates the instruction decode unit busy status, reflects dynamic power dissipation. Meanwhile, since most multimedia applications are of intensive-integer-computation, frequent usage of IALU is the reason for its 2nd highest correlation with dynamic power. Recurrent data retrieving from the cache and writing back to the cache accounts for an average 50% correlation for the memory unit. The intensive high regular loop pattern of multimedia applications and rare FALU occupancy can explain the low dependency of dynamic power on these two branch units and FALU. Most of today’s high-performance processors attempt to take full advantage of pro- gram parallelism. The multiple-instruction issue is a popular design to employ the instruc- tion level parallelism (ILP). Then, the instruction per cycle (IPC) provides a good metric to measure the processor activity. The higher the IPC value, the busier the processor. Consequently, this leads to higher dynamic power dissipation. This strong dependency is confirmed in Fig. 3.6. Moreover, execution of most routines is so short that dynamic power within such a short duration can be treated as a constant with a small error. In 48 Table 3.3: Power consumption models for most energy consumptive routines in H.264/AVC encoder. Routine name Power model Energy Notes P i = α i + β i × IPC i consumption α i β i percentage(%) sad wxh 1.0233 0.7205 26.5 Motion search sub 4x4 pixel 1.0271 0.7178 16.9 Motion search sad 8x8 1.0609 0.6949 6.53 Motion search of mode 8x8 sad 16x16 1.0624 0.6939 6.39 Motion search of mode 16x16 pixel avg mc 1.0214 0.7218 5.92 fractional pixel interpolation sub4x4 dct 1.0063 0.7326 2.76 Integer 4x4 DCT add4x4 idct 1.0180 0.7242 2.37 Integer 4x4 IDCT sad 8x16 1.0620 0.6942 2.25 Motion search of mode 8x16 sad 16x8 1.0618 0.6943 2.18 Motion search of mode 16x8 quant 4x4 1.0615 0.6945 2.08 Quantization other words, we can relate the average dynamic power of the ith routine with IPC directly by P i = α i + β i × IPC i , (3.3) where i is the routine index, α i and β i are constants depending on a specific routine. In (3.3), α i is a parameter used to describe the static power expense while β i is a parameter related to the percentage of a processor involved in the execution of a routine. Parameters α and β depend on the sophisticated interactivity of the function units, which explains the phenomenon of their values being different from one routine to another. Table 3.3 shows the top 10 energy consuming routines of H.264/AVC encoder and their associated α and β values. As revealed by Fig. 3.7, α and β for each routine are not identical because of the instruction type distribution and comprehensive interaction among function units. How- ever, they are distributed in a small range of [0.91.05]. The small variation can be explained by the fact that the static-scheduling property of the VLIW processor and 49 0 50 100 150 200 0 0.2 0.4 0.6 0.8 1 Routine index α β amplitude β α Figure 3.7: The α and β values for the H.264 encoder. the high regularity of multimedia applications. They make the execution pattern on the VLIW processor less affected by the input data than the dynamic-scheduling type pro- cessor. For selected benchmarks, i.e. JPEG, H.264/AVC and AMR codecs, the average power and the average IPC for each routine are measured and plotted as a small rectangle in Fig. 3.8. Figs. 3.8(a) and 3.8(b) show the power-IPC plot for JPEG with different quality factors. Figs. 3.8(c) and 3.8(d) show the power-IPC plot for H.264/AVC with training sequences of different frame lengths. Finally, Figs. 3.8(e) and 3.8(f) give the power-IPC plot for AMR. Fig. 3.8 verifies that the variations of α and β are small so that they can be approximated with the same values. Thus, for a rougher assessment of the power and energy, α and β can be treated as application-specific. In other words, we can relate the average dynamic power with IPC directly by P i = k 0 + k 1 × IPC i , (3.4) 50 0 1 2 3 4 5 1 1.5 2 2.5 3 3.5 4 4.5 IPC Power(w) Experiment Lineal fitting curve (a) JPEG encoder 0 0.5 1 1.5 2 2.5 3 3.5 4 1 1.5 2 2.5 3 3.5 4 IPC Power(w) Experiment Lineal fitting curve (b) JPEG decoder 0 1 2 3 4 5 6 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 IPC Power(w) Experiment Lineal fitting curve (c) H.264/AVC encoder 0 1 2 3 4 5 1 1.5 2 2.5 3 3.5 4 4.5 IPC Power(w) Experiment Lineal fitting curve (d) H.264/AVC decoder 0 0.5 1 1.5 2 2.5 3 3.5 4 1 1.5 2 2.5 3 3.5 4 IPC Power(w) Experiment Lineal fitting curve (e) AMR encoder 0 0.5 1 1.5 2 2.5 3 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 IPC Power(w) Experiment Lineal fitting curve (f) AMR decoder Figure 3.8: The average power-IPC relationship for several test benchmarks. where k 0 and k 1 are constants depending on a specific application. Similarly to (3.3), k 0 is a parameter used to describe the static power consumption while k 1 is a parameter related to the percentage of a processor involved in the execution of an instruction. Since k 0 and k 1 are application-specific while IPC is routine-dependent, (3.4), this approach is called the simplified routine level estimation (SRLS). As discussed in Sec. 3.4.3, the average IPC in each routine can deviate a lot from the average IPC of the whole program. Furthermore, the execution time of each routine also varies. As a result, the energy consumption percentage of each routine with respect to the entire program may be different. If a constant average power value is used throughout the execution of the whole routine, the modeling error could be large. Thus, for the accuracy concern, it is better to consider the routine-level average power. Nevertheless, 51 Table 3.4: Power consumption models for several multimedia routines. Benchmarks Power model P = k 0 + k 1 × IPC k 0 k 1 EPWIC encoder .8399 .8285 EPWIC decoder .9472 .7176 JPEG encoder .8418 .817 JPEG decoder .8996 .7536 MPEG-2 encoder .8909 .6476 MPEG-2 decoder .8643 .7588 MPEG-4 encoder .9538 .7528 MPEG-4 decoder .9709 .7336 H.264 encoder .9924 .7429 H.264 decoder .9599 .6428 AMR encoder .9949 .7529 AMR decoder .9149 .7457 considering the memory requirement to hold α i and β i for the large volume routines, the simplified routine level estimation, expressed by (3.4) is more applicable. Based on the profiling results for every benchmark, the values of k 0 and k 1 are computed and given in Table 3.4. 3.4.3 Impact of Applied Voltage and Operating Frequency For commercial DSP processors, their empirical power consumption model is usually expressed as an affine function of IPC, which is similar to (3.4). However, the affine model parameters vary when the applied voltage and the operating frequency change. (One empirical model will be examined in detail in Sec. 3.6.2.) In this section, we will generalize (3.4) furthermore so that it takes not only the application routine but also the applied voltage and the operating frequency into account. For simplicity, the application specific parameters k 0 and k 1 (instead of routine-based parameters α i and β i ) are employed in the later discussion. 52 To study the impact of the applied voltage and the operating frequency, let us rewrite Eq. (3.4) as P = k 0 (1 + γ× IPC),γ = k 1 k 0 . (3.5) Parameter k 0 depends on implementation details such as the physical layout and the manufacture technology, e.g., the number of transistors per die. Thus, we focus more on the accuracy of the model parameter γ.Thatis,if γ is obtained from the power model is similar to the one specified by the DSP processor, the proposed power model is useful. In the previous subsection, we examine the dependence of γ on a specific application routine. Here, without loss of generality, we examine the characteristics of γ(f,v) as a function of applied voltage v and operating frequency f for one specific application routine; namely, the MPEG-2 decoder. This same methodology can be applied to other routines, too. First, we consider the case where v keeps constant while f varies. The memory hierarchy affects the total energy consumption in a large extent as shown in Fig. 3.3(a). In modern DSPs, as a result of the progress of manufacturing technology, both level 1 and level 2 caches can be integrated with CPU in one die. Thus, the CPU frequency variance has the primary impact on the off-chip memory access time, i.e.,theL2Umiss penalty. Let us take the adopted simulator as an example. The original frequency is 200 MHz and the L2U miss penalty is 35 clock cycles (or 1.75× 10 −7 sec.). As the frequency gets higher, say, 250 MHz, the processor clock cycle changes from the original 5× 10 −9 to 4× 10 −9 sec.. Then, the L2U miss penalty becomes 44 cycles since the databus width between the on-chip components and off-chip memory is unaltered. As the CPU processor 53 Table 3.5: Parameter γ as a function of f and v with small-scale variation. voltage freq. L2 miss k 0 k 1 γ (v) (MHz) penalty 1.3 167 29 .7310 .6336 .8667 200 35 .8643 .7588 .8779 240 42 1.0238 .9105 .8893 1.35 167 29 .8685 .6520 .7207 200 35 1.0728 .7810 .7278 240 42 1.2835 .9371 .7301 becomes faster, more clock cycles are required to access off-chip memory, and vice versa. The miss rate will change accordingly as a result of the varying miss penalty. On the other hand, the total processor core (on-chip components except L1 and L2 cache as shown in Table 3.1) consumes the same amount of energy in total while the average power is higher since the frequency increases and the program’s execution time becomes shorter. To summarize, the change of the operating frequency affects the energy consumed by memory hierarchy and the system average power while it leaves the processor core energy intact. Next, we examine the effect of the applied voltage v. Generally speaking, the power is proportional to the square of the voltage. Thus, reducing the voltage would lead to a quadratic reduction in power. However, when the voltage is lowered, it prolongs the execution time as a sequence of a slower processing speed. Thus, the compound effect is much more complex. In Tables 3.5 and 3.6, we show the L2 miss penalty and values of parameters k 0 , k 1 and γ as a function of the applied voltage v and the operating frequency f.Weexamine a smaller amount variation (or fine-scale variation) and a larger amount of variation (or coarse-scale) in Tables 3.5 and 3.6, respectively. 54 Table 3.6: Parameter γ as a function of f and v with large-scale variation. voltage freq. L2 miss k 0 k 1 γ (v) (MHz) penalty 1.3 200 35 .8643 .7588 .8779 1.22 175 31 .8280 .6100 .7367 1.1 150 27 .6952 .3824 .5501 1.07 125 22 .6500 .3272 .5188 1.0 100 18 .4717 .2119 .4492 .93 75 13 .3312 .1255 .3850 0.85 50 9 .2413 .0775 .3212 For the fine-scale variation, when f is constant while v may take a slightly different value, say 1.3V and 1.35V as shown in Table 3.5 (e.g., TI C64 series can work at the same frequency with different supply voltage). We see that the L2 miss rate does not change much while both the static power (which is proportional to k 0 )and theactive power (which is proportional to k 1 ) become larger as v increases. However, the ratio γ decreases, which means that the increase of the static power is faster than that of the active power. For the coarse-scale variation, f is proportional to v. Then, the power change should roughly reflect the quadratic influence of v plus some other consideration. We will examine this issue in depth in our future work. 3.5 Routine-Level Power/Energy Consumption Model The routine-level power/energy consumption models for multimedia applications will be given in this section. Our objective is to provide a simple yet accurate method to produce 55 the run-time estimation on power consumption. The total energy consumed by a program can be calculated by Energy = ET 0 Power(t) dt, (3.6) where Power(t) represents the dynamic power value and ET is the duration to execute the program. However, the dynamic value is practically unavailable due to the measurer’s limited sensitivity. For example, the ampere meter could not follow the fast fluctuation of the current. Thus, it is difficult to provide accurate data for the integration given in Eq. (3.6). To overcome this difficulty, the execution of the whole program can be decomposed into the execution of routines. Consequently, Eq. (3.6) can be written as Energy = i T i +ET i T i P i (t) dt, T 0 =0,T i+1 = T i + ET i , (3.7) where P i (t) is the dynamic power of routine i during its execution time ET i .Sincetoday’s high performance processor can execute most routines very fast (usually in the order of milliseconds), it is reasonable to replace the dynamic power P i (t) by the average one of each routine as P i (t)≈ AP i ≈ α i + β i × IPC i , (3.8) where AP i denotes the average power during the execution time from T i to T i + ET i and the second approximation is due to Eq. (3.4). As a result, Eq. (3.7) is converted to the discrete-time format and the energy consumption of the whole program could be 56 calculated by summing up each routine’s energy consumption. Mathematically, Eq. (3.7) could be re-written as Energy = N i=0 Energy i = N i=0 (AP i × ET i ), (3.9) where Energy i is the energy consumed by the ith routine. Similarly to (3.8), the simplified routine level power estimation for the ith routine can be written as P i (t)≈ AP i ≈ k 0 + k 1 × IPC i . (3.10) Then, the overall system energy consumption can be calculated via (3.9). 3.6 Model Validation The validation step is crucial since a simple estimation model is acceptable only when it is reasonably accurate. Comparing the low level capacitance value should be the most directive and accurate means to validate a power model. However, we have no access the detail of any industry hardware design and its detailed capacitance values. In this work, we consider two approaches to validate the proposed power consumption model, which are discussed in detail below. 3.6.1 Validation via Trimaran Simulation The first validation is conducted using the Trimaran simulator. We use a test set which is different from the validating data as shown in Table 3.2. By comparing the estimated 57 H.264 Enc H.264 Dec MPEG−4 Enc MPEG−4 Dec MPEG−2 Enc MPEG−2 Dec JPEG Enc JPEG Dec EPWIC Enc EPWIC Dec AMR Enc AMR Dec 0 0.1 0.2 0.3 Test data index Estimation error(%) (a) Routine level algorithm H.264 Enc H.264 Dec MPEG−4 Enc MPEG−4 Dec MPEG−2 Enc MPEG−2 Dec JPEG Enc JPEG Dec EPWIC Enc EPWIC Dec AMR Enc AMR Dec 0 1 2 3 Test data index Estimation error(%) (b) Hybrid system-routine level algorithm Figure 3.9: The plot of standard deviations of power estimation errors with validation data for various codec benchmarks. energy computed based on the proposed energy model as given by Eqs. (3.8), (3.9) and (3.10), where α i , β i , k 0 and k 1 are calculated by the least squares solution. Parameters k 0 and k 1 are given in Table 3.4. Based on the experimental power and energy obtained from the low level power simulator, we can compute the estimation error accordingly. The standard deviations of various validation cases for each multimedia routine are shown in Figs. 3.9 and 3.10. Fig. 3.9 and Fig. 3.10 indicate that the simplified method always yields a larger estimation error in both power and energy as compared with the routine-level assessment scheme. The reason is obvious since the routine-level method can capture the character- istics of the behavior more accurately than the simplified method. Nevertheless, the error of the simplified scheme is not larger than 5%, which is accurate enough in most power estimation applications. Parameters k 0 and k 1 in the the simplified scheme are used for comparison in the following analysis. 58 H.264 Enc H.264 Dec MPEG−4 Enc MPEG−4 Dec MPEG−2 Enc MPEG−2 Dec JPEG Enc JPEG Dec EPWIC Enc EPWIC Dec AMR Enc AMR Dec 0 1 2 3 4 5 Test data index Estimation error(%) (a) Routine level algorithm H.264 Enc H.264 Dec MPEG−4 Enc MPEG−4 Dec MPEG−2 Enc MPEG−2 Dec JPEG Enc JPEG Dec EPWIC Enc EPWIC Dec AMR Enc AMR Dec 0 1 2 3 4 5 Test data index Estimation error(%) (b) Hybrid system-routine level algorithm Figure 3.10: The plot of standard deviations of energy estimation errors with validation data for various codec benchmarks. 3.6.2 Validation via Commercial Media Processor The second approach is to validate the power consumption model with a commercial media processors. Here, TI C6416 is chosen to be a platform to validate our proposed model. As one of the newest members of the TI DSP family, the C64x series [22] can achieve a higher clock rate and increase the CPU throughput with two duplicate data paths and register files (containing 32 registers each). Besides the VLIW structure, it also supports SIMD to enhance the media processing data throughput by the use of data level parallelism (DLP). A power dissipation model provided by TI [23] is given by Power = Baseline Power + Activity Power = a(f,v,t)+ b(f,v)× α (3.11) 59 where f is the clock frequency, v is the supply voltage, t is the environment temperature while the utilization rate α reflects the activity level of the system. In Eq. (3.11), the power dissipation is decomposed into two parts: the baseline power and the active power. The baseline power, which is a function of v, f and t, includes the PLL power and the clock tree power [23]. The baseline power corresponds to the static power dissipation of a processor. The active power is related to active energy consumption of the processor, which is also a function of v, f and α. The voltage and frequency can be easily measured while the utilization rate α is often obtained by empirical measurements or statistics from architectural-performance simulators. The relationship of the power and the utilization rate is depicted in Fig. 3.11. The utilization rate can be estimated by the total instruction, CPU cycles (execution time) and the full width by Utilization Rate = the total no. of instruction CPU cycles×Full Issue = IPC Full Issue (3.12) Thus, we have Power = a(f,v,t)+ b(f,v)× IPC Full Issue , (3.13) 60 which is the power model provided by TI for TIC64x series DSP. By comparing Eqs. (3.13) and (3.4), we can obtain the one-to-one correspondence between the power model provided by TI and our power model by doing the following association: k 0 = a(f,v,t) k 1 = b(f,v) Full Issue (3.14) The full issue of TI C6416 is 8, which means that it could issue and execute 8 instructions in a cycle at the maximum rate. By substituting the corresponding data to Eq. (3.13), the power formulas for TI C6416 working at 500∼ 720 MHz with 1.2V or 1.4V voltage supply become Power(w)= ⎧ ⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨ ⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩ .048 + (.035× IPC) ··· 500MHz 1.2V .069 + (.04875× IPC) ··· 500MHz 1.4V .057 + (.0425× IPC) ··· 600MHz 1.2V .083 + (.0585× IPC) ··· 600MHz 1.4V .068 + (.051125× IPC) ··· 720MHz 1.2V .099 + (.07025× IPC) ··· 720MHz 1.4V (3.15) Note that the parameters given in TI C64x series DSP are much smaller than param- eters k 0 and k 1 in Eq. (3.15). This is due to the fact that power consumption varies according to different physical layout and manufacture technology, e.g.,the number of transistors per die. Thus, we focus more on their ratio, i.e. parameter γ. In Fig. 3.12, we plot and compare γ based on k 1 and k 0 in Table 3.4 and those in Eq. (3.15). The statistics of C6416 is in the range of 0.73∼ 0.75 at a power supply of 1.2V and 0.70 at 61 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 100 200 300 400 500 600 700 Utilization Power(mw) TI C64X Power Consumption 500MHz 1.2V 500MHz 1.4V 600MHz 1.2V 600MHz 1.4V 720MHz 1.2V 720MHz 1.4V Figure 3.11: The power consump- tion plot for C64x. 0 0.5 1 1.5 2 gamma TI C6416 1.2 V TI C6416 1.4 V Simulation result Figure 3.12: Comparison of parameter γ from the simulation and the manual of the TI C6416 DSP. 1.4V . Most results of in the experiment fall into the range of 0.67 ∼ 0.84. There are two exceptions (i.e. the EPWIC encoder and the JPEG encoder), which are close to 1. Thus, our methodology provides an accurate power model for the VLIW processor by comparing its power behavior with that of commercial DSP processors. We examine the characteristics of γ with respect to different values of applied voltage v and operating frequency f in Figs. 3.13 and 3.14. In Fig. 3.13, the x-axis corresponds to operational frequencies 500 MHz, 600 MHz and 720 MHz of the TI DSP chip, respectively. They correspond to the frequency values of 167 MHz, 200 MHz and 240 MHz in our simulator (i.e. the ratio of the commercial DSP’s frequency and the simulator frequency is around 3:1). We see that there is a close match between the actual and the simulation data. Furthermore, we have two interesting observations. First, it confirms our discussion in Sec. 3.4.3 that, as the voltage becomes large, the increase of the static power outweighs that of the active power and, as a result, γ becomes smaller as v gets larger. Second, with a fixed supply voltage, γ is slightly larger when the processor speed is faster, which can be explained by the fact that more active power is dissipated by a faster processor. These 62 0 0.5 1 1.5 2 gamma TI C6416 1.2 V TI C6416 1.4 V Trimaran 1.3V Trimaran 1.35V Figure 3.13: Comparison of parameter γ with respect to the fine-scale frequency/voltage variation. two rules are verified by the data obtained from the TI C6416 DSP. Thus, the proposed model power consumption model can predict the general trend of γ with respect to the variation of the applied voltage and operating frequency. We plot values of k 0 , k 1 and γ with respect to the coarse-scale frequency/voltage variation in Fig. 3.14, where the x-axis represents the supply voltage whose corresponding frequency is given in Table 3.6. The curve of k 1 asshowninFig. 3.14 hasagood match with the cubic function of applied voltage v. This contradicts to the intuition that the power is proportional to the square of the applied voltage. Recall that the capacitive power is the major contributor to the dynamic power, which can be described empirically by Dynamic power ∼ 1 2 × Cv 2 αf, (3.16) 63 0.9 1 1.1 1.2 1.3 0 0.15 0.3 0.45 0.6 0.75 0.9 Volts k0 k1 gamma Figure 3.14: Comparison of parameter γ with respect to the coarse-scale fre- quency/voltage variation. where C is the capacitance which is a function of the wire length, the transistor size, v is the supply voltage, f represents the clock frequency, α is the parameter that describes the loading of the processor. Based on the large-scale variation, f is proportional to v so that the dynamic power dissipation is proportional to O(fv 2 )= O(v 3 ). Thus, parameter k 1 that reflects the active power can be predicted accurately as a cubic function of v. On the other side, it is difficult to express k 0 as a simple function of v, since the static power is in general an exponential function of system temperature t and threshold voltage V T . Without the knowledge of the cooling system and package, it is difficult to relate supply voltage v and power dissipation to system temperature t. Usually, when a variable is small, its exponential function can be approximated by a low-order polynomial. Thus, given the 64 approximation that system temperature t is a linear function of supply voltage v,the static power can be approximated by a 4th-order polynomial of v. It is confirmed in our experimental data that processor’s static power can roughly be represented as a 4th-order polynomial of v in the coarse voltage variation scale. Consequently, reduction in supply voltage helps keep the static power low. Our power consumption model explains the power consumption behavior of the embedded processor as v and f vary in a large scale. 3.7 Applicability of the Proposed Power Model In this section, we explain the application of the developed power model to several possi- ble scenarios. Besides the high level power management and optimization, the proposed online power estimation methodology can be applied to the processor design space and used for algorithmic optimization guidance. In the early chip design stage, the energy estimation schemes can be used to understand the performance-energy efficiency of dif- ferent processor configurations so as to select the most effective structur. This will be discussed in detail in Sec. 3.7.1. Furthermore, the energy consumption characteristics of algorithm modules is investigated based on the proposed routine-level power model in Sec. 3.7.2. Finally, we study how the adjustment of parameters such as quantization parameter (QP) and reference frames affects the overall energy expense in Sec. 3.7.2. 3.7.1 Power Saving Oriented Design in Multimedia Processors The adaptation of processor resources can be used to reduce multimedia application power on today’s high-performance VLIW/EPIC processors. By tuning the processor resource 65 Table 3.7: Multimedia application IPC and power. 1-issue 2-issue 4-issue 6-issue IPC 1.0775 1.3592 1.4068 1.4097 Power (w) 1.69 2.42 2.95 3.11 appropriately to meet the actual needs of the program, a significant amount of power can be saved with a minimal impact on the performance. The multiple-issue technique is a widely used methodology to enhance the performance by exploiting ILP. For the sake of performance alone, it may be attractive to issue as many operations as possible at one time instance. As a result, in high performance processors, we see a trend to use a wider issue data path that consumes more power, which becomes the most dominant part in power consumption. To check the issue-width efficiency, the simulator in Sec. 3.3 was adopted by adding the integer function units since most multimedia applications are run with integer op- erations. The basic configuration of the simulator consists of 1 branch unit, 2 memory load/store units and 1 floating-point unit, which are essential to the program execution. In addition, we add n integer ALU units to the basic configuration of the simulator in our experiment, which is called an n-issue machine. Table 3.7 shows the IPC and power consumption for the MPEG-2 decoding task on 1-, 2- 4 and 6-issue machines. We see that the 2-issue machine saves 20% of power with a performance loss of only 4% as compared with the 6-issue machine. Thus, the IPC does not scale well with the increasing number of function units. In general, the power delay product (PDP) is an ideal metric to characterize the power- performance efficiency in low-end embedded systems since the battery life is the main 66 EPWIC JPEG MPEG−2 MPEG−4 H.264 AMR 0 0.2 0.4 0.6 0.8 1 Normalized Energy 1−issue 2−issue 4−issue 6−issue (a) Encoder EPWIC JPEG MPEG−2 MPEG−4 H.264 AMR 0 0.2 0.4 0.6 0.8 1 Normalized Energy 1−issue 2−issue 4−issue 6−issue (b) Decoder Figure 3.15: The normalized energy for encoders and decoders of multimedia applications. EPWIC JPEG MPEG−2 MPEG−4 H.264 AMR 0 0.2 0.4 0.6 0.8 1 Normalized Power 1−issue 2−issue 4−issue 6−issue (a) Encoder EPWIC JPEG MPEG−2 MPEG−4 H.264 AMR 0 0.2 0.4 0.6 0.8 1 Normalized Power 1−issue 2−issue 4−issue 6−issue (b) Decoder Figure 3.16: The normalized power consumption for encoders and decoders of multimedia applications. 67 EPWIC JPEG MPEG−2 MPEG−4 H.264 AMR 0.8 0.95 1.1 1.25 1.4 1.55 IPC 1−issue 2−issue 4−issue 6−issue (a) Encoder EPWIC JPEG MPEG−2 MPEG−4 H.264 AMR .75 0.9 1.05 1.2 1.35 1.5 IPC 1−issue 2−issue 4−issue 6−issue (b) Decoder Figure 3.17: The IPC for encoders and decoders of multimedia applications. concern in these battery-powered portable devices. Due to the difficulty of measuring dynamic power for PDP computation, the total energy consumption of each program is used to demonstrate the power-performance tradeoff. Fig. 3.15 shows the energy of different multimedia applications running on different issue-width machines. In general, the 6-issue mode is not energy-efficient since it has the highest energy in all applications. The application of the 1-, 2- and 4-issue modes yields a better trade-off between the power and performance. More interestingly, the optimal configuration changes according to the target applications. The individual multimedia application has some specific functionality, which may exhibit a large variation in the computational requirement. Thus, a configuration that is optimal for one application may not be the optimal for the other. Similarly, the power dissipation and performance across a wide range of multimedia benchmarks considered with different issue widths are shown in Fig. 3.16 and Fig. 3.17, respectively. As shown in these figures, the complex 68 circuit of a multiple-issue machine width more than 4 issues is not worthwhile from the power/energy point of view in most cases. 3.7.2 Execution Parameter Adjustment Besides the dynamic energy optimization and thermal management by changing the pro- cessor configuration, the dynamic execution parameter adjustment can yield better trade- off between energy consumption and coding performance. However, this demands an online energy model to manage the energy behavior adaptively according to the avail- able battery resource. The routine-based power model provides a tool to analyze the the energy consumption of multimedia applications corresponding to different execution parameters at the algorithm level. Two execution parameters and their impact on the energy expense are briefly discussed below. The quantization parameter (QP) is a widely used parameter to adjust the coded media quality and its bit rate. It also has a great impact on the system resource us- age such as the execution time and the energy cost. Thus, we would like to develop a model to determine the appropriate QP under the energy budget constraint. We take the H.264/AVC encoder as an example to understand the quantization effect on the energy consumption. Three sequences (Akiyo, Foreman and Mobile & calendar) are chosen to analyze the energy consumption. We see from Fig. 3.18 that the energy consumption decreases as QP increases and that the energy consumption is sequence-dependent. As a result, we may conduct the dynamic QP scaling (DQS) scheme according to the available battery resource so that the support time can be maximized. Depending on available 69 15 20 25 30 35 40 45 0 50 100 150 200 250 QP Engergy(J) Foreman Mobile & calendar Akiyo Figure 3.18: The energy consumption as a function of QP for three test sequences. resources, there exists an optimal balance between the energy, rate and distortion, which will be addressed in detail in Chapter 4. The routine-level energy analysis leads to an finer granularity at the algorithm level. The routines in the H.264 encoder are grouped into 8 modules to understand the impact of the QP variation on the energy cost for each algorithmic module: transform and quantization, inverse transform and dequantization, intra-prediction, motion estimation, interpolation (also used in motion search but relatively independent), loop-filter, CABAC entropy coding, and others. The consumed energy values of routines in each group are summed up to obtain the overall energy cost of each group, and the results are shown in Fig. 3.19. We have the following observations. • Energy bottleneck identification. Although fast motion search algorithms are used for motion estimation, this is still the most energy-expensive module for the entire QP range. 70 0 30 60 90 120 150 Energy(J) Tran/Quan ITran/IQuan Intra pred. Motion search Interpolation Deblocking CABAC Others QP = 16 QP = 22 QP = 28 QP = 34 QP = 40 Figure 3.19: Comparison of energy consumption in various algorithmic modules. • Energy-QP analysis at the algorithm level. Some algorithm modules are QP-sensitive while others are not. By energy profil- ing, we find that the following three modules belong to the QP-insensitive group: transform and quantization, loop-filter and others. This is reasonable since the QP variation barely change the execution number of routines in these modules. On the other hand, when QP becomes larger, the number of zeros in the transform will increase, which will lower the routine running time in the inverse-transform and entropy coding modules. A larger QP value will result in worse image quality so that motion search and intra-prediction could be done faster, which means fewer routines and lower energy consumption. The above observations imply that the motion estimation module should be the first one to be optimized to reduce energy consumption. Based on the above discussion, the 71 proposed routine-level power estimation method can help build up the dynamic quan- tization scaling (DQS) scheme to balance the energy need and energy consumption in run-time while maintaining the coding efficiency of H.264/AVC. 3.7.3 Decoder Friendly Encoding Algorithm Many media application devices such as mobile handheld devices become smaller and lighter. Their energy resource is relatively scarce, given the increasing algorithm com- plexity of applications running on the devices. Therefore, it is highly desirable to develop a decoder-complexity-aware encoder system to alleviate the decoder computational bur- den. In this subsection, we focus on an energy-constrained problem for the H.264 decoder, i.e., how to develop an encoding algorithm that meets requirements of high video quality and low decoding complexity at the same time. This development needs an online de- coder energy consumption estimation. As shown in Fig. 3.20, we see that the fractional pixel interpolation and the adaptive de-blocking filter (ADF) are the top two energy- consuming modules in decoding devices. In Chapter 6, we will address the development of a decoder friendly ADF mode decision encoding algorithm using the IPC-based run time power estimation model as discussed in this chapter. 3.8 Conclusion Modern embedded multimedia applications are characterized by the emergence of resource- limited VLIW processors with SIMD instruction extension. The power profiling result suggests a strong correlation between IPC and power dissipation. By exploiting this 72 0 20 40 60 Energy(J) Motion comp. Error resil ITran/IQuan Intraprediction Entropy decoding Loop filter Others Output Mobile & calendar Foreman Akiyo Figure 3.20: The energy distribution in different algorithmic modules of the H.264/AVC decoder. relationship, the power and the energy models for multimedia applications in VLIW pro- cessors were proposed. The proposed model was validated by the TI C6416 chip-set, yielding a low error rate. This simple model leads to efficient run-time power estimation for various multimedia systems, and provides insights into the energy saving of the VLIW processor. 73 Chapter 4 Complexity-Adaptive Motion Search for H.264 Encoding 4.1 Introduction Motion estimation is one of the most time-consuming units in the H.264 encoder due to the use of long term memory motion-compensated prediction (LTMMCP) [63], variable block sizes and fractional pixel interpolation [37] [62]. Several fast motion search algorithms such as the diamond search [69] and the hexagon search [68] were developed to accomplish a significantly faster speed while maintaining similar R-D performance as compared to the full search. They have achieved a great success in the general purpose processor (GPP) system. However, the speed-up of these fast motion search algorithms when implemented on embedded systems is not so impressive due to the limited resource of the embedded environment. Thus, it is critical to study an effective tradeoff between resource utilization in an embedded system and coding efficiency. Our proposed scheme is motivated by two important observations. First, for the local refinement of fast motion search algorithms, not every round of local search can achieve 74 equally good reduction in the joint rate-distortion cost. By eliminating the complexity- inefficient search efforts, motion estimation can be accelerated at the cost of little coding performance aggravation. Second, due to the video signal characteristics, motion esti- mation of block modes other than 16x16 is often redundant. For instance, average 73% MBs are encoded by 16x16 macroblock mode [44]. Thus, by skipping motion search of unnecessary modes, we can speed up the motion estimation process without sacrificing the coding efficiency much. Our overall objective in this work is to reduce the complexity of the motion search yet maintaining high video quality. Generally speaking, we propose a joint rate-distortion-complexity (RDC) optimization framework to balance the coding efficiency and the complexity cost of the H.264 encoder in this work. The method can cut off the complexity-inefficient local refinement search efforts, skip redundant motion search of redundant block modes and terminate motion search at the optimal RDC points. Our scheme saves the complexity for the motion search up to 35% with a small bit rate increase and negligible video quality degradation. The rest of this chapter is organized as follows. The background and related work is reviewed in Sec. 4.2. The RDC optimization framework based on the Lagrange optimiza- tion method is proposed in Sec. 4.3. Then it is explored to select appropriate Lagrangian multipliers according to the quantization parameter in Sec. 4.4. Experimental results are presented in Sec. 4.5. Finally, some concluding remarks are provided in Sec. 4.6. 75 Transform Quantization Entropy Coding Inverse quantization Inverse Transform Motion estimation & Mode decision Motion compensated prediction Frame buffer (Delay) Intra-frame prediction Deblocking filter Input MB video signal Intra Inter Prediction residue signal Quantized coefficients Motion vector data Dotted box shows decoder Decoded video Bit stream out Figure 4.1: A hybrid video codec using motion compensated predictive coding. 4.2 Background and Related Work In this section, we review the H.264 coding standard by focusing on the parts that are relevant to our discussion in this chapter. A thorough introduction to H.264 can be found in [62]. 4.2.1 Motion Estimation in H.264 One way of compressing the video content is to compress each picture using an image codec such as JPEG. The coding efficiency is smaller since only spatial redundancy is employed in the compression process, which is referred to as intra-frame coding. Temporal redundancy can be exploited to obtain better compression performance, which is referred to as inter-frame coding. Most today’s video coding methods are called hybrid codecs since they adopt a hybrid of intra-frame and inter-frame coding techniques [51]. 76 Fig. 4.1 gives the block diagram of a generic hybrid video encoder based on motion predictive coding. The input image sequence is divided into groups of pictures (GOP) that consist of one intra frame (I frame) and multiple inter frames (P frames and B frames). The basic encoding and decoding unit is a macroblock (MB), which is a block of 16×16 pixels. The MBs are coded in either the intra or the inter mode. In an inter mode, apredictionMB P is formed based on one or multiple reconstructed frame(s), depending on which reference frame provides the best rate-distortion feature. Thus, a motion vector that indicates that the corresponding position of the reference MB is encoded. The reference frames for P frames can only be backwards (past in the time order). The reference frames for B frames can be either backwards and forwards (future in the time order). The residue D n , which is the difference between the original MB and the predicted MB, is transformed and quantized to give a set of quantized transform coefficients. Then, the coefficients are entropy encoded. In video decoding, quantized residual MB coefficients are de-quantized and inverse transformed to added back to predicted MB to reconstruct the target MB. In this chapter, we will focus on the motion estimation and mode decision module in Fig. 4.1, which is marked by the gray color. 4.2.2 Motion Estimation Modes of Variable Block Sizes To further reduce temporal redundancy and improve the R-D performance of motion estimation, H.264 introduced a tree-structured variable-size block modes. Unlike earlier coding standards using a fixed block size (usually 16× 16) for motion estimation, H.264 allows to partition an MB into blocks of variable sizes. For motion estimation of each luma MB in P and B frames, block shapes can be 16× 16, 16× 8, 8× 16 or 8× 8. For a 77 0 0 1 01 01 23 0 0 1 01 3 01 2 16x16 16x8 8x16 8x8 8x8 8x4 4x8 4x4 M Types 8x8 Types Figure 4.2: MB partition modes. 8× 8 block in the P frame, we can further partition it into blocks of size 8× 4, 4×8or 4× 4 as shown in Fig. 4.2. Motion compensation with variable block sizes allows more flexible and accurate pre- diction to offer better R-D performance at the cost of higher search complexity. Full search is usually too expensive to be implemented in a GPP system. Consequently, many fast motion search schemes have been proposed to speedup the motion search process. Most fast motion estimation is processed in two steps. First, a motion vector is predicted based on motion vectors of neighboring blocks [10], where one can compare the neighbor motion vectors and get their medium value [58]. Next, a local refinement is conducted around the predicted motion vector by calculating the SAD (sum of absolute differences) value, which takes the largest computational cost. In general, the local refinement scheme consists of 2 steps: (i) coarse search with a larger stepsize and (ii) fine search with a smaller stepsize. The large pattern is kept using until the minimum distortion point occurs at the center of pattern, then it switches to the small pattern. For example, one can conduct coarse search using a large diamond search 78 (a) Diamond search (b) Hexagon search Figure 4.3: Coarse and fine search patterns for local refinement: (a) the diamond search and (b) the hexagon search. pattern in the beginning stage and fine search using a small diamond search pattern in the later stage. The diamond and hexagon search patterns are shown in Fig. 4.3(a) and Fig. 4.3(b), where circles and squares represent the coarse and fine search patterns, respectively. 4.2.3 Rate-Distortion (R-D) Optimization A typical video encoder encodes an input image sequence by selecting some optimal control parameters P, subject to the constraint of the target bit rate R T by solving minD(P), subject to R(P)≤ R T , (4.1) where D is a distortion measure which is a function of control parameters. The main control variables P involved in this process include quantization parameter QP,block mode M and motion vector m. The interaction between these variables are quite complex so that their optimization is complicated. In practice, the Lagrangian method [51] is a 79 widely accepted approach for bit allocation. The Lagrangian method is applied into two stages: motion estimation and residue coding. In the stage of motion estimation, specifically, for each macroblock B with fixed block mode M, the optimal motion vector associated with the block is selected via minimizing a joint rate-distortion (RD) cost function [51]: J R,D Motion = D DFD + λ D R(m|p m ), (4.2) where m is the motion vector acquired by the motion estimation process, p m stands for the predicted motion vector, R represents the bits associated with motion information , D DFD is the prediction error between the current block and the reference block and λ D is the Lagrange multiplier for motion estimation. Usually, the sum of absolute differences (SAD) is adopted to measure the distortion D DFD since its computational cost is lower than that of the sum of squared difference (SSD). Fig. 4.4(a) shows how J R,D Motion is applied to the fast motion search scheme to determine the optimal motion vector and terminate the local refinement process: a smaller J R,D Motion assures the next search round. Similarly, to do the rate-constrained mode selection, the joint cost of distortion and block mode selection in the residual coding stage can be written as: J R,D Mode = D Rec (M|Q)+ λ M R(M|Q), (4.3) where M is the evaluated macroblock mode out of a set of possible modes, Q stands for the value of quantizer control parameter for transform coefficients. R is the number of bits associated with header, motion and transform coefficients, D Rec is the difference between the reconstructed macroblock and the reference one measured by SSD and λ M is 80 the Lagrange multiplier for mode decision. The search space of mode M in (6.8) can be picked from the set of all possible block modes{INTRA 16× 16, INTRA 4× 4, INTER 16× 16, INTER 16× 8, INTER 8× 16, INTER 8× 8, INTER 8× 4, INTER 4× 8, INTER 4× 4, SKIP, DIRECT}. Due to the low-efficiency of 7-block-mode search for H.264, many simplified algorithms have been developed to eliminate unnecessary modes to save computation. In [67], instead of testing the 7 block modes from the largest to the smallest modes, this fast algorithm tests mode 8×8aftermode16× 16 as shown in Fig. 4.4(b). If the combined R-D cost of mode 8×8 islessthanthat ofmode 16× 16, the search jumps directly to mode 4× 4 while ignoring modes 16×8and 8× 16. By doing so, unnecessary modes are skipped to speed up the search process. A similar idea can be used to skip modes 8×4and4× 8. The Lagrange multipliers in (4.2) and (6.8) determine the relative weights between the signal quality and the bit rate. To simplify the search procedure, the following empirically derived relationship λ D = λ M (4.4) is used in practice, if SAD is used in modeling D DFD while SSD is used for D Rec . In H.264 the quantization step size Q is associated with QP by the equation: Q=2 QP−4 6 (4.5) Given a QP for the coding unit, the following relationship λ M =0.85× 2 QP−4 3 (4.6) 81 Start local refinement Compute best ) ( , i J D R Motion Compute best ) 1 ( , i J D R Motion End Yes ) ( ) 1 ( , , i J i J D R Motion D R Motion No (a) Local refine- ment Mode 16x16 ME Mode 8x8 ME J 8x8 <J 16x16 Mode 4x4 ME J 4x4 <J 8x8 Mode 8x4 ME Mode 4x8 ME End Mode 16x8 ME Mode 8x16 ME No Yes No Yes Not available for B frames (b) Fast mode decision Figure 4.4: J R,D Motion application in local refinement and fast mode decision. is used in H.264 to determine the Lagrange multiplier in the R-D optimization [65]. The validity of such a model is justified by empirical simulations. 4.2.4 Related Work The complexity-constrained motion estimation has been studied for a while. For exam- ple, Kossentini et al. explored the complexity-constrained MPEG-2 motion search in [32] [14], where little attention was paid to the bit rate increase at the expense of compu- tation saving. In [31], [30] and [29], Lagrange cost function is applied to predict SKIP macroblock mode prior to motion estimation. Intuitively, this scheme works effective in the video sequences without intensive motion and frequent scene change. In [50], the three-dimension of rate-distortion-complexity problem is reduced to two-dimension by associating the slope of rate-distortion with complexity. There were also efforts to re- duce the motion compensation cost in the decoder. Wang studied the decoder-friendly 82 Motion prediction Compute best ) ( , , i J C D R Motion Compute best End Yes ) ( ) 1 ( , , , , i J i J C D R Motion C D R Motion No ) 1 ( , , i J C D R Motion (a) Local refinement Mode 16x16 ME Mode 8x8 ME Mode 4x4 ME Mode 8x4 ME Mode 4x8 ME End Mode 16x8 ME Mode 8x16 ME No Yes No Yes Not available for B frames C D R C D R J J , , 16 16 , , 8 8 u u C D R C D R J J , , 8 8 , , 4 4 u u (b)Fastinter-modedecision Figure 4.5: Proposed algorithm: RDC optimization in local refinement and fast mode decision. encoder algorithm that generates low-decoding-complexity and high-quality bit stream for decoders [61]. The complexity reduction in the H.264 encoder in an embedded envi- ronment is quite different from previous work on complexity reduction in GPP systems, and most previous research was conducted from a pure algorithmic viewpoint. To our best knowledge, there has been little research conducted to balance the rate-distortion (R-D) performance and the complexity cost of H.264 motion estimation in the encoder end under an embedded environment. We propose a new complexity-constrained motion search scheme that is suitable for embedded system implementation in Sec. 4.3, as Fig. 4.5 illustrates. To be more specific, given metrics for signal distortion and computational complexity, the proposed scheme exploits the R-D tradeoff and resource consumption to determine the optimal motion vec- tors and the block mode in the motion search process in the encoder. The proposed RDC 83 optimization framework extends the Lagrange optimization framework by including the complexity term which provides quantitative measurements of the required computation for each motion vector type. 4.3 Rate-Distortion-Complexity (RDC) Optimization Framework We propose a complexity adaptive H.264 motion search scheme that explores the opti- mal tradeoff between coded video quality and resource consumption in the encoder im- plemented in an embedded environment. We present a joint rate-distortion-complexity (RDC) optimization framework in this section. Similarly to (4.1), our task is to select the optimal control parameter set P subject to the constraint of the target complexity: min D,R(P ), subject to C(P )≤ C T . The RDC framework is an extension of the Lagrange optimization framework discussed in Sec. 4.2.3 to incorporate the complexity consideration, where the complexity cost function provides quantitative measurement of the complexity for each macroblock. Furthermore, the choice of the Lagrangian multipliers corresponding to the quantization parameter QP is developed to trade-off between the coding performance loss and the complexity saving: maximizing the computation saving with a reasonable small rate-distortion deterioration. Two new Lagrange parameters λ Motion and λ Mode are adopted in our RDC framework to control the tradeoff between the R-D feature and complexity consumption. Parameter λ Motion is used to determine the motion search process in a block mode while parameter 84 λ Mode is employed to decide whether the motion search of subsequent block modes (of smaller block sizes) is worthy to be conducted or not. We can include the complexity cost in the R-D optimization cost in (4.2) via J R,D,C Motion = J R,D Motion + λ Motion C Motion , (4.7) where J R,D Motion is the R-D cost function defined in (4.2), C Motion is the complexity cost function for a given macroblock B and mode M, λ Motion is the Lagrangian multiplier for the complexity term, and J R,D,C Motion is the newly defined joint RDC cost function in our algorithm, which replaces J R,D Motion in Fig. 4.4(a). Via using J R,D,C Motion , the motion search process will stop at the optimal RDC point instead of a optimal R-D point. Because of the large number of capacity misses occurring in the switch of modes, the complexity cost should include the decision whether it is worthwhile to continue to search in subsequent modes. Thus, instead of using J R,D 16×16 and J R,D 8×8 in the comparison modules for the simplified algorithm [67] as shown by grey rhombuses in Fig. 4.4(b), we define a new cost function as J R,D,C Mode = J R,D Motion + λ Mode C Mode , (4.8) where J R,D Motion is the R-D cost function given in (4.2), J R,D,C Mode is the newly defined RDC cost function, which is used in the simplified algorithm in Fig. 4.5(b) to decide whether motion search should be continued in subsequent modes or not. That is, once the reduction of joint R-D cost function J R,D Motion is not worth the computation expense paid by motion 85 search for mode 8×8or 4× 4, its subsequent block modes is skipped to accelerate the motion estimation process. C Mode represents the accumulated complexity cost associated with block modes, which is related with the search order among the block modes. For example C Mode for fast algorithm [67] shown in Fig. 4.5(b) is: C Mode16×16 = C 16×16 , C Mode8×8 = C 16×16 + C 8×8 , C Mode4×4 = C 16×16 + C 8×8 + C 4×4 , (4.9) where C Mode16×16 is the mode complexity used in (4.8), C 16×16 is the complexity cost associated with mode 16x16, and so on. To locate the appropriate Lagrangian multipliers corresponding to a determined QP, one approach is to fully search the whole space, which requests enormous volume com- putation so that it is infeasible. Recall that the correlation between λ D and λ M strongly depends on the distortion measurement choices in (4.2) and (6.8), a similar restriction is employed to limit the search space because the same criteria J R,D Motion is used in (4.7) and (4.8) to measure the joint RD cost, we have: λ Mode = λ Motion , (4.10) Because of the equality relationship in (4.10), λ alone is used to address the 2 La- grangian multipliers, λ Motion and λ Mode both, in the later part of this work. For the joint RDC function discussed earlier, we need a quantitative model for the complexity associated with each candidate motion vector and the block mode. Since SAD operation 86 Table 4.1: Block modes and their SAD complexity. Index Block IC ET Weight IC Weight ET Mode (cycle) 1 16× 16 1180 631 15 13 2 16× 8 634 337 8 7 3 8× 16 662 363 8 8 4 8× 8 325 183 4 4 5 8× 4 152 88 2 2 6 4× 8 168 95 2 2 7 4× 4 80 49 1 1 is the widely used operation to estimate the displacement error due to its simpleness, the execution time of SAD operation can be used to assess the complexity associated with the motion estimation. Generally speaking, there are two types of complexity measurement: instruction count (IC) based and execution-time (ET) based. The former employs the static instruction count as a metric to determine the weight of SAD operations of different block modes. This method works for the situation where cache miss does not play an important role in program execution, e.g., a GPP system or an embedded system with a reasonably large cache. An even simpler approximation can be given by considering the block size alone. For example, if the SAD cost of one 4 block is one unit, then the SAD cost of one 16× 16 block is 16 units. Meanwhile, the ET-based method is more accurate for a specific platform since it uses the profiling data to decide the weight assigned to each mode. The instruction-based and ET-based results are listed in Table 4.1 for comparison. Basically both C Mode and C Motion are decided by the searching points and block modes together. Particularly the complexity cost of the ith block mode motion search can be expressed by C i = N i × W i , (4.11) 87 where N i is the searching points associated with block mode i (1 ≤ i ≤ 7), and W i is the assigned complexity weight. The ET-based weight given in Table 4.1 is used in our experiment. The complexity cost to encode a video sequence is the accumulated complexity sum of all the SAD calculation consumed to encode the video sequence. 4.4 Lagrangian Multiplier Selection When the two Lagrangian multipliers vary from 0 to 32 (λ = 0 in the reference H.264 encoder), the RDC performance variation is given in Fig. 4.6. As shown in Fig. 4.6(a), the overall complexity reduces throughout the entire bit rate range as λ increases. When λ = 32, the PSNR loss in Fig. 4.6(b) is around 1 dB with an increased bit rate, which has a negative impact on the coding coding of H.264. Thus, the choice of the Lagrange multiplier λ is critical to the rate and the distortion performance of reconstructed video. A larger λ value aggravates the rate-distortion tradeoff while a smaller λ value cannot remove all redundant matching operations. Since the choice of quantization parameters (QP) for DCT coefficients is critical to video quality and the bit rate, it is imperative to select appropriate Lagrangian multipliers as QP varies. A method to solve this problem is discussed in this section. First, the experiment that yields that optimal λ value is presented. Then, the optimal λ choice is verified. For simplicity, the restriction of (4.10) is adopted to limit the search space. In Fig. 4.7, we show J R,D Mode , which is the joint R-D cost of each macroblock calculated by (6.8), as a function of the averaged complexity per macroblock for the reference H.264 encoder (the solid line) and the RDC optimized algorithm with λ = 32 (the dashed 88 0 500 1000 1500 2000 2500 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 x 10 9 Bit rate(kbps) Weighted complexity λ’=0 λ’=1 λ’=2 λ’=4 λ’=8 λ’=16 λ’=32 (a) Rate-complexity performance 500 1000 1500 2000 30 31 32 33 34 35 36 37 38 39 40 Bit rate(kbps) PSNR λ’=0 λ’=1 λ’=2 λ’=4 λ=’8 λ’=16 λ’=32 (b) Rate-quality performance Figure 4.6: The rate-complexity and the rate-quality performance of the Stefan sequence parameterized by the value of λ . line). We use J R,D Mode and C to represent the joint RD cost and the complexity of each macroblock in the reference H.264 encoder and J R,D Mode and C for the RDC optimization framework. Then, we define the RD cost decrease due to the complexity increase as ΔJC =− ΔJ ΔC , (4.12) where ΔJ = J R,D Mode − J R,D Mode , ΔC = C − C. Since we want to have more complexity saving with little influence on J R,D Mode , the smaller ΔJC, the better. Furthermore, we define another parameter, called the complexity effi- ciency (CE), as CE = ΔJC max − ΔJC ΔJC max − ΔJC min × 100%, (4.13) 89 0 2000 4000 6000 8000 10000 0 0.5 1 1.5 2 2.5 3 x 10 4 Mean complexity per MB J R,D Mode Δ J R,D Mode Δ C Orig. encoder R−D−C framework Figure 4.7: The R-D cost J R,D Mode as a function of the complexity per MB for the Stefan sequence with the H.264 reference coder and the RDC-optimized algorithm. where ΔJC max and ΔJC min are the maximum and minimum values of ΔJ for the entire range of λ . Fig. 4.8(a) and Fig. 4.8(b) show CE values at several QP’s parameterized by λ for ”Stefan” and ”Mobile & calendar” sequences. The λ value varies from 1 to 64 with a step size of 1, which yields 6 normalized curves. The data are acquired by encoding 300 frames. We see that the optimal λ value, where the maximum CE is located, is a function of the QP. However, its value is relatively stable with respect to these two test sequences. By plotting the optimal λ of each sequence and its mean value at different QPs, we find that the optimal λ can be well approximated as a quadratic function of QP as shown in Fig 4.9. The dashed lines in Fig. 4.9 are the optimal λ values obtained from various sequences, the square symbols are their mean optimal λ values, and the solid line is the quadratic curve fitting line for the square symbols. The curve fitting function is λ Mode (QP)= a 2 QP 2 + a 1 QP + a 0 , (4.14) where a 2 =0.015,a 1 =−0.524,a 0 =5.365. 90 0 2 4 6 8 10 12 14 16 0 10 20 30 40 50 60 70 80 90 100 λ’ CE QP = 16 QP = 22 QP = 28 QP = 34 QP = 40 QP = 46 (a) Stefan 0 2 4 6 8 10 12 14 16 0 10 20 30 40 50 60 70 80 90 100 λ’ CE QP = 16 QP = 22 QP = 28 QP = 34 QP = 40 QP = 46 (b) Mobile & calendar Figure 4.8: The complexity efficiency (CE) is plotted as function of the Lagrange param- eter, λ , parameterized by QP values. 15 20 25 30 35 40 45 50 0 3 6 9 12 15 QP λ’ Akiyo Foreman News Mobile & calendar Stefan Quadratic fitting curve Figure 4.9: Lagrange multiplier λ as a quadratic function of QP. 91 4.4.1 Experimental Verification Similarly to [64], λ can be derived as the negative slope of C Mode − J R,D Motion curves. By assuming that the joint RDC cost function in (4.8) is differentiable everywhere, we can obtain λ by setting the derivative of the cost function to zero so that J R,D,C Mode is minimum. Thus, λ Mode can be given by dJ R,D,C Mode dC Mode = dJ R,D Motion dC Mode + λ Mode 0, ⇒ λ Mode = − dJ R,D Motion dC Mode , (4.15) To the best of our knowledge, there is little theoretical analysis about motion esti- mation complexity C Motion and mode complexity C Mode . Hence we turn to computer simulation. To confirm the relationship in (4.15), an experiment is designed to mea- sure the value of−dJ R,D Motion /dC Mode at a given QP. First, the reference H.264 encoder is run without the complexity optimization with QP∈{16,22,28,34,40,46}. Here, only the ”INTER 16× 16” block mode is enabled in motion estimation. Thus, the refer- ence J R,D Motion and C Mode for the average value per macroblock are obtained. Then, the complexity-constrained encoder is applied to the same video sequence with λ set to (4.14) to acquire J R,D Motion and C Mode . Finally, for each sequence, the derivative values of J R,D Motion and C Mode at a given QP are calculated. The negative slopes of video sequences are calculated and shown in Fig. 4.10 in form of dotted lines. The solid line represents the quadratic fitting curve described by (4.14). The good matching effect in Fig. 4.10 implies the validity of the proposed methodology to determine proper Lagrange multipliers. 92 15 20 25 30 35 40 45 50 0 3 6 9 12 15 QP −dJ RD Motion /dC Mode Akiyo Foreman Mobile & calendar Stefan Quadratic curve fitting Figure 4.10: The plot of−dJ R,D Motion /dC Mode as a function of QP. 4.5 Experimental Results Experimental results using the proposed algorithms are reported in this section. The experiment environment used in our simulation, including the test data and chosen pa- rameters, is listed in Table 4.2. The algorithms are applied to nine typical test sequences, representatives of the Class A, Class B and Class C sequence respectively. These 3 classes provide a variety of spatial details and motions, including slow, medium and fast move- ment. Particularly, Class C contains 2 sub-class sequences: fast motion (Stefan) and camera panning with rich spatial details (Mobile & calendar and Flower), because the experimental results of these two type video sequences are quite different from each other, the experimental results of one sequence from class A (Akiyo) and class B (Foreman) and 3 class C sequences will be provided in this section. Due to the speed and code size con- cern [57], x.264 [60] instead of the JM [45] encoder is used as our H.264 reference codec. Please note that the former is almost 100 faster than the latter [57]. We review the encoder performance in terms of rate, distortion and complexity con- sumption yielded by our proposed algorithms discussed in Sec. 4.3. 93 Table 4.2: Description of Experimental Data and Parameters. Sequence information Sequence name Class A: Akiyo, Container ship and Mother & daughter Class B: Coastguard, Foreman and News Class C: Flower Mobile & calendar and Stefan Frame size CIF (352× 288 pixels) Video format 30fps, GOP 300, IPBPB sequence Simulation Parameters PSNR 30∼40 Reference frame 5 λ Mode (4.14) κ 2 H.264 Encoder Block mode All on Fast motion estimation Diamond S-P Frame No Fig. 4.11 shows the rate-distortion (R-D) curves and the rate-complexity (R-C) curves of the test sequences parameterized by λ values. The complexity value is measured in terms of the weighted SAD complexity required in the local refinement process. In Fig. 4.11, the dashed lines are the R-D curves while the solid lines are R-C ones. For comparison, we show in Table 4.3 the results obtained from the reference H.264 encoder which already incorporates the fast mode decision algorithm illustrated by Fig. 4.4(b). To demonstrate the performance of our algorithm, we present various results with the proposed algorithms in Tables 4.4. The numbers in the tables are the averaged values by using data obtained from the predicted frames. The meaning of each column is explained below. • PSNR deg. (dB) represents the PSNR loss caused by the proposed algorithms. • Rate inc. (%) stands for the increased bit rate of the encoded bit stream by com- parison to the data listed in Table 4.3, which is expressed in terms of percentage. 94 10 24 38 52 66 80 30 32.4 34.8 37.2 39.6 42 Bit rate(kbps) PSNR Orig. encoder RDC framework 10 24 38 52 66 80 0 1.2 2.4 3.6 4.8 6 x 10 8 Complexity Orig. encoder RDC framework (a) Akiyo 0 60 120 180 240 300 30 32 34 36 38 40 Bit rate(kbps) PSNR Orig. encoder RDC framework 0 60 120 180 240 300 0 3.2 6.4 9.6 12.8 16 x 10 8 Complexity Orig. encoder RDC framework (b) Container 0 140 280 420 560 700 28 30.4 32.8 35.2 37.6 40 Bit rate(kbps) PSNR Orig. encoder RDC framework 0 140 280 420 560 700 0 0.9 1.8 2.7 3.6 4.5 x 10 9 Complexity Orig. encoder RDC framework (c) Foreman 0 70 140 210 280 350 30 32.8 35.6 38.4 41.2 44 Bit rate(kbps) PSNR Orig. encoder RDC framework 0 70 140 210 280 350 0 3.2 6.4 9.6 12.8 16 x 10 8 Complexity Orig. encoder RDC framework (d) News 0 600 1200 1800 2400 3000 28 30.8 33.6 36.4 39.2 42 Bit rate(kbps) PSNR Orig. encoder RDC framework 0 600 1200 1800 2400 3000 0 0.6 1.2 1.8 2.4 3 x 10 9 Complexity Orig. encoder RDC framework (e) Flower 0 500 1000 1500 2000 2500 28 30.8 33.6 36.4 39.2 42 Bit rate(kbps) PSNR Orig. encoder RDC framework 0 500 1000 1500 2000 2500 0 1 2 3 4 5 x 10 9 Complexity Orig. encoder RDC framework (f) Stefan Figure 4.11: The rate-distortion (R-D) and rate-complexity (R-C) curves for test se- quences. 95 Table 4.3: R-D-C performance of the reference H.264 encoder Video sequence PSNR (dB) Rate (kbps) J R,D Mode Complexity Akiyo 30.03 12.40 21150 96655544 36.35 21.30 4646 228756272 40.72 51.70 1677 277704718 Foreman 30.63 179.60 26748 1940554440 36.95 936.40 4244 3733576778 39.91 1683.80 2601 4192657516 Stefan 30.13 344.30 23533 3058520637 37.01 1299.30 6156 4470937984 40.21 2232.30 3364 4803385828 Flower 30.00 414.90 23362 1801389306 37.07 1485.60 6115 2663921936 40.96 2557.30 2934 2946321946 Mobile 29.30 313.50 25336 2019278306 & 36.93 1724.10 5895 3492658894 calendar 40.73 3156.60 2957 3764556148 • J R,D Mode inc. (%) shows the average increased joint RD cost per macro block, which is calculated by (6.8). The result is displayed in terms of percentage. • Comp. saving (%) indicates the overall complexity saving achieved in the motion estimation by the proposed algorithms, which is also expressed in terms of percent- age. Both Fig. 4.11 and Table 4.4 demonstrates by choosing appropriate Lagrangian mul- tiplier, great complexity saving can be achieved: up to 34% local refinement cost can be eliminated without great sacrifice in coding efficiency: the PSNR degradation is less than 0.3 dB and the bit rate increase is around 2%. This result meets our original goal: computation complexity can be greatly saved at the cost of little coding efficiency loss. The complexity saving strongly depends on the content character such as motion activ- ity. Because class A sequences are of low movement amount, like ”Akiyo”, they provide little complexity saving room because its motion search cost is already very low. For the 96 Table 4.4: Experimental results of five video sequences with algorithm I Video PSNR Rate J R,D Mode Comp. sequence deg. (dB) inc. (%) inc. (%) saving (%) Akiyo 0.030 0.806 1.025 9.321 0.000 0.000 -0.387 11.790 0.020 0.000 0.517 8.068 Foreman 0.020 0.223 0.397 22.746 0.070 0.481 1.541 18.999 0.080 0.095 2.029 16.952 Stefan 0.255 1.859 13.211 33.970 0.185 2.005 8.713 28.631 0.135 1.800 7.821 28.699 Flower 0.090 -0.096 2.034 16.188 0.200 -0.155 4.590 15.793 0.210 0.051 5.059 16.480 Mobile 0.070 0.223 1.731 11.948 & 0.110 0.157 2.588 7.387 calendar 0.120 0.111 2.676 7.856 panning sequences, i.e., ”Flower” and ”Mobile &calendar”, the challenge comes from the steady camera movement: although the overall movement of a whole frame is large be- cause every MB is moving, the motion vector for each MB is small due to the steady slow motion. Even for these challenging case, the proposed algorithm I can still achieve about 10% ∼ 15% complexity saving while keeping the R-D features of H.264. The excellent performance can be explained below. • During the local refinement process in the motion search, not every round of search can achieve equal joint R-D cost reduction. To save complexity, removing those complexity inefficient search effort only enlarges the prediction error slightly. • Due to the low frequency signal dominance characteristics of video signal, motion estimation of small block modes is frequently redundant. Hence skipping those unnecessary modes has little impact on the coding efficiency. 97 • Though the RDC framework removes some redundant motion search effort, be- cause of the complement effect of multiple reference frames and INTRA mode, the compression performance aggravation is negligible. 4.6 Conclusion In this chapter, a novel complexity-adaptive motion estimation was proposed and ap- plied to H.264 coding standard so that the encoder motion estimation complexity can be reduced with little degradation in the R-D performance. A wide spectrum of test se- quences with low to high motion was chosen to demonstrate the strength of our proposed complexity-adaptive motion search algorithm. Up to 35% of motion search complexity can be saved at the encoder with less than 0.3 dB PSNR loss and a maximum increase of 3% bit rate. The joint R-D-C framework provides an effective solution to the tradeoff optimization between video quality and computational complexity. 98 Chapter 5 Fast Inter-Mode Decision with RDC Optimization 5.1 Introduction A nature of embedded multimedia system (EMS) is that most applications are running at low bit rate. Two important features characterize this scenario: (i) the low bit rate suggests the motion estimation does not have to be accurate since the DCT coefficients will be truncated roughly by the large quantization step anyway. (ii) because of the low frequency signal dominance in video signal, the major encoding block mode is still 16x16 (average 73% [44]) which implies that most motion estimation of other block modes is redundant. RDC optimization framework proposed in Chap. 4 is efficient in removing the redundant motion search of block modes smaller than 8x8 according to its algorithm structure whereas not so well for modes larger than 8x8. A fast inter-mode decision with RDC optimization (FIMD) motivated by the above two facts is proposed in this chapter. Based on statistical features of DCT coefficients, we show that their variance can be represented as a function of displacement errors after motion estimation of mode 16× 16. Then, we investigate the condition under which the 99 block mode of 16× 16 is likely to be the optimal mode. In short, we develop an adap- tive method with multiple thresholds derived from DCT statistics to reduce the motion search complexity based on the RDC optimization framework developed in Chapter 4. Furthermore, a λ -complexity model is developed to control the complexity adaptively to meet the target complexity goal. The rest of this chapter is organized as follows. A fast inter-mode decision algorithm with RDC optimization to exploit the DCT coefficient statistical feature is presented in Sec. 5.2. The method to control the complexities in different coding units to meet the target complexity is described in Sec. 5.3. Experimental results are reported in Sec. 5.4. Finally, concluding remarks are given in Sec. 5.5. 5.2 Fast Inter-Mode Decision with RDC Optimization 5.2.1 Algorithm Overview The fast inter-mode decision algorithm for H.264/AVC with a joint rate-distortion-complexity (RDC) optimization framework is illustrated in Fig. 5.1. It consists of two parts. Part I is the fast inter-mode decision (FIMD), which will be described in detail in this chapter. Part II was carefully discussed in Chapter 4. The FIMD algorithm includes two test conditions to check the probability whether DCT coefficients of prediction errors would be quantized to zeros under the RDC framework. This probability is used to determine whether motion estimation of the block mode other than 16× 16 should be skipped. We see that more complexity saving can be achieved by comparing Fig. 5.1 and Fig. 4.4(b). 100 Mode 16x16 ME Mode 8x8 ME Mode 4x4 ME Mode 8x4 ME Mode 4x8 ME End Mode 16x8 ME Mode 8x16 ME No No No Yes Not available for B frames Yes No Yes Yes C D R C D R J J , , 8 8 , , 4 4 u u 2 16 16 SAD SAD T u 1 16 16 SAD SAD T u Part 1: Prediction Part 2: Complexity Constrained ME C D R C D R J J , , 16 16 , , 8 8 u u Figure 5.1: Fast inter-mode decision with RDC optimization 5.2.2 Explanation of Test Conditions We explain the two test conditions shown in Part I in this subsection. The difference between the joint RDC cost functions of mode 16x16 and mode 8x8, as defined by (4.8), can be expressed as J R,D,C Mode16×16 − J R,D,C Mode8×8 (5.1) =(SAD 16×16 + λ D R 16×16 (m|p m )+ λ Mode C Mode16×16 ) −(SAD 8×8 + λ D R 8×8 (m|p m )+ λ Mode C Mode8×8 ), We know C Mode16 − C Mode8 = −C 8×8 by (4.9). Besides, we have R 16×16 (m|p m )− R 8×8 (m|p m ) < 0 because the bit cost to encode one motion vector for mode 16x16 is 101 usually less than that to encode four vectors for mode 8x8. Thus, we can simplify (5.1) further as J R,D,C Mode16×16 − J R,D,C Mode8×8 < (SAD 16×16 − λ Mode C 8×8 )− SAD 8×8 . (5.2) The above inequality suggests J R,D,C Mode16×16 is less than J R,D,C Mode8×8 if the following criterion SAD 16×16 − λ Mode C 8×8 <SAD 8×8 is met. We consider the following two cases that lead to J R,D,C Mode16×16 <J R,D,C Mode8×8 . • Test I: SAD 16×16 <θ SAD1 . If all DCT coefficients of the residual signal for mode 16× 16 are quantized to zeros, it is easy to see that J R,D 8×8 , J R,D 16×8 and J R,D 16×8 will be all greater than J R,D 16×16 . Thus, mode 16× 16 is the optimal mode, and we can terminate further search of other inter-modes. In Sec. 5.2.3, we show how to select an appropriately threshold θ SAD1 so as to get a high probability of all zero quantized DCT coefficients for mode 16× 16. • Test II: SAD 16×16 <θ SAD2 ,where θ SAD2 = θ SAD1 + λ Mode C 8×8 . 102 Test I is stricter than Test II because of the raised threshold by taking the complexity into account. If Test II is met, there is a high probability (SAD 16×16 −λ Mode C 8×8 )− SAD 8×8 ≤ 0, which implies that J R,D,C Mode16×16 is less than J R,D,C Mode8×8 . Please also note that, under Case II, motion search for modes 16×8and 8× 16 should still be conducted while mode 8× 8 is skipped, since we do not know J R,D 16×8 and J R,D 8×16 . 5.2.3 SAD Threshold Selection In this subsection, we show how to determine the threshold value, θ SAD1 , which is needed in the two test cases given in the last subsection. It is observed that the displacement error x after motion estimation can be modeled by the Laplacian distribution, which has a significant peak at zero with exponentially decayed probability at both side [28]. The expected absolute value of a zero-mean Laplacian distributed random variable x is E(|x|)= ∞ −∞ |x| 1 √ 2b e − √ 2|x| b dx = b √ 2 , (5.3) where b is the scale parameter of the Laplacian distribution. E(|x|) can be approximated by the averaged SAD of a macroblock, i.e., SAD/256. Then, the standard deviation of displacement errors can be found as σ x = √ 2b 2 = SAD 128 . (5.4) H.264 employs a 4× 4 integer DCT, which can be viewed as a scaled version of the 4×4 float-point DCT. We can use the original float-point 4×4 transform matrix to derive 103 the SAD threshold. The orthonormal property of the float-point DCT can be exploited to simplify our analysis. The application of the N×N DCT to a one-dimensional residual signal of length N canbewrittenas F k = Hf n , where F is the DCT coefficient vector, f is the residual vector of the motion compensated prediction error, H is the N×N transform coefficient matrix. Please note that the element in the kth row and nth column of H is H(k,n)= c k 2 N cos[(n + 1 2 ) kπ N ] where k,n=0,1,...,N −1with c 0 = √ 2and c k = 1, otherwise. The covariance matrix of the transformed DCT coefficients can be written as σ 2 F = σ 2 f HRH T . (5.5) where R = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ 1 ρρ 2 ρ 3 ρ 1 ρρ 2 ρ 2 ρ 1 ρ ρ 3 ρ 2 ρ 1 ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ , 104 is the correlation matrix and |ρ|≤ 1 is correlation coefficient used to represent the dependency between pixels. Typically, ρ ranges from 0.4to 0.75 [38]. For example, if ρ=0.6, we have HRH T = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ 2.3680 1.4592 1.0300 0.8212 1.4592 0.8992 0.6347 0.5061 1.0300 0.6347 0.4480 0.3572 0.8212 0.5061 0.3572 0.2848 ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ . (5.6) The above derivation shows that the standard deviation of the DCT coefficients σ F can be estimated by σ f . The standard deviation of DCT coefficients can be written as σ 2 F (k)= σ 2 f [HRH T ] k,k . We see that the standard deviation of the DC coefficient, denoted by σ F (0), is the largest among all DCT coefficients, σ F (k), k =0,1,2,3. Even though different quantization steps is used for different DCT coefficients, we choose them carefully so that the DC coefficient is less likely to be quantized to zero than AC coefficients. Now, we proceed to the 2-D analysis, i.e. applying 2D separable DCT to the 2-D residual signal. Following the above 1-D analysis with a straightforward generalization, the variance σ 2 F (u,v)ofthe(u,v)th DCT coefficients can be written as [27]: σ 2 F (u,v)= σ 2 f [HR X H T ] u,u [HR Y H T ] v,v (5.7) 105 where [·] u,v is the (u,v)th component of the matrix and R X and R Y are correlation matrices along the horizontal and the vertical directions with the correlation parameters ρ X and ρ Y . In our implementation, we choose ρ X = ρ Y = ρ=0.6. According to (5.6), if the DC coefficient of a macroblock is quantized to zero, it is likely that the other coefficients will be quantized to zero, too. Coefficient F u,v is quantized to zero if F u,v <αQ u,v ,where Q u,v is a quantization factor and α is a scaling factor depends on the quantization method. For example, the quantization operation in H.264 is performed as follows: L u,v = sign(F u,v )× [ |F u,u |− Q 2 2Q ], where L u,v is the quantized DCT coefficient. If|L u,v | < 1, F u,v is quantized to 0. Hence, |F u,v | < 2.5Q ensures the quantized DCT coefficient equal to 0. Thus, α is chosen to be 2.5 for H.264. For a residual signal with the zero-mean Laplacian distribution, the DC coefficient is quantized to 0 if σ F (0,0) < 2.5 Q κ (5.8) where κ controls the probability that the DC coefficient is quantized to 0. For instance, the chance of the DC coefficient being zero after quantization is 99% and 94% if κ =3 and 2, respectively. Thus, given the SAD value of a macroblock and the quantization factor, we are able to compute the probability for the DC coefficient to be quantized to zero. In H.264, we have 106 Q =2 QP−4 6 ,where QP is the user control quantization parameter. To meet condition (5.2), we can substitute (5.6) and (5.4) into (5.8) to derive the following threshold value θ SAD1 = 2.5× 2 QP−4 6 × 128 2.2368× κ ≈ 96 κ × 2 QP 6 . (5.9) Similarly, the second threshold value, θ SAD2 , can be found as θ SAD2 = θ SAD1 + λ Mode C min8×8 = 96 κ × 2 QP 6 + λ Mode C min8×8 . (5.10) where C min8×8 is the lower complexity boundary corresponding to mode 8×8, which can be obtained from [20]. Experimental results for the proposed fast inter-mode decision algorithm are reported in this section to select appropriate κ value. Four sequences are tested. They are ”Akiyo”, ”Foreman”, ”Stefan” and ”Mobile & calendar”, where each sequence has 300 frames of the CIF size. They provide a variety of motion patterns including slow, medium, fast motions and the camera panning movement. Fig. 5.2 shows the coding efficiency and complexity saving with different κ values, where the performance of the algorithm in [67] is used as a benchmark. We see from Fig. 5.2(a) that κ = 1 lowers the PSNR by 0.2dB for ”Mobile” sequence. As shown in Fig. 5.2(b), the computational saving associated with κ = 3 is too small to be attractive. Thus, κ = 2 gives a good tradeoff between coding performance and the complexity saving. Furthermore, it is observed that more complexity saving is achieved with negligible coding performance degradation when the bit rate is low. This can be explained as follows. 107 500 1000 1500 2000 2500 26 28 30 32 34 36 38 40 Bit rate (kbps) PSNR Algorithm in [6] κ = 3 κ = 2 κ = 1 (a) R-D performance 500 1000 1500 2000 2500 0 10 20 30 40 50 Bit rate (kbps) Normalized comp. saving (%) κ = 3 κ = 2 κ = 1 (b) Rate vs.complexity saving (%) Figure 5.2: The performance of the proposed algorithm for Mobile & calendar. 0 2 4 6 8 10 0.5 1 1.5 2 2.5 3 x 10 5 γ Complexity QP=4 QP=16 QP=28 QP=40 (a) P frame 0 2 4 6 8 10 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 x 10 5 γ Complexity QP=4 QP=16 QP=28 QP=40 (b) B frame Figure 5.3: Relationship between the Lagrange multiplier and complexity reduction. When QP becomes larger for the low bit rate coding, the threshold value θ SAD1 as given in (5.9) increases in the power of two. Hence, more computation is saved. Besides, a large quantization step size leads to a rough reconstructed frame. Since the residual signal is quantized grossly, the accurate displacement is actually not needed and motion search can be greatly simplified to save the complexity cost. 108 5.3 Complexity-Adaptive Motion Search The proposed fast inter-mode decision algorithm with RDC optimization, illustrated by Fig. 5.1, provides us a tool to control complexity consumption associated with motion search. Our discussion in this section includes complexity control scheme in the frame level which is used to control the motion search complexities in order to meet the overall target complexity. Two complexity control components are considered: (i) complexity modeling and Lagrangian multiplier selection and (ii) complexity buffer management. The former is used to characterize the relationship between the complexity cost and the λ . The latter is used in monitoring the complexity usage and updating the available computational resource for un-encoded frame. For complexity control, an important task is to determine the relation between the complexity and the control parameter, e.g., the Lagrangian multiplier. The simulation results shows that the relationship between the average complexity per frame and pa- rameter γ, which is related to the Lagrange multiplier λ via (5.12) is shown in Fig. 5.3 for P and B frames. Due to a different coding mechanism, P and B frames have different model parameters and they have to be handled separately. Experimental data points are indicated by various symbols while the fitting curves are reciprocal functions in these figures. Thus, we can have a complexity model as C = a 0 (t)+ a 1 (t) γ +1 , (5.11) 109 where C is the accumulated weighted complexity of each frame, a 0 (t)and a 1 (t)aremodel parameters to be learned during the coding procedure, and γ is defined by γ = ⎧ ⎪⎪⎨ ⎪⎪⎩ log 2 (λ 2 )+1,λ ≥ 1, 0,λ =0. (5.12) To be adaptive to the encoding process, parameters of the polynomial model are updated after the coding of n frames, where n is theupdateframe complexity window. The least mean square (LMS) algorithm is used to update these coefficients by a i (n+1) = a i (n)+ μ× (γ+1) −i × e(n),i=0,1, (5.13) where μ is the weighting parameter and e(n)= C(n)− 1 i=0 a i (n)× (γ+1) −i is the prediction error. The LMS algorithm is a well known technique for adaptive filtering, which is simple and robust even the value of μ is set as small as 0.1. The complexity buffer is a virtual buffer to reflex the complexity usage at the encoder end. It is analogous to the rate buffer used in rate control to update the estimation of available resource and avoid buffer overflow or underflow. We can use C GOP to denote the remaining complexity budget in one GOP, N P and N B are the remaining numbers of P and B frames, respectively. The complexity ratio η of P and B frame is updated along 110 the video coding process. The target complexity levels for P and B frames, denoted by C P and C B , are calculated by N P × C P + N B × C B = C GOP , (5.14) η = C B C P . Once C P and C B are available, the Lagrange parameter can be determined accordingly using the model given in (5.11). 5.4 Experimental Results Experimental results with the proposed algorithm is reported in this section. The ex- perimental setup in our simulation is the same as those given in Table 4.2 except that κ=2. 5.4.1 RDC Performance The RDC optimization framework presented in Chapter 4 is called Algorithm I. The fast inter-mode decision with RDC optimization proposed in this chapter is called Algorithm II. The overall algorithm, consisting of Algorithm I and Algorithm II, is illustrated in Fig. 5.1 and labeled as the Hybrid Algorithm. We compare the encoder performance of Algorithm II and the Hybrid Algorithm in terms of the bit rate, distortion and complexity. We show in Table 4.3 the results obtained from the H.264 encoder that includes the fast mode decision algorithm in Fig. 4.4(b). Furthermore, we show results with the proposed algorithm and the hybrid algorithm in Table 5.1 and Table 5.2, respectively. 111 Table 5.1: Experimental results of five video sequences with algorithm II (κ=2 λ =0) Video PSNR Rate J R,D Mode Comp. sequence deg. (dB) inc. (%) inc. (%) saving (%) Akiyo 0.07 1.740 5.716 48.957 0.06 1.407 3.577 27.747 0.02 0.000 0.506 5.264 Foreman 0.03 1.560 2.409 66.947 0.02 0.240 1.269 46.797 0.01 0.128 0.582 22.204 Stefan 0.10 1.890 8.615 60.645 0.03 1.195 1.488 25.808 0.03 0.143 0.265 17.815 Flower 0.09 1.001 3.484 51.259 0.03 0.782 1.483 17.712 0.02 0.015 0.611 14.091 Mobile 0.08 2.112 4.992 42.630 & 0.02 0.369 0.986 8.223 calendar 0.01 0.161 0.395 2.825 The numbers in the tables are averaged values with data from the predicted frames. The meaning of each column is explained below. • PSNR deg. (dB) represents the PSNR loss. • Rate inc. (%) stands for the increased bit rate of the encoded bit stream with respect to the data in Table 4.3 . • J R,D Mode inc. (%) shows the average increased joint RD cost per macro block, which is calculated by (6.8). • Comp. saving (%) indicates the overall complexity saving achieved in the motion estimation. Table 5.1 verifies our previous observation in Sec. 5.2: Algorithm II obtains bigger complexity saving in the lower bit rate end where large quantization step leads to rough 112 Table 5.2: Experimental results of five video sequences with hybrid algorithm. (κ =2 λ =(4.14)) Video PSNR Rate J R,D Mode Comp. sequence deg. (dB) inc. (%) inc. (%) saving (%) Akiyo 0.100 2.546 6.456 53.075 0.07 1.469 4.091 36.266 0.03 0.01 0.071 14.095 Foreman 0.080 1.580 2.935 74.465 0.06 0.640 2.322 56.905 0.04 0.148 2.190 35.392 Stefan 0.315 2.260 19.009 74.014 0.203 1.770 8.927 47.050 0.152 1.763 7.879 41.401 Flower 0.220 2.121 4.824 59.149 0.200 0.188 5.107 30.708 0.200 0.167 5.295 28.249 Mobile 0.140 2.236 6.111 49.773 & 0.120 0.521 2.767 13.531 calendar 0.120 0.308 2.675 12.935 truncation effect. This suggests that in the low bit rate range where QP is large, which is usual in the embedded video application, it is possible to reduce more complexity consumption since the accurate match is anyway unnecessary due to gross quantization. Algorithm Hybrid can gain larger complexity saving in the faster movement sequences and acquire more saving in the low bit rate end. The PSNR degradation is less than 0.3 dB and the bit rate increase is less than 3%. The main reason for the good performance comes from that by the elimination of the motion search effort, consumed by redundant search round and small block modes, does not have noteworthy impact on the overall coding efficiency. The exception is Stefan, which posses fast motion and medium spatial details. The complexity saving of Stefan can reach as high as 74% at the cost of 0.315 dB PSNR degradation and around 3% bit rate raise. 113 200 400 600 800 1000 1200 1400 1600 31 32 33 34 35 36 37 38 39 Bit rate(kbps) PSNR Orig. encoder Algorithm I Algorithm II Algorithm hybrid (a) Foreman R-D curve 0 500 1000 1500 2000 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 x 10 9 Bit rate(kbps) Weighted complexity Orig. encoder Algorithm I Algorithm II Algorithm hybrid (b) Foreman R-C curve 500 1000 1500 2000 2500 30 31 32 33 34 35 36 37 38 39 40 Bit rate(kbps) PSNR Orig. encoder Algorithm I Algorithm II Algorithm hybrid (c) Flower R-D curve 0 500 1000 1500 2000 2500 3000 0.5 1 1.5 2 2.5 3 x 10 9 Bit rate(kbps) Weighted complexity Orig. encoder Algorithm I Algorithm II Algorithm hybrid (d) Flower R-C curve Figure 5.4: Performance of rate-distortion and rate-complexity using the proposed scheme of diamond search. Fig. 5.4 gives the rate-distortion and rate-complexity variation for Foreman and Flower sequences throughout the interested quality range, i.e., PSNR belonging to 30∼ 40 dB. Great complexity saving is achieved with negligible coding efficiency loss. To gain more insights, the frame-to-frame complexity cost and PSNR value are plotted in Fig. 5.5. We observe that the proposed algorithm has little impact on the coding performance. 114 0 30 60 90 120 0 20 40 60 80 100 Frame index Complexity cost(%) Orig. encoder Algorithm I Algorithm II Algorithm hybrid (a) Complexity 0 30 60 90 120 27 27.5 28 28.5 29 29.5 30 30.5 Frame index PSNR Orig. encoder Algorithm I Algorithm II Algorithm hybrid (b) PSNR Figure 5.5: Flower Frame-to-frame computational complexity and video quality compar- ison. 115 5.4.2 Compatibility with Other Fast Motion Search Schemes Besides the diamond search, there are several other fast motion search schemes proposed for video encoding applications. The basic feature of these fast search schemes are of hierarchy structure: small fine refinement following large coarse pattern search. Thus by incorporating our proposed complexity-constrained framework, it can always stop the searching process at the optimal RDC performance point. Fig. 5.6 shows the result of our RDC framework applied to hexagon search. Fig. 5.6(a) and Fig. 5.6(c) describes the R-D performance of hexagon search on ”Foreman” and ”Flower” sequences and Fig. 5.6(b) and Fig. 5.6(d) illustrate the complexity consumption. Similar to the result of diamond search, our proposed scheme can also achieve significant complexity reduction without the sacrifice of rate-distortion characteristics of H.264 in other hierarchy search methods. 5.4.3 Complexity Control For complexity control, the basic tools are to adjust the Lagrange parameters and manage the complexity buffer. The main parameters used in the complexity control experiment are given in Table 5.3. By setting up the complexity to be 70% and 85% of the original H.264 encoder, the λ value is adjusted according to Eq. (5.11) and Eq. (5.14) during the encoding process. The frame-by-frame complexity control results for the “Stefan” sequence at bit rate targeting 1200 kbps with a different target complexity level are shown in Fig. 5.7. The complexity of baseline H.264 result is also shown for comparison. The result looks promis- ing. Although the initial λ is set to the same value according to (4.14) in all cases, it 116 200 400 600 800 1000 1200 1400 1600 31 32 33 34 35 36 37 38 39 Bit rate(kbps) PSNR Orig. encoder Algorithm I Algorithm II Algorithm hybrid (a) Foreman R-D curve 0 500 1000 1500 2000 0 0.5 1 1.5 2 2.5 3 3.5 4 x 10 9 Bit rate(kbps) Complexity Orig. encoder Algorithm I Algorithm II Algorithm hybrid (b) Foreman R-C curve 500 1000 1500 2000 2500 30 31 32 33 34 35 36 37 38 39 40 Bit rate(kbps) PSNR Orig. encoder Algorithm I Algorithm II Algorithm hybrid (c) Flower R-D curve 0 500 1000 1500 2000 2500 3000 0.5 1 1.5 2 2.5 3 x 10 9 Bit rate(kbps) Complexity Orig. encoder Algorithm I Algorithm II Algorithm hybrid (d) Flower R-C curve Figure 5.6: Performance of rate-distortion and rate-complexity using the proposed scheme of hexagon search. 117 Table 5.3: Experimental Data and Simulation Parameters. Sequence information Sequence name Mother & daughter, Foreman and Stefan Frame size CIF (352× 288 pixels) Video format 30fps, GOP 300, IPBPB sequence H.264 encoder and simulation parameters H.264 encoder x.264 [60] Reference frame 5 Basic control unit one frame Frame complexity window 5 λ adjustment window 5 λ prediction model (5.11) λ variance range [0 64] Max λ step ±4 Initial λ 2 κ 2 can be adaptively adjusted and the target complexity level is consistently accomplished. The proposed complexity control method cannot completely smooth this huge complexity increase since we bound the maximum magnitude on the change of λ during each each step to avoid excessive bit rate fluctuation. In other words, we try to keep consistency on video quality and bit rates within a short window throughout the entire video sequence. The complexity control performance of three different bit rate for ”Mother & daugh- ter”, “Foreman” and “Stefan” sequences is shown in Table 5.4. The target complexity is set by the term of n% of the original H.264 encoder. For instance, assume that the original complexity is C orig , then the target complexity is n%× C orig .Thecomplex- ity control error is calculated as the difference between the actual complexity and the target complexity, normalized by the original H.264 encoder complexity. These results confirm that the correctness of our complexity model and its effectiveness in the dynamic complexity adjustment. 118 0 30 60 90 120 150 50 60 70 80 90 100 Frame index Complexity (%) Orig. encoder 85% 70% (a) Normalized complexity 0 30 60 90 120 150 34.5 35 35.5 36 36.5 37 37.5 38 38.5 39 Frame index PSNR Orig. encoder 85% 70% (b) PSNR Figure 5.7: Comparison of frame-to-frame computational complexity. Table 5.4: Performance of complexity control. Class A: Mother & daughter Bit rate 10 kbps 55 kbps 100 kbps Complexity 85% 5.90 -2.39 -2.16 control error 70% 5.75 -1.16 0.97 Class B: Foreman Bit rate 60 kbps 330 kbps 600 kbps Complexity 85% -7.61 -3.81 -1.37 control error 70% -2.70 0.60 2.73 Class C: Stefan Bit rate 300 kbps 1200 kbps 2100 kbps Complexity 85% -5.01 -1.15 -1.37 control error 70% -0.44 2.92 2.72 119 Table 5.5: Block modes and their SAD energy. Index Block IPC ET Energy Mode (cycle) (10 −7 J) 1 16× 16 1.96 631 7.72 2 16× 8 1.94 337 4.10 3 8× 16 1.92 363 4.39 4 8× 8 1.88 183 2.19 5 8× 4 1.86 88 1.04 6 4× 8 1.84 95 1.12 7 4× 4 1.72 49 0.55 0 500 1000 1500 2000 0 500 1000 1500 2000 2500 3000 Bit rate(kbps) Energy (J) Orig. encoder Algorithm I Algorithm II Algorithm hybrid (a) Foreman 0 500 1000 1500 2000 2500 3000 0 200 400 600 800 1000 1200 1400 1600 1800 Bit rate(kbps) Energy (J) Orig. encoder Algorithm I Algorithm II Algorithm hybrid (b) Flower Figure 5.8: Energy consumed by motion estimation for original H.264/AVC encoder and the proposed algorithms. 5.4.4 Energy Saving The complexity of an algorithm is usually independent of the platform whereas the en- ergy consumption may vary from one testbed to another. The energy consumed by the proposed algorithms can be calculated using (3.10) and (3.9). The SAD energy cost of each block mode is listed in Table 5.5. Correspondingly, the overall energy consumption can be determined in a similar way. 120 5.5 Conclusion A fast inter-mode decision algorithm under our previous RDC optimization framework for the H.264/AVC encoder was proposed in this chapter. In the first step, the SAD value of a macroblock is analyzed so that we may skip the remaining modes at an earlier stage without sacrificing coding efficiency much. Then, a joint RDC cost function is used to decide modes consisting of smaller blocks should be executed or not. Finally, the relationship between the complexity and the Lagrange parameters is explored to allocate the complexity costs in different coding units. The good R-D-C performance tradeoff was supported by experimental results. 121 Chapter 6 Decoder-Friendly Adaptive Deblocking Filter (DF-ADF) Mode Decision in H.264/AVC 6.1 Introduction With the rapid growth of the mobile Internet and portable media devices, the embedded multimedia system (EMS) has become a major platform for digital media. It is important to develop low-complexity algorithms for EMS decoder due to its battery-supplied feature yet seek the best performance-complexity tradeoff. Since the decoding operation is highly dependent on modes selected by the encoder, we can control the decoder complexity by requiring the encoder to select the decoder-friendly mode (i.e. taking the decoder complexity into the optimization consideration). Generally, the problem can be stated technically as follows: how to choose the non-normative part of H.264/AVC encoding so that the coded video streams can be decoded with lower complexity by any standard- compliant decoder while maintaining good coding efficiency. To the best of our knowledge, there has been little research conducted along this direction in the past. 122 The adaptive deblocking filter (ADF) module is one of the key components that contribute to the excellent visual performance of the emerging H.264/AVC video coding standard. However, it is computationally intensive. Experimental results indicate that ADF alone occupies 20%∼ 75% of decoding time depending on the video frame size [3]. Thus, many fast algorithms have been developed to reduce the decoder complexity at the expense of poorer rate-distortion (RD) performance. Our solution, called the decoder-friendly adaptive deblocking filter (DF-ADF) mode decision, is motivated by the fact that different block mode combinations have a great impact on the edge number that ADF needs to work on. For example, more energy will be consumed by ADF if a MB is decomposed into many smaller blocks since ADF is used to smooth the block boundary. In particular, INTRA 16x16 and INTRA 4x4 are the two most expensive modes since ADF are demanded for all 48 edges of a MB (i.e. 32 edges for Y and 8 edges for U and V). This fact is exploited by DF-ADF to provide a systematic way to trade the complexity cost for perceptual visual quality effectively. Our study has several novel components: a rigorous rate-distortion-complexity (RDC) optimization framework and an energy cost model. The excellent performance of the proposed DF- ADF algorithm is demonstrated by experimental results with a wide spectrum of video contents. The rest of this paper is organized as follows. The adaptive deblocking filter used in H.264/AVC standard is briefly reviewed in Sec. 6.2. The RDC optimization framework is described in Sec. 6.3. Two complexity models are presented in Sec. 6.4. The energy cost associated with a selected mode is discussed in Sec. 6.5. Experimental results are shown in Sec. 6.6. Finally, concluding remarks are given in Sec. 6.7. 123 6.2 Review of Adaptive Deblocking Filter The blocking artifacts caused by coarse quantization parameters require filter operations in boundary regions to reduce image discontinuities. This process consists of 2 steps [34]: boundary analysis and filtering operation. The former is to check the boundary discontinuity strength and the necessity to apply the filter while the latter is to select the appropriate filter according to the boundary strength parameter to enhance visual perception. The filters are applied to horizontal edges of 4x4 blocks in a macroblock (MB), followed by the vertical edge filter. That is, the filtering order for the luma component in Fig. 6.1(a) is a, b, c, d and then e, f, g, h. The filtering order for chroma components is i, j, k, l as illustrated in Fig. 6.1(b). ab c d e f g h (a) Luma ij k l (b) Chroma Figure 6.1: The order of deblocking filters applied to an MB. An integer, called the boundary strength (BS) parameter, is assigned to every edge between two 4x4 luminance sample blocks depending on the block mode and coding con- ditions. Table 6.1 show how the BS value is set according to block modes and conditions [34]. It is also important for the deblocking filter to distinguish true edges and those due 124 to blocking artifacts. The luminance signal is identified as the blocking artifact by three threshold checking criteria [34]. Table 6.1: Boundary strength parameter BS Block modes and conditions 4 One of the blocks is intra and the edge is a macroblock edge 3 One of the blocks is intra 2 One of the blocks has coded residues 1 Difference of block motion≥ 1 1 Motion compensation from different frames 0 Else Fig. 6.2 illustrates several pixels near a block boundary. For BS=4, three strong filters are applied to p 2 , p 1 and p 0 , respectively, if the following two conditions across the block boundary hold: |p 0 − q 0 | < (α 2) + 2, (6.1) |p 2 − q 0 | <β(Index B ), (6.2) where α and β are derived thresholds depending on quantization parameters. The corre- sponding strong filters are: p 0 =(p 2 +2p 1 +2p 0 +2q 0 + q 1 +4) 3, p 1 =(p 2 + p 1 + p 0 + q 0 +2) 2, (6.3) p 2 =(2p 3 +3p 2 + p 1 + p 0 + q 0 +4) 3. 125 p 3 p 2 p 1 p 0 q 0 q 1 q 2 q 3 Figure 6.2: Horizontal block edge. If either condition (6.1) or condition (6.2) is false, a weaker 3-tap filter is used to modify the value of p 0 only, i.e., p 0 =(2p 1 + p 0 + q 1 +2) 2. (6.4) The values of q 2 , q 1 and q 0 are modified in a similar manner with p substituted by q,and vice versa, in (6.1), (6.2), (6.3) and (6.4). While BS belongs to the set {1,2,3}, their filter implementations are the same. A 4-tap filter is employed to yield filtered value p 0 and q 0 by p 0 = p 0 +Δ 0 ,q 0 = q 0 − Δ 0 , (6.5) where Δ 0 value is achieved by a 4-tap filter and clip operation: Δ 0i =(p 1 − 4p 0 +4q 0 − q 1 +4) 3, Δ 0 = Min(Max(−c 0 ,Δ 0i ),c 0 ). (6.6) In addition, if the β threshold condition is true, i.e. |p 2 −p 0 |<β(Index B ), p 1 is modified as p 1 = p 1 +Δ p1 , Δ p1 =min(max(−c 1 ,Δ p1i ),c 1 ), Δ p1i =(p 2 − 2p 1 +(p 0 + q 0 +1) 1) 1. (6.7) 126 6.3 Rate-Distortion-Complexity Optimization Framework We propose a new decoder-friendly mode decision algorithm for H.264/AVC that consid- ers the tradeoff between coding efficiency and the complexity of deblocking filters applied in the decoder in this section. The proposed DF-ADF mode decision algorithm consists of two major components: the rate-distortion-complexity (RDC) joint optimization frame- work and the decoder complexity model. A new Lagrange parameter λ C is adopted in the RDC framework to balance the RD cost function and the complexity cost function. The complexity cost model provides a quantitative measure for the complexity demanded by the filter implementation and the mode decision process. The Lagrange method is widely used for rate-distortion (RD) optimization in video coding. It is often applied to motion estimation and residual coding. Since tree-structured block modes are used in H.264/AVC to achieve the best RD tradeoff [62], an joint RD cost function J R,D Mode is used to compare the difference between individual modes in the residual coding stage for the optimal block mode selection, i.e. J R,D Mode = D Rec (M|Q)+ λ M R(M|Q), (6.8) where M is the macroblock mode to be evaluated out of a set of possible modes, Q denotes the quantization parameter for transform coefficients, R is the number of bits associated with the header, motion and transform coefficients, D Rec is the difference between the reconstructed and the reference MBs measured by the sum of square differences (SSD) and λ M is the Lagrange multiplier for mode decision. The search space of mode M in (6.8) can be chosen from the set of all possible block modes {INTRA 16× 16, INTRA 127 4×4, INTER 16×16, INTER 16×8, INTER 8×16, INTER 8×8, INTER 8×4, INTER 4× 8, INTER 4× 4, SKIP, DIRECT}. As indicated in Table 6.1, the mode decision process has a great impact on the appli- cation of deblocking filters. We have the following observations. 1. The edge distortion incurred by the INTRA mode is the strongest, whose BSs are either 3 or 4. Consequently, the corresponding filters are strongest and the complexity cost is the highest. 2. The SKIP and DIRECT modes present no distortion which yields the lowest com- plexity. So is the INTER 16x16 mode with zero quantized residue block (ZQRB). 3. The conditions for boundary strength 1 infers that, if a MB is decomposed into more blocks, filters need to work on more block edges, as well as non-zero quantized residue block (NZQRB). Thus, we conclude that refined block decomposition demands more deblocking filter op- erations than modes consisting of larger blocks since there are fewer block edges to be filtered. The mode decision has influence on both the deblocking filter strength and the edge number for the deblocking filter to work on. To take this into consideration, we can include the complexity cost in the RD cost function in (6.8) to result in a joint RDC cost function as J R,D,C Mode = J R,D Mode + λ C C Mode , (6.9) 128 where J R,D Mode is the RD cost function in (6.8), C Mode is the complexity cost function for given macroblock B and mode M, λ C is the Lagrangian multiplier for the complexity term, and J R,D,C Mode is the newly defined joint RDC cost function in our algorithm, which replaces J R,D Mode in the mode decision stage. 6.4 Complexity Models For the joint RDC cost function given in (6.9), a quantitative model to estimate the complexity associated with different mode decisions is needed. Several complexity models will be elaborated in this section. 6.4.1 Complexity Model Based on Software Execution As reviewed in Sec. 6.2, the complexity heavily relies on the filter used in the deblock- ing filter process. If we only focus on the computational cost of the deblocking filter, a quantitative estimate of the complexity can be approximated by the number of filtering operations needed in deblocking. Specifically, the complexity associated with different boundary strength filters are listed in Table 6.2. The corresponding energy consumption highly depends on the platform and its configuration. The very long instruction word (VLIW) architecture is widely used for multimedia applications due to their high instruc- tion level parallelism. Unlike the dynamic scheduling scheme in superscalar processors, the static scheduling scheme of VLIW makes the program execution fairly stable. This implies that the dynamic execution characteristics, such as the instruction number per cycle (IPC), do not fluctuate with the input data much. Based on the result in [19], the 129 Table 6.2: Deblocking filters’ complexity and energy cost. BS Luma Filter Complexity Energy Energy chroma strength (tap) (10 −7 J) weight 4 Luma strong 2×(3+4+5) 8.855 2.98 weak 2×3 4.118 1.38 chroma N/A 2×3 4.118 1.38 1 luma strong 2×4 5.654 1.87 2 weak 1×4 2.976 1 3 chroma N/A 1×4 2.976 1 0 0 0 0 0 0 dynamic power dissipated by the multimedia program on the VLIW processor can be estimated by a linear regression model of the IPC of a software routine as P i = k 0 + k 1 × IPC i , (6.10) where IPC i is the IPC of the ith software routine, P i is the corresponding dynamic power, k 0 and k 1 are two model parameters representing the static power consumption and dynamic power consumption, respectively. Correspondingly, the overall energy expense of a specific program can be assessed by E = i P i × ET i = i (k 0 + k 1 × IPC i )× ET i , (6.11) where IPC i and ET i are the IPC and the execution time of the ith software routine. Please note that parameters k 0 and k 1 depend on processors and programs, which will be discussed with examples in Sec. 6.6. 130 6.4.2 Complexity Model Based on Boundary Strength The overall complexity for BS=4 is derived below. Note that a strong or a weak filter would be applied to boundary pixels depending on whether (6.1) or (6.2) holds, respec- tively. Assume that the probabilities for (6.1) and (6.2) to hold are ρ 1 and ρ 2 , respectively. Considering that both p and q need to be filtered, the overall complexity C BS=4 can be calculated as C BS=4 =2× (ρ 1 × ρ 2 × C 4strong +(1− ρ 1 × ρ 2 )× C 4weak ), where C 4strong and C 4weak denote the complexity for the strong and the weak filters with BS=4. If ρ 1 = ρ 2 =0.5, C BS=4 is computed by C BS=4 =2× (0.25× C 4strong +0.75× C 4weak ) =0.5C 4strong +1.5C 4weak . (6.12) Similarly, if the probability for the threshold condition to hold is 0.5, the energy weight for BS∈{1,2,3} is C BS=123 =0.5× C 123strong +0.5× C 123weak , (6.13) where C 123strong is the energy weight for the strong filter as BS∈{1,2,3}, C 123weak is that for the weak filter. The corresponding energy consumption can be calculated by substitute energy weight into (6.12) and (6.13). 131 6.5 Energy Costs Associated with Selected Modes In this section, we consider the DF-ADF mode decision algorithm. Fig. 6.1 illustrates that a MB that contains at most 48 edges, i.e., 32 edges from the luma block, 8 from two chroma blocks. Since SKIP, DIRECT and INTER 16x16 of ZQRB provide accurate displacement MB for the current MB, no quantization error is introduced. This means that ADF is not needed for these three modes. Both INTRA modes (i.e. INTRA 16x16 and INTRA 4x4) demand all 48 deblocking filters: 16 filters with BS=4, 32 filters with BS=3. Therefore, the overall energy weight for these two INTRA modes can be computed as E Intra =16× E BS=4 +32× E BS=3 = 103 (6.14) where E BS=4 and E BS=3 are mean values from (6.12) and (6.13), respectively. The energy weights for other INTER modes are more complex, since they depend on the block modes as well as the number n i , which is the number of non-zero quantized 4x4 blocks. In Table 6.3, we show the edge numbers that deblocking filters need to work on for each selected mode, where n i is the non-zero quantized 4x4 block number in each mode. The 2nd column denotes whether there is encoded residue for the block, 0 mean ZQRB, 1 stands for non-zero quantized residue block (NZQRB). Thus, we are able to get an quantitative assessment for the energy cost for deblocking filters associated with different selected modes. The overall energy consumption can be obtained by summing up the edge number times the corresponding energy cost for each mode. It is obvious that both 16x16 132 Table 6.3: Edge numbers associated with a selected mode. Block Encoded Edges BS mode residue luma chroma SKIP 0 0 0 0 DIRECT 0 0 0 0 INTRA N/A 32 16 3,4 INTER 0 0 0 0 16x16 1 min(2n i , 32) min(n i , 16) 2 INTER 0 4 4 1 16x8 1 min(2n i + 4, 16) min(n i +4,8) 2 INTER 0 4 4 1 8x16 1 min(2n i + 4, 16) min(n i +4, 8) 2 INTER 0 4 4 1 8x8 1 min(2n i +4, 8) 4 2 INTER 0 2 1 1 8x4 1 min(2n i +2, 4) 2 2 INTER 0 2 1 1 4x8 1 min(2n i +2, 4) 2 2 INTER 0 2 1 1 4x4 1 2 1 2 and 4x4 INTRA modes are the most expensive modes in terms of energy consumption, which are almost twice of the INTER 16x16 mode with residual coding. 6.6 Experimental Results Experimental results using the proposed algorithm are reported in this section. The ex- perimental environment, including the test data and chosen parameters, is given in Table 6.4. Experiments were conducted on a cycle-accurate VLIW simulator called Trimaran [18], which provides the performance profiling data such as IPC and execution time for the energy estimation of (6.11). The algorithm is applied to four test sequences. Three of them are representatives of low- (Akiyo), medium- (Foreman) and fast- (Stefan) activity 133 Table 6.4: Description of Experimental Environment and Parameters. Energy Simulator model parameter k 0 =0.9599, k 1 =0.6428 Sequence information Sequence name Akiyo, Foreman, Stefan and Mobile & calendar Frame size CIF (352× 288 pixels) Video format 30fps, GOP 150, IPBPB sequence Simulation Parameters PSNR 30∼40 Reference frame 5 λ C 0∼ 4 H.264 Encoder Block mode All on Fast motion estimation Diamond S-P Frame No video, respectively. Another video, Mobile & calendar, with camera panning motion with rich spatial detail is also included. Fig. 6.3 shows the coding efficiency and energy saving with different λ C values, where the performance of the standard H.264 decoder is used as a benchmark. Fig. 6.3(b) shows the energy saving for the deblocking filter component, where 0% energy saving is the energy consumption of original ADF in the H.264 decoder. Energy saving associated with λ C = 1 is too small to be attractive. Meanwhile, λ C = 4 degrades the PSNR for more than 0.2 dB in the low bit rate range. Thus, λ C = 2 gives a good tradeoff between coding performance and energy saving. Furthermore, it is observed that less energy saving is achieved when the bit rate is high. This can be explained as follows. When QP becomes smaller for the high bit rate coding, less deblocking filters are required since the small quantization step does not result in much blocking edge distortion in the reconstructed video so that there is little room for the proposed algorithm to work. 134 0 200 400 600 800 1000 32 34 36 38 40 42 44 Bit rate(kbps) PSNR λ C =0 λ C =1 λ C =2 λ C =4 (a) Rate vs. distortion performance 0 200 400 600 800 1000 0 5 10 15 20 25 30 35 Bit rate(kbps) Energy saving (%) λ C =1 λ C =2 λ C =4 (b) Rate vs. energy saving Figure 6.3: Performance of rate-distortion and rate-energy-saving using the proposed DF-ADF algorithm for the Foreman sequence. The rate, distortion and ADF energy saving in the decoder of the proposed DF-ADF algorithm with λ C = 2 are summarized in Table 6.5. The numbers in the table are the averaged values using data obtained from 149 predicted frames. The second column of data gives the target bit rate of the H.264 reference encoder. The last three columns give the PSNR loss, the bit rate increase and the ADF energy saving of the proposed DF- ADF algorithm against that of the H.264/AVC reference coders. As shown in Table 6.5, we see a great amount of energy saving (30%) with little PSNR degradation (within 0.2 dB) and a small bit rate increase (less than 1%) for ”Akiyo” and ”Foreman” sequences. In contrast, the energy saving is not as impressive for ”Stefan” and ”Mobile” sequences since the fast movement and rich spatial details make almost no perfect match in motion estimation. Hence, there are residuals in each 4x4 block for the deblocking filter to work on. 135 Table 6.5: Experimental results of four video sequences Video Target PSNR Bit rate Engery bit rate degradation increase saving (kb/s) (dB) (%) (%) Akiyo 20 0.15 0.866 25.32 100 0.12 0.657 19.72 200 0.09 0.278 15.68 Foreman 100 .19 0.481 29.79 350 0.15 0.233 24.88 600 0.13 0.195 22.21 Stefan 250 0.05 0.355 11.79 1000 0.03 0.196 9.32 2000 0.01 0.158 8.07 Mobile 500 0.07 0.232 11.95 & 1500 0.06 0.187 7.87 calendar 3000 0.02 0.181 7.38 6.7 Conclusion and Future Work A novel DF-ADF mode decision algorithm was proposed for video encoding. Its efficiency was demonstrated by experimental results. We observed a significant amount of of energy saving (up to 30%) in the deblocking filter with negligible quality degradation (less than 0.2 dB) and bit rate raise (within 1%). More extensive performance evaluation will be conducted in the future. 136 Chapter 7 Conclusion and Future Work 7.1 Conclusion In this thesis, several main results were presented in Chapters 3 to 6, and they are summarized below. In Chapter 3, as revealed by collected statistics, the instruction decode unit in a VLIW processor consumes nearly 50% of the total energy in most multimedia applications. This implies that the instruction set architecture (ISA) design is the key to the power minimization of an embedded multimedia system. The power profiling result suggests a strong correlation between IPC and power dissipation. By exploiting this relationship, the power and the energy models for multimedia applications in VLIW processors were proposed. The proposed model was validated by the TI C6416 chip-set, yielding a low error rate. This simple model leads to efficient run-time power estimation for various multimedia systems, and provides insights into the energy saving of the VLIW processor. In Chapter 4, it was observed that H.264 motion estimation actually has good cache behavior with a reasonable size cache. Two factors contribute to the high cache hit 137 rates of multimedia applications. First, the block-mode algorithms make data fit in the cache easily. Second, within these blocks, there is significant data reuse as well as spatial locality. A novel complexity adaptive motion estimation was proposed and applied to the H.264 video standard so that the encoder motion estimation complexity is reduced with little degradation in video quality at the price of small bit rate increase. Experiments were conducted using test sequences with low to high motion activities to demonstrate the advantages of our proposed system. Up to 35% motion estimation complexity can be saved at the encoder with less than 0.2 dB PSNR loss and a maximum increase of 3% bit rate. In Chapter 5, a complexity-constrained inter-mode decision algorithm for H.264/AVC video coding was developed. The proposed algorithm, i.e., the ”mode skip test”, is fol- lowed by the rate-distortion-complexity (RDC) optimization framework in Chapter 4. To be more specific, the “mode skip test” is executed between the MV search for mode 16x16 and that for mode 8x8. It contains two tests to check the number of DCT coeffi- cients of prediction errors quantized to zeros and use it to predict whether the underlying block modes can be skipped or not. As a result, more redundant block modes can be filtered out to save the complexity while maintaining the excellent coding performance of H.264/AVC. A novel decoder friendly adaptive deblocking filter (DF-ADF) mode decision algo- rithm was examined in Chapter 6. Both the complexity and the energy saving of the decoder relies on the usage of ADFs. We first construct the complexity model for the deblocking filter, and the energy consumption model is built based on the IPC-based 138 dynamic power model described in Chapter 3. Then, the encoder performs the rate- distortion-decoder complexity (RDC) optimization to save the energy needed for de- blocking filter operations in decoding. We observed a significant amount of of energy saving (up to 30%) in the deblocking filter with negligible quality degradation (less than 0.2 dB) and bit rate raise (within 1%). 7.2 Future Work Several assumptions were made in obtaining analytic results in Chapter 4∼ 6. To make our current work more extensive and complete, it is worthwhile to relax some of those conditions. Here, we state several problems that deserve further investigation. • Complexity adaptive H.264 encoding system It was assumed in Chapter 4 ∼ 6 that the complexity-adaptive motion estima- tion scheme proposed would have no impact on the subsequent quantization and run-length coding modules, and there would be no correlation between the motion estimation decision in different macroblocks, either. However, the bit rate is in- creased due to a larger residual signal. This happens since the sub-optimal R-D position is taken and the complexity is constrained as well. Although the increase is not significant in the final bit rate, no analysis of the whole encoding system has been made yet. We will develop an analytic framework to model, control and optimize the H.264 encoder, which can be accomplished by two steps. First, we will extend our analysis in Chapter 4 to quantization and run-length coding modules to construct an overall model for the H.264 encoding system. Second, based on 139 the complete encoder model, an overall optimization scheme will be proposed to minimize the complexity of the encoder. • New power-rate-distortion analytic framework Video encoding and data transmission are the two dominant power-consuming pro- cesses in wireless communication. It is shown by experiments that [16], for the QCIF video, video encoding consumes about 2/3 of the total energy in encoding and transmission. For higher resolution video, it is expected that an even larger fraction of energy is consumed in video encoding. On one hand, efficient video com- pression significantly reduces the amount of video data to be transmitted, which in turn saves more energy in data transmission. On the other hand, more efficient video compression comes at the price of a higher computational cost and, thus, more energy consumption. The two conflicting factors imply that there is a tradeoff be- tween bit rate, energy consumption and video quality in practical systems. We will seek the optimal solution for the whole system from encoding to transmission. • Further exploration to global motion compensation Long term global motion compensation (LTGMC) [35] is an emerging technique in- corporated in the H.264 codec to enhance the video coding efficiency further. The basic idea is that a part of visible 2-D motion within video is caused by camera oper- ations such as translation, rotation and zoom. LTGMC applies the super-resolution mosaicking technique to macroblocks that are only affected by global motion. Up to 50 backwards frames are employed as the reference frame in the experiment [35] 140 to achieve up to 28.5% bit rate saving as compared to the H.264 encoder. As ex- pected, the great achievement comes at the expense of high complexity. Meanwhile, the large reference frame pool imposes a big challenge on the memory hierarchy of both GPP and EMS. Due to its great performance of the bit rate reduction, it is highly desirable to explore the LTGMC implementation over EMS. 141 Bibliography [1] 3GPP. Amr test sequence. [Online]. Available: http://www.3gpp.org/ftp/Specs/html-info/26074.htm [2] E. P. Agency. Epa report. [Online]. Available: http://enduse.lbl.gov/projects/infotech.html [3] M. Alvarez, E. Salami, A. Ramirez, and M. Valero, “A performance characterization of high definition digital video decoding using H.264/AVC,” in Workload Charac- terization Symposium, 2005. Proceedings of the IEEE International, vol. 1, pp. 24– 33. [4] G. Ascia, V. Catania, M. Palesi, and D. Patti, “Epic-explorer: A parameterized vliw-based platform framework for design space exploration,” in First Workshop on Embedded Systems for Real-Time Multimedia (ESTIMedia), Newport Beach, Cali- fornia, USA, October 2003. [5] T. Austin. SIMPLESCALAR. [Online]. Available: http://www.simplescalar.com/ [6] D. Brooks, P. Bose, S. Schuster, H. Jacobson, P. Kudva, A. Buyuktosunoglu, J. Well- man, V. Zyuban, M. Gupta, and P. Cook., “Power-aware microarchitecture: Design and modeling challenges for next-generation microprocessors,” IEEE Micro, Decem- ber 2000. [7] D. Brooks. Cs146 computer architecture lecture notes. [Online]. Available: http://www.eecs.harvard.edu/ dbrooks/cs146-spring2004/cs146-lecture1.pdf [8] ——. Wattch. [Online]. Available: http://www.eecs.harvard.edu/ dbrooks/wattch- form.html [9] G. Cai and C. Lim, “Architectural level power/performance optimization and dy- namic power estimation,” inCoolChipsTutorialcolocated withMICRO32,November 1999, pp. 90–113. [10] J. Chalidabhongse and C.-C. J. Kuo, “Fast motion vector estimation using multiresolution-spatio-temporal correlations,” Circuits and Systems for Video Tech- nology, IEEE Transactions on, vol. 7, no. 3, pp. 477 – 488, 1997. [11] T.-C. Chen, Y.-W. Huang, and L.-G. Chen, “Fully utilized and reusable architecture for fractional motion estimation of H.264/AVC,” in Acoustics, Speech, and Signal Processing, 2004. Proceedings. (ICASSP ’04). IEEE International Conference on, May 2004, pp. 9–12. 142 [12] I. CS Univ. Catania. (2004) Epic explorer. [Online]. Available: http://epic- explorer.sourceforge.net/ [13] P. Electronics, “Trimedia -1300 Data Book,” Philips, 1999. [14] M. Gallant, G. Cote, and F. Kossentini, “An efficient computation-constrained block- based motion estimation algorithm for low bit rate video coding,” Image Processing, IEEE Transactions on, vol. 8, no. 12, pp. 1816–1823, Dec. 1999. [15] S. H. Gunther, F. Binns, D. M. Carmean, and J. C. Hall, “Managing the impact of increasing microprocessor power consumption,” Intel Technology Journal,vol. 5, no. 1, 2001. [16] Z. He, Y. Liang, L. Chen, I. Ahmad, and D. Wu, “Power-rate-distortion analysis for wireless video communication under energy constraints,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 15, no. 5, pp. 645– 658, May 2005. [17] J. L. Hennessy and D. A. Patterson, Computer Organization and Design: The Hard- ware/Software Interface, 2nd ed. Morgan Kaufmann, 1997. [18] G. I. T. HP Lab, IMPACT UIUC. Trimaran. [Online]. Available: http://www.trimanran.org [19] Y. Hu, Q. Li, and C. C. J. Kuo, “Run-time power consumption modeling for em- bedded multimedia systems.” in IEEE International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA), 2005, pp. 353–356. [20] Y. Hu, Q. Li, S. Ma, and C.-C. J. Kuo, “Joint Rate-Distortion-Complexity Op- timization for H.264 Motion Search,” in Internation Conference on Multimedia & Expo.(ICME 2006), Toronto, July 2006. [21] M. Huang, J. Renau, and J. Torrellas, “Profile-based energy reduction in high- performance processors,” in 4th ACM Workshop on Feedback-Directed and Dynamic Optimization, December 2001. [22] T. Instruments. C6000 architecture. [Online]. Avail- able: http://dspvillage.ti.com/docs/catalog/generation /details.jhtml?templateId =5154&path=templatedata/cm/dspdetail/data /c6000 architecture [23] ——. TMS320DM64x Power Consumption Summary. [Online]. Available: http://focus.ti.com/lit/an/spra962e/spra962e.pdf [24] Intel. Hyper-threading technology. [Online]. Available: http://www.intel.com/technology/hyperthread/index.htm [25] ——. Itanium 2 processor. [Online]. Available: http://www.intel.com/products/processor/itanium2/ [26] ——. Xeon. [Online]. Available: http://www.intel.com/business/bss/products/ server/xeon/ 143 [27] A. K. Jain, Fundamentals of Digital Image Processing, 1st ed. Prentice Hall, Sep. [28] N. S. Jayant and P. Noll, Digital Coding of Waveforms: Principles and Applications to Speech and Video. Prentice Hall, March 1984. [29] C. Kannangara and I. Richardson. Computational control of an h.264 encoder through lagrangian cost function estimation. [Online]. Available: http://www.rgu.ac.uk/files/H264 comp ctrl kannangara cam ready final.pdf [30] C. Kannangara, I. Richardson, M. Bystrom, J. Solera, Y. Zhao, A. MacLennan, and R. Cooney. Complexity reduction of h.264 using lagrange optimization methods. [Online]. Available: http://www.rgu.ac.uk/files/h264 complexity kannangara.pdf [31] ——, “Low-complexity skip prediction for h.264 through lagrangian cost estimation,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 16, pp. 202 – 208, Feb 2006. [32] F. Kossentini and Y.-W. Lee, “Computation-constrained fast mpeg-2 encoding,” IEEE Signal Processing Letters, vol. 4, no. 8, pp. 224 – 226, Aug. 1997. [33] C. Lee, M. Potkonjak, and W. H. Mangione-Smith, “Mediabench: A tool for eval- uating and synthesizing multimedia and communicatons systems,” in International Symposium on Microarchitecture, 1997, pp. 330–335. [34] P. List, A. Joch, J. Lainema, G. Bjntegaard, and M. Karczewicz, “Adaptive de- blocking filter,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 13, no. 7, pp. 614–619, 2003. [35] ljoscha Smolic, Y. Vatis, H. Schwarz, and T. Wiegand, “Improved H.264/AVC Cod- ing Using Long-Term Global Motion Compensation,” in Proc. SPIE/ST&T Visual Communications & Image Processing (VCIP 2004), SPIE Visual Communications & Image Processing, January 2004. [36] E. of UC Berkley. Spice. [Online]. Available: http://bwrc.eecs.berkeley.edu/Classes /IcBook/SPICE/ [37] J. Ostermann, J. Bormans, P. List, D. Marpe, M. Narroschke, F. Pereira, S. T., and T. Wedi, “Video coding with h.264/avc: tools, performance, and complexity,” Circuits and System Magazine, IEEE, vol. 4, no. 1, pp. 7 – 28, 2004. [38] I.-M. Pao and M.-T. Sun, “Modeling dct coefficients for fast video encoding,”Circuits and Systems for Video Technology, IEEE Transactions on, vol. 9, no. 4, pp. 608–616, 1999. [39] C. U. Penn. Simple power. [Online]. Available: http://www.cse.psu.edu/mdl/software.htm [40] A. Ray and H. Radha, “Complexity-Distortion Analysis of H.264/JVT Decoder on Mobile Devices,” inPictureCodingSymposium(PCS), December 2004. 144 [41] M. Sami, D. Sciuto, C. Silvano, and V. Zaccaroa, “An instruction-level energy model for embedded vliw architectures,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 21, no. 9, pp. 998–1010, September 2002. [42] Schlansker and e. a. Michael S., “Achieving high levels of instruction-level parallelism with reduced hardware complexity,” HP Labs Tech. Report. HPL-96-120, 1996. [43] P. Semiconductors. Video centric socs. [Online]. Available: http://www.semiconductors.philips.com/pip /PNX1300EH.html [44] Y. Shen, D. Zhang, C. Huang, and J. Li, “Fast mode selection based on texture analysis and local motion activity in H.264/JVT,” in Communications, Circuits and Systems, 2004. ICCCAS 2004. 2004 International Conference on, vol. 1, pp. 539– 542. [45] K. Shring. H.264/avc reference software. [Online]. Available: http://iphome.hhi.de/suehring/tml/download/ [46] A. Sinha and A. P. Chandrakasan, “Jouletrack-a web based tool for software energy profiling,” in Proceedings of Design Automation Conference, June 2001, pp. 220–225. [47] N. T. Slingerl and A. J. Smith, “Cache performance for multimedia application,” in Proceedings of the 15th international conference on Supercomputing, June 2001, pp. 204 – 217. [48] A. Smolic, Y. Vatis, H. Schwarz, and T. Wiegand, “Improved H.264/AVC Cod- ing Using Long-Term Global Motion Compensation,” in Visual Communications & Image Processing (VCIP 2004), Jan 2004. [49] Sony. Emotion engine. [Online]. Available: http://www.playstation.com/ [50] J. Stottrup-Andersen, S. Forchhammer, and S. Aghito, “Rate-distortion-complexity optimization of fast motion estimation in H.264/MPEG-4 AVC,” in Image Process- ing, 2004. ICIP ’04. 2004 International Conference on, vol. 1, pp. 111 – 114. [51] G. Sullivan and T. Wiegand, “Rate-distortion optimization for video compression,” Signal Processing Magazine, IEEE, vol. 15, no. 6, pp. 74 – 90, Nov 1998. [52] SYNOPSYS. Powermill. [Online]. Available: http://www.synopsys.com/products/etg /powermill ds.html [53] T. K. Tan, A. Raghunathan, G. Lakshminarayana, and N. Jha, “High-level energy macromodeling of embedded software,” Computer-Aided Design of Integrated Cir- cuits and Systems, IEEE Transactions on, vol. 21, no. 9, pp. 1037– 1050, 2002. [54] E. Technology. Video centric socs. [Online]. Available: http://www.equator.com/productsservices /videocentricsocs.html [55] V. Tiwari, S. Malik, and A. Wolfe, “Power analysis of embedded software: a first step towards software power minimization,” IEEE Transactions on VLSI Systems, vol. 2, no. 4, pp. 437–445, December 1994. 145 [56] V. Tiwari, D. Singh, S. Rajgopal, G. Mehta, R. Patel, and F. Baez, “Reducing power in high-performance microprocessors,” in Design Automation Conference, 1998, pp. 732–737. [57] T. Toivonen. x264 vs jm codec. [Online]. Available: http://www.ee.oulu.fi/∼tuukkat/mplayer/tests/x264test3/readme.html [58] A. Tourapis, O. Au, and M. Liou, “Highly efficient predictive zonal algorithms for fast block-matching motion estimation,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 12, no. 10, pp. 934 – 947, Oct. 2002. [59] E. UMICH. Sim Panalyzer. [Online]. Available: http://www.eecs.umich.edu/panalyzer/ [60] Videolan. x264. [Online]. Available: http://downloads.videolan.org/ pub/videolan/vlc/0.8.4/contrib/ [61] Y. Wang and S.-F. Chang. Complexity Adaptive Mo- tion Estimation and Mode Decision (CAMED) in Low Power H.264. [Online]. Available: http://www.ee.columbia.edu/ dvmm/researchProjects/PervasiveMedia/CAMED/camed summary.html [62] T. Wiegand, G. Sullivan, G. Bjntegaard, and A. Luthra, “Overview of the h.264/avc video coding standard,” Circuits and Systems for Video Technology, IEEE Transac- tions on, vol. 13, no. 7, pp. 560 – 576, July 2003. [63] T. Wiegand, X. Zhang, and B. Girod, “Long-term memory motion-compensated pre- diction,” Circuits and Systems for Video Technology, IEEE Transactions on,vol.9, no. 7, pp. 70 – 84, Feb. 1999. [64] T. Wiegand and B. Girod, “Lagrange Multiplier Selection in Hybrid Video Coder Control,” inProc. IEEE International Conference on Image Processing (ICIP 2001), Thessaloniki,, September 2001. [65] T. Wiegand, H. Schwarz, A. Joch, F. Kossentini, and G. J. Sullivan, “Rate- constrained coder control and comparison of video coding standards,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 13, no. 7, pp. 688–703, 2003. [66] Z. Xu, S. Sohoni, R. Min, and Y. Hu, “An analysis of cache performance of mul- timedia applications,” IEEE Transactions on Computers, vol. 53, no. 1, pp. 20–38, 2004. [67] Z. Zhou and M.-T. Sun, “Fast macroblock inter mode decision and motion estimation for H.264/MPEG-4 AVC,” in Image Processing, International Conference on,vol.2, pp. 789 –792. [68] C. Zhu, X. Lin, and L.-P. Chau, “Hexagon-based search pattern for fast block motion estimation,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 12, no. 5, pp. 349–355, May 2002. 146 [69] S. Zhu and K.-K. Ma, “A new diamond search algorithm for fast block-matching motion estimation,” Image Processing, IEEE Transactions on, vol. 9, no. 2, pp. 287 – 290, Feb. 2000. [70] V. Zyuban, Inherently Lower Power High Performance Superscalar Architectures. Ph. D. Thesis, CSE Dept., Univ. of Notre Dame, 2000. 147
Abstract (if available)
Abstract
This dissertation proposes a complete solution for power efficient multimedia applications on embedded system. We concentrate the research on the emerging video standard: H.264/AVC, both in the encoder and decoder ends. First a run-time power/energy estimation model is presented to provide a fast yet accurate tool for the energy analysis of multimedia application on embedded system. Then a rate-distortion-complexity (RDC) optimization algorithm is proposedto simplify the H.264/AVC motion estimation, which is the most energy-consuming component in this emerging video standard. Based on this RDC framework, a fast inter mode decision algorithm is used to enhance the energy saving. Finally a decoder-friendly adaptive de-blocking filter (DF-ADF) mode decision algorithm is proposed to reduce the decoder energy consumption requirement.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
H.264/AVC decoder complexity modeling and its applications
PDF
Modeling and optimization of energy-efficient and delay-constrained video sharing servers
PDF
Distributed source coding for image and video applications
PDF
Precoding techniques for efficient ultra-wideband (UWB) communication systems
PDF
Energy-efficient shutdown of circuit components and computing systems
PDF
Energy efficient design and provisioning of hardware resources in modern computing systems
PDF
Error tolerant multimedia compression system
PDF
Complexity scalable and robust motion estimation for video compression
PDF
Resource allocation in dynamic real-time systems
PDF
Reliable and power efficient protocols for space communication and wireless ad-hoc networks
PDF
Supporting multimedia streaming among mobile ad-hoc peers with link availability prediction
PDF
Design and analysis of collusion-resistant fingerprinting systems
PDF
Stochastic dynamic power and thermal management techniques for multicore systems
PDF
SLA-based, energy-efficient resource management in cloud computing systems
PDF
Novel and efficient schemes for security and privacy issues in smart grids
PDF
Efficient coding techniques for high definition video
PDF
A joint framework of design, control, and applications of energy generation and energy storage systems
PDF
Silicon-based RF/mm-wave power amplifiers and transmitters for future energy efficient communication systems
PDF
Towards green communications: energy efficient solutions for the next generation cellular mobile communication systems
PDF
Advanced knowledge graph embedding techniques: theory and applications
Asset Metadata
Creator
Hu, Yu
(author)
Core Title
Power efficient multimedia applications on embedded systems
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
10/26/2006
Defense Date
10/18/2006
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
adaptive deblocking filter,decoder friendly,H.264/AVC,motion estimation,OAI-PMH Harvest,power analysis
Language
English
Advisor
Kuo, C.-C. Jay (
committee chair
), Hwang, Kai (
committee member
), Zimmermann, Roger (
committee member
)
Creator Email
yhu0@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m111
Unique identifier
UC1211253
Identifier
etd-Hu-20061026 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-28568 (legacy record id),usctheses-m111 (legacy record id)
Legacy Identifier
etd-Hu-20061026.pdf
Dmrecord
28568
Document Type
Dissertation
Rights
Hu, Yu
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
adaptive deblocking filter
decoder friendly
H.264/AVC
motion estimation
power analysis