Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Complexity -distortion tradeoffs in image and video compression
(USC Thesis Other)
Complexity -distortion tradeoffs in image and video compression
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
INFORMATION TO USERS This manuscript has been reproduced from the microfilm master. UMI films the text directly from the original or copy submitted. Thus, some thesis and dissertation copies are in typewriter face, while others may be from any type of computer printer. The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleedthrough, substandard margins, and improper alignment can adversely affect reproduction. in the unlikely event that the author did not send UMI a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. Oversize materials (e.g., maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand comer and continuing from left to right in equal sections with small overlaps. Photographs included in the original manuscript have been reproduced xerographically in this copy. Higher quality 6” x 9" black and white photographic prints are available for any photographs or illustrations appearing in this copy for an additional charge. Contact UMI directly to order. ProQuest Information and Learning 300 North Zeeb Road, Ann Arbor, Ml 48106-1346 USA 800-521-0600 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. COMPLEXITY-DISTORTION TRADEOFFS IN IMAGE AND VIDEO COMPRESSION by Krisda Lengwehasatit A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA in Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) May 2000 Copyright 2000 Krisda Lengwehasatit Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UMI Number: 3018015 ___ ® UMI UMI Microform 3018015 Copyright 2001 by Beli & Howell Information and Learning Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. Bell & Howell Information and Learning Company 300 North Zeeb Road P.O. Box 1346 Ann Arbor, Ml 48106-1346 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UNIVERSITY OF SOUTHERN CALIFORNIA TH E GRADUATE SCHOOL UNIVERSITY PARK LOS ANGELES. CALIFORNIA 90007 This dissertation, written by t>A *T*T" under the direction of h..U Dissertation Committee, and approved by all its members, has been presented to and accepted b y The Graduate School, in partial fulfillment of re quirements for the degree of DOCTOR OF PHILOSOPHY Dean of Graduate Studies D ate... 2000. D ISSERTA TIO N COM M ITTEE Chairperson Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Contents List o f Figures v List o f Tables xiii A bstract xiv 1 Introduction 1 1.1 Overview of com pression........................................................................ 1 1.2 Rate-distortion and c o m p le x ity ........................................................... 3 1.3 Variable Complexity Algorithm ( V C A ) .............................................. 5 1.4 Discrete Cosine Transform Algorithms .............................................. 7 1.4.1 Definition of DCT ...................................................................... 7 1.4.2 Exact vs. Approximate DCT .................................................. 9 1.4.3 VC A D C T ...................................................................................... 13 1.4.4 Computationally Scalable D C T ............................................... 16 1.4.5 Thesis C on tributions................................................................... 16 1.5 Motion E s tim a tio n .................................................................................. 17 1.5.1 Example: Conventional Exhaustive S e a rc h ........................... 18 1.5.2 Fast Search vs. Fast M atching .................................................. 20 1.5.3 Fixed vs. Variable C om plexity.................................................. 22 1.5.4 Computationally Scalable A lgorithm s..................................... 23 1.5.5 Thesis C ontributions................................................................... 24 1.6 Laplacian Model ..................................................................................... 25 ii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2 Inverse D iscrete C osine Transform 29 2.1 Formalization of IDCT V C A ..................................................................... 30 2.1.1 Input C lassification........................................................................ 31 2.2 Proposed VCA for ID C T ............................................................................ 34 2.2.1 Sequential C lassification.............................................................. 34 2.2.2 Greedy O p tim izatio n ..................................................................... 35 2.2.3 Tree-structured classification ( T S C )............................................ 37 2.2.4 Optimal Tree-structured Classification (OTSC) ................... 39 2.2.5 Computation of 2-D ID C T ........................................................... 43 2.2.6 2-D Dyadic Classification.............................................................. 44 2.3 R esu lts............................................................................................................ 45 2.3.1 Results Based on Image Model ................................................ 47 2.3.2 Real Image Data R esu lts.............................................................. 50 2.4 Distortion/decoding time tradeoffs ....................................................... 54 2.5 Rate-Complexity-Distortion Quadtree O p tim iz atio n ......................... 58 2.6 Summary and C onclusions........................................................................ 63 3 Forward D iscrete C osine Transform 65 3.1 Exact VCA D C T ...................................................................................... 66 3.1.1 Input Classification: Pre-transform Deadzone T e s t............... 66 3.1.2 Optimal C lassificatio n ................................................................. 71 3.2 Approximate D C T .................................................................................... 72 3.2.1 SSAVT R e v ie w .............................................................................. 73 3.2.2 Error Analysis of SSAVT.............................................................. 75 3.2.3 APPROX-Q D C T .......................................................................... 80 3.2.4 APPROX-Q Error A n a ly s is ....................................................... 84 3.3 Results and Hybrid A lg o rith m s.............................................................. 86 3.3.1 Approx-SSAVT (ASSAVT).......................................................... 87 3.3.2 Approx-VCA D C T ....................................................................... 89 3.3.3 ASSAVT-VCA D C T .................................................................... 89 3.4 Summary and C onclusions....................................................................... 90 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4 M otion E stim ation: P robabilistic Fast M atchin g 91 4.1 Partial Distance Fast M a tc h in g ........................................................... 92 4.2 Hypothesis Testing Framework.............................................................. 96 4.3 Parameter Estim ation ........................................................................... 100 4.3.1 Model Estim ation for R O W ..................................................... 101 4.3.2 Model Estim ation for U N I ........................................................ 102 4.4 Experimental R e s u lts .............................................................................. 106 4.4.1 VCA-FM versus VCA-FS ........................................................ 107 4.4.2 UNI versus R O W ........................................................................ 109 4.4.3 S calab ility ...................................................................................... 110 4.4.4 Temporal V ariatio n ..................................................................... 112 4.4.5 Overall P e rfo rm a n c e .................................................................. 113 4.5 HTFM for V Q ........................................................................................... 113 4.6 Summary and C onclusions.................................................................... 117 5 M otion E stim ation: P D S-b ased C andidate E lim ination 119 5.1 PDS-based Candidate Elim ination....................................................... 120 5.1.1 Ideal Candidate E lim ination..................................................... 122 5.1.2 Reduced Steps Candidate Elimination ................................. 123 5.2 Suboptimal C E - P D S ............................................................................. 126 5.2.1 Computationally Nonscalable Fast Search .......................... 126 5.2.2 Computationally Scalable Fast S earch.................................... 128 5.3 Multiresolution A lg o rith m .................................................................... 130 5.4 Summary and C onclusions................................................................... 134 Bibliography 136 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List of Figures 1.1 (a) Video encoding and (b) decoding system........................................ 2 1.2 Variable complexity algorithm scheme................................................... 6 1.3 A fast (a) DCT (b) IDCT algorithm introduced by Vetterli-Ligtenberg where ’R ot’ represents the rotation operation which takes inputs [ .x .y ] and produces outputs [X, Y] such that X = xcosQ + ysinQ and Y = — x sin Q + y cos Q, C4 = 1 / \/2 .............................................. 10 1.4 Frequency of nonzero occurrence of 8x8 quantized DCT coefficients of “lenna” image at (a) 30.33 dB and (b) 36.43 dB........................... 14 1.5 Frequency of nonzero occurrence of 8x 8 quantized DCT coefficients of 10 frames for “Foreman” at Q P=10 (a) I-frame (b) P-frame. . . 14 1.6 Motion estimation of z-th block of frame t predicted from the best block in the search region V in frame t-1.............................................. 18 1.7 Summary of ST1 algorithm where (a) and (b) depict spatial and spatio-temporal candidate selection, respectively. Highlighted blocks correspond to the same spatial location in different frames, (c) illus trates the local refinement starting from the initial motion vector found to be the best among the candidates......................................... 22 1.8 Rate-Distortion using standard-defined DCT (’dashed’) and R-D for SSAVT (’solid’) for a2 = 100 with varying QP at 3 block sizes, i.e., 4x4, 8x8 and 16x16. The rate and distortion is normalized to per pixel basis.............................................................................................. 28 2.1 TSC pseudo code for testing upper 4 branches of Fig. 1.3 (b) in stage 2........................................................................................................... 38 v Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.2 TSC diagram showing zero information of classes after all-zero test and 4-zero tests........................................................................................... 39 2.3 TSC diagram showing zero information of classes after 2-zero test and 1-zero tests. The most right-handed class after 2-zero test is further classified into 9 classes which are tensor products of descendent classes shown, i.e., A < S > B — {(A* n Bj)}Vij where A = {At -}, B = {B i}. Therefore, the number of leaves is 3-|-3+(3x3) = 15............................................................................................................... 40 2.4 Diagram showing possible choices for classification at level 2 where A, B represent testing with Af[1010] and M[0101], respectively. . 41 2.5 Another diagram when the result from level 1 is different............... 41 2.6 Content of bitm ap after first (row-wise) 1-D IDCT where ’x’ rep resents nonzero coefficient........................................................................ 43 2.7 Dyadic classification scheme. A DCT input of size N x N is classified into all-zero, DC-only (K l), low-2x2 (K2),..., lo w -y £ y (Kn/ 2), and full-iVxiV (A'/v) classes.............................................................................. 44 2.8 Number of additions (a) and the number of multiplications (b) needed for cr2 = 100 and 700, using OTSC (’solid’), Greedy (’dashed’), Ideal fast IDCT (’dashed-dotted’), and Ideal m atrix multiplication (’light-solid’). The OTSC and Greedy are optimized for each QP. 47 2.9 Total complexity (a) and the number of tests (b) needed for < r2 = 100 and 700 using OTSC (’solid’) and greedy (’dashed’) algorithms. 48 2.10 (a) Complexity-distortion and (b) Rate-complexity curves for dif ferent algorithms, i.e., greedy (’dotted’) and 2-D dyadic (’dashed’), for DCT size 16x16 (’x’), 8x8 (’o’) and 4x4 (’*’) at a2 = 100. The complexity unit is weighted operations per 64 pixels......................... 49 2.11 Normalized estimated complexity for “lenna” using CW with all zero test algorithm (’+ ’), CW with all-zero test for the first 1-D IDCT and ac-zero test for the second 1-D IDCT (’x’), FW algo rithm (’*’), and OTSC algorithm (’o’)................................................... 50 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.12 Normalized actual tim e complexity (only IDCT algorithm part) for “lenna” using CW with all-zero test algorithm (’+ ’), CW with all zero test for the first 1-D IDCT and ac-zero test for the second 1-D IDCT (’x:), FW algorithm (’*’), and OTSC algorithm (’ o’)............. 51 2.13 Normalized actual tim e complexity (total decoding time) for “lenna” using CW with all-zero test algorithm (’+ ’), CW with all-zero test for the first 1-D IDCT and ac-zero test for the second 1-D IDCT (’x’), FW algorithm (’*’), OTSC algorithm (’o’). OTSC for lenna at MSE 14.79 (’— ’), an<i OTSC for lenna at MSE 60.21 (’ -.’). . . . 52 2.14 Normalized actual time complexity (total decoding time) for ba boon image using CW with all-zero test algorithm (’+ ’), CW with all-zero test for the first 1-D IDCT and ac-zero test for the second 1-D IDCT (’x’) , FW algorithm (’*’), OTSC algorithm (’o’), and OTSC for lenna with MSE 14.79 (’- ’) .................................................. 53 2.15 The complexity (CPU clock cycle) of the IDCT + Inv. Quant, normalized by the original algorithm in TMN at various PSNRs (different target bit rates) using OTSC (’A ’) and combined dyadic- OTSC (’*’).................................................................................................... 54 2.16 Distortion versus (a) estimated IDCT complexity (b) experimen tal decoding time of “lenna” using fixed quantizer encoder and OTSC decoder (’o’), Lagrange multiplier results (’x ’), encoder fol lows Lagrange multiplier results but decoder uses a single OTSC for MSE=60.66 (’*’) and MSE=14.80 (’+ ’), re sp e c tiv e ly ................ 56 2.17 (a) Complexity-Distortion curve obtained from C-D (’ x’) and R-D (’*’) based optimization, (b) Rate-Distortion curve achieved when C-D (’x:) and R-D (’*’) based optimization. The complexity is normalized by the complexity of the baseline Vetterli-Ligtenberg algorithm....................................................................................................... 57 2.18 Quadtree structures of four 16x16 regions and the corresponding representative bits....................................................................................... 59 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.19 Constant-complexity rate-distortion curves. W hen complexity con straint is loosen, the rate-distortion performance can be better, ’dashed’ curves show unconstrained complexity result....................... 61 2.20 Constant-rate complexity-distortion curves at 200 Kbps and 300 Kbps. As rate is more constrained, C-D performance gets worse. . 62 2.21 Constant-distortion rate-complexity curves at MSE = 30 and 50. As distortion requirement is more rigid, the R-C performance be comes worse.................................................................................................. 63 3.1 Geometric representation of dead-zone after rotation......................... 68 3.2 Proposed VCA algorithm ........................................................................... 69 3.3 Comparisons of original and pruned algorithms for different distor tions (a) number of additions, (b) number of multiplications. The DCT lower bound corresponds to computing only the subset of co efficients that will be non-zero after quantization. The VCA lower bound corresponds to pruning subject to the classification mecha nisms of Section 3.1.2, i.e., we can only prune subsets of coefficients which are tested jointly. ......................................................................... 70 3.4 Complexity (clock cycle)-distortion comparison with “lenna" JPEG encoding........................................................................................................ 72 3.5 Additional distortion (normalized by the original distortion) when using SSAVT at various levels of the pixel variance a2..................... 76 3.6 Rate-complexity-distortion functions of SSAVT. The R-C-D results of SSAVT and ideal dyadic test are very close and hence cannot be visually distinguished in the figure......................................................... 78 3.7 Total complexity and the number of tests needed for SSAVT al gorithms and 2-D dyadic at a2 = 100 and 700. 2-D dyadic is considered as an ideal case for the SSAVT where reduced classes are 100% detected....................................................................................... 79 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.8 (a) Complexity-distortion and (b) Rate-complexity curves for dif ferent algorithms, i.e., SSAVT (’solid’) and 2-D dyadic (’dashed’), for DCT size 16x16 (’x ’), 8x8 (’o’) and 4x4 (’*’) at a2 = 100. 2-D dyadic is considered as an ideal case for the SSAVT where reduced classes are 100% detected......................................................................... 80 3.9 The approximate DCT algorithm........................................................... 81 3.10 Rate-Distortion curve of 512x512 lenna image JPEG coding us ing various DCT algorithms. Note that at high bit rate coarser approximate algorithm performances deviate from the exact DCT performance dramatically. The quantization dependent approxi mation can maintain the degradation level over wider range of bit rate................................................................................................................ 83 3.11 Additional distortion (normalized by original distortion) using ap proximate DCT algorithms # 1 (V ), #2 (’v ’), # 3 (’o’), # 4 (’< £ > ’), and #5 (’©’) at various pixel variance a2............................................. 85 3.12 The complexity (CPU clock cycle) of the DCT + Quant, normal ized by the original algorithm in TMN at various PSNRs (different target bit rates) of original DCT (’+ ’), SSAVT (’o’), Apprcx-Q (’A ’), ASSAVT (’□ ’), Approx-VCA (’y ’), and ASSAVT-VCA (’*’). 87 3.13 Additional distortion (normalized by original distortion) using the ASSAVT with 2xl0~4 target deviation from the original distortion (’ - -’), SSAVT (’- ’), at QP = 10 (’o’) and 22 (’*’)? respectively. . . 88 4.1 Subset partitionings for 128 pixels subsampled using a quincunx grid into 16 subsets for partial SAD computation. Only highlighted pixels are used to compute SAD. Two types of subsets are used (a) uniform subsampling (UNI) and (b) row-by-row subsampling (ROW). Partial distance tests at the i-th stage are performed after the metric has been computed on the pixels labeled with i (corre sponding to yi)............................................................................................ 94 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.2 Complexity-Distortion of reduced set SAD computation with ROW DTFM (’dotted’) and without DTFM (’solid’) using (a) ES and (b) ST1 search, averaged over 5 test sequences. Points in each curve from right to left correspond to \/3\ = 256, 128, 64 and 32, respec tively. Note that there is a minimal difference between computing the SAD based on 256 and 128 pixels. For this reason in all the remaining experiments in this work we use at most 128 pixels for the SAD com putation................................................................................ 95 4.3 (a) Scatter plot between M AD (y-axis) and P M A D i (x-axis) and (b) corresponding histograms of M A D — P M A D i. These are plot ted for 16 stages of UNI subsampling, with number of pixels ranging from 8 (top left) to 128 (bottom right). We use UNI subsampling and ES on the “mobile ^calendar” sequence. Similar results can be obtained with other sequences, search m ethods and subsampling grids............................................................................................................... 97 4.4 Empirical pdf of M A D — PM ADi (estimation error) obtained from histogram of training data (solid line) and the corresponding para metric model (dashed line). HTFM term inates the matching metric computation at stage i if PM ADi — M AD^sf > Thi......................... 99 4.5 (a)of and (b) of of ROW computed from the definition (mean square) (’solid’) and computed from the first order moment (’dashed’). The left-most points in (a) are E {(P M A D i — P M A D 2)2} and 2E{\PM AD i ~ P M A D 21}2..................................................................... 103 4.6 (a) of (’ solid’) and 2^f (’dashed’) of MAD estim ate error at 15 stages using ES and UNI, respectively. T he left-most points shows E {(P M A D l- P M A D 2)2} and 2E {\P M A D 1- P M A D 2\}2 for each sequence, (b) Ratio of o f/o f for each sequence. Note that this ratio is nearly the same for all sequences considered.................................. 105 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.7 Example of tracking of statistics 07 under UNI subsampling. Note that the approximated values track well the actual ones, even though the parameters do change over time. We use several different se quences to provide the comparison. This serves as motivation for using online training, rather than relying on precomputed statistics. 107 4.8 Complexity-Distortion curve for first 100 frames of “mobile” se quence (a) with MSE of the reconstructed sequence and (b) with residue energy as distortion measures and search without test (’o’), partial distance (SAD* test) (’*’) and combined hypothesis-S'AD* test (;x:) each curve at fixed Pj labeled on each curve and varying A from 0.01-4................................................................................................ 108 4.9 Complexity-distortion of HTFM with ES and variance estimation on-the-fly, ROW (’solid’) and UNI (’dashed’), (a) PSNR degrada tion vs. clock cycle and (b) residue energy per pixel vs. number of pixel-difference operations. Both clock cycle and number of pixel- difference operations are normalized by the result of ES with ROW DTFM. It can be seen that UNI HTFM performs better than ROW HTFM. The transform coding mitigates the effect of the increase of residue energy in the reconstructed frames. The testing overhead reduces the complexity reduction by about 5%. The complexity reduction is upto 65% at 0.05 dB degradation..................................... 110 4.10 Complexity-distortion of UNI HTFM with variance estimation on- the-fly of (a) 2-D Log search and (b) STl search. The axes are clock cycle and PSNR degradation normalized/compared to the 2-D Log search (a) or ST l search (b) with ROW DTFM . The complexity reduction is upto 45% and 25% at 0.05 dB degradation for 2-D Log and STl search, respectively..................................................................... I l l 4.11 Frame-by-frame speedup factor for ES using ROW and DTFM (’A ’), and ROW HTFM (’o’) with Pf = 0.1 and 0.01 dB degra dation............................................................................................................. 112 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.12 Frame-by-frame speedup factor for 2-D Log search using ROW and no FM (’*’), DTFM (’A ’), and ROW HTFM (’o’) with Pf = 0.2 and 0.04 dB degradation........................................................................... 113 4.13 Frame-by-frame speedup factor for ST l algorithm using ROW and no FM (’*’), DTFM (’A ’), and ROW HTFM (’o’) with Pf = 0.3 and 0.12 dB degradation........................................................................... 114 4.14 Complexity-distortion of HTFM VQ with vector size (a) 8 for i.i.d. source and (b) 16 (4x4) for high-high band of “lenna” image. . . . 117 5.1 Uniform macroblock partition into 16 subsets, showing only upper- left 8x8 region. Partial distance tests at the z-th stage are per formed after the metric has been computed on the pixels labeled with i............................................................................................................. 121 5.2 Cumulative probability of termination using 16 stage PDS and ex haustive search with ROW (’solid’) and UNI (’dashed’), and using TMN’s fast motion search with UNI (’dash-dotted’) of 150 frames of five H.263 sequences coded at 56 Kbps. The efficiency of the PDS relatively drops as a FS is used.......................................................... 122 5.3 Complexity (number of stages) versus m for (a) “Miss America” and (b) “Suzie”. The top and bottom lines in each figure are the original PDS with UNI and the ideal PDS, respectively...................... 125 5.4 Complexity-Distortion using various algorithms average over 5 test sequences. The complexity unit is the clock cycles normalized by the original ROW PD S.............................................................................. 133 5.5 Complexity-Distortion using various algorithms average over 5 test sequences. The complexity unit is the the number of pixel compar isons normalized by the original ROW PDS......................................... 135 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List of Tables 1.1 Profile of component in MPEG2 encoder for the sequence “mo- bile& calendar” . ES: exhaustive search, ST l: a spatio-temporal fast search [4], PDS: partial distance search......................................... 3 1.2 Number of operations required to compute a 1-D size N DCT using various fast algorithms (the number in the brackets is obtained when N =8)................................................................................................... 11 1.3 Notation T a b l e ......................................................................................... 19 2.1 Weight for different logical operations................................................... 46 3.1 Number of operations required for proposed approximate algorithms where Alg. No. i corresponds to using the transform m atrix P0 j. . 82 4.1 Total time for encoding 150 frames and PSNR.................................... 115 4.2 cont. Total time for encoding 150 frames and PSN R..............................116 5.1 Number of PS AD metric computation stages for different PDS vari ation for 150 frames of H.263 sequences coded at 56 Kbps.................. 124 5.2 Results of 2-FCE complexity reduction with respect to the original PDS............................................................................................................... 127 5.3 Results of 8-FCE (2 stage PSAD per 1 step testing)............................. 128 5.4 Result of using UBC’s Fastsearch option............................................... 128 5.5 Result of 2-Step with Threshold when m = 1 and t = 1.................... 130 5.6 Results of MR1-FCE and MR2-FCE complexity reduction at t = 0.8 with respect to the original multiresolution algorithm ............... 133 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A b stract W ith the emergence of the Internet, a broader range of information trans mission such as text, image, video, audio, etc. is now ubiquitous. However, the growth of data transferring is not always matched by the growth in available channel bandwidth. This has raised the importance of compression especially for images and video. As a consequence, compression standards for image and video have been developed since the early 90’s and have become widely used. Those standards include JPEG [1] for still image coding, M PEGl-2 [2] for video coding and H.261/263 [3] for video conferencing. We are motivated by observing that general purpose workstations and PCs have increased their speed to a level where performing compression/decompression in software of images and even video, can be done efficiently at, or near, real-time speed. Examples of this trend include software-only decoders for the H.263 video conferencing standard, as well as the wide use of software implementations of the JPE G standard to exchange images over the Internet. This trend is likely to continue as faster processors become available and innovative uses of software, for example usage of JAVA applets, become widespread. Optimizing the performance xiv Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. of the algorithms for the specific case of software operation is becoming more im portant. In this thesis we investigate variable complexity algorithms. The complexities of these algorithms are input-dependent, i.e., the type of input determines the complexity required to complete the operation. The key idea is to enable the algorithm to classify the inputs so that unnecessary operations can be pruned. The goal of the design of the variable complexity algorithm is to minimize the average complexity over all possible input types, including the cost of classifying the inputs. We study two of the fundamental operations in standard image/video compression, namely, the discrete cosine transform (DCT) and motion estimation (ME). We first explore variable complexity in inverse DCT by testing for zero inputs. The test structure can also be optimized for minimal total complexity for a given inputs statistics. In this case, the larger the number of zero coefficients, i.e., the coarser the quantization stepsize, the greater the complexity reduction. As a consequence, tradeoffs between complexity and distortion can be achieved. For direct DCT we propose a variable complexity fast approximation algo rithm. The variable complexity part computes only DCT coefficients that will not be quantized to zeros according to the classification results (in addition the quantizer can benefit from this information by by-passing its operations for zero xv Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. coefficients ). The classification structure can also be optimized for a given in put statistics. On the other hand, the fast approximation part approximates the DCT coefficients with much less complexity. The complexity can be scaled, i.e., it allows more complexity reduction at lower quality coding, and can be made quantization-dependent to keep the distortion degradation at a certain level. In video coding, ME is the part of the encoder that requires the most com plexity and therefore achieving significant complexity reduction in ME has always been a goal in video coding research. There have been several algorithms with variable complexity for ME. However, most of the research concentrate on reduc ing the number of tested vector points. We propose two fast algorithms based on fast distance metric computation or fast matching approaches. Both of our algorithms allow computational scalability in distance computation with graceful degradation in the overall image quality. The first algorithm exploits hypothesis testing in fast metric computation whereas the second algorithm uses thresholds obtained from partial distances in hierarchical candidate elimination. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 1 Introduction 1.1 O verview o f com pression All current image and video compression standards are based on the same concept of transform based coding, illustrated by Figure 1.1. Basic building blocks of a transform based im age/video encoder include (i) blocking, where data is parti tioned into a smaller unit known as block (with size 8x8 pixels) or macroblock (16x16 pixels), (ii) motion estimation (ME) in predictive mode of video coding to exploit the tem poral redundancy, (Hi) Discrete Cosine Transform (DCT) to decompose the signal into its different frequency components, (iv) quantization to reduce the amount of information down to a level suitable for the channel (while introducing distortion) and (v) entropy coding to compress the quantized data losslessly. The decoding operation performs entropy decoding, inverse quantiza tion, inverse DCT (IDCT) and motion compensation, sequentially to obtain the reconstructed sequence. Typically, standards define only the syntax of the bit-stream and the decod ing operation. They normally leave some room for performance improvement via better bit allocation at the encoders. Usually the more recent standards which provide better compression performance, have also more funtionalities and require more complex operation. W ith the existing standards, the coding gain comes at the price of significant complexity at major components of the basic structure such 1 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. MV DCT Inverse DCT Frame Buffer Quantization Entropy Encoding Motion Estimation Inverse Quantization Motion Compensation (a) MV Inverse DCT Frame Buffer Entropy Decoding Inverse Quantization Motion Compensation (b) Figure 1.1: (a) Video encoding and (b) decoding system. as DCT, inverse DCT and motion estimation. W ith the advent of real-time ap plications such as video conferencing, complexity issues of the codec have become more important. In many applications, including decoding on general purpose computers or on portable devices, significant complexity reduction is needed be fore video can be supported, especially if high resolution is required. Table 1.1 shows an example of a typical profile of computational complexity, listing the percentage of time spent in each of the major components in MPEG2 encoding. We can see that in a video encoding system, motion estim ation is the most time consuming task. W ith fast motion search, one can achieve speedups of a factor of 4-5. However, the complexity still remains significant. Besides ME, video en coders have to perform DCT/IDCT, which is also a m ajor complexity component. In Chapter 2-5, we will study the complexity-distortion tradeoff of these two com ponents (DCT and ME) based on a variable complexity algorithm framework in which the complexity is input-dependent. We will propose algorithms such that 2 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 1.1: Profile of component in MPEG2 encoder for the sequence “mobile&: calendar” . ES: exhaustive search, STl: a spatio-temporal fast search [4], PDS: partial distance search. component ES ES-PDS STl ST1-PDS Motion estimation 86.6% 69.3% 22.9% 20.2% Quant. + Inv. Quant 3.7% 8.3% 20.0% 21.7% DCT + IDCT 1.9% 5.9% 13.0% 12.6% Others 7.8% 16.5% 44.1% 45.5% Relative total time 1 0.44 0.2142 0.2106 their structure can be optimized for a given type of input sequences or images, so that the total complexity is minimal on the average sense. 1.2 R ate-d istortion and com p lexity In information theory, entropy is defined as a measure of the amount of infor mation contained in random data. The amount of information can be expressed in terms of the number of bits needed to represent the data, so that more bits are needed to represent data containing more information. Entropy represents the amount of average information of a random source and also serves as a lower bound for the average code length needed to represent that source. The entropy of a random sequence X is defined as H{X) = - logp(x) bits X where p(x) is probability mass funcFon. An entropy encoder maps source symbol to codeword, £ : x — > • c(x). From information theory one cannot compress data beyond the entropy (Shannon’s source coding theorem). If we want to further compress the data, distortion must be introduced as a tradeoff and the goal is to find the minimal rate for a given distortion (Shannon’s Rate-Distortion the orem [5]), i.e., one wants to find a coder with minimal distortion for a given rate constraint. Similar to the entropy, the rate-distortion function is defined as 3 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the minimum achieveable distortion bitstream such th at the rate satisfies the bit budget constraint, constraint. Image and video compression algorithms aim at achieving the best possible Rate-Distortion (R-D) performance. Three main approaches are typically con sidered to improve R-D performance, namely, better transformation ([6, 7, 8]), quantization bit allocation ([9, 10, 11]), and efficient entropy coding ([12]). There are a number of algorithms that provide high efficiency in each of these areas. Examples include using the wavelet transform with progressive quantization and efficient entropy coding ([13, 14, 15, 16, 17, 18]), DCT with optimal thresholding and quantization [19], DCT with progressive coding [20], variable block size DCT [21], etc. Moreover, in all the above methods, complexity is not taken into account explicitly. Shannon’ s Rate-Distortion theorem provides only an asymptotic result which is only valid for infinite memory coders. Typically, encoders with longer memory, more computational powr er and larger context statistics can perform better than encoders using less resources. In complexity-constrained environment, such as in software implementation of real-time video en/de-coding systems and in battery- limited pervasive devices, complexity becomes a m ajor concern in addition to rate-distortion performance. Normally, one would prefer to have a system that encodes or decodes with higher frame rate with a small degradation in picture quality rather than to have a slightly better rate-distortion performance with much more complexity or delay. Thus, in order to achieve the best of rate-distortion- complexity performance, all three factors must be explicitly considered together. W ith the fast increase in the clock speed and performance of general purpose processors, software-only solutions for image and video encoding/decoder are of great interest. Software solutions result in not only cheaper systems, since no specialized hardware needs to be bought, but also provide extra flexibility as com pared to a customized hardware. In a software implementation, there are many factors that impact the computational complexity in addition to the number of arithm etic operations. For example, conditional logic, memory access or caching, all have an impact in overall speed. However, most of the work that has studied 4 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. complexity issues has focused on arithm etic operation (additions and multiplica tions), and generally has not considered other factors. The work by Gormish [22] and Goyal [23] are examples of research that addresses complexity-rate-distortion tradeoffs. In this research the transform block size is used to control the com plexity, i.e., when the block size increases, the required complexity also increases while rate-distortion performance improves. Note, however, th a t complexity is determined only by the block size, and therefore, it is constant regardless of the input. Thus, as in most earlier work in this topic, complexity is analyzed only as a worst-case. Instead, in this thesis we will concentrate on algorithms designed to perform well on average. 1.3 Variable C om p lexity A lgorithm (V C A ) In this work, we are interested in reducing the computational complexity of an algorithm, where complexity could be measured as actual elapsed time, CPU clock cycle or the number of operations consumed by the process. We develop algorithms with input-dependent complexity (see Fig. 1.2), such that for some types of input the complexity is significantly reduced, whereas for some other types the complexity may be larger than the original complexity (i.e., th a t achieved with an input-independent algorithm). For example, the IDCT does not have to be performed if the input block is all zero. The overhead cost of testing for the all-zero block is then added to the complexity of not-all-zero block, but, if the percentage of all-zero block is large enough, the complexity savings can outweigh the testing overhead. The same can be applied to more sophisticated VCAs in which the final goal is to achieve less complexity on the average. We will show that by using VCAs, it is possible to achieve a better complexity-distortion performance for a given rate than reported in [22], In order to define an efficient VCA, as depicted in Fig. 1.2, it will be necessary to study 3 issues: 1. In pu t Classification In order for the complexity of an operation to be variable, the input must be classified into different classes where each class requires different amount 5 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A lgorithm 1 A lgorithm 2 in p u t output -► C lassifier A lgorithm N Figure 1.2: Variable complexity algorithm scheme. of computation to complete the operation. Conceptually, we can say that a different algorithm (a “reduced algorithm” ) is associated with each class. The complexity of each reduced algorithm may not be the same and depends on the input class. The classification scheme thus plays an im portant role in VCA framework. This idea is similar to entropy coding where the code length of the codeword corresponding to each input symbol can be different. 2. O ptim ization The goal of the input classification is to achieve average-case complexity that is below the complexity achieved when no classification is used and a single algorithm is used for all inputs. However, the classification process itself comes at the cost of additional complexity. Therefore, we use statistical in formation to design a classification algorithm such that the total complexity including the classification cost is minimum on the average. This is achieved by eliminating those classification tests that are not worthwhile, i.e., those that increase the overall complexity. This is analogous to the entropy code design problem where the codebook is designed in such a way that the av erage code-length is smaller than the fixed length code. Normally, symbols with higher probability are given shorter codewords and rare symbols are assigned longer codewords. However, in the complexity case, there are two factors in determining the total complexity, namely, the cost of classifica tion and the cost of the reduced algorithms, which vary for different types of input. Unlike the entropy coding, here it is not possible to guarantee th at the most likely inputs are assigned the least complex algorithms, since the complexity is a function of the input itself. Thus, in minimizing the 6 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. complexity, we will consider both classification cost and reduced algorithm complexity along with the input statistics. 3. C om putational scalability In order to gain additional complexity reduction, it may be necessary to sacrifice to some extent the quality of the algorithm. Normally, computa tional scalability can be introduced in an algorithm at the cost of increased distortion, or higher rate for the same distortion. The complexity-distortion tradeoffs are analogous to rate-distortion tradeoffs, i.e., for a given complex ity constraint, we want to find an algorithm which yields the smallest dis tortion while satisfying the complexity budget. This complexity-distortion problem is more open than its rate-distortion counterpart because the com plexity depends on the platform the system is running on. In this thesis, we present computationally scalable algorithms which perform reasonably well on Unix and PC platforms, but our m ethods can be easily extended to small embedded machine. In this dissertation, we present variable complexity algorithms (VCA) for 3 main components of standard video coders, namely, IDCT, DCT and motion estimation. For each of these algorithms we studied the three main issues discussed above. In Sections 1.4 and 1.5, we give a comprehensive literature survey of D C T/ID C T and motion estimation, and provide summaries of our contributions. 1.4 D iscrete C osine Transform A lgorithm s 1.4.1 D efinition o f D C T The Discrete Cosine Transform (DCT) [24], first applied to image compression in [25], is by far the most popular transform used for image compression applications. Reasons for its popularity include not only its good performance in terms of energy compaction for typical images but also the availability of several fast algorithms. Aside from the theoretical justifications of the DCT (as approximation to the Karhunen-Loeve Transform, KLT, for certain images [24]) our interest stems from 7 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the wide utilization in different kinds of image and video coding applications. The well-known JPEG and MPEG standards use DCT as their transformation to decorrelate input signal (see [1]). Even with the emergence of wavelet transforms, DCT has still retained its position in image compression. While we concentrate on the DCT, most of our developments are directly applicable to other orthogonal transforms. The N point DCT X of vector input x = [x(0),:r(l), — 1)]T is defined as X = D n • x where D n is the transformation m atrix of size NxN with elements D n (i,j) Conversely, the inverse transform can be written as x = D nT • X given the orthogonality property. The separable 2-D transforms are defined as X - D n * X ■ D n T respectively, where X and x are now 2-D matrices. This means that we can apply DCT or IDCT along rows first then across columns of the resulting set of coefficients, or vice versa, to obtain the 2-D transform. Each basis in the DCT domain represents an equivalent frequency component of the spatial domain real data sequence. After applying the DCT to a typical image, DCT coefficients in the low frequency region contain most of the energy. Therefore, DCT has a good energy compaction performance. Computing the D CT/ID CT directly following the definition requires N 2 mul tiplications for a 1-D transform. There have been many fast algorithms proposed to reduce the number of operations. There are two ways to categorize these fast algorithms. First, they can be categorized as either exact or approximate algo rithms depending on whether the transform result follows the DCT definition or wher ( 1.1) and x D nT • X • D n , 8 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (slightly) differs from the definition, respectively. Second, they can be categorized based on their complexities. If the complexities are fixed regardless of the input, they are fixed complexity algorithm. On the other hand, if the complexity is input dependent, they are variable complexity algorithms (VCAs), i.e., there are different algorithms for different types of input (see Figure 1.2). Another related issue is computational scalability where the complexity can be adjusted with the penalty of distortion tradeoffs. We now review previously proposed algorithms for fast implementation of D CT/ID CT, classifying them according to the approaches that are used. 1.4.2 E x a ct vs. A p p roxim ate D C T Exact A lgorithm s Since elements in the DCT transformation matrix are based on sinusoidal func tions, significant reductions in complexity can be achieved. For example, with the availability of the well known Fast Fourier Transform (FFT), one can obtain the DCT using the FFT and some pre-post processings as shown in [26] and [27]. A direct DCT computation fast algorithm was first proposed by Chen et al. in [28]. Since then, there are several other fast DCT algorithms proposed such as [29], [30] and [31], which aim at achieving the smallest number of multiplications and additions. The minimal number of multiplications required for a 1-D DCT transform was derived by Duhamel et. al. in [32]. Loeffier et. al. in [33] achieves this theoretical bound for size-8 DCT. An example of fast algorithm based on Vetterli-Ligtenberg’s algorithm [29] for 1-D size-8 DCT and IDCT is depicted in Figures 1.3(a) and (b), respectively. This algorithm requires 13 multiplications and 29 additions and the structure is recursive, i.e., the top 4 branches starting from the 2nd stage correspond to a size-4 DCT. Table 1.2 shows the complexity of various other algorithms for DCT computation in terms of number of additions and multiplications. It has been shown that a fast algorithm for 2-D requires less arithmetic opera tions than using 2 fast 1-D algorithms separately ([34], [35]). Similar to 1-D DCT, the theoretical bound for higher dimension was derived by Feig and Winograd in 9 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2 stage I stage is subtraction from and (a) 3r d stage - is addition to. X O C4 X4 C4 iz2 . Rot X6 X2 lz4 X7 - Rot XI C4 C4 X3 + Rot 3rt X5 [ 3rd stage 1st stage 2nd stage xO xl x2 x3 x7 where is subtraction from and (b) is addition to. Figure 1.3: A fast (a) DCT (b) IDCT algorithm introduced by Vetterli-Ligtenberg where ’R ot’ represents the rotation operation which takes inputs [x, y\ and pro duces outputs [A T , Y] such that X = x cos Q + y sin Q and Y = — x sin Q + y cos Q, C4 = l/y/2. 10 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 1.2: Number of operations required to compute a 1-D size N DCT using various fast algorithms (the num ber in the brackets is obtained when N =8). A lgorith m m u ltiplications additions m atrix multiplication N 2 [64] N ( N - 1) [56] Chen et.aCll [28] N log2 N — 3N/2 -F 4 [16] 3JV(log2 N — l)/2 -F 2 [26] Wang’84 [30] A (flo g 2i V - l ) - 3 [13] /V(ilog2 iV —2)+3 [29] Lee’84 [31] T log2 N [12] 3 f log2 i V - JV + 1 [29] Duhamel’87 [32] (Theoretical bound) N /2 - log2 N - 2 [11] n /a [36] and the bound can be achieved as pointed out by Wu and Man in [37] by incorporating 1-D algorithm of [33] to a 2-D algorithm by Cho and Lee in [38]. Moreover, there are several other algorithms aiming for different criteria. For example, an algorithm for fused MULTIPLY/ADD architecture introduced in [39], a time-recursive algorithm [40] and a scaled DCT algorithm which computes a scaled version of the DCT, i.e., in order to get the exact DCT the scaling factor must be introduced in the quantization process. A scaled DCT algorithm proposed in [41] gives a significant reduction of the number of multiplications. Research is still ongoing on the topic of fast algorithms, but current goals in this research may involve criteria other than adds and multiplies, e.g., providing favorable recursive properties [42], Some of them avoid multiplications by using a look-up table [43]. It is also worth mentioning the work by Merhav and Bhaskaran [44] in which image manipulations such as scaling and inverse motion compensation are done in the transform domain. This saves some computations as compared to doing the inverse transform, processing, and then the forward transform separately. The gain can be achieved given the sparseness of the combined scaling and transform matrix. A pproxim ate A lgorithm s All of the above algorithms have aimed at computing the exact DCT with minimal number of operations. However, if the lower bound in the number of operations 11 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. has been reached, obviously it is no longer possible to reduce the complexity while maintaining an exact algorithm. One way to further reduce the number of operations is to allow the resulting computed coefficients to be different from their defined values, i.e., allow some errors (distortions) in the DCT computation. The following approaches are representative of those proposed in the literature. One approach is to compute some of the DCT coefficients which represent low frequencies. Empirically, for typical images most of the energy is in the low frequencies as can be seen in Figs. 1.4 and 1.5. Thus, ignoring high frequency coefficients still results in acceptable reconstructed images in general. Earlier, a related work on computing the DFT with a subset of inputs/outputs is proposed by Burrus et.al [45]. This work analyses the number of operations required for the Discrete Fourier Transform (DFT) and provides an algorithm with minimal number of operations (so-called “pruned D FT” ) for the case where only a subset of input or output points are needed. This is achieved by pruning all the operations that become unnecessary when only a subset of input/output points is required. However, the work was restricted to the case where the required subset is known or fixed and where the required input/output points are always in order. For the DCT, one could compute only the first 4 coefficients of a size-8 DCT as in [46]. An alternative is to preprocess the d ata to get rid of empirically unnecessary information before performing a smaller size DCT. In [47] and [48], the simple Haar subband decomposition is used as the preprocessing. Then a size-4 DCT is performed on low band coefficients and the result is scaled and used as the first 4 coefficients of a size-8 DCT. For IDCT, a reduced size input in which only DC, 2x2 or 4x4 DCT coefficients are used to reconstruct the block has been implemented in [46]. Similarly, the reduced size output IDCT can also be used ([49, 50]) for applications such as reduced resolution decoder and thumbnail viewing. However, for video applications the motion drift problem can seriously deteriorate the video quality when an approximate IDCT is used instead of the exact one. Another approach is to use the distributed arithm etic concept. DCT coeffi cients can be represented as a sum of the DCT of each input bit-plane, which can be easily computed by a look-up table [51]. Since the contribution of the least significant bit-plane of input is small, and most likely the LSB represents noise 12 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. in image capturing process, the operation for those bit-planes can be removed. This idea originally came from DFT computation in [52] where it is called SNR update. In [53], a class of approximate algorithms which is multiplication-free are proposed, and the error is analyzed. This method is similar to the distributed arithmetic approach in the sense that all DCT coefficients (not just a subset of coefficients) are computed, but with less accuracy. 1.4.3 V C A D C T One common feature in all the fast algorithms discussed above (both exact DCT and approximate DCT) is that they aim at reducing the complexity of a generic direct or inverse DCT, regardless of the input to be transformed. This is obvious in the case of exact algorithms, since these have a fixed number of computations. Even for the approximate algorithms we just discussed, any input is computed in the same way, even though obviously some blocks suffer from more error than oth ers. Therefore complexity is estimated by the number of operations which is the same for every input. In other words, all of the above algorithms lack adaptivity to the type of input. In this thesis we consider as possible operations not only typical additions/multiplications but also other types of computer instructions (for example i f , th e n , e lse ) such that additional reductions in complexity are possible, in an average sense, if the statistical characteristics of the inputs are considered. To explain the improved performance achieved with input dependent algo rithms, consider the analogy with vector quantization (VQ). In VQ, the input can be a combination of many sources with different characteristics, e.g., background area, edge area and detail area for image. It was shown in [54] that by first clas sifying the source into different classes and then applying a VQ designed for each class, the rate-distortion performance is better than having a single VQ codebook. In the same context, [55] shows that by classifying a block of image into different classes with different VQ codebook reduces the overall computational complexity on the average. In the case of DCT, we motivate the benefits of input-dependent operation by considering the sparseness of quantized coefficients in Figs. 1.4 and 1.5. Note 13 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (a) (b) Figure 1.4: Frequency of nonzero occurrence of 8x8 quantized DCT coefficients of “lenna” image at (a) 30.33 dB and (b) 36.43 dB. Figure 1.5: Frequency of nonzero occurrence of 8x8 quantized DCT coefficients of 10 frames for “Foreman” at QP=10 (a) I-frame (b) P-frame. that, as expected, the high frequency coefficients are very likely to be zero. This will also be the case for the difference images encountered in typical video coding scenarios (e.g. P and B frames in MPEG), where the percentage of zero coefficients is likely to be even higher (see Fig. 1.5). Therefore, for IDCT, it is straightforward to check the content of an input block which has already been transformed and quantized. Taking advantage of the sparseness of the quantized DCT coefficients can be easily done and is a widely used technique in numerous software implementations of IDCT available in pub lic domain software such as the JPEG implementation by the Independent JPEG H onm o iK B '/w i of f o iK M n ’ N onnto haicqrwn al P«*ain» T aiw an* (a) (b) 14 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Group [46], the MPEG implementation by U.C.Berkeley [56], vie, the UCB/LBL Internet video conferencing tool [57], or TMN’s H.263 codec [58]. All these imple mentations take into account the sparseness to achieve image decoding speed-up by checking for all-zero rows and columns in the block to be decoded, since these sets do not require a transform to be performed. Also checking for DC-only rows and columns is useful since the transform result is simply the constant scaled DC vector. In all these approaches there is a trade-off given that additional logic is required to detect the all-zero rows and columns and so the performance of the worst case decoding is worse than if tests were not performed. However the speed up for the numerous blocks having many zeros more than makes up for the difference and, on average, these simple schemes achieve faster decoding for “typical” images. This simple all-zero test m ethod makes the IDCT algorithm become a VCA since the complexity of IDCT operations for each input block depends on the class of that block, and the class is determined by the number of zeros in the block. Another example of VC As is by Froitzheim and Wolf [59] (FW), which formally addresses the problem of minimizing the IDCT complexity in an input dependent manner, by deciding, for a given block, whether to do IDCT along the rows or the columns first. Different blocks will have different characteristics and in some one of the two approach, row first or column first, may be significantly faster. A more sophisticated classification of inputs for 2-D IDCT is proposed in [60] for software MPEG2 decoders with multimedia instructions. The input blocks are classified into 4 classes which are (1) block with only DC coefficient, (2) block with only one nonzero AC coefficient, (3) block with only 4x4 low frequency components and (4) none of the above. The first 3 classes are associated with reduced algorithms that require less operations than the algorithm for class (4) (which uses the baseline DCT algorithm). For the case of forward DCT, an example of block-wise classification can be found in [61], where each block is tested prior to the DCT transform to deter mine whether its DCT coefficients will be quantized to zero or not. Given the values of the input pixels x (i,j) the proposed test determines whether the sum of absolute values exceeds a threshold which is dependent on the quantization 15 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. and the confident region. However, this work has the limitation of assuming a single quantizer is used for all the DCT coefficients (thus it is better suited for for interframe coding scenarios) and can classify only ail-zero DCT block. 1.4 .4 C om p u tation ally Scalable D C T Algorithms in which the complexity can be scaled depending on the level of ac curacy, or the distortion, can be called scalable complexity algorithm. Therefore, scalable complexity algorithms are also approximate algorithms. In the context of IDCT, the scalability can be achieved by introducing distortion at either the decoding or encoding side. The decoder can perform approximate IDCT opera tion and obtain a lower quality reconstruction, or the encode can assign coarser quantization parameters to produce more DCT coefficients quantized to zero and therefore enable a faster decoding. Thus, in general, low quality images lead to low complexity decoding. In the case of predictive video coding, the encoder based scalability is preferred in order to m aintain the synchronization between the encoder and decoder. In [62], optimal quantization assignment at the encoder is studied to obtain minimal distortion for a given decoding time budget. For DCT, the mechanism to control the complexity is by adjusting the ac curacy of the transform which in turns reflects in the degradation in the coding performance. Examples of this approach are G irod’s [63] in which a DCT block is approximated using only DC and first two AC component. The approximation error is then obtained from the sum of absolute difference between the original block and the block reconstructed using only 3 DCT coefficients. This error is compared with a threshold which is a function of the quantization parameter and a desired level of accuracy. Pao & Sun [64] proposed a statistical sum of absolute value testing (SSAVT) which classifies the DCT block into several reduced output classes with controllable degree of confidence. 1.4.5 T hesis C ontributions In chapter 2, we study the benefits of input-dependent algorithms for the IDCT where the average computation time is minimized by taking advantage of the 16 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. sparseness of the input data. We show how to construct several IDCT algo rithms. We show how, for a given input and a correct model of the complexity of the various operations, we can achieve the fastest average performance. Since the decoding speed depends on the number of zeros in the input, we then present a for mulation that enables the encoder to optimize its quantizer selection so as to meet a prescribed “decoding time budget”. This leads to a complexity-distortion opti mization technique which is analogous to well known techniques for rate-distortion optimization. In our experiments we demonstrate significant reductions in decod ing time. As an extension of this work, we address the generalized quadtree optimization framework proposed in [21] by taking the complexity budget into account and using a VCA IDCT to assess the complexity cost. Therefore, we have a complete rate-complexity-distortion tradeoff in the sense th at not only quantization parameter but also the block size are optimally selected for the best rate-complexity-distortion performance. The work in this chapter was published in part in [65] and [62], In chapter 3, we propose two classes of algorithms to compute the forward DCT. The first one is a variable complexity algorithm in which the basic goal is to avoid computing those DCT coefficients that will be quantized to zero. The second one is an algorithm that approximates the DCT coefficients, without using floating point multiplications. The accuracy of the approximation depends on the quantization level. These algorithms exploit the fact that for compression applications (i) most of the energy is concentrated in a few' DCT coefficients and (ii) as the quantization step-size increases an increased number of coefficients is set to zero, and therefore reduced precision computation of the DCT may be tolerable. We provide an error analysis for the approximate DCT compared to SSAVT DCT [64]. We also propose 3 hybrid algorithms where SSAVT, approximate DCT, and VCA approaches are combined. This work was published in part in [66]. 1.5 M otion E stim ation Motion estimation (ME) is an essential part of well-known video compression standards, such as M PEGl-2 [2] and H.261/263 [3]. It is an efficient tool for video 17 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. compression that exploits the temporal correlation between adjacent frames in a video sequence. However, the coding gain comes at the price of increased encoder complexity for the motion vector (MV) search. Typically ME is performed on macroblocks (i.e., blocks of 16 x 16 pixels) and its goal is to find a vector pointing to a region in the previously reconstructed frame (reference frame) that best matches the current macroblock (refer to Fig. 1.6). The most frequently used criterion to determine the best match between blocks is the sum of absolute differences (SAD). / i m block residue D C T /Q Figure 1.6: Motion estim ation of z'-th block of frame t predicted from the best block in the search region T in frame t-1. 1.5.1 Exam ple: C onventional E xh au stive Search We start by introducing the notation that will be used throughout the thesis (refer to Table 1.3). Let us consider the z-th macroblock in frame £. For a given macroblock and candidate motion vector mv, let the sum of absolute difference 18 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 1.3: Notation Table ItiP'Xi ^y) intensity level of (nx,n y) pixel relative to the upper-left-corner pixel of the macroblock. B the set of pixels constituting a macroblock P a subset of B rnv = (mvx, mvy) a candidate motion vector T = {mv} the set of allowable mv in a search region e.g., mvx, m vy G { — 16, — 15.5,..., 15,15.5}}. 7 c r a set of mv actually tested for a given search scheme matching metric be denoted as SAD (m v. /?), where1 SA D (m v,P ) = | It (nx, ny) - It~i(nx + mvx,n y -rm vy) |, (1.2) (nx,ny)e& and where P is a subset of the pixels in the macroblock. This notation will allow us to represent the standard SAD metric based on the set B of all pixels in a macroblock, as well as partial SAD metrics based on pixel subsets p. A ME algorithm will return as an output the best vector for the given search region and metric, M V*(7 ,/?), i.e. the vector out of those in the search region 7 C F that minimizes the SAD computed with /3 C B pixels, MV*(7 , P) = arg min SAD {m v, P). rnvtz-'i In the literature, a search scheme is said to provide an optimal solution if it produces M V*(T , B), i.e., the result is the same as searching over all possible mv in the search region (r) and using a metric based on all pixels in the macroblock (B), M V * (T ,B ) can typically be found using an exhaustive full search (ES). In this thesis, we will term “exhaustive” any search such th a t 7 = T regardless of the particular ft chosen. In general, motion search is performed by computing the SAD of all the vectors '■Note that, since our approach will be the same for all macroblocks in all motion-compensated frames, we will not consider explicitly the macroblock and frame indices {i and t) unless necessary. 19 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. in the search region sequentially (following a certain order, such as a raster scan or an outward spiral), one vector at a time. For each vector, its SAD is compared with the SAD of the “best found-so-far” vector. W ithout loss of generality, let us assume that we are considering the z'-th candidate vector in the sequence, rnvi for i = 1,..., |T|, and we use B for the SAD computation. Thus we define the “best found-so-far” SAD as SADbsfi'YiiB) = min SA D (m v,B ) raven where 7, • = U}=i{nm,-} C F is the set of vectors that have already been tested up to mvi and the associated “best found-so-far” vector is denoted by B). Note that when all vectors in the search region have been tested, M V*(T,B) is equal to MVbS f(T, B). To complete the encoding process MV* is transmitted. The residue block, which is the difference between the motion estimated block and the current block, is transformed, quantized, entropy coded and then sent to the decoder, where the process is reversed to obtain the reconstructed images. 1.5.2 Fast Search v s. Fast M atch in g The computational complexity of motion search is a major concern for block- based video encoding systems with limited computation power resources. We now provide a quick overview of fast ME techniques. Our goal is to provide a rough classification of the various strategies that have been used to reduce complexity, and also to classify and clarify the novelty of the algorithms that will be introduced in Chapters 4 and 5. The total complexity of the ME process depends on (i) the number of candidate vectors in the search region, T, and (ii) the cost of the metric computation to be performed for each of the candidates (e.g., computing a SAD based on the set B will be more costly than using a subset j3 € B .) Thus, fast ME techniques are based on reducing the number of candidates to be searched (fast search) and/or the cost of the matching m etric computation (fast matching). 20 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Fast search (FS) In order to improve the efficiency of the search, fast ME algorithm s can restrict the search to a subset of vectors 7 C T. This subset of vectors can be pre determined and fixed as in [67] or it can vary as dictated by the specific search strategy and the characteristics of the macroblock. Examples of the latter case are 2-D log search [68], conjugate directions and one-at-a-time search [69], new three step search [70], gradient descent search [71], and center-biased diamond search [72], which all exploit in various ways the assumption th at the matching difference is monotonically increasing as a particular vector moves further away from the desired global minimum. For example, 2-D log search starts from a small set of vectors uniformly distributed across the search region and moves on to the next set more densely clustered around the best vector from the previous step (if there is a change in direction, otherwise, the next set would be the same farther apart). A good initial point can also be used to reduce the risk of being trapped in local minima. Approaches to find a good initial point include hierar chical and multiresolution techniques [73, 74, 75, 76, 77]. Another successful class of techniques seeks to exploit the correlations in the motion field, e.g., MVs of spatially and tem porally neighboring blocks can be used to initialize the search as in [4] (referred to as the ST1 algorithm) and [78]. The ST1 algorithm employs the spatial and tem poral correlation of motion vectors of adjacent macroblocks. It starts with the best candidate motion vectors from a set of neighboring mac roblock both spatially and temporally, if available (Fig. 1.7 (a)). Then it performs local refinement on a small 3x3 window search until it reaches the minimum point (Fig. 1.7 (b)). In general, ST l algorithm achieves a higher speed-up than 2-D log search, with also lower overall residue energy. Fast m atching (FM ) Another approach for fast ME, which can also be combined with a FS technique, consists of devising matching criteria th at require less com putation than the con ventional sum of absolute difference (SAD) or mean square error (MSE). One example of this approach consists of computing a partial metric, e.g., the SAD based o n ^ c B [67]. Of particular relevance to our work are the partial distance 21 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (C) M V , m v 3 m v 4 M V ! X m v 5m v 3 m v 2 m v 4 M V , X Figure 1.7: Summary of ST1 algorithm where (a) and (b) depict spatial and spatio-temporal candidate selection, respectively. Highlighted blocks correspond to the same spatial location in different frames, (c) illustrates the local refinement starting from the initial motion vector found to be the best among the candidates. search techniques, which have also been proposed in the context of VQ [79, 54]. In a partial distance approach the matching metric is computed on successively larger subsets of B but the computation is stopped if the partial metric thus computed is found to be greater than the total metric of the “best found-so-far” vector. For example if SAD (m vi, (3 C B) > SADbSf{ li- i,B ) there is no need to complete the metric computation and calculate SA D (m vi,B ). Many imple mentations of FS algorithms include this partial distance technique to speed up their metric computation. Other early termination criteria have been proposed in [80]. Alternatively, matching metrics other than SAD or MSE can also be used. For example, in [81], adaptive pixel truncation is used to reduce the power consumed. In [82], the original (8-bit) pixels are bandpass filtered and edges are extracted, with the final result being a binary bit-map th at is used for matching. O ther approaches include hierarchical feature matching [83], normalized minimum correlation techniques [84], minimax matching criterion [85]. 1.5 .3 F ixed vs. V ariable C om p lexity We can also classify ME techniques into fixed complexity algorithm (FCA) and variable complexity algorithm (VCA). The complexity in FCA is input-independent and remains constant (e.g., a ME technique with fixed and 7 ), while in this work we will consider VCA, where complexity is input dependent (e.g. /3 and 7 are dif ferent for each macroblock and/or frame.) The goal when designing a VCA is then to achieve low complexity in the average case. Thus, we expect the “worst 22 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. case” complexity of the VCA to be higher than, that of a comparable FCA, but hope that on the average, a VCA will have lower complexity. In practice, this is done by making reasonable, though typically qualitative, assumptions about the characteristics of typical sequences. For example, consider the algorithm of [4], which, as indicated earlier, exploits the correlations in the motion field. For this algorithm, performing ME in a scene with smooth motion (e.g. a scene with panning) tends to require less complexity (and to be closer to the optim al ES result) than finding the motion field for a scene with less correlated motion (e.g. a scene with several independent moving objects). Thus, such an algorithm pro vides a good average case performance under the assumption that typical video sequences have predominantly smooth motion. For similar reasons, algorithms in [68, 69, 70, 72] perform well for sequences with small motions. A second example of a VCA algorithm can be found in the partial distance ap proach discussed earlier. The underlying assumption here is that the distribution of SADs for typical blocks has large variance, with few vectors having SAD close to the minimum (i.e. the SAD of the optimal vector). Thus, on average one can expect to eliminate many bad candidate vectors early (those having large metric) and thus to achieve a reduction in overall complexity. Once again this is mak ing an implicit assumption about the statistical characteristics of these matching metrics for typical blocks. In this thesis we argue th at substantial gains can be achieved by making these assumptions explicit, and therefore our probabilistic stopping criterion for the metric computation will be based on explicit statistical models of the distribution of SAD and partial SAD values (see Chapter 4). 1.5.4 C om p u tation ally Scalable A lgorith m s Finally, we consider the computational scalability property, which is a desirable feature in many applications (e.g., to operate the same algorithm in different plat forms, or to run at various speeds in the same platform). Computational scalabil ity allows to trade-off speed with performance (e.g., the energy of the prediction residue in the case of M E). There has been some recent interest in computation scalability in the context of video coding in general and ME in particular. For example, [86] addresses computationally constrained motion estimation where the 23 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. number of vectors to be searched (the size of 7) is determined by a complexity constraint based on a Lagrangian approach. This technique adopts an idea similar to that in [87] but using complexity rather than rate as a constraint. 1.5.5 T h esis C on trib u tion s In this thesis, we will focus on FM approaches based on the partial distance technique. In Chapter 4 we propose a novel fast matching algorithm to help speedup the computation of the matching metric, e.g., the sum of absolute dif ference (SAD), used in the search. Our algorithm is based on a partial distance technique in which the reduction in complexity is obtained by terminating the SAD calculation once it becomes clear that the SAD is likely to exceed that of the best candidate so far. This is achieved by using a hypothesis testing framework such th at we can terminate the SAD calculation early at the risk of missing the best match vector. Furthermore, we discuss how the test structure can be opti mized for a given set of statistics, so that unnecessary tests can be pruned. This work was first introduced in [88] and further refined in [89]. It should be emphasized that the FM techniques we propose can be applied along with any FS strategy and any other additive metrics such as MSE. We also note that, while our experimental results are provided for a software im plementation, focusing on FM approaches may also be attractive in a hardware environment. For example, from a hardware architecture point of view, some FS designs have the drawback of possessing a non-regular data structure, given th at the blocks that have to be searched in the previous frame depend on the selection of initial point. Thus the set of candidates considered varies from macroblock to macroblock. Conversely, ES algorithms have the advantage of operating based on a fixed search pattern (this could also facilitate parallelizing the algorithm). In general, FS algorithms such as that in [4] will have to be modified for hardware implementation, with one of the main goals being to minimize the overhead, even though it is relatively small, due to the non-regularity of the algorithm. As an alternative, if the goal is an efficient hardware design one may choose to design an efficient FM approach (e.g., [81, 90, 91, 92, 93]) and combine it with a simple search technique, such as ES. 24 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In C hapter 5, we present a new class of fast motion estim ation techniques that combine both fast search and fast matching based on partial distances. In these algorithms, the computation of the matching metric is done in parallel for all candidates. Unlike other fast search techniques that eliminate candidates based on the spatial location of the candidate, this technique uses only the partial distance to eliminate candidates, thus increasing the chance to discover an isolated global minimum. We also propose two multiresolution algorithms based on this technique to help speedup the computation and achieve performance comparable to other fast search techniques. 1.6 Laplacian M odel As mentioned in Section 1.3, the optim ization of the VCAs has to take into account the statistics of the input. There are basically two approaches to acquire those statistics, namely, parametric and non-parametric. In the non-parametric case, the probability of an event of interest is obtained directly from the data. For param etric case, a model of the image is assumed and characterized by a set of parameters. The statistics thus can be obtained from the parameterized model. In C hapter 2 and 3, we will use both approaches to evaluate the performance of the proposed algorithms. In this section, we will address the image modelling assumption used through out this thesis. Similar to [22], we model a DCT coefficient in a 2-D block as an independent random variable of Laplacian source, i.e., the p.d.f. of X(u, v) can be w ritten as (1.3) where A(U iU ) is the Laplacian param eter of X (u, v), the DCT coefficient in position (u , v ). In VCA, the complexity savings mostly come from DCT coefficients quantized to zero. Therefore, given that uniform quantization, with stepsize 2QP (QP is a quantization parameter) and deadzone in the region (—2QP,2QP), is used for all DCT coefficients, the probability of X (u, v) being quantized to zero can be 25 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. written as pr(u,t») = Pr{[_|X(u,u)|/2<2PJ = 0 } = P r{|X (it,u)| < 2QP} = 2(1 - eA (u’ ”> 2QP) (1.4) Furthermore, in the case of residue frames, the model param eter A (U j„) can be obtained directly in the spatial domain. From [64], it has been observed that the correlation between pixels in residue frames can be approximated to be separable horizontally and vertically. Thus the correlation can be expressed as r(m , n) = a 2plmlpN where m and n are horizontal and vertical displacement, p is the correlation coefficient, and o2 is the pixel variance. From our observa tion on five H.263 test sequences ( “Miss America” , “Suzie”, “Mother&Daughter” , “Foreman” and “Salesman”), the average correlation coefficient ranges from 0.92 to 0.97. Therefore, in our simulation we use the value 0.9. Let the correlation matrix be denoted by R and written as 1 p P2 PN~X p 1 P ■ t. 1 to p2 p 1 pN- 3 PN- 1 pN-2 . , 1 Therefore, from [94], the variance of the DCT coefficients can be derived as [°X(u,t;)] = ^ 2[D nR D n](i:,ii)[D nB D n]m = < J 2[r^ (u , v)] (1*5) where D n is again the DCT matrix of size N , and the scaling factor T n 2(w, v) is a function of N and (u,v). For example, for DCT of size 8 and p = 0.9, r8 2 (u , v) 26 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. IS [ r 82(u,u)] = 38.2606 6.2219 2.1408 0.3383 6.2219 1.0118 0.3481 ••• 0.0550 2.1408 0.3481 0.1198 0.0189 0.3383 0.0550 0.0189 ••• 0.0030 (1.6) where [r8 2 (u , u)] represents a square m atrix with elements r8 2 . Therefore, we can find the probability of a zero quantized DCT coefficient from the variance in pixel domain as v/2QP pz(u,v) = 2(1 — e rN(u ,v )< r) ^ 7) R ate and D istortion For the sake of completeness, we finally show the rate and distortion of this Laplacian DCT source with uniform quantization presented earlier in this section. The probability of DCT coefficients in bin (2QPi, 2QP(i + 1)) can be expressed as r2QP{i+l) Pi= f x { u , v ) { x ) d x hQPi The rate can be approximated by the entropy R (u,v) = - j p pi log pi i = e' 2XQP(l + 1 2 _X® % ) - M l - e -UQF)> (1.8) and therefore, the total block rate is Rbik = R (ui v)i (u,u) \An (1.9) where e - log9 e and A = V; . . 0 2 a-rN {u,v) For midpoint reconstruction in each quantization bin, the distortion can also 27 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. be derived as r D(u,v) = I T 7 t > j 2 i 2Q P(i+l) 0 2 QPi r2QP (x-QP(2i + l)) fx{u,v){po)dx + / X / x ( u,t,)( x )d x — 2QP = g2r^(a,v) - — ~ 3<r2AgFQP2 A(1 - e -2 A Q P ) Ar* = D{u,v) (u,u) where A t* is total block MSE. (1.10) ( 1.11) R a to -D a to rtio n (or a 2 s 1 0 0 S S A V T D istortion (m se) Figure 1.8: Rate-Distortion using standard-defined DCT (’dashed’) and R-D for SSAVT (’solid’) for < x 2 = 100 with varying QP at 3 block sizes, i.e., 4x4, 8x8 and 16x16. The rate and distortion is normalized to per pixel basis. The R-D curves obtained from (1.9) and (1-11) are shown in Fig. 1.8 for different block sizes. It can be seen that larger block sizes give better coding efficiency. 28 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 2 Inverse Discrete Cosine Transform In this chapter, we propose fast VCA algorithms for IDCT. We treat IDCT before DCT because it is conceptually easier to explain the problem in th at case and the VCA framework can be more easily introduced. For example, if the information about the position of the zero DCT coefficients is available, some operations in volved with those zero coefficients can be pruned. The key idea of VCA IDCT is based on this concept such that knowledge of the position of the zero coefficients is used to prune the DCT algorithm. For typical images, this classification in tuitively should give significant complexity reduction since high frequency DCT components tends to be quantized to zero (see Fig. 1.4 and 1.5). The speedup can be even more significant for lower quality images or in the P-frames of video sequences. We propose 2 novel classification schemes, namely, sequential and tree-structured classification (TSC). We also discuss the design goal for optimal classification for both schemes. The best test structure is selected to achieve minimal average complexity for a given input statistics. For the above two classifications, we will consider only a 1-D IDCT for simplicity. However, the algorithms can be eas ily extended to the separable 2-D transform. Next we consider a heuristic 2-D dyadic classification similar to [60] as an alternative for faster but coarser classi fication. This m ethod can be combined with other 1-D VCA IDCTs mentioned 29 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. above. Some experimental results on different image model param eters and on real image and video data are also shown. The distortion/decoding time tradeoff problem is posed. The goal is to mini mize the distortion with the constraint of a given decoding tim e budget, under the assumption that the IDCT at the decoder is VCA. The problem is solved using Lagrange multiplier techniques that provide the optimal quantization assignment for each block of image, i.e., the assignment th at minimizes distortion under the given decoding time budget. In addition, a generalized rate-complexity-distortion (R-C-D) optimization framework using quadtree structure is proposed. Beside the quantization parameter, we allow the encoder to select the quadtree structure as an additional degree of freedom in the R-C-D tradeoffs. 2.1 Form alization of ID C T V C A Our goal is to define a VCA version of the IDCT that can achieve minimal average complexity. We first select a baseline algorithm (which gives an exact IDCT for any input). We have chosen the algorithm used in [58] (similar to Fig. 1.3(b)) as our baseline algorithm because it requires the minimum number of operations. This algorithm uses fixed point arithmetic while still keeping enough accuracy to be compliant with the JPEG standard. The number of operations wdien apply ing this 1-D IDCT to an 8x8 block of a 2-D image along the row and column sequentially is as follows: 176 multiplications (11 multiplications per 1-D IDCT after factoring out division by 2\/2 to the last stage), 536 additions and 240 bit wise shifting operations(which replace the divisions by 8 and are used to maintain accuracy). This baseline algorithm is then pruned according to the classification informa tion available about the inputs. We refer to pruned algorithms as “reduced IDCT” (RIDCT) algorithms. Each RIDCT is designed for each class such that it uses the minimal number of operations for the given class. The combined classification and RIDCTs is, thus, a VCA. The accuracy of information obtained from classification is also a factor to determine the minimal operation numbers, e.g., a coarse classification will not 30 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. provide exact information about whether each coefficient is zero, and thus will information, finer classifications are generally required. Therefore, the goal of our formalization is to explore the tradeoffs between classification accuracy and classification cost overhead. This is achieved in the average sense by taking the cost of classification and the statistics of frequency of occurrence of each class into account for a given image or set of images. 2.1.1 Input C lassification We define a class of inputs to be a set of input vectors having zero coefficients at certain positions. Let X j be the j -th DCT coefficient in the 1-D input DCT vector and z* be a vector representing the i-th class of input, elements in the z* vector represent zero positions in the input vector, i.e., It is clear that a larger number of zeros results in potentially faster IDCT implementations, since operations such as multiplication by zero and sum with a zero value need not be performed. For example, with a size-8 input vector, an all zero input (i.e., class Z[00000000]) would require no operations while a class Z [ l l l l l l l l ] input would require the maximum number of operations. However, for each input class one can find many RIDCT algorithms which give the exact IDCT output, e.g., one can apply RIDCT of Z [ l l l l l l l l ] to Z[00000000] or any other classes and still get the correct IDCT results. Let Ai = {Aj}j=\ be the set of RIDCTs that provide exact IDCT computation for class z*. There are J(i) RIDCT algorithms for class z*. The one with minimal complexity is defined as where c(Aj) is complexity of Aj. Note here th at the complexity is measured in a unit of interest, which is not restricted to be an arithm etic complexity measure. The most appropriate unit of complexity which can be used directly to judge limit the number of operations that can be pruned. In order to obtain precise where i 0 if Xj = 0 1 if Xj is unknown (2 .1) A* = arg min cfAA 1 AjeAi v J 31 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the performance of a system is the elapsed time or the CPU clock cycle spent in the process. For example, the classification, even though it may not involve any arithmetic operations, does require other types of instructions, and thus has a certain complexity cost. (In this work, we will use “cost” or “complexity” instead of “complexity cost” .) In order to further motivate the concept of classification, let us denote a prim itive class in which all components are known as si.e., Si is the input bitm ap of the z'-th primitive class. Note that the main difference between classes st - and z* is that the former assumes that some coefficients are non-zero, while in the latter those coefficients were not known to be zero or non zero. Note also that the class zz - consists of primitive classes that share the same information about the zero of certain DCT coefficients (j, when ij = 0). The cardinality of the set depends on the number of untested coefficients (j, when ij = 1). Note here th at beside zero-nonzero based classification, one can also classify the input differently, e.g., an input with coefficients value 1 or -1 which requires no operations for multiplication, etc. However, for simplicity we consider only zero-nonzero based classification. The input bitmap can be obtained as a by-product of entropy decoding. Specif ically, when zero-run length coding is employed (in JPEG and M PEG), this bitmap information is computationally cheap to obtain. However, the mapping between an input bitm ap and a corresponding reduced algorithm (classification) is not so straightforward. For better visualization, we assume that the classification and the reduced algorithms are two separate operations. For example, for a DCT vector of size 8, there are 256 possible primitive classes. Brute force mapping of the bitmap to a corresponding reduced algorithm would require a large memory storage for 256 possible RIDCTs. In the case where we have a smaller number of RIDCTs due to memory space limitations, the problem of RIDCTs selection needs to be studied. In our implementation we get rid of the storage dependency by considering a single algorithm which consists of operations conditionally pruned where z (2 .2 ) 32 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. by testing their corresponding inputs. Therefore, the memory requirement is no longer an issue, but the designs of classification and RIDCTs are coupled. At this point, we can generalize the definition of “class” in (2.1) to be any set of primitive classes. Therefore, starting with the space of all possible primitive classes, we can partition this space into groups of classes. Let us denote one such partition as g E Q where Q is the set of all possible partitions of the set of primitive classes. Let Ng be the number of resulting classes in g , and < 7 ( 2 ), i = 0,..., Ng — 1, represent the classes of input which are members of g. Each class g(i) has an associated minimal complexity RIDCT .4?. In order to formulate the minimal complexity VCA problem, the classification cost has to be taken into account. Let Sg be a classification structure for g, Sg E Sg which is the set of all possible testing structures resulting in partition g, and Sg =— {Sg(0),..., S g{Ng — l)} where Sg(j ) is the cost of classifying class g(j). Therefore, our goal can be formalized as Form ulation 1 (M inim al C om plexity V C A ) The goal is to find a partition g* and a corresponding classification structure, S* which gives the best tradeoff between complexity reduction by RIDCT and the complexity of the classification itself. In other words, we want to find g, Sg such that the average complexity, T, is minimized T * = f c (c ( 5 s « ) + c (-4 s « ) ) • P sb')> (2 -3 ) gtzy S gtzO g £_Q where Pg^ be the probability of inputs in class g(i) occuring1, c(Sg^)) be the cost of operations necessary to perform the test Sj. This involves two minimizations over (i) possible partitions, Q, and (ii) test structure, Sg. It can also be viewed as a tradeoff between the fineness of classifica tion and the classification cost. The more precise the class is (smaller group), the more compact the corresponding RIDCTs, i.e., Yljeg{i)Pj ■ c(A*) < Ps(0 • c(A;{i)) for = Z)j€s(i) Pj based on the assumption that c(A*g) < c(A *^),V j E g(i). However the more classification cost, i.e., Pj • c(Sj) > Pg{i) -c(Sg^)). With out any further restrictions, the solution of this optimization problem is very 1This probability can be measured from typical training image data or can be obtained from a model of typical images. 33 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. difficult to find. In the next section, we impose constraints on both partition and the test structure such that the search space is limited to one of manageable size. We restrict ourselves to tests that seem intuitively reasonable given the baseline algorithm. 2.2 P roposed V C A for ID C T In this section, we discuss three classification methods, namely (i) sequential, (ii) tree-structured and (iii) 2-D dyadic classification. For the first two methods we also address the optimization of the classification structure to obtain the minimal average complexity. For dyadic classification, we use a heuristic test structure based on experimental observations of typical image and video data. All of these classification techniques employ zero masks in testing for a certain class. In zero mask testing (ZMT), the bitmap of the input vector is tested via a bitwise “AND” operation with a mask, denoted by m* for the z-th mask, iiij = M[ij]7 j= 0 where ij G {0,1} which represents the class being tested. Each bit of the mask represents a DCT coefficient of interest so that if the bit is set to one we test for a zero coefficient, but we do not care if the bit is set to zero. It can be seen that the mask can be related to the class z* for a certain z presented earlier: if the result of the bitwise ’AND’ is zero, the input belongs to class z^, otherwise, the input is not a member of class zi. 2.2.1 Sequential C lassification Sequential classification is a scheme such that the input is zero-mask tested for one class at a time. There is no refinement for classes already tested. Therefore, the sequential classification can be expressed by a sequence of zero masks, which correspond to the order in which each class is tested. Let M = {m0, ...., be a sequence of zero masks representing a sequential classification, where N M is the number of zero masks, and there are N m + 1 resulting classes. From the 34 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. previous subsection, we know that z,- is a class tested by m ,. However, it is possible that some primitive classes in z* are also members of zt -_i which will be tested first. Therefore, the actual set of primitive classes tested at step z, denoted by d2 , is zt - — U}=q Zj. The last class (default class) after the last test is d,vu+L = CV — U ^o- l Zj, where CV = uji°0sy is the set of all primitive classes. W ithout loss of generality, let us reuse the notation for the minimal complexity reduced algorithm, .4’, for class d2 . Therefore, we can write the expression for the average complexity of the sequential classification IDCT as V _ w — I = y {c(ZMT)-(i+i)+c(Ai))-pJm+(<zMT)-Nh ,+c(A'Kv+l))-P««u+ i> i=0 (2.4) where c(ZMT) is the cost of zero masking test and Pd(i) is the probability of class d2 . Since the cost of ZMT is the same for every mask, the testing cost for the z-th mask is (i+1) times the ZMT cost since there must be i+1 ZMT prior to the z-th mask. The term outside the summation in (2.4) represents the cost for the default class. Therefore, we can formalize the complexity minimization based on sequential test as F o rm u latio n 2 (O p tim a l S eq u en tial C lassification V C A ) The goal is to find the optimal zero mask sequence M* such that the total average complexity is min imal. In other words, we want to minimize where A4 is the set of all possible zero mask sequences M . 2.2.2 G reedy O p tim ization The problem formulation is fairly straightforward but in order to obtain the op timal solution, exhaustive search is unavoidable. One can view the exhaustive search as a tree search in which the tree grows to all possible directions. The time required for exhaustive search can be reduced by using the knowledge of the re dundancy between zero masks, i.e., the fact that, as mentioned earlier, set z* and zj for given i and j may have a non-empty intersection. Thus, the path to a node 35 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. that corresponds to a subset of previous nodes can be omitted. Yet, it is still time consuming to find the optimal solution and it may not be practical to look for it when the size of the input vector is larger than 8. Finding suboptimal solutions is thus a more attractive alternative. Algorithms such as the M-algorithm [95], in which only M best paths are kept at each iteration, can be used. In this work, we use an even simpler and faster approach than the M-algorithm. We use a greedy approach, in which the class th at gives maximal complexity savings is tested first, and then the classes with less computational saving are tested later, until no further savings can be achieved from the classification, i.e., the savings from using a RIDCT is outweighed by the cost of testing. When that point is reached the default algorithm is applied to the remaining (untested) classes. Prior to performing a greedy optimization, the statistics of each class s*, for all i are assumed to be known. Let P s{i) denote the probability of a primitive class s,-. Let us define the complexity savings of reduced algorithm A ? of class z, by di = c(.4y) — c(A*), where Ay represents the baseline algorithm. The Greedy approach can be described as follows. A lgorithm 1 (G reedy O ptim ization) Step 1: Assume known probabilities of all primitive classes, P s(i), V i. Initial ize the list of sequential tests M as an empty set. Set N M = 0. Step 2: Find class i such that the incremental complexity savings is maxi mal, i.e., i* = argma.Xni^ArM Ac, where Ac* = c * • - c(ZMT) ■ P untested and Puntested = Pr{ C V — u£2o ^ i} . Step 3: If Ac,-. > 0 put m* as the last element in M and increment N M by 1. Otherwise, stop. Step 4: If Nm = 256, stop. Otherwise go to Step 2. The calculation of Pd{j) in Step 4 has to take into account the redundancy of previous zero mask tests by excluding the probability of the primitive classes that have already been tested. 36 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.2.3 T ree-structured classification (T SC ) In this subsection, we propose an alternative to sequential classification. Unlike sequential classification where the tested classes are not refined or tested again, the tree-structured classification (TSC) allows hierarchical testing in which tested classes can be further classified for more accurate positions of zeros, which also implies better RIDCTs. The key idea of TSC is to exploit the structure of the baseline algorithm in classification. Consider the baseline full version of the IDCT algorithm we have chosen, which is shown in Fig. 1.3(b). In this IDCT algorithm there are three stages of operations, each of which involves 2, 4 and 8 input coefficients, respectively. This tree-like structure provides a natural way of hier archically classifying the data, going from coarse to fine classification, i.e., ZMTs that test a larger group of coefficients and then smaller and smaller groups. In Fig. 1.3(b), let us consider the issue of grouping first. Needless to say, when all 8 coefficients are zero, all branches up to stage 3 can be pruned. Given that 4 out of 8 coefficients are zero, the two scenarios that give the largest reductions in complexity will correspond to the cases when (X j , Xi, X 3, X 5) or (Ao, X 4 , Xq. X 2) are all zero, since branches of zero coefficients in stage 1 and stage 2 can be pruned. For the case when 2 coefficients are zero, (X2,X 6), (A'1; X 7), (A 3, X 5) and (A0, X4) give the four minimal complexity RIDCTs since the corresponding operations of these 4 pairs in stage 1 can be pruned. When only one coefficient of those pairs is zero, stage 1 operation can be partially pruned. From these, we can find, for the case when any number of coefficients is zero, the grouping with maximal complexity reduction which is a combination of the above 4, 2 and 1 coefficients. As far as the test structure is concerned, in order to achieve the above group ing with maximal efficiency regardless of the statistics of the input (assuming a uniform distribution of the input classes), groups with lower complexity RIDCT should be tested before groups with larger complexity RIDCT. In this case, the all-zero input class must be tested first. Then the two 4-zero classes, the four 2-zero classes and finally the eight 1-zero classes are tested in succession. This hierarchical testing results in at most 15 tests before a certain class can be iden tified. An example of pseudo code for testing 2 coefficients at the second stage of 37 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the upper 4 branches of Figure 1.3(b) is shown in Figure 2.1. i f (A~2==0 a n d J f 6= = 0 ) zq = z 3 = Vo z 1 = z2 = yi /********* save com putation fo r 1/2 and y3 ***********/ e ls e if(X o ==0 and X 4==0) zo = V 3 , zi = y2 Z 2 = ~V2 , z 3 = —y 3 /********* save com putation fo r yo and y 1 ***********/ e lse zQ = yQ+ y3 z i = y i + 2 /2 Z 2 = Vi~ 2 /2 Z 3 = yo — 2 /3 /********* perform ing f u l l v ersio n f o r t h i s p a rt ***********/ ij = Figure 2.1: TSC pseudo code for testing upper 4 branches of Fig. 1.3 (b) in stage 2 . This TSC is illustrated in Figs. 2.2 and 2.3. For simplicity in explanation, we denote Cioili2i3i4i5i6i7 as a class of input obtained from TSC. The classification information ij indicates the following. Ek k e {1, 2,...} means th at at least one of coefficients indexed with a given E k is nonzero 0,1 means that Xj is “known” to be zero or nonzero U means Xj can be either zero or nonzero (unknown) As an example, C e 1ue 20EiQ E2u means that at least one of {xo,^-} and one of {x-2,X6} must be nonzero, x L and x 7 are unknown, and £3 and x 5 are both known to be zeros. Thus, in order to perform IDCT correctly an RIDCT must assume that all coefficients except x 3 and x 5 are zero. Note that the class in the example above can be represented as Z c[01110111] n Z c[11011101] n Z[11101011], where Z%] is a complement of set Z[.\. The newly introduced notation is obviously more concise. Fig. 2.2 illustrates the tree-structured classification of size 8 DCT coefficients. Each node represents a class with some knowledge about the positions of zero coefficients. Descendents of each node are the results of finer classifications, i.e., they contain more information about zero coefficients than their parent nodes. For 38 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Cuuuuuuuu Q d o o o o o o o CElElElElElElElEl ^E iO E iO E iO E iO ^--OEiOEiOEiOE! ^-'EiE2EiE2EiE2EiE2 Figure 2.2: TSC diagram showing zero information of classes after all-zero test and 4-zero tests. simplicity, in the figure the tree is grown to the level of 4-zero classes only. Finer classifications are shown in Fig. 2.3 where 2-zero and 1-zero tests are performed. It can be verified that a total of 256 primitive classes can be obtained with this TSC if the tree is constructed from all-zero level down to 1-zero test level, i.e., combining Fig. 2.2 and 2.3. Every nodes of the tree in Fig. 2.2 and 2.3 has an associated RIDCTs. The RIDCTs for those classes are based on pruning only operations that involve co efficients that are known to be zero. Thus, we cannot prune operations involving coefficient labeled with U or E k unless we are willing to have a non-exact IDCT. 2.2.4 O ptim al T ree-stru ctured C lassification (O T S C ) Note that in Section 2.2.3, we outlined the classes that are induced by testing in a hierarchical manner, starting with tests on 4 coefficients and continuing with two and then one coefficient. At each node, in order to classify to 3 classes, two logical operations must be performed. For example, class C exe xe x e x can be classified to Cqexqex and C exuexu using AND logic between the input bitmap and the mask M [1010]: if the result is zero, the input belongs to Couou, otherwise it belongs to CExu exu- Then C Exuexu is further classified to C ex qexq and C exe2exe2 by performing an AND of the input bitmap with M[0101]: if the result is zero, the 39 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C EiEiEiEi ;0E 31E 3 / \ c C 1E30E3 L 0001 0010 EUEiO 0100 1000 Figure 2.3: TSC diagram showing zero information of classes after 2-zero test and 1-zero tests. The most right-handed class after 2-zero test is further classified into 9 classes which are tensor products of descendent classes shown, i.e., A < g > B = {(-4* n B j)}v ij where A = {A2}, B = {Bi}. Therefore, the number of leaves is 3+3+(3x3) = Id. input belongs to C e ^ e ^ , otherwise it belongs to C e 1e?e1e2- However, the order in which M[0101] and M[1010] are applied can be switched, and the resulting 3 classes can still be obtained. However, the complexities of both orderings are different when the frequency of occurence of the resulting classes is taken into account. For example, in the above example, if class C exO ExO occurs more often than CoeiOEx , then regardless of RIDCTs complexity2 it is better to apply the mask iV/[0101] before M[1010]. Therefore, for optimal TSC tests, ordering has to be carefully chosen. Fur thermore, we allow more freedom to perform only one logic operation at a node to classify only 2 classes, e.g., in the above example, we can choose to have only the second test performed on CexExEiEx to get CexO Ex O and CuexUEx- Thus, the goal of our optimization procedure is to find the best classification, that is, to determine (i) the best order in which to apply the tests and (ii) which tests it is worth performing (in the sense of reducing the average complexity). In Figs. 2.4, we show the possible choices for testing 4 coefficients. We denote the two logic comparisons by A and B. In the figure there are five possible choices which are A-B, B-A, A only, B only and no-test. Each choice results in different classifications except A-B and B-A which have the same final result. Given the 2In our work, the RIDCTs complexity has to be taken into account in order to obtain minimal total complexity. 40 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. &0101 = 0 ? &1010 = 0 ? &0101 = 0 ? &1010 = 0 ? ElOElO MJEiUEi O )E i0Ei QeiUEiU OEiOEi •ElOElO Figure 2.4: Diagram showing possible choices for classification at level 2 where A, B represent testing with M[1010] and M[0101], respectively. &0101 — 0 ? &1O10 = 0 ? &0101 = 0 ? &10L0 = 0 ? uuuu -uouo ouou •EiUEiU O UO U E iOEiO UOUO OEiOEi Figure 2.5: Another diagram when the result from level 1 is different. statistics of the input, we can then compare the complexity including the test to find the best choice on the average sense. The complexities of the five choices as shown in Fig. 2.4 are Ta -B — T. A— o n ly — B— o n ly — + Tr _a — c(SA)PE 1E1E1E1 i + c ( A i n ) P e I Eo E l En ■ c{Sb ) P e iUEiU + c (^oiO l)-P o£lO £ ;1 + c( c(^-ioio)-^SiO£:iO ■ o i u i PuE\UEi l)-PoSiO£i + c(S b )P eiEiEiEi + c{Sa ) P uEiUEi 4“ )PeiE2EiEo C( A ) 10l) c{S a )P eiEiEiEi + C(A)101 i i ) P e iUEiU c( S B)P eiElEiEi + cC^-lOlO^SiO^O + c(A*n l l )Pi/EiUEi 41 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. T x no —t e s t We can perform the comparison of these 5 choices at all level of testing i.e., all-zero, 4-zero, 2-zero and 1-zero test. In the case, where A-only, B-only or no test are chosen in the previous level of zero testing, the current level will start with a class which no information has been obtained, see Fig.2.5. From the TSC tree in Fig. 2.2 and 2.3 we compare all possible combinations of 5 choices in every node to find the test structure with minimal complexity. For consistency with previous formalization, let H be redefined as a set of choices at each node in Fig. 2.2 and 2.3. H = {ft0, ft1 , {A,, hi, ftjj+ 1 , hl+2, ftj(+ 3} i 0} where h[ £ {A — B , B — A , A — only, B — only, no — t e s t} is the choice selected at level I and node i. There are 4 levels corresponding to all-zero, 4-zero, 2-zero and 1-zero test. If the previous level performs classification with less than 3 classes, not all of next level choices are available. For example, if h l = A — only, the resulting classification is {C Eioe1oe1oeiO,CU E 2UE 2UE2ue2} h\ has no value be cause the class Cqev oEiQEiQEi is not classified but included in Cue2ue 2ue2ue2- -^-is0 h\, h\, hi, h\ have no values either. Thus our optim ization goal can be formalized as Form ulation 3 (O ptim al T ree-Structured C lassification V C A ) The goal is to find H* from all possible combinations T - L such that the overall complexity is minimal i.e., we minimize the average complexity ^ T S C (b in a ry ) (?) = 7 T SC(b i n a r y ) (H, P ) (2.6) for a given statistics of input classes. Let H * represent an optimized TSC (OTSC) algorithm which can be found by searching overall possible test structure space, TL, satisfying the above binary TSC framework. 42 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.2.5 C om p u tation o f 2-D ID C T As mentioned earlier, there are several approaches to classify a 2-D DCT input block using the previously proposed algorithms. For example, we can directly test the coefficients in a 2-D block as in [60]. We can also apply OTSC to each row of input in the row-wise 1-D IDCT and then apply the OTSC for each column of the row-wise IDCT output in the column-wise 1-D IDCT. For simplicity, we choose to perform the latter (separable 2-D IDCT). Note that, we can have different OTSCs for different row and column positions but the overhead of caching different functions in the software implementation outweighs the complexity reduction of having more OTSCs for different row/column statistics. Thus we use the same algorithm for all rows/columns. X X X X X DCT X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X rowwise X X X X X X X X X Figure 2.6: Content of bitm ap after first (row-wise) 1-D IDCT where ’x ’ represents nonzero coefficient. Except for the 2-D dyadic classification in Section 2.2.6, the TSC and se quential classification schemes require re-classification for the intermediate result between the 1-D separable IDCTs. After the rowwise 1-D IDCT, the input to the columnwise 1-D IDCT has to be classified again. The bitm ap for the second IDCT can be approximated very accurately from the first bitm ap as shown in Figure 2.6. The approximation relies on the fact that if at least one frequency component is nonzero it is more likely that all the outputs from the IDCT (i.e. the result of applying the IDCT to the row or column) will be all different from zero. There are very unlikely cases where magnitude cancellation occurs at some output points and the output is zero even though there were non-zero inputs but we ignore these cases. The complexity savings from this bitm ap implementation is about 5% over direct coefficient testing. 43 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.2.6 2-D D y a d ic C lassification In. the above proposed classification schemes, i.e., Sequential Classification and TSC, 1-D IDCT is the target the classifications are applied on. They both can be extended to 2-D classification but the complexity is also increased and the optimization process may become impractical. Therefore, we propose an alterna tive classification in which a block of 2-D DCT coefficients is classified instead of classifying each row/column separately. We propose 2-D dyadic classification in which an N x N DCT block is classified into all-zero, DC-only, low-2x2, low-4x4, low-8x8, .... , y x y , and full-iVxiV classes. For each of these classes, a corre sponding 1-D RIDCT is applied along the nonzero rows and then all columns. The testing order follows the above list. The reason behind this classification is from the fact that in typical image high frequency components are more likely to be quantized to zero. Figure 2.7 shows how the classification works. INPUT all-zero ■N /2 Figure 2.7: Dyadic classification scheme. A DCT input of size N x N is classified into all-zero, DC-only (K l), low-2x2 (K2),..., low -yrr^ (iCv/2), and full-iVxiV (iv;v) classes. When the baseline algorithm has a recursive structure in the sense that larger size IDCT computation consists of smaller sizes IDCT computations, we can ex press the complexity of the RIDCTs, accordingly. For example, the Chen,Smith &Fralick [28] or the Vetterli-Ligtenberg [29] (Fig. 1.3) algorithms can be decom posed to a normal half-size IDCT (type II) and a half-size IDCT of type IV [24] and the normal half-size IDCT can be further decomposed and so on. Therefore, it is equivalent to having smaller size IDCTs as the RIDCTs. The complexity 44 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. for each class tested by dyadic classification on an N x N DCT block can then be written as C (A*) = (N + 9lo g 2 N - i + l )^V /2lo « 2 «-i+i (2.7) for class i = 1, 2,..., log2 N + 1 where K n is the complexity of D C T /ID C T of size N and when i = 0, c ( . 4 q ) = 0 for all-zero block. The term th a t multiplies K represents the number of rows and columns on which 1-D reduced IDCT is performed. Here, we use a separable reduced size 1-D IDCT to perform 2-D IDCT. The reduced rowwise transform is performed only for rows with nonzero coefficients. Then the reduced columnwise transform is applied to all columns. In terms of classification structure, it can be seen that, unlike the greedy approach, the order of classification (or threshold testing) is fixed and the closed form com plexity expressions of the reduced algorithms are known. Let Bi denote class i input. Thus, the complexity of B i can be written as Tdyadic(Bi) = c(-4*) + I • Tdtest (2-8) where Tdtest represents the testing cost (2-D zero mask or series of 1-D zero mask tests). For simplicity, we assume that Tdtest is constant for all classes. From (2.8), the average complexity can be expressed as log2 iV+1 T dyadic = (Tdyadic(Bi) • Pr(f?i) (2.9) t= l where Pr(Bj) is the probability of class B i which is a function of QP, a 2 and N . 2.3 R esu lts In our experiments we implement a real software image decoder using the OTSC. For sequential testing, we only implement a greedy optimization based on an image model. The main problem now is obtaining accurate models for the costs of the opera tions used in ZMT, and RIDCT, such as addition, multiplication, shift, OR, NOT, 45 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 2.1: Weight for different logical operations. Operation Weight Addition 1 Binary Shift 1 Binary OR 1 Multiplicatio 3 Conditional Branching 5 IF, etc. While this may be relatively straightforward in hardware implementa tions or for assembly language level programming, it is more complicated if we use a high level programming language, since the effect of memory accesses, size of the code and so forth are more difficult to characterize. This problem is very difficult to solve since it highly depends on compiler and machine architecture. To obtain more accurate estimates one could resort to an assembly language imple mentation or to measuring the average time complexity for each of the possible classification trees directly. This latter approach would clearly be too complex and still would not guarantee an exact cost measure, since many different factors affect the runtime performance. For the sake of practicality, we just use a set of approximated weights for the operations involved. By feeding these approximated parameters into our search algorithms we can get a fairly accurate approximation to the optimal solution. We have used a Sun Sparcstation - 5 running Solaris 2.5 with the gcc compiler, and Pentium 450 MHz running WindowNT 4.0 with Visual C-H- compiler. We use the assessment in Table 2.1 for each operation [96]. Note that one IF statem ent consists of comparing to zero and a jum p operation. One more thing to be considered here is memory access. Whenever one variable is involved in an operation, there is either a read-in or write-out time. However, since this memory access can be highly optimized using an optimization option by the compiler, we do not take that into account. We emphasize this issue because our costs are only approximated and very machine dependent. Hence, we do not guarantee that the result of the OTSC will be the same for all environments. 46 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.3.1 R esu lts B ased on Im age M od el In this section we assume that the image characteristic to be tested follows the model presented in Section 1.6 (assuming INTER-frame coding). Therefore, given that the model parameter is known, the probability of X(u,v) being zero (p2(u, v )) can also be computed. Thus, we can obtain the average complexity of any algo rithms. We perform our experiments on different values of the model param eter and the quantization parameter. First we will show the results of both the greedy and OTSC compared to that of an ideal algorithm in which the classification comes for free. That is, we assume we know the exact position of the zeros in the input so that all corresponding operations can be pruned without any testing overhead. Fig. 2.8 shows the number of additions and multiplications as functions of the quantization parameter (QP) of the Greedy and OTSC compared to the ideal case. In addition to using the Vetterli-Ligtenberg algorithm (Fig. 1.3 (b)) as the baseline algorithm, we show the ideal case when direct m atrix multiplication m ethod is used instead. It can be seen th at in all cases the ideal direct matrix multiplication is inferior to the ideal fast search based algorithm. OP QP (a) (b) Figure 2.8: Number of additions (a) and the number of multiplications (b) needed for cr2 = 100 and 700, using OTSC (’solid’), Greedy (’dashed’), Ideal fast IDCT (’dashed-dotted’), and Ideal m atrix multiplication (’light-solid’). The OTSC and Greedy are optimized for each QP. 47 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Total NormaBzad Complawfy QP (a ) (b) Figure 2.9: Total complexity (a) and the number of tests (b) needed for a 2 = 100 and 700 using OTSC (’solid’) and greedy (’dashed’) algorithms. One can also observe that in terms of number of additions and multiplications, the OTSC requires less operations than the Greedy approach. This is because the ability to refine the tests within OTSC allows b etter classification than the greedy approach. The result of the OTSC in terms of number of additions and multiplications is very close to the ideal case at high quantization and moves away from the ideal as finer quantization is used, see Fig. 2.8. This can be easily explained since in the low rate region most of the quantized DCTs are zero and therefore the zero tests become very efficient. On the other hand, at high rate there are less zero DCTs, thus resulting in a waste of zero mask test computations. Also in Fig. 2.9, the complexity reduction of the OTSC comes at the price of using numerous tests (as can be seen in Fig. 2.9 (b)), which essentially makes the algorithm very machine-dependent, i.e., the total cost could vary a lot from machine to machine. On the other hand, the number of tests (bit map testings) required by the greedy approach is much less than OTSC, thus compensating its larger number of multiplications and additions and resulting in similar overall complexity. We emphasize that the greedy approach performs very close to the OTSC in terms of total complexity while the number of tests is much less. Therefore, it can be accurately used as an indicator of the complexity of the OTSC at various 48 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. O ilo r to n |m M ) block sizes with much less optim ization effort. We show the distortion-complexity and rate-complexity (R-C) performance of different IDCT algorithms at different Q P and block sizes in Figure 2.10 (a) and (b), respectively. ■ to* 5 08 (a) (b) Figure 2.10: (a) Complexity-distortion and (b) Rate-complexity curves for dif ferent algorithms, i.e., greedy (’dotted’) and 2-D dyadic (’dashed’), for DCT size 16x16 (’x’), 8x8 (’o’) and 4x4 (’*’) at a 2 = 100. The complexity unit is weighted operations per 64 pixels. From Figure 2.10 (a), we can see that, for a given distortion, which block size gives the least complexity depends on a 2 and QP. It can be seen that at low distortion, smaller block size is cheaper than large block size, whereas the reverse is true at high distortion. From Figure 2.10 (b), it can be seen th at as rate becomes small, the complexity difference between each block size becomes smaller as well. Therefore, the conclusion is that using large block sizes at low rates results in comparable complexity and distortion performances to using smaller block size. The C-D and R-C results of 2-D dyadic are also shown in Fig. 2.10 (a) and (b), respectively. At this point, we can also compare the 2-D dyadic testing to the greedy approach. We can see th at at high rate the dyadic test requires higher complexity than the greedy algorithm because the classification is coarser for the higher frequency components. However, at lower rate the complexity of the dyadic test is lower since there are fewer tests in the dyadic test and the test is performed on a whole 2-D block, instead of each row or column, thus resulting in more efficient classification cost. In fact, the 2-D test can be performed efficiently 49 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. based on 1-D bitm aps. For example, to test for low-4x4 class of an 8x8 block, the lower 4 bits of column bitmap and lower 4 bits of first 4 row bitmaps are bitwise or and compared with zero. An alternative is to check the EOB symbol which represents the last non-zero coefficient in the zigzag scan order. From the EOB, one can determine the all-zero DCT area and apply a corresponding a reduced IDCT algorithm. Therefore, we can combine the OTSC or greedy approach with the 2-D dyadic test, i.e., the dyadic testing will be used first, if the resulting class is a full-iVxiV block, then the greedy or OTSC approach is applied for each row/column. At low rates, this m ethod has the advantage of low testing cost from dyadic test while being able to refine the test at high rate using the greedy approach. 2.3.2 R eal Im age D a ta R esults Estimated complexity for lenna 0.9 0.8 0.7 0.6 q.0.5 0.4 0.3 0.2 0.1 30 20 50 70 MSE Figure 2.11: Normalized estimated complexity for “lenna” using CW with all-zero test algorithm (’+ ’), CW with all-zero test for the first 1-D IDCT and ac-zero test for the second 1-D IDCT (V ), FW algorithm (’*’), and OTSC algorithm (’o’). We now show the results obtained with real image and video data using a real image and video decoder. First we show the result of OTSC on image data (see 50 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Actual experimental time for lenna (algorithm time) 0.85 0.8 0.75 0.7 0.65 o . 0.6 0.55 0.5 0.45 0.4 0.35 20 10 30 60 50 70 MSE Figure 2.12: Normalized actual time complexity (only IDCT algorithm part) for “lenna” using CW with all-zero test algorithm (’+ ’), CW with all-zero test for the first 1-D IDCT and ac-zero test for the second 1-D IDCT (’x’), FW algorithm (’*’), and OTSC algorithm (’o’). Fig. 2.11). The results show the estimated complexity comparison between (i) IDCT based on our proposed method (OTSC), (ii) the baseline algorithm (CW) with all-zero test, (iii) the baseline (CW) algorithm w ith all-zero test for the first 1-D IDCT and AC-zero test for the second 1-D IDCT (since after the first 1-D IDCT, it is more likely for typical images that only the DC coefficient in the ID vector is non-zero) and (iv) the FW method, for various mean squared error values obtained by changing the quantization param eter. We use the example quantization table from the JPEG standard [1], All the values are normalized by the complexity of the baseline algorithm without all-zero test. Next, the actual implementation times are shown for IDCT algorithm part (Fig. 2.12) and total decoding time (Fig. 2.13), which includes the time for read-in and write-out of the input data. Also shown in Fig. 2.14 is the mismatch case, i.e., we use the IDCT algorithm optimized for a specific image at a certain quality factor for a different image and/or quality factor. 51 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Actual experimental time for lenna (total decoding time) 0.9 0.85 0.8 0.65 0.6 0.55 20 50 40 MSE 60 70 Figure 2.13: Normalized actual time complexity (total decoding time) for “lenna” using CW with all-zero test algorithm (’+ ’), CW with all-zero test for the first 1-D IDCT and ac-zero test for the second 1-D IDCT (’x’); FW algorithm (’*’); OTSC algorithm (:o’). OTSC for lenna at MSE 14.79 and OTSC for lenna at MSE 60.21 O ur proposed method (OTSC algorithm) achieves up to 50% complexity re duction for the IDCT part. When there is a mismatch between the statistics of the training data and those of the actual data being processed, the performance can be degraded. Another source of mismatch comes from the assessment for the cost of each operation, which must be precise enough to maintain the consistency between the estimated complexity and the actual complexity. As seen for both experimental results, even with mismatches, the optimized algorithm is a little bit inferior because of the lack of an exact model for the complexity. However, if our estimation for cost of operations is precise enough, a conclusion drawn from the result when applying optimized IDCT algorithm for “lenna” with MSE 60.21 could be that the degradation for using an algorithm optimized for a higher MSE (which overestimates the number of zeros) is less than the degradation for an al gorithm optimized for a lower MSE (which underestimates the number of zeros). This may not be true for other types of images. Also it has to be pointed out that 52 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1 Actual Experimented time (total decoding time) 0.95 0.9 0.85 0.8 0.75 71 _______i -----------1 ---------- 1 _______i ----------- 1 ---------- 50 100 150 200 250 300 350 MSE Figure 2.14: Normalized actual time complexity (total decoding time) for baboon image using CW with all-zero test algorithm (’+ ’), CW with all-zero test for the first 1-D IDCT and ac-zero test for the second 1-D IDCT (’x’) , FW algorithm (’*’), OTSC algorithm (’o’), and OTSC for lenna with MSE 14.79 (’- ’) our current implementation has concentrated mostly on the issue of structuring the tests, and thus uses a very simple programming style. Our goal at this stage was to implement all the algorithms with a similar style so that a fair comparison is possible. The results of the combined 2-D dyadic and OTSC IDCT are shown in Fig. 2.15 using TMN’s codec [58] and coding 249 frames of the “Foreman” sequence at different target bit rates. The inverse quantization complexity is also considered since there is an overhead cost of bitmap generation for both OTSC and dyadic. It can be seen from Fig. 2.15 that the combined dyadic-OTSC gets an extra 6-9% complexity reduction from the OTSC. An explanation is th at the 2-D classification is cheaper than separate 1-D classification, if it matches the statistics of the data well. In typical image and video data, nonzero coefficients tend to cluster around DC component. T hat is why we observe extra speedup using dyadic classification. We now discuss the issue of statistical estimation. One scenario to obtain the statistics for the proposed VCA IDCTs in a practical encoding system would be to 53 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. IDCT * nversa quantization complexity resuBs of ^foreman* O T S C SO Kbps 40 K bps tr $ 0. X Kbps 20 Kbps- 0 5 5 0.6 N ormalized complexity Figure 2.15: The complexity (CPU clock cycle) of the IDCT + Inv. Quant, normalized by the original algorithm in TMN at various PSNRs (different target bit rates) using OTSC (’A 7 ) and combined dyadic-OTSC (7 *7 )- have the encoder send side information about the optimized tree-structure zero- test IDCT algorithm to the decoder along with the data, such that the decoder can do the IDCT operation with the minimum complexity. However, this comes with the price of bit rate overhead. As an alternative, a tree-structured or greedy optimization can be performed at the decoding end using the past statistics of input data. This seems to be more compatible with the existing image and video standards. However, there is an issue of run-time algorithm change. It is impos sible to compile the optimal IDCT at run-time. Therefore, instead of optimizing the VCA on the fly, a more practical way is to have a set of pre-optimized VCAs that covers a wide range of image statistics at run-time and choose the one with smallest complexity for the image being operated. 2.4 D isto rtio n /d eco d in g tim e tradeoffs As we mentioned earlier for the scalability issue, one way to achieve IDCT com putational scalability is by changing QP at the encoder side. This can be applied to the case where a limited complexity decoder is the m ajor concern. The C-D tradeoff will be addressed, i.e., within a given computational complexity at the 54 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. decoder, the encoder adjusts the quantization param eter such that the distortion is minimized while satisfying the complexity constraint. Note that this requires the encoder to have enough information about the operation at the decoder end in order to change the QP accordingly. As described in the previous section a VCA implementation of the IDCT can be obtained given a “typical” set of input images. As is clear from the results, the coarser the quantization the faster the decoder can run. Thus, a decoder based on a VCA IDCT is inherently computationally scalable. Unlike the JPEG case in which the quantization param eter is fixed for all blocks in a coded image, for video encoding standards such as M PEG l-2 or H.261-3 the encoder can adjust its quan tization parameter on every macroblock in order to try to find the best picture quality for a given bit budget. The same idea can be applied to a complexity- distortion (C-D) framework where the encoder can control the decoding speed by assigning to each block one out of several available quantizers (coarser quantizers result in faster operation). The question then arises on how to optimally select those quantization steps for a given decoding time budget. This leads to a formula tion where the encoder operates based on a complexity-distortion (C-D) tradeoff, rather than on the traditional rate-distortion (R-D) tradeoff. As an application, one could for instance store images to be downloaded by different types of hosts so that the slower hosts can access lower quality images, which can be decoded more quickly. Similarly, the quality of the decoded images can be selected based on the load of the shared host. The problem can be formalized as finding the quantizer assignment {j, z}op« such that ^~2Dj(i) is minimized while ^ T j-(z)) < Tbudget, j i where Tbudget is the total time (or complexity) budget, Dj(i) and Tj(i) are, respec tively, the distortion and decoding time when quantizer i is used for block j. The problem can be solved using the well-known Lagrangian multiplier method [9] O', i}oP t = arg m in(]T Dj(i) + A • ^ Tj(i)) (2.10) 0 ,*} j j 55 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. where A > 0 is the Lagrange multiplier, which will have to be adjusted so that the budget is met. 0 0 0 H o 0 0 0 WSE (a) (b) Figure 2.16: Distortion versus (a) estimated IDCT complexity (b) experimental decoding time of “lenna” using fixed quantizer encoder and OTSC decoder (’o’), Lagrange multiplier results (’x ’), encoder follows Lagrange multiplier results but decoder uses a single OTSC for MSE=60.66 (’*’) and MSE=14.80 (’+ ’), respec tively. Figs. 2.16 and 2.17 summarize our results. In this experiment, we design six OTSC algorithms for six different quantization param eters based on the statistics of the test image. The OTSC design is based on the approach discussed in the previous section. We then have the complexity and distortion for each block and each quantizer, as required by the Lagrange optimization of (2.10). For a given A , we find the quantizer that minimizes (2.10) for each block. Finally, we repeat the minimization at different values of A until the complexity constraint is met. Therefore, the encoder applies different quantizers for different blocks according to the optimization result. The quantizer information must also be sent to the decoder for proper inverse quantization. At the decoder side, the OTSC algorithm that corresponds to the quantizer selected for the block is used. However, in most of the cases, having many OTSCs at the decoder results in excessive overhead. A more practical approach is to have a single OTSC optimized for a single quantizer and to use this OTSC for all blocks. 56 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Camponson of complexity bctwoan oomp/tisL and raW dst C om parison of ra ta b e tw e e n com pA Jist and rata/d ist C 25 MSE 0.2 M SE (a) (b) Figure 2.17: (a) Complexity-Distortion curve obtained from C-D (V ) and R-D (’*’) based optimization, (b) Rate-Distortion curve achieved when C-D (V ) and R-D (’*’) based optimization. The complexity is normalized by the complexity of the baseline Vetterli-Ligtenberg algorithm. Fig. 2.16 indicates that quantizer allocation for each block optimized for com plexity results in very significant reductions in complexity. As expected, the C-D allocation outperforms the other methods. Note that in Fig. 2.16 (a) we compare the C-D allocation when multiple quantization-dependent algorithms are used and when a single algorithm is used (OTSC optimized for fine quantizer or coarse quantizer). When a coarse quantizer is used the performance is very close to that of the multiple algorithm approach. This result is consistent with the previous result in Fig. 2.13 where we can use an OTSC optimized for a coarse quantizer for blocks coded with finer quantization with only small degradation in C-D perfor mance. For the actual decoding tests we thus use a single algorithm. Each point in the C-D optimized curves is obtained with a given param eter A . One may argue that the complexity can be controlled via rate control since in general, the quantizer is determined by bit budget. If the bit budget is small, a coarse quantizer will be used, which in turn results in more DCT coefficients being set to zero, so that decoder with VCA IDCT will be faster. However, Figs. 2.17 (a) and (b) demonstrate how a complexity driven quantizer allocation results in better C-D performance than its rate driven counterpart. Even though C-D and R-D curves from both C-D based and R-D based optimization are close together, as 57 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. expected better performance can be observed when the correct budget constraint param eter is used. However, we may also use a rate budget constraint instead of a complexity constraint to obtain sub-optimal C-D performance. 2.5 R ate-C o m p lex ity -D isto rtio n Q uadtree O p ti m ization There are only a few works addressing the rate-complexity-distortion tradeoffs. Among those are [22] and [23] in which the complexity is considered as a factor to determine the R-D performance, and can be varied by changing the size of the block transform. In [23], the convexity of the R-D-C curve of Gauss-Markov sources is proved. Furthermore, in [23] the C-D comparison for a given rate between KLT and DCT is shown. An interesting conclusion can be drawn that with this particular source the C-D curve of the DCT is better. However, in these works the complexity is not input-dependent, i.e., it considers the worst- case scenario which implies the complexity is a function of the block size. If we use the variable complexity approaches proposed in the previous sections, it can be seen th at the C-D relation also follows a similar tradeoff as R-D when QP changes (see Figure 2.10(a)) whereas (see Figure 2.10(b)) the R-C have the tradeoff when the block size, not QP, varies. It is true because when the block size is larger, the complexity required for block transform is also larger while the rate is smaller as a result of the transform coding gain, and vice versa. However, since the number of available block sizes is limited, in order to obtain a convex hull of the R-C curve, the Lagrange multiplier method can be used as it is in R-D applications [9]. In [97] and [21], quadtree-based coding is used for better R-D performance by using a different block size for different regions and types of inputs. In [21], the R-D optimization criterion is used with the optimal tree pruning algorithm [98]. The optimization already incorporates the fact that the encoder has to send the block size information to the decoder. In this section, we present a more general R-C-D tradeoff based on the La grange multiplier technique, which incorporates the selection of QP and block size. We will first formalize a problem similar to the optimal quadtree based coding of 58 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 0 0 0 0 0 Figure 2.18: Quadtree structures of four 16x16 regions and the corresponding representative bits. [21], with an additional constraint on the complexity. Then the QP as well as the block size are optimally selected for each block of the image or the video frame. The complexity constraint is the target total decoding time at the decoder. We use the same notations as [21]. Based on Figure 2.18, let X U ii denote the i-th DCT block at level n. As in the figure, the leaves (smallest blocks) correspond to a block size of 4x4, the middle nodes represent 8x8 blocks and the root of each tree corresponds to block size of 16x16. Let n = 0,1,2 represent blocks of size 4x4, 8x8 and 16x16, respectively. The scheme can be generalized to smaller block size (i.e., lx l block) or larger block size such th a t level n corresponds to block size 2"x2n. However, given the limitation on com putation, we use only 3 block sizes. The side information needed to be sent to the decoder about the structure of the quadtree is ’O ’ if the block is not split and T ’ otherwise. If the parent node at level n + 1 is split, then there are 4 children blocks, i.e., X n for i — 0,1,2,3. Let rk{Xn^),dk{Xn^i) and tk(X nii) denote the bits, distortion and complexity of X nj in block k. The quadtree optimization is based on the bottom up approach in [98], in which the decision is made on whether smaller blocks with suboptimal R-C-D operating points are better off merged for better R-C-D performance. The goal of the optimization is to minimize the total distortion D such that the total rate R. and complexity T are under their respective budget constraints Rb and Tb. Using the Lagrange m ultiplier method, we can convert this constrained problem into an unconstrained problem by minimizing the following objective function min £ dk ~r A r 'y ' rk -r Xt (2.11) k k k 59 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. where A r and At are Lagrange multipliers controlling the tradeoffs between R- D and C-D, respectively. These parameters need to be searched such that the constraints are met. The optimization of the quadtree can be described as follows. Assume that the optimal rate, distortion and complexity of block k at level n are known to be r * k(X nii), d * k{X nii) and t * k(X nti) for i = 0,1, 2,3. These four subtrees will be merged if dk(Xn+i) + A rrfc(Xn+l) -I- A t£fc(Xn+i) < + Arrk{Xn^) + A tt£(Xnii)] i= 0 (2 .12) where the left side is the minimal cost function (over all possible quantizer choices) for the parent node. Then the optimal rate, distortion and complexity of level n + 1 are updated and the optimization proceeds to the next level. When the merge decision is made, a one-bit side information has to be sent to the decoder as well. Therefore, the new suboptimal solution at level n + 1 becomes {: rJ(.Y„+l) = r T r ‘f " +l) ifm erge (2-13) ' + E L o r * k(Xn,i) if split The resulting distortion and complexity are = ( d* ,(X"+l) ifm erge (2.14) \El=odUXn,i) if split $ ( * * 1 ) = l tk{* n+l) ifm erge (2-15) 1 E L o *ib(-X n,i) i f S p li t The process continues until the root node is reached and the optimal quadtree of every region in an image is computed. Then the Lagrange parameters A r and At are adjusted, and the whole process repeats until the bit budget and the complexity budget constraint are met. There are several methods for adjusting the Lagrange multipliers. In this work, we use the linear approximation algorithm proposed in [99] in which the Lagrange multipliers are shrunk (by a factor 7 < 1) and expanded ( by a factor I / 7 ) when the total constrained quantity is below or 60 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Rato-Distortion Tradeoffs No com p const R a te (Kbps) Figure 2.19: Constant-complexity rate-distortion curves. When complexity con straint is loosen, the rate-distortion performance can be better, ’dashed’ curves show unconstrained complexity result. above the budget, respectively. Using this algorithm, we can generate constant- complexity R-D curves and constant-rate C-D curves. In order to find constant distortion R-C curves, we rewrite (2.11) as min + + (2-16) A r k k Ar k which can be viewed as a rate minimization problem with distortion and com plexity constraint. Then, the linear approximation can be applied to find the new Lagrange multipliers, i.e., j- and yA Figures 2.19-2.21 show experimental results of the quadtree optimization for the first 30 INTER frames of the “Miss America” sequence. We can select the QP out of 5 possible choices, namely, 4, 10, 16, 22 and 28, and 3 block sizes are allowable for the DCT, i.e., 4x4, 8x8 and 16x16. As in [21], we do not take the bits necessary for motion vector coding into consideration since it can be considered as a fixed overhead expense and depends on the motion estimation. The distortion is in the unit of MSE and the complexity is the estim ated number of operations3. 3Since the complexity depends on machine and load condition, we only use the estimate number of operation as it provides framework for future development. 61 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Complcxify-Otstortion Tradeoffs 5so 200 Kbps 300 K bps x 10* Figure 2.20: Constant-rate complexity-distortion curves at 200 Kbps and 300 Kbps. As rate is more constrained, C-D performance gets worse. In Figure 2.19, we show R-D curves for constant complexities. It can be seen that when the complexity constraint is more limited, the R-D performance is degraded. As the complexity budget becomes tighter, the coder is forced to use smaller block size or coarser quantization. We also show the result of the optim al R-D quadtree optimization without complexity constraint, which gives the best R-D result. In Figures 2.20 and 2.21, constant-rate C-D curves and constant-distortion R- C curves are shown. Again, it can be seen that when one param eter of the R-C-D triplet is more strictly limited, the tradeoff functions of other 2 parameters get worse. For example, as the rate budget becomes tighter, a larger block size and a larger QP tend to be chosen. As a result, complexity increases whereas the larger QP means more distortion. On the other hand, as distortion requirement is more demanding, smaller QP and larger block size are likely to be selected. Consequently, higher rate and higher complexity are unavoidable. In addition, note that all the results follow the convexity theorem in [23]. In the case of real-time encoding, in order to avoid the DCT computational load for all 3 block sizes that are necessary for computing the rate and distortion information, a fast approximate DCT (as will be presented in chapter 3) can be employed. In addition, the encoder may assume that the decoder uses only the dyadic testing for IDCT, instead of combined dyadic and OTSC. Therefore, 62 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. t o <r Com plexity (O poralians) x JQs Figure 2.21: Constant-distortiou rate-complexity curves at MSE = 30 and 50. As distortion requirement is more rigid, the R-C performance becomes worse. the complexity evaluation can be done faster without actually testing the DCT coefficients. 2.6 Sum m ary and C onclusions We formalized a variable complexity algorithm for IDCT by discussing the clas sification problem and the reduced complexity algorithms th at can exploit the knowledge obtained from the classification. Then, we presented a sequential clas sification technique using zero mask test as a tool for testing. We proposed a greedy optim ization to select the order in which the classes should be tested by sequential classification that achieve an acceptable suboptimal solution with re duced complexity. We provided R-C-D results for a given input model and at different block sizes. Next, we proposed an alternative tree-structured classifi cation that exploits the structure of the baseline algorithm in the classification process. Given the statistics of the input, we showed how to construct an optimal classifier based on the tree-structured concept. The experimental results show significant complexity savings for typical images at low rates. We also proposed a combined algorithm in which the heuristic 2-D dyadic classification is performed first based on observation of empirical data, and the VCA IDCT can be applied 63 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. later for finer classification. The result gives extra 5% complexity reduction com pared to the OTSC. This implies th at since the image and video content is in general similar, one can apply a heuristic VCA IDCT without much degrada tion compared to the optimal solution. Also, the 2-D classification gives better classification gain at low rate. Finally, we addressed the problem of distortion/decoding tim e tradeoff using OTSC. The problem is solved using a Lagrange multiplier technique for optimal quantization selection at the encoder side. We extended this work to quadtree optimization for rate-complexity-distortion tradeoff. We showed some experimen tal results on a video sequence where the goal is to minimize the distortion in a rate and complexity constrained environment. Using VCA approaches, it may be possible to use larger block sizes efficiently, thus improving the coding gain. The applicability of these techniques is not limited to DCT. There are also some works trying to achieve fast inverse wavelet transform, e.g., work by Fernandez and Ortega in [100], where a similar tree-structure classification for a VCA im plementation of inverse wavelet transform is proposed. 64 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 3 Forward Discrete Cosine Transform In this chapter, we propose 2 classes of fast DCT algorithms, namely a VCA approach and an approximate approach. In the VCA approach, we try to deter mine, from the set of inputs or intermediate outputs of the operations, which set of outputs will be quantized to zero. The classification is performed in a hierarchi cal manner. We also show how to optimize the test structure for a given training data, with techniques similar to those presented for IDCT. However, the complex ity saving of this approach at high bit rate coding (for both MPEG2 and JPEG) are very marginal. Therefore, we resort to lossy approximate DCT algorithms in which, instead of selectively computing a subset of the DCT, all DCT coeffi cients are approximated. These algorithms are multiplication-free, thus yielding great reduction in complexity, and the degree of approximation can be selected based on the quantization level, so th a t the complexity is quantization depen dent. The error analysis is also presented. Unlike VCA, at high bit rate, the approximate DCT still manifests significant gain (e.g., around 30% reduction in computation time) with a slight degradation in R-D performance. As a bench mark for the performance, Pao and Sun’s statistical sum of absolute value test (SSAVT) is reviewed. The error analysis of SSAVT is then derived to explain the good performance of SSAVT. Finally, we introduce several hybrid algorithms, e.g., combined approximate and VCA, combined SSAVT and approximate DCT, 65 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. and combined SSAVT, approximate and VCA DCT. The experimental results for low-bit rate coding (H.263) show significant gain of more than 50% complexity saving using the combined algorithm, as compared with a standard fast DCT algorithm. 3.1 E xact V C A D C T Our goal is to avoid the computation of those coefficients that will be later quan tized to zero. The main difficulty is that DCT coefficients which are quantized to zero are not known unless they have already been computed. If the zero (to be quantized to zero) coefficients are known, only a subset of output needs to be com puted and only necessary operations are performed. We now explore an algorithm which can deal with nonuniform quantization and allows a finer classification of the inputs. 3.1.1 In p u t C lassification: P re-tran sform D ea d zo n e Test The problem of determining whether the output DCTs will be quantized to zero can be viewed geometrically as that of determining whether the input vector is in the dead-zone region of the corresponding quantizers or not (see Figure 3.1). Since the DCT is an orthonormal transform, if the input is in the dead-zone so is the output. In Figure 3.1, the dead-zone is the solid rectangle in the (y 1, 1/2) output coordinate and corresponds to the region where the DCT coefficients are quantized to zero. The input coordinate is (xi,X 2). The test region equivalent to [61] (see Chapter 1 is shown as a dashed square which is equivalent to thresholding the absolute sum of the input. Let x(z) be an input element from the spatial domain input vector x, and let X be the corresponding DCT output vector. Let Q and q be a vector of the quantization and a vector of the thresholds for the coefficients in the input. Q can also be viewed as the vector of deadzone boundary. We present a test region based on the following test, 6 6 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Form ulation 4 (Pre-transform D eadzone Test) We want to find q with max imal hypercube volume, fli hifi), satisfying the triangle inequality |Dn| -q < IQ/21 such that if input |x(i)| < q(£)/2 , V i, then X(z) < QW. V i, i.e., X will be quantized to zero and, thus, can be ignored. |D n| is the elementwise absolute value of the DCT m atrix of size NxN. The solution to the above can be obtained numerically by searching over all values of q. Note th at the ideal test would be one such that the dead-zone corresponding to the test fit as close as possible within the dead-zone corresponding to the actual quantization. This test region is equivalent to a tilted solid square in the dead zone in Figure 3.1. For simplicity, we use a square dead-zone i.e. q(i) = q for all i, since its resulting volume is almost as large as the maximal non-square dead-zone. In order to perform this test we will need to compute at most N absolute values, N comparisons and N — 1 logical operations. This test classifies the input into 2 classes to which we assign either full op eration DCT or no operation DCT. Consider now the baseline fast algorithm in Figure 1.3. It can be seen that the computation is divided into three stages. From Figure 1.3, let D» = H I 0 0 0 0 H | 0 0 0 0 H | 0 0 0 0 H; H? 0 0 H[ where H | has dimension 8 x 8, H f and have dimension 4 x 4 and H?,H!*,H| and H | are 2 x 2 matrices. Therefore, we can write the output of each stage as follows, H l 0 H I H jx = [x? x^]4, H jx = [x ? x |x |x 3 ]4 . (3.1) (3.2) Let us denote xj = x. Therefore, x'- represents the j-th output vector from stage i. From the above, we can apply the proposed test m ethod at the beginning 67 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. X T Q/2 Figure 3.1: Geometric representation of dead-zone after rotation. of each stage, i.e., testing x[ before H[ operation, xf and xf before Hf and H f, and (xf, x l, X3, X4) before (H f, H f, Hjj, H4). Let the associated threshold vectors be denoted by q\, (q^.qV), and (q\ , q% , q% , ?|), respectively. The test structure is shown in Fig. 3.2. Starting from the root of the diagram, at each test, if the condition is satisfied no further operations are performed and coefficients at the end of paths starting from the test are set to zero. Otherwise, we continue toward the leaves. In computing the thresholds of each test, we must incorporate the remain ing operations need to be done after each stage, e.g., ql is computed from the DCT m atrix, D 8, while qf is computed based on the remaining transformation of H? 0 H f and q % is computed based on H i 0 0 H i h i . Note that this classification is not restricted to detecting all-zero blocks as in [61] and can thus be used to determine whether subsets of the output co efficients will be zero. This method can also be extended to a separable 2-D DCT (row-column 1-D DCT) with the use of Kronecker (or tensor) product, i.e., D n • x • D^f = (D n ® D n )x , where x is a row-column scanned 1-D vector from the m atrix x, and the thresholds are obtained in the same manner. Furthermore, hierarchical classification for a non-separable 2-D DCT is possible by simply post- r - 1 t J J 2 Q multiplying (3.1) and (3.2) with and H ]4 1 , respectively. [ 0 H , J However, when the proposed VCA, which is based on the triangular inequality, is used for larger vector size the efficiency of detecting input in the dead-zone greatly decreases. We will show how the vector dimension affect the efficiency of 6 8 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. X s°2 Figure 3.2: Proposed VCA algorithm. the test. O ur threshold designing problem is equivalent to the problem of fitting the biggest circle into the output dead-zone in Fig. 3.1. Assume the dead-zone is shrunk to a square with the width equal to the narrowest side of the dead-zone, then the ratio of the biggest sphere area and dead-zone area (2-D) is 7 r / 4 . For a size n input vector, let V° be the volume of the sphere, and let V ° be the volume of the pre-transform dead-zone. It can be derived th at v* _ n u fit)? cos« (e)d$ V° 2" (3.3) 69 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C a rpeatf-asB nen (nvtttar of atttoM } tor I n u St2*St2 coM fCAtom rttxrt o a w w V C A lour dob h • -------* cd-OCTtowr Bouto • ------- ■'TmABt-OCTtewboe i l (a) Comfl«tiy-dabrton(ntfT*>wotroflpicatan3l tar bnna5t& St2 •xst-vCAkMrCouKi io .4 USE (b) Figure 3.3: Comparisons of original and pruned algorithms for different distortions (a) number of additions, (b) number of multiplications. The DCT lower bound corresponds to computing only the subset of coefficients that will be non-zero after quantization. The VCA lower bound corresponds to pruning subject to the classification mechanisms of Section 3.1.2, i.e., we can only prune subsets of coefficients which are tested jointly. It can be evaluated that as the dimension grows larger, this ratio becomes smaller, which in turn means that the ratio of our threshold hypercube to the actual dead-zone volume is even smaller. This means that the area that is not covered by the threshold hypercube but is part of dead-zone gets larger and thus left unexploited. As mentioned earlier that 2-D DCT is equivalent to a 1-D transform using a Kronecker multiplication and the resulting dimension of the transform is the square of the original dimension. Therefore, as can be seen in Fig. 3.3, the efficiency of the classification becomes less significant when we try to classify input before the first direction (rowwise) 1-D DCT, since the equivalent size of input vector becomes 64 instead of 8. On the other hand, if we apply the proposed VCA to the second direction (columnwise) 1-D DCT only, the size of each input vector is 8 because the DCT operation of each column can be done independently, thus giving more classification gain. r*/2 with / cosn{9) d 6 = J-Tt/2 70 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The result of the classification is shown in terms of number of operations as normalized by the fast algorithm in Figure 1.3, which is used for separable row- column 1-D DCT. We encode the “lenna” image of size 512x512 using different quantization parameters to obtain the complexity at different qualities for the decoded images. The DCT computation is exact and the only distortion is that introduced by quantization. The results of pruning for each quantization level are shown in Figure 3.3 (a), (b). In these figures, the classification costs are not included in order to see the efficiency of the pre-transform deadzone test compared to the ideal case where all input points inside the deadzone are detected. Thus the results indicate the minimal achieveable complexities or the lower bounds. In Figs. 3.3, we see that the lower bound on complexity reduction using our proposed VCA is close to the ideal case when applied to the column 1-D DCT only. 3 .1.2 O ptim al C lassification W hen the complexity of the tests is taken into account the total complexity can be higher than that of the original fixed complexity algorithm, as seen in Fig ure 3.4. As in Chapter 2, we optimize the classification such th at only tests that provide reductions in average complexity are kept (i.e., the savings achieved when operations are skipped outweigh the overhead of testing for all inputs). This optimization is based on training on a set of images and is performed through exhaustive search (since the number of tests involved is small.) We use “baboon” , “boat”, “creek” and “lake” as training data to design the VCA at each quantization param eter for “lenna” image. The result of both esti m ated complexity and CPU clock1 savings are shown in Figure 3.4. We use the same set of weights for operations as in Table 2.1. It can be seen that when the quantization parameter is small, i.e., in the low MSE region, the complexity is the same as that of the original fixed complexity algorithm. This means that there is no test in the algorithm because in the optimization process it is determined th at the tests would not result in any savings. On the other hand, in the high MSE region, given that there will be more zero outputs, there is something to be 1The implementation is run on a Sun Ultra-1 running Solaris 2.5. 71 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Complexity-Distortion from implementation 1 05 0.95 0.9 * 0.85 0.75 0.7 30 40 50 60 70 20 60 MSE Figure 3.4: Complexity(clock cycle)-distortion comparison with “lenna” JPEG encoding. gained from the VCA approach as will be seen later in this chapter. From our results we can see that the gains are modest due in part to the over head involved in testing. However a m ajor reason for the lack of more significant gains is the fact that we are still computing an exact DCT, when in fact at high MSEs an approximate DCT would be acceptable given th at data is going to be coarsely quantized. 3.2 A p p roxim ate D C T The DCT can be approximated using a subband decomposition as in [48] and [47]. This approach exploits the fact that, with appropriate post-processing, the DCT coefficients can be obtained after the subband decomposition, and in typical natural images one can disregard high frequency subband components without greatly affecting the accuracy of the calculated DCT. Therefore, the DCT coef ficients can be approximated from post-processing only low frequency subband coefficients. Because a simple subband decomposition can be used (Haar filters for example) the overhead for pre-processing is small. A subband decomposition 72 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. hence can be viewed as a pre-processing that reduces the interdependencies among the inputs and gives some clues about what information can be disregarded. Instead of approximating the whole DCT block with a subset of DCT coeffi cients, another approach is to estim ate all the coefficients but with less accuracy. In this section, we first give a review of the statistical sum of absolute value test ing (SSAVT) [64], and analyze the distortion contributed by the approximation. Also, our proposed approximate algorithm is described, and its error analysis is discussed. 3.2.1 SSA V T R eview We first review the SSAVT [64] algorithm. From the Laplacian model for residue pixels presented in Section 1.6, we have that the variance of each pixel is (see (1.5)) a X(u,v) = a 2F % ( u , v ) where r ^ (u , v) is a function of the correlation coefficient of spatial domain pixels. We can find the variance of each DCT coefficient from the scaled version of the spatial-domain variance. In [64], the residue pixels are modeled in spatial domain to be Laplacian distributed with zero mean as in (1.3). This assumption is also applicable to some extent to pixels in INTRA coded frames. Thus, in this work we apply the Laplacian model to INTRA coding. Since the fundamental idea of the SSAVT is to allow looser thresholding by introducing a probabilistic criterion (as opposed to the deterministic criterion in [61]), a model mismatch for I-frames would only result in a small deviation in the probabilistic criterion from the ac tual one. From this Laplacian assumption, we can estimate a 2 from the Sum of Absolute Value (SAV) as a zzy/2- S A V /N 2 (3.4) where N 2 is the number of pixels in an N x N block. In the case of a P-frame, the S A V can be obtained as a by-product of the motion estimation, since the Sum of Absolute Difference (SAD) is computed and compared in order to find the best motion vector. However, it is worth mentioning that the SAD corresponding to a macroblock can only be used to obtain a value of a that will be the same for all 73 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. four 8x8 blocks. In an advanced prediction mode (OBMC), the SAD of each block is computed and therefore we can have different cr for each block. Alternatively, it would be possible to explicitly compute the SAD for each of the subblocks so as to have a different a for each. The key to reducing the complexity comes from the fact th at given the model we can always find the range of values th at the DCT coefficient will take with some probability. For example, the probability that a value will fall within (— 3cr, 3a) is about 99%, and as a result, we can skip the computation of a certain coefficient if the value is 99% sure to be quantized to zero. The testing would then consist of checking if 3crx(u,u) < '2QP -+- DZ, (3.5) where Q P and D Z are the quantization parameter and the additional deadzone factor, respectively. That is, the 99% probability interval is completely embedded in the deadzone. 3r,v(u, v)a < (2 Q P + D Z) SA V < (2Q P + D Z) - N 2/{3\/2Tn (u, v )) (3.6) Equation (3.6) is from (3.4). Therefore, the test is equivalent to a threshold testing of the S A V . The confidence level can also be changed depending on the complexity reduction versus accuracy tradeoff. Furthermore, from (1.6) it can be seen that the variances are decreasing from the DC to higher frequency AC coefficients. This implies that if a lower frequency DCT is determined to be zero, so are all higher frequency DCTs. As a result, the classification can be performed by testing the S A V with a set of thresholds which corresponds to classifying the output 8x8 DCT block to i) all-zero, ii) DC-only, Hi) low-4*4> a n < ^ ’ L V ) all-64- For each of the test, r8(0, 0), r8(l, 0) and r8(4, 0) are used in (3.6), respectively. The result of this classification is very promising. Only a slightly degradation in PSNR is observed because the distortion caused by not computing high fre quency components is compensated by the corresponding bit savings. We now provide an analysis of the distortion introduced by the SSAVT th at was not pro vided by the original authors [64] and which will help us evaluate the degradation 74 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. introduced by the SSAVT and compare it with that of our proposed approximate algorithms. 3.2.2 Error A n a ly sis o f SSA V T For each outcome of the SAV test, a corresponding reduced output DCT is ap plied. We can then consider the reduced output DCT as an approxim ation of the exact DCT. For example, the equivalent transform m atrix of the iow-4x4 DCT is 4 D 8 where I 4 is the 4x4 identity m atrix. In words, it is D 8 with all the 0 0 J lower 4 rows, and rightm ost 4 columns, set to zero. Therefore, different approxi mate DCTs are used for different values of cr2. Let {T7i0, T h i , ..., T h e } be a set of thresholds defined as functions of QP for classifying an input into G classes (e.g., for the SSAVT, G is 4) where T h 0 is 0 and T h e = 00. The i-th reduced output DCT is applied ifT h i^i < cr < Th,. where < x is the approximation of a computed from the SAV and Thi = — 2QP % D r — \ (3 J ) 3 max r^-(u,u) Cu,v)eBi where B{ is a set of output coefficients computed. Therefore, the block distortion of class i input can be expressed as Dssavt(B i)= D (u ,v )+ £ a2T 2 N{u,v) (3.8) (u,v)eBi (u,v)£Bi where D(u,v) is obtained from (1.10). The first term is the sum of the distortion of coefficients th a t are computed, and the second term corresponds to the coefficients that are computed and thus are not coded. Figure 3.8 shows the normalized increase in distortion using the SSAVT at various cr2 and QP (ratio between the second term and the first term of (3.8)). This result is based on the assumption that the variance of the signal can be perfectly estim ated from the SAV. It can be seen th at the increases in distortion have a seesaw shape as a function of the pixel variance. This can be explained as follows. For a given QP, there is a set of thresholds th a t specifies which reduced DCT will be used within a certain 75 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Normalized Additional Distortion for SSAVT p u o l v arian ce a2 Figure 3.5: Additional distortion (normalized by the original distortion) when using SSAVT at various levels of the pixel variance a2. interval. At the beginning of each new interval, an algorithm which computes more coefficients is selected thus pushing the distortion down, and as the a 2 grows, the extra distortion starts to become larger again. For the SSAVT algorithm, there are 3 thresholds. After the third threshold has been exceeded (at the right end of each curves in Fig. 3.8), there is no additional distortion since all 64 DCTs are always computed. As an extension to [64], we perform SSAVT for all possible reduced output DCTs, i.e., we also consider low-2x2 output class as well. In other words, c(.4*) will have the form shown in (2.7). The reduction in com putation and the increase in distortion from computing only a subset of DCT coefficients has a benefit in th at the overall rate required to code the DCT coefficients is reduced. Therefore, the rate and complexity of a class g input can be written as R SSavt{Bi) = 5 3 R{u,v) (3.9) (u,v)eBi Tssavt(Bi) = c(A*) + i - c(Av), (3.10) when Thi < a < Thi + l.a. c(Av) is the cost for SAV threshold testing. In order to find D ssavt, R SSavt, and T ssavt, the average values of distortion, rate and complexity, the probability of the outcomes of the SAV test has to be taken 76 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. into account. For a given model of the source, the variance a 2 cannot be perfectly estim ated by the SAV. The mean absolute sum m = S A V /N 2 can be viewed as a random variable representing the time average of the source. It can be derived from the model of the source (first order Markov source) that the mean of m is E {m } = and the variance is ^ where, again, p is the correlation coefficient of the source. Let us denote the approximation of a 2 based on m and a 2 — 2m2. This a 2 is tested with a set of thresholds { T h 0 : T h i , . . . . T h e } where G = log2iV 1 for all possible reduced output DCT. The probability of each outcome of the threshold test can be written as rThi+l Pi = / f2rn?(x)d,X JThi ry/Thij, i/2 = f m { x ) d x (3.11) where f m{x) is the p.d.f. of m. From [88] (see also Chapter 2), the difference between the partial mean of absolute pixel difference (PMAD) and the actual mean of absolute pixel difference (MAD) is modeled as a Laplacian distribution. Therefore, in this thesis, we assume that m has a truncated Laplacian distribution with mean and variance as above, i.e., _ / for :r > 0 m [ 0 otherwise where L = / 0 ° ° \ rne~x' [x~E^T a ^ is a constantnormalization factor that ensures that a valid p.d.f. is obtained and A m = am/y/2. Thus, the complexity, rate and 77 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. distortion can be expressed in a similar fashion to (2.9) as Iog2 AT+1 T s S A V T = E P i ' T d y a d ic ( B i ) (3-12) t = l l°g2 N+ 1 P-SSAVT = E Pi ' Rdyadic(Bi) (3.13) i= I log2 A r+1 DsSAVT = E Pi- Ddyadic(Bi) (3-14) i= 1 Note th at all the above quantities are functions of QP. a 2 and N. The R-C-D characteristic of the SSAVT is shown in Figure 3.6. Rate-Complexity-Oistortion of SSAVT '- 1 0 0 200 150 x 10* 100 50 complexity Figure 3.6: Rate-complexity-distortion functions of SSAVT. The R-C-D results of SSAVT and ideal dyadic test are very close and hence cannot be visually distin guished in the figure. Figure 3.7 shows the complexity comparison between the SSAVT and the 2-D dyadic classification. Since the DCT and IDCT operations are transpositions of each other, the 2-D dyadic classification for IDCT in Chapter 2 can be considered as the ideal case for SSAVT where all instances of reduced output DCT are 100% correctly detected. The addition and multiplication results follow a similar trend as the total complexity curve, and are thus omitted here. The R-D function of the SSAVT is already shown in Figure 1.8. It can be seen that the R-D performance 78 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Total Normal z«d Comptaoty Numbar at Taata S > ~ v- SSAVT DyadtoalMJ V o 2 * 700 • O 2 * tOO - - SSAVT OyadeldM l (a ) (b) Figure 3.7: Total complexity and the number of tests needed for SSAVT algo rithms and 2-D dyadic at cr2 = 100 and 700. 2-D dyadic is considered as an ideal case for the SSAVT where reduced classes are 100% detected. does not degrade much when complexity is reduced. While distortion increases, fewer bits are also needed for the reduced output DCT. The C-D and R-D func tions of the SSAVT and the ideal dyadic test are shown in Figures 3.8 (a) and (b), respectively. In Figure 3.7, the complexity performance as a function of QP of the SS AVT is significantly inferior to that of the ideal dyadic test. It implies that the ability to capture input classes with zero outputs is not very good. However, when considering the resulting rate and distortion of the SSAVT, the overall R-C-D (Fig. 3.6) is only slightly degraded from the ideal dyadic test. The reason to support this claim can be found by considering Fig. 3.8. One can see that in Fig. 3.8(a) the C-D performance of the SSAVT is worse than the ideal at the same complexity, the SSAVT results in higher distortion. This is because it incorrectly classifies nonzero coefficients to be zero coefficients. On the other hand, when it classifies zero coefficients to be nonzero, unnecessary computations are spent on those zero coefficients. However, in Fig. 3.8(b) the SSAVT R-C performance is better than the ideal dyadic test at low rate, because the bits saved from incorrectly classifying nonzero coefficients to zero. As a result, the overall R-C-D performance degradation from the ideal case is m itigated by this R-D tradeoff. 79 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. g ‘ . - - ' . I — . ' » o - '■ ■ ■ ■ ■ ■ ■ ■ - " ■ o 5 10 t s 0 5 TO ts C am fM uiy p a r 64 pcr«i* x 10* C cnptoxtty p*r tA pu*Jx * to* (a) (b) Figure 3.8: (a) Complexity-distortion and (b) Rate-complexity curves for different algorithms, i.e., SSAVT (’ solid1 ) and 2-D dyadic (’dashed1 ), for DCT size 16x16 (’x1 ), 8x8 (’o’) and 4x4 (’*’) at a 2 = 100. 2-D dyadic is considered as an ideal case for the SSAVT where reduced classes are 100% detected, b SSAVT is an example of the superior performance of approximation approaches as compared to exact VCA approaches. However, in the next section, as an al ternative to the SSAVT in which only a subset of the outputs is computed, we propose another class of approximate algorithms in which all the outputs are computed at an accuracy lower than the minimum requirement of the standard DCT. 3.2.3 A P P R O X -Q D C T Here we introduce an approximate algorithm (see also [66]) where in the DCT coefficient computation we replace the multiplications with additions and binary shifts in a manner that is quantization dependent. For large QP, a coarse approxi mate DCT is used because even if the error in coefficient computation is large, the large quantization step means that their quantized representation will introduce even more error. Similarly, finer approximate DCT is used when QP is small. Our proposed approximate DCT is shown in figure 3.9. This figure is derived from the fast DCT in 1.3 by simplifying the operations. The m atrix p Q ranges from the 80 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. xO xl x2 x3 x4 - w .X l x5 x6 x7 Figure 3.9: The approximate DCT algorithm. , to more complex m atrix Po2!Po3)Po4J and p 0 s shown below. qi o4 O 4 O 4 O4 ' 0 1 0 1/8 - 1 /8 Po2 — 0 - 1 /8 1 1/8 0 0 0 - 1 /8 1 1/8 0 1/8 1/8 0 1 q2 0 4 O 4 o 4 O 4 0 1 1/8 1/8 - 1 / 8 Po3 = 0 - 1 /8 1 1/8 - 1 / 8 0 - 1 / 8 - 1 /8 1 1/8 0 1/8 1/8 - 1 /8 1 < 1 3 04 O 4 O 4 O4 0 1 1/8 1/8 - 1 / 4 Po4 = 0 - 1 / 8 1 1/4 - 1 / 8 0 - 1 /8 - 1 /4 1 1/8 0 1/4 1/8 - 1 /8 1 identity m atrix, p 0i = Qi 0 0 I 81 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 3.1: Number of operations required for proposed approximate algorithms where Alg. No. i corresponds to using the transform m atrix PO T -. Alg. No. Additions Multiplication Binary Shifts 1 24 0 2 2 33 0 7 3 38 0 8 4 42 0 12 Vetterli’s [29] 29 13 0 Po5 q 3 o £ 0£ 0 £ 0 £ 0 1 1/8 1/8 - 3 /1 6 0 - 1 / 8 1 3/16 - 1 / 8 0 - 1 / 8 -3 /1 6 1 1/8 0 3/16 1/8 - 1 / 8 1 where qi q2 - qa 0 = 1 - 1/2 1/2 1 1 - 3 /8 1/2 1 1 - 3 /8 3/8 1 [ 0 0 ] The number of operations required for these approxim ate DCTs are shown in Table 3.1. Let an approximate DCT m atrix using one of the above approximations be de noted by (D approx)- After these approximations, if one wants to obtain the exact DCT coefficient values, the post-transform m atrix would be D 8 • D, -l a p p r o x . This post-transform matrix gets closer to the identity m atrix (up to scaling factors) as the approximation gets more accurate. The scaling factors needed at the output can be found to be diag(D 8 • D a p p r o x - 1 ) , and can be absorbed in quantization, 82 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. i.e., in our approximate DCT, the post-transform is approximated by setting off- diagonal elements to zero leaving only the diagonal elements as scaling factors. Since this algorithm is lossy, the approximation introduces more distortion to the reconstructed images. The rate-distortion curves of these approximate DCTs using JPE G coding are shown in Figure 3.10. Rate-Oistortion ot different OCT alg. for fenna* Exact DCT Approx ~Q DCT Approx#5 Approx#4 Approx#3 Approx#2 Reduced 4x4 Approx# 1 SB-DCTLL 36 z 32 28 26 0.5 1.5 Bit per pixel Figure 3.10: Rate-Distortion curve of 512x512 lenna image JPEG coding using various DCT algorithms. Note that at high bit rate coarser approximate algo rithm performances deviate from the exact DCT performance dramatically. The quantization dependent approximation can maintain the degradation level over wider range of bit rate. In this experiment, we encode the “lenna” image with JPEG compression using the example quantization m atrix [1], It can be seen that, as expected, in the high distortion region the increase in distortion introduced by the approximate DCTs does not affect performance because it is less than the distortion due to quantization. Therefore, we develop an algorithm which is quantization parameter dependent (shown as a solid line, Approx-Q, in Figure 3.10) in which the selection of the approximate algorithm is made at the beginning of the encoding process depending on the quantization parameter. It uses a coarser DCT approximation algorithm for low quality coding and finer DCT approximation for high quality coding for small degradation. The degradation of the decoded image introduced by the approximate DCT is 0.12 dB at 0.18 bpp, 0.15 dB at 0.91 bpp and 0.64 dB 83 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. at 3.17 bpp. From Figure 3.10, it is obvious that even w ith the Approx#2, the R- D curve is better than other reduced complexity approximate algorithms such as pruned DCT (computing only low frequency 4x4 DCT) and subband DCT (using low-low subband [48]). This is because we do not lose high frequency information which is present at high rates. 3.2.4 A P P R O X -Q Error A nalysis V V e now analyze the additional distortion introduced by the approximation of the DCT and, based on that, we show how to autom atically select which (coarser or finer) approximate DCT should be used for a given QP and cr2 . Our goal is selecting the algorithm to ensure that the resulting overall distortion does not exceed a certain level. Let us denote the transform matrix of the z-th approximate DCT by D*, the input spatial domain block x, and the DCT computed by this reduced m atrix by X,. Therefore, the difference between the exact and approximate computation can be written as e = X - X i = D xD k - DixDj = ( D - D i) x ( D - D i)t = EixE[ (3.15) where E, = D — Di is the approximation error m atrix. W ith a similar analysis to that in Section 1.6, the variance of the error in DCT domain can be derived as cr 2e(u,v) = o-2[EiRE?](tif1 t)pEiREf](B tl,) = cr20-(u, v), (3.16) where R is the correlation m atrix of the input vector and 4>^(u,v) is a scaling constant that is a function of the approximation m atrix. In general , v) should be much smaller than T%(u, v), since their ratio is the relative additional distortion introduced by the approximation. The total block approximate error can be 84 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. written as De{q), D e(q) =cr 2 ^ 2 v ) (3-17) (* .» ) We can model this approximation error as an additive white Gaussian noise (AWGN) that is to be added to another AWGN error due to quantization, i.e., from the approximate quantized DCT, X can be w ritten as X = X + nq + na where nq is a Gaussian r.v. modeling the error due to quantization and na repre sents another Gaussian r.v. representing the approximation error. It is reasonable to assume that nq and na are independent. Therefore, the distortion can be ex pressed in terms of the summation of the original distortion and the distortion from approximation. D appro x oQ P , cr2, N ) = J 2 D ( u , v) + D e(q) (3.18) (u,v) NormalLzod A dditional Distortion lor A pprax-O C T 10‘* Figure 3.11: Additional distortion (normalized by original distortion) using ap proximate DCT algorithms # 1 (’*’)> # 2 (’V ’)> # 3 (’° ’)> an<i (’©’) at various pixel variance a2. Figure 3.11 shows the approximation error results of the 5 approximate DCT algorithms normalized by the distortion due to quantization. It can be seen that 85 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. not only the approximation error depends on the QP, it also increases as the pixel variance cr2 grows, i.e., for a fixed pixel variance the additional error ratio increases as quantization stepsize decreases. Furthermore, for a fixed QP, the additional error ratio increases with pixel variance. However, for a given approximate al gorithm, Q P still plays a bigger role in the resulting error. Therefore, we can systematically select which approxim ate algorithm to be used for a given QP and cr2, so as to ensure that the approximation error is below a certain limit, i.e., the #-th algorithm is selected if De(q) < rj ^ D (Q P , a2, N) (u, v) where 7 7 is the level of desired additional error . For example, when coding a frame with fixed QP for all blocks, low variance blocks (associated with low activ ity) require less accurate DCT approximation whereas high variance blocks must use finer approximation in order to maintain the same level of additional error throughout the entire frame. 3.3 R esu lts and H ybrid A lgorith m s In this section we discuss a series of hybrid algorithms which combine the strengths of each the previously proposed algorithms. Experimental results are shown based on the baseline H.263 TM N’s codec [58] used to encode 249 frames of the “Fore man” sequence at different target bit rates. The Visual C + + compiler with op timization option is used. The CPU clock cycles spent in both the DCT and quantization are measured because for the VCA exact DCT, not only the DCT can be prunned but also the quantization for those zero coefficients can be omit ted. For Approx-Q, since the approximated coefficients have to be scaled, the extra scaling operation has to be done in the quantization process thus affecting the overall complexity. However, it will be seen later th at the speedup from the DCT part is much more significant than the slowdown in quantization for the Approx-Q case. As can be seen in Fig. 3.12, the speedup of the Approx-Q is much higher than the VCA exact algorithm with only 0.01-0.06 dB degradation. 86 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. OCT ♦ quantization complexity results of foreman* ASSAVT-VCA 5 0 Kbps SSAVT .VCA . 40 Kbps a: 5 a. Appiox-V C A 3 0 Kbps 2 7 5 Figure 3.12: The complexity (CPU clock cycle) of the DCT + Quant, normalized by the original algorithm in TMN at various PSNRs (different target bit rates) of original DCT (’+ ’), SSAVT (’o’), Approx-Q (’A ’), ASSAVT (’□ ’), Approx-VCA (V)> and ASSAVT-VCA (’*’). 3.3.1 A p p rox-S S A V T (A SSA V T ) We first combine the SSAVT and the Approx-Q DCT in such a way that the SAV is first tested and the input is thus classified in the same manner as the SSAVT. Then the reduced DCT is approximated using the approximate algorithm. T hat is, we use approximate algorithms also for the reduced dimension DCTs used in SSAVT. The total block distortion for this combined algorithm, which we call ASSAVT, can be expressed as D a s s a v t { Q P , c t 2, N ) = ^ 2 [D(u,v) + a 2(jyf(u, v)} +- a 2F%(u,v) (3.19) (u,v)€Bi (u,v)£Bi The first summation on the right side is the distortion plus approximation er ror of the computed coefficients. The second term is for uncomputed coefficients. Figure 3.13 shows the result of ASSAVT with the target approximation error at 2xl0~4 of the original distortion. For simplicity, this experiment assumes that the variance of the input is perfectly estimated by the SAV. Therefore, the class that the input belongs to is deterministic. Therefore, the estim ated SSAVT distortion 87 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. is known with certainty. In practice, the estimation of the pixel variance is ran dom, thus resulting in a probabilistic estimation of the SSAVT distortion. The actual deviation from the SSAVT distortion must be averaged over the support of the estim ated pixel variance. In Figure 3.13, it can be seen that the additional distortion follows the SSAVT seesaw-shaped results except that it can now be kept under the specified error bound. Thus, when the error from SSAVT becomes too small, i.e., sufficient coefficients are computed for the given cr, then we can observe the error due to the approximation. Except in those cases, the additional error due to approximation is negligible. Moreover, these situations occur when error is small, anyway, and therefore adding the approximate algorithm to SSAVT is always beneficial. ASSA V T - - SSA V T JO'* J 0 ’ a 1 0 '* pixel v a ria re c a 2 Figure 3.13: Additional distortion (normalized by original distortion) using the ASSAVT with 2xl0~4 target deviation from the original distortion (’ - -’), SSAVT at QP = 10 (’o’) and 22 (’*’), respectively. We apply the ASSAVT to the DCT subroutine of the TMN [58] H.263 codec with some modifications. First, the reduced DCT, i.e., 2x2-DCT and 4x4-DCT, use Approx#4 and Approx#5, respectively. This follows the result in Fig 3.13 that as the variance a 2 gets larger, the error becomes larger too. This also results in larger size DCT after SSAVT classification. Therefore, in order to maintain a small error, finer approximation algorithm has to be used. For DC-only class, the DC coefficient can be computed precisely using only additions and a binary shift. For 88 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the full DCT class, vve use the APPROX-Q algorithm presented earlier in which the selection of the approximate algorithm is quantization parameter dependent. The quantization can also be done faster using the result of SSAVT classification, i.e., the high frequency DCT components do not need to be quantized if they are not computed. Note that the approximate algorithm assignment can be changed according to the performance degradation requirement. In our experiments, we simply select a scheme that gives reasonable complexity reduction and degradation tradeoffs. The results of the proposed ASSAVT as compared to SSAVT are shown in Figure 3.12. The CPU clock cycle of both DCT and quantization are measured as mentioned above. It can be seen that the ASSAVT can obtain 47-55 % complexity reduction, as compared to 17-33% reduction by the SSAVT, while the coding performance is degraded only by 0.04-0.14 dB as compared to 0.02-0.08 dB by SSAVT with respect to the exact algorithm. 3.3.2 A pp rox-V C A D C T It is worthwhile pointing out here th at since the structure of the approximate DCT algorithm is similar to the fast algorithm in Figure 1.3 we can apply the classifi cation structure and the optim ization technique to obtain the optimal VCA as in Section 3.1 for the Approx-Q DCT algorithm presented in Section 3.2. The algo rithm is now data-dependent and will be called Approx-VCA DCT. Experimental results are also shown in Fig. 3.12. In this experiment, we also take advantage of the zero information from the VCA in the quantization process as mentioned earlier, i.e., if the resulting coefficients from VCA are zero, no quantization oper ation is needed. In practice, one could combine VCA DCT and quantization into one subroutine, but for simplicity we still separate them and simply introduce a zero test before quantizing each coefficient. 3.3.3 A S S A V T -V C A D C T Finally, we can combine SSAVT, Approx-Q and VCA exact DCT into one al gorithm. Based on the ASSAVT in Section 3.3.1, we modify the approximate 89 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. algorithm in the case of all-8x8 class to have pre-transform deadzone test capa bility as in Approx-VCA. The result is also shown in Fig. 3.12. It can be seen that the C-D result is improved from the ASSAVT by about 8-10%. However, when compared to the Approx-VCA algorithm, the ASSAVT-VCA improves just a very small fraction at low rates and shows almost no improvement at high rate. This is because the SSAVT classification efficiency drops at high rates. 3.4 Sum m ary and C onclusions We have proposed two techniques for fast VCA DCT computation. The first one computes exact DCT coefficients in a VCA manner by checking on the pre transform deadzone region. This algorithm is hierarchical in the sense that the classification can be broken down to smaller group of coefficients. However, the drawback of this algorithm is the efficiency of the pre-transform deadzone drops rapidly as the dimension of the transform grow. Thus we apply it only to the second 1-D DCT. At high rate, this algorithm shows almost no gain. The sec ond technique was then considered. Unlike the first approach, the resulting DCT output is not exact. We gave a review of the reduced DCT using SSAVT classifica tion and analyzed the performance. Then we proposed an approximate algorithm that computes all the DCT coefficients in a block with less accuracy than in the exact case. The complexity reduction comes from using cheaper arithmetic opera tion (no multiplications) with the sacrifice in slightly poorer coding performance. Finally, we showed some experimental results and suggested several hybrid algo rithms th at combine all the above mentioned algorithms. It can be seen that the result of approximate techniques are much more promis ing than the exact algorithms in terms of complexity at the cost of only a slight degradation in picture quality. Since the DCT cannot be computed exactly using any fixed precision computer, the error is already introduced. Also, the quantiza tion effect masks out the computational error in the case of low rate. This results in a slight decrease in the overall performance. Another conclusion is that using the statistical model, such as SSAVT, allows more classification gain with a small coding performance degradation. 90 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 4 Motion Estimation: Probabilistic Fast Matching In this chapter, we propose a matching algorithm for motion estimation that uses probabilistic stopping criterion based on the partial distance search [79]. It can be categorized as a computationally scalable Fast Matching (FM) algorithm. The basic idea is to stop the computation of the matching metric when the partial metric indicates th at the total metric is likely to exceed that of the best candidate so far. This allows us to save in computation when “bad” candidate vectors are being tested, since these can be discarded before all pixels in the macroblock have been used to compute the metric. To achieve this goal we formalize a hypothesis testing framework where the decision to stop the metric computation is done based on probabilistic models of the distribution of the actual metric based on the calculated partial metric. We select a given probability of error (i.e., missing a “good” vector) based on these models and by varying this probability we can achieve a computationally scalable calculation. From the nature of the test, we will refer to the original partial distance search as determinictic testing fast matching (DTFM) as opposed to the hypothesis testing fast matching (HTFM) ([88],[101]) for the proposed algorithm. This chapter is organized as follows. In Section 4.1, the original partial dis tance algorithm is reformalized and a new macroblock partitioning is introduced. In Section 4.2, we provide a formal description of the hypothesis testing framework 91 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. that forms the basis of our algorithm. We model our variables of interest and as sume that their second order statistics are known. W ith a given error probability or probability of false alarm, we design the hypothesis testing such that the error probability meets the requirements. We also show simulation results on some test sequences where we assume all necessary statistics of a particular sequence are known to the encoder. In Section 4.3, since each sequence has different character istics which are unknown to the encoder before the encoding process takes place, we discuss how to obtain the statistics adaptively during the encoding. As an extension to [86], we propose an alternative computationally scalable Fast Search (FS) in Section 4.4.1 which can be combined with our HTFM. Finally, in Section 4.4, we show the result of our proposed HTFM with adaptive param eter estima tion as compared to DTFM using ES, 2-D Log search [68] and ST 1 search [4]. Finally, concluding remarks are given in Section 4.6. 4.1 P artial D istance Fast M atching In this work, we use SAD as the matching criterion1 (refer to Table 1.3 for nec essary notations). In all our experiments SAD calculations are based on at most 128 pixels out of the 256 pixels of a macroblock. As in [67], this subset /?9 C B is obtained by subsampling using a quincunx grid (see Fig. 4.1.) As shovm by Fig. 4.2, this particular subsampling tends to provide sufficient accuracy for the SAD calculation (see [67, 4]), i.e., the overall MSE does not increase significantly if we use [3 q instead of B. Our work is based on splitting each set fiq into b subsets of pixels, yt, for i 6 {1, 2,..., 6}, such that (J^-i Vi = /3q and yi Hyj = 0 for i ^ j (see for example Fig. 4.1). Let us define = (Jy=i 2 /y- During the SAD calculation of mvj, we compare the partial calculation of the SAD, SA D (m vj} X{) with the best SAD obtained out of all previously tested candidates, SADbsj{p(j^i, j3q). If the partial SAD is greater than SADbsf, we can term inate the computation and continue to L Note, however, that our approach could be easy generalized to other additive distance metrics such as MSE (see for example our work in applying this approach to searching of a Vector Quantizer codebook in [101]). 92 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the next vector. Otherwise, we compute SA D (m vj,X i+i) for the next stage and perform the test again. If no early term ination occurs, the process repeats until the final stage, b, is reached. There are many possible ways to partition j3q into subsets y,-. In this work, we propose two methods, which both have |yj| = 8, Vi, and thus result in 16 stages of testing: these are the uniform partition (UNI)2 and the row-by-row partition (ROW) shown in Fig.4.1(a) and 4.1(b), respectively. In Fig.4.1(a), the partition is designed such th a t the correlation between SAD (m v, 0q) and SAD{rnv,Xi) is maximum, given th at the pixels are uniformly spaced. This results in fewer pixel comparisons on the average, since early term ination is more likely. However, for a hardware implementation, UNI may not be desirable as it results in more irregular memory access patterns. Conversely, ROW (see Fig.4.1 (b)) provides a more regular memory access that could simplify the implementation, although we can expect ROW to be worse than UNI in terms of producing a reliable estimate of the total SAD. To simplify the notation, when there is no ambiguity we omit to write the terms mvj. 7 and B. Also we use P SA D i to represent the partial SAD at stage i, i.e., SAD{rnv, Xi) for i = 0, ...,& — 1. Note that the partial SAD can be computed recursively as P S A D i+l = P SA D i + SAD (m v, Vi) (4.1) where P SA D 0 = 0 and P S A D b = SAD. It is clear from (4.1) that PSAD i < S A D , for Vi. Therefore, if P S A D i is greater than the SADbsf, there is no need to complete the SAD com putation and we can move on to evaluate the next vector. Otherwise, we compute P S A D i+i and perform the test again. As a result the M V* obtained by the partial distance method is obviously the same as that obtained by computing directly the full metric. Thus we call this technique a deterministic testing fast matching (DTFM), 2 During the reviewing process of our paper [89], we became aware of the work by Cheng and Sun [102], which independently proposed using a uniform partition (dithering pattern pixel decimation). It is important to note, however, that neither this or other FM works use a variable number of pixels for matching. 93 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1 fS l] Mi £ 5 ? | ? 9 1 3 . ’1 3 ; ~ 9 J W: a s 7 •3: 7 : i . V 3 | ' m 1 5 1 1 ; JSf 1 1 m auf 2 6 ;2i • £ 6 . . m j * 6 £ i 1 0 1 4 ; 1 0 14 r t o : i.4- = 1 0 1 7 1 4 : 1 8 4 ■ f - \? 4 Z ; ~ 4f. 1 6 1 2 1 6 1 2 1 6 1 2 . 1 6 - 1 2 ' 5 1 5 -u- :S - 9 1 3 1 3 9 1 1 3 1 3 7 3 7 . - : 3 ; ; i& i .- 3 : . 1 5 1 1 ' 1 5 1 1 1 5 1 * “ 1 5 ir 6 * ' - ■ 6 : t ; 6 - * 2 1 6 1 0 14 1 0 1 4 1 0 14 iO ’ : * 1 4 ; 8 ■ 4 - 8 - 14- 8 -4: : J 1 4 : 1 6 1 2 1 6 1 2 16 1 2 1 6 1 2 (a) •Js St? 8H ggi s® 5 ® U J ; r ? S S l p a m a ft: : 2- a® m ? 3 i ' f M 5 s ^ 3 a s m 3 5 1 m iSS' : 4 : i u r 4 .-:-v $£ r'5? ; ' 5 |; ;5U- ■ 6 1 6& 6 -4 3 ' IS: : 7 - - ; 7." 11; ; r \ ; S i u . 8 / 8 9^ > .;- * 7 ;?:p 9 10: io . - 10 1 0 ’ 1 0 10 10 10 5U.I 0 1 , m 1 1;: 1 1 ' 11 11 i 12: 1 2 : ft* : :12 1 2 12 12 1 3 7 1 3 - ‘ 1 3 - 1 3 13£ 1 3 ! 13 1 3 1 4 ; J 4 : 1 4 1 4 1 4 1 4 14 ' &? ■ 1 5 1 5 - 1 5 15; 1 5 is ? 1 5 : 1 6 ; • jk8 It • 1 6 * 16 1 6 ; 1 6 1 6 (b) Figure 4.1: Subset partitionings for 128 pixels subsampled using a quincunx grid into 16 subsets for partial SAD computation. Only highlighted pixels are used to compute SAD. Two types of subsets are used (a) uniform subsampling (UNI) and (b) row-by-row subsampling (ROW). Partial distance tests at the z'-th stage are performed after the metric has been computed on the pixels labeled with i (corresponding to yi). as it deterministically provides the optimal solution. Note th at in this work we call “optimal” the motion vector that is obtained with SAD computed from [3 q (i.e., 128 pixels). Therefore a solution based on x* C fiq will tend to be “sub- optimal” since we cannot guarantee that it will produce the same motion vector selection obtained using j3q. The DTFM approach can be summarized as follows A lgorithm 2 (D T F M ) S tep 1: At the beginning of motion search for a particular block, compute the SAD of the first candidate MV, assign it to S A D bsf and set M Vbsf accordingly. S tep 2: Every time a new mv is considered, as dictated by the FS strategy, set SA D to zero. Set i = 0. If there is no next unevaluated vector, M V* = MVbsf. S tep 3: Compute PSADi S tep 4: I f i < b, go to step 5, else let S A D = P S A D b and go to step 6. S tep 5: If P S A D i > SADbsf, we eliminate the current candidate and go to step 2 . Otherwise, let i = i + 1 and repeat step 3. S tep 6: If S A D < SADbsf, SADbsf = S A D and M Vbsf = mv. Go to step 2. 94 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. PSNR Jogradaiion Complexity-distortion lor lufl search Complexity-distortion lor ST I search I i I tz z to a. 02 norm alized d o c k cycle (a) (b) Figure 4.2: Complexity-Distortion of reduced set SAD computation with ROW DTFM (’dotted’) and without DTFM (’solid’) using (a) ES and (b) ST 1 search, averaged over 5 test sequences. Points in each curve from right to left correspond to \,3\ — 256, 128, 64 and 32, respectively. Note that there is a minimal difference between computing the SAD based on 256 and 128 pixels. For this reason in all the remaining experiments in this work we use at most 128 pixels for the SAD computation. The partial distance technique we have just described is well-known and is implemented in many actual software implementations, where ROW subsampling is typically used (e.g. [58],[103]). The complexity savings of this technique come from the possibility of early termination in Step 5. The amount of complexity reduction varies depending on the nature of the sequence. Also, since we can use DTFM with any FS algorithm, the efficiency of the FS algorithm will affect the savings stemming from DTFM. For example, for efficient FS algorithms the tested MVs are likely to have small SAD and their SAD values will tend to be fairly similar. Therefore there is less chance to term inate the matching computation at an early stage, and the benefits of DTFM will be reduced. In general, the complexity reduction contributed by DTFM can be significant, e.g., about 3-5 times speedup in the ES case. In Fig.4.2, complexity-distortion (C-D) results with and without DTFM are compared. The C-D curves are obtained by changing the set /3. One can observe that the relative gain in using DTFM is lower when a fast search algorithm is used (see also Table 1.1). 95 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.2 H y p o th esis T esting Fram ew ork In this section, we propose an algorithm, hypothesis testing fast matching (HTFM), that enables additional complexity savings as compared to DTFM by allowing an early termination of the SAD calculation based on the likelihood th at the SAD will be greater than SADbsf, given the current PSAD i. This complexity reduc tion over DTFM comes with the cost of potentially not finding the best motion vector for some blocks, which leads to an increase in the energy of the motion compensated predicted frame. In our formulation, we will use the mean absolute difference (MAD) defined as M AD = S A D /\B \, where |S | is the number of pixels in set B. Similarly, we write the “best-found-so-far” MAD as MADbsf = S A D bsf/\B \ and the partial MAD as P M A D i = PSADi/\xi\. It is worth noting th at SA D is always greater than or equal to P SA D i but M AD can be either greater or smaller than PM AD i. Our proposed approach comes from the observation that the PMAD becomes increasingly correlated with the MAD as the partial metric com putation pro ceeds to more stages, i.e., more and more pixels are used. For example, consider Fig.4.3(a) where scatter plots of MAD vs. P M A D i are shown. It can be seen that there is a high correlation and PM AD i is an increasingly good estim ate of MAD as i grows. The histograms of the difference between MAD and the PMADs are also shown in figure 4.3(b). From these figures we can conclude that the following are good approximations: (i) the partial MAD is a good estimate of the final MAD, i.e., E {M A D \P M A D i} % P M A D i, and (ii) there exists a reliable model for the error, and this model is about the same for any values of PMAD, i.e., VMAD\PMADiix ) ~ P M A D -P M A D i(x ~ PM AD i). In DTFM, we stopped the metric computation as soon as P SA D i was greater than SADbsf ■ In HTFM, given the P M A D i at the i-th stage we want to decide whether the M A D is likely to be larger than MADbsf - If that is the case, we can terminate the m atching distance computation early, with the risk that the actual M A D may turn out to be smaller than MADbsf- We refer to this risk as the probability of false alarm, Pp. More formally, our goal is F o rm u latio n 5 (H T F M ) Given P M A D i at the i-th stage (i = {1, 2,..., b}) and 96 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. -------- tr-| 1 0 0 . ------------ 1 0 0 , --------- - z r . 50 100 0 50 100 0 50 100 0 50 100 100 100 100 100 50 50 50 50 50 100 100 100 100 100 50 50 50 50 50 50 100 0 - a 1 0 0 r 50 100 1 0 0 1 0 0 100 50 50 50 50 50 50 100 0 100 50 (a) 0.1 0.05 0 -4 0 -2 0 0 20 0 .2 1 ------------------------- 0.1 0 0.4 0.2 0 0.5 0.1 0.05 0 -40-20 0 20 021 ----------- • — u . i ................................... AhLA -10 0 10 -10 0 10 -1 0 0 10 0.41 --------------------- 0.1 0.05 0 -4 0 -2 0 0 20 02 ( ------------------------- 0.1 0.05 0 -4 0 -2 0 0 20 0 . 2 r -5 0 -5 .A . 0 2 0 -5 0 5 -5 0.5 K J « • 0 5 -5 iV " / • l\ 0.1 0 0.4 0.2 0 0.5 A -1 0 0 10 A -5 (b) Figure 4.3: (a) Scatter plot between M A D (y-axis) and PM AD i (x-axis) and (b) corresponding histograms of M AD — P M A D i. These are plotted for 16 stages of UNI subsampling, with number of pixels ranging from 8 (top left) to 128 (bottom right). We use UNI subsampling and ES on the “mobile &calendar” sequence. Similar results can be obtained with other sequences, search methods and sub sampling grids. 97 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. MADbsf, we want to decide between the following two hypotheses: H 0 : M A D < M A D bsf Hi : M A D > M A D bsf such that the constraint Pf = Pr(H0\decide Hi) < Pf is satisfied. where Pf is the targeted Pp. If Hi is chosen, we stop the SAD computation. Otherwise, we compute PMADi+i and perform another hypothesis testing at the i + 1-th stage. Note that we could formulate the problem using the Neyman-Pearson criterion in which P r (decide Hi\H 0)Pr(H0) (probability of false alarm) is constrained and Pr(decide H 0 \Hi)Pr(Hi) (probability of missed detection) is minimized. How ever, the resulting decision rule turns out to be the same. Also, we treat MADbsf and P M AD i as constants rather than assuming some prior distribution for them. Thus, we only need to consider and model the conditional probability of MAD given P M A D i- P f can then be rewritten as (see also Fig. 4.4), r M ADb3f rMADbsf —PMADi P f = P M A D \ P M A D i { x ) d x = / P M A D - P M A D i i x ) d x J — OG J — C O Given this probability, we can find a threshold parameter, T hi} such that Pp is less than some threshold Pf, r-Thi / P M A D - P M A D i ( x ) d x = Pf (4.2) J — o o so that the decision rule becomes (see Fig.4.4) Hi P M A D i- MADbsf ^ Thi (4.3) Ho Now we can replace the PSAD testing at step 5 of Algorithm 2 (DTFM) by (4.3). As illustrated in figure 4.4, Pr(Po|decide Hi) is the area under the P(MAD\PMADi)(x ) function to the left of MADbsf- In general, we could select 98 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 4.4: Empirical pdf of M A D — P M A D i (estimation error) obtained from histogram of training data (solid line) and the corresponding param etric model (dashed line). HTFM terminates the matching metric computation at stage i if P M A D i ~ M A D bsf > Thi. different Pf thresholds for each stage in the hypothesis testing. However, for simplicity, we fix Pf to a constant at all stages in our experiments. From the histogram in Fig.4.3(b) we can model the difference between MAD and PM A D i as a random variable with Laplacian distribution, i.e., PiMAD_PMAD)(x) Ai.e-Ai|x| where Az - is the Laplacian param eter for stage z. We found th at this model is accurate for many test sequences and FS methods. With a Laplacian model, the threshold (4.2) can be written as a function of A * and Pf as follows Note that the Thi of each stage is different because the model (and therefore A t) is different even if the same Pf is used for each stage. So far, we have formalized a hypothesis testing framework based on likelihood testing of PMAD. However, there is a situation where it is useful to combine HTFM and DTFM . In some cases PSA D i is already larger than SADbsf but our probabilistic estim ate indicates that P M A D i is not large enough to be eliminated (for example if Pf is very small), i.e., PSAD i\X\/\xi\ — SADbsf < Thi\X\ but PSAD i > SADbsf - This would result in computing successive refinements of ~-ln(2Pf) P, < 0.5 Zn2(l — Pf) Pf > 0.5 (4.4) 99 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. PMAD. This situation happens more in the last few stages and when S A D bsj has a small value, i.e., when we are close to finding the best point in the search region . Therefore, in order to take full advantage of all information available at a certain stage, our HTFM also incorporates the partial SAD test in conjunction with the likelihood test at each stage. The HTFM, thus, can be summarized as A lg o rith m 3 (H T F M ) Same as Algorithm 2 except that S te p 5 is replaced by: S tep 5 If P S A D i > S A D bsf or P M A D i > M A D bsf + Thi, we eliminate the current candidate and go to step 2. Otherwise, let i = i + 1 and repeat step 3. The proposed HTFM technique reduces matching metric computation cost, but introduces an overhead due to the hypothesis test at each stage (one more comparison and two additions). While this additional complexity is outweighed in general by the gain from early termination, it is also possible to optimize the testing. Thus some tests could be pruned with the goal of minimal overall complexity for a specific data with known statistics, i.e., the probability mass distribution of being terminated at certain stage as in [ 88]. However, we have found that the optim ization does not give significant speedup, therefore, we do not discuss nor elaborate this issue in this thesis. 4.3 P aram eter E stim ation In Section 4.2, we assumed that the conditional p.d.f. of MAD given P M A D i at stage i is known, so that the appropriate thresholds can be derived from this distribution. In practice, however, these statistics are not known a priori for a particular sequence. A possible solution would be to select in advance these probability distributions, through training over a set of video sequences. However, we have observed th at the statistics of different sequences can differ significantly, depending on the frequency content of each frame, motion of objects in a frame and the FS techniques used. For example, a frame consisting of only low spatial frequency components tends to have less MAD estimation error than a frame with high frequency components. A frame with many moving objects causing 100 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. uncorrelated motion vector also gives higher MAD estimation error. Moreover, with initialization-refinement approaches to FS (e.g. [4]), the MAD estim ation error is smaller than for ES because the statistics based on a set of candidate vectors that are already expected to be good (i.e., their SAD will be close to the optimal one). For these reasons, we focus on approaches that estim ate the probability models on line, with updates taking place every few frames in a video sequence, under the assumption that the statistics do not change rapidly over a small group of frames. We model the conditional density to be a Laplacian distribution with pa rameter AThus we will only need to estim ate the Xl,s in order to update the HTFM thresholds. Furthermore, A is related to the variances, of = E { (M A D — PM A D i)2}, and the first order absolute moments, /^- = E {\M A D — PM AD{\} Therefore, obtaining any one of these three param eters is equivalent. Obviously, obtaining exact statistics would require that we compute the full MAD for each block, so that no fast matching speed up would be possible for the training frame. We now propose two training techniques for fast approximation of the threshold parameters for both ROW and UNI subsampling. In both techniques, our goal is to maintain the speed up due to DTFM while estimating the statistics reliably. Assume that when using DTFM for one block, the SAD calculation is term inated at stage t. In this case we have no information about P M A D i for % > t , but, given that we terminated early, the corresponding P M A D t can be thought to be a good approximation of the overall true M A D . Thus, our estimate of for a 2 can be computed by assuming that for each block M A D ~ P M A D t, so th at our estimated error variance is a? = Et\i{{PMADt - P M A D i ) 2} for t > i ~ 2Et\i{\PM ADt — PM ADi\}2 with the Laplacian assum ption(4.6) by (4.5) 4.3.1 M od el E stim ation for R O W 101 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The update algorithm then becomes A lgorithm 4 (V ariance estim ation for R O W ) Step 1 '.For a selected training frame, for every tested MV, perform DTFM in SAD calculation but also save \PM A D t — PM AD i\ for i < t where t is the stage of DTFM termination. Step 2:Compute o f from 2Et\i{\PMADt — P M A D i\}2. Step 3-.Compute A ; = \j2 fd f and update thresholds for HTFM using this new estimate set of variances and (4 -4 )- Thus for the training frames, we collect PMAD data only before the DTFM termination. The result of variances obtained this way is shown in Fig.4.5 averaged over 150 frames for each of the test sequences (ES is used in the experiment). Two main observations can be made from Fig.4.5. First, it can be seen that of is obviously lower than the of, thus resulting in smaller value of Thi for a given targeted Pj which means that P r(SA D < S'AD&S /|decide Hi) is larger. Therefore, in order to obtain the same probability of error, Pf must be set to a smaller value. Second, we can further reduce the complexity of the data collec tion/manipulation and using linear interpolation to find of for i ^ (1, 2,4, 8} by only computing o f, o f, o f, and of. 4.3.2 M odel E stim a tio n for U N I In general, we can still use Algorithm 4 to approximate the model param eters for UNI subsampling. However, a better estimate of of the true variance erf can be obtained in this case. Let us consider the pixel at position k = (kx, kX J ) and denote d(k) the pixel difference at that position, d(k) = |It(k) - I t~i(k -b mv)\ We can write P M AD i as P M A D i = £ d(k)/\xi\ (4.7) kSxi 102 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Variance of M AO estimate error, 700 6 C 0 mobile football Rower cheer bicycle 500 1 200 stage#) (a) New Variance of MAO estimate error, a2 200 160 160 a. 140 £ 120 100 20 stage(i) (b) Figure 4.5: (a)of and (b) of of ROW computed from the definition (mean square) (’solid’) and computed from the first order moment (’dashed’). The left-most points in (a) are E {{P M A D V — P M A D 2)2} and 2E {\P M AD i — P M A D 2 1 }2- 103 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Consider the first two stages, PM AD i and P M A D i. Because the pixels are uniformly spaced we can assume that the pixel difference, d(k), is an i.i.d. random sequence with average MAD and variance crj. Hence, P M A D i and P M A D i can be viewed as time averages. Therefore, we have E { ( P M A D i- P M A D i) 2} = a 2d {\x2\ - \xi\)/\x 2 \\xi\ = oH \ x 7 . \ - \ x i \)/\ x 2\ (4.8) = < 4(M ~ k iD /k il where af for i = 1,2 are as defined previously. Therefore, using (4.8) for the first two test stages (|x2| = 16 and [xi| = 8), we can approximate af as 2E {(P M A D i — P M A D i)2}. Besides, P M A D i —P M A D 2 can also be modeled to have Laplacian distribution. Hence its variance can be obtained without a square operation from the first order moment (expected absolute value), i.e., we can approximate af by af ~ 4E{\PM AD i - P M A D 2 1 }2 = af. Our experiments (see Fig. 4.6(a)) show that this is a fairly accurate approximation. To approximate the variances at other stages, a2, we observe that (see Fig.4.6(b)) the ratios between the first stage variance af and other a 2 are almost constant regardless of the sequence and FS scheme used (see Fig. 4.6(b)). A possible explanation is based on the i.i.d. assumption of the pixel residue. From (4.7), it can be derived that the ratio of variances of P M A D i and P M A D j does not depend on af but a function of only i and j. Since we can approximate pixels under UNI as i.i.d., it makes sense to see consistent ratios among all test sequences. As a result, in our approach we only estimate af and obtain the remaining a f by applying the scaling factors shown in Fig. 4.6(b). We also note (see Fig. 4.6(a)) that the variances can be estimated fairly accurately from the first order moment, Pi- Therefore, the algorithm to estimate model parameter for UNI can be sum marized as A lgorith m 5 (Variance estim ation for U N I) 104 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Variance of MAO estimate error 300 mobile football 250 flower cheer Q - 2 0 0 bicycle 1 5 0 1 1 0 0 stage(i) (a) Normalized variance of MAO estimate error 2.5 ifootbail :cheer 0.5 stage<i) (b) Figure 4.6: (a) of (’solid’) and 2fi2 (’dashed’) of MAD estimate error at 15 stages using ES and UNI, respectively. The left-most points shows E {(P M A D i ~ P M A D 2)2} and 2 E {\P M A D X — P M A D 2 1 } 2 for each sequence, (b) Ratio of of/djf for each sequence. Note that this ratio is nearly the same for all sequences considered. 105 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. S tep 1 : For a selected training frame, for every tested MV, always compute P M AD i and P M A D 2 and save \P M A D X - P M A D 2\. (DTFM or HTFM tests can be used for the following stages.) S tep 2: Compute = AE{\PM AD X — P M A D 2 1 } 2. Compute other af by dividing 2< 5f with ratios in Fig. 4-6 (b). S tep 3: Compute A 2 - = \J'2 / of and update thresholds for HTFM using this new estimate set of variances and (4 -4 )- This variance approximation technique is fast because the only data we have to collect is P M AD i —P M A D 2 and we can apply DTFM test (when no previous statistics are available) or HTFM test (using previous HTFM test parameters) at stage 2, 3 and so on, i.e., we still gain some computation saving while perform in g the training. Once again, the limitation of Algorithm 5 is that the i.i.d. assump tion is only reasonable when using UNI. For ROW, it is obvious that we cannot apply the UNI param eter estimation technique. Finally, to dem onstrate the effectiveness of our online training scheme we show in Fig. 4.7 a comparison between the a obtained by online training and the actual error variances for the corresponding frames. Our results show that these methods provide a good approximation without a significant impact in the complexity of the training frames. In our experimental results these training techniques will be used as appropriate and the corresponding training overhead will be included in the measured com putation cost. 4.4 E xp erim en tal R esu lts We discuss the conditions of our experiments first. We use 5 test sequences, namely, “Mobile&Calendar” , “Football” , “Flower” , “Cheer” and “Bicycle” . All of them are 150 frames of size 360 by 240. We encode them with an MPEG2 coder based on the “mpeg2encode” source code of [103]. We use frame prediction mode only (in MPEG2 there are other modes of motion compensation such as field and dual-prime prediction). The range of motion search is -15 to 15. We set the target rate at 5 Mbps. All the results are generated on a PentiumTT 300 MHz processor. 106 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. reaJ model parameter vs. approximation 6 C alue S 3 0 (lower football bicyde cheer 200 300 500 600 too 400 frame number 700 Figure 4.7: Example of tracking of statistics c r * under UNI subsampling. Note that the approximated values track well the actual ones, even though the parameters do change over time. We use several different sequences to provide the compar ison. This serves as motivation for using online training, rather than relying on precomputed statistics. In addition to ES for the best motion vector, we also use 2-D log search [68] and ST1 algorithm [4] as fast search techniques. We use each one of these three FS algorithms at integer pel motion accuracy. In order to obtain half-pel motion accuracy, we perform a one step search over a small 3x3 window (on half-pel grid) around the best integer motion vector from the FS algorithms. We compare results between these three techniques with and without our HTFM using either UNI or ROW. This allows us to assess how our FM algorithm performs when a more efficient FS is also used. 4.4.1 V C A -F M versu s V C A -F S We compare the proposed VC A under FM approaches with that of the FS coun terpart based on the idea in [86]. Unlike [86], we use ST1 algorithm [4] which also belongs to initialize-refine approach as the baseline algorithm. Similar to [86], we trade off computational complexity and distortion during the refinement 107 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. steps. The goal is to minimize the distortion under a constraint on the compu tational complexity. In the refinement process, the complexity constraint limits how far the local search goes. The Lagrange multiplier approach is used to get an unconstrained minimization for both parts. C orrpionty-O atoroon 'or ‘moote* 100 fram es CompJAMty-Oaiorbon for 'mobto* 100 frames 3 S 3 I, V 1 cr 9 0 * 1 0 * SAD* (a) (b) Figure 4.8: Complexity-Distortion curve for first 100 frames of “mobile” sequence (a) with MSE of the reconstructed sequence and (b) with residue energy as dis tortion measures and search without test (’o’), partial distance (SAD* test) (’*’) and combined hypothesis-SAD* test (V ) each curve at fixed Pj labeled on each curve and varying A from 0.01-4. At a particular refinement iteration z, before starting the search for the minimal SAD within a 3x3 window centered around the current best result, we test if the Lagrangian cost condition Ji— i — di— i - + - A t{— i > Ji — di 4- A ti (4.9) is met, where A is the Lagrange multiplier, and di and U are the minimal distortion (SAD) and total complexity at the z-th iteration, respectively. If this condition is satisfied, the search continues, otherwise we terminate the search and return the vector with the minimal SAD so far. The variable complexity comes from early term inations of the refinement iteration before the original stop condition is met, i.e., before having a minimal SAD point in the center of a 3x3 window. A is the factor th at indicates how early the termination occurs and can be adjusted to 108 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. make the total complexity meet a certain budget. Intuitively, the iteration termi nates if the reduction in distortion is outweighed by the monotonically increasing complexity, this will prevent further refining of a motion vector if the current residue energy is already small relative to the additional complexity required in the search. Experimental results in terms of number of operations for different pairs of (Pf, A ) are shown in figure 4.8. We obtain the model parameter for HTFM “off line”, i.e., without using Algorithm 5 or 4. For each curve, the Pf is fixed while the value of A changes to obtain the C-D tradeoffs. It can be seen th a t using A allow scalability to S T l algorithm with DTFM. However, by considering two parameters A and Pj together, the best performance in terms of C-D tradeoff is to fix A to zero and changing only Pf. This implies th a t the VCA-FS yields worse complexity-distortion tradeoff than the VCA-FM. Therefore, in the next section, we show only the result of our VCA-FM HTFM. 4.4.2 U N I versus R O W In Figure 4.9 (a), we show the complexity-distortion of 5 test sequences using ES motion estimation with the HTFM. The units of complexity is the actual CPU clock cycles spent in motion estimation normalized by the same quantiy using ROW DTFM . The distortion is in terms of PSNR degradation from the DTFM (both ROW and UNI result in the same quality). If UNI is applied to DTFM, the resulting complexity is poorer, as can be seen by isolated points on 0 dB line. However, with HTFM, the UNI complexity is less than ROW by about 15% as the pixel difference saving overcomes the data access complexity and the likelihood of making wrong decision is smaller due to the reduced estimation error variance. Also shown in Figure 4.9 (b) the same result as (a) but the number of pixel- difference operations is shown instead of the CPU clock cycle, and the average energy of residue pixel instead of the reconstructed PSNR degradation. It can be seen that the savings measured in terms of number of pixel-difference are always greater by about 5% because this measure does not take the complexity of search strategy, the test (DTFM or HTFM test), and param eter estimation into account. 109 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CfrntatfynJbiMon lor U 8 6 * 0 1 | I S e - j I | t 'i » ■ 3 6 S t 0 2 09 04 O S 07 08 t (a) (b) Figure 4.9: Complexity-distortion of HTFM with ES and variance estimation on-the-fly, ROW (’solid’) and UNI (’dashed’), (a) PSNR degradation vs. clock cycle and (b) residue energy per pixel vs. number of pixel-difference operations. Both clock cycle and number of pixel-difference operations are normalized by the result of ES with ROW DTFM . It can be seen that UNI HTFM performs better than ROW HTFM. The transform coding mitigates the effect of the increase of residue energy in the reconstructed frames. The testing overhead reduces the complexity reduction by about 5%. The complexity reduction is upto 65% at 0.05 dB degradation. Therefore, it can be seen th at the DTFM with UNI yields lower num ber of pixel- difference operations than ROW, but the actual CPU clock cycle is higher. It can also be seen that the decrease in reconsructed frame quality is less than the increase in residue energy as a result of the effect of bit allocation of remaining bits from motion vector coding to transform coding. 4 .4 .3 Scalability In Fig.4.9 and 4.10, we show the computational scalability of the HTFM . In order to obtain a complexity-distortion curve we plot the complexity and distortion pair at different P j values (ranging from 0.05 to 0.30 for UNI, and 0.01 to 0.20 for ROW.) For ROW we have to select lower Pf due to the underestimation of of by of. Therefore, the real probability of error is larger than the targeted P f. The statistics are updated using the proposed methods every GOP of size 15. Both 110 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. i>Klily a*u>aO«tun (dll| CMCkifir-dsfirtonte2-OLoQMaftfl Caatfniiyw m ic n or STl uarcft (a) (b) Figure 4.10: Complexity-distortion of UNI HTFM with variance estimation on- the-fly of (a) ‘ 2-D Log search and (b) ST1 search. The axes are clock cycle and PSNR degradation normalized/compared to the 2-D Log search (a) or ST1 search (b) with ROW DTFM. The complexity reduction is upto 45% and 25% at 0.05 dB degradation for 2-D Log and ST1 search, respectively. complexity and distortion are normalized by the complexity and distortion values of ROW DTFM. Given that, as mentioned earlier, the UNI performs better than ROW, in Fig 4.10(a) and 4.10(b), we only show complexity-distortion using UNI HTFM for 2-D Log search and ST l search, respectively. We can see that even though the results are not as good as for the ES case, we still gain complexity reduction with these FS algorithms. For example, we achieve a 45% complexity reduction with around 0.05 dB loss for 2-D Log search and a 25% complexity reduction for ST l search. In terms of subjective quality, we observe no perceptual difference between DTFM and the HTFM at this level of degradation. In all experiments, one can see that the complexity-distortion performance of HTFM on the “Football” sequence is the worst because the high motion content results in high MAD estimation error variance. Therefore, ideally the HTFM works the best for sequence with low MAD estimation error variance. In other words, the residue has to be relatively smooth, which implies smooth moving objects with sm ooth background content. I l l Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.4.4 T em poral V ariation Speeding 14 1 gain in full search O HTFM . P f 0.1 DTFM 150 fram e n u m b er Figure 4.11: Frame-by-frame speedup factor for ES using ROW and DTFM (’A 7 ), and ROW HTFM (7 o7 ) with Pf = 0.1 and 0.01 dB degradation. Figs. 4.11, 4.12 and 4.13 show frame by frame speedup in clock cycle average for “Mobile7 7 and “Football7 7 using ES, 2-D Log search and S T l search, respectively. Speedup is computed by comparing with the original FS algorithm (2-D Log or STl search) without FM. The Pf for HTFM is set to 0.1 or 10% for ES, 20% and 30% for 2-D Log and ST l search, respectively. One can see that with fast model param eter estimation which takes place in the first P-frame of a GOP (size 15), we still perform almost as well as DTFM. By comparing Figs. 4.11 and 4.13, we can see th at with an efficient FS algorithm, the extra speedup from DTFM is smaller, and thus speedup from HTFM algorithm is more difficult to get. Otherwise, the Pf can be set to larger value for greater speedup but that comes with the price of relatively larger distortion increase. It can also be seen from Fig. 4.13 th at in some frames, the DTFM results in slower motion estim ation (speedup less than 1). This is because the candidates being evaluated by S T l are all almost equally good, thus resulting in almost no early term inations. This case is not observed by HTFM because of the probabilistic testing. In such case, the Pf can be set to as high as 40-50% without much PSNR degradation since any of the candidates evaluated by the FS are equally good. 112 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. S peeding 14) gain in 2 - 0 log search O HTFM . P f 0 .2 A D TFM 1 ' l 1 * 1* 1 1 O r t . 150 fra m e n u m b er Figure 4.12: Frame-by-frame speedup factor for 2-D Log search using ROW and no FM (’*’), DTFM “ (’A 7 ), and ROW HTFM (’o’) with Pf = 0.2 and 0.04 dB degradation. 4.4.5 O verall P erform an ce Table 4.1 and 4.2 show the results in terms of total encoding time (seconds) needed to encode 150 frames of 5 test sequences using i) \B\ = 128 pixels without FM, ii) \B\ = 128 pixels with ROW DTFM iii) |B| = 64 pixels w ith ROW DTFM, iv) |B[ = 32 pixels with ROW DTFM , v) UNI HTFM Pf = 20%, vi) UNI HTFM Pf = 30% , and vii) UNI HTFM Pf = 40% (all HTFM results are based on 128 pixels). Once again, we can see that the relative speedup due to FM is much more significant when using ES rather than a FS. Furthermore, we can see that in general our HTFM with Pf = 30% provides speedup as good as or better than using 64 pixels with ROW DTFM but with less distortion. 4.5 H T F M for VQ Hypothesis testing fast matching can also be applied to VQ nearest neighborhood search. In VQ, the encoder must search for a codeword or code vector with distance nearest to an input vector. The matching metric normally used is the Euclidean distance between the input and the codeword, defined as d(x, y) = llx — yil2 = — Um)2, where x is the input vector of dimension k and 113 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Speeding up gain in ST l search O K rFM .P f 0.3 OTFM fra m e n u n b e r Figure 4.13: Frame-by-frame speedup factor for STl algorithm using ROW and no FM (’*!), DTFM (’A ’), and ROW HTFM (’o’) with Pf = 0.3 and 0.12 dB degradation. y is a codeword in the codebook C = {yi, y 2, y n} of size n. Therefore, the quantization rule is Q(x) = y { if ||x — y t -||2 < ||x - yy||2, Vj ± i. The search time is linearly increasing with the dimension of the code vector and the size of the codebook. As in the ME case, we first change the unit of the distance to be a per- dimension distance. In this case we define d(x,y) = d (x ,y )/k and dk>(x:y) = dk-(x, y)/k'. Our first goal is to estimate d(x.,y) from the dk'(x ,y ). Then we find the estimation error pdf and model it such that we can design a decision rule based on hypothesis testing to meet the targeted probability of false alarm. As in the ME case we found that E{d\dk'} can be well approximated by dk> For simplicity, we can also approximate the estimated error pdf using a Laplacian distribution, as in the ME case, and design the decision rule based on this assumption. The Laplacian param eter in this VQ case is obtained from the training vectors. The complexity-distortion result using HTFM is shown in Figure 4.14, which shows the result of DTFM at different code sizes, as well as HTFM curves corresponding to one codebook size with Pf ranging from 0.05 to 0.55. Figure 4.14(a) is for an i.i.d. Gaussian source with unit variance and Figure 4.14(b) is the high-high band from subband decomposition of “lenna” image. In both cases, the codebook 114 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 4.1: Total time for encoding 150 frames and PSNR. sequence FM ST l Log ES sec. PSNR sec. PSNR sec. PSNR mobile 128 pels 37.78 32.4619 43.79 32.1536 472.75 32.3670 DTFM (128 pels) 35.42 32.4619 39.33 32.1536 152.37 32.3670 DTFM (64 pels) 33.63 32.1194 35.54 31.3711 98.25 31.9531 DTFM (32 pels) 32.58 31.9322 33.59 30.9696 71.05 31.6990 HTFM 20% 34.44 32.4518 35.14 32.1612 84.59 32.36 HTFM 30% 33.78 32.4218 34.50 32.1436 80.39 32.3385 HTFM 40% 33.52 32.3548 34.11 32.0778 77.88 32.2721 football 128 pels 39.05 40.4285 44.72 40.0528 449.64 40.5188 DTFM (128 pels) 37.75 40.4285 42.19 40.0528 231.80 40.5188 DTFM (64 pels) 34.40 40.2436 36.34 39.7743 129.63 40.3235 DTFM (32 pels) 32.69 40.1621 33.62 39.6797 88.31 40.2269 HTFM 20% 36.39 40.3264 36.70 39.8961 102.43 40.4645 HTFM 30% 35.16 40.2576 35.64 39.8524 93.38 40.4341 HTFM 40% 34.25 40.1938 34.92 39.7897 87.94 40.3937 flower 128 pels 37.75 35.3020 44.67 34.6225 466.02 35.3124 DTFM (128 pels) 35.75 35.3020 40.69 34.6225 148.64 35.3124 DTFM (64 pels) 33.19 35.0397 35.71 33.7696 94.88 35.0197 DTFM (32 pels) 32.37 34.9337 33.41 33.2857 68.80 34.9056 HTFM 20% 34.67 35.2834 35.81 34.6291 92.80 35.2976 HTFM 30% 34.00 35.2499 35.24 34.6072 87.78 35.2548 HTFM 40% 33.50 35.1884 34.66 34.5820 82.23 35.1661 is designed using the LBG [104] algorithm from training vectors which are i.i.d. Gaussian and typical images, respectively. We can see th at unlike the ME case, in which the equivalent codebook size is fixed by the search region, in this VQ case the size of the codebook can be chosen to meet a complexity-distortion requirement. Note, however, th at in order to operate in a computation scalable mode, in the DTFM case the codebook size has to be modified, while scalability is achieved with a constant codebook size for the HTFM case. In Figure 4.14, complexity-distortion performance achieved by HTFM is approximately the same as that achieved with DTFM using different codebook size within a certain range. This is due to several factors. First, the Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 4.2: cont. Total time for encoding 150 frames and PSNR. sequence FM S T l Log ES sec. PSNR sec. PSNR sec. PSNR cheer 128 pels 42.61 35.0416 47.30 35.27 464.34 35.01 DTFM (128 pels) 40.67 35.0416 43.34 35.00 175.66 35.01 D TFM (64 pels) 37.96 34.9403 39.26 34.8811 107.44 34.9068 D TFM (32 pels) 36.77 34.9051 37.37 34.8461 77.58 34.8624 HTFM 20% 39.21 35.01 39.30 34.9610 92.25 35.00 HTFM 30% 38.31 34.9662 38.60 34.9319 86.29 34.9879 HTFM 40% 37.66 34.9219 37.85 34.9011 83.32 34.9712 bicycle 128 pels 42.28 35.1687 47.58 34.4392 461.28 35.1983 DTFM (128 pels) 40.47 35.1687 45.04 34.4392 224.13 35.1983 DTFM (64 pels) 36.78 34.8938 38.99 34.0651 128.43 34.8431 DTFM (32 pels) 35.06 34.7540 36.16 33.9563 87.87 34.6779 HTFM 20% 37.66 34.9628 39.56 34.3932 102.76 35.1574 HTFM 30% 36.89 34.8585 38.42 34.3518 93.73 35.1083 HTFM 40% 36.18 34.7495 37.53 34.2934 87.95 35.0401 speedup from using DTFM alone is already large, i.e., about 3 to 4 times faster than the original MSE computation. More than 90% of the distance computations are terminated early by DTFM, and most of the terminations occur at early stages. Second, the HTFM introduces more overhead cost for testing while providing more early term ination at first few stages. However, the number of additional early term ination is relatively small compared to the overall gain achieved by DTFM. Finally, the vector dimension in VQ case is still far less than in the ME case (16x16 macroblock). Thus, a few extra early terminations at an earlier stage are outweighed by the overhead cost for extra testing. Therefore, in order to get the maximum speedup performance, in our experiment for subband data VQ, Figure 4.14(b), we apply the HTFM test to the first one quarter of dimensions and simply use the DTFM test for the rest. As a conclusion, the HTFM for VQ, even though it does not provide a significant speedup over DTFM, provides computational scalability for a given fixed codebook, while scalability can only be achieved with different codebook sizes with DTFM. 116 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Total d o c k cy d o Geuaean wifft < f * 1.0 vector au* 8 OtttoftJorv-Comptaxfly oi HH band Teona* vector s i s 4x4 * OTFM USE 0 DTFM I i s 3 V S (a) (b) Figure 4.14: Complexity-distortion of HTFM VQ with vector size (a) 8 for i.i.d. source and (b) 16 (4x4) for high-high band of “lenna” image. 4.6 Sum m ary and C onclusions To summarize the HTFM, we propose a fast motion estim ation by using fast matching in variable complexity framework where the number of pixels used in calculation varies depending on the likelihood that the MAD which is estimated by partial MAD will be larger than the “best found-so-far” SAD. The complexity is input-dependent i.e. it varies depending on the nature of sequences, and is adjustable by the degree of probability to make wrong decision. We formalize the problem with the assumption of the knowledge of the distribution of the MAD estim ation error. Then the hypothesis testing is applied to minimize the probability of error. We call this novel algorithm HTFM. We can apply HTFM to the matching criterion computation of any FS algo rithm and also the nearest neighbor search for VQ encoding. However, in the case of ME the complexity reductions as observed from our experiment are less when more efficient FS is used because the set of considered MVs becomes so highly competitive th at it is more difficult to predict which vector would win. Neverthe less, with a fast search algorithm we still achieve significant complexity reduction e.g., 45% with 0.1 dB PSNR degradation of the reconstructed sequences for 2-D Log search whereas we get 65% complexity saving for ES case. Therefore, our 117 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. algorithm allows a complexity-distortion tradeoff in fast matching th at was not achieved by previous work. For example, in [67], attem pts to reduce the complex ity are done with reduced but fixed complexity algorithm, i.e. the complexity is not input-dependent. We also show the result of using reduced subset fast match ing with DTFM. However, our HTFM still performs better. While other authors [86] have studied variable complexity approaches, these were targetting the C-D trade-off in FS, rather than FM as we propose. 118 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 5 Motion Estimation: PDS-based Candidate Elimination In this chapter we propose algorithms that perform both FM and FS approaches for motion estimation simultaneously using the technique of PDS-based candidate elimination. Similar to the HTFM in Chapter 4, this approach is based on the par tial distance search (PDS) [79] in which the SAD calculation is broken into several steps, and at each step the partial SAD is updated and the calculation is allowed to stop once it is clear that the SAD will be greater than the best-so-far SAD. However, in the proposed approach, instead of using the PDS algorithm to pro vide early termination of SAD computation of candidates evaluated sequentially, in this work, SAD of several candidates are computed simultaneouly step-by-step and at each step some of them are eliminated from the pool of candidates based on their potential to become the best candidate. The proposed algorithm can be viewed as a path search algorithm where the goal is to find a path with minimal SAD which represents the best candidate. Unlike the HTFM in Chapter 4, vve do not have a model of the relationship between the partial and the full SAD. We use a simpler threshold for the likelihood testing. In addition, we also propose a multiresolution approach in which the spatial dependency is used to help reduce the number of candidates considered in early stages. Only few candidates consid ered as representatives of their neighbors are evaluated. If these representatives show the tendency to be good motion vectors, their neighbors will be evaluated 119 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. next. We propose two variants of the multiresolution approach, i.e., breadth-first and depth-first algorithms, respectively. This chapter is organized as follows. In Section 5.1, we propose the ideal dynamic programming which achieves minimal complexity in terms of arithm etic operations. In Section 5.2, faster, albeit, suboptimal, variations of the multiple step PDS approach are proposed. In Section 5.3, the multiresolution algorithms are presented. The complexity-distortion results of all the proposed algorithms are also shown in this section. The conclusions are drawn in Section 5.4. 5.1 P D S -b ased C andidate E lim in ation The complexity reduction from the PDS when applying to various search strategies can vary widely (e.g. 10-80%) depending on two m ajor factors, namely, (i) how the macroblock B is partitioned, and (ii) what fast search (FS) algorithm is used. In the HTFM approach of Chapter 4, we propose a uniform partition (UNI) as a better method for macroblock partitioning as compared to row-by-row partition (ROW) in that it requires fewer pixel comparisons on the average. Therefore, in this work, we will also use the UNI scheme. In this chapter we also provide some experimental results on H.263 sequences in which we calculate the SAD using up to 256 pixel instead of 128 pixel as in the MPEG2 case in Chapter 4 (Fig. 4.1). The UNI partitioning for H.263 sequences is thus shown in Fig. 4.1. In order to better visualize the effect of macroblock partitioning and associated FS algorithm, the cumulative probability of early term ination in a 16-stage PDS is shown in Figure 5.2. The cumulative probability is defined as Ft{i) = P r{ t < i } where t is the stage in which the termination occurs, and t = 1,2,..., 16. From Figure 5.2, the average number of stages required before the PDS term ination can be easily seen from the area above each curve, P { t} = / ( I — F t (i))di. This average termination stage varies from sequence to sequence. Furthermore, it can be seen from the figure that the average term ination stage using UNI is less than 120 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1 1 3 3 1 6 1 1 3 3 1 6 9 5 1 2 7 9 5 1 2 7 4 1 5 2 1 4 4 1 5 2 1 4 1 1 8 1 0 6 1 1 8 1 0 6 1 1 3 3 1 6 1 1 3 3 1 6 9 5 1 2 7 9 5 1 2 7 4 1 5 2 1 4 4 1 5 2 1 4 1 1 8 1 0 6 1 1 8 1 0 6 Figure 5.1: Uniform macroblock partition into 16 subsets, showing only upper-left 8x8 region. Partial distance tests at the z-th stage are performed after the metric has been computed on the pixels labeled with i. that of ROW for exhaustive search, thus resulting in less complexity. In Chapter 4, it was shown th at the relative speedup from PDS when applied to a FS is less than when applied to exhaustive search. This is because the subset of candidate vectors considered in a FS contains only good candidates, i.e., their SADs are already small. Therefore, the SAD termination tends to occur during the last few stages. This fact is well illustrated in Figure 5.2 when a FS is used. Therefore, for a certain FS strategy, the smaller the average of term ination stage, the more efficient the PDS. This raises the issue of whether the order in which the SAD of each candidate is calculated can be optimized. It is obvious th at if a candidate with smaller SAD is computed first, it is certain that the computation of worse candidates (with larger SAD) will be terminated at earlier stage. In fact, if the metric of best candidate is computed first, the order of the remaining candidates to be tested can be disregarded because the order does not affect the stage at which the termination takes place for each of the remaining candidates. However, in practice, the best candidate is obviously not known beforehand. Only a good initial candidate can be guessed. In UBC’ s software implementation of H.263+ encoder [58], the full search is performed by taking this ordering issue into consideration. Starting from a good initial candidate (following the motion vector predictive coding), all the candidates surrounding the initial points are searched in a spirally outward direction. 121 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Cumulative prob. of termination 0 .9 0.8 0 .7 0.6 0.5 0 .3 0.2 4 6 Figure 5.2: Cumulative probability of term ination using 16 stage PDS and ex haustive search with ROW (’solid’) and UNI (’dashed’), and using TM N’s fast motion search with UNI (’dash-dotted’) of 150 frames of five H.263 sequences coded at 56 Kbps. The efficiency of the PDS relatively drops as a FS is used. 5.1.1 Ideal C an d id ate E lim in ation The dynamic programming is proposed as one way to achieve the m inimal number of pixel comparison operations, which would correspond to the case when as if the best candidate is evaluated first. The basic idea of the ideal elimination is to update only a candidate with the smallest value of the partial SAD (PSAD) at a time. Using the same notation as in Chapter 4, the ideal candidate elimination algorithm is summarized as follows. A lgorithm 6 (Ideal Candidate E lim in ation (IC E -P D S)) Step 1 •.Set S A D * to infinity. C o m p u te th e P S A D i o f ev ery c a n d id a te in the search region. S e t the sta g e index, n (i), o f e v e r y ca n d id a te to one, i.e ., n (i) = 1 f o r all i can didates. Step 2 •.Select ca n d id a te, i*, with s m a lle s t va lu e o f P S A D sa tisfyin g i) P S A D < S A D * and ii) n (i* ) < 1 6 . I f no such ca n d id a te exists, retu rn w ith the S A D * result. Step 3: C o m p u te P S A D n^-)+ i o f this c a n d id a te . S e t n(i*) = n (i* ) -I- 1. Step 4 : I f n ( i* ) = 16, com pare the P S A D w ith S A D * a n d select th e m in im a l — * * to be the n ex t S A D * a n d the corresponding b e st-so -fa r M V , M V . 122 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Step 4: R ep ea t S te p 2. It can be verified th at with this method, every candidate is terminated as earliest as possible since the PSAD will remain not updated as long as it is not the minimal PSAD at any given time. However, this algorithm requires multiple access to each of the candidates as well as determining the one with minimal PSAD at each step. These two overhead components may result in higher complexity than for the sequential candidate evaluation approach. 5.1.2 R ed u ced S tep s C andidate E lim in ation One way to reduce the amount of overhead mentioned above is to limit the number of times each candidate is considered. This is done by reducing the number of stages at which the PSAD calculation can be stopped. In this chapter, we propose a two-step CE-PDS. In the first step we compute the partial SAD of all candidates up to stage m. Then the candidate with minimal P SA D m is predicted to be the best vector and therefore its SAD is computed first by continuing SAD calculation from stage m . Then other candidates’ SADs are computed using the PDS. Note that we can also apply the PDS technique in the first step. The stage number where termination occurs in the first step (Step 1 of Algorithm 7), n(j), must be kept as the starting point from which the second step SAD calculation takes off. The algorithm is summarized as follows. A lgorithm 7 (T w o-S tep Candidate E lim in ation (2-C E -P D S)) S tep 1: W ith a p rese le cted m value, fin d M V*{Y,Xm) using P D S a n d keep the p a r tia l S A D , S A D ( r n v j , x n^ ) w here n(j) < m is defin ed above f o r th e j - t h ca n d id a te. S tep 2: U pdate SAD {M V*{J^,xrn), B) first, s e t it to S A D b s f a n d u s e PD S to c o m p u te SA D (m vj, B ) V j, m vj / MV'*(r, xm), s ta r tin g f r o m P S A D n(j). This method does not employ an initialization technique like other FS al gorithms. It is based solely on the matching metric. W ith this algorithm, the memory corresponding to the search area has to be accessed only twice. In Ta ble 5.1, the complexity in terms of the number of stages in the SAD calculation (or 123 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. equivalently, number of pixel comparison operation) of the 2-CE-PDS algorithm is shown and compared with the original PDS and the ideal PDS in which the best candidate is found first. Note that all algorithms result in the same rate- distortion performance. The experiment is based on UBC’s TMN version 3.2 [58]. The experimental environment is as follows, 56 Kbps target bit rate, code first 150 frames of each sequence, search distance 5x5 around the initial guess (offset) motion vector, code I and P-frame only, and use full search1. There are 16 stage in computing SAD as in Figure 5.1. The complexity in the unit described above reduces from the original PDS with UNI by the amount of 0.5-5% when m = 1 as compared to the ideal PDS complexity reduction which is about 1-7.5%. Table 5.1: Number of PSAD metric computation stages for different PDS variation for 150 frames of H.263 secluences coded at 56 Kbps. H.263 sequences O rig. P D S (R O W ) O rig. P D S (U N I) Ideal P D S (U N I) 2-C E m = 1 -PD S ffl-min miss_am foreman salesman mthr_dotr suzie 3214159 3528755 2355461 2975387 3930649 3015848 3407654 2166800 2725827 3740283 2944313 3151037 2144171 2680578 3585322 2962825 3222447 2154860 2700361 3618637 2955141 3181377 2150489 2690790 3596301 Figure 5.3 shows the complexity of the 2-CE-PDS algorithm with different values of m. The value of m controls the tradeoff between the complexities of the computations before and after the m -th stage. Clearly, if m is small, we have a less accurate guess of the best candidate, whereas if m is large, we get a good guess but spend more on com putation in the first step. However, with the optimal value of m - . ^m im the minimal complexity is just less than 1% over the ideal PDS result for all sequences. The value of mmin for “Miss America” , “Foreman”, “Salesman” , “Mother&Daughter” , “Suzie” are 5, 5, 8, 8 and 4, respectively. The value of m min can be related to the cumulative probability of termination in Fig. 5.2. In sequences where the initial candidate is not good enough, the termination can occur at any stage (see “Suzie”) whereas with good initial candidate the termination tends to be seen earlier (see “Salesman”). W ith the former type L The processor type is Pentium II 450 MHz with Windows NT 4.0. 124 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. oomfdaxity of sequences, the first step PDS significantly helps reducing the complexity by finding a good guess of initial candidate. For the latter type of sequences where the initial “guess” is good enough, in order to get even better initial candidate, the number of stages m needs to be larger. However, the difference between the original PDS (UNI) and the ideal PDS is already small (from 1-10%) because of the good initial candidate provided by the original algorithm, thus using any small value of m yields similar results. Therefore, for simplicity, m = 1 would be a good choice. x to* m m a Figure 5.3: Complexity (number of stages) versus m for (a) “Miss America” and (b) “Suzie” . The top and bottom lines in each figure are the original PDS with UNI and the ideal PDS, respectively. One can further generalize the two-step algorithm to an algorithm having an arbitrary number of step (P-CE-PDS), i.e., we divide the calculation into p steps, for each step we update the best partial SAD first. Let m (i), i = 1,..., p, denote the SAD calculation stage at which step i stops, and m (l) < m {2) < ... < m(p) = b, i.e., :rm(p) = B . The p-step PDS can be written as A lg o rith m 8 (P -S tep C a n d id a te E lim in atio n (P -C E -P D S )) S te p 1: Assume there are p steps to compute SAD. The i-th step stops at stage m (i), i = 1, ,.,p. Set i = 1 S te p 2: Find MV*{T, x m^)) using PDS and keep the partial SAD, SA D (rnvj, x n(j)) where n (j) < m(i) for the j-th candidate. 125 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. S te p 3: Compute S A D (M V * (r ,x m^ ) , x m^+y) first, set it to SADbS f and find M V * (r,arm(i+1)) using P D S by computing SA D (rnV j,xm^+i)) for other j-th candidate starting from P S A D n^ y S te p 4: i = i + 1. I f i < p repeat Step 3. Otherwise, return MV*{T, B ). W hen the number of the steps of the P-PDS is equal to number of steps of the PDS, this algorithm is equivalent to the ICE-PDS in Algorithm 6. However, from our preliminary result the additional speedup using P-CE-PDS does not significantly improve 2-CE-PDS results. Also, it can be seen that even with the ideal PDS, the additional speedup is small when the original algorithm already starts with a good initial candidate, as in our experiment. In order to achieve further complexity reduction, we have to resort to a fast search which may not reach a global minimum vector. 5.2 Suboptim al C E -P D S In this section, we further m otivate a suboptimal solution using the result of the CE-PDS algorithms. Unlike other FS algorithms, which have an assumption of monotonically increasing error surface as the distance between a candidate and the local minimum grows, we reduce the number of the candidates tested using their own partial metric information. The key idea would be analogous to a tree growing where only a few branches are kept at a given stage. We propose 2 techniques to limit the number of branches grown, one is computationally nonscalable and the other is computationally scalable. 5.2.1 C om p u tation ally N on scalab le F ast Search We sta rt with an example on 2-CE-PDS. In the Step 2 of Algorithm 7 we up date the SAD of candidates whose P S A D m has been computed, i.e., for those candidates terminated before stage m, the SAD computation is never completed. This makes sense because it is likely that candidates surviving the first step PDS will be good candidates. We can then generalize this idea to p-step which can be summarized as follows using the same notation as previously. 126 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A lgorith m 9 (P -S tep F ast C E -P D S (P -F C E )) Step 1 : Assume there are p steps to compute SAD. The i-th step stops at stage m {i), i = 1, ..,p, and xm(p) = B . Set i = 1, and 71 = T. Step 2: Find M V * (ji, Xm^)) using PDS and keep the partial SAD of the j-th candidate, S A D (m v j,x n(j)), only for those w hose n{j) = m{i). Step 3: Update SA D (M V *('yi,x T n ^))1x m^+i)) first, set it to SA D bsj. Step 4: Let 7i+1 = {m v j\n (j) = m(i)}. Find M V *(ji+ i, x m^+l^) using PDS by updating SAD{mVj, Xm^+i)) V7. Update n(j). Step 5: i = i - 1 - 1. I f i < p repeat Step 3. Otherwise, return MV*("/P, xm^ ) . Table 5.2: Results of 2-FCE complexity reduction with respect to the original PDS. Sequences SNR T Y l-m in Complexity % comp. CPU elk. % elk. difference ( # stages) reduction cycle ratio reduction miss_am 0 dB 3 1030878 67.9% 1853/3277 43.5% foreman -0.01 dB 4 1566098 55.6% 2157/3719 42.0% salesman 0 dB 3 808212 65.7% 1545/2687 42.5% m thr.dotr +0.01dB 3 1049889 64.7% 1807/3108 41.9% suzie +0.02dB 4 1695524 56.9% 2362/3607 34.5% The intrinsic complexity of this algorithm stems from having to test whether a candidate is qualified for the next step PSAD com putation or not. As there are more steps, this cost becomes larger and may outweigh the reduction of pixel comparisons. Therefore, similar to the discussion in the previous section, only 2-step FCE is considered. In Table 5.2, the result of 2-FCE with optimal value of m is shown. It can be seen th at the complexity reduction, in terms of number of stages and the actual execution time, is obviously promising, with near optimal motion search performance. Table 5.3 shows the result of 8-Step FCE with 2 stage interval between steps. We can see that the number of stage operation is further reduced but the total CPU clock cycle increases by 1-5% because of more testing operations overhead for valid candidates in each step. Therefore, our conclusion is that using 2 step suffices for significant speedup gain in terms of the execution time. Table 5.4 shows the result of the fast search implemented in TMN H.263 codec [58]. It can be seen that even though our algorithm shows near optimal 127 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 5.3: Results of 8-FCE (2 stage PSAD per 1 step testing). Sequences SNR difference Complexity (no. of stages) % comp, reduction CPU elk. cycle ratio % elk. reduction miss_am foreman salesman m thr.dotr suzie +0.01 dB -0.04 dB 0 dB -O.OldB -O.OldB 811838 1055540 663683 796422 1142971 74.7% 70.1% 71.8% 73.2% 70.9% 1983/3277 2145/3719 1740/2687 1889/3108 2187/3607 39.5% 57.7% 35.2% 39.2% 39.4% Table 5.4: Result of using UBC’ s Fastsearch option. Sequences SNR difference Complexity (no. of stages) %comp. reduction CPU elk. cycle ratio % elk. reduction miss .am foreman salesman m thr.dotr suzie -0.02 -0.11 0 -0.05 -0.04 1139316 1225935 816724 1004437 1267551 64.5% 65.3% 65.3% 66.2% 67.8% 1689/3277 1702/3719 1399/2687 1543/3108 1749/3607 48.5% 54.2% 47.9% 50.4% 51.5% performance and less number of stage computation, in terms of execution time our algorithm is a bit inferior. This is due to extra clock cycles spent on candidates testings before computing the next step PSAD and multiple memory access for the search area. Another disadvantage of this algorithm is th at the complexity is not scalable, i.e., the complexity-distortion operating point is fixed. Therefore, in order to be able to compare the performance with UBC’s fast search at the same execution time or same distortion, we resort to a computationally scalable solution which allows significant speeding up gain with the sacrifice in some quality degradation. 5.2.2 C om p u tation ally Scalable Fast Search To further reduce the complexity, we can limit the number of candidates passing to the next stage either by comparing the PSAD with a threshold or selecting only a limited number of candidates with the least PSAD. In this work, we choose to 128 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. follow the first approach because of simplicity in implementation. If the second approach is to be used, an extra complexity for sorting must be taken into account which will lessen the overall speedup. Consider the 2-step algorithm, after the first step we have an information of the best partial metric. Also, after updating the best partial SAD to obtain SAD(,sf, we can surmise th at the final SAD would have the value around (strictly equal or less) the SADbsf ■ We can then set up a threshold with the value proportional to the SADbsf and the number of stage left in the 2nd step. Specifically, the threshold, T, is defined as T = t - S A D {M V '{T , xm), B ) • m /16 where f is a control parameter to adjust the tradeoff between the complexity and quality. The threshold is proportional to the linearly scaled version of the best-so-far SAD. Unlike Algorithm 7, in this scheme we only test for the PSAD thresholding. We do not test the value of n(j), in order to maintain a reasonable complexity. In fact, for t less than or equal to 1, all the candidates that do not survive the first step PDS will also fail the threshold testing. For t greater than 1, the scheme allow more candidates to be considered even if they fail the first stage PSAD. The computationally scalable algorithm can be summarized as follows. A lgorithm 10 (2-Step w ith T hreshold Fast C E -P D S (2T -FC E )) Step 1: Find M V*(r , xm) using PD S and keep the partial SAD, SA D (m vj, x n( j )) MmVj E T where n(j) < m for the j-th candidate. Step 2: Update SA D {M V *(Y, x m), B ) first, set it to SA D ^s/ and find B) using the PDS where 7 = {m vj\SA D (rnvj, x n(j)) < T } where n (j) is the starting stage for candidate j. Note th at this 2-step algorithm can be easily extended to a p-step version. Table 5.5 shows the result of applying Algorithm 10 with m = 1 and t = 1. One can see th a t in most cases, we perform better than the UBC’s fast search algorithm in both SNR and complexity (both execution time and number of operations). Besides, our algorithm is computationally scalable, i.e., the complexity can be made smaller by adjusting t to be smaller. As a result of smaller t, the quality 129 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 5.5: Result of 2-Step with Threshold when m = 1 and t = 1. Sequences SNR difference Complexity (no. of stages) %comp. reduction CPU elk. cycle ratio % elk. reduction miss_am foreman salesman m thr.dotr suzie -0.02 -0.08 -0.03 -0.02 + 0.01 647465 839797 581914 657439 831575 79.9% 76.2% 75.3% 77.9% 78.8% 1505/3277 1655/3719 1488/2687 1524/3108 1621/3607 54.1% 55.5% 44.6% 50.9% 55.1% will be slightly degraded. Examples of the complexity-distortion behavior of the algorithm will be provided in the next section. 5.3 M u ltiresolu tion A lgorithm The proposed CE-PDS algorithm can be viewed as a variance of multiresolution algorithm (see [75], [77], etc.) where the pyramid decomposition and subsampling without anti-aliasing filter are used2. In general, the variable block size multires olution algorithm can be implemented using the PDS technique. In the simplest scheme where no filtering operation is performed prior to subsampling, the coars est resolution motion search is equivalent to searching over a subsampled grid of pixel using P S A D n such th at x n correspond to the subsampled pixels in the lowest resolution. In a finer resolution, the center of the search is obtained from the result of the coarser resolution search and the effective search area in the finer resolution can be reduced. The PSAD of candidates on the grid of finer resolution are then evaluated using a corresponding subset of pixel. The coarsest resolution grids come from the set of pixels in the first set of the UNI partitioning. Each set of pixels can be considered as one polyphase of the data. A finer resolution consists of more polyphases than a coarser resolution. However, one advantage of the original multiresolution approach is the use of correlation between matching metrics of adjacent motion vectors to speedup the 2 As discussed in [77] lack of an anti-aliasing filter is not crucial in pyramid coding. 130 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. search. In our algorithms in previous sections, this fact is not taken into con sideration, thus resulting in redundant computations for candidates in the same neighborhood with similar matching metric values. Therefore, in this section, we propose algorithms that in addition to allowing scalable complexity based on our previous m ethod, also take advantage of the candidate vector correlation as in the conventional multiresolution approach. There are several ways to obtain compu tational scalability. For example, the constraint of finding the best candidate in a particular resolution can be loosened in such a way that a “good” but not optimal candidate can be chosen if it can be found much faster than the optimal one. An approach is to allow more than one candidate to be the center of the search in the finer resolution using the partial metric criterion. This can be viewed as a soft-decision on what the initial candidate should be in the next resolution. We use a threshold approach similar to Algorithm 10 where the threshold is computed first from the initial motion vector at the center of the search region in the finest resolution. At the coarsest resolution, we compute the first stage PSAD on the coarsest grid. Candidates with PSAD less than the first threshold are then con sidered for the next step/resolution PSAD, as well as their neighbors in the finer resolution. The order of the search in finer resolutions can be performed by a breadth-first or a depth-first approach. In breadth-first, the update of the partial metrics is per formed for all candidates in the next level first before continuing to the following level. In depth-first, there are only two steps in partial metric computation, i.e., the update of partial metric is from the current resolution to the finest resolution. We propose two algorithms in Algorithm 11 and 12 corresponding to the breadth- first and depth-first approaches, respectively. Prior to describing the algorithm, let us introduce some new notations. Let us assume that there are R levels of resolution. Each resolution is obtained from subsampling the finer resolution by a factor of four. The macroblock is thus partitioned such that x i consists of only pixels in the coarsest resolution and there exist £n(r) for resolution r such that pixels in x n(r) correspond to pixels in resolution r (n(l) = 1). In our experiment for H.263 sequences, there are 3 resolutions and n{ 1) = l,n ( 2) = 4,and n(3) = 16 (refer to Figure 5.1). Let yr be the set of candidates on the subsampled grid at 131 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the r-th resolution. Let ( f > T{rnv) be the set of motion vectors in the neighbor hood of m v in the r-th resolution, i.e., they are m v + [±2'ff_r, ± 2^- r] where R is the level number of the finest resolution. From these notations, the proposed multiresolution (MR) algorithms can be summarized as follows. A lgorithm 11 (Breadth-First M uitiresolution (M R 1-F C E )) Step 1 '.Find, the SAD of the initial motion vector, set it to S A D b sf- Compute T (r ) for r = 1,..., R from T(r) = S A D bsf - l/4 (R_r) • t. Set r = 1. Step 2:Compute SA D {m v,X i) using PDS, V m v G 71. r < — r + 1. Step 3:J7‘ S'A£)(r7w,xn(r_1 )) < T ( r —1), i) use PDS to update to S A D (m v, xn(rj) and ii) compute the neighboring S A D (v ,x n(r)) using PDS VF G < p r(m v). Step 4:r r -l- 1. If r < R, repeat step 3. Otherwise, go to step 5. Step 5 :If SA D (m v, < T {R — 1), i) use PDS to update to SA D (m v, xn^ ) and ii) compute S A D {v,xn^n)) using PDS V v G f>n{mv). Step 6 '.Return M Vb sj as the best motion search result. A lgorith m 12 (D epth-First M uitiresolu tion (M R 2-F C E )) Step 1 :The same as Step 1 in MR1-FCE. Step 2:The same as Step 2 in MR1-FCE. Step 3 :If SA D (m v, xn(r_!)) < T (r— 1), i) use PDS to update to S A D {m v,xn^ ) ) and remove m v from the list of the candidate to be processed, and ii) compute the neighboring S A D {v ,x n^ rf) using PDS V v G 4 > r(mv) . Step 4:r r + 1. If r < R, repeat step 3. Otherwise, go to step 5. Step 5 :The same as Step 5 in M Rl-FCE. In both algorithms, the S A D bsf and MVb sf are updated with the PDS at the finest resolution. In MR2-FCE, the number of steps for SAD calculation for each candidate is at most two since the update is always to the finest resolution. The advantage of this approach is less number of data access (only twice) but the computation could be higher for some candidates which have small PSAD at early stages and large PSAD at later stages. In Table 5.6, the result of the proposed multiresolution algorithms are shown for t = 1.0 and compared with the conventional multiresolution algorithm th at use only one candidate as initial point in each resolution. It can be seen th at 132 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 5.6: Results of MRl-FCE and MR2-FCE complexity reduction at t = 0.8 with respect to the original multiresolution algorithm. Sequences M Rl-FCE MR2-FCE SNR difference % elk. increase SNR difference % elk. increase miss-am foreman salesman mthr_dotr suzie +0.31 dB +0.73 dB +0.04 dB +0.29 dB +0.56 dB 29.1% 44.1% 37.9% 25.7% 22.2% +0.32 dB +0.82 dB +0.04 dB +0.28dB +0.56 dB 23.7% 51.8% 27.6% 19.6% 18.9% the SNR is improved with the cost of extra computation. However, our proposed algorithm is also computationally scalable. Comp*exify-Dtstortion a v e ra g e o v e r S test seq u en ces u .u i - 0.01 - 0.02 ® -0 .0 4 * :2T -F C E HTFM -0 .0 6 0 M R l-F C E -0 .0 7 □ M R 2-FCE - 0 08 -0 .0 9 0.4 0.45 0 .5 5 0 .6 0.65 norm alized C PU d o c k c y d e 0.5 0.7 0.75 0.8 Figure 5.4: Complexity-Distortion using various algorithms average over 5 test sequences. The complexity unit is the clock cycles normalized by the original ROW PDS. As shown in Fig. 5.4 and 5.5, the complexity-distortion curves of these two MR algorithms with varying t from 0.8 to 2 is compared with the scalable 2T- FCE and the nonscalable method 2-FCE from Table 5.2. The UBC’s fast search [58], and the complexity scalable hypothesis testing fast matching (HTFM) are 133 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. also shown in the figures. Note that the results of the conventional multiresolu- tion algorithm do not fit in the range displayed in Figure 5.4. The experimental environment is the same as before. The result is averaged over 5 test sequences which are “Miss America” , “foreman” , “salesman”, “mother & daughter”, and “suzie”. One can see that 3 of our proposed algorithms (2T-FCE, MRl-FCE and MR2-FCE) outperform the fast algorithm in TMN3.0 and the HTFM at low quality-low complexity region. Furthermore, M Rl-FCE and MR2-FCE perform almost equally well. In particular, M Rl-FCE performs better at low quality (lower thresholds) while MR2-FCE performs better at high quality area (higher thresh olds). This is reasonable because at high thresholds, there are more candidates passing to the finer resolutions, thus resulting in more overhead if there are many steps in PSAD calculation as in M Rl-FCE. Compared to 2T-FCE, the proposed MR algorithms do not significantly achieve further speedup. This is due to the small search distance limit which has been set to ±5 from the initial candidate at the finest resolution, i.e., only ±1 at the coarsest resolution. If the search area is larger, the gain from MR approach will be clearer. The result in terms of num ber of pixel comparisons is shown in Figure 5.5. It can be seen th at since this measure does not take the other complexity such as data access, PSAD testing, candidate selection, boundary checking, etc. into account, which contribute about 20% margin in the clock cycle measure. 5.4 Sum m ary and C onclusions We have presented a novel algorithm (PT-FCE) that takes advantage of the par tial metric information to determine which candidates are worth to be fully eval uated. This algorithm can be viewed as a generalized version of multiresolution approach in the sense that the partial metric is obtained from a subset of pixel in a macroblock that corresponds to a certain resolution, and at each resolution the candidates to be considered are not necessary spatially related to the best can didate in the coarser resolution. Furthermore, the algorithm is computationally scalable, i.e., the complexity can be adjusted to meet the computational constraint with some sacrifice on poorer motion estim ation performance. The result shows 134 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Comptarity-Oistortion average over 5 test sequences 0 .0 1 - 0.02 * :TMN FS -0 .0 3 V :2 -F C E © -0 .0 4 * :2T -F C E •0.05 O :HTFM -0 .0 6 O :M R 1-FC E -0 .0 7 □ :M R 2-FC E -0 .0 9 0.25 S 0 .4 0 .4 5 C normalized no. stage operation Figure 5.5: Complexity-Distortion using various algorithms average over 5 test sequences. The complexity unit is the the number of pixel comparisons normalized by the original ROW PDS. that in terms of complexity-distortion tradeoff this novel algorithm (2T-FCE) out performs the hypothesis testing algorithm (HTFM) with also less sophisticated algorithm. Then, we propose two algorithms (MRl-FCE and MR2-FCE) that also use the spatial information in deciding the candidates to be evaluated. These algorithms possess a similar parent-child relationship as in the pyramid-based multiresolution approach without anti-aliasing filtering. They thus take advantage of spatial correlation of motion vectors to speedup the search. It also performs better than the original multiresolution because it allows soft-decision to be made based on partial metric. Thus, it is computationally scalable depending on the threshold factor that control the complexity-distortion tradeoff- . The result shows that this approach performs little bit better than 2T-FCE. However, when the search area becomes larger, the difference in performance will be more visible. 135 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. References [1] W. Pennebaker and J. Mitchell, JPEG Still Image Data Compression Stan dard. Van Nostrand Reinhold, 1994. [2] J. Mitchell, W. Pennebaker, C. E. Fogg, and D. J. LeGall, MPEG Video Compression Standard. New York: Chapman and Hall, 1997. [3] S. G. 16, “H.263 video coding for low bit rate communication,” tech. rep., ITU-T, 1997. [4] J. Chalidabhongse and C.-C. Kuo, “Fast motion vector estimation us ing multiresolution-spatio-temporal correlations,” IE EE Trans. Cir.Sys. for Video Tech., April 1997. [5] T. M. Cover and J. A. Thomas, Elements of Information Theory. Wiley- Interscience, 1991. [6] M. Antonini, M. Barlaud, P. Mathieu, and I. Daubechies, “Image coding using wavelet transform ,” IE EE Trans, on Image Proc., 1992. [7] Wavelet and Subbana Coding. Prentice Hall P T R, 1995. [8] K. Ramchandran, Z. Xiong, K. Asai, and M. Vetterli, “Adaptive transforms for image coding using spatially varying wavelet packets,” IEEE Trans, on Image Proc., 1996. [9] Y. Shoham and A. Gersho, “Efficient bit allocation for an arbitrary set of quantizers,” IEEE Trans, on Signal Proc., vol. 36, pp. 1445-1453, Sept. 1988. [10] K. Ramchandran, A. Ortega, and M. Vetterli, “Bit allocation for dependent quantization with applications to multiresolution and MPEG video coders,” IE E E Trans, on Image Proc., 1994. [11] L.-J. Lin and A. Ortega, “Bit-rate control using piecewise approximated rate-distortion characteristics,” IEEE Trans, on Circ. and Sys. for Video Tech., 1998. 136 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [12] C. Chrysafis and A. Ortega, “Efficient context-based entropy coding for lossy wavelet image compression,” in In Proc. of D C C ’ 97. [13] J. M. Shapiro, “Embedded image coding using zerotrees of wavelet coeffi cients,” IEEE. Trans, on Signal Processing, 1993. [14] D. Taubman and A. Zakhor, “Multirate 3-D subband coding of video,” IE E E Trans, on Image Proc., 1994. [15] A. Said and W. A. Pearlm an, “A new, fast, and efficient image codec based on set partitioning in hierarchical trees,” IEEE Trans, on Circ. and Sys. fo r Video Tech., 1996. [16] H.-J. Wang and C.-C. Kuo, “A multi-Threshold wavelet coder (MTWC) for high fidelity image compression,” 1997. [17] J. Li and S. Lei, “An embedded still image coder with rate-distortion opti mization,” IEEE Trans, on Image Proc., 1999. [18] B.-B. Chai, J. Vass, and X. Zhuang, “Significant-linked connected compo nent analysis for wavelet image coding,” IEEE Trans, on Image Proc., 1999. [19] M. Crouse and K. Ramchandran, “Joint thresholding and quantizer selection for transform image coding: Entropy-constrained analysis and applications to baseline JPEG ,” IE E E Trans, on Image Proc., 1997. [20] J. Li, J. Li, and C.-C. Kuo, “Layered DCT still image compression,” IEEE Trans, on Circ. and Sys. for Video Tech., 1997. [21] G. J. Sullivan and R. L. Baker, “Efficient quadtree coding of images and video,” in Proc. of IC A S S P ’ 91, 1991. [22] M. J. Gormish, Source Coding with Channel, Distortion and Complexity Constraints. PhD thesis, Stanford University, Mar. 1994. [23] V. Goyal and M. Vetterli, “Computation distortion characteristics of block transform coding,” in Proc. of IC A SSP ’ 97, (Munich, Germany), Apr. 1997. [24] K. Rao and P. Yip, Discrete Cosine Transform, Algorithms, Advantages, Applications. Academic Press, 1990. [25] T. N. N. Ahmed and K. R. Rao, “Discrete cosine transform ,” IEEE Trans, on Computer, 1974. [26] M. Narasimha and A. Peterson, “On the com putation of the discrete cosine transform ,” IEEE Trans, on Comm., 1978. 137 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [27] J. J.-I. Hong, Discrete Fourier, Hartley and Cosine Transforms in Signal Processing. PhD thesis, Columbia University, 1993. [28] C. S. Wen-Hsiung Chen and S. Fralick, “A fast computational algorithm for the discrete cosine transform,” IEEE Trans, on Comm., 1977. [29] A. Ligtenberg and M. Vetterli, “A discrete fourier/cosine transform chip,” IE EE J. on Selected Areas in Comm., vol. SAC-4, pp. 49-61, Jan 1986. [30] Z. Wang, “Fast algorithms for the discrete w transform and for the discrete fourier transform,” IE E E Trans, on Signal Proc., vol. ASSP-32, pp. 803-816, Aug. 1984. [31] B. Lee, “A new algorithm to compute the discrete cosine transform,” IEEE Trans, on Signal Proc., vol. ASSP-32, pp. 1243-1245, December 1984. [32] P. Duhamel and H. H ’Mida, “New 2n DCT algorithms suitable for VLSI implementation,” in Proc. of ICASSP’ 87, (Dallas), p. 1805, Apr 1987. [33] C. Loeffler, A. Ligtenberg, and G. Moschytz, “Practical fast 1-D DCT al gorithms with 11 multiplications,” in In Proc. of IC A SSP ’ 89. [34] E. Feig and E. Linzer, “Discrete cosine transform algorithms for image data compression,” in Proc. of Electronic Imaging’ 90, p. 84, 1990. [35] M. Vetterli, “Tradeoffs in the computation of mono and multi-dimensional DCTs,” tech. rep., Ctr. fo Telecommunications Research, Columbia Univer sity, June 1988. [36] E. Feig and S. Winograd, “On the multiplicative complexity of discrete cosine transform,” IE E E Trans, on Information Theory, 192. [37] H. R. Wu and Z. Man, “Comments on ’ ’fast algorithms and implementation of 2-D discrete cosine transform,” IEEE Trans, on Circ. and Sys. for Video Tech., 1998. [38] N. I. Cho and S. U. Lee, “A fast 4x4 DCT algorithm for the recursive 2-D DCT,” [39] E. Linzer and E. Feig, “New scaled DCT algorithms for fused MULTI PLY/ADD architectures,” in Proc. of IC A SSP ’ 91, vol. 4, pp. 2201-2204, 1991. [40] V. Srinivasan and K. J. R. Liu, “Vlsi design of high-speed time-recursive 2-D DCT/IDCT processor for video applications,” IEEE Trans, on Circ. and Sys. for Video Tech., 1996. 138 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [41] Y. Arai, T. Agui, and M. Nakajima, “A fast DCT-SQ scheme for images,” Trans, of IEICE, vol. 71, p. 1095, Nov 1988. [42] C. W. Kok, “Fast algorithm for computing discrete cosine transform,” IEEE Trans, on Signal Proc., 1997. [43] V. S. Dimitrov, G. A. Jullien, and W. C. Miller, “A new DCT algorithm based on encoding algebraic integers,” in Proc. of IC A S S P ’ 98. [44] N. Merhav and V'. Bhaskaran, “Fast algorithms for DCT-domain image down-sampling and for inverse motion compensation,” IEEE Trans.Circ.Sys. for Video Tech., vol. 7, no. 3, pp. 458-475, June 1997. [45] H. V. Sorensen and C. S. Burrus, “Efficient computation of the dft with only a subset of input or output points,” IE EE Trans, on Signal Proc., 1993. [46] “The independent JP E G ’s group software JPEG , version 6.” f t p : / / f t p . uu. n e t . [47] A. Hossen and U. Heute, “Fast approximation DCT: Basic-idea, error anal ysis, application,” in Proc. of ICASSP ’ 97. [48] S. Jung, S. Mitra, and D. Mukherjee, “Subband DCT: Definition, analysis, and applications,” IE E E Trans, on Circ. and Sys. for Video Tech., 1996. [49] J. Bao, H. Sun, and T. Poon, “Hdtv down-conversion decoder,” IEEE Trans, on Consumer Electronics, 1996. [50] J. Song and B.-L. Yeo, “Fast extraction of spatially reduced image sequences from MPEG-2 compressed video,” IEEE Trans, on Circ. and Sys. for Video Tech., 1999. [51] J. B. Lipsher, “Asynchronous DCT processor core,” M aster’s thesis, Univ. of Cal. Davis. [52] J. YVinograd and S. Nawab, “Incremental refinement of DFT and STFT approximations,” IE E E Signal Proc. Letters, 1995. [53] W. K. Cham and F. S. Wu, “Compatibility of order-8 integer cosine trans forms and the discrete cosine transform,” in Proc. of IE E E Region 10 Conf. on Computer and Communication Systems, 1990. [54] A. Gersho and R. M. Gray, Vector Quantization and Signal Compression. Kluwer Acad. Pub., 1992. 139 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [55] K. S. Wang, J. O. Normile, H. Wu, and A. A. Rodriguez, “Vector- quantization-based video codec for software-only playback on personal com puters,” Multimedia Systems 2(5). pp. 191-203, 1994. [56] “MPEG video software decoder, v2.2.” h t t p : //bm rc .berkeley.edu/projects/m peg/m peg_play.htm l. [57] “vic:UCB/LBL video conferencing tool.” f t p : / / f t p . e e . l b l . g o v /c o n fe re n c in g /v ic /. [58] C. University of British. Columbia, “Tmn h.263-f- encoder version 3.0,” http: / / www. ee. ubc. ca /image/h263plus/. [59] K. Froitzheim and H. Wolf, “A knowledge-based approach to JPEG ac celeration,” in Proc. o f IS& T/SP IE Symp. on Electr. Imaging Science and technology, (San Jose), Feb. 1995. [60] E. Murata, M. Ikekawa, and I. Kuroda, “Fast 2D IDCT implementation with multimedia instructions for a software MPEG2 decoder,” in Proc. of IC A SSP ’ 98. [61] X. Bo and Z. Xulong, “A new algorithm for the implementation of h.263 video coding,” IEEE Trans, on Circ. and Sys. fo r Video Tech., To appear. [62] K.Lengwehasatit and A.Ortega, “Distortion/decoding time tradeoffs in soft ware DCT-based image coding," in Proc. of IC A SSP ’ 97, 1997. [63] B. Girod and K. W. Stuhlmuller, “A content-dependent fast DCT for low bit-rate video coding,” in Proc. of IC IP ’ 98. [64] I.-M. Pao and M.-T. Sun, “Modeling DCT coefficients for fast video encod ing,” IEEE Trans, on Circ. and Sys. for Video Tech., 1999. [65] K. Lengwehasatit and A. Ortega, “DCT computation with minimal average number of operations,” in Proc. of VCIP’ 97, (San Jose, CA), Feb. 1997. [66] K.Lengwehasatit and A.Ortega, “DCT computation based on variable com plexity fast approximations,” in Proc. of IC IP ’ 98. [67] B. Liu and A. Zaccarin, “New fast algorithms for the estimation of block motion vectors,” IEEE Trans.Cir.Sys. for Video Tech., vol. 3, pp. 148— 157, April 1993. [68] J. Jain and A. Jain, “Displacement measurement and its application in interframe image coding,” IEEE Trans, on Comm., 1981. 140 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [69] R. Srinivasan and K. Rao, “Predictive coding based on efficient motion es timation,” IEEE Trans.Comm., vol. COMM-33, pp. 888-895, August 1985. [70] R. Li, B. Zeng, and M. Liou, “A new three-step search algorithm for block motion estim ation,” IEEE Trans.Circ.Sys. fo r Video Tech., vol. 4, pp. 438- 442, August 1994. [71] L.-K. Liu and E. Feig, “A block-based gradient descent search algorithm for block motion esitm ation in video coding,” IE EE Trans. Circ.Sys. for Video Tech., vol. 6, pp. 419-422, August 1996. [72] J. Y. Tham, S. Ranganath, and M. Ranganath, “A novel unrestricted center- biased diamond search algorithm for block motion estimation,” IEEE Trans, on Circ. and Sys. fo r Video Tech., vol. 8, pp. 369-377, August 1998. [73] M. Bierling, “Displacement estimation by hierarchical block matching,” in Proc. of VCIP’ 88. [74] F. Dufaux and M. Kunt, “Multigrid block matching motion estimation with an adaptive local mesh refinement,” in Proc. of VCIP’92. [75] S. Zafar, Y. Q. Zhang, and B. Jabbari, “Multiscale video representation using multiresolution motion compensation and wavelet decomposition,” IEEE J. on Sel. Areas in Comm., 1993. [76] Y. Q. Zhang and S. Zafar, “Motion-compensated wavelet transform coding for color video compression,” IEEE Trans, on Circ. and Sys. for Video Tech., 1992. [77] K. M. Uz, M. Vetterli, and D. J. LeGall, “Interpolative multiresolution coding of advanced television with compatible subchannels,” IEEE Trans, on Circ. and Sys. fo r Video Tech., vol. 1, pp. 86-99, March 1991. [78] J.-B. Xu, L.-M. Po, and C.-K. Cheung, “Adaptive motion tracking block matching algorithm for video coding,” IEEE Trans, on Circ. and Sys. for Video Tech., vol. 9, pp. 1025-1029, October 1999. [79] C.-D. Bei and R. M. Gray, “An improvement of the minimum distortion encoding algorithm for vector quantization,” IE E E Trans, on Comm., 1985. [80] Y.-C. Lin and S.-C. Tai, “Fast full-search block-matching algorithm or motion-compensated video compression,” IE E E Trans, on Comm., 1997. [81] Z.-L. He, K.-K. Chan, C.-Y. Tsui, and M. L. Liou, “Low power motion estimation design using adaptive pixel truncation,” in In Proc. of In t’ l Symp. on Low Power Electronics and Design 1997, pp. 167-172. 141 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [82] B. Natarajan, V. Bhaskaran, and K. Konstantinides, “Low- complexity block-based motion estim ation via one-bit transforms,” IEEE Trans. Circ.Sys. for Video Tech., vol. 7, pp. 702-706, August 1997. [83] X. Lee and Y.-Q. Zhang, “A fast hierarchical motion-compensation scheme for video coding using block feature matching,” IEEE Trans. Circ.Sys. for Video Tech., vol. 6, pp. 627-634, December 1996. [84] M. Khosravi and R. Schafer, “Low complexity matching criteria for im age/video applications,” IC IP ’ 94, pp. 776— 780, November 1994. [85] M.-J. Chen, L.-G. Chen, T.-D. Chiueh, and Y.-R Lee, “A new block- matching criterion for moion estimation and its implementation,” IEEE Trans, on Circ. and Sys. for Video Tech., vol. 5, pp. 231-236, June 1995. [86] F. Kossentini and Y.-W. Lee, “Computation-constrained fast M PEG-2 en coding,” IE EE Signal Proc. Letters, vol. 4, pp. 224-226, August 1997. [87] F. Chen, J. Villasenor, and D. Park, “A low-complexity rate-distortion model for motion estimation in h.263,” Proc.ICIP’ 96, vol. 2, pp. 517-520. [88] K. Lengwehasatit, A. Ortega, A. Basso, and A. Reibman, “A novel compu tationally scalable algorithm for motion estimation,” in Proc. of VC IP’98. [89] K. Lengwehasatit and A. Ortega, “Probabilistic partial distance fast match ing algorithms for motion estimation,” Accepted for publication in IEEE Trans on Circ. and Sys. for Video Tech., 1999. [90] V. L. Do and K. Y. Yun, “A low-power vlsi architecture for full-search block- matching motion estimation,” IEEE Trans, on Circ. and Sys. fo r Video Tech., 1998. [91] V. Bhaskaran and K. Konstantinides, Image and Video Compression Stan dards: Algorithms and Architectures 2nd Edition. Boston: Kluwer Aca demic Publishers, 1997. [92] P. Pirsch, N. Demassieux, and W. Gehrke, “VLSI architecture for video compression - A survey,” Proceedings of IEEE, vol. 83, pp. 220-246, Febru ary 1995. [93] H. Yeo and Y. H. Hu, “A novel m odular systolic array architecture for full- search block matching motion estimation,” IEEE Trans, on Circ. and Sys. for Video Tech., vol. 5, pp. 407-416, October 1995. [94] A. K. Jain, Fundamentals of Digital Image Processing. Prentice Hall, 1989. 142 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [95] F. Jelinek and J. B. Anderson, “Instrumentable tree encoding of information sources,” IEEE Trans, on Info. Th., 1971. [96] D.A.Patterson and J.L.Hennessy, Computer Architecture : a Quantitative Approach 2nd Ed. Morgan Kaufmann Publishers, 1996. [97] C. T. Chen, “Adaptive transform coding via quadtree-based variable block- size DCT,” in Proc. of IC A S S P ’ 89, 1989. [98] P. A. Chou, T. Lookabaugh, and R. M. Gray, “Optimal pruning with ap plications to tree-structured source coding and modeling,” IE E E Trans, on Info. Th., 1989. [99] G. Cheung, “Optim al bit allocation strategy for joint source/Channel coding of scalable video,” M aster’s thesis, Electrical Engineering and Computer Sciences, U.C. Berkeley, 1998. [100] P. Fernandez and A. Ortega, “An input dependent algorithm for the inverse discrete wavelet transform ,” in In Proc. of 32nd Asilomar Conf. [101] K. Lengwehasatit and A. Ortega, “Complexity-distortion tradeoffs in vector matching based on probabilistic partial distance techniques,” in Proc. of Data Compression Conference, D C C ’ 99, (Snowbird, UT), Mar. 1999. [102] F.-H. Cheng and S.-N. Sun, “New fast and efficient two-step search algo rithm for block motion estim ation,” IEEE Trans, on Circ. and Sys. for Video Tech., vol. 9, pp. 977-983, October 1999. [103] M. S. S. Group, “Mpeg2 video codec version 1.2,” http://www.creative.net / ~ tristan/MPEG/mssg/mpeg2vidcodecjvl 2. tar. gz. [104] Y. Linde, A. Buzo, and R. M. Gray, “An algorithm for vector quantizer design,” IEEE Trans, on Comm., 1980. 143 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Geometrical modeling and analysis of cortical surfaces: An approach to finding flat maps of the human brain
PDF
Data compression and detection
PDF
Algorithms and architectures for robust video transmission
PDF
Information hiding in digital images: Watermarking and steganography
PDF
Contributions to content -based image retrieval
PDF
Adaptive video transmission over wireless fading channel
PDF
Design and analysis of server scheduling for video -on -demand systems
PDF
Design and applications of MPEG video markup language (MPML)
PDF
Error resilient techniques for robust video transmission
PDF
Contributions to image and video coding for reliable and secure communications
PDF
Energy -efficient information processing and routing in wireless sensor networks: Cross -layer optimization and tradeoffs
PDF
Color processing and rate control for storage and transmission of digital image and video
PDF
Intelligent systems for video analysis and access over the Internet
PDF
Design and performance analysis of low complexity encoding algorithm for H.264 /AVC
PDF
Energy and time efficient designs for digital signal processing kernels on FPGAs
PDF
Advanced video coding techniques for Internet streaming and DVB applications
PDF
Contributions to efficient vector quantization and frequency assignment design and implementation
PDF
High-frequency mixed -signal silicon on insulator circuit designs for optical interconnections and communications
PDF
Design and analysis of MAC protocols for broadband wired/wireless networks
PDF
Fault -tolerant control in complex systems with real-time applications
Asset Metadata
Creator
Lengwehasatit, Krisda
(author)
Core Title
Complexity -distortion tradeoffs in image and video compression
School
Graduate School
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
engineering, electronics and electrical,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Advisor
Ortega, Antonio (
committee chair
), [illegible] (
committee member
), Neumann, Ulrich (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c16-63696
Unique identifier
UC11328284
Identifier
3018015.pdf (filename),usctheses-c16-63696 (legacy record id)
Legacy Identifier
3018015.pdf
Dmrecord
63696
Document Type
Dissertation
Rights
Lengwehasatit, Krisda
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Tags
engineering, electronics and electrical