Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Contribution to transform coding system implementation
(USC Thesis Other)
Contribution to transform coding system implementation
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
INFORMATION TO USERS This manuscript has been reproduced from the microfilm master. UMI films the text directly from the original or copy submitted. Thus, some thesis and dissertation copies are in typewriter face, while others may be from any type of computer printer. The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleedthrough, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send UMI a complete manuscript and there are missing pages, these w ill be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. Oversize materials (e.g., maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand comer and continuing from left to right in equal sections with small overlaps. Photographs included in the original manuscript have been reproduced xerographically in this copy. Higher quality 6” x 9” black and white photographic prints are available for any photographs or illustrations appearing in this copy for an additional charge. Contact UMI directly to order. ProQuest Information and Learning 300 North Zeeb Road, Ann Arbor, M l 48106-1346 USA 800-521-0600 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. NOTE TO USERS This reproduction is the best copy available. U M I' Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CONTRIBUTION TO TRANSFORM CODING SYSTEM IMPLEMENTATION Copyright 2000 by Wenqing Jiang A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Electrical Engineering) May 2000 Wenqing Jian Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UMI Number: 3018008 __ ___ __< § ) UMI UMI Microform 3018008 Copyright 2001 by Bell & Howell Information and Learning Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. Bell & Howell Information and Learning Company 300 North Zeeb Road P.O.Box 1346 Ann Arbor, Ml 48106-1346 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UNIVERSITY OF SOUTHERN CALIFORNIA THE GRADUATE SCH OO L UNIVERSITY PARK LOS ANGELES. C A LIFO R N IA 90007 This dissertation, written by ' ............................. under the direction of h.XS Dissertation Committee, and approved by all its members, has been presented to and accepted by The Graduate School, in partial fulfillment of re quirements for the degree of DOCTOR OF PHILOSOPHY Dean of Graduate Studies Date ....... M ay ..4 A„ 2 0 0 0 Chairperson C , Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A c k n o w le d g e m e n t I would like first to express m y g ratitu d e to my advisor and friend Professor An tonio O rtega, whose continuous guidance and support kept lifting me up through all these years at USC, m aking it both a challenging and fun experience. I would also like to thank Professors Alvin M. Despain, C.-C. Jay Kuo for serving in m y qualifying com m ittee and Professors G erard G. M edioni and Zhen Zhang for serving both on m y qualifying and defense com m ittees. D uring m y years at USC, I had the pleasure to work w ith a num ber of friends including Hua, Krisda, Nazeeh, Phoom , Xiao-Miao, Younggap and my office- mates Raghav, Sangyong, and W oontack. I would like to thank all of them for the enjoyable collaborations we had (including sharing pizzas every week and com peting for the couch in th e lab). On this Thanksgiving holiday, my love goes to my parents and my brother for their years of encouragem ent which kept m e going. I th an k my wife for sharing with me all these years, and for her love, support and faith in me. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Contents A ck n o w led g em en t ii List o f T ab les vi List o f F ig u res vii A b stra ct xi 1 In tro d u ctio n 1 1.1 M otivation and O verview ............................................................................ 1 1.1.1 DW T System D e s ig n ...................................................................... 3 1.1.2 Robust C o m m u n ic a tio n ............................................................... 5 1.2 Contributions and Outline of the Thesis ............................................. 7 2 D W T A rch itectu re D esign 9 2.1 In tro d u c tio n .................................................................................................... 9 2.2 Problem D e fin itio n ....................................................................................... 12 2.2.1 Sequential Architecture Design ................................................. 12 2.2.2 Parallel Architecture D e s ig n ........................................................ 13 2.3 P r e lim in a r ie s ................................................................................................. 15 2.3.1 The Standard A lg o rith m ............................................................... 16 2.3.2 The Lifting A lg o r ith m .................................................................. 17 2.3.3 Practical DWT System D esign..................................................... 19 2.4 T he Overlap-State T e c h n iq u e .................................................................. 21 2.4.1 The Finite State M achine Model .............................................. 21 2.4.2 O v e rla p -S ta te .................................................................................... 25 2.4.3 Performance A n aly sis...................................................................... 28 2.4.4 Delayed N o rm aliz atio n .................................................................. 32 2.5 T he Proposed DWT A rchitectures........................................................... 36 2.5.1 ID S y s te m s ........................................................................................ 36 2.5.2 2D S y s te m s ....................................................................................... 39 2.5.3 Sequential A rc h ite c tu re s ............................................................... 41 2.5.4 Parallel A rchitectures...................................................................... 45 iii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.6 Experim ental R e su lts..................................................................................... 47 2.6.1 Delayed N o rm aliz a tio n .................................................................... 47 2.6.2 Strip P a r a l le l .................................................................................... 48 2.7 C o n c lu sio n s ...................................................................................................... 50 3 C on strain ed Transform D e sig n For M u ltip le D e sc r ip tio n C o d in g 52 3.1 In tro d u c tio n ...................................................................................................... 52 3.2 Problem D e fin itio n ........................................................................................ 56 3.3 Proposed Design A p p ro ach ........................................................................... 59 3.3.1 Geometric I n te rp r e ta tio n s ............................................................. 59 3.3.2 Param etric Transform and F acto rizatio n s................................. 61 3.3.3 Two Stage Transform Design ...................................................... 64 3.4 M DTC Design E x a m p le s.............................................................................. 66 3.4.1 Equal Rate Channe l s ....................................................................... 66 3.4.2 Sequential Protection C h a n n e ls ................................................... 67 3.4.3 Simulation Results for Gaussian S o u r c e s .................................. 68 3.5 Karhunen-Loeve Vector T r a n s f o r m ......................................................... 69 3.6 C o n c lu s io n s ...................................................................................................... 73 4 M u ltip le D escrip tio n C od in g for Erasure C h a n n els 74 4.1 In tro d u c tio n ...................................................................................................... 74 4.2 Erasure Channel Model and Problem D e fin itio n ................................. 77 4.3 MDC Based On Explicit R ed u n d a n cy ...................................................... 7S 4.3.1 Base S y stem ......................................................................................... 78 4.3.2 Context-Adaptive E x te n sio n ........................................................ 79 4.4 O ptim al Redundancy Bit Allocation for Gaussian S o u rces............... 81 4.4.1 Im plicit Channel M o d elin g ............................................................. 81 4.4.2 Explicit Channel M o d elin g ............................................................. 85 4.5 Experim ental R e su lts..................................................................................... 87 4.5.1 An Image MDC Exam ple ............................................................. 87 4.5.2 A Speech MDC E x a m p le ................................................................ 91 4.6 Sum m ary and Future Work ...................................................................... 94 B ib lio g r a p h y ............................................................................................................... 96 A O verlap -A d d and O verlap-Save 105 B D W T F S M E xam ples 107 C M D T C T ransform D esign A lg o rith m 112 C.0.1 Redundancy bit rate p ................................................................... 113 C.0.2 Side Distortion D s .......................................................................... 113 iv Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. D K L V T V ector Space P a r titio n 116 v Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List of Tables 2.1 T he standard algorithm .................................................................................. 16 2.2 Costs com parison for multilevel wavelet decompositions (J > 3) . 34 2.3 The proposed sequential DWT algorithm ................................................. 37 2.4 Overlap buffer size B s in ID DW T for ./-level decompositions using a T-tap wavelet filterbank............................................................................. 37 2.5 T he proposed parallel DWT algorithm ...................................................... 39 2.6 Comparison of m em ory requirements where W is the w idth of the image scanline and a = (2J — 1),/? = (1 — 2~J) ................................... 43 2.7 Comparison of m em ory requirements where a = (2‘ / — l ) , /? = ( l — 2- J ) 45 2.S DW T CPU tim e of different sequential algorithm s (in seconds). . 48 2.9 T he proposed parallel DWT algorithm ...................................................... 50 2.10 DW T running tim e of different parallel algorithm s (in seconds). . 51 4.1 “Squrriel” reconstruction NMR com parison ( d B ) ................................. 93 4.2 “Draw” reconstruction NMR com parison ( d B ) ..................................... 94 v i Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1 0 t o t o t o t o t o t o to List of Figures 1.1 A simplified im age/video coding system .................................................... 1 1.2 A three-level tree decomposition.................................................................. 4 2.1 An example dataflow chart of a three-level wavelet decomposition. Solid lines: com pletely transformed data; Dashed lines: boundary samples from th e neighboring block. O perations 1,3,5: com m u nicate boundary d ata samples to neighboring blocks; O perations 2,4,6: transform for current level................................................................ 10 .2 A typical sequential DW T system diagram .............................................. 12 .3 (a) 2D mesh processor network; (b) T he corresponding d ata p arti tion (b)................................................................................................................ 14 .4 (a)Bus-connected processor network; (b)T he corresponding strip data partition.................................................................................................... 15 .5 Two-channel wavelet filterbank in polyphase representation. . . . 18 .6 Wavelet transform via lifting, (a) Forward transform , (b) Inverse transform ............................................................................................................ 18 .7 Boundary processing for DWT. W hen filter (length L) moves to the right boundary of block 1, it needs input data samples from block 2. In a sequential architecture, block 1 and 2 do not reside in memory at the sam e time. In a parallel architecture, block 1 and 2 are allocated to different processors....................................................... 20 .S State transition diagram of DWT as a FSM ........................................... 24 .9 State transitions across block boundary using e* with a* = 3 and 6‘ = 2. (a) P artial computations near boundaries, (b) A fter up dating, boundary samples stay in interm ediate states. T he "new boundary” separates fully updated ou tp u t coefficients from par tially computed ones...................................................................................... 26 vii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. < N (M 2.10 Sequential D W T using overlap-state, (a) In p u t in initial state i. (b) Block 1 consists samples up to x l(2m). A fter sta te transition, sam ples near block boundary are only partially u p d ated (anti-causal filtering results not available), (c) Partially up d ated samples (state inform ation) are overlapped w ith next block of input samples. They are now com pletely updated by adding their anti-causal filtering re sults. (d) Completely transform ed (updated) input from state i to i + 1...................................................................................................................... 29 2.11 Parallel DW T using overlap-state, (a) Input in initial state i. (b) Input is partitioned over two processors. Block 1 and 2 are trans formed separately and each have state inform ation appearing near the block boundary, (c) S tate inform ation is exchanged between two processors and the partially updated samples are fully updated. (d) Com pletely transformed (updated) input from state i to i + 1. 30 2.12 An exam ple dataflow chart of a three-level wavelet decomposition using the proposed Overlap-State technique. Solid lines: com pletely transform ed data; Dashed lines: partially transform ed data. O peration 1: each block transform s its own allocated data inde pendently and state inform ation is buffered; O peration 2: state inform ation is com m unicated to neighboring blocks; Operation 3: com plete transform for the boundary d ata sam ples............................ 31 2.13 Illustration of delayed norm alization. (a)Recursive two-channel lift ing. (b)Two channel lifting w ith delayed norm alization.................... 32 .14 A 2D DW T example of delayed norm alization..................................... 33 .15 Joint design of transform and quantization to reduce computation. (a) Independent transform and quantization, (b) Joint transform and quantization. The norm alization operation is merged with the quantization....................................................................................................... 35 2.16 The proposed sequential D W T architecture.......................................... 36 2.17 The proposed parallel DWT architecture. In Split stage, each pro cessor com putes its allocated d ata independently up to the required decomposition level. In Merge stage, a one-way communication is initiated to communicate the state inform ation to neighboring pro cessors. A postprocessing operation is then started to complete the transform for boundary sam ples................................................................. 38 2.18 2D D W T /FS M illustration. Shaded areas represent the state in form ation with -ftO/CO the row /colum n state inform ation at level 0 and so on. The input block is first row transform ed (a) and column transform ed (b) at level 0, then downsampled (taking the LLO sub band) and row transformed a t level 1 (c), and colum n transformed at level 1 (d)...................................................................................................... 40 viii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.19 Strip sequential D W T system diagram . T he input is segm ented into d ata strips which are transform ed sequentially from top to b o tto m .................................................................................................................. 42 2.20 Block sequential DW T system diagram . T he input is segm ented into blocks which are transform ed from left to right and from top to b o tto m ............................................................................................................. 44 2.21 Merge operations in 2D m esh processor network, (a)transfer row sta te inform ation from to P ij+ 1 ; (6)transfer column s ta te in form ation from P ij to c)transfer newly generated row state inform ation from to Pi,j+i; (d)com plete transform for boundary sam ples. Notice the total am ount of d a ta in each processor in the final state (d) is different from th e original uniform allocation due to the Merge operations................................................................................... 46 2.22 Merge operations for strip parallel im plem entation, (a)transfer row sta te inform ation from P{ to P t+i; and (b)com plete transform s for boundary samples in each processor........................................................... 47 3.1 A typical M DTC s y s te m ................................................................................ 54 3.2 R edundancy rate distortion curve of a M DTC system for G aussian vector w ith cq = 1 , 0 2 = 0.5........................................................................... 58 3.3 D istributions of two independent G aussian variables with variance c ti = 1, 0 2 = 0.5. (a) original; (b) after 45° rotation with correlation coefficient 0.6. (c) after scaling(a = 2); (d) after scaling an d 45° rotation w ith correlation coefficient 0.94. Clearly one can see th a t d a ta in (d) is more correlated th a n th a t in (b )...................................... 61 3.4 L attice structure of a RS tra n s fo rm ........................................................... 63 3.5 Transform search space..................................................................................... 66 3.6 Com parisons am ong different transform s................................................... 69 4.1 T he proposed MDC system ............................................................................. 79 4.2 C ontext-based MDC S y s te m ......................................................................... 80 4.3 R ate-distortion performances com parison for a Gaussian source ■A/*(0,1): (1) Ozarow: optim al bound. (2) MDSQ: optim al level- constrained results. (3) Proposed: Lloyd-Max quantizer results w ith fixed length code. (4) O ptim al: results using the rate-distortion function of the Gaussian source.................................................................... 84 4.4 An exam ple configuration of two different polyphase transform s us ing a two-level wavelet decom position on a 16x16 input m atrix . In all the subbands, all the 1 sam ples constitute one polyphase com ponent and the rem aining sam ples constitute the other polyphase com ponent, (a) plain polyphase transform , (b) vector polyphase transform .............................................................................................................. 87 ix Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.5 E xperim ental results with Lena gray-level image, (a) Two descrip tions. Polyphase (1): plain polyphase transform . Polyphase (2): vector polyphase transform, (b) Performances with independent packet losses...................................................................................................... 89 4.6 R econstructed Lena images at to tal rate 0.56ps using only one channel inform ation, (a) Original image; (b) Redundancy bit rate 0.056ps, PSN R 29.64dB; (c) R edundancy bit rate 0.106ps, PSN R 31.91dB: (d) Redundancy bit rate 0.156ps, PSNR 32.98dB . . . . 90 4.7 T he proposed and the RAT schemes for robust packet speech coding. 92 4.8 R econstructed NM R distribution plot for the Draw sentence under different packet loss probabilities. RAT: o; Proposed: + ; JAY: x . 95 A .l (a) Overlap-save. (b) Overlap-add..................................................................106 B .l S tate transitions of the Daubechies (9,7) filters using factoriza tion (B .l). The state information consists of 4 samples (the over lap buffer size) in shaded boxes. Dashed lines represent opera tions necessary for the transform of the new input sam ple pair {x°(9), x°(10)}.................................................................................................. 108 B.2 S tate transitions of the (2,10) filters using factorization (B .l). The state inform ation consists of 4 sam ples in shaded boxes. Dashed lines represent operations necessary for the transform of th e new input sam ple pair {x°(10), x ° ( ll) } ............................................................. 110 B.3 S tate transitions of the CDF (4,2) filters using factorization (B.4). T he state inform ation consists of 3 samples in shaded boxes. Dashed lines represent operations necessary for the transform of the new input sam ple pair {x°(8), x°(9)}...................................................................... I l l x Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A b str a c t W ith the increasing dom inance of the transform coding technique virtually in every image and video coding schemes proposed up to date, efficient transform coding system im plem entation has become an im portant research topic. This the sis addresses two system issues th at may arise in practice: (i) efficient architecture designs for the discrete wavelet transform ; and (ii) efficient transform coding for robust communication over erasure channels. T he first contribution of the thesis is to develop an overlap-state technique for efficient multilevel wavelet decompositions when memory and delay constraints have to be strictly observed. In this case, the wavelet transform can be computed in a block-by-block fashion, i.e., the input data is segmented into blocks and each block is processed separately, either sequentially or in parallel. The proposed technique enables efficient d a ta exchange between consecutive data blocks such th at th e required memory buffer size an d /o r communication overhead can be significantly reduced com pared to existing techniques. T he second contribution is to provide two efficient loss d ata recovery tech niques for robust com m unication over erasure channels, such as today’s Internet. For the problem of multiple description transform coding, we propose a structured correlating transform design m aking use of the available channel inform ation. The enforced structure enables a significant reduction of the number of design parame ters, which results in significantly reduced design and im plem entation complexities com pared to existing approaches. An alternative technique for loss recovery using explicit redundancy is also proposed. This is achieved by splitting the source into different components and quantizing them differently; the prim ary information is finely quantized at a relatively high rate and the redundant inform ation is coarsely quantized at a relative low rate. W hen the prim ary information is lost, the re dundancy information can be used for recovery. The performance analysis and sim ulation results show that the proposed technique is very com petitive compared to previously published works. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 1 Introduction 1.1 M o tiv a tio n an d O v e r v ie w To m otivate our work in this thesis, let us first consider a sim plified communication system shown in Fig. 1.1. The coding and decoding procedures can be described as follows. F irst, the redundancy in the data is removed by applying a decorrelating transform , such as th e Karhunen-Loeve transform (KLT) or its approximations (e.g., the discrete cosine transform (DCT) and the discrete wavelet transform (D W T)). N ext, the transform coefficients are quantized and entropy-coded to meet the bit budget constraint. The encoded b itstream is then sent over the channel for transm ission. At the receiver, an inverse procedure is applied to reconstruct the original data. Such a coding framework constitutes the core of virtually every existing im age/video coding standards, e.g., JPE G x, M PE G x (M P E G l, M PE G 2 and MPEG4, F o rw ard T ra n sfo rm D C T /D W T •i Q u an tiza tio n ! ■ In v e rse T ra n s fo rm '■« tD C T /ID W T Dequantizationi- E n tro p y ■ C o d in g ! T _ i ; : E n tro p y ▼ D eco d in g ! ! C h an n el Figure 1.1: A simplified im age/video coding system . Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. etc..) and H.26x (H.261, H.263, and H.263+, etc.., although the standards them selves are designed to target different data com pression applications. As a result, practiced system designs and im plem entations have to be subjected to different constraints. For exam ple, th e JP E G standard is m ainly used for compression of still im ages, such as pictures in digital cameras, printers and scanners. O ne im portant issue here is to reduce the algorithm memory usage in these consum er electronics products to reduce the price [1, 2, 3]. T h at is, th e transform , quantization and entropy coding blocks in Fig. 1.1 have to be im plem ented using as little memory as possible. The M PE G x and H.26x standards, on the other hand, are designed for compression of moving pictures, such as m ovie and television video signals. In this case, fast algorithm s, including fast transform , fast quantization and fast entropy coding, becom e m ore im portant since m ost applications have very strin gent tim e constraints and some even require real-tim e encoding an d / or decoding, e.g., videoconferencing and digital broadcasting of live concerts. As one can see, it is thus very im p o rtan t to design and im plem ent efficient system s under various application constraints. W ith the fast developm ent of the Internet, m ore and more m ultim edia signals (e.g., image, video and audio signals) are being com m unicated over the packet network. However, m ost Internet service providers can only deliver best-effort services using current available network technologies, i.e., packets transm itted m ay get lost before arrival at the destination. This can happen when the net work congestion occurs or link failures at interm ediate routers. To achieve robust com m unication over such unreliable channels thus becomes another im portant objective. In this thesis work, two practical issues are addressed: (i) Efficient D W T sys tem design under m em ory and delay constraints; and (ii) Robust com m unication over packet erasure channels. We now give a detailed review on each of these two topics. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1.1.1 DWT System Design In recent years, there have been considerable research activities centered around building efficient systems for com puting the discrete wavelet transform (DW T) [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1-5, 16, 17, IS]. This is certainly due in part to the fact th at the wavelet transform is a powerful tool for m ultiscale time- frequency signal decomposition and analysis, which has found applications in m any areas, such as signal processing, digital comm unications, num erical analysis, and com puter graphics [19]. Moreover, practical system design is itself a very challenging problem because there may be stringent constraints, such as buffer size, delay, power, chip area and control complexity, imposed by specific DW T applications [7, 17, 20, 21, 22, 8, 14, 23, 12, 2, 11, IS]. In m any cases, a sequential architecture is used where the DWT is com puted by splitting the input into blocks, with the processor operating on one block at a tim e [6, 12, 2]. One reason for such a choice is th a t only a lim ited am ount of m em ory is available for th e transform com putation. Example scenarios include image com pression/decom pression systems using a D SP/A SIC chip in consumer electronics products (e.g., digital cameras) or space-borne instrum ents [12, 2, 24]. In these applications, reducing the memory buffer size helps not only to m aintain low costs but also to reduce the chip design area and thus the volume of the final product. As an alternative, a parallel architecture would split the input among several processors to speed up the transform com putation [11, 25, 9, 23, IS]. This is typical for applications where a large volume of d a ta has to be processed in a reasonably short time. For instance, seismic data processing [23] or illum ination com putations in com puter graphics [19] are potential applications. Obviously, fast DW T com putation to m eet stringent delay constraints is critical to the success of any wavelet-based techniques. W hile the problem of system design under constraints, such as m em ory and delay, is not new and is also encountered in the design for traditional transform s (e.g., F F T or DCT), system design for the wavelet transform poses particular difficulties not present before. Consider an exam ple three-level wavelet decom position as depicted in Figure 1.2 where (h .g ) are respectively the lowpass and 3 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. highpass filters. T he input is first filtered by h and g. The lowpass o u tp u t is then downsam pled and filtered again. For m ultilevel decompositions, this process of filtering and downsampling operations has to be perform ed recursively on the input data. Since filtering operations in D W T can not be im plem ented as a block transform (w ith the exception of the trivial H aar transform ), this recursiveness nature of the D W T com putation poses special challenges when stringent m em ory and delay constraints have to be observed for practical D W T system designs. Figure 1.2: A three-level tree decomposition. Assume, for example, that this three level wavelet decomposition is to be per formed using two processors, with each processor being allocated half of th e input data. A problem arises when DW T is com puted near d a ta boundaries at b o th pro cessors (refer also to Fig. 2.7). Because D W T is not a block transform , for correct wavelet coefficients com putation near d ata boundaries, each processor would need to access d ata allocated to the other processor. In this case, either the two proces sors exchange d ata before each level of the decom position or the two processors are given sufficient overlapped data to carry on th e whole com putation w ithout com m unication with each other. The first approach dem ands frequent d ata exchanges between processors which increases the com m unication overhead thus adversely affecting the system scalability in a parallel architecture, particularly in one with slow com m unication links, for example, a network of workstation (NOW s) or a local area m ulticom puter (LAM) [26, 10, 27, 28]. T he second approach, although avoiding frequent communication, needs to overlap d a ta at each processor. This overlap, due to the recursive nature of the D W T filtering operations, can be very large and hence very expensive in term s of m em ory and com m unication w hen the num ber of levels of decomposition increases [29]. 4 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. This provides th e m otivation to investigate efficient D W T system design un der m em ory and delay constraints. W hile there are m any factors which could possibly affect the m em ory and delay for th e D W T com putation (see, e.g., the m em ory requirem ent for the complete D W T com pression/decom pression system s [12, 2] and the interprocessor com m unication required by d ata transposition in 2D DW T [IS]), in this work, we focus on th e m em ory and the interprocessor com m unication constraints im posed by the segm entation of the input data, eith er for sequential or for parallel architecture designs as dem onstrated by the above exam ple. Consequently, th e two param eters we use to m easure the perform ance would be the am ount of d a ta to be transm itted betw een processors (or to be stored in th e processor if a sequential com putation is used) and the num ber of tim es d ata has to be com m unicated between processors. 1.1.2 Robust Communication T he second part of this thesis is devoted to th e study of robust com m unication techniques for delay-constrained applications over erasure channels. Exam ples of applications in which such techniques becom e crucial include videoconferencing, realtim e audio and speech over packet netw orks. T he best-effort service m odel, as currently being im plem ented by m ost in tern et service providers (ISPs), does not guarantee tim ely lossless packet delivery. Packets can be dropped or over delayed in a num ber of scenarios. For exam ple, packets can be over-delayed in congested network segm ents or even dropped if packet dropping policies (e.g., random early drop (RED )) are im plem ented to relieve the congestion. Packets can also be corrupted thus becoming useless a t the receiver when sent over hostile channels, e.g., a m obile radio channel suffering from severe m ultipath fading. As observed by a num ber of works recently [30, 31, 32], packet losses, if not dealt w ith appropriately, can cause very annoying quality changes in the received signal. Numerous research efforts have been aim ing at providing quality-of-service (QoS) by redesigning the network infrastructure (e.g., RSVP [33]) thus providing bounds on packet losses or avoiding losses altogether. However, in this work, we study techniques to enable recovery from packet losses and to m itigate the losses 5 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. in signal quality due to the underlying erasure channel. Our m otivation is two fold: (i) th e applications we study (e.g., im age, video and audio com m unications) do not require lossless d ata recovery, i.e., some degradation can be tolerated as long as th e degradation is below a certain threshold: and (ii) these techniques can com plem ent QoS-based transm ission (especially if packet losses still occur even if they are bounded), or at least serve as near term solutions before the wide deployment of QoS networks. Our goal is th en to design techniques to enable the signal quality to degrade gracefully in the presence of packet losses. The problem of robust com m unication over erasure channels is not new and has been studied thoroughly in the context of channel coding theory [34] and data com m unication protocols [35]. By using error correcting codes (e.g., block and convolutional codes), one can increase the inform ation redundancy (thus resulting in bandw idth inefficiency) to achieve m ore robust com m unication [36, 37, 38]. By using retransm ission mechanisms (e.g., th e ARQ schemes) one can also achieve robust com m unication at the expense of an increase in com m unication delay [39, 40], These existing techniques, though qu ite successful in the past, have to be applied w ith caution for communications over packet networks because of the differences of th e underlying channel m odels. For example, packet losses occur more often due to network congestion rath er th an because of b it errors due to channel noise, as assum ed in m any conventional channels. In this case, immediate retransm ission attem p ts may even aggravate the situation and lead to more packet losses. The effectiveness of retransm ission also decreases in m ulticast applications when end users suffer from uncorrelated losses. In this case, an appropriate choice would be to use forward-error-control (FE C ) schemes. However, end users may experience different levels of packet loss due to the network heterogeneity and a scheme th a t provides a single level error protection can be insufficient for some users but redundant for others. These exam ples show th at robust communication over lossy packet networks deserves further study. 6 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1.2 C o n trib u tio n s a n d O u tlin e o f th e T h e sis In th e next chapter, we present our work on D W T system architecture design. We start with an overview on previous works on D W T com putations, including algo rith m studies and architecture designs. Then a finite state machine (FSM ) model is proposed to characterize the DW T behavior especially around the d a ta bound aries. Based on this D W T/FSM model, an overlap-state technique is proposed to correctly transform the boundary samples in a m ultilevel wavelet decom position w ith reduced interprocessor com m unication and m em ory usage. We then present various sequential and parallel DW T system designs using the proposed technique and show how transform buffer size and com m u n ication overhead can be reduced. O ur m ain contribution in this chapter is to provide a new technique, overlap- state, to com pute multilevel wavelet decompositions in a way that can significantly reduce the m em ory and com m unication overhead compared to existing technolo gies. We believe that this will greatly help practical DW T system designs where m em ory and delay constraints have to be strictly observed. T he second part of this thesis studies techniques th at can achieve robust com m unication over packet erasure channels, such as today’s Internet. M ore specifi cally, we study techniques th at can help the signal quality to degrade gracefully in case of packet losses. Two different approaches are taken in our study. In chapter 3, a constrained correlating transform design for m ultiple description coding is presented. This correlating transform is used to add redundancy in the encoded d a ta such that lost transform coefficients can be estim ated (thus recovered) using th e correlation existing in correctly received coefficients. S tarting w ith a review of related works in m ultiple description transform coding (M D TC), we provide a detailed analysis on why non-orthogonal correlating transform s can perform better th an orthogonal correlating transform s for MDTC system design. We then address th e difficulties of existing M DTC system designs in which arb itrary non-orthogonal transform s are used. A two-stage transform design approach, structure design and m agnitude design, is then presented to drastically reduce both the design and im plem entation complexities. We also provide design examples using the proposed technique for the design of transforms for equal rate and sequential protection channels. 7 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C hapter 4 is devoted to a different technique for packet loss recovery. Rather th an using a correlating transform to im plicitly add redundancy into the data for loss recovery, redundancy is explicitly added by source splitting and selective quantization. T he source splitting is perform ed using the polyphase transform as an example though other forms are also possible. The selective quantization is achieved by quantizing th e prim ary inform ation and redundant inform ation at different resolutions. T h e performance analysis of our proposed system is provided in detail for Gaussian sources and comparisons w ith existing approaches are also provided. Experim ental results of image and speech coding and com m unication over independent packet loss channels are also provided. T he main contribution in this part is th a t we propose sim ple (in design and im plem entation) yet efficient (com petitive coding performances com pared to pre viously reported works) techniques for adding redundancy into encoded d ata for loss recovery. We believe th a t this is im portant for practical system im plem enta tions when robust com m unication is required over unreliable channels. 8 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 2 DWT Architecture Design In this chapter, we stu d y efficient DWT architecture designs under m em ory and delay constraints l . 2 .1 I n tr o d u c tio n Various investigations of efficient wavelet transform s im plem entation have been reported in recent years. T he m ost popular D W T algorithm is the recursive filter ing approach using the corresponding wavelet filterbank, the so-called standard algorithm [42], whose com putational com plexity is O(L) per output coefficient (L is th e filter length). T he FFT -based DW T algorithm proposed by Rioul et al. [5] can reduce the com plexity from O(L) to 0(log L) for large filter lengths L. For short filters, they presented a “fast running F IR filtering” technique which can achieve 30% saving in com putations. Using a lattice structure, V aidyanathan et al. [43, 44] have shown th a t the complexity can be reduced by a factor of 50% for orthogonal wavelet filterbanks. The ladder stru ctu re by Marshall [45] and the lifting algorithm by Daubechies and Sweldens [46] further show th at, asym ptoti cally, for large filter lengths L, 50% savings in com putations can be achieved for any F IR wavelet filterbanks including orthogonal and biorthogonal filterbanks. In sequential architecture designs, most approaches adopt the stan d ard F F T - based filtering techniques [47], i.e., overlap-add or overlap-save. These include the L Part o f this chapter were published in [10, 16, 17, 41]. 9 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Block 1 / J n Block 2 T Level 1 ! LeveI2 1 LeveI3 6 Figure 2.1: An exam ple dataflow chart of a three-level wavelet decomposition. Solid lines: com pletely transform ed data; Dashed lines: boundary samples from the neighboring block. Operations 1,3,5: com m unicate boundary data samples to neighboring blocks; Operations 2,4,6: transform for current level. recursive pyram id algorithm (RPA) by Vishwanath [6], th e spatially segmented wavelet transform (SSWT) by Kossentini [29], and the reduced line-based com pression system by Chrysalis et al. [12, 1]. Since the SSW T overlaps data only once before the start of the transform , the overlap buffer size increases exponen tially with the increase of decomposition levels. An alternative is implemented in [6, 12] where d ata is overlapped at each level of decom position and the buffer size is reduced. In parallel architecture designs, most approaches proposed re quire com m unication of the boundary d ata at each level of decomposition (see, for example, the works by Fridm an et al. [11] and by Nielsen et al. [25]). One such design w ith three level decompositions is shown in Figure 2.1. To reduce the overhead caused by frequent inter-processor communication, Yang at el. [26] pro posed to use boundary extensions in their DW T system configured from a cluster of SGI workstations. This, however, com putes incorrect wavelet coefficients near data boundaries, which causes perform ance degradation in some applications, for example, low-bit rate image coding [48]. In this chapter, we present a novel technique, overlap-state, for the D W T com putation, which can help to achieve significant memory and com m unication sav ings. The idea is m otivated by the standard overlap-add technique which performs filtering operations on neighboring d a ta blocks independently first and completes the com putation later by summing the partial boundary results together [47]. We 10 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. extend this idea to the case of m ultilevel wavelet decom positions using the lifting framework form ulated by Daubechies and Sweldens [46]. The DWT is first mod eled as a finite state machine (FSM) using the lifting algorithm . Multilevel partial com putations (interm ediate states) are perform ed for samples near block bound aries. We show that, to obtain a correct transform near d ata boundaries, these interm ediate states can be preserved in their original storage spaces (an extension of the in-place com putation feature of the lifting algorithm ) and exchanged only once between neighboring d ata blocks for any arbitrary J level decompositions. Some recent works have also explored (independently of our work) the use of lifting factorizations for memory savings in DW T im plem entations [49, 50, 51, 52]. The novelty of our work is th at, first, we introduce partial computations for boundary samples at multiple decom position levels and preserve these partially com puted results (interm ediate states) in their original locations for later pro cessing, and second, we propose th a t processors exchange d ata after multilevel decompositions rather than at each decom position level. We will show how the overlap-state technique can be used to reduce the m em ory requirement and the interprocessor communication overhead in the sequential and parallel architecture designs. This chapter is organized as follows. In the next section, we define the problem of memory and delay constrained DW T system design for sequential and parallel architectures. In Section 2.3 an overview of various DW T algorithms, including the standard DW T algorithm and the lifting algorithm , is provided. Difficulties of com puting D W T near data boundaries are detailed from the memory and delay perspective. Section 2.4 then presents the overlap-state technique based on the idea of partial com putation at m ultiple levels in the process of DWT com puta tion. It is shown that the proposed technique can help to significantly reduce the memory and communication need in practical system designs. A delayed normal ization technique is also introduced to speedup multilevel D W T computations. In Section 2.5 examples DWT system designs for ID and 2D d ata are provided in detail. Experim ental results and anaylsis are given in Section 2.6. Finally, Section 2.7 concludes this chapter. 11 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.2 P r o b le m D e fin itio n In this Section, we define the problem of sequential and parallel architecture designs using the 2D separable DW T as an exam ple. O ther types of D W T system can be defined similarly. 2.2.1 Sequential Architecture Design A sequential system for 2D DWT is shown in Figure 2.2. The transform working buffer (e.g., on-chip m em ory or cache m em ory) is usually sm all in size compared to the d a ta size. Therefore the original data, stored in a secondary storage space (e.g., hard disk, fram e buffer), has to be segm ented such th at each segment can be loaded to the working buffer and th e transform can be com puted one segment at a tim e. Variations of this generic system include: 1 . T h e block-based system presented in SSW T by Kossentini [29], which com putes the wavelet transform one im age block at a time. 2. T h e line-based system presented by Chrysafis et al. [12, 1 ], which computes the wavelet transform “on the fly” and where the basic input units are image lines. DWT j-*----: Working | Wavelet; FSM |— Buffer : Coeffs ; ! ; ‘ Data Storage Figure 2.2: A typical sequential D W T system diagram . Assum e blocks of size B (in bytes) are given to the processor, i.e., at most B bytes of input d ata can be loaded into the working buffer at a tim e. Define the 12 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. overlap buffer size Bs in bytes as the memory space taken by the overlapped d ata which has to be kept in the working mem ory so th a t th e correct transform can be com puted for the next block of input data. In the case of the line-based system s [12, 1 ], B s is the minimum working buffer size needed for the transform . After transform of each block, only (B — B s) bytes can be freed and wavelet coefficients generated can be transfered to the next processing stage, e.g. quantization. We now define th e system efficiency t j as B — B s B s v = - g - = l - f (2.1) Intuitive explanations of rj are as follows. If 77 = 1 , this indicates th a t all of the original d a ta samples can be fully transform ed which corresponds to th e case of pure block transform s, such as DCT or the H aar transform . If, using the whole buffer, no com plete decom position can be performed (i.e., d ata is not enough for J-level of decompositions), then 77 = 0. We m ention, however, that in this case it is possible th a t some of the wavelet coefficients in high frequency bands can be generated. T he problem is form ulated as: Given a fixed working buffer size B , how to compute the D W T to maximize the system throughput rj? Obviously, to increase the system throughput, one has to reduce the overlap buffer size B s as m uch as possible. 2.2.2 Parallel Architecture Design M e sh P r o c e s s o r N e tw o rk We first consider a 2D mesh-connected processor network depicted in Figure 2.3(a), where each processor is only connected with its im m ediately neighboring processors (a sim ilar m odel is also studied in [II]). Com m unications between processors can be according to the single port model, in which processors may only send or receive a message over one com m unication link at a tim e, or the multi-port model, in which a processor may send or receive m ultiple messages over m ultiple links at a time. Using such a model, the natu ral partition for 2D d a ta is the block partition strategy shown in Figure 2.3(b). The processor Pm,n is allocated w ith input 13 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 2.3: (a) 2D mesh, processor network; (b) T he corresponding data partition (b). samples w ith indices (x ,y ) ,m N r < x < (m + 1 )N r,m N c < y < (m + 1 )N C . W ithout loss of generality, assume W = M N r and H = N N c where (Nr , Nc) are the block row and column length, and (M .N ) are the num ber of processors in row and column direction respectively. B u s P r o ce sso r N etw ork We also consider another type of processor network in which each processor can communicate to every other processor through a com m on bus. An example is the LAM (Local Area M ulticom puter) system s, see Figure 2.4(a), where locally connected machines are reconfigured into a parallel system (a sim ilar model can also be found in [23]). O ne possible data partitioning approach is the strip partition which is depicted in Figure 2.4(b) where processor Pn is allocated with input samples of indices (x ,y ),0 < x < W — l,niV c < y < {n + l) N e. T he message passing mechanisms in both processor networks are m odeled as follows. The communication tim e Tc for a size-m message is - Z ” c — 4 “ Tntw 4 ~ ip where ts is the tim e it takes to establish a connection. tp is the propagation time, and tw is the tim e to transm it a size-1 message. If one message unit is an integer, then t w is the tim e to transm it one integer. O ther cases are defined similarly. Notice th a t for the bus processor network, tp is taken as the average propagation 14 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 2.4: (a)Bus-connected processor network: (b)The corresponding strip d ata partition. tim e and, for the m esh processor network, tp = It^ where I is the num ber of links and th is the propagation tim e over one link. T he design problem is form ulated as: Given the communication model as de fined above, minimize the communication overhead in a parallel D W T system. To this end, clearly we can reduce the overhead by reducing the num ber of com m u nications and/or reducing the am ount of d ata th at has to be exchanged. 2 .3 P r e lim in a r ie s To lay the foundation for our proposed architecture designs, we first give a review on D W T algorithms including the standard and the lifting algorithm s. T he dif ficulties of applying traditional techniques to m eet the system design constraints (m em ory and delay as outlined in the previous section) are then explained in detail. We mention th at, throughout this chapter, we focus on the tree-structured e [42] multilevel octave-band wavelet decom position system w ith critical sam pling using a two-channel wavelet filterbank. The extensions of our study to system s of standard DWTs [53], m ultichannel wavelet filterbanks, and wavelet packet de compositions are straightforw ard. 15 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.3.1 The Standard Algorithm Theoretically [54], the wavelet transform is a signal decom position technique which projects an input signal onto a multiscale space constructed from the dilated and translated versions of a prototype wavelet function, i.e., the mother wavelet. C om putationally most w ravelet transform s can be im plem ented as recursive filter ing operations using th e corresponding wavelet filterbank as shown in Figure 1.2. This im plem entation will be refered to as the standard algorithm hereafter. We em phasize th at the filtering operations are only perform ed every o th er sam ple for two-channel filterbanks, i.e., a subsam ple-filtering approach, which is already a fast algorithm [5]. The pseudo-code im plem entation is given in Table 2.1. Table 2.1: T he standard algorithm . input: x[n],n £ [0, N] N: input sequence length J: decomposition level L: filter length o utput: y[fc], k E [0, N] begin for (j = 0; j < J \ j + + ) for (n = 0 : n < 2 -/-J ; n -(- + ) { x1 [n] = E L o xJ 1[m]h[2n — m] y J N = £m=o xj - l [m]g[2n - m\ } end For practical applications w ith memory and delay constraints, th e standard algorithm , however, m ay not be a good choice for three reasons: (i) it requires a buffer of same size as th e input to store the interm ediate results (the lowest subband) for recursive filtering operations; (ii) it has a large latency since all the outputs of one subband are generated before th e o u tp u t of the next subband; and (iii) the com putation cost is high. Define algorithm computation cost as the num ber of m ultiplications and additions per o u tp u t point. Using wavelet filters w ith L-taps, L m ultiplications and (L — 1 ) additions are needed for one output 16 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. point at each level. The cost Cs of the standard algorithm , for a ./-level wavelet decomposition, can be com puted as [5] ° s ~ (L + L - l ) ( l + - + - H ------ + ^J~r) = 2(2 L — 1 ) ( 1 — 2~J) (2.2 ) (2.3) 2.3.2 The Lifting Algorithm A size-iV polyphase transform [54] of a signal x[n] is defined as a m apping that generates N subsequences with each being a shifted and downsampled version of a:[n], i.e., x,-[n] = x[nN -t- i]. These subsequences are called the polyphase components of signal x\n\. In the case of N = 2 , this transform sim ply divides the input sequence into two polyphase components which consist of sam ples from the original sequence having odd indices and even indices, respectively. In z-transform domain, the polyphase representation of x[n\ is X(z) = 52'iLo1 z~l X{(zN). Define the polyphase m atrix P (^) as P (z) = H0(z) Ht(z) Gq(z ) G'i(z) (2.4) ' Y 0 (z) ' ' H0(z) Hi(z) ‘ ’ X 0(z) Yi(*) . Go(z) Gi(z) Xi(s) where Hfz) is the ith polyphase com ponent of the H(z) (sim ilarly defined for G (^)). The D W T in polyphase domain can be w ritten as (2.5) A schematic plot is shown in Figure 2.5 for DW T with a two-channel wavelet filterbank operating in the polyphase dom ain. If the determ inant of P (^) is a monomial, i.e., det("P(z)) = z l, then X can be losslessly reconstucted as X = P _ l(z)Y w ith only a phase shift. In this case, the corresponding filters will be called as perfect reconstruction (PR) wavelet filters. The m ain advantage of the polyphase dom ain transform com putation is that the polyphase m atrix P(-z) can be further factored and the factorization leads to fast DWT algorithm s [43, 44, 45, 46]. Using the Euclidean algorithm , Daubechies 17 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 2.5: Two-channel wavelet filterbank in polyphase representation. and Sweldens [46] have shown that the polyphase m atrix P (z) of any P R FIR filterbanks can be factored into a product form of elem entary m atrices as P(*) = ’ 1 / K 0 i n i=m " 1 o ' ' l Si(z)' 0 K . *,-(*) i . 0 1 (2.6 ) where (st -(s),ti(z)) are the prediction and updating filters, respectively, at stage i. It has been shown th at such a lifting-factorization based DW T algorithm is, asym ptotically for long filters, twice as fast as th at of the standard algorithm (Theorem S in [46]). In Figure 2.6, the forward and inverse DW T using lifting factorization are illustrated schematically. i ----------- v j y + v r / ‘ + - - z -**42/ I ll(z) I i s .( z ) ; _ i ~ ^ + ^ — ‘ X ,‘~ ! : s«(z) j : — i — 1 u(z) ! (a) + ; t.(z) 1 (b) + s,(z) Figure 2.6: W avelet transform via lifting, (a) Forward transform , (b) Inverse transform. 18 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Notice th a t the elem entary m atrices in the lifting factorization axe all trian gular (upper or lower triangular) with constants in the diagonal entries. Such a choice of elem entary m atrices enables the im plem entation of the D W T to be in- place (see next section for details), a key difference w ith respect to o ther types of factorizations (e.g., the lattice and ladder factorizations). While all these factor izations can reduce the D W T com putation, the in-place feature can also reduce the transform memory. Consequently, the lifting algorithm is chosen as th e baseline D W T algorithm for our proposed architecture designs. 2.3.3 Practical DWT System Design For practical DW T system design under m em ory and delay constraints, choosing only a fast algorithm (e.g. the lifting algorithm) m ay not be sufficient. First, the com plexity of the lifting algorithm is still linear w ith the size N of th e input data, i.e., O(N). If a parallel system is used to fu rth er speed up the com putation, the first problem to solve is th a t of establishing an efficient sharing of d a ta across processors so that the correct transform can be com puted at the boundaries. Second, though the in-place feature of the lifting algorithm elim inates the need for a buffer to store interm ediate results, it does not address the problem of extra buffer requirem ent when the input d ata has to be transform ed on a block-by-block basis. In Figure 2.7, we represent the DW T operation near block boundaries for one level decomposition. Obviously, extra buffer or com m unication is needed to en sure correct com putations near d ata block boundaries. Such a problem also exists in cases of conventional linear filtering of long d a ta sequences and can be dealt w ith by using either overlap-add or overlap-save techniques (see A ppendix A). However, because the D W T consists of recursive filtering operations on m ulti level downsampled data, direct applications of these two existing techniques may increase significantly th e cost in term s of m em ory an d /o r com m unication. Consider a J-level wavelet decomposition w ith block size N and filter length L. Both overlap-add and overlap-save require an ex tra buffer (for boundary fil tering operations) of size L — 2 for each level of decom position. If th e overlap is done once for all decom position levels (the SSW T approach by Kossentini [29]), 19 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Input, 1 ▲ L i ▲ 1 A ▲ A _ A ▲ ' A ^ : : _ a Output Figure 2.7: Boundary processing for D W T. W hen filter (length L) moves to the right boundary of block 1 , it needs input d a ta samples from block 2. In a sequential architecture, block 1 and 2 do not reside in m em ory at the sam e tim e. In a parallel architecture, block 1 and 2 are allocated to different processors. the total overlap buffer size is (2J — 1 ){L — 2) which increases exponentially with J . This becom es significant if deep decom position and long wavelet filters are used. An alternative is to overlap a t each level. In this case, the overlap buffer size is J (L — 2) for J-level decompositions. This, however, causes delay in parallel architectures since one processor has to wait the other to send new d ata after each level of decom position (an approach described in [11, 25]). A th ird approach [26, 23] is to use boundary extension (e.g. sym m etric extension) to approxim ate the d ata in th e neighboring blocks. This com pletely elim inates th e overlap buffer and also elim inates the com m unication for d ata exchanges between processors. Unfortunately, the DW T coefficients near block boundaries are com puted incor rectly. The above analysis thus shows th e inefficiencies, in term s of m em ory and/or com m unication overhead, of DW T system designs which adopt the existing over lapping techniques. In the next section, we will introduce a novel overlap-state technique for D W T com putation across block boundaries which can help to re duce the com m unication overhead in parallel architectures and the overlap buffer size in sequential architectures. 20 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2 .4 T h e O v e r la p -S ta te T ech n iq u e In this Section, we first introduce th e FSM model for D W T based on the Lifting factorization. Then we present the overlap-state technique for DW T computation across consecutive d ata blocks, which can help to reduce significantly memory and communication overhead in DW T system designs. 2.4.1 The Finite State Machine Model From the lifting point of view [55, 46], the elem entary triangular matrices in the factorization (2 .6 ) can be further classified as prediction/lifting and updating/dual lifting operations respectively. From a com putational point of view, however, there is no big difference am ong these elem entary matrices, each of which essentially updates one polyphase component at a tim e using linear convolutions. W ithout loss of generality, we introduce notation e l(z) to represent the ele m entary m atrices. T h at is I Si{z) ef(*) = 1 -- 0 1 1 / or 0 1 ti(z) 1_ Let the input be X (z) with polyphase representations in frequency domain as X (z) = [X0 (z) X ^ z ) ] 4 and in tim e domain as x(n) = [a:o(n) .ri(n)]J. Now define the interm ediate states in the process of transform ation, {X ‘(^), i = 0,1, • • •, 2m + 1 }, as X*(z) = e ‘~le ," ~ 2 • • • e°X(.z) (2.7) = f [ ^ (z )X (z ) (2.3) i = e i- l (a:)Xi“1(z) (2.9) where X ‘(^r) is the resulting signal after the first i elem entary m atrices have been applied. Consider one lifting stage using a lower triangular elem entary m atrix 21 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. el(z) to update X ‘(s) into X*+ I(z) as follows. ' X 7+ 1 (z) ■ 1 0 " ' X* (z) ' . X ‘+1(^) _ _*•■(*) 1 . ’ X ‘ (z) Xl(z) + ff(* )X & (z) (2.10) (2 .11) As one can see, in this transform ation step the polyphase com ponent X q(z) is unchanged while polyphase component X '^x) is updated by adding a quantity com puted from the other polyphase component. In tim e domain, this m eans that all even samples are preserved while all odd sam ples are updated. For an input vector X of size N (assuming N even), the state transition can be w ritten as x ‘+ 1 (0 ) x ‘( 0 ) x '( 0 ) + <x(0 ) ^ + 1 ( 1 ) x ‘( l) + cr(l) xl(l) + c r(l) *i+1 (2 ) x ‘( 2 ) **'(2 ) + *(2) x ‘+ 1 (3) x ‘(3) + cr(3) x'(3) + cr(3) x,+1 (2 fc) = x ‘ ( 2 k) = x ‘( 2 k) + a(2k) x i+l{2 k + 1 ) x'{2k — |— 1 ) — |— cr(2 k 1 ) x*(2k -f— 1 ) — { — <x(2k -f~ 1 ) x i+l(N - 2 ) x \ N - 2) x \ N - 2) + a ( N - 2 ) x,+ l(Ar - 1 ) x ‘(N - 1) +<x(/V - 1 ) x ‘(iV — 1 ) + cr(Ar — 1) (2 . 1 Denote tl{z) = YX=-a' Kiz n (a * — 0,6' ^ 0)? th en the updating qu an tity <x(n) can be com puted as cr{n) = 0 n = 2k t}xo{k — I) n = 2k + 1 (2.13) If e(x) is upper triangular, then odd samples are unchanged and even samples are updated. In this case, denote 5 :(^) = sl nz~n (a1 >0,6* > 0), then the 22 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. updating q u an tity a (n ) for upper triangular m atrix e(r) is a(n) = J n = 2k \ 0 n — 2k + 1 An im portant observation is th a t, only one polyphase com ponent is updated at each state transition and the updating quantity S(n) only depends on samples from the other polyphase com ponent. W hen updating even samples, only odd samples are needed and vice versa. This leads to the following three conclusions for states updating at each stage: 1. W henever X 1 is updated into X I+1, there is no need to keep the old value of X* since no other updating will need it any more. In other words, every tim e we generate X 1 , we only need to store this set of values, i.e., we do not need to know any of the other X J, for j < i , in order to com pute the output (the final wavelet coefficients). 2. The up d ated value of each sam ple r I+1 (n) can be stored in the same memory space allocated for x‘(n) since the old value xl{n) does not contribute to the updating of its neighbors and any later stage updating. For example, x‘(i) can be over-written by x I+1 (l) w ithout affecting th e updating of x*(3). This is the so-called in-place property of the lifting algorithm . Then, to transform a block of size of N we ju st need a buffer of size 'V, while the standard algorithm needs a buffer of size 2 N (N for th e original input and iV for th e transform outputs). 3. The updating of each sam ple x ‘(rz) can be im plem ented independently from the updating of other samples. T h at is, there is no ordering of the updating between samples. For exam ple, one can update x ‘(3) before or after the updating of x‘(l) and obtain the sam e result. For the polyphase m atrix factorization, the necessary and sufficient condition for the above properties is th at th e elem entary m atrix e ‘(z) can only be in the form of low er/upper triangular m atrices w ith constants on th e diagonal entries as mentioned before. This key property of the lifting factorization guarantees that 23 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the D W T can be com puted in-place. T h a t is, each raw input d ata sample x(n) (initial state) is progressively updated into a wavelet coefficient (final state) using samples in its neighborhood. Thus the wavelet transform based on the polyphase factorization can be m odeled as a FSM in which each elem entary m atrix e* updates the FSM sta te X* to the next higher level X I+1. The forward wavelet transform Y (s) can be w ritten as Y (z) = P (s)X (z ) = n e'MXM i=2m = e 2m(z) ■ ■ ■ e 1 (z) e°(z)X °(z) X *(z) X?(z) Y ( z ) = X 2m+ l (z) and the inverse transform is (2.15) (2.16) (2.17) X(z) = P - ‘(*)Y (z) 2m i= 0 0 / = e -° (.) - 2 m + l / ( 3 )e - 2m(z)Y °(z) X2- ( 2) (’ -•IS) (2.19) (2.20 ) : x 2m -i(r) ^ - s X (2)= X ° (z ) where e - t ( 2r) is the inverse of e‘(.-r). The schem atic plot of the DW T as a FSM is depicted in Fig. 2.S. A formal definition is given as follows [56]. Figure 2.8: S tate transition diagram of DW T as a FSM. 24 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. D efin ition A Discrete Wavelet Transform can be modeled as a Finite State Ma chine, that is, a -5-tuple (Q, £ , £, < ? o , F) [56] where Q is a finite set of states, Q = {X ‘(x)}, i = 0,1, • • -, m w ith m determined by the given factorization. S is a finite set of events, S = { e ‘(x)}, i = 0,1, • • •, m , is the lifting opera tions at each lifting stage. qo is the initial state, <70 = X°(x) (th e raw input data). F is the final state, F = Y (2 ) = X 2m(z), (the wavelet transform output). S is the transition function m apping QxT, — Q, 5 = { (X '(3 ),e ‘(^)) — > ■ X1 + 1 (^)}- 2.4.2 Overlap-State Assume there are M elem entary m atrices { e \ i = 0,1, • • •, M — 1} in the factor ization of the polyphase m atrix P (z), th en there are a total of M states in the FSM defined above. T he FSM modeling tells us th at, to com pute the transform, one needs to help each and every sam ple x(n) to complete its state transitions from state 0 up to state M — 1 sequentially. This means th at one has to compute the updating quantities {cr‘(n), i = 0,1, • - •, M — 1} (2.13) and (2.14) at all these stages. Unfortunately this can not be accom plished for samples near block bound aries. This happens when the input has to be transform ed on a block-by-block basis due to buffer size lim itation or for th e purpose of parallel processing. Consider one operation across a d a ta boundary using an upper triangular elem entary m atrix. Let the current sta te be i and the input sequence x‘(rz) be segmented at point 2k (refer to Figure 2.9). In this state transition, even-indexed samples are updated using odd-indexed samples, The updating quantity a(2k) (2.14) is S(2k) = (2 .2 1 ) I = + (2.22) I 25 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. - 1 6 * = J2 sixi(2k - 21 + 1) + £ six*(2k - 2 1 + 1) (2.23) l— — a1 1 = 0 V --------------„ ---------------- '----------- .------------- " C(2k) A(2Ar) where C(2k) and A{2k) are the contributions from the causal and anti-causal part of filter -s ‘(2 ), respectively. Block 1 boundary Block 2 ▲ A A : A ' ! A 1 ; a , p r- A A c : C i L c _ i_ - _ i p A f c X '' \ ' 4- -A_ C C x ‘ ' x ' X s x ‘ | x 1 ; x ‘ x j x ‘ • x ‘ x ‘ X s X s x ' x ‘ x ‘ x ' x ‘ x ‘ 2k-8 2k-6 ; 2k-4 2k-2 2k 2k+2 2k+4 2k+6 2k+8 M --------------- ------- ► M - - - - - - — ► m ----------------------► . new boundary j state f state T new boundary (a) ' C ' 1 ! c ^ ! C A ! A j ! I ! i i x ' ^ j X ^ l x ^ i x H x l x ‘ x ‘ | x ‘ X 1 x ‘ X 1 x ‘ X s x ‘ X i+I x i+1 x i+ l ------- 2k-8 2k-6 2k-4 2k-2 2k 2k+2 2k+4 |2k+6 2k+8 new boundary j state T new boundary (b) Figure 2.9: S tate transitions across block boundary using e* w ith a1 = 3 and 6 ' — 2. (a) P artial com putations near boundaries, (b) A fter updating, boundary samples stay in interm ediate states. The "new boundary” separates fully updated output coefficients from partially com puted ones. Obviously, for samples near the block boundary we cannot compute both C(2k) and A{2k) due to the segm entation. Therefore a{2k) will not be avail able. As a result, these samples can not be updated into sta te i + 1 . In Fig. 2.9(a), for exam ple, r l(2A :) in block 1 can not be u p d ated into r 1+ 1(2&) since {r*(2fc + 1), r ‘(2A;-{-3), x l(2k + 5)} are in the right block and thus are not available at the time block 1 is transformed. Consequently, a{2k), the updating factor for sam ple x'{2k) cannot be com puted. Therefore, xl(2k) cannot be updated into x t+l(2k). R ather than leaving x*{2k) in state z, we choose to partially u p d ate xl{2k) as xl(2k) = xl{2k) + C{2k) 26 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. since C(2k ) can be com puted from the causal neighborhood (a function of {xl(2k— l),x'(2A; — 3 ) ,x l(2k — 5)}). The significance o f this p artial updating is th a t one can free all th e sam ples in the causal past for fu tu re processing and save m emory. In this case, sam ples {xl(2k — 1), xl(2k — 3), x l(2k — 5)} do not need to be buffered for the fully u p d atin g of xl(2k) since their contribution C(2k) has already been added to the paxtial result x l(2 A :). On the o th er hand, if we choose not to partially update xl(2k), th en {x'(2A; — l ) ,x ‘(2& — 3),x*(2k — 5)} have to be buffered. The sam e partial u p d atin g happens also for sam ples { x ‘(2k — 2), x l(2k — 4)} in th e left block and sam ples {x'(2k + 2).xl{2k + 4)} in the right block. For the com plete state transition from i to i + 1, we need to buffer in each block the following samples: 1. Partially updated samples such as {x‘(2£:), x * ( 2 k — 2), x*(2k — 4)} in th e left block and {x'(2k + 2 ),x ‘(2k + 4)} in th e right block. 2. Contributing samples required by partially u pdated samples (in th e other block) to com plete th e transform , such as {x‘(2m — l ) , x l(2m — 3)} in the left block and {xl(2m + l ) ,x ‘(2m + 3), x ‘(2m + 5)} in the right block. Notice tht these partially updated samples are not exactly the state inform ation as defined before in th e FSM definition. For sim plicity, however, these partially updated samples and contributing samples will be all called the state inform ation hereafter. Obviously, as long as the state inform ation is preserved at each stage, the transform can be com pleted at any later tim e. T h at is exactly what a FSM is supposed to do. We m ention th a t such a later processing is possible because partial updating in the right block (updating of x*(2m 2) an d x ‘(2m + 4)) can be im plem ented independently from the partial updating in th e left block (updating of x ‘(2 m ), x !(2m — 2 ) and x ‘(2 m — 4)) as discussed before. T he p artial updating does not remove any inform ation needed by the other block, since it updates sam ples th a t are not inputs a t the i-th state transition stage. T he end state after application of e 1 is shown in Fig. 2.9(b). As one can see, because partially updated sam ples cannot be used for processing, the size of the segm ent over which we can com pute is reduced, so th a t the effective boundary is now reached before sample x t + 1 (2k—4) 27 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. in block 1 and sam ple x t+l(2k + 6 ) in block 2. Effectively, the physical boundary splits into two and extends inwards in both blocks. T he next state transition via e I + 1 will operate only on samples in state i + 1. All th e samples between th e two new boundaries become the state inform ation and the sam e procedure repeats at each state transition stage. To complete the transform for samples near the block boundary, the sta te in formation in neighboring blocks need to be exchanged. This can be done by over lapping the states between consecutive blocks. Thus we propose the overlap-state m ethod for D W T com putation across consecutive d ata blocks. The overlap-state procedure is shown in Fig. 2.10. In case of parallel processing, the im plem entation is shown in Fig. 2 . 1 1 . Though only one state transition is shown in these two figures, the overlap-state design can be easily generalized to multiple state tran sitions at m ultiple decomposition levels because all these state transitions share the same three properties as given before (see section 2.4.1). 2.4.3 Performance Analysis B u ffe r Size A n a ly sis Given a lifting factorization of the polyphase m atrix P(-z), we now show how much state inform ation one needs to store for the DW T com putation, i.e., the overlap buffer size. This is a key factor for m em ory constrained sequential architecture design. As shown before, at each stage, the partially updated samples and contributing samples need to be stored. Denote the total num ber of partially updated samples as B[ and the to tal num ber of contributing samples as B\. W riting s‘(z) and tl(z) as = E L - * . A n ) ! - " t‘(z) = £ L - „ . i-'fn jz -” (2.24) where a1 > 0,6‘ > 0. Then B\ = ax,B \ = bx. The num ber of samples th at m ust be buffered at stage i, B x. is B x = a1 + 6 * . Assume there are N state transitions 28 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. boundary (a) x ‘ ; X* !XS ;X ‘ : x s X 1 x' x ! : : i ■ ■ t 1 X s x ! ! X 1 ; x ‘ ! X 1 | X 1 I X s x' x ‘ X' 2k-8 2k-6 2k-4 2k-2 2 k ! c j | c j ; c 2k+2 2k+4 2k+6 2k+8 (b) : x *ij x i+ 1 i x' | x ‘ ; x ‘ j x ‘ X* 2k-8 2k-6 2k-4 2k-2 2k A ~A A (c) ; c I : c ; 1 c x ‘ x ‘ x ‘ x ‘ X* x ‘ x l x ‘ x ‘ x ! x ‘ X s X 1 X 1 2k-4 2k-2 2 k ; 2k+2 2k+4 2k+6 2k+8 x i+“ X s *I X s * x h -1 X W x « + i ^ x i- *1 hi ^ i+l i+l i+l ^ h f X s * x i+| x i+ X M 2k-8 2k-6 2k-4 2k-2 2 k j 2k+2 2k+4 2k+6 2k+8 Figure 2 .1 0 : Sequential DW T using o v e r l a p - s t a t e , (a) Input in initial state i . (b) Block 1 consists samples up to i !(2m). After state transition, samples near block boundary are only partially updated (anti-causal filtering results not available), (c) Partially updated samples (state inform ation) are overlapped with, next block of input samples. They are now completely updated by adding their anti-causal filtering results, (d) Completely transform ed (updated) input from state i to z ’ + l. in the factorization of P(-z), the buffer size B s for one level decomposition is B, = Y , B ‘ (2.25) 1 = 0 = y ; V + b‘) (2.26) i= 0 Since the lifting factorization of a given polyphase m atrix is not unique, ob viously one would choose the factorization which gives the minimum B s if the amount of m em ory is limited. An alternative way to find out the buffer size is to graphically plot the state transitions for a given factorization. See Appendix B for details. 29 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Block 1 k bounidary Block 2 (a) (b) ( c ) -------- (d)- X ‘ ;x ‘ X 1 X*: Xs i t : X s! x ‘ X* X' x ; X ! , x j X' X' x ‘ X s x ‘ Xs 2k-8 2k-6 2k-4 2k-2 2k 2k+2 2k+4 2k+6 2k+8 1 c ; 1 C C A A X M x ^ x - i x +1! X s : X 1 i x ‘ X 1 X 1 X' X* x ‘ x ‘ X X X x i+ !x i+ i; 2k-8 2k-6 2 k 4 A 2k-2 a : 2k A 2k+2 C 2k+4 C 2k+6 2k+8 i C ; c C A A X "1 X W'X ''' X i+ U ^ i ] X ‘ X 1 x ' X' X 1 X ' x ‘ x i+ 1X i+ I X i+I X i+l x i+ l 2k-8 2k-6 2k-4 2k-2 2k 2k+2 2k+4 2k+6 2k+8 X w x w x w ’x + ■ 1 1 j^ i+ l x ,+‘ x i+ 1 x » -! X w x H x w X i+ I X * " * " * ■ X x i+ 1X i+ I X M 2k-8 2k-6 2k-4 2k-2 2k 2k+2 2k+4 2k+6 2k+8 Figure 2.11: Parallel DWT using overlap-state, (a) Input in initial state i. (b) Input is partitioned over two processors. Block 1 and 2 are transform ed separately and each have state information appearing near the block boundary, (c) S tate in form ation is exchanged between two processors and the partially updated samples are fully updated, (d) Com pletely transform ed (updated) input from state i to i + 1 . C o m m u n ica tio n The com m unication delay is th e tim e used for exchanging d ata between adjacent processors. In existing parallel algorithm s [11, 25], before each level of decom position, (L — 2) boundary sam ples need to be com m unicated to the adjacent processors (L is the filter length). Using the com m unication m odel given in (2.2), the to tal com m unication tim e D 0u , for a J level wavelet decomposition, is D0id = J(ts + (L — 2)tw + tp) (2.27) In th e proposed parallel algorithm , using the overlap-state technique, the d ata exchange can be delayed after the independent transform of each block so th a t only one com m unication is necessary. An exam ple of three level decompositions 30 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Block 1 Block 2 Levell I _ . ; State 1 Statel ---1 ------------------ 1 - - - 1 i - - Level2 State2 / \ State2 Level3 ; / I : ; / ! ' / : I r.-.i.-T . ■ c r State3 \ j State3 Figure 2.12: An exam ple dataflow chart of a three-level wavelet decomposition using the proposed Overlap-State technique. Solid lines: com pletely transform ed data; Dashed lines: partially transform ed d ata. O peration 1 : each block trans forms its own allocated d a ta independently and state inform ation is buffered; Op eration 2: state inform ation is com m unicated to neighboring blocks; O peration 3: com plete transform for th e boundary d ata sam ples. is shown in Figure 2.12. This should be com pared to Figure 2.1. Furtherm ore, th e size of the state inform ation at each stage B s is upper bounded by (L — 2). So the com m unication tim e in the proposed algorithm is upper bounded by Dnew Si ts + J (L — 2 )tw -f- tp (2.28) As one can see, the com m unication overhead is reduced in th e proposed par allel algorithm because: (i) the number of com m unication tim es is reduced; and (ii) the am ount of d ata exchanged is reduced. Essentially, the overlap-state tech nique enables us to exchange more data in one com m unication setup rather than exchanging a small am ount of data in m ultiple com m unication setups. It is, however, im portant to emphasize that how m uch this com m unication overhead reduction contributes to the reduction of the to tal com putation tim e will strongly depend on the parallel system com m unication link design. Clearly the slower the inter-processor com m unication, the larger th e gain and vice versa. 31 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.4.4 Delayed Normalization Although the lifting based DW T algorithm has been shown to be twice as fast as th at of the standard Standard algorithm by Daubechies and Sweldens, this is only true in general asym ptotically for long filters [46]. In this section we introduce a sim ple technique, Delayed Normalization, which can help to reduce the com putation of m ultilevel wavelet decompositions. As one m ay have noticed, the last m atrix factor in the polyphase factoriza tion form (2 .6 ), is a norm alization factor which scales the lowband and highband coefficients respectively. This normalization factor will appear at each level of de composition for a multilevel wavelet decomposition. Since the wavelet transform is a linear operation and multiplication is com m utative for linear operations, this normalization (m ultiplication) operation can actually be delayed until the last level decomposition. By doing so, com putations can be saved. l/K P(z) P(z) I/K P(Z) I/K (a) P(z) l/K P(z) P(Z) (b) yo y > y * y 3 - yo y, y’ - y > Figure 2.13: Illustration of delayed normalization, (a)Recursive two-channel lift ing. (b)Two channel lifting w ith delayed norm alization. One example of a three-level octave-band wavelet decomposition is shown in Fig. 2.13. Interestingly, norm alization operations for all the yi coefficients can be all elim inated provided th at the same wavelet filterbanks are applied at each stage. If different wavelet filterbank is used at different levels of decomposition, then in general only one norm alization (multiplication) operation is necessary for each 32 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. wavelet transform coefficients. Obviously such a delayed norm alization technique can also be used for m ultidimensional wavelet decompositions and wavelet packet decompositions. In Fig. 2.14, a one-level 2D wavelet decom position is shown with recursive norm alization in (a) and delayed norm alization in (b). Row Column l/K P (z ) ! K (a) P(z) : l/K K P (z ) i P (z ) P (z ) ! K . l/K " z - M t 2 ) — z — P (z ) K" yo y» yj yo y * y^ y* (b) Figure 2.14: A 2D DW T example of delayed norm alization. We now give the perform ance analysis for ID octave-band wavelet decompo sition. Let the input d ata sequence length be N and the decom position level be J. The com putational costs of the standard algorithm , the lifting scheme, and the lifting scheme with delayed normalization are denoted respectively as C / /5 C /, and C[,. T he cost unit we use is the average num ber of m ultiplications and additions per o u tp u t point. Then C J M = c j ,( i + | + ± + . = 2Cm (1 — 2 ) (2.29) (2.30) where C\,f is the num ber of m ultiplications and additions per o u tp u t point for one level decom position using the standard algorithm . Accordingly, th e lifing cost is (2.31) 33 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 2.2: Costs com parison for multilevel wavelet decom positions (J > 3) Wavelet S tandard Lifting Lifting w ith Delayed N orm alization cost speedup cost speedup Haar 1.5 1.5 0 % 1.5 0 % D4 7 4.5 56% 3.5 1 0 0 % D 6 1 1 7 57% 6 83% (9-7) 11.5 7 64% 6 92% (4,2) B-spline 8.5 5 70% 4 113% For the lifting scheme w ith delayed norm alization, the whole wavelet transform can be decom posed into two parts. One is the norm al lifting operation part which lasts for J levels w ithout norm alization. For this p art the one-level average cost is C[, = C'l — 1 since one norm alization/m ultiplication is saved for each coefficient. The second p art is the final norm alization p art for all the coefficients. This part incurs cost 1 (one m ultiplication) per output point. So the to tal average cost is C l L, = 2(C[ — 1 ) ( 1 — 2~J) + 1 (2.32) = 2C£(1 - 2~J) + 2~J~l - 1 (2.33) = C l + - 1 (2.34) If N is large enough such th a t J can be large enough, then in the lim it C l is on an average one operation fewer than that of a pure lifting scheme. In Table 2.4.4 we show how this will affect the algorithm relative speedup using th e sam e filters given by Daubechies and Sweldens [46], The above perform ance analysis applies for transform s w ith different wavelet filters at each stage. We have m ade the assum ption th a t J is large enough such that is negligible. If the same filterbank is used at all decom position stages, the assum ption can be further relaxed. Recall th a t the norm alizations for coefficients can all be elim inated ( see Fig. 2.12). T he savings is 0.25 since one-quarter of the total in p u t d a ta samples do not have to be scaled. Thus average cost of the norm alization p art should be 0.75 rath er th an 1 per output point. Taking this into consideration, as long 34 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. as J is large enough such th a t < 0.25, the above cost estim ation Cl> is accurate. T h at is equivalent to having J > 3 which is a reasonable assum ption for most practical wavelet applications. F urther reduction of the norm alization operation is possible if we jointly de sign the DW T system and the im m ediate d ata processing system . For exam ple, in a wavelet d ata compression system , wavelet coefficients will be quantized im m ediately after transform . Such a system is shown in Fig. 2.15(a). The {Qi, i = 0 , 1 , 2, 3) are quantizers designed for wavelet coefficients in different sub bands. Obviously, the norm alization operation can be done jointly with this quantization operation and thus can be com pletely elim inated from the transform operation. This is shown in Fig. 2.15(b). For other applications, such as noise reduction using thresholding, this com putation reduction is also possible. Com pared to independent transform and quantization, com putation can be saved if designed jointly. P(z) P(z) j P(z) z .-► vrV ► (a) '1 ^ - P(z) : P(z) z 2 '■ly (b) P(Z ) i/k y» rrr- yo — — I y> ----- : ?' -TOl T - k y2 , — y- - Qr - k ? yj -7T y ^ 2— — - Q , yo r— yo Q o ^ y > , — y > — y o — o y o — Q.-*- ^ y o — Q Figure 2.15: Joint design of transform and quantization to reduce com putation, (a) Independent transform and quantization, (b) Joint transform and quantiza tion. The norm alization operation is merged w ith the quantization. 35 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.5 T h e P r o p o s e d D W T A r c h ite c tu r e s In this section, we present first generic sequential and parallel architecture designs for ID D W T using the Overla-p-State technique. Variations are then detailed for 2D separable DW T systems. 2.5.1 ID Systems S e q u e n tia l In Figure 2.16 the proposed sequential DW T system is shown and the pseudo code sequential algorithm is given in Table 2.3. The in p u t data sequence is first segm ented into non-overlapping blocks of length N and fed into the FSM one block at a tim e. The state information, however, is saved so th a t after one block has been com puted the next one can use it. After transform ation, the wavelet coefficients are concatenated together to give the final result. In p u t D ata S tream " , y S ection n _ ! n -P o in t | _ n O utput In p u t D W T /F S M ! Sam ples S a m p le s * T O utput D ata Stream Figure 2.16: The proposed sequential DW T architecture. As one can see, the general system structure for DW T com putation is essen tially the sam e as th at in the standard overlap-add approach. The D W T/FSM acts as a state m achine w ith memory and the state inform ation (partially com puted boundary samples from the previous block) at m ultiple decomposition level are overlapped. This helps to reduce the memory requirem ent for the transform 36 J DW T T~ S I ' State Information! Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. com putation. This overlap leads to output delay in practice, i.e., the n output samples shown in Fig. 2.16 are delayed relative to th e n input samples. Table 2.3: T he proposed sequential D W T algorithm . begin initialize state S; for (k = 0; k < N; k + + ) { J-level wavelet transform for block k ; update state S; } end In Table 2.5.1 the required overlap buffer sizes, of different sequential DW T algorithms are given. For an iV-point input data block, if the lifting algorithm is im plem ented, the total buffer size is N + Bs. The system efficiency 77 is n = (2-35) Obviously, th e proposed sequential DW T algorithm , using the overlap-state tech nique, requires a smaller overlap buffer size B s and thus improves the system throughput. However, if N » O(JL) then the relative im provem ent becomes small. On th e other hand, if N = 0 when all com pletely transform ed coefficients are im m ediately transfered (e.g., th e line-based system in [1 2 , 1 ]), the savings in memory can be significant (details are given in the next section). Table 2.4: O verlap buffer size Bs in ID DW T for J-level decom positions using a L-tap wavelet filterbank. SSWT[29] RPA[6] Proposed L-tap C -1 1 i - H 1 h' 1 to < J ( L - 2) (9.7) 7(2^ — 1 ) 7 J 4 J ( 2 , 1 0 ) 8 (2 J - 1 ) S J 4 J CDF(4,2) o(2J — 1 ) 5 J 3 J 37 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. : : i i ! Input Data Stream 1 | j f ! T 1 1 \ ▼ ! I i n-Point ! ! | j n-Point ! 1 i ! n-Point i ; DWT/FSM i | DWT/FSM i I ' DWT/FSM ! j i T ▼ i i T ; Intermediate result/ r t w v i s s§ n-Point DWT/FSM n-Point n-Point : t n-Point j_ n-Point DWT/FSM DWT/FSM ; H DWT/FSM DWT/FSM ; i ; i Output Data Stream! Figure 2.17: The proposed parallel D W T architecture. In Split stage, each proces sor computes its allocated data independently up to the required decomposition level. In Merge stage, a one-way com m unication is initiated to com m unicate the state information to neighboring processors. A postprocessing operation is then started to com plete th e transform for boundary samples. P a r a lle l In Figure 2.17 the proposed parallel D W T architecture is shown and a pseudo code algorithm is given in Table 2.5.1. T he input d ata is uniform ly seg m ented non-overlapping blocks and allocated onto p available processors. Each processor com putes its own allocated d a ta up to the required wavelet decompo sition level. This stage is called Split. The o u tp u t from this stage consists of (i) com pletely transform ed coefficients and (ii) the state inform ation (partially up dated boundary sam ples). In the second stage, Merge, a one-way communication is initiated and the state inform ation is transfered to the neighboring processors. T he state inform ation from the neighbor processor is then com bined together with its own corresponding state inform ation to com plete the whole D W T transform. As shown before, th e proposed parallel architecture only requires one commu nication between neighboring processors for /-lev el decom positions. T he amount of d ata exchanged is also less than th a t in direct overlapping approaches [11, 25]. Therefore, the com m unication delay is reduced. 38 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 2.5: The proposed parallel DW T algorithm . begin{ transform in processor p} for( j = 0; j < J ; j + + ) { transform at current level j. store state information. } send state information to processor p + 1; receive state information from processor p — 1; for( j = 0; j < J',j + +) { transform boundary d ata samples at current level j. } end 2.5.2 2D Systems In Figure 2.IS an exam ple 2D DWT w ith two level decom positions is shown. The data is row transform ed first and then column transform ed. Naturally, data samples along block boundaries cannot be fully transform ed due to the lack of neighboring sam ples. Tfyese constitute the row and column sta te information at each level. Let us introduce some notations first. Let Nr, Nc be the width and the height of the d a ta block, respectively. For decomposition level j = 0,1, ••• ,/— 1, define • {W/0, W /i}: num bers of partially transform ed samples near left and right boundaries respectively in a row. { W ^ , defined sim ilarly for a column. • {/VJ, N*}: length of a row and a column respectively before the start of the decomposition at each level. • {M /, M l}: num ber of completely transform ed samples in a row and a col umn respectively. • B{: total num ber of partially updated samples, i.e., the size of the buffer to hold the state inform ation for further processing. 39 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (b) Dow Row Sample i Level 1 (a) W r o M r ; W r t Ro; R 1 N c W e i GO (c) (d) N r Figure 2.IS: 2D D W T/FSM illustration. Shaded areas represent the state infor mation with RO/CO the row/colum n state information a t level 0 and so on. The input block is first row transform ed (a) and column transform ed (b) at level 0, then downsampled (taking the LLO subband) and row transform ed at level 1 (c), and column transform ed at level 1 (d). The following identities between these defined quantities at each level j can be derived, refer to Fig. 2.18. M 3 r M 3 C m = Ni B{ = N i - W io - W i Ni-wia- wi, r LM /- 7 2 J j > i { iVr j= 0 r L M r v s j i > i 1 Nc j= 0 N 3N i - M i M i (2.36) (2.37) (2.38) (2.39) (2.40) 40 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Upon com pletion of all /-lev e l decompositions, vve have E B i (2.41) i=o B - B s (2.42) NrN c - E Bi (2.43) i=o where B s is the total buffer size necessary to store th e state inform ation at all decom position levels and B e is the effective block size, i.e., num ber of wavelet coefficients th a t can be transfered to the next stage for processing, thus freeing up memory. 2.5.3 Sequential Architectures S trip S e q u e n tia l In this case, the buffer is organized to hold one strip of data at a tim e. Equivalently, this is when B = W N C or B = N rH where W x H is the original d a ta size. This scenario is depicted in Fig. 2.19 for B = W N C . Because the input d a ta is segmented only in column direction, sta te inform ation (partially transform ed samples) will only appear along the colum n direction. Certainly, some type of boundary extension techniques, such as th e sym m etric extensions, have to be used for the transform near the left and right row boundaries. For the transform of the very first strip , extension is also needed for the upper of each column. Each strip takes over th e state inform ation left by the previous strip to transform its own data. Upon com pletion, it also generates the state information for the next strip. Then the strip slides down and the D W T is calculated strip- by-strip w ith state inform ation overlapped between strips. At th e bottom of Fig. 2.19 a blown-up version of the state information is shown. For a /-level decom position, the state buffer size B s can be calculated as Bs = E B i (2.44) j= 0 = ]T W W & -* (2.45) j=o 41 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ▼ Current Strip state info. T Sliding H r ■ Last Scrip I ▼ _ \ ! J-I I W c/ ▼ J-2 * 1 wit i I Wcl ▼ 0 A 1 w Z w Figure 2.19: Strip sequential DW T system diagram . T he input is segm ented into d ata strips which are transform ed sequentially from top to bottom . Obviously, B s is proportional to the row length W for the case depicted in Fig. 2.19. To reduce the state buffer size, the segm entation should choose the dimen sion w ith large data size. T h at is, if W > H then segment along row direction and segment along colum n direction if otherwise. In Table 2.6 comparisons of our proposed algorithm with existings ones for the m inim um memory requirem ents. As one can see, the proposed system can produce significant m em ory savings. Consider a color image size of 4096x4096 where each color com ponent sample is stored as a 4 bytes floating point num ber for D W T com putation. In this case, one im age scanline requires 48KB. Using the Daubechies (9,7) wavelet filterbank (L = 9), for a 3-level decomposition, the to tal m em ory would be 5S8KB if using the RPA algorithm (the approach given in [12, 1]). Using the overlap-state technique, th e buffer size can be reduced to 296KB. 42 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 2.6: Comparison of memory requirem ents where W is th e w idth of the image scanline and a = (2J — l),/3 = (1 — 2~J ) SSWT[29] RPA[6] Proposed L-tap W a (L - 2) 2W(3(L - 2) < 2Wf3(L - 2) (9,7) IW a 14W 0 8 W(3 (2,10) 8 W a 16W/3 8 W(3 CDF(4,2) 5 W a 10 W 8 6W(3 B lo c k S e q u e n tia l In this case, the buffer is divided into two parts, one for holding the state inform ation and one for new input samples in the sliding transform window. Equivalently, this is when B = B s + N rNc. This scenario is depicted in Figure 2.20. As one can see, the d ata is segmented into blocks of size N rx N c and trans formed one block at a time. Since boundary extensions can be applied for the left and up boundaries of the very first block A, state inform ation {Ar, Ac} will appear only on the right and down side of the block upon com pletion of the trans form. The {Ar, Ac} correspond respectively to the partially transform ed row and column samples. W hen the window slides right to the position of block B , only the row state inform ation Ar can be overlapped. This shows th a t A r can be fully transform ed by overlapping while A c has to be buffered for later processing. As for block A, the colum n state information generated by B also has to be buffered. This process continues until the completion of transforms of all the blocks in the first block row. By th a t time, the column state information has accum ulated to th e size Bs exactly sam e as that of the sequential strip DWT, refer to (2.44). It turns out th at the state buffer size B s will not increase beyond this point. This can be verified by checking the first block C in the second block row. For clarity of illustration, the second row is drawn separately from the first block row in Figure 2.20. Actually, the two block rows are overlapped w ith the part of the state information. That is, block C takes over the state inform ation Ac to complete the transform . When the transform stops, state inform ation also appears on the right and down boundaries of block C. However, since Ac has now been fully transform ed and hence can be transfered out, Cc can be w ritten into 43 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. — ► Sliding 1 state info. ► Sliding ? K Figure 2.20: Block sequential D W T system diagram . T he input is segmented into blocks which are transform ed from left to right and from top to bottom . the locations of A c without increase of the total state buffer size Bs. The m ost general case of sequential block DW T is depicted for block D. The block D overlaps with previously generated state inform ation in both the row and column directions, {CV, Bc}. W hen it finishes its transform , it leaves {Z?c, Dr} for later processing. The transform of block E in the last block row is the same as that of D except th at boundary extension can be used in the column direction. To study the system efficiency, consider the problem of determ ining the buffer so that a block of d ata of size N x N can be com pletely transform ed. This is typical in a transform ed-based im age coding application where images are coded on a block-by-block basis. Assume th e buffer is of size N b x N b ■ In Table 2.5.3, N b is given for J-level wavelet decom positions using different wavelet filterbanks and overlapping techniques. Take the Daubechies (9,7) filterbank as an exam pleand assume the decomposition level is ./ = 3. If th e block size is of 32x32, then 44 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. N = 32. Using SSW T, then Nb = 32 + 4 9 = 81 which means a buffer size of 81x81 is needed to com pute the D W T of a d ata block 32x32. T he efficiency 7 7 in this case is approxim ately 16%. Using RPA, then N b = 45 and the efficiency increases to 50%. If we use the overlap-state technique, then iVg = 39 and the efficiency increases to 64%. Table 2.7: Com parison of memory requirem ents where a = (2J —1),/? = (1 — 2~J ) S S W T [29] RPA[6] Proposed L-tap N + a(L - 2) N + 2P(L - 2) < N + 20{L - 2) (9,7) N + 7a N + 14/? N + 8/? (2,10) N + Sa N + 16/5 iV + S/? CDF(4,2) N + 5a N + 10/? ;V + 6/? 2.5.4 Parallel Architectures B lock P a r a llel As shown before, in the first phase Split each processor is allocated w ith its por tion of d a ta and starts the transform all the way to the required decom position level J. U pon com pletion, the d ata configuration at each processor is shown in Fig. 2.21(a). T he center part of each block is com pletely transform ed while the boundaries are left with the partially transform ed sam ples, i.e., the state inform a tion. T he next stage Merge is to cornrm m icate the state inform ation and com plete the transform for boundary samples. If the single-port m odel is used, then three com m unications is necessary to com plete the transform , one for row state, one for colum n and one for the intersection of row and colum n state. However, if the multi-port model is used, the row and column state inform ation exchange can be im plem ented sim ultaneously thus reducing one com m unication. This Merge process is shown in Figure 2.21 from (a) to (d) for the single-port model. If the multi-port m odel is used, (a) and (b) can be com bined to sim ultaneously trans m it/receive the row and column state inform ation to /fro m neighboring processors. This is in contrast to the observation given in [15] th a t “T he 2D DW T algorithm s seem not able to effectively utilize m ore than a single com m unication port at a 45 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. tim e” , our analysis shows th a t using m ulti-port model, the com m unication over head can actually be reduced com pared to th at of a single-port model. D a ta ^Transferee! D a ta ac e ac h p ro c e s s o r b e fo re c o m m u n ic a tio n (a) (d ) \ Poo Pax R o w ! S ta te (b) C o l S ta te R y w /C o lL S ta te "EEF "E B *"- I T ! -m *- r{±f- (c) Figure 2.21: Merge operations in 2D mesh processor network, (a)transfer row state inform ation from P ij to P ij+ 1 , ' (6)transfer column state inform ation from P ij to Pi+i,j', (c)transfer newly generated row state information from PtJ- to Pij+c, (d)com plete transform for boundary samples. Notice the total am ount of data in each processor in the final state (d) is different from the original uniform allocation due to the Merge operations. Strip P arallel In the first stage Split, each processor is allocated with its own strip and transforms up to the required level of decom position J. Since no segmentation is done in the row direction, state inform ation obviously will only appear along up and down boundaries in each block. This is shown in Figure 2.22. Next Merge, only one com m unication is necessary to transfer/receive the column state inform ation from neighboring processors. 46 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. transferee! // Data at eachj processor state j“ before communication (a) (b) Figure 2.22: Merge operations for strip parallel im plementation, (a)transfer row state inform ation from Pt - to Pi* i; and (b)complete transforms for boundary sam ples in each processor. 2 .6 E x p e r im e n ta l R e su lts In this section, experim ental results are provided to show the com putation reduc tion using the Delayed Normalization technique in sequential lifting algorithms. R esults are also given for the parallel DW T system using the Overlap-State tech nique. The wavelet filterbank used is the Daubechies (9,7) filters. The input im age is of size 512x512. 2.6.1 Delayed Normalization In this experim ent, three D W T algorithm s using the (9, 7) filters are implemented. 1. The recursive standard algorithm (see Table 2.1). The com putation cost is 11.5 m ults/adds per output point. 2. Lifting DW T algorithm . The com putation cost is 7 m ults/adds per output point. 3. Lifting DWT algorithm which delays the normalization until the last level of decompositions. T he com putation cost is approxim ately 6 m ults/adds per output point. 47 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In the experim ent, 2D separable wavelet transform s are im plemented. T he algorithm s are tested on a ULTRA-1 SUN workstation w ith clock speed 133MHz. T he algorithm running CPU tim e is m easured using th e clock() function call in C. T he average C PU tim e over 50 running instances of each algorithm are listed in Table 2.6.1. To compare the perform ances, the stan d ard algorithm is used as the basis and the relative speedup is calculated as Tstandard/Tnew — 1. Two observations can be seen from the experim ent results. One is that the lifting scheme coupled with delayed scaling can have about 30% improvement over the standard algorithm for over three-level decompositions while lifting alone only gives about 20% improvement. Second, neither lifting algorithm s achieve the perform ance gain as predicted in Table 2.4.4. The second observation actually tells us th at the num ber of m ultiplications/additions in a algorithm is not the only factor contributing to the total D W T running tim e. The algorithm speed m ay also be affected by how efficiently th e pseudo code is w ritten and the memory usage too. Obviously, this is a very im portant factor to consider when building a real DW T system beyond th at of reduction of num bers of m ultiplications and additions. Table 2.S: D W T CPU tim e of different sequential algorithm s (in seconds). Level Standard Lifting Lifting with Delayed Normalization time speedup tim e speedup 1 0.2398 0.2088 15% 0.1980 21% 2 0.2887 0.2466 17% 0.2294 26% 3 0.2988 0.2527 18% 0.2303 30% 4 0.3035 0.2598 17% 0.2368 28% 5 0.3098 0.2601 19% 0.2375 30% 2.6.2 Strip Parallel In this experim ent, three different parallel DW T algorithm s are implemented and tested against a sequential DW T algorithm . 1. Sequential lifting algorithm. 48 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2. Each processor com putes the DW T using the stan d ard algorithm . D ata exchanges betw een processors follow the direct overlapping technique, i.e., processors exchange d ata at each level of decom positions [11, 25]. 3. Each processor com putes the DWT using the fast lifing algorithm . D ata exchanges between processors follow the direct overlapping technique, i.e., processors exchange d ata at each level of decom positions [11, 25]. 4. Each processor com putes the DWT using the fast lifting algorithm . D ata exchanges between processors follow the proposed overlap-state technique. The first issue in parallel system designs is how to allocate d ata to differ ent processors. In this experim ent, the strip partition strateg y [11] is adopted for its sim plicity and its appropriateness for the parallel system used in the ex perim ent. T he 512x512 image is segmented into two strips w ith size 256x512, each of which is loaded into one machine for transform . T he parallel platform is LAM 6.1 from Ohio Supercom puter Center, which runs over E thernet connected SUN ULTRA-1 w orkstations. Two workstations are used to sim ulate a parallel system w ith two processors. The algorithm running tim e is m easured using the M P IJ V tim e Q function call from MPI libraries. The C-code algorithm is shown in Table 2.9. T he relative speedup is calculated against th e sequential lifting al gorithm as Tseq/T P ara — 1- The algorithms are tested in 50 running instances and the average D W T running tim es for different decom position levels are given in Table 2.6.2. It can be seen from the results that our proposed parallel algorithm can sig nificantly reduce the D W T com putation tim e even com pared w ith the fastest available parallel algorithm , parallel lifting algorithm . N otice th a t the improve ment is not linear w ith the increase of the decom position level. The reason is th at, though the com m unication overhead increases w ith th e decomposition level, the total num erical com putation also increases. A nother interesting observation is th at, even at one level decomposition the proposed algorithm still outperforms the parallel lifting algorithm . This is because though two algorithm s both require one data exchange between processors, the am ount of d a ta exchanged is different. For the (9,7) filters, the proposed algorithm only needs to exchange approximately 49 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table ‘ 2.9: The proposed parallel DWT algorithm . MPI_Barrier( MPI_COMM_WORLD); start = MPI_W time(): begin{ transform in processor p} for( j = 0;j < J - j + + ) { transform at current level j. store state inform ation. } send state inform ation to processor p + 1; receive state inform ation from processor p — 1; for( j = 0; j < J ; j + +) { transform boundary d ata samples at current level j. } end MPIJBarrier(MPI_COMM_W ORLD); finish=M P I_Wtime(); cputim e=finish-start; half the am ount necessary in the parallel lifting algorithm . 2.7 C o n clu sio n s In this chapter, an overlap-state technique is proposed for m ultilevel wavelet de compositions. T he basic idea is to model DW T as a finite state machine using the factorization of the polyphase m atrix. In this model, each raw input data sample (initial state) is progressively updated into a wavelet coefficient (final state) with the help of samples in its neighborhood. The benefit of such a D W T/FSM model is that the transform (or state transition) of each sample can be stopped at any interm ediate stage as long as the state inform ation at the break point is preserved. Since the state information rath er than the raw data sam ples needs to be stored or com m unicated, we have shown th at this helps to reduce th e buffer size in a se quential architecture and the com m unication overhead in a parallel architecture. 50 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 2.10: DW T running tim e of different parallel algorithm s (in seconds). Level Sequential Parallel Standard Parallel Lifting Parallel Proposed time speedup tim e speedup tim e speedup 1 0.3638 0.3115 17% 0.2745 33% 0.2045 78% 2 0.3649 0.3275 11% 0.2899 26% 0.2338 56% 3 0.3952 0.3490 13% 0.2938 34% 0.2369 67% 4 0.4028 0.3513 15% 0.3041 34% 0.2383 69% 5 0.4041 0.3675 9% 0.3165 28% 0.2417 67% Detailed analysis on buffer size calculation for a given factorization and com m uni cation overhead reduction are also provided. To further reduce the com putations, a delayed normalization technique for m ultilevel wavelet decompositions is also presented. Using the overlap-state technique, new sequential an d parallel DW T architec tures are designed. Several system variations for 2D separable DW T are provided and analyzed in detail, which include D W T systems of strip sequential, block sequential, random sequential, block parallel and strip parallel. T he perform ance analyses and the experimental results have shown th a t the proposed sequential architecture requires less memory and runs faster th an existing sequential algo rithm s. The proposed parallel architecture reduces the interprocessor com m uni cation overhead by reducing the num ber of com m unication tim es and the am ount of d a ta exchanged. As a result, the DW T running tim e of the proposed parallel architecture is faster than the best parallel algorithm available, the parallel lifting algorithm . O ne im portant advantage of traditional overlapping techniques, overlap-add and overlap-save, is th at they are well suited for the fast im plem entations using F FT . Naturally, further research is needed to search for fast D W T algorithms com patible with the proposed overlap-state technique. It would increase the chances of a wide application of the wavelet transform if this can be achieved. 51 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 3 Constrained Transform Design For Multiple Description Coding In C hapter 2 we studied efficient DW T architectures in a transform coding system . T he problem there is simple in the sense th at th e transform is given and the only concern is the efficiency of the transform com putation. In this chapter, we consider the problem of both designing and com puting an unknown transform efficiently. Specifically, we study transform design and efficient im plem entation when the input d a ta not only needs to be compressed but also needs to be delivered robustly to the receiver l. 3.1 In tr o d u c tio n Finding a good transform has long been a key issue for various transform coding system designs. Traditionally, a good transform is defined to be the one which can m axim ally decorrelate the input data, i.e., rem ove the redundancy and thus achieve m axim um energy/data com paction. However, to achieve overall perfor m ance optim ization for data compression and com m unication over an unreliable channel (e.g., a m obile wireless channel or a best-effort netw ork), m axim um decor relation m ay not always be the best choice. 1Part of this chapter represent work published in [57, 58]. 52 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Consider the case when the com m unication channel is not perfect, i.e., the encoded b itstream m ay arrive at the receiver with error (refer to Figure 1.1). W henever this happens, the received bitstream will not be decoded correctly. As a result, th e receiver will not be able to recover all the transform coefficients. Moveover, th e receiver will not be able to estim ate lost coefficients using received ones since th e transform has removed the correlation between o u tp u t coefficients. If the lost coefficients have small variances, then the end-to-end reconstructed distortion is sm all. However, suppose components with large variances are lost, then the distortion will be high. For communications over unreliable channels, such a quality variation in the received signal can be very annoying. Conventionally, channel coding is applied in such cases for error protection and recovery. In this chapter, however, we study an alternative way for error recovery by redesigning the transform to introduce correlation between the tran sm itted co efficients, a technique of m ultiple description transform coding (M D TC) proposed recently by W ang et al. [59], O rchard et al. [60] and Goyal et al. [61, 62, 63]. A complete M D TC system is shown in Figure 3.1 [64]. The input d a ta is first decor related using a transform Tx as is the case in a conventional transform coding system (see Figure 1.1). However, quantization coefficients are th en correlated using another transform T 2. A fter that, the recorrelated data is encoded into two different bitstream s, called descriptions in M DTC terminology, which are sent through different channels for transmission. If channel failures result in data loss, the receiver can now estim ate lost coefficients from received ones since there exists correlation betw een them . As such, the end-to-end reconstruction distortion can be reduced com pared to the case when no correlation exists. Clearly, for such a m ultiple description transform coding system , its coding efficiency will be lower than th at of the system shown in Figure 1.1 due to the correlation introduced by T 2, i.e., extra bits are needed to encode th e correlated output from T 2. The design goal of a MDTC system then focuses on the problem of searching for an optim al correlating transform T 2 which can achieve error recovery at m inim um possible redundancy. For a pair of Gaussian random variables w ith two output channels, the optimal transform has been provided analytically by Goyal et aJ.. [61] with one special case 53 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. X I 1 — ^ T, I | Decorrelatingj Figure 3.1: A typical M DTC system also reported by Orchard et al. [60]. They have also shown that non-orthogonal correlating transforms perform better than orthogonal correlating transform s in term s of redundancy rate-distortion gain, though, in this case the transform itself has to be invertible (mapping integers to integers). For MDTC system s with M inputs and M outputs, the optim al transform design and its perform ance analy sis is still an open problem, though near-optim al solutions for 3 and 4 channels have been given by Goyal et al.[61]. Orchard et al.[60] suggested a redundancy allocation strategy among pairs of input variables but optim al pairing is not yet readily available. A numerical optim ization algorithm was proposed by Goyal et al.[61] to design transforms for arbitrary num ber of channels. However, exhaus tive search through the whole space of all non-orthogonal transforms is not only com putationally intensive but also leads to im plem entation difficulties when using an arbitrarily structured non-orthogonal transform . In this chapter, we first address the problem of correlating transform design, i.e., designing T 2 (refer to Figure 3.1) to optimize the operational redundancy rate- distortion performance. The approach we propose is a two-stage transform design technique: separating the design into (i) structure design and (ii) m agnitude design. The observation is th at the error protection properties of a M D TC system can be fully characterized by the output correlation m atrix, i.e., the correlation m atrix of coefficients generated by T 2. Given the output correlation m atrix, one can im m ediately see which descriptions are correlated (structure) and to what extent they are correlated (m agnitude). W hile the m agnitude inform ation of the correlation m atrix cannot, in general, be quantified for specific redundancy rate distortion constraints, the structural inform ation can. sometimes be directly 54 X j Recorrelating Entropy [ 1 __ Coder I Entropy L i Coder I X. Xr Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. inferred if specific channel conditions or protection requirem ents are provided. For exam ple, one common technique for error protection when transm itting over lossy packet networks used in the robust audio tool (RAT) is to have each packet protect its previous packet by carrying some redundant inform ation [30]. In this case, if a MDTC system is to be designed, one possibility is to have a band-diagonal correlation m atrix w ith each coefficient correlated only w ith neigh boring coefficients. Packing each, coefficient into one packet and send these packets sequentially over the network, a sim ilar error protection scheme as th a t of RAT is com pleted. In this case, th e key to the transform design is to first find all transforms (admissible transform s) which can generate a band-diagonal correla tion m atrix and then search through the admissible transform set for the optimal solution. As we will show, the adm issible transform is sim ply th e eigenm atrix of the output correlation m atrix. If th e structural inform ation is available, such as the band-diagonal structure in this case, the transform can often be pre-designed taking advantage of existing results in the area of decorrelating transform design. Thus the stru ctu ral inform ation of the output correlation m atrix can be used to find all adm issible transforms (eigenm atrices of the o u tp u t correlation m atrix). To meet the final redundancy rate-distortion constraints, an optim ization algorithm similar to th a t in [61] can then be applied to com plete the m agnitude design. The m ajor advantage of this two-stage design approach is th a t it can enforce a structure on the transform to be designed. For optim al redundancy rate-distortion performance, previously propsoed transform design techniques required search ing through all non-orthogonal transform s, i.e., startin g w ith an arbitrary non- orthogonal transform , as shown by Goyal et al. [61]. If, however, the transform structural inform ation can be derived from the available channel inform ation, the search space can be drastically reduced so that the com plexity of the optim ization algorithm is also significantly reduced. The enforced stru ctu re in the design phase also leads to structured optim al transform s, which can often be im plem ented via existing fast algorithm s as will be shown later. As the second part of this chapter, we address the problem of designing a single transform which can both decorrelate and recorrelate the input at the same time. As one can see from Figure 3.1, th e M DTC system configuration is itself redundant, 55 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. i.e., the input is first decorrelated by T i, and then recorrelated by T 2 - To reduce th e system im plem entation complexity, we propose to replace b o th T i and T 2 with a single transform , th e K am unen-Loeve vector transform (K LV T). T he proposed KLVT can take as in p u t a group of vectors and generate decorrelated vectors while preserving correlation between vector components in each vector. Each of these vector com ponents can be grouped to form one description and sent through different channels. T h e error recovery mechanism in case of channel failures is the sam e as that in a M D TC system (refer to Figure 3.1), however, the system can now be configured as sim ply as th a t in Figure 1.1. T he idea of KLVT is very sim ilar to the vector transform (V T) proposed by Li at al. [65, 66, 67], which is used to preprocess th e data for vector quantization. To m axim ize the coding efficiency, both KLVT and VT are required to m axim ally rem ove the inter-vector correlation. However, an optim al V T is also required to m axim ally preserve the intra-vector correlation for com pression while KLVT can be designed to vary the intra-vector correlation based on channel statistics for loss d a ta recovery. T he rem ainder of this chapter is organized as follows. In th e next section, a brief review on M D TC is provided and the problem of optim al correlating trans form design is defined. In Section 3.3, we propose a tw o-stage transform design approach based on param etric scaling-rotation transforms for the design of MDTC system s with M inputs and M outputs. Section 3.4 provides two correlating trans form design exam ples for equal rate channels and sequential protection channels, respectively. Sim ulation results of G aussian vectors are also presented. In Section 3.5, the proposed Karhunen-Loeve vector transform (KLVT) is defined together and we provide an analysis of possible applications for the M D TC system design. Finally, we conclude o u r work in Section 3.6. 3 .2 P r o b le m D e fin itio n Assume for a source X , M descriptions {C,-,i = 1,2, • • •, M } are generated. No tations are introduced following these in [60]. C e n tr a l D is to rtio n D c The average reconstruction error when all M descrip tions are used. 56 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. S id e D is to r tio n Ds The average reconstruction error when only a subset of M descriptions are used. R e d u n d a n c y p The difference between the actual coding rate R and R~ = R x (D c), the source rate-distortion function evaluated at Dc. R e d u n d a n c y R a te - D is to r tio n F u n c tio n p(Ds; D c) The am ount of ex tra bits of redundancy necessary to achieve a desired side distortion Ds at a given central distortion Dc. For an input source vector X , the encoding procedure of a MDTC system shown in Figure 3.1, as described in [60, 61], is 1. X is decorrelated by T i- 2. T he transform coefficients generated by T i are quantized with a uniform scalar quantizer. 3. T he quantized vector X is transform ed with an invertible, discrete transform T 2 , which introduces correlation among vector components of X . 4. T he components of the transform output vector, {X 1?X 2 }, are indepen dently entropy coded. 5. T he encoded bitstream s are separated from each other and sent over different channels. Assume th at the correlation information is delivered to the decoder correctly. The decoding procedure is 1. Entropy decode all received bistreams. 2. (a) If all descriptions axe received, then the inverse transform T ^ 1 is ap plied. The inverse transform ed data is then dequantized. (b) If only a subset of descriptions are received, the received data is de quantized first. T he lost descriptions are then estim ated from available descriptions using the correlation inform ation received from the en coder. The reconstructed vector (including received and reconstructed descriptions) is inverse transform ed by T j l - 57 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3. The output from previous stage is then inverse transform ed by T]-1 to get the reconstruction X . The final end-to-end reconstruction distortion D = E \\X — X \\2 is the central distortion Ds for case 2(a) and the side distortion Dc for case 2(b). The multiple description coding problem can be form ulated as O b je c tiv e Design the transform T 2 to minimize the side distortion Ds as well as the central distortion Dc subject to a redundancy constraint p. Orthogonal Transform Nonorthogonal Transforn 0.6 0.5 i0.4 0.3 0.2 0.1 2.5 1.5 R e d u n d a n c y (bits/vector) 0.5 Figure 3.2: Redundancy rate distortion curve of a M DTC system for Gaussian vector with a 1 = 1,02 = 0.5. For MDTC of pairs of independent Gaussian random variables, an im portant result is that non-orthogonal transform s are better than orthogonal transforms in term s of the redundancy rate-distortion gain [60, 61]. For a given central distor tion, the non-orthogonal transform not only can achieve lower average side distor tion using the same am ount of extra bits, it can also extend the redundancy rate- distortion fu n ction to the region where the orthogonal transform cannot reach, as can be seen in Figure 3.2 (Notice th a t this curve is not dependent on the to tal coding rate and detailes can be found in [60, 61]). However, non-orthogonal transforms, if used for recorrelation, pose three challenges to the MDTC system design: 5S Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1. Lossless Implementation: T he transform has to be lossless for efficient quan tization. For finite precision im plem entation, this means th at the transform has to be an integer m apping, i.e., m apping integers to integers. 2. Design Complexity. T he num erical optim ization algorithm by Goyal et al. [61] becomes com putationally intensive w ith the increase of th e dim en sionality of the transform necessary. For an M -channel M DTC system , the transform has to be of size M x M and each of its entries is a design param eter (com plexity of 0 ( M 2). 3. Implementation Complexity: The optimal transform can have arb itrary struc ture in practice, which makes fast im plem entation difficult. In the next section, we sta rt w ith an intuitive geom etric explanation of why non-orthogonal transform s can perform better th a n orthogonal transform s. We then propose a structured non-orthogonal transform framework for optim al cor relating transform design. 3 .3 P r o p o s e d D e s ig n A p p ro a ch 3.3.1 Geometric Interpretations For two equal ra te channels, the optim al pairing transform T a given by Goyal and Kovacevic [61] is of the form a 1/(2a) —a 1/(2 a) (3.1) It can be easily derived th at a 1/(2a) - a 1/(2 a) 1/V2 l/\/2 - 1/ n/2 l/y/2 y/2a 0 0 l/(V2a) (3.2) (3.3) 59 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. As one can see. the optim al pairing transform is nothing b u t a concatenated scaling-rotation transform . Obviously, the rotation transform , orthogonal by it self, can introduce correlation for uncorrelated inputs. T he use of the scaling transform further enhances its ability to do so. Let us w rite the optim al transform T atg in param etric form as T a > 0 = cos 8 sin 8 a 0 — sin 8 cos 8 0 1 /a _ r o ta tio n s c a lin g a cos 6 sin 8/a — a sin# cos 8/a (3.4) (3.5) Denote th e original orthogonal basis vectors as { u i, 1 1 2} and the new basis vectors under th e transform T a > g as { v i ,v 2}. One can write Vi u t = T C i0 = . V2 u 2 a cos 8 + (1 /a ) sin 8 u 2 —a sin 8 Ui + (1 /a) cos 8 u 2 The inner product between v r and v 2 is < v l,v2 > = (l/2 )s in 2 0 (— a2 + 1 /a 2). (3.6) (3.7) It is easy to see th at the inner product of the new basis vectors is equal to zero (new basis is still orthogonal) whenever the scaling factor a = 1. In this case, T a,g is equivalent to an orthogonal transform . Since an orthogonal trans form ation am ounts to a plane or hyperplane rotation, it can not introduce any correlation for the case when all eigenvalues of the input vector correlation m atrix are equal. Geometrically, such a random vector does not have any directional preference am ong all eigenvectors, and the joint distribution will be circular sym m etric or hyper-spherical in higher dimensions. However, if the scaling factor a is chosen to be either sm aller or larger than I, the inner product will become nonzero (assum ing 8 is nonzero) and the new basis will becom e correlated. In this case, even if the input vector com ponents all have sam e variances, they can still become correlated after such a scaling-rotation transform . This scaling-rotation 60 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. structure partially explains th a t non-orthogonal transform s have greater flexibility to introduce correlation than orthogonal transforms. In Figure 3.3, the data distribution plots are shown for a pair of independent Gaussian random variables under different transform s. It can be seen clearly that the scaling-rotation transform introduces stronger correlation than its orthogonal counterpart. -to * - -10 Original 0 (a ) Scaling Orthgonal 10 s 0 - 5 - 1 0 0 0 (b) 10 - 5 -10 - 5 10 -10 Figure 3.3: Distributions of two independent G aussian variables with variance crl = 1,02 = 0.5. (a) original; (b) after 45° rotation with correlation coefficient 0.6. (c) after scaling(a = 2); (d) after scaling and 45° rotation with correlation coefficient 0.94. Clearly one can see th a t data in (d) is m ore correlated than that in (b). 3.3.2 Parametric Transform and Factorizations The decomposition of the optim al pairing transform into a scaling-rotation frame work is not accidental but reflects a general structure of non-orthogonal trans forms. Indeed, any m atrix can be factored into a product of an orthogonal m atrix and an upper triangular m atrix, the so-called QR decom position in m atrix theory [68]. Using our notations we w rite this as T = R U where R is an orthogonal transform and U is an upper triangular m atrix. The upper triangular m atrix U can be further decomposed as a product of a scaling m atrix S and another upper 61 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. triangular m atrix L (w ith diagonal entries all 1). In other words, any transform can be w ritten as a concatenation of three transform s, an upper triangular tran s form L, a scaling transform S and a rotation transform R (orthogonal transform ) T = R S L . (3.8) Note th a t T can also be a non-square transform which is the case when fram e expansions are used for adding redundancy [63]. To reduce the design complexity, in this work, we only study non-orthogonal square transforms with scaling-rotation factorization, i.e., T rs = R S . To use an non-orthogonal transform for M DTC, the first constraint is lossless implemen tation for efficient quantization [61]. To this end, we further require th a t the determ inant of T rs be 1, i.e., det{T rs) = det(R S ) = det(S) = 1. We now show th at any such transform T rs can be factored into lifting steps and thus can be im plem ented losslessly. We first present factorization results for 2x2 rotation and scaling transform s (adapted from [46]) as follows a 0 1 a — a2 l 0 ' 1 a - 1 ' ’ 1 0 ' 0 I f a 0 1 —l/a 1 0 1 1 1 cos 9 - sin 9 a {cqs8— 1 ) sin 9 1 0 ' * 1 (cos 0— 1 ) sin 6 sin 9 cos 9 0 1 sin# 1 0 1 , (3.9) . (3.10) These results show th at basic 2x2 rotation and scaling transform s can be im ple m ented losslessly. For M x M rotation transform s, it is well known that any orthogonal M x M m atrix Q m can be factored into the product of M (M — l)/2 orthogonal m atrices, each of which is a M x M Givens rotation th at only rotates two com ponents at a tim e [6S]. The factorization of a Givens rotation m atrix is equivalent to th at of the basic 2x2 rotation. Therefore, any M x M orthogonal transform can be im plem ented losslessly as an integer mapping. Factorization of a scaling transform is also straightforward using the factor ization result of the basic 2x2 scaling transform . One can show th a t any M x M 62 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. scaling transform can be w ritten as a product of M — 1 subscaling transform s, each of which only scales two components a t a tim e. As an exam ple, we show the decom position of a 4x4 scaling transform as follows S (A ) = a0 0 0 0 0 GEi 0 0 0 0 G E 2 0 0 0 0 a3 a0 0 0 0 0 1/ao 0 0 0 0 10 0 0 0 1 N ote the constraint of dei(S) = 1 is applied, i.e.,nj=o ai = 1 * Therefore a3 = l / ( a 0 a 1a 2). ' 1 0 0 0 ‘ ’ 1 0 0 0 0 & 0 & \. 0 0 0 1 0 0 0 0 l / ( a 0ai) 0 0 0 aoCtl<22 0 0 0 0 1 _ 0 0 0 1 / (a0a 1a 2) _ x-, x 3 yi 3. *3 / 1 \ / 1 ____y / ► y2 ► y2 XM aM - yvi Figure 3.4: Lattice structure of a R S transform Thus we have shown th a t any T rs can be factored into lifting steps and there fore can be im plem ented losslessly. The general im plem entation stru ctu re of T rs is shown in Figure 3.4, which has a lattice stru ctu re sim ilar to th at of a biorthogonal filterbank. The perfect reconstruction property is guaranteed by such a transform framework. To derive th e inverse transform , one only needs to change all the rotation angles to the opposite sign and m ultiply by the inverses of the scaling factors. 63 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Based on our analysis in the previous section, we propose to use the param etric form of a Trs for the design of M DTC. T he param etric form of T rs can be written as Trs(A, 0 ) = R (0)S (A ) (3.11) where A are scaling factors and 0 are Givens rotation angles. As analyzed before, the scaling transform changes the relative energy distribution of the input vector components while the rotation introduces correlation among vector components. The net effect of such an operation is th a t the orthogonal basis is transformed into a non-orthogonal one. As a result, m ore correlation can be introduced among vector com ponents via decomposition to a non-orthogonal basis rather than an orthogonal basis. 3.3.3 Two Stage Transform Design Although T rs enjoys a structured lattice im plem entation, th e num ber of design param eters of an M x M transform still increases quadratically ((9(M 2)) for a M- Dimensional input vector. In this section we propose a two-stage design technique making use of the available channel inform ation to further constrain Trs and reduce both the design and im plem entation complexities. Recall th at the transform to be designed is a recorrelating transform (equiv alent to T 2 as shown in Figure 3.1). Denote the uncorrelated input to T rs as X and the correlated output as Y . Let = dzag-fcr?-}, i = 1, 2, - - -, M be the input correlation m atrix and let Ry- = {r,-y}, i , j = 1,2, • • •, M be the o u tp u t correlation m atrix of the transform output Y . T hen we have R r = T rs(A ,0 )R * T * s(A ,0 ) (3.12) = R (0 )R sR ‘(©) (3.13) Since Rs = S(A )R xS(A )f c = {a?-cr?-,z = 1,2, • • • , M} is still a diagonal m atrix, the rotation transform R (0 ) has to be the eigenm atrix of th e output correla tion m atrix Ry-. If the output correlation m atrix can be predesigned, then the 64 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. problem of correlating transform design can be form ulated as the inverse of the m atrix diagonalization problem. An orthogonal solution to the correlating trans form is sim ply the inverse of KL transform. In general, we cannot pre-design the o u tp u t correlation m atrix subject to redundancy and distortion constraints. However, we can select its structural information for specific channel conditions or protection requirem ents. For example, equal rate channels require the output correlation m atrix to have equal diagonal entries for Gaussian inputs. Thus ad missible transform s have to be able to generate correlation matrices w ith equal diagonal entries. Based on such observations, we propose a two stage transform design approach, i.e., stru ctu re design and m agnitude design. The structure design finds admissible transform s (eigenmatrices of the output correlation m atrix ) for specific channels using th e T rs factorization framework. In Figure 3.5 we depict how the transform search space can be reduced gradually using available channel inform ation. Start from the transform space which consists of all transform s, either orthogonal or non-orthogonal, w ith determ inant one. We reduce the search space by enforcing that all th e transform s must have a scaling-rotation factorization. This will re duce the complexities of both the design and im plem entation as described before. From this scaling-rotation transform space, application specific constraints can be im posed to further reduce the space for admissible transform s (details in next section). After the structure of the transform is found, th e m agnitude design then searches for the optim al transform from admissible transform s using the algo rithm described in [61] where derivation details of the average side distortion Ds{A , 0 ) and the redundancy bit rate p(A ,© ) are given. A different derivation of this optim ization algorithm is given in Appendix C. We perform a redundancy constrained transform design using a Lagrange m ultiplier A. The cost function is J = D s( A ,© ) Ap(A, © ). By varying A, one can scan all the operational redundancy rate distortion points (DSlp). 65 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Nonorthogonal ' Transforms det(T )=l SR SH SD ST Figure 3.5: Transform search, space. 3 .4 M D T C D e s ig n E x a m p le s We give design examples for two im portant channels, equal rate channels and sequential protection channels, both of which can be characterized by the output correlation m atrix Ry. It turns out that for these two special channels, not only can we use fixed rotation transform s, but also these fixed transform s also have fast algorithm s. Using a fixed rotation transform , we can reduce th e num ber of design param eters from M — 1(A ) + M (M — l ) / 2 ( 0 ) = (M 2 + M — 2)/2 to only M — 1. This makes the optim ization converge faster and reduces the am ount of inform ation to be conveyed to the decoder. O n the other hand, fast algorithms reduce both the encoding and decoding com plexities. 3.4.1 Equal Rate Channels T he equal rate channels case requires th a t o u tp u t descriptions have the same rates, which helps the buffer m anagem ent (e.g. packetization/depacketization in a packet network) both at the encoder and the decoder. As stated before, equal rate should be interpreted in a statistical sense. For exam ple, two Gaussian sources w ith same variances will be viewed as equal rate sources if quantized w ith the same quantizer. T h a t is to say, for G aussian random variables, “equal ra te ” requires that the correlation m atrix R y have equal diagonal entries. Here we provide a T rs transform which generates equal rate descriptions for arbitrary num ber of channels w ith M = 2r\ The equal rate transform we propose is a scaling Hadam ard transform . T (A ) = H S (A ) (3.14) 66 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. where H is the H adam ard transform . We need to show th a t, under this transform , for input X , R y has equal diagonal entries (equal variances), i.e., r, -, - = r ,i = 1, 2, • • -, M and r is a constant. Since the H adam ard transform H is real sym m etric, th e o u tp u t correlation is R y = H R sH . Denote H = = 1,2, - - •, M . We have h2 { = 1 ,V (z,j). T h e o u tp u t Y com ponent variance r, -, - can be w ritten as M M r ii = Y a 2.cr2.h2. = T a 2 < J 2- J = i i=i T hus we have shown th at a scaling-H adam ard (SH) transform generates equal ra te descriptions. However, correlation coefficients between o u tp u t vector com ponents, i.e., r tJ-, i ^ j , have to be jointly designed with the scaling transform to m eet the rate-distortion requirem ents via the optim ization algorithm given in A ppendix C. Nonetheless, th e scaling-H adam ard structure guarantees equal rate o u tp u t and it also reduces th e num ber of design parem eters from M 2 (an abitrary non-orthogonal M x M transform ) to M — 1 (M — 1 scaling factors required in the scaling-Hadm adard transform ). We m ention th a t this scaling-H adam ard transform reduces to the o ptim al equal rate transform given by Goyal and Kovacevic [61] when M = 2, which dem onstrates th at, at least for two descriptions coding, the R S transform does not com promise the optim ality of the M D TC system . We also note that the cascaded structure given by Goyal et al. [61] is equivalent to a Scaling H adam ard transform for M = 4. 3.4.2 Sequential Protection Channels A nother exam ple channel is the sequential protection channel in which descrip tions are sent out sequentially and each description will only protect the losses of its im m ediate predecessor and its im m ediate successor. In a lossy packet network, an exam ple scenario is th a t when each packet carries inform ation for the recov ery of its previous and next packets, a sim ilar technique to th a t used for audio transm ission in the Robust Audio Tool [30]. For a M DTC system , this indicates th a t the output correlation m atrix R y should be a tridiagonal m atrix in which 67 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. descriptions axe sequentially correlated. The R y assumes th e form R y = 1 p 0 0 ... 0 p 1 p 0 ... 0 0 p 1 p .. . 0 0 0 p 1 p 0 0 0 p 1 (3.15) From m atrix theory, we know th at the eigenmatrix for this type of sym m etric tridiagonal Toeplitz matrices is the Discrete Sine Transform (DST) [69]. So the transform we propose for sequential protection channels is the Scaling-DST trans form T (A ) = D S T S (A ) 3.4.3 Simulation Results for Gaussian Sources (3.16) In this experim ent, we com pare results of different configurations of the cor relating transform for a 4D Gaussian vector source with standard deviations {1,0.5,0.3,0.1} [60]. We com pare the side distortion when there is only one description lost with equal channel failure probabilities. The different transform s are (i) Rotation; (ii) Scaling-Rotation; (in) Scaling-Hadamard; and (iv) Scaling- DST. The Scaling-Hadamard configuration is equivalent to the cascaded structure given by Goyal and Kovacevic (Figure 3 in [61]) for a 4-D input vector. T he op tim ization is done via Powell’s direction set technique [70]. The initial scaling factors are all set to be 1 and the initial rotation angles are all set to be tt/4 for the Scaling-Rotation configuration. T he comparison of all four configurations is shown in Figure 3.6. Clearly, all the non-orthogonal transforms achieve better performances compared to the or thogonal transform (case (i)). This indicates that non-orthogonal transform s can also perform b etter than orthogonal transforms for M DTC systems of m ore than 68 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Scaling-Rotation Scaling-Hadamai i Scaling-DST Rotation - 5 m s tr o <5 o 2 co - 1 1 0.2 0.6 0.8 1 1.2 Redundancy (bitsA/ector) 1.6 1.8 0.4 1.4 Figure 3.6: Com parisons among different transforms. two channels. We also observe performance degradations when we im pose con straints on the rotation transform , scaling-H adam ard or scaling-DST transform , specially at higher redundancy bit rates. However, structured transform s sim plify the design and im plem entation complexities. In this case, both H adam ard transform and DST can be im plem ented using th e ir existing fast algorithm s. A drawback of using a non-orthogonal transform is th at the com plete M DTC system (refer Figure 3.1) requires two transform s, Ti for decorrelation (before quantization) and T2 for recorrelation (after quantization), both at th e encoder and decoder. Obviously, this increases the system im plem entation complexity. If an orthogonal transform is used, then one can m erge T2 with T\ into one transform to reduce the com putation. In the next section, we present one such transform and show possible applications for MDTC system design. 3.5 K a rh u n en -L o ev e V ecto r T ran sform Let X be a vector random sequence of size N x 1 generated from a stationary and ergodic source with zero m ean. Assume each vector in the sequence can be further decomposed into subvectors of size M x 1, i.e., X = { x i , • • • , X l } where M L = N . 69 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Denote the correlation m atrix of X as R and the correlation m atrix between two sub vectors, Xi and X j, as R ij. Then we have the following identities. and E {x\x)) = L E (x tr- lx)) R = E { x \ x f - 1) M - l i V - l X3 Rl,l • • • Rl,£-1 (3.17) (3.18) Definition: A ny unitary transform T which can block diagonalize th e autocorrela tion m atrix R is defined to be a KL vector transform (KLVT) for vector random sequence X = {x^ 1 < i < L }. One such transform T produces an output Y = T X w ith correlation m atrix, R y R y * = T R T ^ Ri R2 (3.19) (3.20) R l where Rz is th e autocorrelation m atrix of transform ed subvector y a n d Y = T X = {y,-, i = 1,2, • • •, L}. The existence of such a transform for any arbitrary inputs is obvious since one can easily see th a t the Karhunen-Loeve transform is a special case of KLVT. Actually, KLVT defines a transform set and its properties follow these of the KLT. K L V T P r o p e r tie s Property 1: K L V T set KLVT defines a unitary transform set S = {T = B K } which includes all trans forms of the form T = B K where K is the KLT for X and B can be any arbitrary 70 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. block unitary m atrix, such th a t the sizes of th e subblocks in B agree with the sizes of subblocks in R . So th e cardinality of th e set S is infinity. If Br B = where each B,- is unitary, and K R K ff = Ai A l then T R T f/ = B K R K hBh B 1A 1B f B 2A 2B ? B lA lB " j is also a block diagonal m atrix. Property 2: Decorrelation Assume T is any transform from the KLVT set S and Y = T X w ith Y i < L} and X = {x,-, 1 < i < L}, then E[ViYf] = 0 i^j R i i = j {y .-, i < (3.21) 71 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. where R is diagonal if T is the KLT of X and arbitrary sym m etric otherwise. This property is directly derived from, th e definition. Property 3: Energy Compaction Among all the unitary transform s, KLVT transforms pack m axim um average en ergy in I < L subvectors of Y . Here th e energy of a vector y is defined to be its m ean squared length .E{[y|2}. If the vector size is 1, then its energy equals to its variance (assuming zero m ean), i.e., M £ { | y | 2 } = E{ y ‘ y } = r r ( E { y y ‘ } ) = £ A,-, ( 3 . 2 2 ) t=l where M is the size of subvector y and {A,-, i = 1,2, • - •, M } are eigenvalues of the autocorrelation m atrix i?{ y y f c } (T r(T ) denotes the trace of m atrix T ). This property can be easily proved by KLT’s m aximum energy compaction property. Since each KLVT transform is related to the KLT by a block unitary transform which preserves the eigenvalues [68], the transform ed vector will pre serve all the energies of the corresponding scalar components in the KLT case. Hence if the scalar com ponents are ordered in KLT case, the accum ulative vector energy will also be ordered in the same order after the block unitary transform. To probe further into the KLVT, we provide one more view from vector space partition. As one can see, KLVT actually defines a transform set T which includes all the transform s that can block diagonalize th e autocorrelation m atrix R and hence will decorrelate the corresponding vector signal if applied to the original data. How ever, this non-uniqueness property of the KLVT should not be m isinterpreted as the non-uniqueness of the decorrelation vector space p artition for a given sig nal (see Appendix D). Actually, the decorrelation vector space partition for a given vector signal is unique only up to a perm utation of the eigensubspaces cor responding to each of these eigenvalues. For a given autocorrelation m atrix, the eigenvalues are uniquely defined. The difference among the transform s comes from the residue intra-vector correlation after th e transform, which will depend on the exact transform chosen. At the one end is the KLT transform which removes com pletely inter-vector correlation as well as intra-vector correlation. There exists, 72 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. however, an infinite num ber of transform s which can m axim ally remove the in ter vector correlation while keeping th e intra-vector correlation at some level. T his intra-correlation obviously can be used for error protection as th at discussed before in a MDTC system . Compared to th e system configuration shown in Figure 3.1 in which two transform s axe needed (Ti for decorrelating and T 2 for recorrelat- ing), a KLVT-based MDC fram ework thus can can reduce both the design and im plem entation complexity. One m ajo r concern, however, is how the redundancy rate-distortion perfor mance of a KLVT-based MDC system compares to th a t of systems using a KLT for decorrelation and an optim ized non-orthogonal transform for recorrelation. We mention th a t it has only been shown for uncorrelated Gaussian input vectors th at non-orthogonal transforms are superior over orthogonal transform s [61]. T h e overall analysis of a MDC system for correlated inputs, which is a m ore appropri ate model for m ost natural signals, rem ains to be an open problem. F urther work is necessary to b e tte r understand b o th theoretic and practical performance lim its of a KLVT-based MDC system and comparisons to its non-orthogonal transform based counterparts. 3.6 C o n c lu sio n s In this chapter, we studied the problem of constrained transform design for robust com m unication via m ultiple description transform coding. A two stage transform design approach is proposed for M D TC system design. Such an approach enables us to find stru ctu red transform solutions using available channel inform ation. This helps to reduce both the system design and im plem entation complexities. We also provided exam ple transform designs for equal rate channels and for sequential protection channels. To further reduce th e MDTC system complexity, a K ahunen- Loeve vector transform is proposed and possibilities for application in M D TC system designs are illustrated. One fu tu re work is to stu d y the redundancy rate- distortion perform ance of KLVT in a M DTC system. 73 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 4 Multiple Description Coding for Erasure Channels Last chapter we introduced a correlating transform based coding system which can recover the lost d ata using the correlation existing in the correctly received data. In this chapter, we take a different approach for loss d a ta recovery, i.e., adding redundancy explicitly rather th an implicitly (as was down when using the correlating transform of C hapter 3) in th e encoded data for loss recovery. Our m ain m otivation is to achieve a simpler system design and im plem entaion L . 4 .1 I n tr o d u c tio n In recent years, a num ber of approaches for error control and protection for au dio and video com m unication over packet networks have been reported in the literature [30, 73, 74, 75, 76, 77]. A m ajo rity of these works have chosen the receiver-only strategies to avoid retransm ission and to reduce th e com m unication delay. These include various forward error correction (FEC) schem es and error concealm ent techniques th a t exploit th e residual correlation in th e encoded data. For exam ple, in Jay an t’s subsam ple-interpolation [73] schem e, odd and even samples of the input speech signal are sent in different packets. Thus if a packet is lost, th e m issing samples can be interpolated using the neighboring samples l Work presented in this chapter was published in part in [71, 72]. 74 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. th at were received correctly. Unequal error protection schemes have been studied by, am ong others, Davis et al.[36], Danskin et al.[37] and Sherwood et al.[38], in which error correction codes are applied according to th e im portance of the data to provide different levels of protection. The robust audio tool (RAT) is proposed by H ardm an et al. for m ulticast teleconferencing [30]. In RAT, each packet carries explicitly a redundant version of the previous packet, which can be used for loss recovery. Sim ilar ideas have also been explored and extended to video coding by Bolot at al.[74, 75, 76] and Podolsky et al.[77]. To deal w ith the network het erogeneity issue in broadcast/m ulticast applications, a hierarchical/layered FEC scheme is proposed by Tan et al.[7S] and Chou et al.[79], in which end users can subscribe to different num ber of channel layers for error protection based on their own observed packet loss statistics. Recently, there has been renewed interest in using th e technique of m ultiple description coding (MDC) for error protection and control over the packet net work [80, 81, 82, 83, 73, 84, 85]. As formulated by El-G am al and Cover in the 1980s [80], the basic idea of MDC is to encode the input signed into m ultiple descriptions with the constraint th at any one of the descriptions can render an acceptable reconstruction of the signal. Furtherm ore, it requires th at the more the descriptions, the better the reconstruction quality. Assume each description is sent through one packet, then an acceptable signal reconstruction can be guar anteed as long as one packet is received correctly. This is reasonable for networks where QoS is not available and therefore there is no way to give more priority to certain packets. This assum ption is well m atched to the MDC philosophy, where all packets carry equally im portant information. Among recently proposed MDC techniques we cite the DPCM diversity system by Ingle et al. [86] for packet speech and the wavelet image MDC coding scheme by Servetto et al. [87] which both use the m ultiple description scalar quantizer (MDSQ) designed by Vaishampayan [88, 89], the m ultiple description perceptual audio coder (MDPAC) presented by Arean et al. [31], which uses the m ultiple description transform coding (M DTC) technique proposed by Wang et al. [59] and Orchard et al. [60], and further devel oped by Goyal et al. [63] (see also Chapter 3). However, these MDC techniques usually involve complex system designs and im plem entations which m ay affect 75 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. their effectiveness for real-tim e applications. For example, MDSQ requires careful index assignment, which becomes difficult if m ore than two descriptions are nec essary [SS]. To build a com plete transform coding system , the M DTC approach as discussed in Chapter 3 necessitates another correlating transform besides the conventional decorrelating transform [60, 63]. In this chapter, we propose a new MDC scheme for packet loss recovery over packet networks. Different from previous MDC works, data loss recovery is achieved using the redundancy explicitly carried by the encoded data, an idea inspired by the work of H ardm an et al. [30] and Bolot et al. [75] on robust packet speech/audio over the Internet. In the proposed scheme, th e input is first split into different components and each component is quantized separately at different resolutions. In this work, a polyphase transform is used for source splitting. Each polyphase component is coded independently at relatively high quality (i.e., finely quantized). Redundancy is then explicitly added to each description by coding other polyphase components at a lower coding rate. In case of packet losses, i.e., when descriptions are missing, the lost d ata is recovered using the redundancy carried in the correctly received packets. T he approach of adding redundancy explicitly leads to a very sim ple system design and im plem entation. The use of polyphase transform also enables us to incorporate context-adaptive coding tech niques [90] to further im prove the system coding efficiency. We will show th at our proposed MDC system , although simple, can yield very com petitive coding perform ance as compared to previously published work. M ore recently (at the tim e of writing this chapter), a sim ilar work using explicit redundancy have been reported by Moguel et al., by increasing the granularity of redundancy addi tion, i.e., varying the am ount of redundancy with the im portance of the data for unequal loss protection [91]. One m ay have noticed th a t MDC approaches can actually be combined with FEC schemes for a better overall error protection. For exam ple, one can send redundant packets following the d ata packets for loss recovery, where each packet can be protected using th e error-correcting codes. Indeed, there have been such hybrid error control schemes reported recently, e.g., the M DC via FEC work by Puri et al. [92] and the generalized MDC scheme through unequal error loss 76 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. protection by Mohr et al. [93]. Nevertheless, im proving the perform ance through the use of error protection codes goes outside the scope of this chapter. The rem ainder of this chapter is organized as follows. In the next section, we provide a characterization of the erasure channel an d we define the problem . In Section 4.3, the proposed MDC system using polyphase transform and selective quantization is detailed. The perform ance analysis in term s of optim al redundancy bit allocation and com parisons for Gaussian i.i.d sources are presented in Section 4.4. Section 4.5 provides image and speech coding results using th e proposed MDC technique. Finally, we conclude our work in Section 4.6. 4 .2 E ra su re C h a n n e l M o d el a n d P r o b le m D e fi n itio n T he erasure channel m odel which we are going to stu d y in this work is a simplified one. Each packet of tran sm itted d ata is assumed eith er to be com pletely lost in case of channel failures or received correctly when th e channel is good. In terms of the seven-layer OSI network model, in essence, we do not consider bit errors introduced in the physical layer and we assume th a t these have been corrected or the whole packet has been discarded if it was not possible to correct th e errors. Instead, we consider an alternative data recovery technique for the application layer, where the sm allest d ata unit is a single packet of the inform ation source. T h at is, the channel is a packet erasure channel. These erasures will occur largely due to the network congestion and link failures. To begin with, we m ake the following two assum ptions: • Channel failure events occur independently In other words if we are transm itting over two or more channels they will each fail independently. Examples include m obile channels spaced beyond the coherence bandw idth, tim e slots separated larger than the coherence tim e and packets routed through different paths in the network. • The receiver knows which channel fails T h a t is, the receiver knows which description is lost. This can be achieved 77 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. by inserting synchronization points in th e bitstream or tagging packets w ith sequence num bers. T he independent channel failure assum ption, though a simplification, can serve as a good approxim ation when the network traffic is sufficiently m ultiplexed from dif ferent sources an d a random dropping policy is used to relieve the congestion (for m ore details, see [75]). Though the failure probability of a single channel is large in some cases, th e first assum ption ensures th a t the probability of sim ultaneous m ultiple independent channel failures will tend to be small. A m ultiple description coding system , taking advantage of this reasonable assumption, generates inde pendent bitstream s (descriptions), each of which is sent over different channels. This channel diversity thus helps greatly to increase the probability of the event th at at least one description arrives successfully at the receiver. To recover the lost description(s), each description has to carry, in addition to th e inform ation about itself, e x tra inform ation (redundancy) about other descriptions. Assume a to ta l of M descriptions are generated for th e source. Let us define Central Distortion Dc : the reconstruction error using all M descriptions. Side Distortion D s : the reconstruction error using only k descriptions (k < M ). For a fixed bit budget, one would prefer descriptions independent of each other to minimize D c. However, for loss recovery, one has to introduce redundancy among descriptions to reduce Ds. O ur goal is then O b je c tiv e For a given total bit rate R and a channel failure model, design a multi-pie description coding system to minimize the average central distortion Dc as well as the average side distortion D s. 4.3 M D C B a s e d O n E x p lic it R e d u n d a n c y 4.3.1 Base System The proposed M DC system is shown in Figure 4.1. T he two quantizers Q\ and Q 2 are respectively the fine (high rate Rq) and coarse quantizers (low rate p). 78 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 4.1: The proposed MDC system. The input source X is first split into two subsources Yt and Y2. In this case, a polyphase transform is applied but other choices are also possible. Note that for random sources, polyphase transform is not really necessary but for correlated sources, the splitting via the transform will make a difference. In a two description coding system , Yi consists of all even indexed samples and Y2 consists of all odd indexed samples. The two polyphase com ponents, Yi and Yo, are then finely quantized using Qi and packed into corresponding packets Pi and P 2 . For error protection and recovery, each packet also carries a coarsely quantized version of the other polyphase component. For exam ple, Pi consists of yi, finely quantized F'i, and y2, coarsely quantized Y2. The other packet P2 is obtained similarly, i.e., (y2, yi). At the receiver, if only one packet is received, then one finely quantized polyphase com ponent and one coarsely quantized polyphase com ponent are used for reconstruction. For example, if only Pi is received, yx and y2 are used for reconstruction. However, if both packets are received, then only finely quantized data j/i and y2 are used in the signal reconstruction. In this sense, y\ and y2 are called redundant information whose function is solely to provide packet loss protection. 4.3.2 Context-Adaptive Extension Obviously, to im prove the bandwidth efficiency, one would like to use as few bits as possible to encode the redundant inform ation. This motivates us to consider context-adaptive coding techniques. T he basic idea of a context-based coding 79 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. technique is to make use of the knowledge of the neighborhood statistics for the data to be encoded [90]. Since the polyphase com ponents Y\ and Yi are generated from the same input X , strong correlation is expected to exist betw een Yl and Yi. This is true for natural speech or image signals and this correlation has been taken advantage of to achieve compression gain in a num ber of coding standards (e.g., G.721 for speeches and JP E G for images). If the input X is th e subband data from a D CT or a wavelet transform output, linear correlation is approxim ately removed in most cases. However, strong structural sim ilarities still exist between polyphase com ponents. One example is th at large m agnitude coefficients tend to cluster together and so is the case for smaller m agnitude coefficients. This feature has been exploited extensively and proven to be very successful in context adaptive codecs developed recently for image coding applications [90, 94]. To incorporate the idea of context adaptive coding, the redundant inform ation, yi and yi in our proposed MDC system , is quantized and coded conditioned on the dequantized data, yi and yi. This is shown in Figure 4.2 where the coarse quantizer Qi also has as an input the dequantized d ata from Qi. As a result, m ore efficient quantization of the redundant inform ation can be achieved. Note th a t, because we limit our context-based coding to operate within each packet, a packet loss does not affect the decoding of other packets. Figure 4.2: Context-based MDC System SO Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4 .4 O p tim a l R e d u n d a n c y B it A llo c a tio n for G a u s sia n S o u rces We now give a perform ance analysis of our proposed M DC system for i.i.d Gaus sian random sources. We answer the following question: for a given total bit rate, what is th e optim al bit allocation between the source coding (prim ary in formation) and th e channel coding (redundant inform ation) to m inim ize both the central distortion and side distortions? 4.4.1 Implicit Channel Modeling Consider first th e traditional MDC problem without explicit channel modeling. For the problem of two descriptions coding, the goal is to find the operational optim al regions for the 5-tuple (R s,i, Rs,2 , Dc, Ds^, D s>2) w here D Sti is the side distortion of description i at rate R Sti [80]. We only study equal rate MDC systems in which all descriptions are coded w ith the sam e bit rate, i.e., R Sji = R s,2 - Assume X is an i.i.d zero mean random source with p d f /( x ) and variance cr2. Under the high resolution quantization assum ption, its rate-distortion function can be approxim ated as D = hcr22~2R, w ith h an integral constant defined as h = Aj { / ^ [ / ( x ) ] 1 /3*/# j [95]. For G aussian sources, h = \f3irf2. Obviously, applying the polyphase transform on a memoryless source does not change the distortion-rate function. Any subsam pling, scrambling, etc., would not change the statistics, since the data is random . Therefore, the polyphase components VI, V 2 will have sam e distortion-rate functions as X . Let the bit ra te for prim ary inform ation be R q and th e redundant bit rate be p, the side distortion (when one description is lost) and effective bit rates are then DSfi = ^ h a 22~2p + ^hcr22 -2R° (4.1) Rs,i = \ { R o + p ) (4.2) * = 0,1 (4.3) 81 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. T h e co r r e sp o n d in g c e n tr a l d is to r tio n a ch iev ed is Dc = ha2 2~2Ro (4.4) at a total bit rate R = R q + p. W ith this simple formulation, the optim al bit (i.e., one description will be lost with very high probability), then the optim al bit rate. This is in fact the m inim um possible achievable side distortion Ds for our system , which however, is obtained at the expense of the m axim um possible central distortion Dc. If only central distortion counts (i.e., both descriptions will arrive at the destination with very high probability), then the redundancy p should always be set to zero. This means th a t each channel will carry half of the total information. Upon receiving data from both channels, one can get th e m inim um possible central distortion for the given bit rate R. However, the side distortion will be its m axim um in this case since there is no redundancy. In between these two extrem e cases, one has the freedom to fine tune the side distortion with respect to the central distortion by choosing different redundancy bit allocations. W ith this redundancy allocation, the achievable pair of side and allocation between Ro (prim ary information) and p (redundant inform ation) then becomes a constrained optim ization problem. Using a Lagrange m ultiplier, define th e cost function J as J = D c + X Ds = A ct2 2 - 2 ( r - ' > + A ( i haH-2' + X ~ha2' r v'R-" ) (4.5) (4.6) T he optim al redundancy bit rate can be analytically solved by having |^ | 0 which leads to d p \ p = p ‘ ~ (4.7) This optim al redundancy p“ has a very intuitive explanation. Since < 1, p* is in the range of [0, for a given total bit rate R. If only side distortion counts redundancy rate is £R . This shows th at both the prim ary inform ation and the redundant information p art are coded at the sam e rate, i.e., half of the total 82 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. c e n tr a l d is to r tio n (D s , Dc) for a g iv e n to ta l b it r a te R is D c = hcr22 r + 2 Io S 2(2 +a) (4.8) D s = + /i^ -2 - h + 2'1 o S 2 ^ 5 +x^ (4-9) As a practical example, we consider MDC for a unit-variance zero m ean memo- ryless Gaussian source. In our simulation, we first generate a sequence of Gaussian i.i.d samples. Then the even samples are quantized by a Lloyd-Max quantizer at a source coding rate R q and the odd samples are quantized by the same Lloyd-Max quantizer at a redundancy coding rate p . This is our description 1. Description 2 is formed in the same way except that odd samples are quantized at rate R q and even samples are quantized at rate p. The central distortion Dc is the MSE achieved at rate R q of the original source and the side distortion D s is the average of th e MSEs achieved using only description 1 or description 2. We use fixed-length codes for the index coding so the bit allocation is very simple w ith the only constraint of fixed total coding rate R = R q + p = c o n s t a n t . For exam ple, if the to tal coding rate is 5bps. the possible bit allocations are (f?o,p) = (5> 0), (4,1), (3,2) a t which we measure central distortions and side distortions. The optim al lower bound of the achievable set of 5-tuple (f?s,i, Rs,2 , Ds, 1 , D s,2, Dc) for a unit-variance zero mean Gaussian source has been given by Ozarow [82] as Ds,i > 2~2 R (4.10) Ds,2 > 2~2R’’ 2 (4.11) • y ~ 2 ( R s , 2 + R j ,i ) Dc > — — ;=----- = - (4.12) - l - ( y / R - y / A ) 2 where II = (1 - DSjl)( 1 - Ds o) and A = Ds,LDa,2 - 2~2(R* -2+R^ K The total bit rate is given by R = R Sil + Rsp. The asym ptotic results using the optim al level-constrained MDSQ given by Vaishampayan et al. [89] are Dc w I/*2-fl(1+a) (4.13) Ds « h 2 -R(l~a) (4.14) 83 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. -1 0 m -1 5 ■ o §. -2 0 o ? -2 5 C O c < D O -30 R= 5 bps R= 6 bps -3 5 \ N \ V * \ * k . v . V . N V ’ * — Ozarow MDSQ - 1-- Proposec — Optimal - 1 0 — -15 C Q g. g - 2 0 | -2 5 Q I -3 0 c ( D a -20 m -25 -30 o C O ? - 3 5 c a C D O -40 -45 -20 -1 5 -1 0 -5 Side Distortion (dB) R= 7 bps -3 5 -4 0 + ................ . . . \ ........... X - ." K -V. \ . . \ — Ozarow MDSQ — i — Proposec — Optimal X _....... > -2 0 -1 5 -1 0 -5 Side Distortion (dB) R= 8 bps -------- « -----! ------------ ! -------- - X . \ . - . X \X x \ > ; -X; \ : x — Ozarow MDSQ -f- Proposec — Optima! > -2 0 S '-25 m g R -3 0 « — 35 2 -4 0 c < B O -45 -2 5 -2 0 -1 5 -1 0 -5 Side Distortion (dB) -5 0 -2 5 " T X *V............ • X * -. X ■A * . \ — Ozarow MDSQ Proposec — Optimal -2 0 -1 5 -1 0 -5 Side Distortion (dB) Figure 4.3: Rate-distortion perform ances com parison for a Gaussian source yV(0,1): (1) Ozarow: optim al bound. (2) MDSQ: optim al level-constrained re sults. (3) Proposed: Lloyd-Max quantizer results w ith fixed length code. (4) Optim al: results using the rate-distortion function of th e Gaussian source. where 0 < a < 1 and h = The comparisons are shown in Figure 4.3 for different bit rates. As one can see, the results of our proposed system using a Lloyd-Max quantizer on an average are com parable to those achieved w ith an optim al level-constrained MDSQ. They can be better than MDSQ when redundancy rates become very low, for relatively low total bit rates. It would be interesting to determ ine whether we can improve the proposed system perform ance by using m ore efficient quantizers other than th e Lloyd-Max quantizer. Moreover, it would also be useful to establish what is the best achievable perform ance within the proposed MDC system frame work? For exam ple, for a Gaussian source, if we can design a quantizer which 84 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. operates exactly on th e rate distortion, function, can we approach the Ozarow’s MDC bounds? Using th e optim al bit allocation derived before, the achievable central and side distortions are D c = cr2 -H+2 Io S2(2+x) D = — O-K-k'otot.sh) + fl9 -« + ^ S 2 (s rx ) s 2 “ 2 These optim al results are plotted in th e sam e figure (see Figure 4.3). As one can see, the perform ance gap narrows drastically and alm ost approaches the Ozarow’s lower bounds at lower redundancy rates. This indicates th a t, while the proposed system cannot achieve the lower bounds w ithin the whole operational range, its performance can be greatly improved if we can design b e tte r quantizers. Moreover th at at low redundancy rates MDSQ is not as efficient (even it does not introduce explicit redundancy) as using two standard quantizers. We m ention that here the quantizer design is exactly the sam e as th at for single description coding. Therefore we can m ake use of the state-of-art results from single description coding to reach our m ultiple description coding goals. As a result, th e system design and im plem entation com plexity are expected to be reduced com pared to specially designed MDC system s, such as the MDSQ [SS] and M D TC [60, 63] systems. O ur experim ental results on MDC for im age transm ission will further illustrate this point. 4.4.2 Explicit Channel Modeling Next we consider the channel model in the analysis. A ssum e th a t the two de scriptions are sent over two different channels with independent channel failure probability p. T here are four different situations at the receiver: (a) both descrip tions are received, which happens w ith probability (1 — p )2 and results in distor tion Dc = ha22~2R°: (b) one description is lost, which happens with probability 2p(l — p) and distortion Ds = | h a 22~2p + ^h a 22~2R° ; and (c) both descriptions are lost, which happens with probability p2 and distortion Db = cr2. The average 85 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (4.15) (4.16) d is to r tio n at th e r e c e iv e r for th is ch an n el m o d e l is D — (1 — p)2Dc + 2p(l - p ) D s + p 2D b = (1 - p f h a 22~2Ro + 2p(l - p )[\h a 22~2p + ^ h a 22~2R°] + p2a2 Since the reconstruction distortion in case (c) will not be affected by the redun dancy allocation, the optim al bit allocation only needs to m inim ize the first two term s in D. Notice th a t the total rate R = R q + p. Taking th e derivative of D w ith respect to th e redundancy rate p and solving the equation ^ \ p=p• = 0, it can be found th at, to minimize the average distortion in the presence of channel failures, the optim al bit allocation is P " = ^ + ^ l° g 2 P ( 4 - 1 7 ) One may notice th a t the optimal bit allocation p“ can actually be computed directly using the result derived before for th e case of implicit channel modeling. Notice that the average distortion D for th e independent loss channel can be w ritten as D = Cl (Dc + XDs) + p 2Db (4.18) where C\ = (1 —p)2 and A = . So the optim ization of D is the same as that of J (4.5). T hat is, th e optim al redundancy can be com puted from (4.7) using the new A . This indicates that the use of an explicit channel m odel only changes the weighting factor of the side distortion in the cost function J (4.5), i.e., the A. Thus, for a MDC system , once the encoding is completed, all possible decoding scenarios can be enum erated and corresponding side distortions can be computed. The channel model then determines the probability of each decoding scenario, i.e., providing a weighting function to compute the average distortion. By using an arbitrary m ultiplier A, the optimization of cost function J (4.5) actually subsumes all cases using explicit channel models. In this sense, we can say th a t the proposed MDC system can also be easily extended for other non-independent-loss channel models, e.g., the bursty erasure channel m odeled w ith a Markov chain loss process. 86 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.5 E x p e r im e n ta l R e su lts 4.5.1 An Image MDC Example In this section, we present a wavelet image M DC exam ple. As shown by our analysis, the more efficient the quantization schem e, the b etter performance of our M DC system. Among these state-of-art wavelet coders [90, 96, 97, 98], we choose the Said-Pearlm an wavelet coder due to its sim plicity and to the fact th at the code is available 2. 1 i 1 1 i 1 i 1 i i i i i i i i t : l 1 ; i ! 1 : ’ 1 ! l i ii i 1 1 1 1 i ; i i l ! l i i l i i 1 1 I ! i l l l 1 1 : i l i ■ l l 1 1 1 ! l ! ! l ! l 1 1 1 ! l i ! 1 1 1 i i i 1 i i l l 1 1 i l l 1 1 ! i l l 1 ; I 1 1 ! i l l 1 i l i i l i i l i 1 ! 1 ; 1 1 i l l 1 i l l i l l I . . i 1' ! 1 1 i l l 1 I ! 1 1 l ! 1 1! i .. 1! H i i l l i l l 1 i l l i l l i 1 1 1 1 ; I i ; i i i : i i ; l 1 ] JL ' 1 ; 1 i l I i 1 i 1! 1 1 i 1 1 ; 1 1 i: i 1 1; i 1; 1 11 I 1| 1 i : : l! i l l 1 ; ‘ l i i l i i l l 1 1 1 1 1 ! ! l i 1 1 1 ! 1 : 1 l i 1 1 1 1 i l i l i 1 1 l ! 1 l i I 1 I i l i i i i r - 1 1 i ; l l l ! 1 i i l l l ! . l i I i J L l i l 1 l i i l l I l i I i l i l ! i l l i i i ! : 1 1 1 1 1 1 ! i ! 1 1 ; i l i l l 1 1 1 I ! 1 1 : l i i n l i ! 1 1 1 1 i I l 1 l i ! 1 ! 1 1 l i i l i 1 i 1 1 : i 1 I 1 1 i i 1 1 1 1 ! ! 1 1 l i ! 1 1 1 1 1 1 1 ! ! 1 1 1 ! ! ! 1 l : i 1 1 1 l i 1 l i l i ; j 1 1 : l 1 1 1 i i ' 1 l i 1 ill! 1 lr I 1 ! H 1 ! 1 1 ! I l i l ! 1 1 1 1 1 l i i l i l : i 1 1 1 1 ! l i l i ! 1 1 1 ! I i 1 li (a) (b) Figure 4.4: An example configuration of two different polyphase transforms using a two-level wavelet decom position on a 16x16 input m atrix. In all the subbands, all the 1 samples constitute one polyphase com ponent and the remaining samples constitute the other polyphase component, (a) plain polyphase transform, (b) vector polyphase transform . In our experim ent, the input image is first wavelet transform ed and its polyphase com ponents are extracted. In this case the polyphase transform of the wavelet coefficients, not th at of the original image data, is extracted. Two different types of polyphase transforms are tested on the wavelet coefficients (see Figure 4.4). One is the plain polyphase transform which, for two descriptions coding, consists sim ply in grouping all the even coefficients into one description and all the odd 2T he author would like to thank Dr. A. Said and Prof. W . A. Pearlman for providing the SPIHT image coder. S7 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. coefficients into th e other description. This is done for each row in each subband. The second is not a polyphase transform in strict sense but can be viewed as a gen eralized polyphase transform in vector form . Each subband is first grouped into vectors of sam e size and these vectors’ indices are used in the polyphase transform . For exam ple, if every two successive wavelet coefficients are grouped into a vector of size 2, then th ere are 4 such vectors in each row in in each subband at th e first decomposition level, refer to Figure 4.4. D enote these 4 vectors as { v 0, v 1? v 2, v 3}. Polyphase com ponent 1 will have {v0,V 2 } and polyphase com ponent 2 will have {vi, v 3}. T he vector size is increased across subbands as 2J ' , j = 0,1, • • •, J — 1. The m otivation for introducing such a generalized polyphase transform is to pre serve spatial structures across subbands, since these structures are exploited by the SPIH T coder to improve the coding efficiency [98]. Let the two polyphase components be yi and j/2 - Then (yi(Ro), y^ip)) consti tutes our first description and (y2(Ro), Vi(p)) the second description. T he Said- Pearlm an wavelet coder [98] is used to quantize and entropy code th e polyphase components. For exam ple, yi(Ro) m eans th a t yi is coded at a bit rate R 0 w ith the Said-Pearlm an coder. If one description is lost, we reconstruct from th e received data ((yi{Ro)-,y2(p)) or (y2 (Ro)->yi(p))) which gives the side distortion. The cen tral distortion is derived using (yi(R0), y2(Ro))■ The total coding rate is R 0 + p. Since the Said-Pearlm an coder makes use of the zerotree stru ctu re am ong sub bands, the second type of polyphase transform generates slightly b e tter coding results. In Figure 4.5 we show MDC results for the Lena gray-level image (size 512x512) using the Said-Pearlm an wavelet coder [98]. The results of two descrip tions are plotted and the comparison w ith a recent MDSQ-based MDC wavelet coder by Servetto et al. [87] is also given. W ith a fixed total coding rate, our MDC coder achieves b etter rate-distortion perform ance in the whole redundancy range. In Figure 4.6 we show the reconstructed Lena images using only one chan nel inform ation w ith different redundancy rates at a total coding ra te O.obps (the vector form polyphase transform is used). In the second experim ent, we m easure the average achieved PSN R s when there are independent packet losses. The input im age is first wavelet transform ed and then a polyphase transform (the zerotree vector form) is im plem ented on the 88 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. . _ _ _ _ 3 © S id e distortion Dm (d8 ) 3 6 , r S> 2 8 s Figure 4.5: Experim ental results w ith Lena gray-level im age, (a) Two descrip tions. Polyphase (1): plain polyphase transform . Polyphase (2): vector polyphase transform , (b) Performances w ith independent packet losses. wavelet coefficients. The downsam pling factor is 16 so we have a total of 16 polyphase com ponents. To protect from channel failures, the redundancy is car ried in a sequential way like th a t used in RAT [30]. T h a t is, packet 1 carries redundancy to protect packet 2 while packet 2 carries redundancy to protect 3 and so on. Each polyphase component constitutes the prim ary p a rt in each packet while it also carries redundancy to protect the next polyphase com ponent in the sequence. For exam ple, packet 0 carries two parts of information: (1) polyphase component 0 coded at rate Ro and (2) polyphase com ponent 1 coded a t rate p. We emphasize th at R0 and p apply to all pixels, and therefore the to tal rate is indeed R0 + p. In this experim ent, we fix th e coding rates with R0 = OAbps and p = 0.1 bps so that the to tal coding rate is R = 0.5bps. Since the to tal num ber of wavelet coefficients in each packet is the same, all 16 packets have the same size. We then m easure the reconstruction error assuming independent packet losses. For example, assum e there are 4 packets lost during the transm ission, we first gen erate the loss p attern independently w ith 4 erasures. Let one loss pattern be 1 1 1 0 0 1 1 1 0 1 0 1 1 1 1 1 w ith Is representing received packets while 0s represent lost packets. In this case, packet 4 can be reconstructed by using the redundancy carried by packet 3. T he sam e is true for packet 9 and 11 while packet 89 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Original R ed u n d an cy 10 % Figure 4.6: Reconstructed Lena images at total rate 0.56ps using only one channel inform ation, (a) Original image; (b) Redundancy bit rate 0.056p.s, PSN R 29.64dB; (c) Redundancy bit rate O.lO&ps, PSNR 31.91dB; (d) R edundancy bit rate 0.15&ps, PSNR 32.98dB 5 will be lost without reconstruction. We tested 1000 loss p atterns for each case when there are 1 /2 /3 /4 packets lost. In Figure 4.5 we show the image MDC results at to ta l rate 0.56ps (O.l&ps redundancy rate). W ith different num ber of independent packets losses, the aver age PSNRs are plotted using star symbols with the stan d ard deviations in vertical bars. As one can see, reconstructed PSNRs deviate from th e m ean values with an average standard deviation about 3dB. The quality changes are due to changes in the different loss patterns with consecutive losses leading to the worst recon structions. For the case of only one packet loss, the average PSN R is 34.5dB with standard deviation about 2dB. This is due to the fact th a t we assume th at the 90 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. first packet loss is not recovered in the experim ent. If th e first packet can also be recovered using the redundancy carried by the last packet, then the average PSN R becomes 34.99dB w ith standard deviation O.I2dB. T he experim ental results show th at the system perform ance degrades gradually as the packet loss rate increases. However, it also indicates th at sim ple sequential packet protection does not perform well in cases of consecutive losses. Recently, Miguel et al. have shown th a t b etter error protection can be achieved by increasing the granularity of redundancy addition [91] by incorporating th e idea of unequal error protection. In their work, the am ount of redundancy is varied according to the im portance of th e data, i.e., m ore redundancy is given to im portant d ata and less for u nim portant data. For future im provem ent of our proposed MDC technique, the idea of unequal error protection should b e further explored. 4.5.2 A Speech MDC Example In this experim ent, we show an exam ple context-adaptive MDC system for robust speech coding over a lossy packet network and com pare th e results w ith those of the RAT scheme [30, 75]. T he speech m aterials consist o f two sentences recorded at 16KHz and 16bps, one m ale speaker w ith “A tam ed squirrel makes a nice p et” and one female speaker w ith “Draw every outer line first, th en fill in the interior”. Each 20ms speech segment is sent in one packet with 320 sam ples in each packet. A 14 bits per sam ple PCM coder is used as the fine quantizer and a 2-bit ADPCM coder is used as th e coarse quantizer in the simulation. T h e 14bps PCM coder is obtained by rem oving the two LSB bits from the original speech. A schem atic plot of speech coding and packetization is shown in Figure 4.7. In the proposed scheme, th e two polyphase components Yx (all even samples) and Y2 (all odd samples) of the input vector X are independently quantized using a fine scalar quantizer Q i and packed, respectively, into packets Pi and P2. To achieve loss protection, Pi needs to carry also a coarse version of polyphase component Y2. R ather than quantize Y2 directly using a coarse quantizer Q 2, we first predict Y2 using the dequantized d ata Yx and only quantize th e prediction residue r 2(n). r 2(n) = y2{n) - (yi(n) + yx{n + l) ) /2 (4.19) 91 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1*1 (ra) = yi{n) - (y2(n - 1) + y i( n ) ) / 2 (4.20) where ri{n) and rx(n) are th e prediction residues of Yi using Y\ and Yi using Yi respectively. These prediction residues are th e n quantized using the coarse quantizer Q i. As one can see, a simple average interpolation is used for prediction though m ore more sophisticated techniques can be applied as well. sp eech o ! i 2N -1 R A T N N + l ! N + - 2 i -------------- 2 N P rop osed o I 2 4 Figure 4.7: The proposed and the RAT schemes for robust packet speech coding. In the same figure, we also show the RAT schem e [30], which segm ents the input X into two N sam ple fram es. Each fram e is quantized, coded, and packed independently into one packet. Each packet also carries a coarsely quantized d ata of previous N samples for loss protection. Three different schemes are im plemented and tested in th e simulation, nam ely 1. T he subsam ple-interpolation m ethod by Jay an t [73]. T he interpolation is the average of two neighboring samples as shown before. Each fram e is 16bps PCM coded. 2. T he RAT scheme in which each packet carries a 2-bit ADPCM coded d ata of the previous packet for loss protection an d the m ain part is 14bps PCM coded. 3. Every two frames are polyphase transform ed and coded by the 14bps PCM 3In this experiment, to illustrate our ideas a PCM coder is chosen to encode primary in formation and an ADPCM coder is chosen to encode th e redundancy. More advanced coders can certainly be used and more gains are expected, though, more delicate context adaptive techniques are needed. 92 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. coder. The interpolation is the average of two neighboring samples as shown before. The prediction residues are th en coded using the 2-bit ADPCM. To m easure the reconstruction quality of speech signal, we use the Noise-to- Mask Ratio (NMR) which measures the relative energy of noise components above the signal’s audible m asking threshold [99]. T he NM R is defined as [99] (4.21) where M is the total num ber of frames, B is the num ber of Critical Bands (CB), Cb is the num ber of frequency components for CB b, and \D (i.k)\2 is the power spectrum of the noise at frequency bin k and fram e i. The ki, k^ are respectively the low and high frequency bin indices corresponding to CB b. We choose NM R rather than other criteria (e.g., M ean Opinion Score) as per formance m easurem ent for a number of reasons. One is th a t NMR is an objective measure based on hum an hearing system and it has been found to have a high degree of correlation with subjective tests [100]. Second, although the MOS re sult is subjectively a b etter indication of th e speech audio quality, most MOS results published are based on limited tests w ithin the authors’ research group or departm ent and it is difficult to compare. T h ird and the most im portant point is that our focus here is to see the relative perform ance variations between different algorithms and NM R is easy to compute. Table 4.1: “Squrriel” reconstruction NM R comparison (dB) Loss Prob J A Y R A T Proposed mean std mean std mean std 10% 3.36 7.20 4.86 3.37 0.36 7.02 15% 6.41 10.37 7.49 7.25 3.40 10.36 20% S.61 11.44 9.24 10.36 5.99 11.41 30% 12.34 19.96 12.68 19.65 10.67 19.98 In Table 4.1 we show the reconstruction NM R results for the Squirrel sentence under independent packet losses at different loss rates. Each loss rate is sim ulated 100 times and the results are taken as the ensemble averages. It can be seen that the proposed scheme achieves lowest m ean NM R at all loss rates. However, 93 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the standard deviation of the reconstruction N M R is larger com pared to that of the RAT scheme. The reason is th at when two consecutive packets are lost, the proposed scheme can not recover if these two packets are both polyphase components from the same speech frame. However, the RAT scheme can still recover one packet. The same observations can also be seen from th e NM R results of the Draw sentence given in Table 4.2 and Figure 4.8. One possible solution is to introduce more than two descriptions coding, for exam ple, three or four descriptions coding as long as the delay constraints are observed. Provided the num ber of consecutive packet losses is smaller th an the num ber of descriptions, one can avoid the catastrophic case when all descriptions are lost. Table 4.2: “Draw” reconstruction N M R comparison (dB) Loss Prob J A Y R A T Proposed mean std mean std mean std 10% 5.07 0.43 5.42 2.71 0.76 7.03 15% 8.01 9.79 7.95 4.30 2.77 5.01 20% 9.11 9.45 8.89 4.43 4.28 5.05 30% 12.11 11.28 11.10 6.84 7.92 8.59 4.6 S u m m a ry a n d F u tu re W ork In this chapter, a MDC system using polyphase transform and selective quan tization has also been proposed in this chapter. For i.i.d Gaussian sources, we give detailed analysis of optim al bit allocation to achieve m inim um average cen tral distortion and side distortion for a fixed to tal coding rate. O ur experimental results have shown that our system im plem entation, compared to previous pro posed systems, is simple yet the achieved MDC results for image coding is better, especially at lower redundancy rates. O ne interesting question is to consider w hether MDC techniques are applicable to robust m ulticast applications for m ultim edia com m unications. O ur preliminary work has shown th at MDC provides better end-to-end reconstruction signal qual ity com pared to th at of layered coding [101] when the network is heavily loaded and delay constraints become critical. We believe th a t, the MDC system can also be designed and im plem ented in a way to provide a layered coding scheme to 94 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Loss Prob = 10% Loss Prob = 15% c o •a C O cc c o c o l o T 03 C O o 15 10 5 0 -5 -10 X x x X X ^ X ^ + ^ % t + : + / :++ r* ^ r- ........... * • -x- -F + +: + + 20 40 60 80 Number of Tests Loss Prob = 20% 100 *2 X X x*s +++ f+ ^ « - 4 * ™ 20 40 60 80 100 Number of Tests c o ■ a C O C C j* r co C O I 0 1 C D O J O 2 20 15 10 5 0 -5 -10 X X X X X < x x x a JjS ffy gjjX'xQ + + * ( & x Q -k oX i S t * - t * % + X X 3 £ Q + 1 +' 20 40 60 80 Number of Tests Loss Prob = 30% 100 s* f 0* '' x*x* x *¥ 20 40 60 80 Number of Tests 100 Figure 4.8: R econstructed NMR distribution plot for the Draw sentence under different packet loss probabilities. RAT: o; Proposed: + ; JAY: x cope with the netw ork heterogeneity issue. Each description can be sent through one m ulticast group and end users can subscribe to a different num ber of groups based on their bandw idth availability and observed channel statistics. In other words, end users can choose to receive from one to M descriptions by subscribing to corresponding m ulticast groups. By doing so, not only can one deal with the bandw idth heterogeneity issue (i.e., the goal of LC [101]) b u t also with the packet loss heterogeneity issue (i.e., the goal of the hierarchical and layered FEC schemes [78, 79]. 95 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. References [1] C. Chrysafis and A. O rtega, “Line based, reduced m em ory, wavelet image compression,” IE E E Trans, on Image Processing, vol. 9, no. 3, pp. 378-389, M ay 2000. [2] P. Cosman, T . Frajka, and K. Zeger, “Image com pression for memory constrained printers,” in Proc. o f ICIP, Oct. 1998. [3] P. Cosman and K. Zeger, “M emory constrained wavelet-based image cod ing,” IE E E Signal Processing Letters, vol. 5, no. 9, pp. 221-223, Sept. 1998, to appear. [4] M. Misra and V. K. Prasanna, “Parallel com putation of 2-D wavelet trans form s,” in Proc. o f 11th Intl. Conf. on Pattern Recognition, 1992. [5] O. Rioul and P. Duham el, “Fast algorithm s for discrete and continuous wavelet transform s,” IE E E Trans, on Inform ation Theory, vol. 38, no. 2, pp. 569-586, M ar. 1992. [6] M. Vishwanath, “T he recursive pyram id algorithm for th e discrete wavelet transform ,” IE E E Trans, on Signal Proc., vol. 42, no. 3, pp. 673-676, Mar. 1994. [7] C. Chakrabarti and M. Vishwanath, “Efficient realizations of the discrete and continuous wavelet transform s: From single chip im plem entations to mappings on SIMD array com puters,” IE E E Trans, on Signal Proc., vol. 43, no. 3, pp. 759— 771, Mar. 1995. [8] R. E. Van Dyck, T. G. M arshall, M. Chin, and N. Moayeri, “Wavelet video coding w ith ladder structures and entropy-constrained quantization,” IE E E Trans, on Circuits and System s fo r Video Technology, vol. 6, no. 5, pp. 483-494, Oct. 1996. [9] H. Sava, A. C. Downtown, and A. F. Clark, “Parallel pipeline im plem enta tion of wavelet transform ,” Proc. Inst. Elect. Eng., Vision Image Process., vol. 144, no. 6, pp. 355-359, Dec. 1997. 96 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [10] W . Jiang and A. Ortega, “Efficient DWT system architecture design using filterbank factorizations,” in Proc. o f ICIP, O ct. 1999, vol. 2, pp. 749-753. [11] J. Fridm an and E. S. Manolakos, “On the scalability of 2-D discrete wavelet transform alogrithm s,” M ultidimensional System s and Signal Processing, , no. S, pp. 185-217, 1997. [12] C. Chrysafis and A. Ortega, “Line based, reduced memory, wavelet image com pression,” in Proc. o f Data Compress. Conf., 1998, pp. 398-407. [13] M. A. Trenas, J. Lopez, and E. L. Zapata, “A m em ory system supporting the efficient SIMD com putation of the two dim ensional DW T,” in Proc. o f IC A SSP , 1998, pp. 1521-1524. [14] C. C hakrabarti and C. M umford, “Efficient realizations of encoders and decoders based on the 2-d discrete wavelet transform ,” IE E E Transactions on V L SI System s, 1998. [15] J. Fridm an and E. S. Manolakos, “Discrete wavelet transform: D ata de pendence analysis and synthesis of distributed m em ory and control array architectures,” IEEE Trans, on Signal Processing, vol. 45, no. 5, pp. 1291- 1308, May 1997. [16] A. Ortega, W. Jiang, P. Fernandez, and C. Chrysafis, “Implem entations of the discrete wavelet transform : complexity, m em ory, and parallelization issues,” in Proc. of SPIE: Wavelet Applications in Signal and Image Pro cessing VII, Oct. 1999, vol. 3813, pp. 386-400. [17] W . Jiang and A. Ortega, “A parallel architecture for DW T based on lifting factorization,” in Proc. o f S P IE in Parallel and Distributed Methods fo r Image Processing I l f Oct 1999, vol. 3817, pp. 2-13. [IS] F. Marino, V. Piuru, and E. Swartzlander Jr., “A parallel im plem entation of the 2D discrete wavelet transform without interprocessor com m unication,” IE E E Trans. Signal Proc., vol. 47, no. 11, pp. 3179-3184, Nov. 1999. [19] Proc. o f the IEEE: Special Issue on Wavelets, N um ber 4. 1996. [20] K. K. Parhi and T. Nishitani, “VLSI architectures for discrete wavelet transform s,” IE E E Trans, on V L SI System, vol. 1, no. 2, pp. 191-202, June 1993. [21] A. S. Lewis and G. Knowles, “VLSI architecture for 2D daubechies wavelet transform w ithout m ultipliers,” Electronic Letters, vol. 27, no. 2, pp. 171- 173, Jan. 1991. 97 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [22] T. C. Denk and K. K. Parhi, “VLSI architectures for lattice structure based orthonorm al wavelet transform s,” IE E E Trans, on C A S , vol. 44, no. 2, pp. 129-132, Feb. 1997. [23] L. Yang and M. Misra, “Coarse-grained parallel algorithm s for m ulti dimensional wavelet transform s,” The Journal o f Supercomputing, vol. 12, no. 1/2, pp. 99-118, 1998. [24] J. Bowers, L. K eith, N. Aranki, and R. Tawel, “IM AS integrated controller electronics,” Tech. Rep., Jet Propulsion Laboratory, Pasadena, CA, USA, Jan. 1998. [25] O.M. Nielsen and M. Hegland, “TRCS9721: A scalable parallel 2D wavelet transform algorithm ,” Tech. Rep., The A ustralian N ational University, Dec. 1997. [26] L. Yang and M. Misra, “Parallel wavelet transform s for image process ing,” in A S A E Workshop on Application o f Parallel Processing for Image Processing, Aug. 1997. [27] T. E. Anderson, D. E. Culler, D. A. Patterson, and th e NOW team , “A case for NOW (Networks of W orkstations),” IEEE M icro, vol. 151, pp. 54-64, Feb. 1995. [28] LAM Team UND, “LA M /M PI parallel com puting,” Last modified 24-Mar- 2000 . [29] F. Kossentini, “Spatially segm ented wavelet transform ,” Tech. Rep., UBC. 1998, ISOIEC JT C 1SC29WG1 WG1NS68. [30] V. Hardm an M. A. Sasse, M. Handley, and A. W atson, “Reliable audio for use over the In tern et,” in Proc. IN E T , 1995. [31] R. Arean, J. Kovacevic, and V. K. Goyal, “M ultiple description perceptual audio coding w ith correlating transform s,” IE E E Trans, on Speech and Audio Processing, 2000. [32] W.-T. Tan and A. Zakhor, “M ulticast transm ission o f scalable video using receiver-driven hierarchical F E C ,” in Proc. Packet Video Workshop, New York, Apr. 1999. [33] L. Zhang, S. Deering, D. E strin, S. Shenker, and D. Zappala, “RSVP: A new resource ReSerVation Protocol,” IE E E Network Mag., vol. 7, no. 5, pp. 8-18, Sept. 1993. 98 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [34] S. Lin and D. J. Costello, Error Control Coding: Fundamentals and Appli cations, P rentice Hall, 19S3. [35] A. S. Tanenbaum , Com puter Networks, Prentice H all, 1996. [36] G. Davis an d J. Danskin, “Joint source and channel coding for internet image transm ission,” in Proc. o f S P IE Conf. on Wavelet Applications o f Digital Im age Processing X IX , Aug. 1996. [37] J. M. D anskin, G. M. Davis, and X. Song, “Fast lossy internet image trans mission.” in Proc. o f 1995 A C M Multimedia Conference. San Francisco, CA, 1995. [38] P. G. Sherwood and K. Zeger, “Joint source and channel coding for internet image transm ission,” in Proc. o f D C C ’ 97, 1997. [39] M. R. K arim , “Packetizing voice for mobile radio,” IE E E Trans, on Com m unications, vol. 42, no. 2 /3 /4 , pp. 377-385, F eb ./M ar./A p r. 1994. [40] C.-Y. Hsu, A. Ortega, and M. K hansari, “R ate control for robust video transm ission over burst-error wireless channels,” IE E E Journal on Selected Areas in Com m unications, Special Issue on M ultim edia Network Radios, vol. 17, no. 5, pp. 756-773, M ay 1999. [41] W. Jiang an d A. O rtega, “Lifting-based discrete wavelet transform sys tem architecture design,” IE E E Trans, on Circuits and System s fo r Video Technology, M ar. 2000, Accepted for publication. [42] S. M allat, “A theory for m ultiresolution signal decom position: The wavelet representation,” IEE E Trans, on Patt. Anal, and M ach. Intell., vol. 11, no. 7, pp. 674-693, 1989. [43] P. P. V aidyanathan and P.-Q. Hoang, “Lattice structures for optim al de sign and robust im plem entation of two-channel perfect-reconstruction QM F banks,” IE E E Trans. Acoust., Speech, Signal Processing, vol. 36, no. 1, pp. SI-94, Jan . 1988. [44] P. P. V aidyanathan, “M ultirate digital filters, filter banks, polyphase net works, and applications: A tutorial,” Proceedings o f The IEEE, vol. 78, no. 1, pp. 56-93, Jan. 1990. [45] T. G. M arshall, “Zero-phase filter bank and wavelet coder m atrices: Prop erties, triangular decompositions, and a fast algorithm ,” M ultidimensional System s and Signal Processing, vol. 8, pp. 71-88, 1997. 99 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [46] I. Daubechies and W. Sweldens, “Factoring wavelet transform s into lifting steps,” J. Fourier Anal. Appl., vol. 4, no. 3, pp. 247-269, 1998. [47] R. B lahut, Fast Algorithms fo r Digital Signal Processing, Addison-Wesley Publishing Company, 1985. [4S] “Report on core experim ent codeffl ’ ’Com plexity reduction of SSW T” ,” Tech. Rep., M otorola A ustralia, UBC, 1998, ISOIEC JT C 1SC29WG1 WG1NSS1. [49] C. Chrysafis, “Low memory fine-based wavelet trasform using lifting scheme,” Tech. Rep., HP Labs, 1998, ISOIEC JT C 1SC29WG1 WG1N97S. [50] P. Onno, “Low memory fine-based wavelet transform using lifting scheme,” Tech. Rep., Canon Research C enter France, 1998, ISOIEC JT C 1SC29WG1 WG1N1013. [51] G. Lafruit, L. Nachtergaele, J. Borm ans, M. Engels, and I. Bolsens, “Op tim al m em ory organization for scalable texture codecs in M PEG -4,” IE E E Trans, on Circuits and System s fo r Video Technology, vol. 9, no. 2, pp. 218-243, Mar. 1999. [52] D. Taubm an, “Adaptive non-separable lifting transforms for image com pression,” in Proc. ICIP, Kobe, Japan, Oct. 1999. [53] G. Beylkin, R. Coifman, and V. Rokhlin, “Fast wavelet transform s and num erical algorithm s I,” CPAM, vol. 44, pp. 141-183, 1991. [54] M. V etterfi and J. Ivovacevic, Wavelets and Subband Coding, Prentice-Hall PTR , Englewood-Cliffs, New Jersey, 1995. [55] W. Sweldens, “The lifting scheme: A new philosophy in biorthogo- nal wavelet constructions,” in Wavelet Applications in Signal and Image Processing III, A. F. Laine and M. Unser, Eds. 1995, pp. 68-79, Proc. SPIE 2569. [56] J.E. Hopcroft and -J.D. Ullman, Introduction to Autom ata Theory, Lan guages, and Computation, Addison-Wesley Publishing Company, Reading, M assachusetts, 1979. [57] W. Jiang and A. Ortega, “M ultiple description coding via scaling rota tion transform ,” in Proc. o f Intl. Conf. on Acoustics, Speech and Signal Processing, Phoenix, AZ, Mar. 1999. [5S] W. Jiang and A. Ortega, “Karhunen-loeve vector transform KLVT for vector quantization,” Tech. Rep., Signal and Image Processing Institute, U niversity of Southern California, 1996. 100 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [59] Y. Wang, M. O rchard, and A. R. Reibm an, “M ultiple description image coding for noisy channels by paring transform coefficients,” in Proc. IEEE 1997 First Workshop on Multimedia Signal Processing, 1997. [60] M. T. O rchard, Y. Wang, V. Vaishampayan, and A. R. Reibm an, “Redun dancy rate-distortion analysis of m ultiple description coding using pairwise correlating transform s,” in IC IP ’97, 1997. [61] V. K. Goyal and J. Kovacevic, “O ptim al multiple description transform coding of G aussian vectors,” in Proc. o f IE E E Data Compression Confer ence, 199S. [62] V. K. Goyal, J. Kovacevic, and M. V etterli, “Multiple description transform coding: Robustness to erasures using tight frame expansions,” in Proc. of IE E E int. Sym p. Info. Th., Aug. 1998. [63] V. K. Goyal, J. Kovacevic, R. Arean, and M. Vetterli, “M ultiple description transform coding of images,” in IC IP ’ 98, 1998. [64] J. Kovacevic, “M ultiple descriptions as joint source channel codes,” 199S. [65] W. Li, “Vector transform and image coding,” IE E E Trans, on C A S for VT, vol. 1, no. 4, pp. 297-307, Dec 1991. [66] X. Xia and B.W . Suter, “On vector KL transform ,” IE E E Trans, on CAS fo r VT, vol. 5, no. 4, pp. 372-374, Aug 1995. [67] W. Ding, “O ptim al vector transform for vector quantization,” IE E E Signal Processing Letters, vol. 1, no. 7, pp. 110-113, July 1994. [68] G.H. Golub and C.F.Van Loan, M atrix Computations, John Hopkins Uni versity Press, 1996. [69] Anil Iv. Jain, Fundamentals o f Digital Image Processing, Prentice Hall Information and System Sciences Series. Prentice Hall, Englewood Cliffs, NJ 07632, 1989. [70] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery, Nu merical Recipes in C, Cambridge University Press, 2nd edition, 1992. [71] W. Jiang and A. Ortega, “M ultiple description coding via polyphase trans form and selective quantization,” in Proc. o f Visual Com munication and Image Processing, San Jose, CA, Jan. 1999. [72] W. Jiang and A. Ortega, “M ultiple description speech coding for robust communication over lossy packet networks,” in Proc. o f Intl. Conf. on Mul timedia Engineering, New York, NY, July 2000. 101 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [73] N. S. Jayant and S. W . Christensen, “Effects of packet losses in waveform coded speech and im provem ents due to an odd-even sam ple-interpolation procedure,” IE E E Trans. Communications, vol. COM-29, no. 2, pp. 101- 109, Feb. 1981. [74] J-C . Bolot and A. V. G arcia, “Control m echanism s for packet audio in the In te rn et,” in Proc. IE E E IN FO C O M ’ 96, 1996, pp. 232-239. [75] J.-C . Bolot and A. V. Garcia, “The case for FEC -based error control for packet audio in the in tern e t,” in to appear A C M M ultimedia System s, 1997. [76] J.-C . Bolot, S. Fosse-Parisis, and D. Towsley, “A daptive FEC-based error control for Internet telephony,” in Proc. IE E E IN F O C O M M ’ 99, 1999, vol. 3, pp. 1453-1460. [77] M. Podolsky, C. Rom er, and S. McCanne, “Sim ulation of FEC-based error control for packet audio on the internet,” in IN F O C O M ’ 98, San Francisco, CA, M ar. 1998. [7S] W . Tan and A. Zakhor, “Real-tim e internet video using error resilient scal able compression and TCP-friendly transport protocol,” IEEE Trans, on M ultim edia, vol. 1, no. 2, pp. 172-186, June 1999. [79] P. A. Chou, A. E. M ohr, S. M ehrotra, and A. W ang, “FEC and pseudo- ARQ for receiver-driven layered m ulticast of audio and video,” in Proc. o f The Data Compression Conf., Mar. 2000. [SO ] A. A. El-Gam al and T . M. Cover, “Achievable rates for multiple descrip tions,” IE E E Trans. Inform ation Theory, vol. IT-28, no. 6, pp. 851-857, Nov. 1982. [81] J. K. Wolf, A. D. W yner, and J. Ziv, “Source coding for multiple descrip tions,” The Bell System Technical Journal, vol. 59, no. 8, pp. 1417-1426, O ct. 1980. [S2] L. Ozarow, “On a source-coding problem w ith two channels and three receivers,” The Bell System Technical Journal, vol. 59, no. 10, pp. 1909- 1921, Dec. 1980. [83] N. S. Jayant, “Subsam pling of a DPCM speech channel to provide two "self-contained” half-rate channels,” The B ell System Technical Journal. vol. 60, no. 4, pp. 501-509, Dec. 1981. [84] Z. Zhang and T. Berger, “New results in binary m ultiple descriptions,” IE E E Trans. Inform . Theory, vol. IT-33, pp. 502-521, 1987. 102 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [85] Z. Zhang and T . Berger, “M ultiple description source coding w ith no excess m arginal rate,” IE E E Trans. Inform. Theory, vol. 41, no. 2, pp. 349-357, 1995. [86] A. Ingle and V. A. Vaishampayan, "D PCM system design for diversity systems w ith applications to packetized speech,” IE E E Trans. Speech and Audio Processing, vol. 3, no. 1, pp. 48— 58, 1995. [87] S. D. Servetto, K. Ram chandran, V. Vaisham payan, and Iv. N ahrstedt, “M ultiple description wavelet based im age coding,” in IC IP ’ 98, 1998. [SS] V. A. V aisham payan, “Design of m ultiple description scalar quantizers,” IE E E Trans. Inform ation Theory, vol. 39, no. 3, pp. 821-834, 1993. [S9] V. A. Vaisham payan and J.-C. Batllo, “A sym ptotic analysis of m ultiple description quantizers,” IE E E Trans, on Inform ation Theory, vol. 44, no. 1, pp. 27S-2S4, Jan. 199S. [90] C. Chrysafis and A. Ortega, “Efficient context-based entropy coding for lossy wavelet im age compression,” in Proc. o f D C C ’ 97, Snowbird, UT, M ar. 1997. [91] A. C. Miguel, A. E. Mohr, and E. A. Riskin, “SPIH T for generalized mul tiple description coding,” in Proc. o f IC IP , 1999. [92] R. Puri, K. R am chandran, and I. K ozintsev, “M ultiple description source coding using forw ard error correction (fee) codes,” 1999. [93] A. E. Mohr, E. A. Riskin, and R. E. Ladner, “Generalized m ultiple descrip tion coding through unequal loss protection," in Proc. o f IC IP , 1999. [94] A. Said and W. A. Pearlm an, “Low-complexity waveform coding via alpha bet and sam ple-set partitioning,” in Proc. o f V C IP ’ 97, San Jose, CA, Jan. 1997. [95] A. Gersho and R. M. Gray, Vector quantization and signal compression, Prentice Hall Inform ation and System Sciences Series. Prentice Hall, Engle wood Cliffs, N J 07632, 1991. [96] S. M. LoPresto, K. Ram chandran, an d M. T. Orchard, “W avelet image coding using rate-distortion optim ized backw ard adaptive classification,” in Proc. o f V C IP ’ 97, San Jose, CA, Jan . 1997. [97] X. Wang, G. Wu, and X. Lin, “A zerotree based m ultirate wavelet image coding m ethod,” Acta Electronica Sinica, vol. 25, no. 4, pp. 48-51, Apr. 1997. 103 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [98] A. Said and W . A. Pearlm an, “A new fast and efficient image codec based on set partitioning in hierarchical trees,” IE E E Transactions on Circuits and Systems fo r Video Technology, vol. 6, no. 4, pp. 243-250, June 1996. [99] D. E. Tsoukalas, J. N. Mourjopoulos, and G. Kokkinakis, “Speech enhance m ent based on audio noise suppression,” IE E E Trans, on speech and audio processing, pp. 497-514, 1997. [100] W. Jiang and H. Malvar, “Adaptive speech noise rem oval,” Tech. Rep., Microsoft Research, Aug. 1999. [101] S. R. M cCanne, “Scalable compression and transm ission of internet mul ticast video,” Tech. Rep. Ph.D. Thesis, UCB/CSD-96-92S, University of California, Berkeley, 1996. 104 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Appendix A Overlap-Add and Overlap-Save Standard techniques for linear F IR filtering over long d ata sequences include overlap-save and overlap-add, both of which are block-based approaches [47]. The input sequence is segm ented into blocks, each of which is filtered in th e frequency domain using D FT and ID FT. The outputs from each block processing are con catenated together to form the final result which is identical to the sequence obtained as if the whole input sequence had been processed in the tim e-dom ain. Let x(n) be the input sequence. Denote the block length as M and the filter length is L. O v e rla p -S a v e In this m ethod, each block consists of:(i) (L — 1) samples from the previous block; and (iz) (M — 1) new samples from the sequence. The data blocks are bi{n) = {0,0, • • •, 0, r ( 0 ) ,r ( l) , • • • ,x(M — 1)} L - 1 b2(n) = {x(M — L + 1, • • •, x(M — 1), x(M ), • - -, x{2M — 1)} ^ — ^ ^ L —l M b3(n) = {x(2M — T -1 - 1, • • •, x(2M — 1), i(2 i¥ ) , • • •, i( 3 M — 1)} N i \ j > ' V — — ^ ✓ L —l M and so on. As one can see, the actual block size needed is Ar = M + L — 1 since (L — l) samples have to be overlapped from the previous block. T he filtering output is yi(n ) = IDFT{FFT{bi(n)) * FFT(h)). The first (L — 1) samples are discarded due to aliasing and the remaining M samples constitute the desired result from linear convolution. O v e rla p -A d d In this m ethod, each block consists of only (L — l) nonoverlapping samples from the input sequence. The data blocks are 6x(n) = {r(Q), z (l), • • •, x(M - 1), 0,0, - • -, 0) L— l 105 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. b2{n) = {ar(jVf), • • •, x(2M — 1), 0,0, — , 0J L - 1 b2{n) = ( i(2 M ), • • - ,x (3 M — 1),0,0, - - • ,0} £ . - 1 and so on. As one can see, though th e input block is nonoverlapped, the actual block size is still N = M + L — 1 becasue of the (L — 1) padded zero samples. The filtering o u tp u t y,(n) = IDFT(FFT(bi(n))* FFT(h)) is free of aliasing. T he last (L — l) points, however, m ust be overlapped and added to th e first (L — 1) points of the succeeding block to form the final result. T hat is y(n) = {yo(0),yo( l ) , - - - , y o( M - l),y 0(M ) + y i(0 ), yo(M + 1) + y i(l), • • •, yo(N — 1) + yi(L — 1), yi(L ), — } Input D a ta Stream Input Data Stream o In p u t Sam ples : n-Poim . Cyclic \ Convolution n Output S am p la n-A Input Samples ! n-Point L inear — ; C onvolution l _ _ n Output Samples il Select Second Value ; Output Data Stream (a) Output Data Stream (b) Figure A .l: (a) Overlap-save. (b) O verlap-add. 106 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Appendix B DWT FSM Examples D a u b ech ies (9 ,7 ) filters This filterbank has been used extensively in the im age compression algorithms proposed in the literature. T he factorization of th e analysis polyphase m atrix (adapted from [46]) is P.(*)= where a = -1.586134342,/? = -0.05298011854,7 = 0.8829110762,5 = 0.4435068522, and C = 1.14960439S. Based on this factorization, the forward transform is "C 0 ' 1 5(1 + z ~ l ) ' ' 1 o ' ' 1 0 (1 + z ~ 1) ' ' 1 0 ’ . 0 !/C . . 0 1 . 7(1 + z) 1 . 0 1 a(l + z) 1 x°0(n) = s?(n) = a:[(n) = z £ ( n ) = x\(n) = x\ (n) = a;o(n) = x?(n) = and the inverse transform is x\[n) x l(n) x l Q (n) x\(n) x°0(n) x(2n) x(2 n -f 1) x^n) + a(a;o(n) + x£ (n + 1)) ;r°(n) + /?(x[(n) + x\(n — 1)) ^ i( n ) + l ( xo(n ) + xo(n + !)) Xo(n) + ^(^i(«) + x \( n — 1)) Cxo(n) x i (n )/c xl(n )/C xl(n) — S(xl(n) + x \( n — 1)) Zi(n) ~ 7(^o(n) + x o ( n + !)) x o(n ) ~ P ( x l ( n ) + x \i.n ~ !)) 1 0 7 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. x°(rc) = xj(n) — a(io (n ) + Xq(ti + 1)) x(2n + 1) = x °i(T ^) x(2 n) = £o(rc) As one can see, using the (9,7) wavelet filters, to tal 4 state transitions are needed to transform a raw input sample into a wavelet coefficient. This process is ahown in Fig. B .l. Assum e there are 9 samples, {x°(0), ar°(l), • - • , x°(8)}, loaded in memory initially. T he first elem entary m atrix e°(x) is lower triangular, so the state transition is to update odd samples with two neighboring even sam ples. For example, x ° (l) is updated into x L (l) = z ° (l) + o:(xo(0) + x ° (2 )). The same updating also occurs for samples {x°(3),x°(5),x°(7)}. Notice th at samples {x°(0), x°(8)} rem ain un-updated because they are needed by neighboring blocks, e.g., ar°(8) is needed by r°(9) and r°(0) is needed by x °(—1) (not shown in the figure). Consequently, x°(0) and x°(S) are preserved as the state inform ation at state 0. O 1 2 3 4 - 5 6 7 8 9 10 : o | o i O O o o o o o O ! 0 j X^x | y / ^Xpc i cxXXcc | raXXoc j a X ’Xcc j a, / 1 o j 1 1 1 1 : 1 , 1 1 j O O j 0 : X A I M X L X ' X A " ; o j i i 2 j 2 : 2 | 2 j 2 j 1 o 0 ; 0 ; I v / X y i y / \ y 1 y X O 1 2 3 3 3 2 1 0 O 0 \ s ! s / x s ; s X X * * -«»- — , r i ! X 1 x - x 3 2 3 M 3 j 2 i_ i o o o Figure B .l: State transitions of the Daubechies (9,7) filters using factorization (B .l). The state inform ation consists of 4 samples (the overlap buffer size) in shaded boxes. Dashed lines represent operations necessary for the transform of the new input sample pair {x°(9),x°(10)}. The next elem entary m atrix e 1(z) is upper triangular so it updates even sam ples using odd samples. For example, x*(2) is updated into x2(2) = a:1(2) + (3{xl { 1) + x : (3)) and so are samples {xx(4),x1^ ) } . Again, x x(l) and x x(7) are preserved as the state inform ation at state 1. The sam e process continues until x°(4) is updated into the final transform coefficient r 4(4). The state inform ation near the right boundary consists of samples shown in shaded boxes in the figure, i.e., {x3(5), x 2(6), x 1(7), x°(8)}. So the overlap buffer size for one level of wavelet decomposition using th e Daubechies (9,7) filters is 4 samples. These partially updated samples constitute the only information one 108 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. needs to buffer for the transform of the new input d ata pair {x°(9), x°(10)}. The operations are shown as dashed lines in the figure. As one can see, all these operations are based on the state information, which is preserved in the mem ory buffer. (2,10) filte rs This filter has been found to give excellent perform ance for lossless image com pression. The factorization of the analysis polyphase m atrix is 1 ' 0 r-—< 1 ' 1 1/2 ‘ ' 1 o' L U * 2 ~ -z-2) + S(* - *_1) 1 J 0 1 .. \ i— ) 1 __1 Based on this factorization, the forward transform is *o(” ) - x(2n) x}(n) = x(2n -(- 1) — x(2n) x2(n) = x l 0(n) + x\(n ) / 2 xl(n) = *!(«) X q (n) = z 2(n) x?(ra) = x 1(n) + g^-(x0(n — X q (n + 2)) + and the inverse transform is x2(n) 1 C M o 1 C 1 -< I I Xo(n) = xo(n) x[(n) = zt(n) Xq(ti) = x2 Q (n) - x\(n)/'2 x(2 n) = Xo(n) x(2 n + 1) = x}(n) + x(2n) 22 64* [n - 1) - xl(n + 1)) As one can see, the first two state transitions are basically th e sam e as those of the (9,7) filters. Assum e initially there are 10 samples in the m em ory as shown in Fig. B.2. The last transition is more interesting which is detailed here. The elem entary m atrix e2(z) is e 2(z) = This is a lower triangular m atrix so odd samples get updated. For exam ple, x2(5) S (z* - z -2) + - z-') 0 1 109 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 10 11 X ■ - 1 I 0 0 0 ! 0 0 i 0 i o ; o 0 0 0 0 M M \ ; \ ! * i i ! i 1 i [ 1 ! 1 i 1 1 ! 1 1 1 1 i l l ! o ! o ! * / V i / V i / : / / TA 2 2 : 2 | 2 2 ; 2 • 2 2 2 1 o I o : t i ><?S ■C T x r i -7 ; ‘ : 2 ! 2 ! 2 ! 2 3 j 3 9 * "9 ‘ 2 1 2 L 0 ; 0 i X3 Figure B.2: S tate transitions of the (2,10) filters using factorization (B .l). The state inform ation consists of 4 samples in shaded boxes. Dashed lines represent op erations necessary for the transform of th e new input sam ple pair {x°(10), x °(ll)} . is updated into x \ i ) = x 2(S) + i ( z 2(S) - x 2(0)) + § ( * 2(6) - x2(2)) (B .l) On the other hand, x 2(7) can not be fully updated because :ro(10) is not available (not in buffer yet). However, it is partially updated as x 2(7) = x 2(7) + ^ ( - x 2(2)) + | | ( ^ 2(S) - x 2(4)) (B.2) This partial updating then frees sam ple x 2(2) from the buffer. In other words, to fully update x2(7), no samples w ith indices sm aller th an 7 are needed. Same partially u p d atin g is also performed for sam ple x 2(9) as x2(9) = x2(9) + ^ ( - * 2(6)) + | | ( - ^ ( 4 ) ) (B.3) The only sam ples which have to be buffered are {ar2(6), x 2(7), x2(S): rr2(9)}. So the overlap buffer size is 4 samples. W hen th e next new input pair {x°(10), a:0( ll) } comes, operations in dashed lines are executed. As a result, samples {x2(6), x 2{7)} are com pletely transform ed thus can be rem oved from the buffer. However, sam ples {ar°(10), x ° (Il)} can only be partially updated and thus have to be buffered. This process continues until all inputs are transform ed. C D F ( 4 , 2 ) f i l t e r s The scaling function of CDF(4,2) filters is a cubic B-spine which is used frequently 110 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. in com puter graphics for interpolation. The factorization of the analysis polyphase m atrix (adapted from [46]) is '1 /2 O' [ 1 & 1 + * - 1) ] ' 1 o ' '1 - 1 ( 1 + * - ! ) - 0 2 0 1 - ( l + z) 1 0 1 Based on this factorization, the forward transform is x°(n) = x(2 n) x i(n ) = ®(2n + l) x l 0 (n) = xO (n )-i(x ? (n )+ a r? (n -l)) rr[(n) = a:?(n) — (xq(ti) + Xq(ti + 1)) xl(n) = x l Q (n) + JL (x J(n ) + x \{n - 1)) and the inverse transform is 3 x\ (n) = x$(n) - — (x}(n) + x}(n - 1)) x°(n) = Xi(n)+(xg(n)+ Xg(n + 1)) xjfcn) = Xq(ti) + ^(x°(n) + x°(n - 1)) In this case, the state transition is basically the sam e as that of the (9,7) filters. The overlap buffer size is 3 samples as shown in Fig. B.3. 0 1 2 3 4 5 6 7 8 9 10 X ° 0 o i 0 - I o o o o °i 0 0 o ; s / \ ; / \ 1 / N " M* / / X 1 0 0 i 1 1 1 1 1 0 0 \ 0 ; o ! i / ' X2 0 o 1 2 2 2 1 o 0 o 0 j \ ' i V ^ X3 0 0 1 2 ! 3 ! 2 ! 1 I 0 1 0 0 o ! Figure B.3: State transitions of the CDF (4,2) filters using factorization (B.4). The state inform ation consists of 3 samples in shaded boxes. Dashed lines represent operations necessary for the transform of the new input sample pair {x°(S),x“(9)}. I l l Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Appendix C MDTC Transform Design Algorithm The basic algorithm we use is the sam e as that described in [61]. Here we pro vide a different derivation for a special channel model. W e assum e that channel fails independently w ith an equal channel failure probability p (description loss probability). One exam ple is the best-effort network where independent packet loss is a reasonable model for each individual connection [75]. We mention th at it is straightforward to generalize the algorithm to cases of unequal channel failures probabilities. The above constrained optim ization problem can be transform ed into a uncon strained one using a Lagrangian m ultiplier A. The cost function J can be w ritten as J = ZA;(A, 0 ) + Ap(A, 0 ) Let R x = {on-}?2 ’ = 1,2, - • • , M be the correlation m atrix of input vector X . Let R y = = 1,2, • • •, M be the correlation m atrix of the transform output Y . Then we have R y = T ( A ,Q )R x T(A,Q Y = R(® )S(A)Rx S(A)tR(®)t = R(Q)Rs R(®y where Rs = S (A )R x S (A Y = {a?-cr?-}, i = 1, 2, • • ■ , M is a diagonal m atrix. 112 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C.0.1 Redundancy bit rate p Let us first derive th e redundancy ra te p for this correlating transform T. Assume the coding rate is R x for input vector X and components of X are identically dis tributed. By optim al bit allocation and high resolution quantization, the achieved distortion is [95] Do = h p \2 ~2Rx where h = X { / f ^ [ / ( x ) ] 1/3dx}3 is a constant integral determ ined by the compo nent density function / ( x ) and p2 x = (rii= i< T F)1 ^ is the geom etric m ean of the component variances. A fter transform ation, the distortion will rem ain unchanged since T is im plem ented losslessly. However, the bit rate will be increased to Ry. We have Do = hp2 y 2~2Rr where p \ = (E[t=i ■ The redundancy bit rate p is derived as the increased bit rate from R x to R y as p( A ,0 ) = R y — R x J L i nfiir.y 2 M ° S n & of C.0.2 Side Distortion D s Now we derive the side distortion when there are erasures, i.e., some descriptions lost. We first need to recover the lost descriptions from received descriptions. Assume m out of M descriptions are lost. W ithout loss of generality, we can partition Y into received and lost portions [61], i.e., Y = [yr yi]* For notational simplicity, we will assum e r = m and / = M — m so yr is a m -dim ensional vector which is available at the decoder while yi is a (M — m )-dim ensional vector which is not available at the decoder. Denote the decoder reconstructed descriptions as Y = [yr yi]1 , then Y = [yr yi]1 . The M MSE estim ator of yi, yi, based on yr is the conditional mean, i.e., yi = E[yi\yr\. For jointly Gaussian vectors, this reduces to a LMMSE estim ator, th a t is yi = R irR rr yr w ith mean squared estim ation error (MSE) E\\ey\\2 = E\\yi — yi\\2 = tr(R ee) 113 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. where tr(Q ) is the trace of m atrix Q and R r r = E l y r P r ] Ru = E[yiy\] R i r = R r [ = E [ y r y i ] R e e = E l e y C y ] = R i l — R i r R ~ r R r l Once Y is found, the estim ated X = T~lY can be obtained via the inverse correlating transform T ~l. For sim plicity, denote the transpose of T ~ l as T~l. The final m ean-squared reconstruction error between X and X is £ ( l k v ||2) = E\\x-x \\2 = E\\T~l(Y — Y)\\2 = tr ( E [ T ~ \Y - Y ) ( Y - Y f T - 1 )) = tr(T~lE[{Y - Y){Y - Y Y Y T -1 ) W rite Y and Y in their partitioned forms, we have £ ( e v 4 ) = E [ { Y - Y ) { Y - Y y ] 0 0 0 Ree Partition T ~l correspondingly, 1 _ ’ T r r T r l ' r p - t _ r T t r r r p t 1 lr T l r T « . 1 -- r p t r l T t 1 u After some sim plifications, we finally get the reconstruction m ean-squared error, the side distortion Dm when there are m descriptions lost, as Dm = E\\ex \\2 = tr (T riR eeT,h) + tr(T uR eeTft) The average side distortion D s is a weighted sum D s ' M ' M = E PmDr 7 7 1 = 1 w ith Pm = m pm( 1 — p) m, the probability of m descriptions lost. To sum m arize, we list steps for side distortion calculation. 114 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1. For m = 1 : 1 : M , m is the num ber of lost descriptions. • C alculate the side distortion Dm. — Partition R y to find Rrr, Ru, and Rir. P artitio n T ~ l to find Trl and Tu. — Calculate correlation of estim ation error R ee = Ru — RirR~^ Rri- — Calculate side distortion as Dm = tr(T riR eeT ^) -\-tr{TuReeTil). • C alculate the event probability Pm = 1 — p)M~m ( m descrip tions are lost). 2. Calculate the average side distortion Ds = Y2m=i Pm.Dm- 115 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Appendix D KLVT Vector Space Partition Let < a i, a 2, • - •, > denote a vector space A with basis set {ai, 1 < i < N }. T hat is, any vector in A can be represented as a linear com bination of the basis vectors. A is also called the expansion of this basis set. Consider a random sequence x = {x,-, 1 < i < N }. Let R be an N x N (assum ing N is even) correlation m atrix and its eigenm atrix 3? is = [ex e 2 • • - e/v] where {et -, 1 < i < N } are eigenvectors and {A,-, 1 < i < N } are corresponding eigenvalues. Hence R can be w ritten as R = A1e i e i /? + A2e 2e 2' ff + • • • + i V = Ax -et -eiH i=i which is also called the spectral decom position of R on its eigenspace E = (ex e 2 • • • ejv) in the context of signal processing. In fact, each ej-e^ constitutes an eigen- subspace of dimensionality I w ith no correlation between each other. Now we merge each two consecutive eigen-subspaces to form a new set of eigen-subspaces as follows E i = (ex,e2) E 2 = (6 3 , 6 4 ) - • • E.v/ 2 = (e,v _x, e y V ) 116 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. We will have another set of uncorrelated eigensubspaces {Ei, 1 < i < N / 2}, each of which w ith dim ensionality 2 and their union is still the eigenspace E . iV /2 E = l J E i i = l Projected onto this new set of eigensubspaces, R can be expanded as follows R = E iA iE ^ 1 + E 2 A 2 E I 1 4------ + E n / 2A -n /2 E n /2 N / 2 = ^ E i A i E ; 1 1 t=i where {Ei, 1 < i < N } are N x 2 m atrices w ith rank 2, i.e., the colum n space of dim ensionality of 2, and Ai are 2 x 2 eigenm atrices. One notices that, {Ei,l < i < AT/2} is a very special partition of the eigenspace E with each subspace having eigenvectors in th e basis set and Ai, 1 < i < N are all diagonal m atrices. As a m atter of fact, these eigen-subspaces {Ei, 1 < i < N/2} can have other basis set other than the eigenvectors. In other words, Ai, 1 < i < N need not to be diagonal either. As long as these subspaces are a valid decorrelation partition, R will be block-diagonalized. 117 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Data compression and detection
PDF
Contributions to coding techniques for wireless multimedia communication
PDF
Contributions to content -based image retrieval
PDF
Analysis, synthesis and recognition of human faces with pose variations
PDF
Code assignment and call admission control for OVSF-CDMA systems
PDF
Contributions to efficient vector quantization and frequency assignment design and implementation
PDF
Contributions to image and video coding for reliable and secure communications
PDF
A perceptual organization approach for figure completion, binocular and multiple-view stereo and machine learning using tensor voting
PDF
Geometrical modeling and analysis of cortical surfaces: An approach to finding flat maps of the human brain
PDF
Design and analysis of server scheduling for video -on -demand systems
PDF
Adaptive video transmission over wireless fading channel
PDF
Error resilient techniques for robust video transmission
PDF
Design and applications of MPEG video markup language (MPML)
PDF
Algorithms and architectures for robust video transmission
PDF
Energy and time efficient designs for digital signal processing kernels on FPGAs
PDF
A unified Bayesian and logical approach for video-based event recognition
PDF
Intelligent systems for video analysis and access over the Internet
PDF
Information hiding in digital images: Watermarking and steganography
PDF
Advanced video coding techniques for Internet streaming and DVB applications
PDF
Complexity -distortion tradeoffs in image and video compression
Asset Metadata
Creator
Jiang, Wenqing (author)
Core Title
Contribution to transform coding system implementation
School
Graduate School
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
engineering, electronics and electrical,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Advisor
Ortega, Antonio (
committee chair
), Medioni, Gerard (
committee member
), Zhang, Zhen (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c16-62394
Unique identifier
UC11337735
Identifier
3018008.pdf (filename),usctheses-c16-62394 (legacy record id)
Legacy Identifier
3018008.pdf
Dmrecord
62394
Document Type
Dissertation
Rights
Jiang, Wenqing
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Tags
engineering, electronics and electrical