Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
High fidelity multichannel audio compression
(USC Thesis Other)
High fidelity multichannel audio compression
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
HIGH FIDELITY MULTICHANNEL AUDIO COMPRESSION by Dai Yang A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) August 2002 Copyright 2002 Dai Yang- Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UMI Number: 3094388 Copyright 2002 by Yang, Dai All rights reserved. ® UMI UMI Microform 3094388 Copyright 2003 by ProQuest Information and Learning Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. ProQuest Information and Learning Company 300 North Zeeb Road P.O. Box 1346 Ann Arbor, Ml 48106-1346 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UNIVERSITY OF SOUTHERN CALIFORNIA The Graduate School University Park LOS ANGELES, CALIFORNIA 90089-1695 This dissertation, w ritten b y D a I _____________ ______________ U nder th e direction o f h&£.. D issertation C om m ittee, and approved b y a ll its m em bers, has been p resen ted to an d accepted b y The G raduate School, in p a rtia l fulfillm ent o f requirem ents fo r th e degree o f DOCTOR OF PHILOSOPHY lean o f Graduate S tu d ies D ate A ugust 6 , 2002 DISSER TA H O N COMMITTEE Chairperson Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. D edication Dedicated with love to my husband Ruhua He, my son Joshua S. He, my parents Junhui Yang and Zongduo Dai. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A cknow ledgem ents I would like to express my most profound graditude to my advisor Dr. C.-C. Jay Kuo, who has been a tireless and enthusiastic source of good ideas. I have benefited greatly from Dr. Kuo’s extensive knowledge, his invaluable experience in research and his decent personality. And I will continue to benefit from this throughout my entire life. Dr. Kuo has been a strong influence in both my educational and personal pursuits and I will forever be grateful to him for sharing his successful career with me. My sincere gratitude also goes to my co-advisor Dr. Chris Kyriakakis for his guidance and support throughout my thesis-building period. Without these, it would be impossible for me to finish my Ph.D. program. I wish also to thank Dr. Hongmei Ai for her valuable discussions on my research, her help and encouragement with my thesis, and most importantly, the friendship that we have shared throughout my graduate study at USC. My thanks goes out to all group members under Dr. Kuo’s guidance. Post doctors and fellow students have helped me with useful discussions and entertaining conversations over years. My graduate study in this group was greatly enriched by their accompany, and it is my privilege to have spent time with them. iii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Finally, I would like to thank my husband Ruhua. His unconditional support and love make it possible for me to finish my program on time. I also wish to thank my parents and parents-in-law for their delight in helping taking care of my son during my research period. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C ontents D edication ii Acknowledgem ents iii List of Tables ix List of Figures x Abstract xiv 1 Introduction 1 1.1 Motivation and Overview............................................................................... 1 1.1.1 Redundancy Inherent in Multichannel A u d io .............................. 2 1.1.2 Quality-Scalable Single Compressed B itstream ........................... 3 1.1.3 Embedded Multichannel Audio Bitstream .................................. 4 1.1.4 Error-Resilient Scalable Audio B itstream ..................................... 5 1.2 Contributions of the R e sea rc h ..................................................................... 6 1.2.1 Inter-Channel Redundancy Removal A pproach........................... 7 1.2.2 Audio Concealment and Channel Transmission Strategy for Heterogeneous Network.................................................................... 9 1.2.3 Quantization Efficiency for Adaptive Karhunen-Loeve Trans form .................................................................................................... 10 1.2.4 Progressive Syntax-Rich Multichannel Audio Codec Design . . 11 1.2.5 Error-Resilient Scalable Audio C o d in g ........................................ 13 1.3 Outline of the D issertation............................................................................ 14 Inter-Channel Redundancy Rem oval and Channel-Scalable D ecod ing 15 2.1 Introduction................................................................................................... 15 2.2 Inter-Channel Redundancy Removal....................................................... 16 2.2.1 Karhunen-Loeve Transform ............................................................ 16 2.2.2 Evidence for Inter-Channel D e-C orrelation.................................. 19 2.2.3 Energy Compaction E ffe c t............................................................... 23 2.2.4 Frequency-Domain versus Time-Domain K L T ........................... 26 v Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.3 Temporal Adaptive K L T ............................................................................. 28 2.4 Eigen-Channel Coding and T ransm ission................................................ 32 2.4.1 Eigen-Channel C o d in g ................................................................... 32 2.4.2 Eigen-Channel Transm ission.......................................................... 35 2.5 Audio Concealment for Channel-Scalable Decoding........................................................................................................... 38 2.6 Compression System Overview................................................................... 42 2.7 Complexity A n a ly s is.................................................................................... 45 2.8 Experimental R esu lts.................................................................................... 47 2.8.1 Multichannel Audio C o d in g .......................................................... 47 2.8.2 Audio Concealment with Channel-Scalable Coding ................ 51 2.8.3 Subjective Listening T e st................................................................ 53 2.9 Conclusion....................................................................................................... 55 3 A daptive Karhunen-Loeve Transform and its Q uantization Effi ciency 57 3.1 in tro d u ctio n .................................................................................................... 57 3.2 Vector Quantization .................................................................................... 59 3.3 Efficiency Of KLT D e-Correlation............................................................. 61 3.4 Temporal Adaptation E ffe c t....................................................................... 68 3.5 Complexity A n a ly s is.................................................................................... 73 3.6 Experimental R esu lts.................................................................................... 74 3.7 Conclusion........................................................................................................ 75 4 Progressive Syntax-Rich M ultichannel Audio Codec 77 4.1 Introduction.................................................................................................... 77 4.2 Progressive Syntax-Rich Codec D e s ig n .................................................... 80 4.3 Scalable Quantization and Entropy C oding............................................. 81 4.3.1 Successive Approximation Quantization (SA Q ).......................... 82 4.3.1.1 Description of the SAQ A lg o rith m .............................. 82 4.3.1.2 Analysis of Error Reduction R a t e s .............................. 84 4.3.1.3 Analysis of Error Bounds .............................................. 87 4.3.2 Context-based QM co d er................................................................. 89 4.4 Channel and Subband Transmission S trategy.......................................... 91 4.4.1 Channel Selection R u le .................................................................... 91 4.4.2 Subband Selection Rule ................................................................. 91 4.5 Implementation Issues ................................................................................. 97 4.5.1 Frame, subband or channel skipping............................................. 97 4.5.2 Determination of the MNR th re sh o ld .......................................... 98 4.6 Complete Algorithm Description................................................................. 99 4.7 Experimental R esu lts....................................................................................... 102 4.7.1 Results using MNR m easurem ent....................................................103 4.7.1.1 MNR Progressive................................................................. 103 4.7.1.2 Random A ccess.................................................................... 104 vi Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.7.1.3 Channel E nhancem ent........................................................105 4.7.2 Subjective Listening T e st.....................................................................106 4.8 Conclusion......................................................................................................... 109 5 Error-Resilient D esign 111 5.1 Introduction...................................................................................................I ll 5.2 WCDMA Characteristics............................................................................... 114 5.3 Layered Coding S tr u c tu r e ............................................................................117 5.3.1 Advantages of the Layered Coding ................................................. 117 5.3.2 Main Features of Scalable Codec .................................................... 118 5.4 Error-Resilient Codec Design.........................................................................121 5.4.1 Unequal Error Protection ..................................................................121 5.4.2 Adaptive Segm entation........................................................................ 123 5.4.3 Frequency Interleaving........................................................................ 125 5.4.4 Bitstream A rchitecture........................................................................ 129 5.4.5 Error Control S tra te g y ........................................................................ 129 5.5 Experimental R esu lts...................................................................................... 132 5.6 Conclusion..........................................................................................................134 5.7 Discussion and Future W o rk ......................................................................... 136 5.7.1 D iscussion............................................................................................... 136 5.7.1.1 Frame Interleaving...............................................................136 5.7.1.2 Error C oncealm ent...............................................................136 5.7.2 Future w o rk ............................................................................................137 6 Conclusion 138 Bibliography ........................................................................................................... 141 A ppendix A Descriptive Statistics and P aram eters....................................................................147 A.l M ean ................................................................................................................... 147 A.2 V ariance.............................................................................................................148 A.3 Standard D eviation..........................................................................................150 A.4 Standard Error of the M e a n ......................................................................... 152 A.5 Confidence In te rv a l..........................................................................................154 A ppendix B Karhunen-Loeve Expansion ....................................................................................1.58 B.l D e fin itio n ..........................................................................................................158 B.2 Features and Properties ................................................................................159 A ppendix C Psychoacoustics...........................................................................................................162 C.l Hearing A r e a ................................................................................................... 162 vii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C.2 M ask in g ............................................................................................................. 166 C.2.1 Masking of Pure T o n e s..................................................................... 167 C.2.2 Temporal E ffe c ts ............................................................................... 169 A ppendix D MPEG Advanced Audio C oding............................................................................. 172 I).l Overview of MPEG-2 Advanced Audio C oding......................................... 172 D.2 P reprocessing....................................................................................................176 D.3 Filter B a n k .......................................................................................................177 D.4 Temporal Noise S h a p in g ................................................................................ 180 I).5 Joint Stereo C oding..........................................................................................181 D.5.1 M/S Stereo C o d in g ............................................................................ 182 D.5.2 Intensity Stereo Coding .................................................................. 183 D.6 P re d ic tio n .......................................................................................................... 185 D.7 Quantization and C oding................................................................................ 186 D.7.1 Overview .............................................................................................186 D.7.2 Non-Uniform Q uantization............................................................... 188 D.7.3 Coding of Quantized Spectral V a lu e s ............................................188 D.7.4 Noise S haping......................................................................................189 D.7.5 Iteration P r o c e s s ............................................................................... 190 D.8 Noiseless Coding ............................................................................................. 192 D.8.1 S ectioning.............................................................................................192 D.8.2 Grouping and Interleaving................................................................193 D.8.3 Scale Factors ...................................................................................... 195 D.8.4 Huffman Coding ................................................................................195 viii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List o f Tables 2.1 Comparison of computational complexity between MAACKLT and A A C .................................................................................................................. 46 3.1 Absolute values of non-redundant elements of the normalized covari ance matrix calculated from original signals 61 3.2 Absolute values of non-redundant elements of the normalized covari ance matrix calculated from KLT de-correlated signals 62 3.3 Absolute values of non-redundant elements of the normalized covari ance matrix calculated from scalar quantized KLT de-correlated signals. 62 3.4 De-correlation results with SQ..................................................................... 66 3.5 De-correlation results with VQ.................................................................... 67 4.1 MNR comparison for MNR progressive profiles.........................................103 4.2 MNR comparison for Random Access and Channel Enhancement pro files ....................................................................................................................... 104 5.1 Characteristics of WCDMA error patterns...................................................115 5.2 Experimental results of the frequency interleaving method......................127 5.3 Average MNR values of reconstructed audio files through different WCDMA channels..............................................................................................134 A.l Weaning weights of four charolais steers (in pounds)............................... 147 A.2 Variance calculation using deviations from the m ean ...............................148 A.3 Distribution of t two-tailed tests................................................................... 156 D.l Huffman coclebooks used in AAC...................................................................195 ix Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List o f Figures 2.1 Inter-channel decorrelation via KLT............................................................. 17 2.2 Absolute values of elements in the lower triangular normalized covari ance matrix for 5-channel ’’Herre” ............................................................... 20 2.3 Absolute values of elements in the lower triangular normalized covari ance matrix for 10-channel ” Messiah” ......................................................... 21 2.4 Absolute values of elements in the lower triangular normalized covari ance matrix after KLT for 5-channel ’’Herre”............................................ 23 2.5 Absolute values of elements in the lower triangular normalized covari ance matrix after KLT for 10-channel ’’Messiah”...................................... 24 2.6 Comparison of accumulated energy distribution for (a) 5-channel ” Herre” and (b) 10-channel’’Messiah” ....................................................................... 25 2.7 Normalized variances for (a) 10-channel ’’Messiah” , and (b) 5-channel ” Messiah” , where the vertical axis is plotted in the log scale................ 26 2.8 (a) Prequency-domain and (b) time-domain representations of the center channel from ’’Herre” .......................................................................... 27 2.9 Absolute values of off-diagonal elements for the normalized covariance matrix after (a) frequency-domain and (b) time-domain KL trans forms with test audio ’’Herre”....................................................................... 28 2.10 Absolute values of off-diagonal elements for the normalized covariance matrix after (a) frequency-domain and (b) time-domain KL trans forms with test audio ’’Messiah”................................................................... 29 2.11 De-correlation efficiency of temporal adaptive KLT.................................. 30 2.12 The overhead bit rate versus the number of channel and the adaptive period................................................................................................................. 31 2.13 The modified AAC encoder block diagram.................................................. 33 x Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.14 The empirical probability density functions of normalized signals in 5 eigen-channels generated from test audio ” Herre”.................................... 35 2.15 The empirical probability density functions of normalized signals in the first 9 eigen-channels generated from test audio ’ ’Messiah” .............. 36 2.16 The block diagram of the proposed MAACKLT encoder...................... 43 2.17 The block diagram of the proposed MAACKLT decoder...................... 44 2.18 The MNR comparison for (a) 10-channel ’’Herbie” using frequency- domain KLT (b) 5-channel ’’Herre” using frequency-domain KLT (c) 5-channel ’’Herre” using time-domain KLT................................................ 49 2.19 The mean MNR improvement for temporal-adaptive KLT applied to the coding of 10-channel ” Messiah”, where the overhead information is included in the overall bit rate calculation............................................. 50 2.20 MNR comparison for 5-channel ’’Herre” when packets of one channel from the (a) L /R and (b) Ls/Rs channel pairs are lost........................... 52 2.21 Subjective listening test results.................................................................... 54 3.1 (a) The de-correlation efficiency and (b) the overhead bit rate versus the number of bits per element in SQ......................................................... 64 3.2 (a) The de-correlation efficiency and (b) the overhead bit rate versus the number of bits per vector in VQ............................................................ 65 3.3 The magnitude of the lower triangular elements of the normalized covariance matrix calculated from de-correlated signals, where the adaptive time is equal to (a) 0.05, (b) 0.2, (c) 3, and (d) 10 seconds. . 69 3.4 (a) Adaptive MNR results and (b) adaptive overhead bits for SQ and VQ for 5-channel Messiah.............................................................................. 70 3.5 MNR result using test audio ’ ’Messiah” coded at (a) 64 kbit/s/ch, (b) 48 kbit/s/ch, (c) 32 kbit/s/ch, and (d) 16 kbit/s/ch....................................74 3.6 MNR result using test audio ”Ftbl” coded at (a) 64 kbit/s/ch, (b) 48 kbit/s/ch, (c) 32 kbit/s/ch, and (d) 16 kbit/s/ch..................................... 75 4.1 The adopted context-based QM coder with six classes of contexts. . . 89 4.2 Subband width distribution........................................................................... 92 xi Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.3 Subband scanning rule, where the solid line with arrow means all subbands inside this area are scanned, and the dashed line means only those non-significant subbands inside the area are scanned. . . . 94 4.4 The block-diagram of the proposed PSMAC encoder.............................. 99 4.5 Illustration of the progressive quantization and lossless coding blocks. 100 4.6 Listening test results for multi-channel audio s o u rc e s .............................106 4.7 Listening test results for single channel audio sources. The cases where no confidence intervals are shown correspond to the situation when all four listeners happened to give the same score to the given sound clip............................................................................................................. 107 5.1 A simplified example of how subbands are selected from layers 0 to 3. 119 5.2 Example of frequency interleaving.................................................................. 126 5.3 The bitstream architecture............................................................................... 129 5.4 Mean MNR values of reconstructed audio files through different WCDMA channels.................................................................................................................133 A.l Areas of the normal curve.................................................................................150 C.l Illustration of the hearing area, i.e. the area between the threshold on quiet and the threshold of pain. Also indicated are areas encompassed by music and speech, and the limit of damage risk. The ordinate scale is not only expressed in the sound pressure level but also in the sound intensity. The dotted part of the threshold in quiet stems from subjects who frequently listen to very loud music....................................... 163 C.2 Illustration of the threshold in quiet, i.e. the just-noticeable level of a test tone as a function of its frequency, registered with the method of tracking. Note that the threshold is measured twice between 0.3 and 8 kHz..................................................................................................................... 164 C.3 The level of test tone masked by ten harmonics of 200 Hz as a function of the frequency of the test tone. Levels of the individual harmonics of an equal size are given as the parameter...................................................168 C.4 Schematic drawing to illustrate and characterize regions within which premasking, simultaneous masking and postmasking occur. Note that postmasking uses a different time origin than premasking and simul taneous masking..................................................................................................169 D.l The block diagram of the AAC encoder........................................................ 173 xii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. D.2 The block diagram of the AAC decoder........................................................ 174 D.3 Three AAC profiles............................................................................................ 175 xiii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A bstract With the popularization of high-quality audio, there is an increasing demand for an effective compression and transmission technology. Despite of the success of current perceptual audio coding techniques, Some problems still remain open and need improvement. This dissertation addresses two system issues that may arise in practice in high-fidelity audio coding technology: (i) an inter-channel redundancy removal approach designed for high-quality multichannel audio compression and (ii) a progressive syntax-rich multichannel audio coding algorithm and its error-resilient, codec design. The first contribution of this dissertation is to develop a Modified Advanced Au dio Coding with Karhunen-Loeve Transform (MAACKLT) algorithm. In MAACKLT, we exploit the inter-channel redundancy inherent in most multichannel audio sources, and prioritize the transformed channel transmission policy. Experimental results show that, compared with MPEG AAC (Advanced Audio Coding), the MAACKLT algorithm not only reconstructs better quality of multichannel audio at a regular low bit rate of 64 kbit/s/ch but also achieves coarse-grain quality scalability. The second contribution of this dissertation is the design of a Progressive Syntax- Rich Multichannel Audio Coding (PSMAC) system. PSMAC inherits the efficient xiv Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. inter-channel de-correlation block in the MAACKLT algorithm while adding a scal able quantization coding block and a context-based QM noiseless coding block. The final bitstream generated by this multichannel audio coding system provides fine- grain scalability and three user-defined functionalities which are not available in other existing multichannel audio codecs. The reconstructed audio files generated by our proposed algorithm achieve an excellent performance in a formal subjective listening test at various bit rates. Based on the PSMAC algorithm, we extend its error-free version to an error-resilient codec by re-organizing the bitstream and mod ifying the noiseless coding modules. The performance of the proposed algorithm has been tested under different error patterns in WCDMA channels using several single channel audio materials. Our experimental results show that the proposed approach has excellent error resiliency at a regular user bit rate of 64 kbit/s. xv Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C hapter 1 Introduction 1.1 M otivation and O verview Ever since the beginning of the twentieth century, the art of sound coding, trans mission, recording, mixing, and reproduction has been constantly evolving. Start ing from the monophonic technology, technologies on multichannel audio have been gradually extended to include stereophonic, quadraphonic, 5.1 channels, and more. Compared with traditional mono or stereo audio, multichannel audio provides end users with a more compelling experience and becomes more and more appealing to music producers. As a result, an efficient coding scheme is needed for multichannel audio’s storage and transmission, and this subject has attracted a lot of attention recently. Among several existing multichannel audio compression algorithms, Dolby AC-3 and MPEG Advanced Audio Coding (AAC) are the two most prevalent perceptual digital audio coding systems. Dolby AC-3 is the third generation of digital audio 1 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. compression system from Dolby Laboratories, and has been adopted as the audio standard for High Definition Television (HDTV) systems. It is capable of providing indistinguishable audio quality at 384 kbit/s for 5.1 channels [A/5]. AAC is currently the most powerful multichannel audio coding algorithm in the MPEG family. It can support up to 48 audio channels and provide perceptually lossless audio at 320 kbit/s for 5.1- channels [BB97]. In general, these low bit rate multichannel audio compression algorithms not only utilize transform coding to remove statistical redundancy within each channel, but also take advantage of the human auditory system to hide lossy coding distortions. 1.1.1 Redundancy Inherent in M ultichannel Audio Despite the success of AC-3 and AAC, not much effort has been made in reducing inter-channel redundancy inherent in multichannel audio. The only technique used in AC-3 and AAC to eliminate redundancy across channels is called ’’Joint Coding”, which consists of Intensity/Coupling and Mid/Side(M/S) stereo coding. Coupling is adopted based on the psychoacoustic evidence that, at high frequencies (above approximately 2kHz), the human auditory system localizes sound primarily based on envelopes of critical-band-filtered signals that reach human ears, rather than signals themselves [Dav93, TDD+94], M/S stereo coding is only applied to lower frequency coefficients of Channel-Pair-Elements (CPEs). Instead of direct coding of original signals in the left and right channels, it encodes the sum and the difference of signals in two symmetric channels [BBQ+96, JF92]. 2 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Our experimental results show that high correlation is very likely to be present be tween every pair of channels besides CPE in all frequency regions, especially for those multichannel audio signals that are captured and recorded in a real space [YAKK00b|. Since neither AAC nor AC-3 exploits this property to reduce redundancy, none of them can efficiently compress this kind of multichannel audio content. On the other hand, if the input multichannel audio signals presented to the encoder module have little correlation between channels, the same bit rate encoding would result in higher reconstructed audio quality. Therefore, a better compression performance can be achieved if inter-channel redundancy can be effectively removed via a certain kind of transform together with redundancy removal techniques available in the existing multichannel audio coding algorithms. One possibility to reduce the cross-channel redundancy is to use inter-channel prediction [Fuc93] to improve the coding perfor mance. However, a recent study [KJ01] argues that this kind of technique is not applicable to perceptual audio coding. 1.1.2 Quality-Scalable Single Compressed B itstream As the world is evolving into the information era, media compression for a pure stor age purpose is far less than enough. The design of a multichannel audio codec which takes the network transmission condition into account is also important. When a multichannel audio bitstream is transmitted through a heterogeneous network to multiple end users, a quality-scalable bitstream would be much more desirable than the non-scalable one. 3 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The quality scalability of a multichannel audio bitstream makes it possible that the entire multichannel sound can be played at various degrees of quality for end users with different receiving bandwidths. To be more precise, when a single quality- scalable bitstream is streamed to multiple users over the Internet via multicast, some lower priority packets can be dropped, and a certain portion of the bitstream can be transmitted successfully to reconstruct different quality multichannel sound according to different users’ requirement or their available bandwidth. This is called the multicast streaming [WHZOO]. With non-scalable bitstreams, the server has to send different users different unicast bitstreams. This is certainly a waste of resources. Not being considered for audio delivery over heterogenous networks, the bitstream generated by most existing multichannel audio compression algorithms, such as AC-3 or AAC, is not scalable by nature. 1.1.3 Embedded M ultichannel Audio Bitstream Similar to the quality-scalable single audio bitstream mentioned in the previous section, the most distinguishable property of an embedded multichannel audio com pression technique lies in its network transmission applications. In the scenario of audio coding, the embedded code contains all lower rate codes ’ ’embedded” at the beginning of the bitstream. In other words, bits are ordered in importance, and the decoder can reconstruct audio progressively. With an embedded codec, an encoder can terminate the encoding at any point, thus allowing a target rate or a distor tion metric to be met exactly. Typically, some target parameters, such as the bit 4 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. count, is monitored in the encoding process. When the target is met, the encoding simply stops. Similarly, given a bitstream, the decoder can cease decoding at any point and produce reconstructions corresponding to all lower-rate encodings. The property of being able to terminate the encoding or decoding of an embedded bit stream at any specific point is extremely useful in systems that are either rate- or distortion-constrained [Sha93]. MPEG-4 version-2 audio coding supports fine grain bit rate scalablility [PKKS97, ISOb, ISOg, ISOh, HA P 1 98] in its Generic Audio Coder (GAC). It has a Bit-Sliced Arithmetic Coding (BSAC) tool, which provides scalability in the step of 1 kbit/s per audio channel for mono or stereo audio material. Several other scalable mono or stereo audio coding algorithms [ZL01, VA01, SAK99] were proposed in recent years. However, not much work has been done on progressively transmitting multichannel audio sources. Most existing multichannel audio codecs, such as AAC or AC-3, can only provide fixed-bit-rate perceptually loseless coding at about 64 kbit/s/ch. In order to transfer high quality multichannel audio through a network of a time- varying bandwidth, an embedded audio compression algorithm is highly desirable. 1.1.4 Error-Resilient Scalable Audio Bitstream Current coding techniques for high quality audio mainly focus on coding efficiency, which makes them extremely sensitive to channel errors. A few bit errors may lead to a long period of error propagation and cause catastrophic results, including making the reconstructed sound file with un-acceptable perceptual quality and the crash 5 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. of the decoder. The desired audio bitstream transmitted over the network should be the one with error resiliency, which means that the audio bitstream should be designed to be robust to channel errors, i.e. make the error impact to be as small as possible. During the period when packet are corrupted or lost, the decoder should be able to perform error concealment and allow the output audio to be reproduced at acceptable quality. Compared with existing work on image, video or speech coding, the amount of work on error-resilient, audio is relatively small. Techniques on robust coding and error concealment for compressed speech signals have been discussed for years. However, since speech and audio signals have different applications, straight forwardly applying these techniques to audio does not generate satisfactory results in general. 1.2 C ontributions of th e R esearch Based on the current status of multichannel audio compression discussed in previ ous sections, we propose three audio coding algorithm which are all build upon the MPEG Advance Audio Coding’s (AAC) basic coding structure in this dissertation. The first one is called Modified Advance Audio Coding with Karhunen-Loeve Trans form (MAACKLT). The second one is called Progressive Syntax-Rich Multichannel Audio Codec (PSMAC). And the third one is called Error-Resilient Scalable Audio Coding (ERSAC). Major contributions of this dissertation are summarized below. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1.2.1 Inter-Channel Redundancy Removal Approach As mentioned in Section 1.1.1, not much effort has been made in reducing inter channel redundancy inherent in multichannel audio sources. In our research, we carefully observe the inter-channel correlation present in multichannel audio ma terials and propose an effective channel redundancy removal approach. Specific contributions along this direction are listed below. • O bservation of channel correlation Based on our observation, most of the multichannel audio materials of interest exhibit two types of channel correlation. The first type only shows high corre lation between CPEs, but little correlation between other channel pairs. The other type shows high correlation among all channels. • Proposal of an effective inter-channel redundancy removal approach An inter-channel de-correlation method via KLT is adopted in the pre-processing stage to remove the redundancy inherent in original multichannel audio signals. Audio channels after KLT show little correlation between channels. • Study of efficiency of the proposed inter-channel de-correlation m ethod Experimental results show that the KLT pre-processing approach not only sig nificantly de-correlates input multichannel audio signals but also considerably compacts the signal energy into the first several eigen-channels. This provides strong evidence of KLT’s data compaction capability. Moreover, the energy compaction efficiency increases with the number of input channels. 7 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • Frequency dom ain KLT versus tim e domain KLT It is observed that applying KLT to frequency domain signals achieves a better performance than directly applying KLT to time domain signals as shown by experimental results in Section 2.8. Thus, intra-channel signal de-correlation and energy compaction procedures should be performed after time-domain signals are transformed into the frequency domain via MDCT in the AAC encoder. • Temporal adaptive KLT Multichannel audio program often comprises of different periods, each of which has its unique spectral signature. In order to achieve the highest information compactness, the de-correlation transform matrix must adapt to the character istics of different periods. Thus, a temporal-adaptive KLT method is proposed, and the trade-off between the adaptive window size and the overhead bit rate is analyzed. • Eigen-channel com pression Since signals in de-correlated eigen-channels have different characteristics from signals in original physical channels, MPEG AAC coding blocks are modified accordingly so that they are more suitable to compress audio signals in eigen- channels. 8 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1.2.2 Audio Concealment and Channel Transmission Strategy for Heterogeneous Network Based on results of our KLT pre-processing approach, the advantage of this method is further explored. Once signals in original physical channels are transformed into independent eigen-channels, the energy accumulates much faster with the number of channel increases. This implies, when transmitting data of a fixed number of chan nels with our algorithm, more information content will be received in the decoder side and better quality of the reconstructed multichannel audio can be achieved. Possible channel transmission and recovery strategies for MPEG AAC and our algo rithm are studied and compared. The following two contributions have been made in this work. • Channel im portance sequence It is desirable to re-organize the bitstream such that bits of more important channels are received at the decoder side first for audio decoding. This should result in best audio quality given a fixed amount of received bits. According to the channel energy and, at the same time, considering the sound effect caused by different channels, the channel importance for both original physical chan nels and KL-transformed eigen-channels is studied. A channel transmission strategy is determined according to this channel importance criterion. 9 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • Audio concealm ent and channel scalable decoding When packets belonging to less important channels are dropped, an audio con cealment strategy must be enforced in order to reconstruct a full multichan nel audio. A channel-scalable decoding method based on this audio conceal ment strategy for bitstreams generated by AAC and the proposed algorithm is proposed. Experimental results show that our algorithm has a much better scalable capability and can reconstruct multichannel audio of better quality, especially at lower bit rates. 1.2.3 Quantization Efficiency for Adaptive Karhunen-Loeve Transform The quantization method for the Karhunen-Loeve Transform matrix and the dis cussion on the temporal adaptive KLT method in the MAACKLT algorithm are relatively simple. Some questions arise in improving the efficiency of the quanti zation scheme. For example, can we improve the coding performance by reducing the overhead involved in transmitting the KL transform matrix? If the number of bits required to quantize each KL transform matrix is minimized, can we achieve much better inter-channel de-correlation efficiency if the KLT matrix is updated much more frequently? Having these questions in mind, we investigate the impact of different quantization methods and their efficiency for adaptive Karhunen-Loeve Transform. The following two areas are addressed in this research. 10 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • Scalar quantizer versus the vector quantizer The coding efficiency and the bit requirement of the scalar quantizer and the vector quantizer are carefully analyzed. Although a scalar quantizer applied to the KL transform matrix gives much better inter-channel de-correlation efficiency, vector quantization methods that have a smaller bit requirement achieve a better performance in term of the final MNR values of the recon structed sound file. • Long versus short tem poral adaptive period for KLT We study how to choose a moderately long temporal adaptive period so that the optimal trade-off between the inter-channel de-correlation efficiency and the overhead bit rate can be achieved. 1.2.4 Progressive Syntax-Rich M ultichannel Audio Codec Design Being inspired by progressive image coding and the MPEG AAC system, a novel progressive syntax-rich multichannel audio compression algorithm is proposed in this dissertation. The distinctive feature of the multichannel audio bitstream gener ated by our embedded algorithm is that it can be truncated at any point and still reconstruct a full multichannel audio, which is extremely desirable in the network transmission. The novelty of this algorithm includes the following. 11 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • Subband selection strategy A subband selection strategy based on the Mask-to-Noise Ratio (MNR) is proposed. An empirical MNR threshold is used to determine the importance of a subband in each channel so that the most sensitive frequency region can be reconstructed first. • Layered coefficient coding A dual-threshold strategy is adopted in our implementation. At each layer, the MNR threshold is used to determine the subband significance, the coefficient magnitude threshold is used to determine coefficient significance. According to these selection criteria, within each selected subband, transformed coefficients are layered quantized and transmitted into the bitstream so that a coarse-to- fine multiprecision representation of these coefficients can be achieved at the decoder side. • M ultiple context lossless coding A context-based QM coder is used in the lossless coding part in the proposed algorithm. Six classes of contexts are carefully selected in order to increase the coding performance of the QM coder. • Three user-defined profiles Three user-defined profiles are designed in this codec. They are MNR progres sive, random access and channel enhancement. With these profiles, PSMAC algorithm provide end users versatile functionalities. 12 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1.2.5 Error-Resilient Scalable Audio Coding In order to improve the performance of the PSMAC algorithm when its bitstream is transmitted over erroneous channels, we extend its error-free codec to an error- resilient, scalable audio coding (ERSAC) over WCDMA channels by re-organizing the bitstream and modifying the noiseless coding part. The distinctive features of the ERSAC algorithm are presented below. • U nequal error protection Compared with the equal error protection scheme, the unequal error protection method gives higher priority to critical bits and, therefore, better protection. It offers an improved perceived signal quality at the same channel-signal-to-noise ratio. • A daptive segm entation In order to minimize the error propagation effect, the bitstream is dynam ically segmented into several variable length segments such that it can be re-synchronized at the beginning of each segment even when error happens. Within each segment, bits can be independently decoded. In this way, errors can be confined to one segment, and will not propagate and affect the decoding of neighboring segments. • Frequency interleaving To further improve the error-resilience, bits belong to the same time period but a different frequency region are divided into two groups and sent in different 13 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. packets. Therefore, even when the packets that contain bits for the header or the data part of the base layer are corrupted, not all information for the same time position is lost so that the end user is still able to reconstruct a poorer version of the sound with some frequency component missing. 1.3 O utline of th e D issertation This dissertation consist of several chapters and they are organized as follows. Chap ter 2 and Chapter 3 are devoted to the Modified Advance Audio Coding with Karhunen-Loeve Transform (MAACKLT) algorithm. Chapter 2 presents the inter channel redundancy removal approach and the channel-scalable decoding method, while Chapter 3 studies the quantization and adaptation properties of the Karhunen- Loeve Transform (KLT) employed in MAACKLT algorithm. Chapter 4 describes the Progressive Syntax-Rich Multichannel Audio Codec (PAMAC). Chapter 5 ex tends the work done in Chapter 4 to an error robust codec design. Finally, all main result,s achieved in this thesis are summarized in Chapter 6. Some related research background knowledge is included in Appendices. 14 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C hapter 2 Inter-C hannel R edundancy R em oval and C hannel-Scalable D ecoding 2.1 Introduction In this chapter1, we present a new algorithm called MAACKLT, which stands for Modified Advanced Audio Coding with Karhunen-Loeve Transform (KLT). In MAACKLT, a 1-13 temporal-adaptive KLT is applied in the pre-processing stage to remove inter-channel redundancy. Then, de-correlated signals in the KL trans formed channels, called eigen-channels, are compressed by a modified AAC main profile encoder module. Finally, a prioritized eigen-channel transmission policy is enforced to achieve quality scalability. In this work, we show that the proposed MAACKLT algorithm provides a coarse- grain scalable audio solution. That is, even if packets of some eigen-channels are 1P art of this chapter represents works published before, see [YAKKOOb, YAKKOOa, YAKK02c| 15 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. dropped completely, a slightly degraded yet full-channel audio can still be recon structed in a reasonable fashion without any additional computational cost. To summarize, we focus on two issues in this research. First, the proposed MAACKLT algorithm exploits inter-channel correlation existing in audio data to achieve a better coding gain. Second, it provides a quality-scalable multichannel audio bitstream which can be adaptive to networks of time-varying bandwidth. The rest of this chapter is organized as follows. Section 2.2 summarizes the inter-channel de-correlation scheme and its efficiency. Section 2.3 discusses the temporal adaptive approach. Section 2.4 describes the eigen-channel coding method and its selective transmission policy. Section 2.5 demonstrates the audio concealment strategy at the decoder end when the bitstream is partially received. The system overview of the complete MAACKLT compression algorithm is provided in Section 2.6. The computational complexity of MAACKLT is compared with that of MPEG AAC in section 2.7. Experimental results are shown in Section 2.8. Finally, concluding remarks are given in Section 2.9. 2.2 Inter-C hannel R edundancy R em oval 2.2.1 Karhunen-Loeve Transform For a given time instance, removing inter-channel redundancy would result in a significant bandwidth reduction. This can be done via an orthogonal transform M V = U. Among several commonly used transforms, including the Discrete Cosine 16 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Transform (DCT), the Fourier Transform (FT), and the Karhunen-Loeve Transform (KLT), the signal-dependent KLT is adopted in the pre-processing stage because it is theoretically optimal in de-correlating signals across channels. Figure 2.1 illustrates how KLT is performed on multichannel audio signals, where the columns of the KL transform matrix is composed by eigenvectors calculated from the covariance matrix associated with original multichannel audio signals. Original multichannel audio Eigen-channel audio signals signals with high correlation with little correlation between channels between channels KL Transform Correlated Decorrelated Matrix Component Component r M X V f * N u f J S * J V . - » ■ Figure 2.1: Inter-channel decorrelation via KLT. Suppose that an input audio signal has n channels. Then, we can form an n x n KL transform matrix M composing of n eigenvectors of the cross-covariance matrix associated with these n channels. Let V (i) denote the vector whose n elements are the iih sample value in channel 1,2,..., n, i.e. V (i) = [xi,x2,..., xn]T, i = l,2 ,...,k , 17 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. where Xj is the ith sample value in channel j (1 < j < n), k represents the number of samples in each channel, and [*]T represents the transpose of [*]. The mean vector Hv and covariance matrix Cy are defined as py Cy E[V] S t ■ V(i) k E{{V - ltv)(V ~ v v )‘ \ k The KL transform matrix M is M = [mi,m2,... , mn]T, where m y m 2, • • m n are eigenvectors of Cy . The covariance of KL transformed signals is E[(U - ^ { U - tiuf T1 ME[(V - fly)(V - fXy)T}MT M C yM T Ai 0 0 A 2 0 0 0 0 A t7 . where Ay A 2, ..., A „ are eigenvalues of Cy. Thus, the transform produces statis tically cle-correlated channels in the sense of having a diagonal covariance matrix for transformed signals. Another property of KLT, which can be used in the recon struction of audio of original channels, is that the inverse transform matrix of M is equal to its transpose. Since Cy is real and symmetric, the matrix formed by 18 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. normalized eigenvectors are orthonormal. Therefore, we have V = M TU in recon struction. From KL expansion theory [Hay96], we know that selecting eigenvectors associated with the largest eigenvalues can minimize the error between original and reconstructed channels. This error will go to zero if all eigenvectors are used. KLT is thus optimum in the least-square-error sense. 2.2.2 Evidence for Inter-Channel De-Correlation Multichannel audio sources can be roughly classified into three categories. Those be longing to class I are mostly used in broadcasting, where signals in one channel may be completely different from the other. Either broadcasting programs are different from channel to channel, or the same program is broadcast but in different lan guages. Samples of audio sources in class I normally contain relatively independent signals in each channel and present little correlation among channels. Therefore, this type of audio sources will not fall into the scope of high quality multichannel audio compression discussed here. The second class of multichannel audio sources can be found in most film sound tracks, which are typically in the format of 5.1 channels. Most of this kind of pro gram material has a symmetry property among CPEs and presents high correlation in CPEs, but little correlation across CPEs and SCEs (Single Channel Elements). Almost all existing multichannel audio compression algorithms such as AAC and Dolby AC-3 are mainly designed to encode audio material that belongs to this cate gory. Figure 2.2 shows the normalized covariance matrix generated from one sample 19 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 0 ) 1 t \ 1 Ls Ls Rs < D S 3 c O S 43 u Channel Rs c 1 L .0053 1 R .0066 .9153 1 Ls .0166 .0210 .0156 1 Rs .0086 .0036 .0065 .2570 1 C L R Ls Rs Channel Channels Figure 2.2: Absolute values of elements in the lower triangular normalized covariance matrix for 5-channel ” Herre” . audio of class II, where the normalized covariance matrix is derived from the cross covariance matrix by multiplying each coefficient with the reciprocal of the square root of the product of their individual variance. Since the magnitude of non-cliagonal elements in a normalized covariance matrix provides a convenient and useful method to measure the degree of inter-channel redundancy, it is used as a correlation metric throughout the paper. A third emerging class of multichannel audio sources consists of material recorded in a real space with multiple microphones that capture acoustical characteristics of that space. Audio of class III is becoming more prevalent with the introduction of consumer media such as DVD-Audio. This type of audio signals has considerably larger redundancy inherent among channels especially adjacent channels as graph ically shown in Figure 2.3, which corresponds to the normalized covariance matrix 20 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Channel Channel c 1 c .6067 1 R .3231 .1873 1 Lw .0484 .0745 .0064 1 Rw .0298 .0236 .0854 .2564 1 Lh .1464 .0493 .0887 .0791 .0921 1 Rh .1235 .1655 .0310 .0031 .0439 .0356 1 Ls .1260 .1189 .0407 .0724 .0705 .0671 .0384 1 Rs .1148 .2154 .1827 .0375 .0826 .0952 .0147 .0570 1 Bs .2156 .1773 .1305 .0540 .1686 .1173 .1606 .0383 .0714 1 C L R Lw Rw Lh Rh Ls Rs Bs Channels Figure 2.3: Absolute values of elements in the lower triangular normalized covariance matrix for 10-channel ”Messiah1 ’. derived from a test sequence named ’ ’Messiah”. As shown in the figure, a large de gree of correlation is present between not only CPEs (e.g. left/right channel pair and left-surrouncl/right-surround channel pair) but also SCE (e.g. the center channel) and any other channels. The work presented in this research will focus on improving the compression performance for multichannel audio sources that belong to classes II and III. It will be demonstrated that the proposed MAACKLT algorithm not only achieves good 21 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. results for class III audio sources, but also improves the coding performance to a certain extent for class II audio sources compared with original AAC. Two test data sets are used to illustrate the de-correlation effect of KLT. One is a class III 10-channel audio called ”Messiah” 2, which is a piece of classical music recorded live in a concert hall. They were obtained from signals mixed from 16 microphones placed in various locations in the hall. Another one is a class II 5- channel music called ’ ’Herre” 3, which was used in MPEG-2 AAC standard (ISO/I EC 13818-7) conformance work. These test sequences are chosen because they contain a diverse range of frequency components played by several different instruments so that they are very challenging for inter-channel de-correlation and subsequent coding experiments. In addition, they provide good samples for result comparison between original AAC and the proposed MAACKLT algorithm. Figures 2.4 and 2.5 show absolute values of elements in the lower triangular part of the normalized cross-covariance matrix after KLT for 5-channel set ’ ’Herre” and 10-channel set ’’Messiah”. These figures clearly indicate that KLT method achieves a high degree of de-correlation. Note that the non-diagonal elements are not exactly zeros because we are dealing with an approximation of KLT during calculation. We predict that by removing redundancy in the input audio with KLT, a much better coding performance can be achieved by encoding each channel independently, which will be verified in Later sections. 2The 10 channels include C enter (C), Left (L), R ight (R), Left W ide (Lw), R ight W ide (Rw), Left High (Lh), Right High (Rh), Left Surround (Ls), Right Surround (Rs) and Back Surround (Bs) 3The 5 channels include C, L, R, Ls, and Rs. 22 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. "'v - . Eigen-channel < u i 43 0 1 a < D so iS 1 1 2 .0006 1 3 .0016 .0004 1 4 .0013 .0004 .0026 1 5 .0004 .0004 .0026 .0009 1 1 2 3 4 5 Eigen-channel Eigen-channels Figure 2.4: Absolute values of elements in the lower triangular normalized covariance matrix after KLT for 5-channel ” Herre” . 2.2.3 Energy Compaction Effect The KLT pre-processing approach not only significantly de-correlates the input mul tichannel audio signals but also considerably compacts the signal energy into the first several eigen-channels. Figures 2.6 (a) and (b) show how energy is accumulated with an increased number of channels for original audio channels and de-correlatecl eigen- channels. As clearly shown in these two figures, energy accumulates much faster in the case of eigen-channels than original channels, which provides a strong evidence of data compaction of KLT. It implies that, when transmitting data of a fixed number of channels with the proposed MAACKLT algorithm, more information content will be received at the decoder side, and better quality of reconstructed multichannel audio can be achieved. 23 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 8 9 ~ - * > > 3 4 Eigen-channel 10 , 2 Eigen-channel 1 1 2 .0012 1 3 .0006 .0009 1 4 .0001 .0014 .0010 1 5 .0004 .0019 .0014 .0003 1 6 .0001 .0007 .0006 .0005 .0005 1 7 .0005 .0017 .0012 .0000 .0005 .0011 1 8 .0004 .0016 .0009 .0002 .0005 .0007 .0005 1 9 .0002 .0009 .0008 .0004 .0001 .0006 .0001 .0007 1 10 .0001 .0013 .0008 .0005 .0002 .0010 .0001 .0001 .0010 1 1 2 3 4 5 6 7 8 9 10 Eigen-channels Figure 2.5: Absolute values of elements in the lower triangular normalized covariance matrix after KLT for 10-channel ’ ’Messiah”. Another convenient way to measure the amount of data compaction can be ob tained via eigenvalues of the cross-covariance matrix associated with the KL trans formed data. In fact, these eigenvalues are nothing else but variances of eigen- channels, and the variance of a set of signals reflects its degree of jitter, or the information content. Figures 2.7 (a) and (b) are plots of variances of eigen-channels 24 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 50% 50% >4 0 % 03 30% -o 20% < 10% < 10% Before KLT After KLT Before KLT After KLT 5 Channel Channel (a) (b) Figure 2.6: Comparison of accumulated energy distribution for (a) 5-channel ” Herre” and (b) 10-channel’’Messiah”. associated with the ’ ’Messiah” test set consisting of 10 and 5 channels, respec tively. As shown in figures, the variance drops dramatically with the order of eigen- channels. The steeper the variance drop is, the more efficient the energy compaction is achieved. These experimental results also show that the energy compaction effi ciency increases with the number of input channels. The area under the variance curve reflects the amount of information to be encoded. As illustrated from these two figures, this particular area is substantially much smaller for the 10-channel set than that of the 5-channel set. As the number of input channels decreases, the final compression performance of MAACKLT tends to be more influenced by the coding power of the AAC main profile encoder. 25 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. $ 0.8 . § 0.6 >0.4 ■ o o 0.8 .§ 0.6 >0.4 ■ a 5 3 0.2 4 1 2 3 4 5 6 7 8 9 10 2 3 5 Eigen-channel Eigen-channel (a) (b) Figure 2.7: Normalized variances for (a) 10-channel ’ ’Messiah”, and (b) 5-channel ’’Messiah”, where the vertical axis is plotted in the log scale. 2.2.4 F req u en cy-D om ain versu s T im e-D o m a in KLT In all previous discussion, we considered only the case of applying KLT to time- clomain signals across channels. However, it is also possible to apply the inter- channel cle-correlation procedure after time-domain signals are transformed into the frequency-domain via MDCT (Modified Discrete Cosine Transform) in the AAC encoder. One frame of the audio signal from the center channel of ” Herre” in the frequency- domain and in the time-clomain are shown in Figures 2.8 (a) and (b), respectively. The energy compaction property can be clearly seen from the simple comparison between the time-domain and the frequency-domain plots. Generally speaking, ap plying KLT to frequency-domain signals achieve a better performance than directly applying KLT to time-domain signals. In addition, a certain degree of delay and 26 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 6000 ^ 4 0 0 0 q.2 2000 200 400 600 800 1000 200 400 600 800 1000 Sample Sample (a) (b) Figure 2.8: (a) Frequency-domain and (b) time-domain representations of the center channel from ”Herre”. reverberant sound copies may exist in time-domain signals among different chan nels, which is especially true for class III multichannel audio sources. The delay and reverberation effects affect the time-domain KLT’s cle-correlation capability, how ever, they may not have that much impact on frequency-domain signals. Figures 2.9 and 2.10 show absolute values of off-diagonal non-redundant elements for normal ized covariance matrices generated from frequency- and time-domain KL transforms with test audio ” Herre” and ” Messiah” , respectively. Clearly, the frequency-domain KLT has a much better inter-channel cle-correlation capability than that of the time- domain KLT. This implies that applying KLT to frequency-domain signals should lead to a better coding performance, which will be verified by experimental results shown in Section 2.8. And any results discussed hereafter will focus on frequency- domain KLT method unless otherwise mentioned. 27 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Eigen-channel Eigen-channel ^ " - c h a n n e l Eigen-channel (a) (b) Figure 2.9: Absolute values of off-diagonal elements for the normalized covariance matrix after (a) frequency-domain and (b) time-domain KL transforms with test audio ”Herre”. 2.3 Tem poral A daptive KLT A multichannel audio program comprises of different periods, each of which has its unique spectral signature. For example, a piece of music may begin with a piano preclude followed by a chorus. In order to achieve the highest information compact ness, the de-correlation transform matrix should be adaptive to the characteristics of different periods. In this section, we present a temporal-adaptive KLT approach, in which the covariance matrix (and, consequently, the corresponding KL transform matrix) is updated from time to time. Each adaptive period is called a ” block”. Figure 2.11 shows the variance of each eigen-channel of one non-aclaptive and two temporal-adaptive approaches for test set ”Messiah”. Compared with the non- adaptive method, the adaptive method achieves a smaller variance for each eigen- channel. Furthermore, the shorter the adaptive period, the higher inter-channel de- correlation is achieved. The only drawback of the temporal-adaptive approach over 28 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Eigen-channel m 1 c- 8 x 10~3 C Eigen-channel 9 io Eig~, ,-v,, ,u, „ Eigen-channel 9 10 (a) (b ) Figure 2.10: Absolute values of off-diagonal elements for the normalized covariance matrix after (a) frequency-domain and (b) time-domain KL transforms with test audio ’’Messiah”. the non-adaptive approach goes to the overhead bits, which have to be transmitted to the decoder so that the multichannel audio can be reconstructed to its original physical channels. Due to the increase of the block number, the shorter the adaptive- period is, the larger the overhead bit rate is. The trade-off between this ’ ’block” size and the overhead bit rate will be discussed below. Since the inverse KLT has to be performed at the decoder side, the information of the transform matrix should be included in the coded bitstream. As mentioned before, the inverse KLT matrix is the transpose of the forward KLT matrix, which is composed by eigenvectors of the cross-covariance matrix. To reduce the overhead bit rate, elements of the covariance matrix are included in the bitstream instead of those of the KLT matrix since the covariance matrix is real and symmetric and we only have to send the lower (or higher) triangular part that contains non-redundant 29 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. x 1 0 ' ,5 adaptive each 430 frames 10 -n- non-adaptive -o - adaptive each 645 frames 0 3 > 2 3 Eigen-channel 4 5 Figure 2.11: De-correlation efficiency of temporal adaptive KLT. elements. As a result, the decoder also has to calculate eigenvectors of the covariance matrix before the inverse KLT can be performed. Only one covariance matrix has to be coded for the non-temporal-adaptive ap proach. However, for the temporal-adaptive approach, every covariance matrix must be coded for each block. Assume that n channels are selected for simultaneous inter channel de-correlation, and the adaptive period is K seconds, i.e. each block contains K seconds of audio. The size of the covariance matrix is n x n, and the number of non-redundant elements is n x (n + 1)/2. In order to reduce the overhead bit rate, the floating-point covariance matrix is quantized to 16 bits per element. Therefore, 30 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3 channels 5 channels 7 channels 10 channels Adaptive time (second) Figure 2.12: The overhead bit rate versus the number of channel and the adaptive period. the total bit requirement for each covariance matrix is 8n x (n + 1) bits, and the overhead bit rate roverhead is _ _ 8n x (n + 1) _ 8(n + 1) F o v e r h e a d T7 ,■ nK K in bit per second per channel (bit/s/ch). The above equation suggests that the overhead bit rate increases approximately linearly with the number of channels. The overhead bit rate is, however, inversely proportional to the adaptive time (or the block size). Figure 2.12 illustrates the overhead bit rate for different channel numbers and block sizes. The optimal adaptive time is around 10 seconds. At this block size, inter- channel redundancy can be efficiently removed with a reasonable cost of overhead 31 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. bits. From this figure, we know that a 10-second block only generates less than 10 bit/ s/ch overhead bit rate. Compared with the 64 kbit/s/ch typical bit rate, this overhead is so small that it can be neglected. 2.4 E igen-C hannel C oding and T ransm ission 2.4.1 Eigen-Channel Coding The main profile of the AAC encoder is modified to compress audio signals in de-correlated eigen-channels. The detailed encoder block diagram is given in Fig ure 2.13, where the shaded parts represent coding blocks that are different from the original AAC algorithm. The major difference between Figure 2.13 and the original AAC encoder block diagram is the KLT block added after the filter bank. When the original input signals are transformed into frequency domain, the cross-channel KLT are performed to generate the de-correlated eigen-channel signals. Masking thresholds are then calculated based on the KL transformed signals in the perceptual model. The KLT related overhead information is sent into the bitstream afterwards. The original AAC is typically used to compress class II audio sources. Its M/S stereo coding block is specifically used for symmetric CPEs. It encodes the mean and difference of CPEs instead of two independent SCEs, which reduces redundancy existing in symmetric channel pairs. In the proposed algorithm, since inter-channel de-correlation has been performed in an earlier stage and audio signals after KLT are 32 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reproduced with permission of th e copyright owner. Further reproduction prohibited without permission. Input audio Iteration loops signal Quantized spectrum of previous frame Coded Audio Stream TNS KLT Prediction Filter bank Scale factors Gain control M/S disabled Intensity coupling Noiseless coding Perceptual model Overhead information Legend Data — Control — Bitstream Multiplex Figure 2.13: The modified AAC encoder block diagram. C O CO from independent eigen-channels with little correlation between any channel pairs, the M/S coding block is no longer needed. Thus, the M/S coding block of the A AC main profile encoder is disabled. The AAC encoder module originally assigns an equal amount of bits to each input channel. However, since signals into the iteration loops are no longer the original multichannel audio in the new system, the optimality of the same strategy has to be investigated. Experimental results indicate that the compression performance will be strongly influenced by the bit assignment scheme for de-correlated eigen-channels. According to the bit allocation theory [GG91], the optimal bit assignment for identically distributed normalized random variables under the high rate approxi mations while without nonnegativity or integer constraints on the bit allocations is bi = b + \ log2 ^-, (2.2) z p where b = ~ is the average number of bits per parameter, k is the number of pa rameters, and p2 = ( n t i of M is the geometric mean of the variances of the random variables. It is verified by experimental data that the normalized probability density functions of signals in eigen-channels are almost identical. They are given in Fig ures 2.14 and 2.15. This optimal bit allocation method is adopted for rate/clistortion control processing when coding eigen-channel signals. 34 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 0.05 0.05 0.05 0.04 0.04 0.04 0.03 0.03 0.03 0.02 v J j U '-AJL/ 0.02 0.02 u u u U j u -2 0 2 -2 0 2 -2 0 2 0.05 0.05 0.04 0.04 0.03 0.03 0.02 \A J L / 0.02 U l j J U j u -2 0 2 -2 0 2 Figure 2.14: The empirical probability density functions of normalized signals in 5 eigen-channels generated from test audio ’ ’Herre”. 2.4.2 Eigen-Channel Transmission Figures 2.6 (a) and (b) show that the signal energy accumulates faster in eigen- channel form than original multichannel form. This implies that, with a proper channel transmission and recovery strategy, transmitting the same number of eigen- channels and of original multichannels, the eigen-channel approach should result in a higher quality reconstructed audio since more energy is transmitted. It is desirable to re-organize the bitstream so that bits of more important channels can be received at the decoder side first for audio decoding. This should result in the best audio quality given a fixed amount of received bits. When this re-organized audio bitstream is transmitted over a heterogeneous network, for those users with 35 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 0.024 0.024 0.024 0.022 0.022 0.022 0.02 0.02 0.02 0.024 0.024 0.024 0.022 0.022 0.022 0.02 0.02 0.02 0.024 0.024 0.024 0.022 0.022 0.022 0.02 0.02 0.02 Figure 2.15: The empirical probability density functions of normalized signals in the first 9 eigen-channels generated from test audio ’ ’Messiah”. a limited bandwidth, the network can drop packets belonging to less important channels. The first instinct about the metric of channel importance would be the energy of the audio signal in each channel. However, this metric does not work well in gen eral. For example, for some multichannel audio sources, especially those belonging to class II, since they are re-producecl in a music studio artificially, the side channel which normally does not contain the main melody may even has a larger energy than the center channel. Based on our experience with multichannel audio, loss or significant distortion of the main melody in the center channel would be much more 36 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. annoying than loss of melodies in side channels. In other words, the location of chan nels also plays an important role. Therefore, for a regular 5.1 channel configuration, the order of channel importance from the largest to the least should be: 1. Center channel, 2. L /R channel pair, 3. Ls/Rs channel pair, 4. Low frequency channel. Between channel pairs, their importance can be determined by their energy values. This rule is adopted in experiments below. After KLT, eigen-channels are no longer the original physical channels, and sounds in different physical channels are mixed in every eigen-channel. Thus, spa tial dependency of eigen-channels is less trivial. We observe from experiments that although it is true that one eigen-channel may contain sounds from more than one original physical channel, there still exists a close correspondence between eigen- channels and physical channels. To be more precise, audio of eigen-channel 1 would sound similarly to that of the center channel, audio of eigen-channels 2 and 3 would sound similarly to that of the L/R channel pair etc. Therefore, if eigen-channel 1 is lost in transmission, we would end up with a very distorted center channel. More over, it happens that, sometimes, eigen-channel 1 may not be the channel with a very large energy and could be easily discarded if the channel energy is adopted as the metric of channel importance. Thus, the channel importance of eigen-channels 37 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. should be similar to that of physical channels. That is, eigen-channel 1 correspond ing to the center channel, eigen-channel 2 and 3 corresponding to the L /R channel pair, eigen-channel 4 and 5 corresponding to the Ls/Rs channel pair. Within each channel pair, the importance is still determined by their energy values. 2.5 A udio C oncealm ent for C hannel-Scalable D ecoding Consider the scenario that an AAC-coded multichannel bitstream is transmitted in a heterogeneous network such as the Internet. For end-users who do not have enough bandwidth to receive full channel audio, some packets have to be dropped. In this section, we consider the bitstream of each channel as one minimum unit for audio reconstruction. When the bandwidth is not sufficient, we may drop bitstreams of a certain number of channels to reduce the bit rate. It is called channel-scalable decoding, which has an analogy in MPEG video coding, i.e. dropping B frames while keeping only I and P frames. For an AAC channel pair, the M/S stereo coding block will replace low frequency coefficients in symmetric channels to be their sum and difference at the encoder, i.e. speci[i\ < — (speci[i] + specr[i])/2, (2.3) specr[i\ < — (speci[i\ — specr[i\) / 2 , (2.4) 38 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. where speci[i\ and specr[i] are the ith frequency-domain coefficient in the left and right channels of the channel pair, respectively. The intensity coupling coding block will replace high frequency coefficients of the left channel with a value proportional to the envelope of the sound signal in the symmetric channel, and set the value of right channel high frequency coefficients to zero, i.e. speci[i] < — (speci[i] + specr[i]) x \ (2.5) EsisfbY s pecr[i\ * — 0 , (2 .6 ) where Ei[sfb], Er[sfb) and Es[sfb\ represent, respectively, energy values of the left channel, the right channel and the sum of left and right channels of the subbancl that sample i belongs to. Values of are sent to the bitstream as scaling factors. At the decoder end, the low frequency coefficients of the left and right channel are reconstructed via specify] spe speci[i] + specr[i\, speed* J — specr[i\. (2.7) (2.8) 39 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. For high frequency coefficients, audio signals in the left channel will remain the same as they are received from the bitstream, while those in the right channel will be reconstructed via where scale is the inverse of the scaling factor. When packets of one channel of a channel pair are dropped, we drop frequency scaling factors. Therefore, what we receive at the decoder side are just coefficients in the left channel. For low frequency coefficients, they correspond to the mean value of the original frequency coefficient in the left and right channels. For high frequency coefficients, they correspond to the energy envelope of the symmetric channel. That it, we have specr[i] = scale x speci[i], (2.9) coefficients of the right channel while keeping all other side information including speci[i] — > (speci[i\ + speCr[i])/2, (2 .10 ) specr [i] — ► 0 (2 .11) for the low frequency part and (2 .12) specr[i] — > 0 , (2.13) for the high frequency part. 40 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Note that since scaling factors are contained in the received bitstream, recon struction of high frequency coefficients in the right channel will remain the same as the original AAC when data of all channels are received. Therefore, only low fre quency coefficients in the right channel need to be recovered. The strategy used to reconstruct these coefficients is just to let values of right channel coefficients equal to values of received left channel coefficients. This is nothing else but the mean value of coefficients in the original channel pair, i.e. specr[i\ = speci[i} — » (speci[i\ + specr[i\)/2. (2-14) Audio concealment for the proposed eigen-channel coding scheme is relatively simple. All coefficients in dropped channels will be set to 0, then a regular decoding process is performed to reconstruct full multichannel audio. For the situation where packets of two or more channels are dropped, the reconstructed dropped channel may have a much smaller energy than other channels after inverse KLT. In order to get better reconstructed audio quality, an energy boost up process can be enforced so that the signal in each channel will have a similar amount of energy. To illustrate that the proposed algorithm MAACKLT has a better quality- clegraclation property than AAC (via a proper audio concealment process described in this section), we perform experiments with lossy channels where packets are dropped in a coded bitstream in Section 2 .8 . 41 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.6 C om pression S ystem O verview The block diagram of the proposed compression system is illustrated in Figure 2.16. It consists of four modules: (1) data partitioning, (2) Karhunen-Loeve transform, (3) dynamic range control, and (4) the modified AAC main profile encoder. In the data partitioning module, audio signals in each channel are partitioned into sets of non-overlapping intervals, i.e. blocks. Each block contains K frames, where K is a pre-defined value. Then, data in each block are sequentially fed into the KLT module to perform inter-channel cle-correlation. In the KLT module, multichannel block data are de-correlated to produce a set of statistically independent eigen- channels. The KLT matrix consists of eigenvectors of the cross-covariance matrix associated with the multichannel block set. The covariance matrix is first estimated and then quantized into 16 bits per element. The quantized covariance coefficients will be sent to the bitstream as the overhead. As shown in Figure 2 .1 , eigen-channels are generated by multiplication of the KLT matrix and the block data set. Therefore, after the transform, the sample value in eigen-channels may have a larger dynamic range than that of original channels. To avoid any possible data overflow in the later compression module, data in eigen- channels are rescaled in the dynamic range control module so that the sample value input to the modified AAC encoder module does not exceed the dynamic range of that in regular 16-bit PCM audio files. This rescaling information will also be sent to the bitstream as the overhead. 42 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Data Partitioning Module Multichannel Data Sub-block _ _ Partitioning Figure 2.16: KL Transform Module Dynamic Range Control Module Modified AAC Compression Module Quantized Covariancje Matrix Desired Bit Rate Mapping info Overhead Bits Compressed Bit Stream Mapping to 16-bits Bit- allocation Control Karhunen-Loeve Transformation Transmission Strategy Control Covariance Matrix Estimation Eigenvector fc Eigenvalue Calculation Covariance Matrix Quantization Modified AAC Main Profile Encoder The block diagram of the proposed MAACKLT encoder. 43 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reproduced with permission of th e copyright owner. Further reproduction prohibited without permission. Compressed Eigen-channels Compressed Bit Stream Mapping Info Covariance Matrix Reconstructed Multichannel audio Combining Sub-blocks Eigenvector Calculation AAC Main Profile Decoder Inverse Mapping from 16-bits Channel Recovery Strategy Control Inverse Karhunen- Leove Transform Figure 2.17: The block diagram of the proposed M AACKLT decoder. Signals in de-correlated eigen-channels are compressed in the next module by a modified AAC main profile encoder. The AAC main profile encoder is modified in our algorithm so that it is more suitable in compressing the audio signal in eigen- channels. To enable channel-scalability, a transmission strategy control block is adopted in this module right before the compressed bitstream is formed. The block diagram of the decoder is shown in Figure 2.17. The mapping infor mation and the covariance matrix together with the coded information for eigen- channels are extracted from the received bitstream. If data of some eigen-channels are lost due to the network condition, the eigen-channel concealment block will be enabled. Then, signal values in eigen-channels will be reconstructed by the AAC main profile decoder. The mapping information is used to restore from a 16-bit dynamic range of the decoded eigen-channel back to its original range. The inverse KLT matrix can be calculated from the extracted covariance matrix via transpos ing its eigenvectors. Then, inverse KLT is performed to generate the reconstructed multichannel block set. These block sets are finally combined together to produce the reconstructed multichannel audio signals. 2.7 C om plexity A nalysis Compared with the original AAC compression algorithm, the additional computa tional complexity required by the MAACKLT algorithm mainly comes from the KLT 45 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 2.1: Comparison of computational complexity between MAACKLT and AAC Encoding Time used (seconds) MAACKLT AAC Extra Time Percent Messiah 1-sec AP 344.28 26.15 8 .2 % Messiah 5-sec AP 340.43 22.30 7.0% Messiah 10-sec AP 339.44 21.31 6.7% Messiah NonA 337.62 318.13 19.49 6 .1 % Herre NonA 112.15 101.23 10.92 1 0 .8 % Decoding Time used (seconds) MAACKLT AAC Extra Time Percent Messiah 1-sec AP 16.92 4.62 37.6% Messiah 5-sec AP 16.04 3.74 30.4% Messiah 10-sec AP 15.60 3.30 26.8% Messiah NonA 14.66 12.30 2.36 19.2% Herre NonA 2.75 2.42 0.33 13.6% pre-processing module, which includes generation of the cross-covariance matrix, cal culation of its eigenvalues and eigenvectors, and matrix multiplication required by KLT. Table 2.1 illustrates the running time of MAACKLT and AAC for both the encoder and the decoder at a typical bit rate of 64 kbit/s/ch, where ”n-sec AP” means the MAACKLT algorithm with a temporal adaptive period of n seconds while ’ ’NonA” means a non-aclaptive MAACKLT algorithm. The input test audio signals are 20-seconcl 10-channel ’’Messiah” and 8 -seconcl 5-channel ’ ’Herre” . The system used to generate the above result is a Pentium III 600 PC with 128M RAM. These results indicate that the coding time for MAACKLT is still dominated by the AAC compression and de-compression part. When the optimal 10-second 46 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. temporal adaptive period is used for test audio ”Messiah” , the additional KLT com putational time is less than 7% of the total encoding time at the encoder side while the MAACKLT algorithm only takes about 26.8% longer than that of the original AAC at the decoder side. The MAACKLT algorithm with a shorter adaptive period will take a little bit more time in encoding and decoding since more KL transform matrices are need to be generated. Note also that we have not made any attem pt to optimize our experimental codes. A much lower amount of encoding/decoding time of MAACKLT is expected if the source code for the KLT pre-processing part is carefully re-written to optimize the performance. In channel-scalable decoding, when packets belonging to less important channels are dropped during transmission in the heterogeneous network, the audio conceal ment part adds a negligible amount of additional complexity in the MAACKLT decoder. The decoding time remains about the same as that of regular bit rate decoding at 64 kbit/s/ch when all packets are received at the decoder side. 2.8 E xperim ental R esu lts 2.8.1 M ultichannel Audio Coding The proposed MAACKLT algorithm has been implemented and tested under the PC Windows environment. We supplemented an inter-channel redundancy removal block and a channel transmission control block to the basic source code structure of MPEG- 2 AAC [ISOa, ISOc]. The proposed algorithm is conveniently parameterized 47 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. to accommodate various input parameters, such as the number of audio channels, the desired bit rate and the window size of temporal adaptation, etc. We have tested the coding performance of the proposed MAACKLT algorithm by using three 10-channel set audio data ’ 'Messiah”, ’ ’Band” and ’’Herbie” and one 5-channel set audio data ’ ’Herre” at a typical rate of 64 kbit/s/ch. Test materials ’’Messiah” and ’’Band” are class III audio files, while ’’Herbie” and ’’Herre” are class II audio files. Figures 2.18 (a) and (b) show the mean Mask-to-Noise-Ratio (MNR) comparison between the original AAC4 and the MAACKLT scheme for the 10-channel set ’ ’Herbie” and the 5-channel set ’ ’Herre”, respectively. The mean MNR values in these figures are calculated via „ A/rvt"D i L j c h a n n e l ^ ^ ^ c h a n n e l , s u b b a n d t r . i r , mean MNRsuhband = ---------- - ------— -------- . (2.15) number of channels The mean MNR improvement shown in these figures are calculated via n / tmij • + E auH»nd(mean M NRJJ££^lt - mean M N R ^ arJ mean MNR improvement = ------------------- —1 . number of subbancls (2.16) Experimental results shown in Figure 2.18 (a) and (b) are generated by using the frequency-domain non-adaptive KLT method. These plots clearly indicate that MAACKLT outperforms AAC in the objective MNR measurement for most sub bands and achieves mean MNR improvement of more than 1 clB for both test audio. 4All audio files generated by AAC in this section are processed by the AAC m ain profile encoder and decoder. 48 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. It implies that, compared with AAC, MAACKLT can achieve a higher compression ratio while maintaining similar indistinguishable audio quality. It is worthwhile to mention that no software optimization has been performed for any codec used in this section and all coding blocks adopted from AAC have not been modified to improve the performance of our codec. -e ■ MPEG AAC MAACKLT D O 2, 10 5 « t* mean MNR improvement: 2.8dB/subband 5 95 C O Q ) 90 688«"e.^ 10 15 20 25 Subband (a) 50 C D 2,40 cc z 30 20 r..........1 " 1 ......1 ---------1 ----- -* > ■ • MPEG AAC MAACKLT y 1 1 l [ W I 1 1itseiii iviiMrr iin|jruv«Mi«nr i .4dB/subband 50 C D 2 ,4 0 cc 30 20 < ? -e - MPEG AAC MAACKLT v u 1 mea n MNR 3.7dB/s improvement ubband 5 10 15 20 Subband (b) 25 10 15 20 Subband (c) 25 Figure 2.18: The MNR comparison for (a) 10-channel ’’Herbie” using frequency- domain KLT (b) 5-channel ’ ’Herre” using frequency-clomain KLT (c) 5-channel ’’Herre” using tirne-domain KLT. Figure 2.18 (c) shows the mean MNR comparison between AAC and MAACKLT with the time-domain KLT method using 5-channel set ’ ’Herre”. Compared with 49 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the result shown in Figure 2.18 (b), we confirm that frequency-domain KLT achieves a better coding performance than time-domain KLT. The experimental result for the temporal-adaptive approach for 10-channel set ” Messiah” is shown in Figure 2.19. This result verifies that a shorter adaptive period de-correlates the multichannel signal better but sacrifices the coding performance by adding the overhead in the bitstream. On the other hand, if the covariance matrix is not updated frequently enough, inter-channel redundancy cannot be removed to the largest extent. As shown in the figure, to compromise these two constraints, the optimal adaptive period for ’’Messiah” is around 10 seconds. S1.6 B 1 .2 E0.4 0 2 4 6 8 10 12 14 16 18 20 Adaptive period (second) Figure 2.19: The mean MNR improvement for temporal-adaptive KLT applied to the coding of 10-channel ’’Messiah”, where the overhead information is included in the overall bit rate calculation. 50 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.8.2 Audio Concealment with Channel-Scalable Coding As described in Section 2.5, when packets of one channel from a channel pair are lost, we can conceal the missing channel at the decoder side. Experimental results show that the quality of the recovered channel pair with the AAC bitstream is much worse than that of the MAACKLT bitstream when it is transm itted under the same network condition. Take the test audio "Herre” as an example. If one of the L /R channel pair is lost, the reconstructed R channel using the AAC bitstream has obvious distortion and discontinuity in several places while the reconstructed right channel by using the MAACKLT bitstream has little distortion and is much more smoother. If one of the Ls/Rs channel pair is lost, the reconstructed Rs channel using the AAC bitstream has larger noise in the first one to two seconds in comparison with that of MAACKLT. The corresponding MNR values are compared in Figures 2.20 (a) and (b) when AAC and MAACKLT are used and missing channels are concealed, when packets of one channel from L /R and Ls/Rs channel pairs are lost. We see clearly that MAACKLT achieves better MNR values than AAC for about 2 clB per subband for both cases. For a typical 5.1 channel configuration, when packets of more than two channels are dropped, which implies that at least one channel pair’s information is lost, some lost channel can no longer be concealed from the received AAC bitstream. In con trast, the MAACKLT bitstream can still be concealed to obtain a full 5.1 channel 51 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. MPEG AAC MAACKLT MPEG AAC MAACKLT mean MNR improvement: 1,9dB/subband c o 30 ;o £ 20 mean MNR improvement: 2.2dB/subband p-q O ' - 1 0 - 1 0 -20 Subband Subband (a) (b) Figure 2.20: MNR comparison for 5-channel ’ ’Herre” when packets of one channel from the (a) L /R and (b) Ls/Rs channel pairs are lost. audio with poorer quality. Although the recovered channel pairs do not sound ex actly the same as the original ones, a reconstructed full multichannel audio would give the listener a much better acoustical effect than a three- or mono-channel audio. Take the 5-channel ’ ’Messiah”, which include C, L, R, Ls and Rs channels, as an example. At the worst case, when packets of four channels are dropped and only data of the most important channel are received at the decoder side, the MAACKLT algorithm can still recover 5-channel audio. Compared with the original sound, the recovered Ls and Rs channels lost most of the reverberant sound effect. This is because inverse KLT can only recover those information in the received channels. Since eigen-channel 1 does not contain much reverberant sound, the MAACKLT decoder can hardly recover these reverberant sound effects in the Ls and Rs channels. Similar experiments were also performed by using test audio ’’Herre”. However, the advantage of MAACKLT over AAC is not as obvious as test audio “Messiah”. 52 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The reason can be easily found out from the original covariance matrix as shown in Figure 2.2. It indicates that little correlation exists between SCE and CPE for class II test audio such as ’’Herre”. Thus, once one CPE are lost, little information can be recovered from other CPEs or SCEs. 2.8.3 Subjective Listening Test In order to further confirm the advantage of the proposed algorithm, a formal subjective listening test according to ITU recommendations [111, 128a, 128b] was conducted in an audio lab to compare the coding performance of the proposed MAACKLT algorithm and that of the MPEG AAC main profile codec. At the bit rate of 64 kbit/s/ch, the reconstructed sound clips are supposed to have the indistinguishable quality as the original ones, which means that the difference be tween MAACKLT and AAC would be small enough such that non-professionals can hardly tell. Therefore, instead of inviting a large number of non-expert listeners, four well-trained professionals participated in the listening test [128b]. During the test, for each test sound clips, subjects listened to three versions of the same sound clips, i.e. the original one followed by two processed ones (one by MAACKLT and one by AAC in random order), subjects were allowed to listen to these files as many times as possible until they were comfortable to give scores to the two processed sound files for each test material. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The five-grade impairment scale given in Recommendation ITU-R BS. 1284 [128a] was adopted in the grading procedure and utilized for final data analysis. Four multi channel audio materials, i.e. ’ ’Messiah”, ’’Band”, ’’Herbie” and ’’Herre”, are all used in this subjective listening test. According to ITU-R BS. 1161-1 [111], audio files selected for listening test only contains short durations ( 1 0 to 2 0 seconds long), so all test files coded by MAACKLT are generated by non-adaptive frequency-domain KLT method. 5 4 >, " c d 0 3 2 1 Figure 2.21 shows the listening test results, where bars represent the score given to each test material coded at 64 kbit/s/ch. The dark shaded area on the top of each bar represents the 95% confidence interval, where the middle line shows the mean value and the other two lines at the boundary of the dark shaded area represent the A=MPEG AAC ,M=MAACKLT f l A M A M A M A M Messiah Band Herbie Herre Figure 2 .2 1 : Subjective listening test results. 54 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. upper and lower confidence limits [RADH87]. It is clear from Figure 2.21 that the proposed MAACKLT algorithm outperforms MPEG AAC in all four test materials. 2.9 C onclusion We presented a new channel-scalable high-fidelity multichannel audio compression scheme called MAACKLT based on the existing MPEG-2 AAC codec. This algo rithm explores the inter- and inner-channel correlation in the input audio signal and allows channel-scalable decoding. The compression technique utilizes KLT in the pre-processing stage to remove the inter-channel redundancy, then compresses the resulting relatively independent eigen-channel signals with a modified AAC main profile encoder module, and finally uses a prioritized transmission policy to achieve quality scalability. The novelty of this technique lies in its unique and desirable capability to adaptively vary the characteristics of the inter-channel de-correlation transform as a function of the covariance of a short period of music and its ability to reconstruct different quality audio with single bitstream. It achieves a good coding performance especially for the input audio source whose channel number goes beyond 5.1. In addition, it outperforms AAC according to both objective (MNR measure ment) and subjective (listening) tests at the typical low bit rate of 64 kbit/s/ch while maintaining a similar computational complexity for both encoder and decoder mod ules. Moreover, compared with AAC, the channel-scalable property of MAACKLT 55 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. enables users to conceal full multichannel audio of reasonable quality without any additional cost. 56 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C hapter 3 A daptive K arhunen-L oeve Transform and its Q uantization Efficiency 3.1 Introduction Based on today’s most distinguished multichannel audio coding system, a Modified Advanced Audio Coding with Karhunen-Loeve Transform (MAACKLT) method is proposed to perceptually losslessly compress a multichannel audio source in Chap ter 2. This method utilizes the Karhunen-Loeve Transform (KLT) in the pre processing stage for the powerful multichannel audio compression tool, i.e. MPEG Advanced Audio Coding (AAC), to remove inter-channel redundancy and further improve the coding performance. However, as described in Chapter 2, each element of the covariance matrix, from which the KLT matrix is derived, is scalar quantized to 16 bits. This results in 240 bits overhead for each KL transform matrix for typi cal 5 channel audio contents. Since the bit budget is the most precious resource in the coding technique, every effort must be made to minimize the overhead due to 57 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the additional pre-processing stage while maintaining a similar high-quality coding performance. Moreover, the original MAACKLT algorithm did not fully explore the KLT temporal adaptation effect. In this research, we investigate the KLT cle-correlation efficiency versus the quan tization accuracy and the temporal KLT adaptive period. Extensive experiments on the quantization of the covariance matrix by using scalar and vector quantizers have been performed. Based on these results, the following two interesting points are concluded. • Coarser quantization can dramatically degrade the de-correlation capability in terms of the normalized covariance matrix of de-correlated signals. However, the degradation of decoded multichannel audio quality is not as obvious. • Shorter temporal adaptation of KLT will not significantly improve the de- correlation efficiency while considerably increase the overhead. Thus, a mod erately long adaptation time is a good choice. It is shown in this work that, with vector quantization, we can reduce the over head from more than 200 bits to less than 3 bits per KL transform while maintain ing comparable coding performance. Even with scalar quantization, a much lower overhead bit rate can still generate decoded audio with comparable quality. Our experimental results indicate that although a coarser quantization of the covariance matrix gives a poorer de-correlation effect, reduction of bits in the overhead is able 58 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. to compensate this degradation to result in a similar coding performance in terms of the objective MNR measurement. The rest of this chapter1 is organized as follows. In section 3.2 we introduce vector quantization and its application to the MAACKLT algorithm. In sections 3.3 and section 3.4 we explore how the quantization method and the temporal adap tive scheme affect the KLT de-correlation efficiency and the coding performance by applying scalar and vector quantizers to encode the KLT matrix with a range of dif ferent bit rates. In section 3.5 we examine computational complexity issues. Some experimental results are presented in section 3.6. Finally concluding remarks are given in section 3.7. 3.2 V ector Q uantization The MAACKLT algorithm described in Chapter 2 dealt only with scalar quantiza tion of the covariance matrix. If the input audio material is short or if the KLT matrix is updated more frequently, the overhead that results from transmitting the covariance matrix will increase significantly, which will degrade the coding perfor mance of the MAACKLT algorithm to a certain extent. To alleviate this problem, we have to resort to a look-up-table (LUT) approach. Here, a stored table of pre calculated covariance matrices is searched to find the one that approximates the estimated covariance matrix of the current block of the input audio. This approach JP art of this chapter represents work published before, see [YAKKOla] 59 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. yields a substantial savings in the overhead bit rate since only pointers to the table, instead of the entire covariance matrix itself, will be transmitted to the receiver. Vector quantization (VQ) [GG91, PA93, Equ89] provides an excellent choice to implement the LUT idea. By vector quantization, we identify a set of possible vec tors both at the encoder and the decoder side. They are called the coclebook. The VQ encoder pairs up each source vector with the closest matching vector (i.e. ’’code word”) in the codebook, thus "quantizing” it. The actual encoding is then simply a process of sequentially listing the identity of codewords that match most closely with vectors making up the original data. The decoder has a codebook identical to the encoder, and decoding is a trivial m atter of piecing together the vectors whose identity have been specified. Vector quantizers in this work consider the entire set of non-redundant elements of each covariance matrix as an entity, or a vector. The identified coclebook should be general enough to include the characteristics of differ ent types of multichannel audio sources. Since VQ allows for direct minimization of the quantization distortion, it results in smaller quantization distortion than scalar quantizers (SQ) at the same bit rate. In other words, VQ demands a smaller number of bits for source data coding while keeping the quantization error similar to that achieved with scalar quantizer. Four different five-channel audio pieces (each containing center (C), left (L), right (R), left surround (Ls) and right surround (Rs) channels), are used to generate more than 80,000 covariance matrices. Each covariance matrix is treated as one training 60 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. vector X, which is composed of fifteen non-redundant elements of the covariance matrix as shown below. Xi x2 x3 X ~ £ 4 X5 X q j X 7 Xg X q X 10 ^ 11 X\2 X\g X u X u where x\, x2, ■ ■ ■ , x,u are elements in the lower triangular part of the covariance ma trix. During the codebook generation procedure, the Generalized Lloyd Algorithm (GLA) was run on the training sequence by using the simple square error distortion measurement, i.e. 15 d(X ,Q (X )) = £ [ X - Q ( X ) ] 2, i=l where Q(X) represents the quantized value of X. The same distortion measurement is used with the full searching method during the encoding procedure. 3.3 Efficiency O f KLT D e-C orrelation The magnitudes of non-cliagonal elements in a normalized covariance matrix pro vide a convenient metric to measure the degree of inter-channel correlation. The 61 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. normalized covariance matrix is derived from the cross-covariance matrix by multi plying each coefficient with the reciprocal of the square root of the product of their individual variance, i.e. where Cjy(i,j) and C(i,j) are elements of the normalized covariance matrix and the cross-covariance matrix in row i and column j, respectively. Tables 3.1 and Table 3.2 show the absolute values of non-redundant elements (i.e. elements in only the lower or the upper triangle) of the normalized covariance matrix calculated from original signals and KLT cle-correlated signals respectively, where no quantization is performed during the KLT de-correlation. From these tables, we can easily see that KLT reduces the inter-channel correlation from around the order of 1 0 “ 1 to the order of 1 0 “4. Table 3.1: Absolute values of non-reclunclant elements of the normalized covariance matrix calculated from original signals. 1 5.36928147e-l 1 3.26056331e-l 1.02651220e-l 1 1.17594877e-l 8.56662289e-l 5.12340667e-3 1 7.46899187e-2 1.33213668e-l 1.15962389e-l 6.55651089e-2 1 In order to investigate the de-correlation efficiency affected by various quanti zation schemes, a sequence of experiments, including SQ and VQ with a different 62 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 3.2: Absolute values of non-redunclant elements of the normalized covariance matrix calculated from KLT de-correlatecl signals. ] 1.67971275e-4 1 2.1505959le-4 1.01530173e-3 1 4.19255484e-4 4.03864289e-4 2.56863610e-4 1 3.07486032e-4 4.23535476e-4 3.48484672e-4 5.20389082e-5 1 Table 3.3: Absolute values of non-reclunclant elements of the normalized covariance matrix calculated from scalar quantized KLT de-correlated signals. 1 1.67971369e-4 1 2.15059518e-4 1.01530166e-3 1 4.19255341e-4 4.03863772e-4 2.56863464e-4 1 3.07486076e-4 4.23536876e-4 3.48484820e-4 5.20396538e-5 1 number of bits per element/vector, was performed. Table 3.3 shows the absolute val ues of non-reclundant elements of the normalized covariance matrix calculated from KLT de-correlatecl signals, where each element of the covariance matrix is scalar quantized into 32 bits. Compared with Table 3.2, values in Table 3.3 are almost identical to those in Table 3.2 with less than 0.0001% distortion per element. This suggests that, with 32 bits per element scalar quantizer, we can almost faithfully reproduce the covariance matrix with negligible quantization error. Figures 3.1 and Figure 3.2 illustrate how de-correlation efficiency and the corre sponding overhead changes with SQ and VQ, respectively. It is observed that simple Mean Square Error (MSE) measurement is not a good choice when evaluating the 63 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. de-correlation efficiency. A better measure is the average distortion D, which is the summation of magnitudes of lower triangular elements of the normalized covariance matrix, i.e. D = ' f ' £ \ C N(,,3)\, (3.1) i= 2 j = l where Cn is the normalized covariance matrix of signals after KLT de-correlation. The overhead in terms of bits per KLT matrix is calculated via OHs = Bs x N, (3.2) OHv = Bv, (3.3) where Bs denotes the number of bits per element for SQ, N is the number of non- redundant elements per matrix, and Bv denotes the number of bits per codeword for VQ (recall that one KLT matrix is quantized into one codeword). For 5 channel audio material, N is equal to 15. Figure 3.1 (a) suggests that there is no significant degradation in de-correlation efficiency when the number of bits is reduced from 32 bits per element to 14 bits per element. However, further reduction in the number of bits per element will result in a dramatic increase of distortion D given in Equation (3.1). From Equation (3.2), we know that the overhead increases linearly as the number of bits per element increases with a gradient equals to N. This is confirmed by Figure 3.1 (b), in which the overhead is plotted as a function of the number of bits per element in the logarithmic scale. Compared with Figure 3.1 (a), we observe that the overhead OH 64 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. -2 32 Bits per element Bits per element (a) (b) Figure 3.1: (a) The de-correlation efficiency and (b) the overhead bit rate versus the number of bits per element in SQ. increases much more rapidly than the decrease of the distortion D when the number of bits per element increases from 14 to 32. It indicates that when transmitting the covariance matrix with a higher bit rate, the improvement of de-correlation efficiency is actually not sufficient to compensate the loss due to a higher overhead rate. The minimum number of bits per element for SQ is 2 , since we need one bit for the sign and at least one bit for the absolute value for each element. To further reduce the number of bits per element can be achieved by using VQ. Figure 3.2 (a) illustrates that the average distortion D increases almost linearly when the number of bit per vector decreases from 16 bits per vector to 7 bits per vector, and then slows clown when the bit per vector further decreases. Compared with Figure 3.1, it is verified that VQ results in smaller quantization distortion than SQ at any given bit rate. Figure 3.2 (b) shows how the overhead varies with the number of bits per 65 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2 4 6 8 10 12 14 16 ° 2 4 6 8 10 12 14 16 Bit per vector Bit per vector (a) (b) Figure 3.2: (a) The de-correlation efficiency and (b) the overhead bit rate versus the number of bits per vector in VQ. covariance matrix. Note that VQ reduces the overhead bit rate more than a factor of N (which is 15 for 5 channel audio) with respect to SQ. Tables 3.4 and Table 3.5 show the average distortion D. the overhead information and the average MNR value for SQ and VQ, respectively, where the average MNR value is calculated as below: l\/n\Tr> Y l c h a n n e l ^ N R - c h a r i T i e l ^ u b b a r i d ,,, . . m e a n lVil\ i X s u b b a n d , r i i ; (3 * 1) number ol channels , . . 1 1 c m h b m nh mean MNR.q,i w i n M W . average MNR = ■ 'm ~ d -— - su b b a ™ _ 3 5 ) number of sub band The test audio material used to generate results in this section is a 5 channel performance of ” Messiah” with the KLT matrix updated every one second. As shown in Table 3.4, a fewer number of bits per element results in a higher distortion in 66 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 3.4: De-correlation results with SQ. bit/element D Overhead (bit/matrix) Ave MNR (dB/sb) 2 2.14 30 N /A “ 4 9.13e-l 60 56.56 6 2.91e-l 90 56.37 8 7.29e-2 1 2 0 56.02 1 0 2.05e-2 150 56.08 1 2 7.36e-3 180 56.00 14 5.24e-3 2 1 0 55.93 16 4.93e-3 240 55.91 32 4.89e-3 480 55.84 “Using 2 bits per elem ent, which quantizes each elem ent into values of either 0 or ± 1, leads to problem s in later compression steps. exchange of a smaller overhead and, after all, a larger MNR value. Thus, although a smaller number of bits per element of the covariance matrix results in a larger average distortion D, the decrease of the overhead bit rate actually compensates this distortion and improves the MNR value for the same coding rate. A similar argument applies to the VQ case as shown in Table 3.5 with only one minor difference. That is, the MNR value is nearly a monotonic function for the SQ case while it moves up and down slightly in a local region (fluctuating within 1 dB per subband) for the VQ case. However, the general trend is the same, i.e. a larger overhead in KLT coding degrades the final MNR value. We also noticed that even when using 1 bit per vector, vector quantization of the covariance matrix still gives good MNR results. 67 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 3.5: De-correlation results with VQ. bit / vector D Overhead (bit/matrix) Ave MNR (clB/sb) 1 1.92 1 56.73 2 1.87 2 56.61 3 1.81 3 56.81 4 1.78 4 56.87 5 1.73 5 56.12 6 1.62 6 56.23 7 1.58 7 56.88 8 1.47 8 56.96 9 1.37 9 56.42 1 0 1.28 1 0 55.97 1 1 1.16 1 1 56.08 1 2 1.04 1 2 56.28 13 0.948 13 55.83 14 0.848 14 55.72 15 0.732 15 56.19 16 0.648 16 55.87 Our conclusion is that it is beneficial to reduce the overhead bit rate used in the coding of the covariance matrix of KLT, since a larger overhead has a negative impact on the rate-distortion tradeoff. 3.4 Tem poral A d ap tation Effect A multichannel audio program in general comprises of several different periods, each of which has its unique spectral signature. For example, a piece of music may begin Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. with a piano preclude followed by a chorus. In order to achieve the highest infor mation compactness, the de-correlation transform matrix must adapt to the char acteristics of different sectoins of the program material. The MAACKLT algorithm utilizes a temporal-adaptive approach, in which the covariance matrix is updated frequently. On one hand, the shorter the adaptive time, the more efficient the inter channel cle-correlation mechanism. On the other hand, since the KLT covariance matrix has to be coded for audio decoding, a shorter adaptive time contributes to a larger overhead in bit rates. Thus, it is worthwhile to investigate the tradeoff so that a good balance between this adaptive time and the final coding performance can be reached. In Figure 3.3, we show the magnitude of the lower triangular elements of the normalized covariance matrix calculated from de-correlated signals by using different adaptive periods, where no quantization has been applied yet. These figures suggest that there is no significant improvement of the cle-correlation efficiency when the KLT adaptive time decreases from 10 seconds to 0.05 second. As the overhead dramatically increases with the shorter adaptive time, the final coding performance may be degraded. In order to find the optimal KLT adaptive time, a thorough investigation is performed for both SQ and VQ. First, let us look at how adaptive time affects the overhead bit rate. Suppose n channels are selected for simultaneous inter-channel de-correlation, the adaptive 69 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Eigen-channel Eigen-channel Eigen-channel Eigen-channel (a) (b) 8 x 10 -3 Eigen-channel Eigen-channel Eigen-channel Eigen-channel (c) (d) Figure 3.3: The magnitude of the lower triangular elements of the normalized co- variance matrix calculated from de-correlated signals, where the adaptive time is equal to (a) 0.05, (b) 0.2, (c) 3, and (d) 10 seconds. time is K seconds, i.e. each sub-block contains K seconds of audio, and M bits are transmitted to the decoder for each KL transform. The overhead bit rate r^erhead is T overhead. M n K (3.6) in bits per second per channel (bit/s/ch). This equation suggests that the overhead bit rate increases linearly with the number of bits used to encode and transmit the KL 70 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. transform matrix. The overhead bit rate is, however, inversely proportional to the number of channels and the adaptive time. If SQ is used in the encoding procedure, each non-redundant, element has to be sent. For n channel audio material, the size of the covariance matrix is n x n. and the number of non-redundant elements is n x (n + l)/2. If B s bits are used to quantize each element, the total bit requirement for each KLT is n{n + l)Bs/2. Thus the overhead bit rate r^®rhead for SQ is equal to sq _ (n + l)Bs o v e rh e a d 2 K The overhead bit rate rv J jjerhead f°r VQ is simpler. It is equal to VQ Bv n K where B v represent the number of bits used for each KLT covariance matrix. 57 SQ - ± - VQ SQ -A - VQ 56 n r 1 52 20 Adaptive time (second) Adaptive time (second) (a) (b) Figure 3.4: (a) Adaptive MNR results and (b) adaptive overhead bits for SQ and VQ for 5-channel Messiah. 71 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The average MNR value (in clB) and the overhead bit rate (in the logarithm scale) versus the adaptive time for both SQ and VQ are shown in Figures 3.4 (a) and (b), respectively. The test material is 5-channel ’’Messiah”, with 8 bits per element for SQ and 4 bits per vector for VQ for KLT cle-correlation. The total coding bit rate (including bits for the overhead and the content) is kept the same for all points in the two curves in Figure 3.4 (a). We have the following observations from these figures. First, for the SQ case, the average MNR value remains about the same with less than 0.3 clB variation per subbancl when the adaptive time varies from 1 to 10 seconds. However, when the adaptive time is decreased furthermore, the overhead effect starts to dominate, and the average MNR value decreases dramatically. On the other hand, when the adaptive time becomes larger than 1 0 seconds, the average MNR value also decreases, which implies that the coding performance degrades if the KLT matrix is not updated frequently enough. For VQ, the changing pattern of the average MNR value versus the adaptive time is similar as that of SQ. However, compared with the scalar case, the average MNR value starts to degrade earlier at about 5 seconds. This is probably due to the effect that VQ gives less efficient cle-correlation, so that more frequent adaptation of the KLT matrix will generate a better coding result. As shown in Figure 3.4 (a), it is clear that the average MNR generated by using VQ is always better than that of SQ ancl the difference becomes significant when the overhead becomes the dominant factor for KLT adaptive time less than 1 second. 72 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.5 C om plexity A nalysis The main concern of a VQ scheme is its computational complexity at the encoder side. For each D dimension vector, we need 0(D S ) operations to find the best matched codeword from a codebook of size S using the full search technique. For a n channel audio, each covariance matrix is represented by a vector of dimension n(n + l)/2 . Thus, for each KLT, we need 0(n2S ) operations. While for the scalar case, to quantize each element requires 0 (1 ) operations, and for each covariance matrix, we need 0(n2) operations. Suppose that the input audio is of L seconds long, and the KLT matrix is updated each K seconds. Then, there will be totally \L/K~\ covariance matrices to be quantized2. This means we need 0(\L/K ]n2) = ■ 0(Ln2/ K ) for scalar quantization, 0(\L/K }n2S) = 0(Ln2S / K ) for vector quantization, operations. Thus, for a given test audio source, the complexity is inversely proportional to KLT adaptive time for either quantization scheme. For VQ, the complexity is also proportional to the coclebook size. Compared with SQ, VQ requires more operations by a factor of S. To reduce the computational complexity, we should limit the codebook size and set the KLT adaptation time as long as possible while keeping the desired coding performance. 2where [V| represents the sm allest integer which is greater than or equal to *. 73 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Experimental results shown in Section 3.3 suggests that a very small coclebook size is usually good enough to generate the desired compressed audio. By choosing a small codebook size and keeping the KLT adaptation time long enough, we do not only limit the additional computational complexity, but also save the overhead bit requirement. At the decoder side, VQ demands just a table look up procedure and its complexity is comparable to that of SQ. 3.6 E xperim ental R esults We tested the modified MAACKLT method by using two five-channel audio sources ” Messiah” and ”Ftbl” at different bit rates varying from a typical rate of 64 kbit/s/ch to a very low bit rate of 16 kbit/s/ch. Figures 3.5 and Figure 3.6 show the mean MNR comparison between SQ and VQ for test audio ”Messiah” and ”Ftbl” , respectively, where the KLT matrix adaptation time is set to 10 seconds. The mean MNR values in these figures are calculated by Equation (3.4). In order to show the general result of scalar and vector cases, a moderate bit rate (i.e. 8 bits per element for SQ and 4 bits per vector for VQ) is adopted here. From these figures, we see that, compared with SQ, VQ generates comparable mean MNR results at all bit rates, and VQ even outperforms SQ at all bit rates for some test sequence such as ’’Messiah”. 74 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. VQ VQ 48 kbit/s/ch 64 kbit/s/ch 15 10 15 5 10 20 5 20 25 25 Subband (a) Subband (b) VQ B40 32 kbit/s/ch so 10 15 5 20 25 Subband (c) VQ 16 kbit/s/ch 5 10 15 20 25 Subband (cl) Figure 3.5: MNR result using test audio ’ ’Messiah” coded at (a) 64 kbit/s/ch, (b) 48 kbit/s/ch, (c) 32 kbit/s/ch, and (d) 16 kbit/s/ch. 3.7 C onclusion To enhance the MAACKLT algorithm proposed earlier, we examined the relationship between coding of the KLT covariance matrix with different quantization methods, KLT cle-correlation efficiency and the frequency of KLT information update exten sively in this research. In specific, we investigated how different quantization method affects the final coding performance by using objective MNR measurements. It is 75 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 48 kbit/s/ch §30 40 VQ VQ 64 kbit/s/ch 10 15 20 Subband 25 (a) 10 15 20 25 Subband (b) VQ 32 kbit/s/ch 5 10 15 20 25 o SQ 16 kbit/s/ch Subband (c) 10 15 20 Subband (d) Figure 3.6: MNR result using test audio ”Ftbl” coded at (a) 64 kbit/s/ch, (b) 48 kbit/s/ch, (c) 32 kbit/s/ch, and (cl) 16 kbit/s/ch. demonstrated that to reduce the overhead bit rate generally provides a better trade off for the overall coding performance. This can be achieved by adopting a small est possible bit rate to encode the covariance matrix together with moderately long KLT adaptation period to generate the desired coding performance. Besides, a small codebook size in VQ do not increase the computational complexity significantly. 76 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C hapter 4 P rogressive Syntax-R ich M ultichannel A udio C odec 4.1 Introduction Most of today’s multichannel audio codecs can only provide bitstreams with a fixed bit rate, which is specified during the encoding phase. When this kind of bitstream is transmitted over variable bandwidth networks, the receiver can either successfully decode the full bitstream or ask the encoder site to re-transmit a bitstream with a lower bit rate. The best solution to address this problem is to develop a scalable compression algorithm, which is able to transmit and decode the bitstream with a bit rate that can be adaptive to a dynamically varying environment (e.g. the instantaneous capacity of a transmission channel). This capability offers a significant advantage in transmitting contents over channels with a variable channel capacity or connections for which the available channel capacity is unknown at the time of encoding. To achieve this goal, a bitstream generated by scalable coding schemes 77 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. consists of several partial bitstreams, each of which can be decoded on their own in a meaningful way. In this way, transmission and decoding of a subset of the total bitstream will result in a valid clecodable signal at a lower bit rate and quality. MPEG-4 version-2 audio coding supports fine grain bit rate scalablility [PKKS97, ISOb, ISOg, ISOh, HAB+98] in its Generic Audio Coder (GAC). It has a Bit-Sliced Arithmetic Coding (BSAC) tool, which provides scalability in the step of 1 kbit/s per audio channel for mono or stereo audio material. Several other scalable mono or stereo audio coding algorithms [ZL01, VA01, SAK99] were proposed in recent years. However, not much work has been done on progressively transmitting multichannel audio sources. In this work, we propose a progressive syntax-rich multichannel audio codec (PSMAC) based on MPEG AAC. In PSMAC, the inter-channel redundancy inherent in original physical channels is first removed in the pre-processing stage by using the Karhunen-Loeve Transform (KLT). Then, most coding blocks in the AAC main profile encoder are employed to generate spectral coefficients. Finally, a progressive transmission strategy and a context-based QM coder are adopted to obtain the fully quality-scalable multichannel audio bitstream. The PSMAC system not only supports fine grain bit rate scalability for the multichannel audio bitstream, but also provides several other desirable functionalities, such as random access and channel enhancement, which have not been supported by other existing multichannel audio codecs. 78 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Moreover, compared with the BSAC tool provided in MPEG-4 version-2 and most of other scalable audio coding tools, a more sophisticated progressive trans mission strategy is employed in PSMAC. PSMAC does not only encode spectral coefficients from MSB to LSB and from low to high frequency so that the decoder can reconstruct these coefficients more and more precisely with an increasing band width as the receiver collects more and more bits from the bitstream, but also utilize the psychoacoustic model to control the subband transmission sequence so that the most sensitive frequency area is more precisely reconstructed. In this way, bits used to encode coefficients in those non-sensitive frequency area can be saved and used to encode coefficients in the sensitive frequency area. Compared with the algorithm without this subband selection strategy, a perceptually more appealing audio can be reconstructed, especially at very low bit rate such as 16 kbit/s/ch. The side information required to encode the subband transmission sequence is nicely handled in our implementation so that the overall overhead will not have significant impact on the audio quality even at very low bit rates. Note that Shen et al. [SAK99] proposed a subband selection rule to achieve progressive coding. However, Shen’s scheme demands a large amount of overhead in coding the selection order. Experimental results show that, when compared with MPEG AAC, the decoded multichannel audio generated by the proposed PSMAC’s MNR progressive mode has comparable quality at high bit rates, such as 64 kbits/s/ch or 48 kbits/s/ch and much better quality at very low bit rates, such as 32 kbits/s/ch or 16 kbits/s/ch. We also demonstrate that our PSMAC codec can provide better quality of single channel 79 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. audio when compared with MPEG-4 version-2 generic audio coder at several different bit rates. The rest of this chapter1 is organized as follows. Sections 4.2 gives an overview of the proposed syntax-rich design. Section 4.3 and Section 4.4 describe how the progressive quantization is employed in our system. Section 4.5 discusses some implementation issues. Section 4.6 illustrates the complete compression system. Some experimental results are shown in Section 4.7. Finally, conclusion remarks are given in Section 4.8. 4.2 P rogressive Syntax-R ich C odec D esign In the proposed progressive syntax-rich codec, the following three user defined pro files are provided. 1. MNR Progressive: If this flag is on, it should be possible to decode the first N bytes of the bitstream, where N in term of bit rate is a user specified value or a value that the current network parameters allowed. 2 . Random Access: If this flag is present, the codec will be able to independently encode a short period of audio more precisely than other periods. It allows users to randomly access a certain part of audio that is more of interest to end users. 1 P art of this chapter represents work published before, see [YAKKOlb, YAI<K02a, YAK02] 80 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3. Channel Enhancement: If this flag is on, the codec will be able to independently encode an audio channel more precisely than other channels. Either these channels are more of interest to end users or the network situation does not allow the full multi channel audio bitstream to be received in time. The MNR progressive profile is the default mode. For the other two profiles, i.e. random access mode and channel enhance mode, the MNR progressive feature is still provided as a basic functionality and decoding of the bitstream can be stopped at any arbitrary point. With these three profiles, the codec provides a versatile functionality that is desired in a variable bandwidth network condition with different user access bandwidth. 4.3 Scalable Q uantization and E ntropy C oding The major difference between the proposed progressive audio codec and other exist ing non-progressive audio codecs such as AAC lies in the quantization module and the entropy coding module. The dual iteration loop used in AAC to calculate the quantization step size for each frame’s and each channel’s coefficients is replaced by a progressive quantization block. The huffman coding module used in the AAC to encode quantized data is replaced by a context-based QM coder. They will be explained in detail below. 81 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.3.1 Successive Approximation Quantization (SAQ) The most important component of the quantization module is called successive approximation quantization (SAQ). The SAQ scheme, which is adopted by most embedded wavelet coders for progressive image coding, is crucial to the design of embedded coders. The motivation for successive approximation is built upon the goal of developing an embedded code that is in analogy to find an approximation of binary-representation to a real number [Sha93]. Instead of coding every quantized coefficient as one symbol, SAQ processes the bit representation of coefficients via bit layer sliced in the order of their importance. Thus, SAQ provides a coarse-to-fine, multiprecision representation of the amplitude information. The bitstream is orga nized such that a decoder can immediately start reconstruction based on the partial received bitstream. As more and more bits are received, more accurate coefficients and higher quality multichannel audio can be reconstructed. 4.3.1.1 D escription of the SAQ A lgorithm SAQ sequentially applies a sequence of thresholds T o,Ti,... ,Tjv+i for refined quan tization, where these thresholds are chosen such that 7) = 7 i _ i / 2 . The initial threshold T0 is selected such that \C{i)\ < 2 T0 for all transformed coefficients in one subband, where C(i) represents the ith spectral coefficient in the subband. To implement SAQ, two separate lists, the dominant list and the subordinate list, are maintained both at the encoder and the decoder sides. At any point of the pro cess, the dominant list contains the coordinates of those coefficients that have not 82 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. yet been found to be significant. While the subordinate list contains magnitudes of those coefficients that have been found to be significant. The process that updates the dominate list is called the significant pass, and the process that updates the subordinate list is called the refinement pass. In the proposed algorithm, SAQ is adopted as the quantization method for each spectral coefficient within each subband. This algorithm, is listed below. Successive Approxim ation Quantization (SAQ) A lgorithm 1. Initialization: For each subband, find out the maximum absolute value Cmax for ah coef ficients C(i) in the subband, and set the initial quantization threshold to be T0 = C r n a x / 2 + BIAS, where BIAS is a small constant. 2 . Construction of the significant map (significance identification): For each C(i) contained in the dominant list, if |C(i)\ > Tk, where Tk is the threshold of the current layer (layer k), add i to the significant map, remove i from the dominant list and encode it with 'Is', where V is the sign bit. Moreover, modify the coefficient’s value to C(i) C(i) - 1.5 x Tk, \IC{i) > 0 C(i) + 1.5 x Tk, otherwise 83 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3. Construction of the refinement map (refinement): For each C(i) contained in the significant map, encode the bit at layer k with a refinement bit 'D' and change the value of C (i) to C(i) - 0.25 x Tf c , VC(i) > 0 C(i) + 0.25 x Tk, otherwise 4. Iteration: Set Tk+i = Tkj 2 and repeat Steps 2-4 for k = 0 ,1 ,2 ,... 4.3.1.2 A nalysis of Error R eduction R ates The following two points have been observed before [Ket al.97): • The coding efficiency of the significant map is always better than that of the refinement map at the same layer. • The coding efficiency of the significant map at the kth layer is better than that of the refinement map at the (k — l)th layer. In the following, we would like to provide a formal proof by analyzing the error reduction capability due to the significant pass and the refinement pass, respectively. First, let us consider the error reduction capability for the bit-layer coding of coefficient C(i), Vi, in the significant pass. Since the sign of each coefficient will be coded separately, we will assume C(i) > 0 below without loss of generality. Suppose that C(i) becomes significant at layer k. This means Tk < C(i) < Tk- i = 2 Tk and 84 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. its value is modified accordingly. Then, error reduction Ai due to the coding of this bit can be found as A] = C{i) - |C(i) - 1.5 x Tk\. Note that, at any point of the process, the value of j(7(i)| is nothing else but the remaining coding error. Since 71 < C(i) < 2Tk, — 0.5Tk < C(i) — 1.57*, < 0.5Tk, we have |C(i) — 1.571| < 0.571. Consequently, Ai = C(i) - \C{i) - 1.5 x Tk| > 0.5Tk. Now, let us calculate the error reduction for the bit-layer coding of coefficient in the refinement pass. Similar to the previous case, we assume C(j) > 0. At layer k, suppose C(j) is being refined, and its value is modified accordingly. The corresponding error reduction is A 2 = C(j) — \C{j) — 0.25 x Tk\. Two cases have to be considered: 1. If C(j) > 0.25Tfc , A 2 - C(j) - C(j) + 0.2571. = 0.25Tk. 85 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2 . If C(j) < 0.257*, A 2 = C{j) + C(j) - 0.257* = 2CO') - 0.257* < 0.57* - 0.257* = 0.257*. Thus, we conclude that A2 = C(j) - \C(j) - 0.25 x 7*| < 0.2571- < 0.57* < Aj. Thus, the error reduction for significant pass is always greater than that of the refinement pass at the same layer. Similarly, at layer (k — 1), the error reduction for coefficient C'(j), Vj, caused by the refinement pass is A 3 = C(j) - |C{j) - 0.25 x 7*_1| < 0.257*_1 = 0.57* < A 1 } which demonstrates that error reduction in the significant pass at layer k is actually greater than or equal to that of the refinement pass at layer (k — 1 ). According to the above analysis, a refinement-significant map coding is proposed and adopted in our progressive multichannel audio codec. That is, the transmission of kth refinement map of subband i is followed immediately by the transmission of (k + l)th significant map of subband i. 86 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.3.1.3 A nalysis of Error Bounds Suppose the ith coefficient C(i) has a value T0/2 R+ 1 < |C(i)| < T0/2R. Then, its binary representation can be written as T T T C(i) = sign x [ao(^) + a i(^f) + “2 (7^ ) + • • • ] OO r p = sign x fc=0 Z where T = T0/2 R+1, T0 is the initial threshold, and a0, ai, a2, ... are binary values (either 0 or 1). In the SAQ algorithm, C(i) is represented by: C(i) — sign x [l-5ao(—) + 0.5&i(—) + 0.562(—) + ■ • • ] r p OO r p = sign x [1.5a0(—) + 0.5 ^ k= ~l where at and bk are related via ak = 0.5(bk + 1), Vk= 1 ,2 ,3 ,..., or { f) ® k 1) Vfc= 1 ,2 ,3 ,... 1) ak 0, 87 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Based on the first M + 1 bits ao, 0 1 , 0 2 , • • •, Om, the reconstructed value R\(i) by using the binary representation is T T T T i?i(y) = signx [o0(—) + oi(—) + a2( ^ ) + ... + aM^ - ] M p = sign x E a fc (— )]. fc=o z Based on the first M + l bits O o, 6 1 , f > 2 , • • •, & m , the reconstructed value i?2 (f) by using SAQ is Ii2(i) = sign x [1.5o0( ^ ) + 0.5&i(^-) + 0.5b2( ^ ) + ... + 0.5bM^ ] p M p = sign x [1.5o0(—) + 0.5 P bk(^)}. k—1 Thus, the error introduced by the binary representation for this coefficient is OO r p E 1(i) = |C(i) - = | £ at - r i •O k ' fc=M+l * < V — - — — 9 k 9 M' fc=M+l z z Similarly, the error introduced by SAQ for this coefficient is OO r p s 2(i) = |C ( i ) - f l 2( ; ) M 0 .5 x £ 1 f c = M + l ^ OQ r p r p < 05 £ 2* = 2 S « . f c = M + l ^ z 88 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. We conclude that the upper bound of both error E\(i) caused by the binary representation and error E^i) cause by SAQ are decaying exponentially when the incoming number M of bits is increasing linearly. 4.3.2 Context-based QM coder The QM coder is a binary arithmetic-coding algorithm designed to encode data formed by a binary symbol set. It was the result of the effort by JPEG and JBIG committees, in which the best features of various arithmetic coders are integrated. The QM coder is a lineal descendent of the Q-coder, but significantly enhanced by improvements in the two building blocks, i.e. interval subdivision and probability estimation [PM93]. Based on the Bayesian estimation, a state-transition table, which consists of a set of rules to estimate the statistics of the bitstream depending on the next incoming symbols, can be derived. The efficiency of the QM coder can be improved by introducing a set of context rules. The QM arithmetic coder achieves a very good compression result if the context is properly selected to summarize the correlation between coded data. Six classes of contexts are used in the proposed embedded audio codec as shown in Figure 4.1. They are the general context, the constant context, the subbancl significance context, the coefficient significance context, the coefficient refinement context and the coefficient sign context. The general context is used in the coding of the configuration information. The constant context is used to encode different 89 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Program Configuration Bit Stream Quantizer Channel’s Header Info Bit Stream Subband Significance Bit Stream Coefficient Significance Bit Stream Coefficient Refinement Bit Stream Coefficient Sign Bit Stream Compressed File Figure 4.1: The adopted context-based QM coder with six classes of contexts. channel’s header information. As their names suggest, the subbancl significance con text, the coefficient significance context, the coefficient refinement context and the coefficient sign context are used to encode the subband significance, coefficient signif icance, coefficient refinement and coefficient sign bits, respectively. These contexts are adopted because different classes of bits may have different probability distribu tions. In principle, separating their contexts should increase the coding performance of the QM coder. 90 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.4 C hannel and Subband Transm ission S trategy 4.4.1 Channel Selection Rule In the embedded multichannel audio codec, we should put the most important bits (in the rate-distortion sense) to the cascaded bitstream first so that the decoder can reconstruct the optimal quality of multichannel audio given a fixed number of bit received. Thus, the importance of channels should be determined for an appropriate ordering of the bitstream. For the normal 5.1 channel configuration, it was observed in Chapter 2 that the channel importance will be eigen-channel 1, followed by eigen- channels 2 and 3, and then followed by eigen-channels 4 and 5. Between each channel pair, the importance is determined by their energy. This policy is used in this paper. 4.4.2 Subband Selection Rule In principle, any quality assessment of an audio channel can be either performed subjectively by employing a large number of expert listeners or done objectively by using an appropriate measuring technique. While the first choice tends to be an expensive and time-consuming task, the use of objective measures provides quick and reproducible results. An optimal measuring technique would be a method that produces the same results as subjective tests while avoiding all problems associated with the subjective assessment procedure. Nowadays, the most prevalent objective measurement is the Mask-to-Noise-Ratio (MNR) technique, which was first intro duced by Brandenburg [Bra87] in 1987. It is the ratio of the masking threshold with 91 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. respect to the error energy. In our implementation, the masking is calculated from the general psychoacoustic model of the AAC encoder. The psychoacoustic model calculates the maximum distortion energy which is masked by the signal energy, and outputs the Signal-to-Mask-Ratio (SMR). A subband is masked if the quantization noise level is below the masking thresh old so the distortion introduced by the quantization process is not perceptible to human ears. As discussed earlier, SMR represents the human auditory response to the audio signal. If SNR of an input audio signal is high enough, the noise level will be suppressed below masking threshold and the quantization distortion will not be perceived. Since SNR can be easily calculated by gjyft ___________^original I _________ S i I ^original ‘ -’reconstruct (*) I2 where >S'original^) an(l ^reconstruct (*) represent the ith original and the ith recon structed audio signal value, respectively. Thus, MNR is just the difference of SNR and SMR (in dB), or S N R = M N R + SMR. A side benefit of the SAQ technique is that an operational rate vs. distortion plot (or, equivalently, an operational rate vs. the current MNR value) for the coding algorithm can be computed on-line. The basic ideas behind choosing the subbancl selection rules are simple. They are: 92 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1. The subband with a better rate deduction capability should be chosen earlier to improve the performance. 2. The subband with a smaller number of coefficients should be chosen earlier to reduce the computational complexity, if the rate reduction performances of two sub bands are close. The first rule implies that we should allocate more bits to those subbands with larger SMR values or smaller MNR values. In other words, we should send out bits belonging to those subbands with larger SMR or smaller MNR values first. The second rule tells us how to decide the subband scanning order. As we know about the subband formation in MPEG AAC, the number of coefficients in each subbancl is non-clecreasing with the increase of the subbancl number. Figure 4.2 shows the subband width distribution used in AAC for 44.1 kHz and 48 kHz sampling frequencies and long block frames. Thus, a sequential subband scanning order from the lowest number to the highest number is adopted in this work. 1 00---------- ! - r - 80 60 40 20 0 5 10 15 20 25 30 35 40 45 50 Subband Figure 4.2: Subband width distribution. 93 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Iii order to save bits, especially at very low bit rates, only information corre sponding to lower subbands will be sent into the bitstream. And when number of layer increases, more and more subbands will be added. Figure 4.3 shows how sub bands are scanned for the first several layers. At base layer, priority are given to lower frequency signals, so only subbands numbered upto L# will be scanned. As en hance layers’ information goes into the bitstream, the subband scanning upper limit, i.e. Lei, Ae2 and Lez, increases, until this limit reach the effective psychoacoustic upper bound of all subbands N. In our implementation, Lez = N . Which means after the third enhance layer, all subband will be scanned. Here subband scanning upper limits Lb, Lei and Le 2 are empirically determined values which compromise the coding efficiency with the coding performance. In PSMAC, a dual-threshold coding technique is proposed in the progressive quantization module. One is the MNR threshold, which is used in subbancl selection. The other is the magnitude threshold, which is used for coefficients’ quantization in each selected individual subband. A subbancl which has its current MNR value smaller than the current MNR threshold is called significant subband. Similar as the SAQ process for coefficient quantization, two lists, i.e. the dominant subbancl list and the subordinate subband list, are maintained in the encoder and the decoder sides. The dominant subband list contains the indices of those subbands that have not become significant, and the subordinate subband list contains the indices of those subbands that have already become significant. The process that updates the 94 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reproduced w ith permission o f th e copyright owner. Further reproduction prohibited without permission. SB #Lg SB #LR 1 SB #LF 7 SB #1 SB #2 SB $L p-^ Subband scanning for base layer ^ Subband scanning for 1s t laye: Subband scannin ► I for 2n d layer Subbai id scanning for 3r d layer w .......... .......... w Figure 4.3: Subband scanning rule, where the solid line with arrow means all subbands inside this area are scanned, and the dashed line means only those non-significant subbands inside the area are scanned. CJi subband dominant list is called the subband significant pass, and the process that updates the subband subordinate list is called the subband refinement pass. Different coefficient magnitude thresholds are maintained in different subbands. Since we would like to deal with the most important subbands at first and get the best result with only a little information from the resource. Moreover, since sounds in different subbands have different sensibilities to human ears according to the psychoacoustic model, it is worthwhile to consider each subband independently instead of all subbands in one frame simultaneously. We summarize the subband selection rule below. Subband Selection Procedure 1. MNR threshold calculation Determine empirically the MNR threshold value for channel % at layer k. Subbancls with smaller MNR value at the current layer are given higher priority. 2. Subbancl dominant pass For those subbands that are still in the dominant subband list, if subband j in channel i has the current MNR value MNRfj < Z^?NR, a(jc[ subband j of channel i into the significant map, remove it from the dominant subbancl list, send 1 to the bitstream, indicating this subband is selected. Then, do coefficient SAQ for this subbancl. For subbancls that have MNRf > J ’ .MNR 1 > j — '£ ) N ‘ send 0 to the bitstream, indicating this subband is not selected at this layer. 96 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3. Subband refinement pass For subband already in the subordinate list, do coefficient SAQ. 4. MNR values update Re-calculate and update MNR values for selected subbands. 5. Repeat Steps 1-4 until the bitstream meets the target rate. 4.5 Im plem entation Issues 4.5.1 Frame, subband or channel skipping As mentioned earlier, each subband has its own initial coefficient magnitude thresh old. This threshold has to be included in the bitstream as the overhead so that the decoder can start to reconstruct these coefficients once the layered information is available. In our implementation, the initial coefficient magnitude threshold 2^(0) for channel % and subband j will be truncated to the nearest power of 2 that is no smaller than CYf1, i.e. t 'iJ ‘ 7y(0) = py= riog2 C™“ l, where C™x is the maximum magnitude for all coefficients in channel i and subband j- In order to save bits, the maximum power pr -tax — max(pij),Vj, for all subbands in channel i will be included in the bitstream at the first time when channel i is 97 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. selected. A relative value of each subband’s maximum power, i.e. the difference Apij = — Pi,j between p t® ® 1 and pij, will be included in the bit stream at the first time when the selected subband becomes significant. For a frame with its maximum value 6'”"“ equal to 0, i.e. max(C™lx) = 0, Vj, which means all coefficients in channel i in this frame have value 0, then a special indicator will be set to let the decoder know it should skip this frame. Similarly, if 6'™°' has value 0, another special indicator is set to tell the decoder that it should always skip this subbancl. In some cases when the end user is only interested in reconstructing some channels, channel skipping can also be adopted. 4.5.2 Determ ination of the M N R threshold At each layer, the MNR threshold for each channel is determined empirically. Two basic rules are adopted when calculating this threshold. 1. The MNR threshold should allow a certain number of subbancls to pass at each layer. Since the algorithm sends 0 to the bit stream for each un-selected subband which is still in the significant subband list, if the MNR threshold is so small that it allows too few subbands to pass, too many overhead bits will be gen erated. As a result, this will degrade the performance of the progressive audio codec. 98 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2. Adopt a maximum MNR threshold. If the MNR threshold calculated by using the above rule is greater than a pre defined maximum MNR threshold then the current MNR threshold I f LihJb ' for channel % at kth layer will be set to This is based on the assumption that a higher MNR value does not provide higher perceptual audio quality perceived by the human auditory system. 4.6 C om plete A lgorithm D escription The block diagram of a complete encoder is shown in Figure 4.4. The perceptual model, the filter bank, the temporal noise shaping (TNS), and the intensity blocks in our progressive encoder are borrowed from the AAC main profile encoder. The inter-channel redundancy removal procedure via KLT is implemented after the input audio signals are transformed into the MDCT domain. Then, a dynamic range control block follows to avoid any possible data overflow in later compression stages. Masking thresholds are then calculated in the perceptual model based on the KL transformed signals. The progressive quantization and lossless coding parts are finally used to construct the compressed bitstream. The information generated at the first several coding blocks will be sent into the bitstream as the overhead. Figure 4.5 provides more details of the progressive quantization block. The chan nel and the subbancl selection rules are used to determine which subbancl in which channel should be encoded at this point, and then coefficients within this selected 99 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reproduced w ith permission o f th e copyright owner. Further reproduction prohibited without permission. Input audio signal Transformed Coefficients Syntax control = = =► KLT TNS Filter bank Intensity Coupling Noiseless Coding Perceptual Model Progressive Quantization Dynamic Range Control Data control Data Syntax control Legend Bitstream Multiplex Coded Bit Stream Figure 4.4: The block-diagram of the proposed PSMAC encoder. o o Reproduced w ith permission o f th e copyright owner. Further reproduction prohibited without permission. Transformed Coefficients Syntax control Channel Subband C oefficients’ Selection i r Selection SAQ NR progressive? Random access? hannel enhance? N o N o M NR update Context-based Binary QM Coder Exceed bit budget? N oiseless Finish one layer or this channe Coding Y es Finish one layer J o r all channel?^ Y es Progressive j | Quantization! Legend Data Data control Syntax control Figure 4.5: Illustration of the progressive quantization and lossless coding blocks. subband will be quantized via SAQ. The user defined profile parameter is used for the syntax control of the channel selection and the subband selection. Finally, based on several different contexts, the layered information together with all overhead bits generated during previous coding blocks will be losslessly coded by using the context-based QM coder. The encoding process performed by using the proposed algorithm will stop when the bit budget is exhausted. It can cease at any time, and the resulting bitstream con tains all lower rate coded bitstreams. This is the so-called fully embedded property. The capability to terminate the decoding of an embedded bitstream at any specific point is extremely useful in systems that are either rate-constrained or distortion- constrained. 4.7 E xperim ental R esults The proposed PSMAC system has been implemented and tested. The basic audio coding blocks [ISOc] inside the MPEG AAC main profile encoder, including the psy choacoustic model, filter bank, temporal noise shaping and intensity/coupling, are still adopted. Furthermore, an inter-channel removal block, a progressive quantiza tion block and a context-based QM coder block are added to construct the PSMAC audio codec. Two kinds of experimental results are shown in this sections. They are results measured by objective metric, i.e. Mask-to-Noise Ratio (MNR), and results measured in subjective metric, i.e. listening test score. It is worthwhile to mention 102 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. that no software optimization has been performed in any codec we used and the cod ing blocks adopted from AAC has not been modified to improve the performance of our codec. Moreover, test audio that has the worst performance generated by the MPEG reference software has not been selected in the experiment. 4.7.1 Results using M N R measurement Two multichannel audio materials are used in this experiment to compare the perfor mance of the proposed PSMAC algorithm with MPEG AAC’s [ISOc] main profile codec. One is a one-minute long ten-channel 2 audio material called ”Messiah”, which is a piece of classical music recorded live in a real concert hall. Another one is an eight-second long five-channel 3 music called ’ ’Herre”, which was used in the MPEG-2 AAC standard (ISO/IEC 13818-7) conformance work. 4.7.1.1 M N R Progressive The performance comparison of MPEG AAC and the proposed PSMAC for the normal MNR progressive mode are shown in Table 4.1. The average MNR shown in the table is calculated by Equation 3.4 and 3.5 Table 4.1 shows the MNR values to compare the performance of the non-progressive algorithm AAC and the proposed PSMAC algorithm when working in the MNR pro gressive profile. Values in this table clearly show that our codec outperforms AAC in c lu d in g center, left, right, left surround, right surrond, back surround, left high, right high, left wide and right wide. in c lu d in g center, left, right, left surround and right surrond. 103 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 4.1: MNR comparison for MNR progressive profiles Bit rate (bit/s/ch) Average MNR values (dB/subband/ch) Herre Messiah AAC PSMAC AAC PSMAC 16k -0.90 6.00 14.37 21.82 32k 5.81 14.63 32.40 34.57 48k 17.92 22.32 45.13 42.81 64k 28.64 28.42 54.67 47.84 for both testing materials at lower bit rates and only has a small performance degra dation at higher bit rates. In addition, the bitstream generated by MPEG AAC only achieves an approximate bit rate and is normally a little bit higher than the desired one while our algorithm achieves a much more accurate bit rate in all experiments carried out. 4.7.1.2 Random Access The MNR result after the base layer reconstruction for the random access mode by using the test material ’ ’Herre” is shown in Table 4.2. When listening to the reconstructed music, we can clearly hear the quality difference between the enhance period and the rest of the other period. The MNR value given in Table 4.2 verifies the above claim by showing that the mean MNR value for the enhanced period is about 10 dB per subband better than the rest of other periods. It is common that we may prefer a certain part of a music to others. With the random access profile, the user can individually access a period of music with better quality than others when the network condition does not allow a full high quality transmission. 104 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 4.2: MNR comparison for Random Access and Channel Enhancement profiles Random Access Channel Enhancement Enhanced channel Other channels other area enhanced area w/o enhance w/ enhance w/o enhance w/ enhance 3.99 13.94 8.42 19.23 1.09 -2.19 4.7.1.3 Channel Enhancem ent The performance result using test material ” Herre” for the channel enhancement mode is also shown in Table 4.2. Here, the center channel has been enhanced with enhancement parameter 1 and coded at bit rate of 16 kbit/s/ch. Since we have to separate the quantization and the coding control of the enhanced physical channel, as well as to ease the implementation, KLT is disabled in the channel enhancement mode. Compared with the normal MNR progressive mode, we find that the enhanced center channel has an average of more than 10 dB per subband MNR improvement, while the quality of other channels is only degraded by about 3 dB per subbancl. When subjectively listen to the reconstructed audio, the one with the center channel enhanced has a much better performance and is more appealing, compared with the one without channel enhancement at the very low bit rate of 16 kbit/s/ch. This is because the center channel of ” Herre” contains more musical information than other channels and a better reconstructed center channel will give listeners with better overall quality, which is basically true for most multichannel audio materials. Therefore, this experiment suggests that with a narrower bandwidth, audio gener ated by the channel enhancement mode of the PSMAC algorithm can provide the 105 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. user a more compelling experience with either a better reconstructed center channel or a channel which is more interesting to a particular user. 4.7.2 Subjective Listening Test In order to further confirm the advantage of the proposed algorithm, a formal sub jective listening test according to ITU recommendations [111, 128a, 128b] was con ducted in an audio lab to compare the coding performance of the proposed PSMAC algorithm and that of the MPEG AAC main profile codec. Same group of listeners as described in Section 2.8.3 participated in the listening test. During the test, for each test sound clips, subjects listened to three versions of the same sound clips, i.e. the original one followed by two processed ones (one by PSMAC and one by AAC in random order), subjects were allowed to listen to these files as many times as possible until they were comfortable to give scores to the two processed sound files for each test material. The five-grade impairment scale given in Recommendation ITU-R BS. 1284 [128a] was adopted in the grading procedure and utilized for final data analysis. Besides ’ ’Messiah” and ’’Herre” , another two 10-channel audio materials, i.e. ’ ’Band” and ’ ’Herbie”, are added in this subjective listening test. According to ITU-R BS. 1161- 1 [111], audio files selected for listening test only contains short durations, i.e. 10 to 20 seconds long. Figure 4.6 shows the score given to each test material coded at four different bit rates during the listening test for multi-channel audio materials. The solid vertical 106 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Messiah Band 5 4 J?3 C O 0 2 1 0 A=MPEG AAC P=PSMAC I I I A p 1 P .........I I I P A I P I A I 1 A 16k 32k 48k 64k Bit rate in bit/s/ch Herbie 5 4 ^ 3 C O 0 2 1 0 A=MPEG AAC P=PSMAC . 1 I I I i I P A P A p 1 A J a p 16k 32k 48k 64k Bit rate in bit/s/ch Herre 5 4 ^ 3 C O 0 2 1 0 A=MPEG AAC P=PSMAC . . . i .. ! i I P 1 P . . A. P . A p 1 A A r I I P A 16k 32k 48k 64k Bit rate in bit/s/ch 5 4 J T 3 C O 0 2 1 0 A=MPEG AAC P=PSMAC . . . w ...... I I 1 ...AI Ap L .* p . P A I A P 16k 32k 48k 64k Bit rate in bit/s/ch Figure 4.6: Listening test results for multi-channel audio sources line represents the 95% confidence interval, where the middle line shows the mean value and the other two lines at the boundary of the vertical line represent the upper and lower confidence limits [RADH87]. It is clear from Figures 4.6 that at lower bit rate, such as 16 kbit/s/ch or 32 kbit/s/ch, the proposed PSMAC algorithm outperforms MPEG AAC in all four test materials; while at higher bit rate, such as 48 kbit/s/ch or 64 kbit/s/ch, PSMAC achieves comparable or a little degraded subjective quality when compared with MPEG AAC. 107 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. GSPI 6 5 * 4 -4 —» "§3 ° 2 1 0 TRPT B=BSA P=PSIV c IA C k T | r 1 P 1 B 3 3 B 3 3 16k 32k 48k 64k Bit rate in bit/s/ch VIOO 6 5 >,4 4 —< § 3 ° 2 1 0 . . . » ...... I i B=BSA P=PSIV C AO Il | I B p 4 I S I B 3 16k 32k 48k 64k Bit rate in bit/s/ch 6 5 >,4 -4 —» « 3 ° 2 1 0 B=BSA P-PSIV > o o I 1 I l 3 f 1 b B 1 p B b 3 B 16k 32k 48k 64k Bit rate in bit/s/ch Figure 4.7: Listening test results for single channel audio sources. The cases where no confidence intervals are shown correspond to the situation when all four listeners happened to give the same score to the given sound clip. To demonstrate that the PSMAC algorithm achieves excellent coding perfor mance even for single channel audio files, another listening test for mono sound is carried out as well. Three single channel test audio materials, called ’’GSPI”, "TR PT” and ”VIOO”, are used in this experiment. Here we are comparing the performance between standard fine-grain scalable audio coder provided by MPEG-4 BSAC [ISOb, ISOh| and the single channel mode of the proposed PSMAC algorithm. 108 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 4.7 shows the listening test result for single channel audio materials. From this figure, we can clearly see that at lower bit rates, e.g. 16 kbit/s/ch and 32 kbit/s/ch, our algorithm generates better sound quality for all test sequences. At higher bit rates, e.g. 48 kbit/s/ch and 64 kbit/s/ch, our algorithm outperforms MPEG-4 BSAC for two out of three test materials and is only a slightly worse for the ”TR PT” case. 4.8 C onclusion A progressive syntax-rich multichannel audio coding algorithm is presented in this research. This algorithm utilizes KLT in the pre-processing stage to remove inter channel redundancy inherent in the original multichannel audio source. Then, rules for channel selection and subband selection are developed and the SAQ process is used to determine the importance of coefficients and their layered information. At the last stage, all information is losslessly compressed by using the context-based QM coder to generate the final multichannel audio bitstream. The distinct advantages of the proposed algorithm over most existing multichan nel audio codecs not only lie in its progressive transmission property which can achieve a precise rate control, but also in its rich-syntax design. Compared with the new MPEG-4 BSAC tool, PSMAC provides a more delicate subband selection strategy such that the information, which is more sensitive to the human ear, is 109 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. reconstructed earlier and more precisely at the decoder side. It was shown by ex perimental results that PSMAC has a comparable performance as non-progressive MPEG AAC at several different bit rates when using the multichannel test mate rial, while it achieves better reconstructed audio quality than MPEG-4 BSAC tools when using single channel test materials. Moreover, the advantage of the proposed algorithm over the other existing audio codec is more obvious at lower bit rates. 110 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C hapter 5 E rror-R esilient D esign 5.1 Introduction High quality audio communication becomes an increasingly important part of the global information infrastructure. Compared with speech, audio communication re quires a much more volume of data being transmitted in a timely manner, and a highly efficient compression scheme for the storage and transmission of audio data are critical. Extensive research on audio coding has been conducted in both academia and industry for years. Several standards, including AC-3, MPEG-1, MPEG-2, MPEG-4 [BB97, A/5, ISOa, ISOe, ISOf], have been established in the past decade. Earlier standards, e.g. MPEG-1, MPEG-2 or AC-3 were primarily designed for cod ing efficiency and they only allow a fixed bit rate coding structure. These algorithms are not ideal for audio delivery over noisy wireless IP networks with a time-varying bandwidth, since they do not take error resilience and VBR traffic into consideration. I l l Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Recent technological developments have led to several mobile systems aiming at personal communications services (PCS), supporting both speech and data trans mission. Mobile uses usually communicate over wireless links characterized by lower bandwidths, higher transmission error rates, and more frequent disconnections in comparison to wired networks. To transmit high quality audio through an IP net work with VBR (variable bit rates) traffic, a scalable audio compression algorithm, which is able to transfer audio signals from coarse to fine qualities progressively, is desirable. However, to achieve a good coding gain, most existing scalable tech niques adopt variable-length coding in their entropy coding part, which makes the entire bitstream susceptible to channel noise. The traditional channel coding scheme only protects bits equally, without giving important bits higher protection, which results in a situation that a small bit error rate may lead to reconstructed audio with annoying distortion or even unacceptable perceptual quality. Since most audio compression standards and network protocols, such as the MPEG-4 version 2 audio codec, were designed for wired audio communications, they would not be effective if straightforwardly applied to the wireless case. A scalable bitstream with joint source-channel coding would be truly needed in a wireless audio streaming system. MPEG-4 version 2 supports audio fine-grain scalability and error-robust cod ing [ISOe, ISOf]. Its error resilient AAC coder does not have the progressive prop erty. Its BSAC utilizes Segmented Binary Arithmetic (SBA) coding to avoid error propagation within spectral data. However, this feature alone is not sufficient to pro tect the audio data in an effective manner over the wireless channel. Compared to 112 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. work on error-resilient image/vicleo coding, the number of papers on error resilient audio coding is relatively small. Data partitioning and reversible variable length codes were adopted by Zhou et al. in [ZZXZ01] to provide the error-resilient feature to the scalable audio codec in [ZLOlj. Base on the framework in [ZZXZ01], Wang et al. incorporated an unequal error protection scheme in [WZZZ01]. In Chapter 4, we proposed a progressive high quality audio coding algorithm, which has been shown to outperform MPEG-4 version 2’s scalable audio codec. In this work, we extend the error-free progressive audio codec to an error-resilient scalable audio codec (ERSAC) by re-organizing the bitstream and modifying its noiseless coding part. The proposed error-resilient scalable audio codec actually uses the MPEG Advanced Audio Coding (AAC) as the baseline together with an error robust scalable transmission module, which is specifically designed for WCDMA channels. In the proposed ERSAC codec, a dynamic segmentation scheme is first used to divide the audio bitstream into several variable-length segments. In order to achieve good error resiliency, the length of each segment is adaptively determined by the characteristics of WCDMA channels. The arithmetic coder and its probability table are re-initialized at the beginning of each segment, so that synchronization can be achieved at the decoder side even when error occurs. Bits within each segment are ordered in such a way that more important bits are placed near the synchronization point. In addition, an unequal error protection scheme is adopted to improve ro bustness of the final bitstream, where Reed-Solomon codes are used to protect data bits, and the parameters of each Reed-Solomon code is determined by the WCDMA 113 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. channel condition. Moreover, a frequency interleaving technique is adopted when data packetization is performed so that the frequency information belongs to the same period is sent in different packets. In this way, even if some packets belongs to the header or the base layer are corrupted, we still can hear a poorer quality period of sound with some frequency component lost (unless packets corresponding to the same period of sound are corrupted at the same time). We test the performance of our algorithm using several single-channel audio materials under different error pat terns of WCDMA channels. Experimental results show that the proposed approach has excellent error resiliency at a regular user bit rate of 64 kb/s. The rest of this chapter1 is organized as follows. Some characteristics of the WCDMA channel are summarized in Section 5.2. The layered audio coding structure is described in Section 5.3. Section 5.4 explains the detailed error-resilient technique in the proposed algorithm. Some experimental results are shown in Section 5.5, and concluding remarks are given in Section 5.6. Some discussions and future work directions are addressed in Section 5.7. 5.2 W C D M A C haracteristics The third generation (3G) mobile communication systems have been designed for effective wireless multimedia communication [HT01]. The WCDMA (wideband Direct-Sequence Code Division Multiple access) technology has been adopted by the : P art of this chapter represents work published before, see [YAKK02b] 114 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UMTS standard as the physical layer for air interface. WCDMA has the following characteristics. • WCDMA is designed to be deployed in conjunction with GSM. • The chip rate of 3.84 Mcps used leads to a carrier bandwidth of approximately 5 MHz. • WCDMA supports highly variable user data rates; in other words, the concept of obtaining Bandwidth on Demand (BoD) is well supported. • WCDMA supports two basic modes of operation: Frequency Division Duplex (FDD) and Time Division Duplex (TDD). • WCDMA supports the operation of asynchronous base stations. • WCDMA employs coherent detection on uplink and downlink signals based on the use of pilot symbols or common pilot. • The WCDMA air interface has been crafted in such a way that advanced CDMA receiver concepts can be deployed by the network operator as a system option to increase capacity and/or coverage. Two reference error-resilient simulation studies [ITU98, ITU99] for the charac terization of the radio channel performance of the 1.9 GHz WCDMA air interface were recently carried out by the ITU-Telecommunications Standardization Sector. In the study of 1998, which is referred to as study #1, only six simulation results for fixed data bit rate of 64 kb/s were obtained. In the study of 1999, which is 115 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. referred to as study #2, simulation results were extended to four different data bit rates, including 32 kb/s, 64 kb/s, 128 kb/s and 384 kb/s. Study # 2 also replaced convolutional codes by turbo codes so that an even better channel performance can be achieved. In this work, we only consider error-resilient coding for single-channel audio coded at 64 kb/s. The main characteristics of all error patterns corresponding to 64 kb/s contained in two studies are listed in Table 5.1. Each error pattern file is of 11,520,000 bit long corresponding to 3 minutes of data transmission. The bit error is a binary one. Within a byte the least significant bit is transmitted first. Table 5.1: Characteristics of WCDMA error patterns. Study # File # File name Mobile speed (km/h) Average BER (b/s) 1 0 wcdma-64kb-005hz-4 3 8.2e-5 1 1 wcclma-64kb-070hz-4 40 1.2e-4 1 2 wcclma-64kb-211hz-4 120 9.4e-5 1 3 wcclma-64kb-005hz-3 3 1.4e-3 1 4 wcclma-64kb-070hz-3 40 1.3e-3 1 5 wcdma-64kb-211hz-3 120 9.7e-4 2 6 wcdma_64kb_50kph_7e-04 50 6.6e-4 2 7 wcdma_64kb_50kph_2e-04 50 1.7e-4 2 8 wcclma_64kb_3kph_5e-04 3 5.1e-4 2 9 wcclma_64kb_3kph_2e-04 3 1.6e-4 2 10 wcdma_64kb_3kph_7e-05 3 7.2e-5 2 11 wcdma_64kb_3kph_3e-06 3 3.4e-6 2 12 wcdma_64kb.-50kph_6e-05 50 6.0e-5 2 13 wcdma_64kb_50kph_3e-06 50 3.4e-6 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5.3 Layered C oding Structure 5.3.1 Advantages of the Layered Coding The most popular and efficient way to construct a scalable bitstream is to use the lay ered coding structure. When the information from the first and the most important layer called the base layer is successfully retrieved from the bitstream at the decoder side, a rough outline of the entire sound file can be recovered. When the information from more and more higher level layers, called enhancement layers, are successfully retrieved from the bitstream, the sound file with better and better quality can be reconstructed. When the bitstream is transmitted over error-prone channels, such as the wired and/or wireless IP networks, the advantage of the layered coded bitstream is more notable than the fixed rate bitstreams. For a fixed rate bitstream, when an error occurs during transmission, the decoder can only reconstruct the period before the error and the period after the decoder regain the synchronization caused by the error. The resulting sound file may contain the lost period of several milli-seconcls to several seconds long, depending on how long the synchronization is regained at the decoder site. If the bitstream cannot be re-synchronized, the data after the error may be completely lost, which results in a partially reconstructed sound file. However, when an error occurs during transmission of a layered coded bitstream, unless the error occurs in the base layer, the decoder can still recover the sound file, but has sound quality degradation in enhancement layers for some period of time. Experiments suggest that, even containing poorer quality for some period of time, 117 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the reconstruct sound file of full length would give listeners better sensation than a sound file with some completely lost periods. 5.3.2 M ain Features of Scalable Codec The major difference between the proposed scalable codec design and the traditional fixed bit rate codec lies in the quantization module and the entropy coding module. The ERSAC algorithm inherit the basic idea of the progressive quantization and context-based QM coder in PSMAC algorithm to achieve the fine-grain bit rate scalability. In order to classify and protect bit according to their importance, bits are re-ordered so that bits which belong to the same priority are grouped together for easier protection. Compared with PSMAC algorithm, the major modification in the progressive quantization module lies in how to transmit those subband significant bits. In PSMAC, bits which indicate the subband significance are sent together with the coefficient bits, while in the ERSAC algorithm, these bits are sent in the header. In PSMAC the threshold used to determine the subband significance is MNR values and they are updated after each coding layer, while in ERSAC, SMR are adopted for determination of the subband significance in every layer. Thus the subband se lecting sequence might not be the same when using PSMAC and ERSAC algorithm for some input audio files. However, experiments show that the perceptual quality of the reconstructed sound files is quite similar when adopting these two slightly different subband selection rules. 118 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In ERSAC, at layer i, i < 3, an empirical threshold T * based on the Signal- to-Mask Ratio (SMR) is calculated. Only those subbands whose SMR values are greater or equal to T) will be selected and become significant. At the next layer, i.e. layer i + 1 , the SMR values of newly included subbands together with the remaining non-significant subbands in previous layers will be compared with Ti+1, and an updated significant subband list will be created. Figure 5.1 provides an example to show how subbands are selected from layer 0 to layer 3. To better illustrate this procedure, L q, L\, L 2 and L3 are set to 6 , 10, 14 and 18, respectively, in this example. If the bit budget is not exhausted after the 3rd enhancement layer, more layers can be encoded. All subbands will be included in the significant list from this point on. At the encoder, the subband significance information is included in the header, where a binary digit ” 1 ” represents a significant subband and ”0” represents a non-significant subbancl. Thus, in the example given in Figure 5.1, the subbancl significance information bits should be 0101111011011001100101011 . Whenever the encoder finds a significant subband, it visits coefficients inside this subband, and performs progressive quantization on coefficients and then do the entropy coding. Here, we adopt the Successive Approximation Quantization (SAQ) scheme to quantize the magnitude of coefficients and the context-based QM coder 119 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reproduced w ith permission o f th e copyright owner. Further reproduction prohibited without permission. Base layer: 1st enhance layer: 2nd enhance layer: 3rd enhance layer: SB#1 fe:B#2 SB#3 SB #4 SB#5 SB#6 0 1 0 1 1 1 SB//1 SB#3 SB#7 SB IIH SB#9 SB# 10 0 SB# I SB#9 SB#11SB# 12 SB# 13 SB#14 SB#i S B #i Legend Non-significant Subband #i Significant Subband #i 0 0 1 1 0 SB#9 SB#U SB#14 SB#15 SB#16 3B#17 SB#18 Figure 5.1: A simplified example of how subbands are selected from layers 0 to 3. to o to noiseless code all generated bits. Detailed description of coefficients’ SAQ and context-based QM coder can be found in Chapter 4. 5.4 E rror-R esilient C odec D esign 5.4.1 Unequal Error Protection When a bitstream is transmitted over the network, errors may occur due to chan nel noise. Even with channel coding, errors can still be found in received audio data. In particular, for highly compressed audio signals, a small bit error rate can lead to highly annoying or perceptually unacceptable distortion. Instead of demanding an even lower bit error rate (BER), which is expensive to achieve in a wireless environment, the use of joint source and channel coders have been studied in [YFT+99, SS99, WZZZ01] and shown to be promising in achieving good perceptual quality without complex processing. Most coded audio bitstreams contain certain bits that are more sensitive to trans mission errors than others in audio reconstruction. Unequal error protection (UEP) offers a mechanism to relate the transmission error sensitivity to the error protec tion capability. An UEP system typically has the same average transmission rate as a corresponding equal error protection (EEP) system but offers an improved per ceived signal quality at the same channel signal to noise ratio. In order to prevent the complete loss of transmitted audio, the critical bits need to be well protected from channel errors. Examples of critical bits include the headers and the most 121 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. significant bits of source symbols. The loss of header’s information usually leads to catastrophic sound distortion while an error in the most significant bit results in higher degradation than that of others. Thus, high-priority bits need to be protected using channel coding or other methods. But the redundancy due to channel coding reduces compression efficiency. A good tradeoff between rates of the source coder and the channel coder has to be considered. The study of error-correcting codes began in late 1940’s. Among several error correcting codes [MS77, LDJC83], such as Hamming, BCH, cyclic and Reed-Muller codes, the Reed-Solomon code is chosen in our implementation because of its ex cellent performance on correcting burst errors, which is the most common case in wireless channel transmission. Reed-Solomon codes are block-based error correct ing codes with a wide range of applications in digital communications and storage. The number and the type of errors that can be corrected by Reed-Solomon codes depend on code parameters. For a fixed number of data symbols, codes that can detect and correct a smaller number of bit errors has smaller parity check symbols, thus producing smaller redundancy bits. In this work, data in the compressed audio bitstreams are protected according to their error sensitivity classes. Experimental results suggest that errors in both headers and the base layer lead to unacceptable reconstructed audio quality, while the same amount of errors in the enhancement layers results in less perceptual distortion with the number of the enhancement layer goes higher. Therefore, bits in the header and the base layer are given the same high est priority, bits in the first enhancement layer are given the moderate priority, and 122 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. bits in the second and higher enhancement layers are given the lowest priority during the error protection procedure. To further determine the error correcting capability of Reed-Solomon codes used for each error sensitivity class, more detailed analysis is performed on all WCDMA error patterns. Since the Reed-Solomon code is a byte-based error correcting code, the mean and the standard deviation of the byte error rate are calculated for each error file. Files with similar byte error rate characteristics are combined to one group. All fourteen error patterns are finally divided into four groups, whose virtual mean and standard deviation values are then empirically determined. Based on these virtual statistical data, the target error correcting capability of the Reed-Solomon code is finally calculated by the following formula etarget(g, c) = meanbyte(g) + f byte(g, c) x stdhyte(g), (5.1) where etarget(g,c), meanbyte(g), sl,dbyte(g) are the target error correcting ability (in percentage), the virtual mean and the virtual standard deviation of the byte error rate for group g and error sensitive class c, respectively, and fbyte{g,c) is a function of group number g and error sensitivity class number c. 5.4.2 Adaptive Segm entation Although the arithmetic coder has excellent coding efficiency and is adopted by al most all layered-cocled source coding techniques, the arithmetic coder together with 123 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. other variable-length codes are known to be highly susceptible to channel errors due to the synchronization loss at the decoder side, which leads to error propagation, the loss of some source symbols and even the crash of the decoder. To prevent these undesirable results and, at the same time, to consider the redundancy generated by the UEP scheme, an adaptive segmentation strategy is developed in this work. That is, the generated bitstream is partitioned into several independent segments. By ’ ’independent”, we mean that the entropy coding module at the decoder side can independently decode each segment. This can be achieved as follows. At the beginning of each segment, the arithmetic coder is restarted, its probability tables are refreshed, and some stuffing bits are appended at the end of each segment so that each segment is byte-aligned. In this way, several independent synchronization points are provided in the bitstream and errors can only propagate until the next synchronization point, which means that errors will be confined to their correspond ing segments and will not affect the decoding of other segments. The determination of the segment length is another issue to be addressed. Since the arithmetic coder has to be restarted and flushed for each segment, segments with a length too small will considerably degrade the entropy coding efficiency. Thus, a good tradeoff between the coding performance and the error resilient capability should be studied. The use of the bit error rate as a parameter to determine the segment length provides an intuitive and straight-forward solution. However, after some exploration, we find out that the error distribution pattern should also be taken into consideration. 124 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Experimental results show that error files with a similar bit error rate may have a quite different error distribution pattern. Let us introduce a concept called the error occurrence period, which is defined as the length (in bits) between two neigh boring errors. Here, it is assumed that there are at least eight or more free bits between these two neighboring errors. The average error occurrence period and its standard deviation are calculated for each error pattern file. Files with the similar characteristics are grouped together. Then, the virtual mean and the virtual stan dard deviation value of the error occurrence period are empirically determined for each group. Finally, the segment length is calculated via ■seglen(g) = ni(uin(n T M V {g) + foccur(g) x stdoccur(g), (5.2) where seglen(g), meanoccur(g) and stdoccur{g) are the segment length, the virtual mean and the virtual standard deviation of the error occurrence period for group < 7, respectively, and f O C cur(g) is a function of group number g. Note that the group number g in Eq (5.2) may not be the same as that in Eq (5.1). 5.4.3 Frequency Interleaving Traditionally, all bits belonging to the same time position are packed together and sent into th e b itstre a m . W hen errors h a p p e n in th e global header or th e d a ta p a rt of the base layer, the corresponding period of sound data cannot be reconstructed. Instead, it may have to be replaced by silence or other error concealment technology. 125 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Simple experiments show that substituting the corrupted period with silence in a reconstructed sound file generates an unpleasant sound effect. So far, there is no effective error concealment technology to recover the lost period in audio. In order to improve the performance under this situation, a novel frequency interleaving method is proposed and incorporated in the ERSAC algorithm. With this new method, bits corresponding to different frequency components in the same time position are divided into two groups. Then, even if some packets are corrupted during transmission, the decoder can still be able to reconstruct a poorer quality version of the sound with some frequency component missing. Figure 5.2 depicts a simple example on how the frequency interleaving is imple mented in ERSAC. In this figure, only significant subbancls for each layer in Fig ure 5.1 are shown. Adjacent significant subbands are divided into different groups. Bits belonging to a different group will not be included in the same packet. By this way, bits in different subbands, which correspond to different frequency interval, are interleaved so that a perceptually better decoded sound file can be reconstructed when some packets are corrupted. To show the advantage of the frequency interleaving method, a simple experi ment is carried out. During the experiment, we artificially corrupt two packets for each test sound file. Both packets belong to the data part of the base layer in the bit stream. The corresponding time positions of these packets are chosen such that one packet contains information for a smooth period and the other one contains infor mation for a period with rapid melody variations. With one packet lost in the base 126 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reproduced w ith permission o f th e copyright owner. Further reproduction prohibited without permission. Base layer: 1s t enhance layer: 2n d enhance layer: s b :#2 l i d : SB#5 SB#6 SB#1 SB#2 SB#4 SB #5SB#6 SB#7 SB#8 SB #i #i Legend Subband #i in group A Subband #i in group B SB#1 SB#2 3B#3 M i SB#5 SBKi SB#7 i n SB# 10 SB#12 SB# 13 3 enhance layer: SB#1 SB#2 SB#3 H i SB#5 SB#6 SB#7 SB#8 SB# 10 SB# 1 1SB#12 SB#13 SB# 15 SB# 17SB#18 Figure 5.2: Example of frequency interleaving. o o - a layer, coefficients corresponding to certain frequency areas cannot be reconstructed. Therefore, the reconstructed sound clips may contain a period with defects. How ever, the degree of the perceptual impairment differs a lot from sample to sample. Table 5.2 lists the experiment results for the frequency interleaving method. Table 5.2: Experimental results of the frequency interleaving method. Affected area Input file VIOO TRPT GSPI Smooth area Can only be noticed when listened really carefully Can be noticed but not annoying Can be noticed and is a little annoying Area with rapid melody variations hardly noticeable hardly noticeable can be noticed but not annoying We see from this table that, for some input sound files, such as the one named ”VIOO” , users can hardly detect the defect of the reconstructed file. For some other input sound files, such as the one named ”TRPT”, users are able to catch the defect in the smooth period, but the perceptual impairment is in the level of ’ 'perceived but not annoying”. For input sound files like ”GSPI” , which has a wide range of frequency components, users can easily detect the defect which may be somewhat annoying. On the other hand, if no frequency interleaving is enforced, when corrupted packets resides in the header or the data part of the base layer, no information for the corresponding time period can be recovered and can only be played back by silence if there is no concealment technique involved. We can decisively conclude that a sound file with a sudden silence period inserted is much 128 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. more annoying than the one constructed by the proposed frequency interleaving method. 5.4.4 B itstream Architecture The bitstream architecture of the proposed algorithm is illustrated in Figure 5.3. The entire bitstream is composed of a global header and several layered information. Bits contained by lower layers represent information of perceptually more important coefficients’ values. In other words, the bitstream contains all lower bit rate codes at the beginning of the bitstream so that it can provide different QoS to different end-users. Let us look at the details of each layer. Within each layer, there are many variable-length segments, and each segment can be independently decoded. At the beginning of each segment, there is a segment header. These segment header bits are utilized to indicate the synchronization point. The data part within segments are partitioned into several packets. One packet is considered as a basic unit input into the Reed-Solomon coding block, where parity check bits are appended after data bits. At the end of each segment, there are some stuffing bits so that the whole segment is byte-aligned. 5.4.5 Error Control Strategy When the end user receives a bitstream, the Reed-Solomon decoder is employed to detect and correct any possible errors occurred during channel transmission. Once 129 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ith enhance layer Base layer 1s t enhance layer Segment #1 Segment #2 Segment #i Segment #n Independent synchronization point Packet #1 Packet #2 Packet #i Packet #k Hg : Global Header Hs : Segment Header S: Stuffing P: Parity Check Figure 5.3: The bitstream architecture. an uncorrectable error is detected, its position information, such as the layer num ber, the segmentation number and the packet number, will be marked. If this uncorrectable error occurs in the global header or the data part of the base layer, all bits belonging to the corresponding time position will not be reconstructed and this period of sound will be replaced with silence. Some error concealment strategy may be involved to make the final audio file sound more smoothly. However, if the uncorrectable error occurs in the data part of any other enhancement layers, all bits belonging to the corresponding time position will not be used to refine the spectral data so that this error will not cause unpleasant distortions. Experimental results show that uncorrectable errors in layer two or higher have little impact on the final audio file. Normal listeners can hardly perceive any im pairment, if not listening carefully. If these errors are in layer one, normal listeners may perceive a little but not annoying impairment in the final audio file. There is another type of error that happens to bits belonging to the segment header. When this type of error occurs, it may cause the decoder to lose synchronization and stop decoding earlier than expected. In this scenario, the error affects more frames which can be well beyond one specific segment and the resulting audio file may correspond to a lower rate reconstructed audio file with poorer quality. 131 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5.5 E xperim ental R esults The proposed ERSAC system has been implemented and tested. The basic audio coding blocks [ISOc] of the MPEG AAC main profile encoder, including the psy choacoustic model, filter bank, and temporal noise shaping, are adopted to generate spectral data. An error-resilient progressive quantization block and a context-based QM coder block are added at the end to construct the error-robust scalable audio bitstream. Three single-channel sound files, i.e. GSPI, TRPT and VIOO, which are downloaded and processed from the MPEG Sound Quality Assessment Mate rial, are selected to test the coding performance of the proposed algorithm. The Mask-To-Noise Ratio (MNR) values are adopted here as the objective sound quality measurement. Figure 5.4 and Table 5.3 show the experimental results for three test material using different WCDMA error pattern file, where the mean MNR and the average MNR values are calculated by Equation 3.4 and 3.5. Based on results shown in Figure 5.4 and Table 5.3, we have the following obser vations. 1 . No error: there is no uncorrectable errors (in GSPI error pattern 0, 4, 5, 9, 1 0 , 11, 12, 13; TRPT error pattern 0, 1, 2, 9, 10, 11, 12, 13; VIOO error pattern 0, 1, 10, 11, 13). This happens when either there is no error during the period when the bit stream is transmitted over the WCDMA channel or there are error but they 132 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. have been corrected by ERSAC’s error detection scheme. Since more than half of experiment cases belong to this category, it shows the proposed ERSAC algorithm has an excellent error-resilient capability. 2. Error case 1: error occurs in the global header and the data part of the base layer (None is observed in our experiment). When this happens, the decoder has no way to reconstruct the affected period of the sound file. Then, this period will be error concealed by the repetition of data in the previous period. 3. Error case 2: error occurs in the segment header (None is observed in our experiment). When this happens, the decoder may lost synchronization and will not be able to continue decoding the proceeding bitstream, which means the decoder will stop refining all coefficients’ values. If this happens in lower layers, e.g. layer 0 or layer 1, the reconstructed audio should have poor quality and the end user may easily perceive the distortion. However, if this happens in higher layers, e.g. layer 2 or higher, errors will not have big impact on the reconstructed sound file. 4. Error case 3: error occurs in the data part of layer 1 or higher (observed in all remaining cases). When this happens, the decoder will stop refining coefficients in the affected 133 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. period, and the reconstructed sound file has slightly degraded quality, which belongs to the perceptible distortion degree, but not annoying. GSPI TRPT 110 £100 c G £ 90 Z3 h ? 50 Q i? 40 Subband VIOO 100 £ 80 Subband |T 60 c G _Q . Q w 40 C O z 20 O ) -20 Subband Figure 5.4: Mean MNR values of reconstructed audio files through different WCDMA channels. 5.6 C onclusion We presented an error-resilient scalable audio coding algorithm, which is an exten sion of our previous work on progressive audio compression. Compared with other existing audio codecs, this algorithm not only preserves the scalable property, but 134 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 5.3: Average MNR values of reconstructed audio files through different WCDMA channels. Error pattern file # Error pattern file name Ave. MNR (clB/subband) GSPI TRPT VIOO 0 wcdma-64kb-005hz-4 83.24 57.82 46.57 1 wcdma-64kb-070hz-4 82.97 57.82 46.57 2 wcclma-64kb-21 lhz-4 83.18 57.82 46.53 3 wcdma-64kb-005hz-3 83.25 57.64 46.25 4 wcclma-64kb-070hz-3 83.24 57.49 46.47 5 wcdma-64kb-211hz-3 83.24 57.77 46.43 6 wcdma_64kb_50kph_7e-04 82.72 57.24 46.38 7 wcdma_64kb_50kph_2e-04 83.36 57.80 46.31 8 wcdma_64kb_3kph_5e-04 81.96 56.82 46.23 9 wcdma_64kb_3kph_2e-04 83.39 57.86 46.45 10 wcdma-6 4kb _3kph_7 e-05 83.39 57.86 46.61 11 wcdma_64kb_3kph_3e-06 83.24 57.82 46.57 12 wcdma_64kb_50kph_6e-05 83.39 57.86 46.58 13 wcdma_64kb_50kph_3e-06 83.24 57.82 46.57 also incorporate an error-robust scheme specifically designed for WCDMA channels. Based on the characteristics of the simulated WCDMA channel, a joint source- channel coding method was developed in this work. The novelty of this technique lies in its unique unequal error protection, adaptive segmentation coding structures and the freqeucny interleaving technique. Experimental results showed that the proposed ERSAC achieved a good performance using all simulated WCDMA error pattern files at a regular user bit rate of 64 kb/s. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5.7 D iscussion and Future W ork 5.7.1 Discussion 5.7.1.1 Frame Interleaving Error-resilient algorithms designed for image or video codec normally contain a block interleaving procedure, where bits belonging to adjacent areas are packed separately so that any propagated error within packets will not affect a large area. This is done because human eyes are more sensitive to low frequency components while less to high frequency components when errors are spread to a larger area. Similarly, the frame interleaving technique can also be considered for error-resilient audio codec design. However, unlike image or video, experimental results show that human ears are capable in catching spreading impairment in sound files. In fact, a longer evenly distorted period is less annoying and more tolerable than an un-smoothly sound period with distortion frequently on and off. Therefore, no frame interleaving is adopted in the proposed algorithm, packets are just sent according to their original time sequence. 5.7.1.2 Error Concealm ent Several error concealment techniques were proposed for speech communications [GLWW86, Jay81, WGDP88, VR93, RW85], such as SOLA, WSOLA, frame rep etition and waveform substitution etc. However, these methods are not suitable for high quality audio because of different applications for speech and audio. The 136 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. main purpose of speech is communication while the main purpose of audio is enter tainment. Thus, as long as people can understand, some noise in the background of speech is tolerable, which is certainly not the case for high quality audio. One common practice of existing speech error concealment methods is the addition of background noise. As a result, the error-concealed speech has good intelligibility while just having some additional noise in the background. However, adding noise in the audio file is normally un-tolerated, which makes none of available error con cealment methods suitable for high quality audio. 5.7.2 Future work In our current work, the ERSAC algorithm has only been implemented for single channel material, and it can be extended to accommodate stereo or even multi channel error-resilient codecs. Although only mono or stereo audio applications are needed in today’s wireless communication systems, we can foresee the need of send ing multichannel audio files over wired or wireless networks in the future. Thus, error-resilient multichannel audio coding is still a research topic. When input sound files with more than one channel are incorporated into the error-resilient codec, chan nel dependency should be taken into account, and could be utilized to develop an efficient error concealment strategy. 137 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C hapter 6 C onclusion The important results achieved in this dissertation are summarized as follows: • M odified Advance Audio C oding w ith Karhunen-Loeve Transform (M AACKLT) MAACKLT is a new quality-scalable high-fidelity multichannel audio com pression algorithm based on MPEG-2 Advanced Audio Coding (AAC). The Karhunen-Loeve Transform (KLT) is applied to multichannel audio signals in the pre-processing stage to remove inter-channel redundancy. Then, signals in de-correlated channels are compressed by a modified AAC main profile en coder. Finally, a channel transmission control mechanism is used to re-organize the bitstream so that the multichannel audio bitstream has a quality scalable property when it is transmitted over a heterogeneous network. Experimental results show that, compared with AAC, the proposed algorithm achieves a bet ter performance with the objective Mask-to-Noise-Ratio (MNR) measurement while maintaining a similar computational complexity at the regular bit rate 138 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. of 64 kbit/s/ch. When the bitstream is transmitted to narrow-band end users at a lower bit rate, packets of some channels can be dropped, and slightly de graded yet full-channel audio can still be reconstructed in a reasonable fashion without any additional computational cost. • Progressive Syntax-R ich M ultichannel Audio Codec (PSM A C ) Based on AAC, we develop a progressive syntax-rich multichannel audio codec in this dissertation. It not only supports fine grain bit rate scalability for the multichannel audio bitstream, but also provides several other desirable func tionalities, which are not available in other existing multichannel audio codecs. Moreover, compared with other existing scalable audio coding tools, a more so phisticated progressive transmission strategy is employed in PSMAC. A formal subjective listening test shows that the proposed algorithm achieves excellent performance at several different bit rates when compared with MPEG AAC using multichannel test material and when compared with MPEG version-2’s BSAC using single channel test materials. • Error-Resilient Scalable Audio Coding (ERSAC) Inheriting the basic coding structure of PSMAC, the ERSAC algorithm ex tends the error-free progressive audio codec to an error-resilient scalable audio codec by re-organizing the bitstream and modifying the noiseless coding mod ule. A progressive quantization, a dynamic segmentation scheme, a frequency interleaving technique and an unequal error protection scheme are adopted in 139 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the proposed algorithm to construct the final error robust layered audio bit stream. The performance of the proposed algorithm is tested under different error patterns of WCDMA channels with several test audio materials. Our experimental results show that the proposed approach achieves excellent error resilience at a regular user bit rate of 64 kb/s. 140 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. B ibliography [ i n ] [128a] [128b] [A/5] [BB97] [BBQ+96] [Bla83] [Bra87] [Dav93] [Elqu89] [E\ic93] [GG91] Recommemdation ITU-R BS. 1116-1. Methods for the Subjective A s sessment of Small Impairments in Audio Systems Including Multichan nel Sound Systems. Recommemdation ITU-R BS. 1284. Methods for the subjective assess ment of sound quality - general requirments. Recommemdation ITU-R BS. 1285. Pre-Selection Methods for the Sub jective Assesment of Small Impairments in Audio Systems. ATSC Document A/52. Digital Audio Compression Standard (AC-3). K. Brandenburg and M. Bosi. ISO/IEC MPEG-2 Advanced Audio Coding: Overview and applications. In AES 103rd convention, AES preprint 4641, New York, September 1997. M. Bosi, K. Brandenburg, S. Quackenbush, L. Fielder, H. Fuchs K. Aka- giri, M. Dietz, J. Herre, G. Davidson, and Y. Oikawa. ISO/IEC MPEG- 2 Advanced Audio Coding. In AES 101st convention, AES preprint 4382, Los Angeles, November 1996. J. Blauert. Spatial Hearing. MIT Press, 1983. K. Brandenburg. Evaluation of quality for audio encoding at low bit rates. In AES 82nd convention, A E1S preprint 2433, London, 1987. M. Davis. The AC-3 multichannel coder. In AES 95th convention, AES preprint 3774, New York, October 1993. W. H. Equitz. A new vector quantization clustering algorithm. IEEE Transactions on Acoustics, Speech and Signal Processing, 37(10), Oc tober 1989. H. Fuchs. Improving joint stereo audio coding by adaptive inter-channel prediction. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pages 39-42, 1993. A. Gersho and R. M. Gray. Vector Quantization and Signal Compres sion. Kluwer Academic, 1991. 141 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [GL83] [GLWW86] [HAB+98] [Hay96] [HJ96] [HT01] [ISOa] [ISOb] [ISOc] [ISOcl] [ISOe] [ISOf] [ISOg] [ISOh] [ITU98] G. H. Golub and C. F. Van Loan. Matrix Computations. Baltimore, MD: Johns Hopkins Univ. Press, 1983. D. J. Goodman, G. B. Lockhart, O. J. Wasem, and W.-C. Wong. Wave form substitution techniques for recovering missing speech segments in packet voice communications. IEEE Transaction on Acoustics, Speech and Signal Processing, 34(6), December 1986. J. Herre, E. Allamanche, K. Brandenburg, M. Dieta, B. Teichmann, B. Grill, A. Jin, T. Moriya, N. Iwakami, T. Norimatsu, M. Tsushima, and T. Ishikawa. The integrated fileterbank based scalable MPEG-4 audio coder. In AES 105th convention, AES preprint 4810, San Fran cisco, CA, September 26-29 1998. S. Haykin. Adaptive Filter Theory. Prentice Hall, third edition, 1996. J. Herre and J. Johnston. Enhancing the performance of perceptual audio coders by using temporal noise shaping (TNS). In AES 101st convention, AES preprint 4384, Los Angeles, CA, November 1996. H. Holma and A. Toskala. WCDMA for UMTS, Radio Access for Third Generation Mobile Communications. Wiley, revised edition, 2001. ISO/IEC JTC1/SC29/WG11 N1650. IS 13818-7 (MPEG-2 Advanced Audio Coding, AAC). ISO/IEC JTC1/SC29/WG11 N2205. Final Text of ISO /IE C FCD 14496-5 Reference Software. ISO/IEC JTC1/SC29/WG11 N2262. ISO /IEC TR 13818-5, Software Simulation. ISO/IEC JTC1/SC29/WG11 N2425. MPEG-4 Audio verification test results: Audio on Internet. ISO/IEC JTC1/SC29/WG11 N2503. Information Technology - Coding of Audio-Visual Objectis - Part 3. ISO /IEC IS 14496-3:1999. ISO/IEC JTC1/SC29/WG11 N2803. Information Technology - Cod ing of Audio-Visual Objects - Part 3: Audio Amendment 1: Audio Extensions. ISO /IEC 14496-3:1999/AMD 1:2000. ISO/IEC JTC1/SC29/WG11 N2803. Text ISO /IE C 14496-3 Amd 1/FPDAM. ISO/IEC JTC1/SC29/WG11 N4025. Text of ISO /IE C 14496-5:2001. ITU-T SG-16. WCDMA Error Patterns at 64kb/s, June 1998. 142 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [ITU99] [Jay81] [JF92] [JHDG96] [K et al.97] [KJ01] [LDJC83] [Lee99] [Mat] [MS77] [PA93] [PB86] [PFTV92] [PKKS97] ITU-T SG-16. WCDMA Error Patterns, January 1999. N. S. Jayant. Effects of packet losses in waveform coded speech and im provements due to an odd-even sample-interpolation procedure. IEEE Transaction on Communications, 29(2), February 1981. J. Johnson and A. Ferreira. Sum-difference stereo transform coding. In IEEE ICASSP, pages 569-571, 1992. J. Johnson, J. Herre, M. Davis, and U. Gbur. MPEG-2 NBC audio - stereo and multichannel coding methods. In A E S 101st convention, AES preprint 4383, Los Angeles, CA, November 1996. C.-C. Jay Kuo and et al. Multithreshold wavelet codec (MTWC), November 1997. Doc. N. WG1N665. S. Kuo and J. D. Johnston. A study of why cross channel prediction is not, applicable to perceptural audio coding. IEEE Signal Processing Letters, 8(9), September 2001. S. Lin and Jr D. J. Costello. Error Control Coding: Fundamentals and Applications. Prentice Hall, Englewood Cliffs, NJ, 1983. J. Lee. Optimized quadtree for karhunen-loeve transform in multispec- tral image coding. IEEE Transactions on Image Processing, 8(4):453- 461, April 1999. SQAM Sound Quality Assessment Material. http://www.tnt.uni- hannover.de/project/mpeg/audio/sqam/. F. J. Mao/williams and N. J. A. Sloane. The Theory of Error-Correcting Codes. North-Holland, Netherlands, 1977. K. K. Paliwal and B. S. Atal. Efficient vector quantization of LPC parameters at 24 bits/frame. IEEE Transactions on Speech and Audio Processing, 1(1), January 1993. J. Princen and A. Bradley. Analysis/syhthesis filter bank design based on time domain aliasing cancellation. IEEE transaction on acoustics, speech, and signal processing, ASSP-34(5), October 1986. W. Press, B. Flannery, S. Teukolsky, and W. Vetterling. Numerical Recipes in C, the Art of Scientific Computing. Cambridge University Press, second edition edition, 1992. S. Park, Y. Kim, S. Kim, and Y. Seo. Multi-layer bit-sliced bit-rate scalable audio coding. In AES 103rd convention, AES preprint 4520, New York, NY, September 26-29 1997. 143 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [PM93] [PSO O ] [RADH87] [RW85] [SAK99] [Sha93] [SS99] [STR95] [TDD+94] [VAOl] [VR93] [WGDP88] W. Pennebaker and J. Mitchell. JPEG Still Image Data Compression Standard. Van Nostrand Reinhold, New York, 1993. T. Painter and A. Spanias. Perceptual coding of digital audio. Pro ceedings of the IEEE, 88(4), April 2000. Jr. R. A. Damon and W. R. Harvey. Experimental Design ANOVA, and Regression. Harper & Row, New York, 1987. S. Roucos and A. Wilgus. High quality time-scale modification of speech. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 493-496, 1985. Y. Shen, H. Ai, and C.-C. Kuo. A progressive algorithm for perceptual coding of digital audio signals. In the 33rd Annual Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, Octorber 1999. J. Shapiro. Embedded image coding using zerotrees of wavelet coeffi cients. IEEE transaction on signal processing, 41(12), December 1993. D. Sinha and C.-E.W. Sundberg. Unequal error protection methods for perceptual audio coders. In 1999 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 2423-2426, Phoenix, AZ, March 15-19 1999. J. Saghri, A. Tescher, and J. Reagan. Practical transform coding of multispectral imagery. IEEE Signal Processing Magazine, 12(1):32— 43, January 1995. C. Todd, G. Davidson, M. Davis, L. Fielder, B. Link, and S. Vernon. AC-3: Flexible perceptual coding for audio transmission and storage. In AES 96th convention, AES preprint 3796, Amsterdam, February 1994. M. S. Vinton and E. Atlas. A scalable and progressive audio codec. In IEEE ICASSP 2001, Salt Lake City, Utah, USA, May 2001. W. Verhelst and M. Roelands. An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 554-557, 1993. O. J. Wasem, D. J. Goodman, C. A. Dvorak, and H. G. Page. The ef fects of waveform substitution on the quality of PCM packet communi cations. IEEE Transaction on Acoustics, Speech and Signal Processing, 36(3), March 1988. 144 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [W HZOO] [WV91] [WZZZOl] [YAK02] [YAKKOOa] [YAKKOOb] [YAK KOI a] [YAKKOlb] [YAKK02a] [YAKK02b] [YAKK02c] D. Wu, Y. T. Hou, and Y. Zhang. Transporting real-time video over the internet: Challenges and approaches. Proceedings of the IEEE, 88(12):1855-1877, December 2000. R. Waal and R. Veldhuis. Subbancl coding of stereophonic digital audio signals. In IEEE ICASSP, pages 3601-3604, 1991. G. Wang, Q. Zhang, W. Zhu, and J. Zhou. Channel-adaptive error protection for scalable audio over channels with bit errors and packet erasures. In IEEE Globecom’ 01, November 2001. D. Yang, H. Ai, and C.-C. Kuo. Progressive multichannel audio codec (PMAC) with rich features. In International Conference on Acoustics, Speech, and Signal Processing, Orlando, FL, May 13-17 2002. D. Yang, H. Ai, C. Kyriakakis, and C.-C. Kuo. An explorition of Karhunen-Loeve transform for multichannel audio coding. In Proc. SPIE on Digital Cinema and Microdisplays, volume 4207, pages 89- 100, Boston, MA, November 5-8 2000. I). Yang, H. Ai, C. Kyriakakis, and C.-C. Kuo. An inter-channel re dundancy removal approach for high-quality multichannel audio com pression. In AES 109th convention, AES preprint 5238, Los Angeles, CA, September 2000. D. Yang, H. Ai, C. Kyriakakis, and C.-C. Kuo. Adaptive Karhunen- Loeve transform for enhanced multichannel audio coding. In Proc. SPIE on Mathematics of Data/Image Coding, Compression, and En cryption IV, volume 4475, pages 43-54, San Diego, CA, July 29-August 3 2001. D. Yang, H. Ai, C. Kyriakakis, and C.-C. Kuo. Embedded high-quality multichannel audio coding. In Conference on Media Processors, Part of the Symposium on Electronic Imaging 2001, San Jose, CA, January 21-26 2001. D. Yang, H. Ai, C. Kyriakakis, and C.-C. Kuo. Design of progressive syntax-rich multichannel audio codec. In Proc. SPIE, pages 121-132, San Jose, CA, January 2002. D. Yang, H. Ai, C. Kyriakakis, and C.-C. Kuo. Error-resilient design of high fidelity scalable audio coding. In Proc. SPIE on Digital Wireless Communications IV, volume 4740, Orlando, FL, April 1-5 2002. D. Yang, H. Ai, C. Kyriakakis, and C.-C. Kuo. High fidelity multi channel audio coding with Karhunen-Loeve transform. Submitted for second-round review, February 2002. 145 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [YFT+99] [ZF90] [ZL01] [ZZXZOl] C. W. Yung, H. F. Fu, C. Y. Tsui, R. S. Cheng, and D. George. Unequal error protection for wireless transmission of mpeg audio. In ISCAS'99 Proc. of the 1999 IEEE International Symposium on Circuits and Sys tems, pages 342-345, Orlando, FL, May 30-June 2 1999. E. Zwicker and H. Fasti. Psychoacoustics, facts, and models. Springer- Verlag, Berlin, 1990. J. Zhou and J. Li. Scalable audio streaming over the internet with network-aware rate-clistortion optimization. In IEEE International Conference on Multimedia and Expo 2001, Tokyo, Japan, August 2001. J. Zhou, Q. Zhang, Z. Xiong, and W. Zhu. Error resilient scalable audio coding (ERSAC) for mobile applications. In Proc. Multimedia Signal Processing Workshop, Cannes, France, October 2001. 146 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A p p en d ix A D escriptive S tatistics and Param eters A .l M ean One of several items of interest in a set of observations is a measure of the central or the representative overall value. While such values as the mode, the median, and the midpoint of the range of values are occasionally considered, the most commonly used measure is the average, generally referred to as the arithmetic mean, or simply the mean. The mean is obtained by dividing the total of all observations by the number of observations as given below 147 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. where y is the common symbol for the mean, and n is the number of observations in the sample, and Y. represents the sum of the observations. Table A.l list one sample data. The mean for the data in this table is and is an estimate of the unknown population parameter /i. Table A.l: Weaning weights of four charolais steers (in pounds) steer number weight 1 420 = Y\ 2 480 = Y2 3 4 3 0 = y3 4 465 = Y4 sum=1795=Ei Yi = Y. A. 2 Variance When dealing with biological data, one is continually confronted with variability among observations. While various measures of this variability are used, the one of primary interest in statistical analysis is known as the variance. The variance in a sample or a group of observations is measured as the sum of the squares of the deviations from the sample mean divided by one fewer than the number of observations. It has been shown that the division by the total number of observations leads to a biased estimate of the population variance, while the division by one fewer 148 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. than the total number of observations leads to an unbiased estimate for calculating the variance in any sample set of data can be written , 2 S i K - y 2 n — 1 where s2 is the symbol used for the variance of the observations in the sample and n is the number of observations. The value s2 is an estimate of the population variance, a parameter designated cr2. The deviation Yi — y is often written yi: leading to the frequent use of the symbol Y for the mean, since Z i V % = 0. Table A.2: Variance calculation using deviations from the mean Yi Yi - y (Yi - y f = Vf 420 -28.75 826.5625 480 31.25 976.5625 430 -18.75 351.5625 465 16.25 264.0625 Y. = 1795 E , V - - y 0.00 ■ = ~ = 448.75 s2 = 'Z i Y i - y 2 = 2418.7500 = Z , Y-y2 __ Z,Vi _ 2418.7500 _ n — 1 n — 1 3 E t yf 806.25 To calculate the variance of observations in Table A.l, we present Table A.2 which shows operations involved in calculating the variance using (A.2). The variance of 806.25 is an estimate of the population variance a2. While calculating the variance using (A.2) is relatively easy with only a few observations, it is quite cumbersome . The formula (A.2) 149 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. when a large number of observations are involved. It can be shown algebraically that Ta Yi - y2 Y .iY ? -Y ? /n n — 1 n — 1 (A.3) The right-hand side of the equation, known as the ’ ’working formula,” is easier to use with a large number of observations. By using (A.3) to calculate the variance of observations given in Table A.l, we have s2 E ,T ;2 - Y 2/n _ 807,925 — (1795)2/4 n — 1 4 — 1 807,925.00 - 805,506.25 2418.75 806.25. A .3 Standard D eviation While it is mathematically convenient to use squared deviations as a measure of dispersion or variation about the mean, it is normal to think of this variation in terms of the original values. We can merely take the square root of the variance to return to the original scale of measurement. The square root of the variance is known as the standard deviation and is expressed as 150 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. where s is the symbol for the standard deviation of a sample from the population and is an estimate of the parameter a. For the example under discussion, we have s = ^806.25 = 28.39. (A.5) I ______ — I M -+ 1 c t p+2a p,+3o Figure A.l: Areas of the normal curve. If the data of interest follow the normal curve, the population of observations would be visually described by the normal curve as shown in Figure A.l. That is, if one were able to plot the values of all individuals in a population, their frequencies would follow a bell-shapecl curve, or distribution, with a clustering of observations not greatly different in size than the mean and a reduction in numbers of observations as the size of deviation from the mean increases in either direction. Statistical theory tells us that the area under the curve included by fj,±la would include 68.26 percent of the variates or observations, p±2cr would include 95.46 percent of the observations, and p ± 3cr would include 99.73 percent of the values. It can also be stated that 50 percent of the values would fall between ji ± 0.674a, 95 percent of the values would 151 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. fall between /i± 1.965er, and 99 percent of the values would fall between // ±2.576(7. Assuming that the estimates of the population standard deviation and the mean from our sample are the population parameters (i.e., a = 28.38 and = 448.75), we would expect 50 percent of the weaning weights of Charolais steers to fall between 429.62 and 467.88 lb, 95 percent of the weaning weights to fall between 392.96 and 504.54 lb, and 99 percent to fall between 375.62 and 521.88 lb. It should be noted that four observations provide an extremely small number from which to make such estimations. A .4 Standard Error o f th e M ean If a number of samples are drawn from a population, the mean of each of the samples could be calculated, and we would find that we had variability among these sample means just as we had variability among individual observations in a single sample. If a large number of sample means were calculated, the mean of these sample means would be an estimate of the parameter / i . The means would also be distributed in a normal curve similar to that of the original population. However, the variation among the means would be smaller than ay since the mean of a sample will deviate less from the overall mean than will some members of the sample. The variance among a group of sample means would be measured as = ( A .6 ) 152 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. where Sy2 represents the variance of a group of sample means, yi represents an individual sample mean, y. represents the mean of all sample means, and k is the number of means included. The value Sy2 would be an estimate of the parameter a y 2 . A common situation arises when we have only one sample mean and wish to estimate the variance to be expected in a distribution of several means. This variance is estimated as (A.7) / L where s2 is the variance calculated among the observations within the sample, and n is the number of observations in the sample. The square root of the variance of the means, Sy, would be the standard deviation of the mean. However, the standard deviation of a statistic such as a mean, whether calculated as the square root of (A.6) or estimated as the square root of (A.7), is normally referred to as the standard error of the statistic, and the standard error of the mean in this case. The term standard deviation is usually reserved to describe the variation among individual observations. In our sample, the standard error of the mean would be s 2 8 -39 1 „ on sv '= ■ —7 = = — 7 = ~ = 14.20. \Jn \/4 It can be seen that the standard error of the mean is inversely related to the square root of the sample size. As the sample size increases, the standard error of the mean decreases. The standard error of the mean serves the same purpose for 153 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the distribution of sample means as does the standard deviation for a distribution of individual observations. The standard error of the mean is an estimate of the parameter ay and a range of jx ± lay includes 68.26 percent of a population of means and a range of /i ± 1.9 6 e x ,, includes 95 percent of means. The standard error of the mean is used frequently in statistics to indicate what amount of variation would be expected with continued sampling of a population of means. A .5 C onfidence Interval The standard error of a statistic is used frequently to develop what is termed a confidence interval. A confidence interval is a range between upper and lower limits, which is expected to include the true population value of a parameter at a selected level of probability. The upper and lower limits are referred to as a confidence limits and are a function of a I , value for a given level of probability and the standard error of the statistic. The necessary t values are found in Table A.3 and are derived from the t dis tribution developed by William S. Gossett (Student, 1908) and perfected by R. A. Fisher (1926). W. S. Gossett (1876-1937) was a brewer and statistician who pub lished under the name of Student. R. A. Fisher (1890-1962) was one of the pioneers 154 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. in the field of statistics and made outstanding contributions in a great many areas of statistical theory and application. Let us define the statistic where y, Sy and /./ , are the sample mean, the standard deviation and the parameter, respectively. The curve for the t distribution is symmetric, and as the degrees of freedom increase, the t distribution comes closer to the normal curve in form. Values for degrees of freedom of oo are those of the normal distribution. The t distribution has found great utility in statistical procedures, and its appli cation with regard to statistics other than the mean, and their standard errors, can be found in [RADH87], If we wished to develop the 95 percent confidence interval about the mean of a set of observations, we would calculate the confidence limits as Lower confidence limit = — to.osSy, (A.9) Upper confidence limit = +£0. 0 5^ , (A.1 0 ) where the t value is found in the table of the t distribution, i.e. Table A.3, under the column headed 0.05 level of probability and in the row for n — 1 degrees of freedom, 155 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. where n is the number of observations in the sample. The confidence interval about the population mean can then be written V - tomSy < fi<y + to.osSy. (A.l 1) The 95 percent confidence interval for the mean if the four observations in Table A.l would be from 448.75 - (3.182)(14.20) to 448.75 + (3.182)(14.20) or 403.57 to 493.93. Calculation of this interval leads to the statement that we feel 95 percent confident that the population mean p, lies between 403.57 and 493.93 lb. The interval is commonly written y ± tomSy, (A. 12) which in our example would be 448.75 ±45.18. 156 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table A.3: Distribution of t: two-tailed tests. Probability of a larger value of t, sign ignored df 0.500 0.400 0.300 0.200 0.100 0.050 0.020 0.010 0.001 1 1.000 1.376 1.963 3.078 6.314 12.706 31.821 63.657 636.619 2 0.816 1.061 1.386 1.886 2.920 4.303 6.965 9.925 31.598 3 0.765 0.978 1.250 1.638 2.353 3.182 3.541 5.841 12.941 4 0.741 0.941 1.190 1.533 2.132 2.776 3.747 4.604 8.610 5 0.727 0.920 1.156 1.476 2.015 2.571 3.365 4.032 6.859 6 0.718 0.906 1.134 1.440 1.943 2.447 3.143 3.707 5.959 7 0.711 0.896 1.119 1.415 1.895 2.365 2.998 3.499 5.405 8 0.706 0.889 1.108 1.397 1.860 2.306 2.896 3.355 5.041 9 0.703 0.883 1.100 1.383 1.833 2.262 2.821 3.250 4.781 10 0.700 0.879 1.093 1.372 1.812 2.228 2.764 3.169 4.587 11 0.697 0.876 1.088 1.363 1.796 2.201 2.718 3.106 4.437 12 0.695 0.873 1.083 1.356 1.782 2.179 2.681 3.055 4.318 13 0.694 0.870 1.079 1.350 1.771 2.160 2.650 3.012 4.221 14 0.692 0.868 1.076 1.345 1.761 2.145 2.624 2.977 4.140 15 0.691 0.866 1.074 1.341 1.753 2.131 2.602 2.947 4.073 16 0.690 0.865 1.071 1.337 1.746 2.120 2.583 2.921 4.015 17 0.689 0.863 1.069 1.333 1.740 2.110 2.567 2.898 3.965 18 0.688 0.862 1.067 1.330 1.734 2.101 2.552 2.878 3.922 19 0.688 0.861 1.066 1.328 1.729 2.093 2.539 2.861 3.883 20 0.687 0.860 1.064 1.325 1.725 2.086 2.528 2.845 3.850 21 0.686 0.859 1.063 1.323 1.721 2.080 2.518 2.831 3.819 22 0.686 0.858 1.061 1.321 1.717 2.074 2.408 2.819 3.792 23 0.685 0.858 1.060 1.319 1.714 2.069 2.500 2.807 3.767 24 0.685 0.857 1.059 1.318 1.711 2.064 2.492 2.797 3.745 25 0.684 0.856 1.058 1.316 1.708 2.060 2.485 2.787 3.725 26 0.684 0.856 1.058 1.315 1.706 2.056 2.479 2.779 3.707 27 0.684 0.855 1.057 1.314 1.703 2.052 2.473 2.771 3.690 28 0.683 0.855 1.056 1.313 1.701 2.048 2.467 2.763 3.674 29 0.683 0.854 1.055 1.311 1.699 2.045 2.462 2.756 3.659 30 0.683 0.854 1.055 1.310 1.697 2.042 2.457 2.750 3.646 157 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A p p en d ix B K arhunen-L oeve E xpansion B .l D efinition Let the M-by-1 vector u(n) denote a data sequence drawn from a wide-sense sta tionary process of zero mean and correlation matrix R. Let qi, q 2 , . . q m be eigenvectors associated with the M eigenvalues of the matrix R. Vector u (n) may be expanded as a linear combination of these eigenvectors as follows: M U W = I Z C^ (B -1) i= 1 The coefficients of the expansion are zero-mean, uncorrelated random variables de fined by the inner product Ci(n) ■ = q f u (n), i=l,2,...,M. (B.2) 158 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The representation of random vector u(n) described by (B.l) and (B.2) is the discrete-time version of the Karhunen-Loeve expansion. In particular, (B.2) is the analysis part of the expansion in that it defines c .;(n) in terms of the input vector u(n). On the other hand, (B.l) is the ”synthesis” part of the expansion in that it reconstructs the original input vector u(n) from C : t{n). Given the expansion of (B.l), the definition of c.j(n) in (B.2) follows directly from the fact that the eigenvectors qi, q2, ..., qm form an orthonormal set, assuming they are all normalized to have the unit length. Conversely, this same property may be used to derive (B.l), given B .2 Features and P roperties The coefficients of the expansion are random variables characterized by the following properties: Equation (B.3) states that all coefficients of the expansion have zero mean, which follows directly from (B.2) and the fact that random vector u(n) is itself assumed to have the zero mean. Equation (B.4) states that coefficients of the expansion (B.2). E[a(n)\ = 0, i = 1,2,... ,Af, (B.3) and 159 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. are uncorrelatecl and that each one of them has a mean-square value equal to its respective eigenvalue. The second equation can be easily derived as below: E[ci(n)c*{n)\ = E[(q?u(n))(qfu(n))H} = < 1 ? E{u(n)u{n)H}qj = qfR q* = A^qfqj { A;, i = j 0, i ^ j For a physical interpretation of the Karhunen-Loeve expansion, we may view eigenvectors qx, q2, . . qm as coordinates of an M-climensional space. Random vector u(n) can be represented by the set of its projections Ci(n), c2(n), ..., C m { ’ h ) onto these axes, respectively. Moreover, we deduce from (B.l) that M J 2 \ci(n)\2 = \u(n)\2 (B.5) i = 1 where \u(n)\ is the Euclidean norm of u(n). That is, coefficient C j(n ) has an energy equal to that of the observation vector u (n) measured along the ith coordinate. Nat urally, this energy is a random variable whose mean value equals the ith eigenvalue, as shown by 160 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. E[\ci(n)\2] = Aj, i = 1,2,..., M. (B.6) This result follows directly from (B.2) and (B.4). 161 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A p p en d ix C P sychoacoustics Starting from the first generation of PCM/DPCM coding to the second generation of perceptual audio coding, psychoacoustics plays an important role in reducing the bit rate in audio compression. Most current audio coders achieve compression by exploiting the fact that ’’irrelevant” signal information is not detectable by even a well trained or sensitive listener. Irrelevant information is identified during signal analysis by incorporating several psychoacoustic principles in the coder design such as the absolute hearing threshold, simultaneous masking and temporal masking, etc. [PSOO]. This section reviews some fundamental knowledge of psychoacoustic principles that are commonly used in perceptual audio coding. C .l H earing A rea The hearing area is a plane in which audible sounds can be displayed. In its normal form, the hearing area is plotted with the frequency on a logarithmic scale as the abscissa, and the sound pressure level in dB on a linear scale as the ordinate. This 162 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. means that two logarithmic scales are used because the level is related to the loga rithm of the sound pressure. The critical-band rate may also be used as the abscissa. This scale is more relevant to features of our hearing system than the frequency. The usual display of the human hearing area is shown in Fig. C.l. On the right, ordinate scales are either the sound intensity in W att per square meter (W /m 2) or the sound pressure in Pascal (Pa). The sound pressure level is given for a free-field condition relative to 2 x 10_1 Pa. The sound intensity level is plotted relative to 10~1 2 W /m 2. A range of about 15 decades in the intensity or 7.5 decades in the sound pressure (corresponding to a range of 150 dB in the sound pressure level) is encompassed by the ordinate scale. As to the abscissa, we must realize that our hearing organ produces sensations for pure tones within three decades in the frequency ranging from 20 Hz to 20 kHz. The actual hearing area represents that range, which lies between the threshold in quiet (the limit towards low levels) and the threshold of pain (the limit towards high levels). These thresholds are given in Fig. C.l as solid and broken lines, respectively. These limits hold for pure tones in the steady state condition, i.e. for tones lasting longer than about 100 ms. If speech is resolved into spectral components, the region it normally occupies can also be illustrated in the hearing area. In Fig. C.l, the range encompassed by speech sounds is indicated by the area hatched from the top left to the bottom right starting near 100 Hz and ending near 7 kHz. The area as indicated hold for normal speech, for example, as delivered in a small lecture hall. The components of music encompass a larger distribution in the hearing area as given in Fig. C.l 163 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 140 dB 120 140 dB 120 threshold of pain 100 .lim it of damage risk 100 music speech 40 -o . threshold in quiet 02 .05 .1 .2 .5 1 2kHz 5 10 20 100 w m 2 1 10-2 1 0 - 4 — c " O o 1 0 -1 0 C O 10-12 200 Pa 20 0.2 £ z > C O 2 - 1 O '2 < 8 Q. 2-10-3 E Z J 2 1 0'4 W 2 4 0 -5 frequency of test tone Figure C.l : Illustration of the hearing area, i.e. the area between the threshold on quiet and the threshold of pain. Also indicated are areas encompassed by music and speech, and the limit of damage risk. The ordinate scale is not only expressed in the sound pressure level but also in the sound intensity. The dotted part of the threshold in quiet stems from subjects who frequently listen to very loud music. by different hatching. It starts at low frequencies near 40 Hz and reaches about 10 kHz. Including pianissimo and fortissimo, the dynamic range of music starts at sound pressure levels below 20 clB and reaches levels in excess of 95 clB. Extreme and rare cases are ignored for spectral distributions of music and speech displayed. It can be seen, however, that both areas are well above the threshold in quiet, which will be explained in more detail later. The threshold in quiet is a function of the frequency, where the sound pressure level of a pure tone is just audible. This threshold can be measured quite easily by experienced or inexperienced subjects. The reproducibility of the threshold in quiet for a single subject is high and lies normally within ±3 dB. 164 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The frequency dependence of the threshold in quiet can be measured precisely and quickly by Bekesy tracking. A recording produced in this manner is shown in Fig. C.2. A whole recording from low to high frequencies lasts about 15 minutes. In order to show the reproducibility of such a tracking of threshold in quiet, two trackings, one with an upward sweep and another with a downward sweep in the frequency, are shown in Fig. C.2 for a frequency range between 0.3 and 8 kHz. Excursions of the zigzag reach as much as 12 dB (i.e. about ±6 dB). The middle of this zigzag curve is defined as the threshold in quiet. dB tu c o « 4 - J ■ 4 — » C O 0 4-* 4— o -20 .02 .05 .1 .2 .5 1 2kHz 5 10 20 frequency of test tone Figure C.2: Illustration of the threshold in quiet, i.e. the just-noticeable level of a test tone as a function of its frequency, registered with the method of tracking. Note that the threshold is measured twice between 0.3 and 8 kHz. 165 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C.2 M asking Masking plays a very important role in everyday life. For a conversation on pave ments of a quiet street, for example, little speech power is necessary for speakers to understand each other. However, if a loud truck passes by, the conversation is severely disturbed. By keeping the speech power constant, one person can no longer hear from the other. There are two ways to deal with the masking phenomenon. We can either wait until the truck passed and then continue our conversation or raise our voice to produce more speech power and greater loudness. Our partner can hear the speech sound again. Similar effects take place in most pieces of music. One instrument may be masked by another if one of them produces high levels while the other remains faint. If the loud instrument pauses, the faint one becomes audible again. These are typical examples of simultaneous masking. To measure the effect of masking quantitatively, the masked threshold is usually determined. The masked threshold is the sound pressure level of a test sound (usually a sinusoidal test tone), necessary to be just audible in the presence of a masker. The mask threshold, in all but a very few special cases, always lies above the threshold in quiet. It is identical with the threshold in quiet when frequencies of the masker and the test sound are very different. If the masker is increased steadily, there is a continuous transition between an audible (unmasked) test tone and one that is totally masked. This means that, besides total masking, partial masking also occurs. Partial masking reduces the 166 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. loudness of a test tone but does not mask the test tone completely. This effect often takes place in conversations. Masking effects can be measured not only when masker and test sounds are present simultaneously, but also when they are not simultaneous. In the latter case, the test sound has to be a short burst or sound impulse which is present before the masker stimulus is switched on. The masking effect produced under these conditions is called the pre-stimulus masking, shorted to ”premasking”. This effect is not very strong. However, if the test sound is present after the masker is switched off, then a quite pronounced effect may occur. Since the test sound is present after the termination of the masker, the effect is called the post-stimulus masking, shorted to ’ ’postmasking” . C.2.1 M asking of Pure Tones Several different masking effects are studied in Zwicker’s book. We will discuss the masking effect of pure tones by complex tones here, since this is the most common situation occurring in music. Most instrumental sounds in music are composed of a fundamental tone and many harmonics. The difference in timbre produced by different musical instruments depends on the frequency spectra of their harmonics. Whereas a flute produces primarily one signal component (i.e. the fundamental), a trum pet produces many harmonic partials and therefore elicits a much broader masking effect than a flute. 167 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Fig. C.3 shows thresholds of pure tones, masked by a complex tone composed of a 200 Hz fundamental frequency and nine higher harmonics, all with the same amplitude but random in the phase. The mask thresholds are given for sound pres sure levels of 40 dB and 60 clB of each partial. On the logarithmic frequency scale, the distance between partials is relatively large at low frequencies, but becomes very small between the ninth and tenth harmonic. Accordingly, dips between harmonics become smaller and smaller with the increasing frequency of the test tone. In the frequency range between 1.5 Hz and 2 Hz, the maxima and minima can hardly be distinguished. At frequencies above the last harmonic (in our case 2 Hz), the mask thresholds are flatter towards higher frequencies at higher levels of the masking complex. At frequencies one to two octaves above the highest spectral component, mask thresholds approach the threshold in quiet. In music, many complex tones, each composed of many harmonics, are used at the same time. This means that the corresponding masking effect can be assumed to produce shapes similar to those out lined in Fig. C.3. However, the minima between lines become even smaller because the density of lines is higher. It should be noted that non-ranclom phase conditions of components lead to temporal envelopes of the sound that can be described as impulsive. Consequently, temporal effects in masking may become a crucial factor in determining the mask threshold. Effects of this kind are discussed in the following section. 168 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. dB 2 60 60dB per tone g > 20 40dB 0.02 0.05 0.1 0.2 0.5 1 2kHz 5 10 20 frequency of test tone Figure C.3: The level of test tone masked by ten harmonics of 200 Hz as a function of the frequency of the test tone. Levels of the individual harmonics of an equal size are given as the parameter. C .2.2 Temporal Effects Masking in steady-state conditions with long-lasting test and masking sounds, was described in previous sections. However, the transmission of information in music or speech implies a strong temporal structure of the sound. Loud sounds are followed by faint sounds and vice versa. In speech, vowels generally represent the loudest parts whereas consonants are relatively faint. A plosive consonant is a typical example of a sound that is often masked by a preceding loud vowel. The effect occurs not only because of the reverberation of the room in which speech is received, but also in free-field conditions, because of the temporal effects of masking which characterize our hearing system. To measure these effects quantitatively, maskers of a limited duration are given and masking effects tested with short test-tone bursts or short pulses. Further, the short signal is shifted in time relative to the masker as illustrated in Fig. C.4, where 169 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Post-masking Simultaneous P re- 60 dB = 40 -50 0 50 100 150ms 0 50 100ms 150 200 Time after masker onset, At : Delay time td Figure C.4: Schematic drawing to illustrate and characterize regions within which premasking, simultaneous masking and postmasking occur. Note that postmasking uses a different time origin than premasking and simultaneous masking. a 200 ms masker masks a short tone burst with as small a duration as possible and negligible in relation to the duration of the masker. In such a case, it is advantageous to use two different time scales. At first, the value At corresponds to the time relative to the onset of the masker - both positive and negative values exist. The second time scale starts at the end of the masker. This time is often called delay time and indicated by td- It is convenient to use as the ordinate not the sound pressure level of the test-tone burst, but the level above the threshold of this sound. This level is referred to as the sensation level. Three different temporal regions of masking relative to the presentation of the masker stimulus can be differentiated. Premasking takes place during that period of time before the masker is switched on. In this period, negative values of At apply. The period of pre-stimulus masking is followed by simultaneous masking when the masker and test sound are presented simultaneously. In this condition, A t is positive. After the end of the masker, post stimulus masking, normally called postmasking, occurs. During the time scale given 170 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. by positive delay time, td, the masker is not physically existent; nevertheless, is still produces masking. The effect of postmasking corresponds to a decay in the effect of the masker and is more or less expected. Premasking, however, represents an effect that is unforeseen because it appears during a time before the masker is switched on. This does not mean, of course, that our hearing system is able to listen into the future. Rather, the effect is understandable if one realizes that each sensation - including premasking - does not exist instantaneously, but requires a build-up time to be perceived. If we assume a quick build-up time for loud maskers and a slower build-up time for faint test sounds, then we can understand why premasking exists. The time during which premasking can be measured is relatively short and lasts only about 20 ms. Postmasking, on the other hand, can last longer than 100 ms and ends after about a 200 ms delay. Therefore, postmasking is the dominant non-simultaneous temporal masking effect. 171 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A p p en d ix D M P E G A dvanced A udio C oding MPEG AAC (Advanced Audio Coding) is the newest and most powerful member of the MPEG family for high-quality, digital audio compression. AAC allows for a flexible selection of operating modes and configurations. Applications of MPEG AAC range from low bit-rate Internet audio to multichannel broadcasting services. The review presented here is mostly based on the MPEG-2 AAC document ISO/IEC 13818-7 [ISOa]. The following sections describe the main feature of the AAC multi channel coding system. D .l O verview of M P E G -2 A dvanced A udio C oding Efficient audio coding removes redundant information from audio signals. Correla tions between audio samples and statistics of sample’s representation are exploited in order to remove redundancy. Frequency-clomain and time-domain masking prop erties of the human auditory system [ZF90] are also exploited in order to remove 172 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. imperceptible signal content (irrelevancies). The frequency content of the audio sig nal is subdivided by means of a filter bank into subbands which are approximations of human auditory critical bands. The data rate reduction is achieved by quantizing the spectrum of the time signal according to perceptual models and may include a noiseless coding process. The steps to carry out these processes, as will be fully described in the following sections, lead to basic structure of the MPEG-2 AAC system as shown in Figures D.l and D.2. In order to allow the tradeoff between quality and memory/processing power requirements, the AAC system offers three profiles: the Main Profile, the Low Com plexity (LC) Profile, and the Scalable Sampling Rate (SSR) Profile, where the main profile is the highest quality profile. 1. Main Profile In this configuration, the AAC system provides the best audio quality at any given data rate. W ith exception of the preprocessing tool, all parts of AAC tools may be used. The memory and the processing power required in this configuration are higher than those required in the low complexity profile con figuration. It should also be noted that a main profile AAC decoder can decode a low-complexity-profile encoded bit stream. 2. Low C om plexity Profile In this configuration, the prediction and preprocessing tools are not employed and the TNS order is limited. While the quality performance of the LC profile 173 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Input time signal Quantized Spectrum 13818-7 Coded Audio Stream Bitstream Multiplex Previou 5 Frame Iteration Loops M/S TNS Prediction Filter Bank Intensity/ Coupling Quantizer Scale Factors Noiseless Coding Data ’ Control — Legend Perceptual Model Rate/Distortion Control Process Figure D.l: The block diagram of the AAC encoder. 174 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 13818-7 Coded Audio Stream Bitstream Demultiplex Output Time Signal M/S TNS Intensity/ Coupling Prediction Filter Bank Inverse Quantizer Scale Factors Noiseless Decoding Data ‘ Control Legend Figure D.2: The block diagram of the AAC decoder. 175 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. is very high, the memory and the processing power requirements are consider ably reduced in this configuration. 3. Scalable Sampling Rate Profile In this configuration, preprocessing tools are required. They include a polyphase quadrature filter (PQF), gain detectors and gain modifiers. The prediction module is not used in this profile, and the TNS order and the bandwidth are limited. The SSR profile has a lower complexity than that of the main profile or the LC profile, and it can provide a frequency scalable signal. ■7 L o w I C o m p le x it y R a te 2 0 k H z 1 8 k H z N 12 k H z 6 k H z Figure D.3: Three AAC profiles. D .2 P reprocessing The preprocessing block is only used in the SSR profile. It consists of a polyphase quadrature filter (PQF), gain detectors and gain modifiers. The PQF has four unique bandwidth outputs. At the sampling rate of 48 kHz, it can provide four 176 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. output bandwiclths at 24 kHz, 18 kHz, 12 kHz and 6 kHz. Gain detectors produce the gain control data compliant with the 13818-7 syntax. This information consists of the number of gain data, the index of position and the index of level. The gain control is applied in order to suppress pre-echo. The amplitudes of each PQF band are controlled independently by gain detectors, and gain control can be applied in conjunction with all types of window sequences. The time resolution of gain control is approximately 0.7 ms at the 48 kHz sampling rate. The step size of gain control is 2” where n is an integer between — 4 and 11. The signal can be amplified or attenuated by gain control. Gain modifiers control the gain of each PQF band. The modification function (MOD) is calculated by gain control data decoding and gain control function setting processes. Gain controlled signals are derived by applying MOD to the corresponding signal bands. D .3 F ilter Bank A fundamental component of the MPEG AAC audio coder is the conversion of time domain signals at the input of the encoder into an internal time-frequency representation and its reverse process in the decoder. This conversion is done by a forward modified discrete cosine transform (MDCT) in the encoder, and an inverse modified discrete cosine transform (IMDCT) in the decoder. MDCT and IMDCT employ a technique called time domain aliasing cancellation (TDAC). Additional 177 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. information about the TDAC and the window-overlap-add process can be found in [PB86]. The analytical expression for MDCT can be written as 1 ^ 27r 1 = jr Y . xincos(-— (n + n0)(k + -)), 0 < k < N - 1. 72=0 Since sequence is odd-symmetric, coefficients from 0 to N/2 — 1 uniquely specify the transform. The analytical expression of IMDCT is N~1 2 7r 1 Xin = X ik cos (-7 7 (n + n0)(k + -)), 0 < n < N - 1, k= o ^ z where n = sample index, N = transform block length, i = block index, n0 = (N/2 + l)/2. In the encoder, this processing takes the appropriate block of time samples, mod ulates them by an appropriate window function, and performs MDCT to ensure good frequency selectivity for the filter band. Each block of input samples is overlapped by 50% with the immediately preceding block and the following block. The length N of the transform block can be set either 2048 or 256 samples. 178 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Because the window function has a significant effect on the filter-bank frequency response, the MPEG AAC filter bank has been designed to allow a change in the window shape to best adapt to input signal conditions. The shape of the window is varied simultaneously in the encoder and the decoder to allow the filter bank to efficiently separate spectral components of the input for a wider variety of input signals. The use of 2048-sample time-domain transform improves coding efficiency for signals with complex spectra, but may create problems for transient signals. Quan tization errors extending more than a few milliseconds before a transient event are not effectively masked by the transient itself. This leads to a phenomenon called pre echo in which the quantization error from one transform block is spread in time and becomes audible. Long transforms are inefficient for coding signals which are tran sient in nature. Transient signals are best encoded with relatively short transform lengths. Unfortunately, short transforms produce inefficient coding of steady-state signals due to poorer frequency resolution. MPEG AAC circumvents this problem by allowing the block length of the trans form to vary as a function of signal conditions. Signals that are short-term stationary are best accommodated by the long transform, while transient signals are generally reproduced more accurately by short transforms. The transition between long and short transforms is seamless in the sense that aliasing is completely canceled in the absence of transform coefficient quantization. 179 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. D .4 Tem poral N oise Shaping A novel concept in perceptual audio coding is represented by the temporal noise shaping (TNS) tool of AAC [HJ96]. This tool is motivated by the fact that, despite the advanced state of today’s perceptual audio coders, the handling of transient and pitched input signals still presents a major challenge. This is mainly due to the problem of maintaining the masking effect in the reproduced audio signal. In particular, coding is difficult because of temporal mismatch between the masking threshold and the quantization noise (also known as the pre-echo problem). The TNS technique permits the coder to exercise a control over the temporal fine structure of quantization noise even within a filter-bank window. The concept of this technique can be outlined as follows. • Time/frequency duality considerations. The concept of TNS uses the duality of time and frequency to extend known coding techniques by a new variant. It is well known that signals with an ” un-flat” spectrum can be coded efficiently either by directly coding spectral values (’ ’transform coding”) or by applying predictive coding methods to time signals. Consequently, the corresponding dual statement relates to the coding of signals with an ”un-flat” time structure, i.e. transient signals. Efficient coding of transient signals can thus be achieved by either directly coding time domain values or by employing predictive coding methods to spectral data. Such a predictive coding of spectral coefficients in the frequency domain in fact 180 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. constitutes the dual concept to the intra-channel prediction tool as described in Section D.6. While intra-channel prediction over time increases coder’s spectral resolution, prediction over frequency enhances its temporal resolution. • Noise shaping by predictive coding. If an open-loop predictive coding technique is applied to a time signal, the quantization error in the final decoded signal is known to be adapted in its Power Spectral Density (PSD) to the PSD of the input signal. Dual to this, if predictive coding is applied to spectral data over frequency, the temporal shape of the quantization error signal will be adapted to the temporal shape of the input signal at the output of the decoder. This effectively puts quantization noise under the actual signal and avoids problems of temporal masking, either in transient or pitched signals. This type of predictive coding of spectral data is referred to as the Temporal Noise Shaping (TNS) method. Since the TNS preprocessing can be applied to either the entire spectrum, or only a part of the spectrum, time-domain noise control can be applied in any necessary frequency-dependent fashion. In particular, it is possible to use several predictive filters operating on distinct frequency (coefficient) regions. D .5 Joint Stereo C oding For further enhancement of its capabilities, MPEG-2 AAC includes two well-known techniques for joint stereo coding of signals: Micl/Side (M/S) stereo coding (also 181 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. know as ’ ’sum/difference coding”) and intensity stereo coding. Both joint coding strategies can be combined by selectively applying them to different frequency re gions. By using M/S stereo coding, intensity stereo coding, and L /R (independent) coding as appropriate, it is possible to avoid expensive overcoding due to Binaural Masking Level Depression, which account for noise imaging. Also, it achieve a sig nificant saving in data rate very frequently. The concept of joint stereo coding in MPEG-2 AAC is discussed in detail in [JHDG96]. D.5.1 M /S Stereo Coding M/S stereo coding is used to control the imaging of coding noise, as compared to the imaging of the original signal. In particular, this technique is capable of addressing the issue of ” Binaural Masking Level Depression” [Bla83], where a signal at lower frequencies (below 2 kHz) can show up to 20 dB difference in masking thresholds depending on the phase of the signal and noise present (or lack of correlation in the case of noise). A second important issue is that of high-frequency time-domain imaging on transient or pitched signals. In either case, the properly coded stereo signal can require more bits than two transparently coded monophonic signals. In MPEG-2 AAC, M/S stereo coding is applied within each channel pair of the multichannel signal, i.e. between a pair of channels that are arranged symmetri cally on the left/right listener axis. In this way, imaging problems due to spatial unmasking are avoided to a larger degree. 182 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. M/S stereo coding can be used in a flexible way by selectively switching in time (on a block-by-block basis), as well as in frequency (on a scale-factor-band by scale- factor-band basis) [JF92]. The switching state (M/S stereo coding ”on” and ’ ’off”) is transmitted to the decoder as an array of signaling bits (”ms_used”). This can accommodate short time delays between L and R channels, and still accomplish both image control and signal-processing gain. While the amount of time delay allowed is limited, the time delay is greater than the interaural time delay and allows for control of most critical image issues [JF92]. D.5.2 Intensity Stereo Coding The second important joint stereo coding strategy for exploiting inter-channel irrel evancy is the well known generic concept of intensity stereo coding [WV91]. This idea has been widely utilized in the past for stereophonic and multichannel coding under various names such as dynamic crosstalk and channel coupling. Intensity stereo coding exploits the fact that the perception of high frequency sound compo nents mainly relies on the analysis of their energy-time envelopes [Bla83]. Thus, it is possible for certain types of signals to transmit a single set of spectral values shared among several audio channels with virtually no loss in sound quality. The original energy-time envelopes of coded channels are preserved approximately by means of a scaling operation such that each channel signal is reconstructed with its original level after decoding. 183 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. MPEG-2 AAC provides two mechanisms for applying generic intensity stereo coding. • The first one is based on the ’ ’channel pair” concept as used for M/S stereo coding and implements an easy-to-use straight-forward coding concept that covers most of normal needs without introducing noticeable signaling over head into the bit stream. For simplicity, this mechanism is referred to as the AAC intensity stereo coding tool. While the intensity stereo coding tool only implements joint coding within each channel pair, it may be used for coding of both two-channel as well as multichannel signals. • The second one, which is a more sophisticated mechanism, is not restricted by the channel pair concept and allows better control of coding parameters. This mechanism is called the AAC coupling-channel element. It provides two functionalities: First, coupling channels may be used to implement generalized intensity stereo coding, where channel spectra can be shared across channel boundaries, including sharing among different channel pairs. The second func tionality of the coupling channel element is to perform a downmix of additional sound objects into the stereo audio so that e.g. a commentary channel can be added to an existing multichannel program (’ ’voice-over”). Depending on the profile, certain restrictions apply regarding consistency between coupling chan nels and target channels in terms of the window sequence and window shape parameters. 184 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Thus, MPEG-2 AAC provides appropriate coding tools for many types of stereo phonic audio from traditional two channel recordings to 5 to 7 channel surround sound material. D .6 P red iction Prediction is used to improve redundancy reduction. It is especially effective in handling stationary parts of a signal which are the most demanding parts in terms of the required data rate. Because a short window in the filter bank is used to handle signal changes (i.e. non-stationary signal characteristic), prediction is only used for long windows. For each channel, prediction is applied to spectral components resulting from the spectral decomposition of the filter bank. For each spectral component up to 16 kHz, there is one corresponding predictor, resulting in a bank of predictors, where each predictor exploits the auto-correlation between spectral component values of consecutive frames. The overall coding structure by using a filter bank with high spectral resolution implies the use of backward adaptive predictors. In this structure, predictor coeffi cients are calculated from preceding quantized spectral components in the encoder as well as in the decoder, to achieve high coding efficiency. No additional side infor mation is needed for the transmission of predictor coefficients - as would be required for forward adaptive predictors. A second order backward-adaptive lattice structure 185 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. predictor is used for each spectral component so that each predictor is working on the spectral component values of two preceding frames. Predictor parameters are adapted to current signal statistics on a frame-by-frame base by using an LMS adap tation algorithm. If prediction is activated, the quantizer is fed with a prediction error instead of the original spectral component, resulting in a coding gain. In order to guarantee that prediction is only used if it results in a coding gain, an appropriate predictor control is required and a small amount of predictor control information has to be transmitted to the decoder. For predictor control, predictors are grouped into scale factor bands. The predictor control information for each frame is determined in two steps. First, it is determined for each scale factor band whether or not prediction gives a coding gain, and all predictors belonging to a scale factor band are switched on/off accordingly. Then, it is determined whether the overall coding gain by prediction in the current frame compensates at least the additional bits needed for the predictor side information. Only in this case, prediction is activated and the side information is transmitted. Otherwise, prediction is not used in the current frame and only one signaling bit is transmitted. D.T Q uantization and C oding D.7.1 Overview While all preceding steps perform some kind of preprocessing of the audio data, the real data rate reduction is done during the quantization process. The primary goal of 186 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the module is to quantize the spectral data in such a way that the quantization noise fulfills the demands of the psychoacoustic model. At the same time, the number of bits needed to code this quantized spectrum must be below a certain limit, which is normally the average number of bits that is available for a block of audio data. This value depends on the sampling frequency and the desired data rate. In the AAC coder, a bit reservoir permits a variable bit distribution between consecutive audio blocks on a short-time basis. There are two constraints in this process: to fulfill the demands of the psychoacoustic model and to keep the number of needed bits below a certain number. Thus, the following two problems of the quantization process have to be addressed. W hat to do when the demands cannot be fulfilled with the available number of bits? W hat should be done if not all bits are needed to meet the demand? The strategy to optimize the quantization process is not standardized, the only requirement is that the produced bit stream is compliant with the syntax as described in [ISOaJ. One possible strategy is to use two nested iteration loops as described in this section. One important issue is the tuning between the psychoacoustic model and the quantization process, which can be viewed as one of the secretes of audio coding, since it requires a lot of experience and know-how. The main features of the AAC quantization and coding process are: • Non-uniform quantization. • Huffman coding of spectral values with different tables. 187 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • Noise shaping by amplification of groups of spectral values (the so-called scale factor bands). The information about the amplification is stored in scale fac tors. • Huffman coding of differential scale factors. D.7.2 Non-Uniform Quantization The main advantage of a non-uniform quantizer is the built-in noise shaping depend ing on the amplitude of coefficients. The increase of the signal-to-noise ratio with an increasing signal energy is much lower than that of a linear quantizer. The range of quantized values is limited to ±8191. The Quantizer _stepsize represents the global quantizer step size. Thus, the quantizer can be changed in steps of 1.5 dB. D .7.3 Coding of Quantized Spectral Values Quantized coefficients created by the quantizer are coded by using Huffman Codes. A highly flexible coding method allows the use of several Huffman tables for one spectrum. Two and four-dimensional tables with and without the sign are available. The lossless coding process is described in detail in Section D.8. To calculate the number of bits needed to code a spectrum of quantized data, the coding process has to be performed, and the number of bits needed for the spectral data and the side information have to be accumulated. 188 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. D .7.4 N oise Shaping The use of a non-uniform quantizer is not sufficient to fulfill the psychoacoustic demands. An additional method to shape the quantization noise is required. AAC uses the individual amplification of groups of spectral coefficients, the so-called scale factor bands. To be able to fulfill the requirements as efficiently as possible, it is desirable to shape the quantization noise in units similar to the critical bands of the human auditory system. Since the AAC system offers a relatively higher frequency resolution for long blocks of 23.43 Hz/line at the 48 kHz sampling frequency, it is possible to build groups of spectral values which reflect the bandwidth of critical bands very closely. Figure 4.2 shows the width of scale factor bands for long blocks at 44.1 kHz or 48 kHz sampling frequency. Note that the width of the scale factor bands is limited to 32 coefficients except for the last scale factor band. The total number of scale factor bands is 49 for long blocks. The AAC system offers the possibility to individually amplify scale factor bands in a step of 1.5 dB. Noise shaping is achieved because amplified coefficients have larger amplitudes. Therefore, they will generally exhibit a higher signal-to-noise ratio after quantization. On the other hand, larger amplitudes normally need more bits to be coded, i.e. the bit distribution between scale factor bands is changed implicitly. Amplification has to be performed in the decoder. For this reason, the amplifi cation information, which is stored in the scale factors (in units of 1.5dB steps), has to be transmitted to the decoder. 189 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. D.7.5 Iteration Process The decision on which scale factor band has to be amplified is, within certain limits, up to the encoder. Thresholds calculated by the psychoacoustic model are the most important criteria, but not the only ones, since the number of bits that can be used is limited. As already mentioned above, the iteration process described here is just one method to perform noise shaping. This method is however known to produce very good audio quality. Two nested loops, the so-called inner and outer iteration loops are used to determine optimal quantization. The description given here has been simplified to facilitate the understanding of the process. The task of the inner iteration loop is to change the quantizer step size until the given spectral data can be coded within the number of available bits. For this purpose, an initial quantizer step size is chosen, the spectral data are quantized and the number of bits necessary to code the quantized data is counted. If this number if higher than the number of available bits, the quantizer step size is increased and the whole process is repeated. The task of the outer iteration loop is to amplify spectral coefficients in all scale factor bands in a way such that the demands of the psychoacoustic model are fulfilled as much as possible. 1. At the beginning, no scale factor band is amplified. 2. The inner loop is called. 3. For each scale factor band, distortion caused by the quantization is calculated. 190 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4. The real distortion is compared with the allowed distortion calculated by the psychoacoustic model. 5. If this result is the best result so far, it is stored. This is needed, since the iteration process does not necessarily converge. 6. Scale factor bands with a real distortion higher than the allowed distortion are amplified. At this point, different methods can be applied to determine the scale factor bands to be amplified. 7. If all scale factor bands have been amplified, the iteration process stops. The best result is restored. 8. If there is no scale factor band with a real distortion above the allowed distor tion, the iteration process stops as well. 9. Otherwise, the process is repeated with new amplifications. Some other conditions which cause termination of the outer iteration loop are not mentioned above. Since amplified parts of the spectrum need more bits for coding while the number of available bits is constant, the quantization step size has to be changed in the inner iteration loop to decrease the number of used bits. This mechanism moves bits from spectral regions where they are not needed to spectral regions where they are needed. This is also the reason that the result after an amplification in the outer loop may be worse than before. The best result should be stored after the termination of the iteration process. 191 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Quantization and coding of short blocks is similar to those for long blocks. How ever, grouping and interleaving have to be taken into account. Both mechanisms are described in detail in Section D.8. D .8 N oiseless C oding The input to the noiseless coding module is the set of 1024 quantized spectral co efficients. As a first step a method of noiseless dynamic range compression may be applied to the spectrum. Up to four coefficients can be coded separately as mag nitudes in excess of one, with a value of ±1 left in the quantized coefficient array to carry the sign. The clipped coefficients are coded as integer magnitudes and an offset from the base of the coefficient array to mark their location. Since the side in formation for carrying clipped coefficients costs some bits, this noiseless compression is applied only if it results in a net saving of bits. D.8.1 Sectioning The noiseless coding module segments the set of 1024 quantized spectral coefficients into sections so that a single Huffman codebook is used to code each section. For reasons of coding efficiency, section boundaries can only be at scale factor band boundaries so that, for each section of the spectrum, one must transmit the length of the section, in scale factor bands, and the Huffman coclebook number used for the section. 192 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Sectioning is dynamic. It typically varies from block to block so that the number of bits needed to represent the full set of quantized spectral coefficients is minimized. This is done by using a greedy merge algorithm starting from the maximum possible number of sections, each of which uses the Huffman codebook with the smallest possible index. Sections are merged if the resulting merged section results in a lower total bit count, with merges that yield the greatest bit count reduction done first. If the sections to be merged do not use the same Huffman codebook, then the codebook with the higher index must be used. Sections often contain only coefficients whose value is zero. For example, if the audio input is band limited to 20 kHz or lower, then the highest coefficients are zero. Such sections are coded with Huffman codebook zero, which is an escape mechanism that indicates that all coefficients are zero and it does not require that any Huffman codewords be sent for that section. D .8.2 Grouping and Interleaving If the window sequence is eight short windows, then the set of 1024 coefficients is actually a matrix of 8 by 128 frequency coefficients representing the time-frequency evolution of the signal over the duration of eight short windows. Although the sectioning mechanism is flexible enough to efficiently represent the 8 zero sections, grouping and interleaving provide greater coding efficiency. As explained earlier, coefficients associated with contiguous short windows can be grouped such that they share scale factors among all scale factor bands within the group. In addition, 193 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. coefficients within a group are interleaved by interchanging the order of scale factor bands and windows. To be more specific, it is assumed that, before interleaving, the set of 1024 coefficients c are indexed as c[g][w}[b\[k], where g is the index on groups, w is the index on windows within a group, b is the index on scale factor bands within a window, k is the index on coefficients within a scale factor band and the right-most index varies most rapidly. After interleaving, coefficients are indexed as c [flf] [&][«#]. This has the advantage of combining all zero sections due to band-limiting within each group. 194 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. D .8.3 Scale Factors The coded spectrum uses one quantizer per scale factor band. The step size of each of these quantizers is specified as a set of scale factors and a global gain that normalizes these scale factors. In order to increase compression, scale factors associated with scale factor bands that have only zero-valued coefficients are not transmitted. Both the global gain and scale factors are quantized in 1.5dB steps. The global gain is coded as an 8-bit unsigned integer and the scale factors are differentially encoded relative to the previous scale factor (or global gain for the first scale factor) and then Huffman coded. The dynamic range of the global gain is sufficient to represent full-scale values from a 24-bit PCM audio source. D .8.4 Huffman Coding Huffman coding is used to represent n-tuples of quantized coefficients, with the Huffman code drawn from one of 11 codebooks. The spectral coefficients within n- tuples are ordered (low to high) and the n-tuple size is two or four coefficients. The maximum absolute value of the quantized coefficients that can be represented by each Huffman codebook and the number of coefficients in each n-tuple for each codebook is shown in Table D.l. There are two codebooks for each maximum absolute value, with each representing a distinct probability distribution function. The best fit is always chosen. In order to save on codebook storage (an important consideration in a mass-produced decoder), most codebooks represent unsigned values. For these 195 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Codebook index n-Tuple size Maximum absolute value Signed values 0 0 1 4 1 yes 2 4 1 yes 3 4 2 no 4 4 2 no 5 2 4 yes 6 2 4 yes 7 2 7 no 8 2 7 no 9 2 12 no 10 2 12 no 11 2 16(ESC) no Table D.l: Huffman coclebooks used in AAC. coclebooks, the magnitude of coefficients is Huffman coded and the sign bit of each non-zero coefficient is appended to the codeword. Two codebooks require special note, i.e. codebook 0 and codebook 11. As mentioned previously, codebook 0 indicates that all coefficients within a section are zero. Codebook 11 can represent quantized coefficients that have an absolute value greater than or equal to 16. If the magnitude of one or both coefficients is greater than or equal to 16, a special escape coding mechanism is used to represent those values. The magnitude of coefficients is limited to no greater than 16 and the corresponding 2-tuple is Huffman coded. The sign bits, as needed, are appended to the codeword. 196 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
High -speed networks with self -similar traffic
PDF
Algorithms for streaming, caching and storage of digital media
PDF
Design and analysis of MAC protocols for broadband wired/wireless networks
PDF
Blind multiuser receivers for DS -CDMA in frequency-selective fading channels
PDF
Intelligent image content analysis: Tools, techniques and applications
PDF
Dynamic logic synthesis for reconfigurable hardware
PDF
Induced hierarchical verification of asynchronous circuits using a partial order technique
PDF
Efficient acoustic noise suppression for audio signals
PDF
Adaptive packet scheduling and resource allocation in wireless networks
PDF
Adaptive video transmission over wireless fading channel
PDF
Error resilient techniques for robust video transmission
PDF
Digital processing for transponding FDMA/TDMA satellites
PDF
Intelligent systems for video analysis and access over the Internet
PDF
Color processing and rate control for storage and transmission of digital image and video
PDF
Dynamic radio resource management for 2G and 3G wireless systems
PDF
Design and performance analysis of low complexity encoding algorithm for H.264 /AVC
PDF
Contributions to image and video coding for reliable and secure communications
PDF
Design and analysis of server scheduling for video -on -demand systems
PDF
Advanced video coding techniques for Internet streaming and DVB applications
PDF
Complexity -distortion tradeoffs in image and video compression
Asset Metadata
Creator
Yang, Dai (author)
Core Title
High fidelity multichannel audio compression
School
Graduate School
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
engineering, electronics and electrical,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Advisor
Kuo, C.-C. Jay (
committee chair
), [illegible] (
committee member
), Goel, Ashish (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c16-547522
Unique identifier
UC11334967
Identifier
3094388.pdf (filename),usctheses-c16-547522 (legacy record id)
Legacy Identifier
3094388.pdf
Dmrecord
547522
Document Type
Dissertation
Rights
Yang, Dai
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Tags
engineering, electronics and electrical