Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Speech enhancement and intelligibility modeling in cochlear implants
(USC Thesis Other)
Speech enhancement and intelligibility modeling in cochlear implants
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
SPEECH ENHANCEMENT AND INTELLIGIBILITY MODELING IN COCHLEAR IMPLANTS by Chuping Liu A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) August 2008 Copyright 2008 Chuping Liu ii Dedication To my beloved family iii Acknowledgements Nothing can be better than having two exceptionally intelligent advisors Dr. Shri Narayanan and Dr. Qian-Jie Fu guiding me through my Ph.D program. My deepest gratitude and thanks for their invaluable guidance, inspiration, support, encouragement and patience throughout all these years in all aspects of my research and career. I am very thankful to Drs. Dani Byrd, Jerry Mendel, B. Keith Jenkins and Bob Shannon for their dedicate serving in my thesis committee and for all the insights, thoughts, discussions and help they extended to me. I am also very grateful to have mentors Drs. Ray Goldsworthy and Louis Braida at Sensimetrics and MIT, offering me exciting ingredients of research into my thesis. Life wouldn’t be so colorful without the companion of so many wonderful friends and colleagues at USC and HEI. I thank them and very cherish all of the sharing that we had, which greatly enrich my many perspectives on seeing people and the world. I thank the companion of my husband for all the roads that we went though together hand in hand. I would also like to thank my son for helping me by doing his exceptionally wonderful job to be healthy, happy and up-growing. Finally, with this thesis, I am very glad to be able to tell my beloved grandmother and parents “eventually I am here”. They have been so confused for so many years about when I would graduate. Now they change to keep asking me “then… what are you going to do in your career?” That is an exceptionally wonderful question and will definitely continuous to move me forward. iv Table of Contents Dedication ................................................................................................................. ii Acknowledgments..................................................................................................... iii List of Tables ............................................................................................................ vii List of Figures ........................................................................................................... ix Abstract ...................................................................................................................xiii Chapter 1: Introduction ......................................................................................... 1 1.1 Motivation and objective............................................................................1 1.2 Cochlear implant systems........................................................................... 3 1.3 Proposal for speech enhancement framework in cochlear implants........... 11 1.4 Proposal for speech intelligibility modeling in cochlear implants ............. 16 1.5 Thesis outline..............................................................................................18 Chapter 2: Speech enhancement: bandwidth extension for telephone speech.. 20 2.1 Introduction................................................................................................. 20 2.2 Methods....................................................................................................... 22 2.2.1 Signal processing.............................................................................. 22 2.2.2 Test materials.................................................................................... 25 2.2.3 Test procedures................................................................................. 25 2.3 Results......................................................................................................... 26 2.4 Discussion................................................................................................... 29 2.5 Conclusions................................................................................................. 32 Chapter 3: Speech enhancement: spectral normalization for different talkers 34 3.1 Introduction................................................................................................. 34 3.2 Methods....................................................................................................... 39 3.2.1 Implementation of spectral normalization ........................................ 39 3.2.2 Objective verification on cochlear implant simulations ................... 41 3.3 Experiment 1: Effect with two different talkers ......................................... 44 3.3.1 Methods ............................................................................................ 44 3.3.2 Results and discussion ...................................................................... 46 3.4 Experiment 2: Effect with simulated talkers............................................... 52 3.4.1 Methods ............................................................................................ 53 3.4.2 Results and discussion ...................................................................... 60 v 3.5 General discussion ...................................................................................... 64 3.6 Conclusions................................................................................................. 69 Chapter 4: Speech intelligibility modeling: challenges and proposal ................ 71 4.1 Introduction................................................................................................. 71 4.2 Factors relating to inter-subject performance difference............................ 74 4.3 The proposed general speech intelligibility model ..................................... 78 4.4 Investigation outline.................................................................................... 81 Chapter 5: Speech intelligibility modeling: parametric effects .......................... 83 5.1 Introduction................................................................................................. 83 5.2 Estimating the vowel acoustic space........................................................... 87 5.3 Experimental design.................................................................................... 88 5.3.1 Test materials.................................................................................... 88 5.3.2 Signal processing for five parameters............................................... 89 5.3.3 Subjects............................................................................................. 94 5.3.4 Procedures......................................................................................... 95 5.4 Results......................................................................................................... 96 5.5 Discussion ................................................................................................. 102 5.6 Conclusions................................................................................................ 107 Chapter 6: Speech intelligibility modeling: effect of electrode confusions ....... 108 6.1 Introduction................................................................................................ 108 6.2 Method ....................................................................................................... 111 6.2.1 Quantification of electrode confusion patterns................................ 111 6.2.2 Prediction of speech intelligibility .................................................. 113 6.3 Experimental design................................................................................... 114 6.3.1 Test materials................................................................................... 114 6.3.2 Signal processing............................................................................. 115 6.3.3 Subjects............................................................................................ 117 6.3.4 Procedures........................................................................................ 117 6.4 Results........................................................................................................ 118 6.5 Discussion and conclusions ....................................................................... 125 Chapter 7: Speech intelligibility modeling: customization with psychoacoustic measurements .......................................................................................... 127 7.1 Introduction................................................................................................ 127 7.2 Method ....................................................................................................... 128 7.3 Experimental design................................................................................... 130 7.3.1 Subjects............................................................................................ 130 7.3.2 Psychoacoustic measurement procedures........................................ 130 7.3.3 Speech recognition materials and procedures ................................. 132 vi 7.4 Psychoacoustic measurement results ......................................................... 133 7.4.1 CI MAP and electrode T/C levels.................................................... 133 7.4.2 Acoustic to electric mapping ........................................................... 133 7.4.3 Intensity resolution .......................................................................... 134 7.4.4 Electrode confusion patterns............................................................ 134 7.5 Prediction results........................................................................................ 136 7.5.1 Averaged performance across talkers.............................................. 136 7.5.2 Individual subject’s talker preference.............................................. 139 7.5.3 Acoustic distances of vowel pairs.................................................... 141 7.6 Discussion and conclusions ....................................................................... 143 Chapter 8: Conclusions and future work ............................................................ 149 8.1 Conclusions................................................................................................ 149 8.2 Future work................................................................................................ 155 Appendix A: Estimation of the conversion function based on GMM modeling ... 159 Appendix B: The conversion between Mel-scaled LSF and linear-scaled LPC.... 162 Appendix C: Summary of subject demographics and participation in all the experiments ....................................................................................... 166 Appendix D: CI device MAP settings and T/C levels for individual CI subjects .. 167 Appendix E: Electrode comparison and normalized electrode confusion patterns for individual CI subjects.................................................................. 172 Appendix F: Summary of speech recognition scores for individual CI subjects... 181 Bibliography ........................................................................................................... 183 vii List of Tables Table 1-1 Off-the-shelf criteria and correspondent approaches for the proposed speech enhancement framework in cochlear implants ......... 15 Table 3-1 Performance difference between unprocessed source talkers (i.e., M1 vs. F1), and between spectrally-normalized and unprocessed talkers. Note that because the performance with talkers M1 and F1 differed among individual subjects, comparisons are made in terms of the “Better” and “Worse” talker. Bold numbers indicate significant differences in performance across different sentence lists (p<0.05). ....................................................................................... 49 Table 3-2 Pitch and formant analysis for the pitch-shift and spectral transformations in Experiment 2. The target F0 for the pitch-shift transformations was scaled according to the pitch-stretching ratio used for processing; the target F0 for the spectral transformation refers to the Measured F0 values after pitch-stretching. The F0s were measured with software Wavesurfer 1.8.5. Formant frequencies were estimated for the vowel /I/ from the sentence “Glue the sheet to the dark blue background”. Note that reference talker T1.0 (in bold) was F1 from Experiment 1. ................ 56 Table 3-3 r2 and significance values for linear regressions performed between the unprocessed talkers from Experiment 1 (M1 and F1) and the pitch-shift transformations from Experiment 2 (T0.6, T0.8, T1.2, T1.4, T1.6). ....................................................................... 68 Table 5-1 Summary of the fixed and variable speech processor parameters for all five experiments. Shadowed areas represent the variable parameters in each experiment. .......................................................... 90 Table 5-2 Corner frequencies for analysis/carrier bands used in Exps. 2 and 3 ................................................................................................... 91 Table 5-3 Corner frequencies for carrier bands used in Exp. 4. The analysis bands were fixed and distributed according to Greenwood’s formula (Greenwood 1990). For each simulated insertion depth, the corner frequencies were calculated according to Greenwood’s formula (Greenwood 1990). ................... 93 Table 6-1 Summary of the band partition and electrode confusion levels for different number of channel simulation and different powers of the base matrix. The base matrix follows Eq. (6-3) and the electrode confusion level a λ is characterized according to Eq. (6- 2). ....................................................................................................... 117 Table 6-2 Summary of the prediction ability of the three speech intelligibility indexes under variant combination conditions. The unlisted factors are factored out by averaging. The low and high indicates the range of the perception scores or the intelligibility index. The R2 and p value are the linear regression results of the perception scores. The bolded digits highlight the significant prediction effect. ................................................................................ 124 Table 7-1 Prediction ability for averaged performance across talkers with customized speech intelligibility model under different psychoacoustic response combinations. ............................................ 138 Table 7-2 Prediction ability for talker preference with customized speech intelligibility model under different psychoacoustic response combinations. ..................................................................................... 141 Table 7-3 Descriptive statistics of the acoustic space for the wrong and right vowel pairs under with and without customization of the intelligibility model. The numbers in the parenthesis represent the standard deviation of the acoustic space. ..................................... 143 viii ix List of Figures Figure 1-1 Improvement of CI speech perception over years .............................. 2 Figure 1-2 Schematic of cochlear implants.......................................................... 4 Figure 1-3 High-level architecture for CI speech perception............................... 13 Figure 1-4 Proposed speech enhancement framework......................................... 16 Figure 2-1 Bandwidth extension framework for telephone speech...................... 23 Figure 2-2 Comparison of the spectrogram with the restored wideband speech (top panel) and the original wideband speech (bottom panel) .................................................................................................28 Figure 2-3 Sentence recognition performance for individual CI subjects with (shaded bar) and without (black bar) the restored highband information. The error bars indicate one standard deviation.............. 29 Figure 2-4 Sentence recognition performance with telephone speech for individual CI subjects listening to male (black bar) and female (shaded bar) talkers. The error bars indicate one standard deviation. The star indicates significant difference between male and female performance at a level of 0.05.......................................... 32 Figure 3-1 Implementation framework of the GMM-based spectral normalization algorithm ..................................................................... 40 Figure 3-2 Normalized talker distortion as a function of number of channels. Solid line: without spectral normalization. Dashed line: with spectral normalization. Note that the talker distortion between talkers F1 and M1 (unprocessed speech) was used as the reference ....................................................................................... 44 x Figure 3-3 Individual and mean sentence recognition performance for talkers M1 and F1. Sentence recognition performance for individual CI subjects, ordered according to the degree of talker sensitivity; mean performance across subjects is shown at the far right of the figure. For subjects S1 – S3, performance with F1 was better than that with M1; for subjects S4 – S9, performance was better with M1 than with F1. The error bars show one standard deviation, and the asterisks show significantly different performance between the two talkers (t- test: p<0.05) ....................................................................................... 47 Figure 3-4 Waveforms for the sentence “Glue the sheet to the dark blue background.” Top panel: talker template T0.6 (upward pitch shift). Middle panel: talker template T1.0 (unprocessed speech from talker F1). Bottom panel: talker template T1.6 (downward pitch shift). ......................................................................................... 57 Figure 3-5 Spectral envelopes for different talker templates. Top panel: spectral envelopes for talker templates T0.6, T1.0 and T1.6. Bottom panel: spectral envelopes for talker template T1.0 and the converted templates T0.6 Æ T1.0 and T1.6 Æ T1.0. ...................... 58 Figure 3-6 NH subjects’ overall speech quality ratings for the pitch-shift transformations, with (open symbols) and without (filled symbols) spectral normalization. The error bars show one standard deviation, and the asterisks indicate significantly different ratings with spectral normalization (p<0.05). Note that source talker T1.0 (unprocessed speech from talker F1) was used to anchor the subjective quality ratings. .................................... 62 Figure 3-7 Sentence recognition performance for NH and CI subjects, with (open symbols) and without (filled symbols) spectral transformation, as a function of pitch-shift transformations. The error bars show one standard deviation, and the asterisks indicate significantly different performance after spectral transformation (p<0.05). .................................................................... 63 Figure 4-1 Speech recognition scores with two different talkers at two different speech bandwidths. The error bars show one standard deviation. . .......................................................................................... 73 Figure 5-1 Normalized mean acoustic space (left axis) and mean vowel recognition performance (right axis), as a function of the number of spectral channels. .............................................................. 97 xi Figure 5-2 Normalized mean acoustic space (left axis) and mean vowel recognition performance (right axis) for 4-channel speech, as a function of the slope of the carrier band filter. .................................. 98 Figure 5-3 Normalized mean acoustic space (left axis) and mean vowel recognition performance (right axis) for 4-channel speech, as a function of the distribution of carrier band filters. ............................ 99 Figure 5-4 Normalized mean acoustic space (left axis) and mean vowel recognition performance (right axis) for 4-channel speech, as a function of the simulated insertion depth of the carrier bands. ........ 100 Figure 5-5 Normalized mean acoustic space (left axis) and mean vowel recognition performance (right axis) for 4-channel speech, as a function of the amplitude mapping function. ................................... 102 Figure 6-1 Speech recognition performance under different number of channels and different electrode smearing levels. The error bars indicate one standard deviation, and the stars indicate significantly different recognition performance between 6 and 8 channel condition under the specific electrode smearing level (p<0.05). . ......................................................................................... 119 Figure 6-2 Prediction ability of the acoustic space for percent correct under 8 combinations of smearing and channel conditions (4 smearing * 2 channel=8). The filled black dots represent individual subject data; the red diamonds was the mean speech recognition scores across subjects. The solid red line represents the linear regression; the red dashed lines represent the mean 95% confidence level. The linear regression strength and p value were displayed on the left upper corner. ............................................ 120 Figure 6-3 Prediction ability of the acoustic space for percent correct under 32 conditions (4 smearing * 2 channel * 4 talkers). The filled black dots represent individual subject data; the red diamonds was the mean speech recognition scores across subjects. The solid red line represents the linear regression; the red dashed lines represent the mean 95% confidence level. The linear regression strength and p value were displayed on the left upper corner. ................................................................................................ 122 xii Figure 6-4 Prediction ability of the acoustic space for percent correct under 4 conditions (4 talkers). The filled black dots represent individual subject data; the red diamonds was the mean speech recognition scores across subjects. The solid red line represents the linear regression; the red dashed lines represent the mean 95% confidence level. The linear regression strength and p value were displayed on the left upper corner. .................................. 123 Figure 7-1 Framework for customized speech intelligibility model .................... 129 Figure 7-2 Relationship between acoustic space and averaged performance across talkers under with and without customization of speech intelligibility model ........................................................................... 137 Figure 7-3 Relationship between acoustic space and individual CI users’ talker preference under with and without customization of speech intelligibility model ................................................................ 140 Figure 7-4 Relationship between acoustic space and percent confusions of vowel pairs in individual CI users under with and without customization of speech intelligibility model..................................... 142 Figure 8-1 Summary of CI speech perception patterns and psychoacoustic measurements ..................................................................................... 153 xiii Abstract Accompanying cochlear implant (CI) performance improvement over years, CI speech recognition increasingly showed unique patterns and huge inter-subject performance difference that are different from that of normal hearing (NH) listeners. Previous literatures have not paid enough attention to such unique patterns, regardless of numerous evidences. To further improve speech perception with the next generation CI device, it is critical to integrate such unique patterns into the CI framework. The thesis addressed speech enhancement and intelligibility modeling from systematic approaches. Speech enhancement utilized unique speech recognition patterns as feedback to modify input auditory signals to CI devices. Speech intelligibility modeling unified CI device settings and psychoacoustic responses from individual subjects to predict inter-subject performance difference. Although speech recognition with NH listeners are highly robust even when acoustic information is lost or variant, such listening conditions are very challenging for CI users. A speech enhancement framework was proposed and realized to improve CI speech recognition in the context of telephone speech and different talker speech. The proposed speech enhancement framework significantly improved CI speech recognition; individual CI users showed substantially different recognition patterns in terms of the effect of speech enhancement, speech bandwidth and different talkers. xiv Speech intelligibility modeling was further studied to explore the observed inter-subject performance difference. An acoustic distance based intelligibility model was proposed to integrate variant CI device settings (e.g., frequency partition, electric stimulation rate, and acoustic-to-electric mapping) and psychoacoustic responses (e.g., electrode confusion patterns, intensity resolution, and dynamic range), aiming to capture complex and highly non-linear distortion of acoustic features through CI devices. Three phases of this work were involved. Firstly, a general speech intelligibility model was studied and validated under variant parametric effect in CI processing for averaged CI performance. Secondly, the effect of CI electrode confusion patterns was studied by smearing the spectral contrast of acoustic features. Thirdly, individual CI MAP and psychoacoustic response patterns were unified into the proposed model. Compared to the general speech intelligibility model, the customized model significantly boosted the linear prediction ability of inter-subject performance difference. The contributions of different psychoacoustic responses to inter-subject performance difference were further discussed. 1 Chapter 1 Introduction 1.1 Motivation and objective Cochlear implant (CI) has been the first clinically safe interface between electronics and human sensation. It has successfully provided hearing to profoundly deaf people that could not benefit from hearing aids. As an auditory prosthesis, CI is intended to restore the full range of normal auditory function to deafened individuals. Among the many auditory functions, one of the primary measures is speech communication. The average performance of speech communication in CI users has been significantly improved over years from single channel CI to multi-channel CI with variant speech processing strategies, as shown in Figure 1-1. With the most updated CI speech devices, the best speech recognition scores in quiet were observed to reach 95 percent correct for sentences. However, accompanying this average performance improvement, two remarkable observations are 1) listening through CI device is far from robust as normal hearing (NH) listeners; 2) inter-subject performance difference is substantial. Substantial pervious literatures have either directly studied or implicitly discovered such unique listening patterns in CI. Yet, its significance did not receive appropriate attention or utilization from engineering point of view. To design a new generation of CI device, it is critical to address these problems. 3M/House F0F2 F0F1F2 MPEAK SPEAK/Clarion N24/CII/Med-El CII/3G Percent correct (%) 0 20 40 60 80 100 Sentences Words Figure 1-1: Improvement of CI speech perception over years Motivated from this fact, this thesis focused on addressing these two remarkable observations in CI from speech enhancement and intelligibility modeling point of view. Speech enhancement utilized the unique speech recognition patterns in CI as information to optimize the speech input to CI device. Intelligibility modeling individualized speech intelligibility predictor with customized psycho-acoustic measurements from individual CI users. Such endeavor, on one hand, will directly impact CI device design and, more generally, hearing assistive device design. On the other hand, such endeavor will foster the integration of speech/audio engineering with psycho-acoustic studies and will shed some lights on the scientific pursuit towards a better understanding of the auditory system. 2 3 1.2 Cochlear implant systems A cochlear implant is a prosthetic device that can restore partial hearing to profoundly hearing-impaired individuals. Cochlear implant generates sound sensation by directly stimulating the residual auditory nerve with electric currents. It bypasses damaged components of the external, middle, and inner ears. Cochlear implant is appropriate for profoundly hearing impaired individuals who receive little or no benefit from conventional hearing aids or from corrective surgery. Currently, there are nearly 100,000 CI users (adults and children) over the world (NIDCD 2006). The schematic of cochlear implants is illustrated in Figure 1-2. Cochlear implant system consists of implanted and external part. It has three key components: microphone, speech processor and the electrode array. A cochlear implant system typically operates as follows. The environmental acoustic signal is picked up by the microphone and is sent to the speech processor. The speech processor extracts acoustic parameters and determines electrical stimulation parameters, which are then encoded and transmitted by radio frequency induction (e.g., 2.5 MHz in Nucleus 22 channel cochlear implant system) to receiver/stimulator via the transmitting coil located in the headset. The receiver/stimulator subsequently decodes the information and delivers electrical stimulation pulses to the selected electrodes within the cochlea. The receiver/stimulator will not generate electrical stimuli in response to external radio frequency interference because a specific data format is required in order to generate the electrical stimulation provided by the implant. The electrical stimulation generated by the receiver/stimulator consists of a train of charge- balanced, biphasic current pulses. Parameters associated with the stimulation (e.g., pulse width, inter-phase duration, stimulation rate, electrode configuration,) are controlled by the MAP, which is stored in the speech processor and determines how acoustic information is transformed into electrical stimulation parameters. The background of cochlear implant system most related to this thesis is the speech processing strategies and factors affecting speech perception. Figure 1-2: Schematic of cochlear implants 1) Speech processing strategies Speech processing strategies, the way to encode speech information, have been evolved together with cochlear implant device and greatly improved speech 4 5 recognition level in cochlear implant over years. The ultimate goal of speech processing is to make the electric stimulation of electrode array to mimic the functioning of a normal cochlea. By electric stimulation, the perceptually important information needs to be preserved in order for the patient to be able to hear intelligible speech. There are basically three types of speech processing strategies in cochlear implants; the first type is feature extraction based; the second type is N-of M based and the third type is waveform based (Loizou 1998; Goldsworthy 2005). The very first speech processing strategy for feature-extraction based approach is F0/F2 strategy. In this approach, the fundamental frequency (F0), the second formant frequency (F2) and its amplitude were extracted from the speech signal. This strategy conveys F2 frequency information by stimulating appropriate electrode along the electrode array. The amplitude of the pulses is proportional to the amplitude of F2. The voicing information of the speech was conveyed by a stimulation rate of F0 pulses/second. This strategy enables some patients to obtain open-set speech recognition. The F0/F2 strategy was later modified to include the first formant information and was called F0/F1/F2 strategy. The first formant frequency (F1) and its amplitude are extracted from the speech signal. Similar to F0/F2 strategy, appropriate electrode for F1 is also selected and is stimulated in proportional to the amplitude of F1. Voicing information was also conveyed by F0. The F0/F1/F2 strategy greatly increases speech recognition in vowel recognition but not in consonant recognition. This is not surprising given that consonants contain high frequency information that is not emphasized by F0/F1/F2 strategy. A 6 refinement version of F0/F1/F2 was called Multi-peaks (or MPEAK) strategy. Multiple spectral peaks in high frequency band, together with F1 and F2 formants, were extracted and used to stimulate appropriate electrodes. Studies confirmed that MPEAK strategy significantly outperformed F0/F1/F2 strategy on open-set speech recognition. But, one problem with feature extraction type strategy is that formant tracker performed poorly in adverse listening environment. N-of-M type speech processing strategy is different from feature extraction type strategy. Rather than selecting electrodes according to formant or pitch location, N-of-M type strategy analyze speech signal into M sub-bands; each sub-band is associated with each electrode. Among the M sub-bands, only the N sub-bands with the highest energy, where N is less than M, are used to stimulate the corresponding electrodes. Spectral peak (SPEAK) is one of N-of-M type strategy. The signal is sent to a bank of filters (e.g., 20 filters in Cochlear Spectra 22) and the energy in each frequency band is measured. The N bands with the highest energy are selected and the corresponding electrodes are stimulated sequentially with the pulse amplitudes proportional to its energy. The information of each analysis filter band is designed to deliver to each electrode. The number of selected electrodes is related to the signal level and the spectral composition of the acoustic input. Typically, an average of six electrodes is selected during each scan cycle. A more recent N-of-M type strategy is referred to as ACE (advanced combination encoders), which is similar to the SPEAK strategy in that M = 22 and N varies from 6 to 10, but uses higher stimulation rates. 7 Waveform type speech processing strategy is very similar to N-of-M type strategy. The only difference is that N is always equivalent to M in waveform type strategy. There are basically two strategies for waveform type processing. The first one is compressed-analog (CA) strategy. The speech signal is firstly compressed using an automatic gain control and then filtered into a number of frequency bands. The filtered waveforms are then delivered simultaneously to appropriate electrodes in analog form. The CA approach yields better speech recognition than single channel approach as it introduces higher frequency resolution. However, the CA approach simultaneously stimulates electrodes in analog form and causes significant channel interactions due to the summation of electrical fields. Neural response to stimuli from one electrode may be significantly distorted by stimuli from other electrodes. This distortion has been shown to distort acoustic information and deteriorate speech perception. To tackle channel interaction issue in the CA approach, a new stimulation strategy called continuous interleaved sampling (CIS) was developed. Trains of pulses are delivered to the electrodes in a non-simultaneous and interleaved fashion. The pulse amplitudes are modulated by the sub-band envelopes. It was observed that recognition scores with CIS strategy are significantly higher than with CA processor. Although different speech processing strategies encode different amount of specific information, a cochlear implant system can support multiple speech processing strategies. For example, the spectra 22 speech processor, manufactured by Cochlear Corporation, can support SPEAK, MPEAK, F0/F1/F2, and other speech 8 processing strategies. For each strategy, the electrical stimulation parameters are customized for each patient via the MAP. Although individual CI users may prefer some speech processing strategy over the others, no single speech processing strategy always outperforms the others in all the CI users. 2) Factors affecting speech perception in cochlear implants It is not easy to assess the significance of individual factors on speech perception due to the interactions among factors. In cochlear implants, it is even more difficult to assess the perceptual consequences of individual factors due to the distorted representation through the electric stimulation scheme. Factors affecting speech perception in cochlear implants has been the focus of research for many years. They can be roughly categorized as follows: i) Deaf profiles. Loizou 1998 summarized some of these factors that affecting speech perception in cochlear implant users. They include duration of deafness, age of onset of deafness, age at implantation, duration of cochlear implant use, population and distribution of surviving healthy hair cells. Typically, cochlear implant users, who are with shorter duration of deafness prior to implantation, implanted at an earlier age, exposed to a longer use of cochlear implant device and have a larger number of healthy hair cells, perform better than other users with opposite profiles. However, it still remains unclear for some factors. For example, prelingually deafened persons who were implanted in adolescence have been found to obtain different levels 9 of speech and music perception than those implanted in adulthood. However, it is not clear whether children should be implanted at a minimum age of 2 for maximum auditory perception. ii) Signal processing strategies. For the many signal processing strategies, no single strategy always outperforms the others in all the cochlear implant users. Depending on the signal processing strategies, not only the amount of information delivered from electric stimulation is different, but also the parameter design can be different when implementing the same signal processing strategy. For example, the CIS strategy can be configured in a number of ways by varying design parameters such as filter bank intervals, envelope detection, stimulation rate and shape of compression function from acoustic stimuli to electric pulses (Loizou 2006). Specifically, in envelope detection, two methods can be used to extract the envelopes of the filtered waveforms. One is to use rectification followed by a low pass filtering as implemented in Clarion device. The other is to use Hilbert transformation to extract Hilbert envelope as in Med-El device. No clear advantage has been demonstrated for one method over the other. These factors in signal processing strategies affect speech perception in a complex, multi- dimensional manner such that effect of individual factor to speech perception is difficult to factor out from the others. iii) Fundamental limitations of electrical stimulation. Under the electrical stimulation scheme, it is very possible that some of the acoustics conveyed 10 through the auditory periphery is lost and degraded. Whenever a sensing system utilizes a set of discrete sensors to transmit information about the original signal, sampling is involved and therein lies the potential for a loss of information (Wilson et al. 1990). In cochlear implants, the transduction from acoustic waves to basilar membrane motion is replaced by a transduction from acoustic waves to current fields generated from the electrode array implanted in the cochlea, i.e., the spatial composition of the basilar membrane motion is sampled by a finite set of electrodes. The dynamics of the hair cell membrane and chemical synapse in the organ of Corti in response to acoustical waveforms is replaced by “electrical synapse” driven by electrical stimulation, i.e., temporal sampling is involved to represent the stochastic firing pattern from the acoustic waves. Although the electrode array may consist of as many as 22 electrodes in Nucleus 24 devices, studies have shown that the best performance of cochlear implants is never better than the performance with normal hearing listeners listening to 6 or 8 channel vocoders that simulate cochlear implant processing (Fu 1997, Shannon et al. 1995). Hence, spectral-temporal information conveyed through electric stimulation is very limited and possibly distorted. Furthermore, it is possible that the conveyed information may not be able to be utilized by the central auditory system. Besides the above factors, the levels of intelligence and communicativeness of cochlear implant users may also affect auditory performance. Auditory 11 rehabilitation not only needs commitment from cochlear implant users in terms of time and efforts, but also needs support from family, friends and workplace. All these factors in whole affect CI performance. 1.3 Proposal for speech enhancement framework in cochlear implants Similar to the front-end speech processing in automatic speech recognition (ASR), speech processing with CI has been designed to resemble the normal auditory functioning in auditory periphery. Interestingly, these two research fields have been going to different directions. Speech processing in CI has been focusing on investigating the effect of processor parameters in CI signal processor design (e.g., Shannon et al. 1995, Fu 1997, Loizou et al. 2000), the effect of noise (e.g., Fu et al. 1998b, Munson and Nelson 2005), the effect of electric stimulation level, duration, and rate (e.g., Vandali et al. 2000) and the effect of combining acoustic hearing with CI or bilateral CI (e.g., Muller 2002). In contrast, speech processing in ASR has been focusing on feature extraction and representations (e.g., Huang et al. 2001), speech enhancement in environmental noise (e.g., Hermansky and Morgan 1994), speech normalization in children and multiple talkers (Potamianos and Narayanan 2003, Stylianou et al. 1998), feature extraction from suprasegmental level such as prosody, pitch, stress, prominence etc. (Wang and Narayanan 2007, Kompe 1997). In ASR, features may be used as long as they are distinguishable for machines. In CI, due to the nature of electric stimulation, not all the features important for speech perception 12 can be easily delivered through the CI interface (i.e., temporal fine structures, complex pitch). This fact has restricted optimization for CI speech recognition. In the field of speech perception, one of the primary issues has been the apparent “lack of invariance” between acoustic information in a signal and the phonemic perception. The variability of the acoustic information may be caused by numerous factors (e.g., gender, dialect, social group, speaking rate, emotional state, vocal tract length, articulatory habits, phonetic context). The variability not only exists in across-talkers (e.g., Tsao et al. 2006), but also in within-talkers (e.g., Newman et al. 2001). Regardless of the wide range of variability in intended phoneme in speech production, normal hearing listeners can still preserve “perceptual constancy” of the linguistic message in various listening conditions (Pisoni 1993). However, the “perceptual constancy” typically is not obvious or not easy to achieve in cochlear implant users. For example, normal hearing listeners are able to preserve high level of speech understanding in narrow bandwidth speech (e.g., telephone speech) and multi-talker environment. However, such conditions are all very challenging for CI users. To achieve good speech perception in CI is not an easy task. Firstly, acoustic signal was represented in a greatly degraded manner through the CI device. Secondly, the delivered acoustic features were typically distorted from acoustic to electric mapping through electric stimulation of the electrodes. Speech perception in CI is only available when the degradation and distortion of the acoustic signal are within a certain acceptable range. Figure 1-3 illustrates the architecture of speech perception in CI. In the third level of the architecture: individual listening patterns, a number of electro- psychoacoustic studies were carried out to study the relationship between psychoacoustic measurements and speech perception (e.g., electrode discrimination, pitch discrimination). No clear or consistent results were found to be able to explain the speech perception in all the CI users. Tremendous CI research has been focusing on the second level of the architecture: optimization of the CI device, including how to stimulate the electrodes with a higher rate, how to design the electric stimulation to introduce less electrode interactions, and how to provide more spectral resolutions etc. Compared to the intensive literatures on the second and third level in the CI speech recognition architecture, only a few studies addressed the problems in the first level. Studies of noise suppression strategies and the effect of talker variability may fall into this category. Auditory signals CI device Individual listening patterns Speech recognition patterns Figure 1-3: High-level architecture for CI speech perception Based on the above information, speech enhancement in CI has some special concerns. Correspondingly, the methodology to enhance speech has some constraints as well. Specifically, CI speech enhancement requires: 13 14 i) Adaptability. As was discussed, to realize the acoustic to electric stimulation of the auditory fibers, there are many different speech processing strategies. Even for the same strategy, the realization approaches to implement the processing vary across different devices. It is desirable that the speech enhancement method may easily fit into different devices. This concern determines that a front-end speech morphing is possibly most suitable. ii) Flexibility. As just mentioned, the challenging listening scenarios for CI users vary a lot and showed some unique patterns that are different from NH listeners. It is important that such speech enhancement framework is flexible enough to be able to adapt different listening scenarios. Since adaptability determines that a front-end processing is most appropriate, a statistical modeling approach may be a good option to model different scenarios in a statistical way, such that limited number of parameters can capture the characteristics of the signal to be processed. iii) Real-time implementation. This requirement is absolutely a must for CI application. Based on the above flexibility and adaptability concerns, an off- line training and on-time speech processing becomes natural and practical for speech enhancement. After off-line training process, there are only limited number of parameters needed to be stored in the device. For the real-time implementation, these parameters are easily retrieved and used in the front- end speech processing. 15 Table 1-1 summarized the off-the-shelf criteria for speech enhancement and the correspondent proposed approaches. This speech enhancement framework was applied to two different listening scenarios to improve speech perception in CI. Based on the off-the-shelf criteria, the speech enhancement framework was illustrated as in Figure 1-4. The goal of offline training is to train a conversion function between the ideal and non-ideal speech, i.e., to represent the relationship of these two with limited number of parameters that may facilitate the online implementation. During online implementation, the non-ideal speech is analyzed in a similar fashion as that of the offline training. Specifically, speech was decomposed into spectral envelope and excitation. The spectral envelope goes through a transformation based on the conversion function that is trained through the offline training process. Then speech synthesis is deployed to render speech mimicking the properties of the ideal speech. Table 1-1: Off-the-shelf criteria and correspondent approaches for the proposed speech enhancement framework in cochlear implants Criteria Approaches Flexible to different CI speech processors Front-end processing (speech morphing and speech synthesis) Adaptive to different hearing scenarios statistical modeling On-the-fly implementation Offline training and online conversion conversion function training Non-ideal speech Ideal speech Offline training Online conversion excitation Spectral envelope transform On-the-fly non ideal speech Approximated ideal spectral envelope Approximated ideal speech Figure 1-4: Proposed speech enhancement framework 1.4 Proposal for speech intelligibility modeling in cochlear implants To predict speech intelligibility in cochlear implants, a noise-band vocoder, which processed the speech in a similar fashion as the cochlear implant signal processor, was typically used to render acoustic signals. Speech intelligibility predictor (typically articulation index (AI) type) was then deployed on such simulated signals. Such method generally was used in predicting speech intelligibility in noise (e.g., Goldsworthy 2005). Although such CI simulation study offers lots of convenience and possibility to study CI perception in a time affordable manner, two disadvantages with such AI type speech intelligibility predictor were 1) the noise-band simulated acoustic signal may not reflect other important 16 17 psychoacoustic measurements that are unique for CI users; 2) it was impossible to study the inter-subject performance difference because AI type methods need a reference “good” speech, which is typically clean speech. To study the inter-subject performance difference, this thesis proposed an acoustic feature based speech intelligibility predictor. It aims to incorporate not only the essential speech elements delivered through CI device, but also the individual psychoacoustic responses of individual CI listeners. It was hypothesized that such a predictor may explain the overall speech perception patterns in CI but also the inter-subject performance difference. The study of such a speech intelligibility predictor involved three consecutive phases tackling different aspects of speech perception in CI. Phase 1 defined the speech intelligibility predictor and established the overall goodness of such a predictor. The proposed speech intelligibility predictor was used to evaluate the general speech recognition patterns for normal hearing listeners listening to noise-band vocoders. Five type of CI speech degradation was explored: the number of spectral channels, spectral smearing, spectral warping, spectral shifting and nonlinear amplitude mapping. Phase 2 introduced electrode confusion patterns into the speech intelligibility predictor. This phase validated the flexibility of the intelligibility model in reflecting the psychoacoustic measurements in CI. The extent of electrode confusions were simulated by raising powers of a matrix, and used to modulate the acoustic features. Its prediction ability was evaluated in NH listeners listening to CI simulations. 18 Phase 3 investigated the speech intelligibility model integrating with customized psychoacoustic measurement from individual CI listeners. Inter-subject performance difference and talker preference in individual CI users were modeled. The daily speech processor for individual CI users was fully taken into account; psychoacoustic response patterns (e.g., electrode confusion patterns, dynamic range, acoustic-to-electric compression and intensity resolution) were unified into the speech intelligibility model. The effect of the variant psychoacoustic measurements for speech perception prediction were investigated and discussed. 1.5 Thesis outline Chapter 2 studied a bandwidth extension approach to compensate for perception deficit due to loss of high frequency information in CI. The bandwidth extension method was based on the proposed speech enhancement framework. Its effect was evaluated by CI users listening to speech materials with and without bandwidth extension enhancement, using their daily speech processors. Chapter 3 further studied the speech enhancement framework in compensating for cross-talker acoustic differences. It aims to leverage cross-talker intelligibility difference in CI users. Based on the perceptual data for different talkers, speech was classified into optimal and non-optimal speech. A spectral normalization method was explored to adjust the speech in spectrum domain. Objective and subjective measurements were explored to evaluate the effect of spectral normalization. 19 Chapter 4 firstly summarized the inter-subject performance difference that was observed in Chapter 2 and 3. The factors relating to inter-subject performance difference was discussed. A general speech intelligibility model was proposed to tackle the challenges of modeling the inter-subject performance difference. The investigation of this model was outlined Chapter 5 studied the prediction ability of the proposed general speech intelligibility model under different parametric effect. A noise-band vocoder was used to simulate acoustic information delivered through CI device. The relationship between the predictor and speech recognition performance by normal hearing listeners were studied. Chapter 6 focused on studying the effect of electrode confusion patterns to the speech recognition. The electrode confusion pattern was simulated by using a non-identity matrix to smear the acoustic features from CI analysis bands. Noise band vocoder was used to render speech for speech perception test in NH listeners. The intelligibility predictor was correlated to the speech perception data. Chapter 7 studied the customized model with variant psychoacoustic measurements from CI users. The acoustic features in the model were modulated or adjusted based on the physical meaning of the psychoacoustic measurements. The customized model was used to predict the inter-subject performance difference and talker preference in individual CI users. The effect of psychoacoustic measurement to the intelligibility prediction was also investigated and discussed. Finally, Chapter 8 summarized the current study and suggested future work. 20 Chapter 2 Speech Enhancement: Bandwidth Extension for Telephone Speech It is well known that speech understanding by normal hearing people is highly robust to missing acoustic information. However, CI users typically showed disadvantages under such listening conditions. Perceptual constancy was difficult to maintain in CI when some acoustic information was lost. We specifically study the optimization of telephone speech where high frequency acoustic components are missing. Telephone communication only transmits speech information in the range of 300Hz to 3400Hz. Although normal hearing people have no much difficulty in speech understanding with such a limited range of frequency band, studies showed that the narrow bandwidth in telephone speech significantly degraded speech perception in CI users (Milchard and Cullington 2004, Fu and Galvin 2006). In this chapter, we extended the proposed speech enhancement framework in optimizing telephone speech by a bandwidth extension method. 2.1 Introduction Telephone use is still challenging for many deaf or hearing impaired individuals including cochlear implant (CI) users. According to Kepler et al. 1992, 21 there are three major contributors to the difficulties in telephone communication: the limited frequency range, the elimination of visual cues, and the reduced audibility of the telephone signal. For example, the telephone bandwidth in use today is limited to 300 Hz to 3.4 kHz. Compared to face-to-face conversational speech, telephone speech does not convey information above 3.4 kHz, which is useful in the identification of many speech sounds like consonants. For cochlear implant users who receive frequency information up to approximate 8 kHz or even higher, the narrowband telephone speech presents an obstacle even when they can achieve a fairly good wideband speech perception. Previous studies have assessed capability of CI patients to communicate over the telephone. While many CI patients are capable of certain degree of communication over the phones, speech understanding is significantly worse than with broadband speech (Milchard and Cullington, 2004, Ito et al. 1999, Fu and Galvin 2006). For example, word discrimination score obtained from telephone speech was decreased by 17.7 % than those with wideband speech. Analysis of the word errors revealed that the place of articulation was the predominant type of errors occurred (Milchard and Cullington, 2004). On the other hand, investigation of telephone use among cochlear implant recipients reported a 70% of the respondents communicated via the telephone, within which 30% use cellular phone (Cray et al. 2004). The capability to understand words using auditory cues only will increase the chances of one using the telephone and will promote independent living, employment, socialization and self-esteem in cochlear implant users. 22 To completely improve the telephone communication ability of hearing impaired people, an expensive way is to change the current public swithched telephony network (PSTN) to transmit wideband speech, and to enrich conversation with videos. This is very difficult to accomplish in the near future. A cheaper way is to add external equipments to enhance the audibility of telephone speech. For example, telephone adapter, which was used to reduce noise level in the telephone and to record telephone speech into a tape recorder, was found to boost speech- tracking score in CI users (Ito et al. 1999). Yet, such auxiliary instruments may not be easy to obtain, especially in mobile communication. A previous study by Terry et al. 1992 investigated frequency-selective amplification and compression via digital signal processing techniques to compensate for high-frequency hearing loss in hearing impaired people. Nevertheless, the approach required to incorporate audiometric data from users to achieve best performance. In this study, we proposed another approach to enhance telephone speech. The relationship between wideband and narrowband speech was pre-trained and was used to recover the missing information based on telephone speech only. Such approach does not require auxiliary instrument and patient data to implement it. 2.2 Methods 2.2.1 Signal processing In this study, a GMM based spectral mapping from narrowband to wideband speech was designed to expand the limited bandwidth of telephone speech. From source filter model, speech consists of source excitation and vocal tract filter. To expand narrowband speech to wideband speech, it correspondingly consists of two parts: excitation extension and spectral envelope extension (Park and Kim 2000, Gustafsson et al. 2006, Nilsson and Kleijn 2001). The framework for bandwidth extension is shown in Figure 2-1. Figure 2-1: Bandwidth extension framework for telephone speech The basic idea for spectral envelope extension was to train a mapping function from the narrowband spectral envelope to the wideband spectral envelope. The narrowband speech spectral characteristic was modeled as a GMM. The mapping function was trained to minimize the spectral distance between the narrowband and the target wideband spectral envelopes in the minimum mean square 23 24 sense (MMSE), over the training dataset. The solution to the mapping function was given in the Appendix A Eq. (A-4). Two methods introduced by Makhoul and Berouti 1979 may be applied for excitation spectrum extension in this study: spectral folding and spectral translation. Spectral folding simply generated a mirror image of the narrowband spectrum for high-band spectrum. The implementation of spectral mirroring was equivalent to up- sample the excitation signal in the time domain by zero padding. This almost added no extra cost in the processing. Yet, the energy in the reconstructed high band was typically over-estimated with this approach; the harmonic pattern of the restored high band was a flipped version of the original narrowband spectrum, centering at the highest frequency of the narrowband speech. Spectral translation, instead, did not have these problems, but was with more expensive computation. The excitation spectrum of the narrowband speech, obtained from Fourier transform of the time domain signal, was translated to the high frequency part and was padded to fill the desired whole band. A low pass filter was applied to do spectral whitening, such that the discontinuities between the translations were smoothed. The extended wideband excitation in the time domain was then obtained from inverse Fourier transformation. In this study, Mel-scaled line spectral frequency (LSF) features (18 th order) and energy were extracted to model the spectral characteristics of speech in a 19 dimensional space. The spectral mapping function between the narrowband and wideband speech was trained with 200 randomly selected sentences from IEEE database (100 sentences from a female talker and the other 100 sentences from a 25 male talker). The excitation component between 1k-3k Hz was used to construct the high-band excitation component. A low pass butter-worth filter (1 st order with cutoff frequency 3kHz) was used to do spectral whitening. The synthesized high-band speech (i.e., frequency information above 3.4kHz) was obtained from high pass filtering the convolution result of the extended excitation and extended spectrum. It was then appended to the original telephone speech to render the reconstructed wideband speech that covered the frequency band from 300Hz to 8kHz. 2.2.2 Test materials The test materials in this study were IEEE (1969) sentences, recorded from one male and one female talker at House Ear Institute with a sampling rate of 16kHz. The narrowband telephone speech was obtained from bandpass filtering the above wideband speech (9 th order butterworth filter, bandpass between 300Hz to 3400Hz) and was downsampled to 8kHz. Four conditions were tested: female/male restored wideband speech and female/male telephone speech. All sentences were normalized to have the same long-term root mean square (RMS) value. 2.2.3 Test procedures Seven CI subjects (2 women, 5 men, S1, S2, S3, S5, S6, S8 and S9 in Appendix C ) participated in this study. All subjects were native speakers of American English and had extensive experience in speech recognition experiments. 26 All subjects provided informed consent in accordance with the local IRB, and all were paid for their participation. Subjects were tested using their clinically assigned speech processors and comfortable volume/sensitivity settings; once testing begun, these settings were not changed. Subjects were tested while seated in a double-walled sound-treated booth (IAC). Stimuli were presented via a single loud speaker at 65 dBA. The sentences in the IEEE database were divided into 72 lists, with 10 sentences per list. For each run, a list was randomly chosen (without replacement) and a sentence from that list was randomly chosen (without replacement) and presented to the subjects. Subjects responded by repeating the sentence as accurately as possible; the experimenter tabulated all correctly identified words. Performance was calculated according to the ratio between correctly identified words and all words presented in the list; performance was averaged across 4 – 5 lists for each talker condition. To familiarize subjects with the different talkers and test procedures, a practice session was provided prior to the sentence recognition test with each talker. The test order of the different talker and test conditions was randomized for each subject. No feedback was provided during the test. 2.3 Results The spectrogram of the restored wideband speech was compared to that of the original wideband speech in Figure 2-2, with the sentence “Glue the sheet to the dark blue background” produced by the female talker. The restored high band 27 frequency component highly synchronized with the original phone speech and well extended the harmonic patterns toward the high band. In contrast to the spectrogram of the original wideband speech, the spectrogram of the restored highband speech was relatively flat, which was possibly due to the Mel-scaled speech analysis that placed less spectral resolution in the high band. The sentence recognition performance with (shaded bar) and without (black bar) the restored highband components was shown in Figure 2-3. The performance scores were plotted as the average scores with individual female and male talkers, since post processing indicated a non-significant effect of talkers under phone [paired t-test, t=-1.812, p = 0.120] and phone+hf [paired t-test, t=-1.184, p = 0.281] condition. On average, the recognition score with the restored highband speech was 3.5% higher than without the restored highband speech. The improvement was small but significant [One-way repeated measures ANOVA: F(1,6)=5.989, p=0.050]. Furthermore, Figure 2-3 demonstrated substantial cross subject variability in performance. Firstly, the cross subject variability was observed in terms of the performance for the same test materials. For example, subject S6 obtained over 80% correct under with and without the restored highband component conditions. In contrast, Subject S3 obtained only about 35% in average. Secondly, the cross subject variability was observed in terms of the effect with the restored highband speech. For example, subject S2 achieved 9.3% higher performance with the restored highband speech; while subject S8 even decreased performance by 2.6% lower performance with the restored highband speech. Figure 2-2: Comparison of the spectrogram with the restored wideband speech (top panel) and the original wideband speech (bottom panel). 28 Subjects S1 S2 S3 S5 S6 S8 S9 Avg Percent correct(%) 20 40 60 80 100 Phone Phone+hf Figure 2-3: Sentence recognition performance for individual CI subjects with (shaded bar) and without (black bar) the restored highband information. The error bars indicate one standard deviation. 2.4 Discussion The high-band speech components that were removed by telephone transmission were estimated according to a pre-trained relationship between the telephone speech and the wideband speech. Although speech perception was significantly improved with the restored highband speech, the improvement was relatively small comparing to a previously reported 17% perception difference between telephone speech and wideband speech (Milchard and Cullington 2004). The discrepancy may be due to the difference in test materials and subjects. In the current study, IEEE sentences and seven CI subjects were tested. In Milchard and 29 30 Cullington 2004, there were totally 18 subjects participated in the study. Besides, the Four Alternative Auditory Feature Test (FAAF) based on 80 consonant-vowel- consonant (CVC) forms (e.g., BAD BAG BAT BACK) was used. On the other hand, the discrepancy may be due to the speech analysis procedures used in the current study. Mel-scaled LSF features were used, which placed lower resolution on the high frequency components. The feature order used for speech analysis was the same (18 th order) for both wideband and narrowband speech, although their frequency ranges were different. Finally, the discrepancy may be due to the artifacts coming from the nature of speech synthesis. Speech synthesis used a limited number of estimated parameters describing complex articulation configurations. It was difficult to accomplish with such accuracy that a synthesis could be obtained without perceptual distortion (Park and Kim, 2000, Nillson and Kleijn 2001, Nilsson et al. 2002, Gustafsson et al. 2006). In the current study, spectral discontinuities and somewhat robotic speech quality were typically perceived as the artifacts. In the current study, the speech was synthesized and presented in free field to CI subjects. Because CI subjects were very sensitive to any artifacts in the speech, the testing procedure may under estimate the contribution of the restored highband speech. Future study may want to directly transmit the restored highband information and stimulate the electrode array directly according to certain CI speech processing strategy. By avoiding speech synthesis, the evaluation of the bandwidth extension method may be more accurate. 31 The substantial cross subject performance difference may come from the substantial difference in the CI device settings and electro-psychoacoustic listening patterns across subjects. For example, the CI speech processor in some subjects may encode more information on the high band speech. Such subjects may be more vulnerable to high frequency information loss than other subjects. However, it was not clearly understood what exactly in the CI device settings and electro- psychoacoustic listening patterns may cause the problem. Besides cross subject performance difference, a substantial cross talker intelligibility difference was also observed and illustrated in Figure 2-4. Figure 2-4 showed the sentence recognition performance with individual subjects listening to telephone narrowband speech produced by male and female talkers. On average, female talkers outperformed male talkers by ~5.1%; yet such difference was not significant [paired t-test: t = -1.812, p= 0.120]. However, individual subjects showed different intelligibility difference to these two talkers. For example, subject S3 preferred female talker speech to male talker speech by 13.7% intelligibility difference; subject S6 preferred male talker speech to female talker speech with 3.6% intelligibility difference. Such cross talker intelligibility difference is often observed in hearing-impaired and cochlear implant listeners. This may be related to the individual speech processor and individual psychoacoustic response in individual CI users. Chapter 4 further studied an approach to leverage the cross talker intelligibility difference. Subjects S1 S2 S3 S5 S6 S8 S9 Avg Percent correct(%) 20 40 60 80 100 Male phone Female phone * Figure 2-4: Sentence recognition performance with telephone speech for individual CI subjects listening to male (black bar) and female (shaded bar) talkers. The error bars indicate one standard deviation. The star indicates significant difference between male and female performance at a level of 0.05. 2.5 Conclusions The present study proposed a bandwidth extension method to enhance telephone speech understanding in CI users. The wideband spectral information was estimated based on the available telephone speech and a pre-trained relationship between the narrowband and wideband speech. The narrowband excitation was extended to wideband excitation by spectral translation. A source filter model was used to synthesize estimated wideband speech, whose high band frequency 32 33 information was filtered out and appended to the original telephone speech. Seven CI users listened to the telephone speech with and without the restored highband frequency information. Sentence recognition with the restored highband speech was significantly higher than without the restored highband speech. Yet, the net improvement was relatively small and was substantially dependent on individual subjects. Its possible reasons and future directions were discussed. 34 Chapter 3 Speech enhancement: spectral normalization for different talkers 1 Under narrowband telephone speech, the cross talker intelligibility difference was substantial, although, on average, this difference was not significant. Yet, for certain subject, such difference was significantly different. With wideband speech, such intelligibility difference may also exist. This chapter utilized cross talker intelligibility difference to enhance non-ideal talker’s speech to mimic that of the ideal talker’s speech. A spectral normalization approach was studied and its effect was evaluated with real recorded speech and simulated different talker speech. 3.1. Introduction Normal hearing (NH) listeners are able to understand speech from a variety of talkers, despite differences in acoustic characteristics (e.g., voice pitch, speaking rate, accent, etc.). NH listeners are thought to use some form of “speaker normalization” to process speech from multiple talkers, thereby preserving the perceptual constancy of the linguistic message (Pisoni, 1993). Speaker normalization may affect processes at an early segmental acoustic-phonetic level (Verbrugge et al., 1 1 Most of the Chapter 3 is reproduced from Liu et al. 2008. Part of the content was removed to Appendix to avoid replication with other Chapters. Changes were made to section and equation numbers to be internally consistent with this thesis. 35 1976; Assmann et al., 1982), and is associated with some central processing cost, as reflected in the reduced speech performance as the number of talkers is increased (Mullennix et al., 1989; Sommers et al., 1994). Despite the operation of such speaker normalization processes, speech intelligibility varies considerably across different talkers. Different talkers have been shown to produce different levels of speech intelligibility in NH listeners (e.g., Hood and Poole, 1980; Cox et al., 1987). For example, Cox et al. (1987) studied the intelligibility of speech materials produced from three male and three female talkers in different listening conditions in NH listeners. Results indicated significant differences in intelligibility across talkers, even in listening environments that allowed for full intelligibility of everyday conversations. These cross-talker effects have also been observed in cochlear implant (CI) users. For example, Green et al. (2007) recently studied the effects of cross-talker differences on speech intelligibility in CI users and NH listeners listening to acoustic CI simulations. In their study, two groups of talkers (high or low intelligibility talkers) were established according to mean word error rates, based on previous data collected with NH listeners; each group consisted of one male adult, one female adult, and one female child talker. Results showed differences in intelligibility between the two talker groups, for a variety of listening conditions; talker group differences were maintained even when overall speech performance was reduced in the more difficult listening conditions. In CIs, speech patterns are represented by a limited number of spectral and temporal cues. In addition, electrically evoked speech patterns may be further distorted due to 36 the spectral mismatch between the input acoustic frequency and electrode place of stimulation. Previous CI acoustic simulation studies with NH listeners have shown differences in speech understanding for different talkers (e.g., Dorman et al., 1997a; Fu and Shannon, 1999). While talker variability may not have been the main research focus, these studies suggest that the degree of spectral distortion may significantly affect intelligibility with different talkers Thus, compared with NH listeners whose spectral resolution may better support perceptual normalization across talkers, CI users’ speech recognition may be more susceptible to acoustic differences across talkers. Although widely studied, the relation between speech intelligibility and the acoustic-phonetic properties for different talkers remains unclear. For example, while Bradlow et al. (1996) found no correlation between speaking rate and intelligibility in NH listeners, Bond and Moore (1994) found that, compared with more-intelligible talkers, less-intelligible talkers produced words and vowels with shorter durations. Besides speaking rate, other acoustic-phonetic correlates of intelligibility across talkers have been studied, e.g., fundamental frequency (F0; Bradlow et al., 1996), amplitude of stressed vowels (Bond and Moore, 1994), long-term average spectrum and consonant-vowel intensity ratio (Hazan and Markham, 2004). Typically, no single acoustic feature was able to explain the intelligibility difference across talkers. Hazan and Markham (2004) suggested that highly intelligible speech may depend on combinations of different acoustic-phonetic characteristics. 37 Some researchers have tried to improve speech intelligibility by compensating for differences along one acoustic dimension. For example, Luo and Fu (2005) studied an acoustic rescaling algorithm to normalize the formant space across talkers; the algorithm was evaluated in NH subjects listening to acoustic CI simulations. In their study, mean third formant frequency (F3) values (across vowels) were calculated for each talker in the stimulus set. The ratio between the mean F3 value for each talker and the reference talker (the talker that produced the best vowel recognition for each subject) was used to adjust the analysis filter bank of an acoustic CI simulation to match an optimal reference pattern. Multi-talker Chinese vowel recognition was tested in NH subjects listening to a 4-channel acoustic CI simulation, with and without the acoustic rescaling algorithm. Results showed a small but significant improvement in subjects’ overall multi-talker vowel recognition with the acoustic rescaling algorithm. Note that in the Luo and Fu study (2005), the largest improvements in performance were not always for the least-intelligible talkers. Nejime and Moore (1998) examined the effects of reduced speaking rates for speech intelligibility in noise, using a simulation of cochlear hearing loss in NH subjects. Reducing the speaking rate did not significantly improve intelligibility in the context of the simulated hearing loss. This lack of effect may have due to the relatively weak contribution of speaking rate to intelligibility, or to processing artifacts associated with the signal modification. In the present study, rather than normalize speech in one acoustic dimension or by linearly rescaling the formant space, we used a spectral normalization method 38 based on statistical modeling of acoustic features to compensate for complex and dynamic acoustic variability at the speech segment level. As described earlier, no single acoustic feature can fully account for intelligibility difference across talkers. As a statistical modeling does not depend on any single feature, this approach may be more beneficial. The term “spectral normalization” is used because: a) the spectral envelope was used to analyze the acoustic variability and b) the proposed algorithm was intended to normalize spectral characteristics across different talkers. Note that “spectral normalization” in this study refers to a signal processing procedure rather than a perceptual process (e.g., “speaker normalization,” as in Pisoni, 1993). The proposed spectral normalization algorithm was used for “front-end” processing (before CI speech processing), and was evaluated in CI users and NH subjects. In Experiment 1, recognition of sentences produced by two different talkers (one male and one female) was measured in CI listeners, with and without spectral normalization conditions; sentence recognition was measured in quiet for each talker independently. In Experiment 2, sentence recognition was measured in CI and NH listeners, with and without spectral normalization, using pitch-stretched transformations to simulate different talkers, i.e., speech was systematically pitch- stretched to produce different F0 and vocal tract configurations while preserving temporal characteristics such as speaking rate, overall duration and amplitude. Subjective quality ratings and stimulus discriminability data were also collected from NH listeners. 39 3.2 Method In the present study, the spectral normalization algorithm was based on the proposed speech enhancement framework in Chapter 1. A continuous statistical model is utilized to compensate for acoustic differences between talkers. By using a continuous model, speech from a variety of talkers may be adjusted to match a listener’s optimal speech patterns. Specifically, the algorithm used a Gaussian Mixture Model (GMM) to represent spectral characteristics of a “source” talker at the segmental level, and transformed the source talker’s spectral characteristics to that of a “target” talker using a trained spectral conversion function. The Appendix A shows the detail of the conversion function estimation. 3.2.1. Implementation of spectral normalization Once the spectral conversion function had been estimated from training data, the spectral conversion was performed as depicted in Figure 3-1. This framework is very similar to the Figure 2-1 except that Figure 3-1 does not require the residual translation operation because the target speech and source speech had the same bandwidth. Figure 3-1: Implementation framework of the GMM-based spectral normalization algorithm. In the above system, a Mel-scaled Line Spectral Frequency (LSF) feature (Huang et al., 2001) was used to train the GMM, as it is perceptually based and has smooth interpolation characteristics (Kain and Macon, 1998). Specifically, the Mel- scaled LSF coefficients were obtained as follows. After frame-based speech analysis and Linear Predictive Coding (LPC) coefficients extraction, the LPC spectrum was transformed to a Mel-warped spectrum (Huang et al., 2001) according to the relationship where is frequency in Hz and is the corresponding Mel frequency in Mels. The warped spectrum was then uniformly re- sampled using splined cubic phase interpolation to obtain the Mel-scaled LPC spectrum. A least square fit was used to transform the Mel-scaled LPC spectrum to ( ) 1125ln(1 / 700) Mf f =+ f ) (f M 40 Mel-scaled LPC coefficients, which were then transformed to Mel-scaled LSF coefficients. To transform a given utterance, spectral feature vectors from the source talker’s speech were extracted and transformed by the spectral conversion function that was trained using the GMM (as described above in Appendix A). The residual from the spectral extraction was then convolved with the modified spectral parameters to render the transformed speech signal. In the present study, there was no attempt to match the prosodic characteristics of source and target talkers. Hence, the source talker’s average fundamental frequency (F0), speaking rate and articulation rhythms were preserved after transformation. To reduce computational load, a diagonal conversion was used (i.e., the and in Eq. A-4 were in diagonal form). This is a common practice in GMM training, as the correlation between distinct cepstral coefficients is very small (Stylianou et al., 1998). The number of GMM components (i.e., m in Eq. A-1) was set to 64, as the contribution of additional GMM components to the acoustic distance between target and transformed speech is marginal beyond 64 components (Liu et al., 2006). i Τ i ∑ 3.2.2 Objective verification of the spectral normalization algorithm in cochlear implant simulations Spectral conversion has been shown to effectively transform the spectral characteristics (e.g., formant position/bandwidth, spectral tilt, energy distribution, 41 42 vocal tract length) of a source talker to that of a target talker without spectral degradation (Stylianou et al., 1998; Kain and Macon, 1998; Liu et al., 2006). For CI users, speech recognition performance is most strongly influenced by parameters that affect spectral resolution (e.g., the number of electrodes/channels). In general, speech recognition in quiet improves with increasing numbers of spectral channels (Fu, 1997; Dorman et al., 1997b; Fishman et al., 1997). To see the effect of spectral conversion on spectrally degraded speech (as is typically encountered by CI users), the proposed algorithm was evaluated using distance measurements with an acoustic CI simulation. The acoustic CI simulation was implemented similarly to Shannon et al. (1995). The signal was first pre-emphasized with a filter coefficient of 0.95. The input frequency range (100 – 6000 Hz) was then band-passed into 16, 8, 6 or 4 frequency analysis bands (24 dB/octave filter slope), distributed according to Greenwood’s formula (Greenwood, 1990). The temporal envelope was then extracted from each frequency band by half-wave rectification and low-pass filtering (160 Hz envelope filter). The envelope of each band was then used to modulate a wideband noise, which was then spectrally limited by the same band-pass filter used for frequency analysis. Finally, the modulated carriers of each band were summed to render spectrally degraded speech. To measure the degree of spectral conversion for spectrally degraded speech, the acoustic distance between the transformed speech and the target speech was calculated as follows: 2, arg 11 1 [ p N nk nk MFCC converted t et nk dc N == =− ∑∑ ,2 ]c (3-1) where is the total frame numbers in feature streams, is the k th component of the Mel-Frequency Cepstral Coefficients (MFCC) vector in frame n, and N , nk c p is the MFCC order (14, in the present case). Lower values of indicate greater spectral similarity between the transformed speech and the target speech. When the spectrum of the transformed speech perfectly aligns with that of the target speech, = 0. 2 d 2 d The objective analysis was performed using IEEE sentences (IEEE, 1969), recorded with one male (M1) and one female (F1) talker. Spectral normalization from F1 to M1 was analyzed. The GMM training dataset included 100 sentences randomly selected from the database; the testing dataset included the entire database (720 sentences). The average acoustic distance between the source (F1) and target (M1) speech was calculated across the whole testing dataset. The average acoustic distance for each condition was then converted to dB units referenced to the acoustic distance between the unprocessed source and target speech (i.e., no acoustic CI simulation or spectral normalization). Figure 3-2 shows the mean acoustic distance (in dB) between the source and target speech, as a function of the number of spectral channels. The acoustic distance decreased similarly with (dashed line) or without (solid line) spectral normalization as the number of spectral channels was reduced. The acoustic distance was significantly reduced (paired t-test: p<0.05) with the spectral normalization algorithm; the mean reduction in acoustic distance was – 2.73dB, across all spectral resolution conditions. The objective analysis showed that 43 spectral normalization was efficient in transforming the source speech in mimicking the target speech, regardless of the number of spectral channels. Numbers of CIS channel unprocessed 16 8 6 4 Normalized talker distortion (dB) -10 -8 -6 -4 -2 0 without spectral normalization with spectral normalization Figure 3-2: Normalized talker distortion as a function of number of channels. Solid line: without spectral normalization. Dashed line: with spectral normalization. Note that the talker distortion between talkers F1 and M1 (unprocessed speech) was used as the reference. 3.3 Experiment 1: Effect with two different talkers 3.3.1. Methods A. Subjects Nine post-lingually deafened adult CI users (7 men, 2 women) participated in this experiment. Appendix C lists relevant demographics for the CI subjects. All subjects were native speakers of American English and had extensive experience in 44 45 speech recognition experiments. All subjects provided informed consent in accordance with the local IRB, and all were paid for their participation. B. Stimuli and Speech Processing IEEE sentences (IEEE, 1969), recorded with one male (M1) and one female (F1) talker, were used in this experiment. The mean F0, across all sentences was 92 Hz for M1 and 185 Hz for F1. It is assumed that, in practice, spectral normalization is beneficial only when a less intelligible talker is transformed toward a more intelligible talker. Because it was unknown which talker might produce better recognition performance in individual CI subjects, the spectral transformation was performed between both talkers (i.e., M1 was transformed to F1, and F1 was transformed to M1). The GMM for the source talker was trained with 100 randomly selected sentences, resulting in over 60,000 Mel-scaled LSF feature vectors (25th order). The function to transform the source talker to the target talker was estimated according to Eq. A-4. After training the conversion function, all sentences with each source talker were spectrally transformed toward the target talkers. Note that the training sentences were also transformed and included in the listening test to increase the available speech materials for the experiment. For descriptive purposes, when the source talker was M1 and the target talker was F1, the transformed speech was labeled M1-to-F1 (and vice versa). In Experiment 1, IEEE sentence recognition was tested for four talker conditions: M1 (unprocessed), F1 (unprocessed), M1-to-F1, and F1-to-M1. 46 C. Procedure Subjects were tested using their clinically assigned speech processors and self-adjusted comfortable volume/sensitivity settings; once testing began, these settings were not changed. Subjects were tested while seated in a double-walled sound-treated booth (IAC). Stimuli were presented via a single loud speaker at 65 dBA. The sentences in the IEEE database were divided into 72 lists, with 10 sentences per list. For each run, a list was randomly chosen (without replacement) and the sentences from within the list were presented in random order. Subjects were asked to repeat what they heard; the experimenter tabulated all correctly identified words. Performance was calculated according to the ratio between correctly identified words and all words presented in the list; performance was typically averaged across 4 – 5 lists for each talker condition. To familiarize subjects with the different talkers and test procedures, a practice session with one randomly selected list (without replacement) was provided prior to the sentence recognition test in each condition. Note that the speech stimuli used in the practice session were not included in test stimulus set. The test order for the different talker conditions was randomized for each subject. No feedback was provided during the test. 3.3.2 Results and discussion Figure 3-3 shows individual subjects’ sentence recognition performance for the M1 and F1 unprocessed source talkers, as well as the mean performance across subjects. Subjects are ordered according to talker sensitivity. Note that throughout the paper, “talker sensitivity” refers to the magnitude of the difference in performance between the two talkers. Mean performance with talker F1 was 2.8 percentage points greater than that with M1; however, the difference was not significant [one-way repeated measures (RM) ANOVA: F(1,8) = 0.560, p=0.476]. For subjects S1 – S3, performance was better with F1 than with M1; for subjects S4 – S9, performance was better with M1 than with F1. The difference in performance between talkers was significant for subjects S1, S2, S3, and S8 (t-test: p<0.05; analysis performed within individual subjects using raw data from multiple sentence lists). There was inter-subject variability in terms of talker sensitivity, ranging from 0.3 percentage points for subject S4 to 21 percentage points for subject S1. * Subject S1 S2 S3 S4 S5 S6 S7 S8 S9 Avg Percent correct 40 50 60 70 80 90 100 M1 F1 F1>M1 F1<M1 * * * * Figure 3-3: Individual and mean sentence recognition performance for talkers M1 and F1. For subjects S1 – S3, performance with F1 was better than that with M1; for subjects S4 – S9, performance was better with M1 than with F1. The error bars show one standard deviation, and the asterisks show significantly different performance between the two talkers (p<0.05). 47 48 Because the talker that produced better speech understanding differed among individual subjects, the most relevant comparison between unprocessed and spectrally transformed speech is in terms of the “Better” and “Worse” talker. Ideally, after spectral transformation, performance with the Worse talker would be equivalent to that with the Better talker (and vice versa). For convenience, the term “Worse-to- Better” refers to the transformation of the Worse toward the Better talker (and vice versa). Table 3-1 compares performance with unprocessed speech to that with spectrally normalized speech in terms of the Better and Worse talkers. Note that individual subject data was analyzed with t-tests, using raw performance data from multiple sentence lists; mean performance data (across subjects) was analyzed using mean data from each subject (across sentence lists). While the Better talker differed among individual subjects, the mean baseline performance difference between the Better and Worse talker was 8.1 percentage points; this difference was significant [one-way RM ANOVA: F(1,8) = 10.164, p=0.013]. For the Better-to-Worse transformation, mean performance was significantly poorer than that with the Better talker [one-way RM ANOVA: F(1,8)= 5.558, p= 0.046]. The difference was significant for subjects S1, S2, and S4. There was no significant difference in mean performance between the Better-to-Worse transformation and the Worse talker [one- way RM ANOVA: F(1,8)= 0.308, p= 0.594]; however, performance differed significantly for subjects S1, S3, and S4. For the Worse-to-Better transformation, mean performance remained significantly poorer than that with the Better talker [one-way RM ANOVA: F(1,8)= 6.624, p= 0.033]; the difference was significant 49 only for subject S1. Interestingly, performance with the Worse-to-Better transformation was significantly better than that with the Worse talker [one-way RM ANOVA: F(1,8)= 12.967, p= 0.007], although the difference was not significant for any individual subject. Table 3-1: Performance difference between unprocessed source talkers (i.e., M1 vs. F1), and between spectrally-normalized and unprocessed talkers. Note that because the performance with talkers M1 and F1 differed among individual subjects, comparisons are made in terms of the “Better” and “Worse” talker. Bold numbers indicate significant differences in performance across different sentence lists (p<0.05). Performance difference (percentage points) Better Better-to-Worse Worse-to-Better Subject Better talker Worse talker vs. Worse vs. Better vs. Worse vs. Better vs. Worse S1 F1 M1 20.7 -10.2 10.5 -15.5 5.2 S2 F1 M1 16.1 -11.6 4.5 -7.8 8.3 S3 F1 M1 12.4 5.3 17.7 -10.6 1.7 S4 M1 F1 0.3 -21.6 -21.3 -0.5 -0.2 S5 M1 F1 0.4 -0.7 -0.3 1.7 2.1 S6 M1 F1 0.5 -0.2 0.3 0.8 1.3 S7 M1 F1 2.0 -3.2 -1.2 -0.4 1.6 S8 M1 F1 9.4 -4.9 4.4 -6.0 3.3 S9 M1 F1 11.6 -8.7 2.9 -6.8 4.8 AVG across all 9 subjects 8.1 -6.2 2.0 -5.0 3.1 AVG across subjects S1, S2, S3 and S8 14.7 -5.4 9.3 -10.0 4.6 Figure 3-3 shows that only 4 out of the 9 subjects exhibited significant differences in intelligibility between the F1 and M1 talkers. This may have been due to ceiling performance effect in some subjects (S5, S6 and S7). Further analysis was performed using only the subjects whose baseline performance was significantly 50 affected by talker (S1, S2, S3 and S8). Results are shown in Table 3-1 alongside the analyses with all 9 subjects. For the Better-to-Worse transformation, mean performance was 5.4 percentage points lower than that with the Better talker; however, this difference was not significant [one-way RM ANOVA: F(1,3)= 1.957, p= 0.256]. Mean performance for the Better-to-Worse transformation was 9.3 percentage points better than that with the Worse talker; however, this difference was not significant [one-way RM ANOVA: F(1,3)= 8.757, p= 0.060]. For the Worse-to- Better transformation, mean performance remained 10.0 percentage points poorer than that with the Better talker; this difference was significant [one-way RM ANOVA: F(1,3)= 23.383, p= 0.017]. Mean performance for the Worse-to-Better transformation was 4.6 percentage points better than that with the Worse talker [one- way RM ANOVA: F(1,3)= 10.672, p= 0.047]. Thus, the sub-analyses using subjects S1, S2, S3 and S8 showed similar performance patterns to those with the previous analyses using all 9 subjects. In general, the spectral normalization algorithm produced the intended results, i.e., performance improved when a Worse talker was transformed toward a Better talker and vice-versa. For some subjects, performance with the transformed talkers did not always follow this general trend. For example, for subject S3, performance improved when the Better talker was transformed to the Worse talker. Conversely, performance slightly declined for subject S4 when the Worse talker was transformed to the Better talker. This adverse effect may have been because the spectral normalization could not completely compensate for all the acoustic/perceptual 51 differences between the source and target talkers. Alternatively, the transformation may have resulted in “talkers” that were not included or sampled in the test materials. Sensitivity to the spectral normalization algorithm also varied among subjects. For example, there was only a 2 percentage point difference in performance among the four conditions for subjects S5 and S6. For subject S4, performance with the Better-to-Worse transformation was 22 percentage points poorer than that with the Better talker; there was only ~1 percentage point difference in performance among the remaining three talker conditions. In general, the effect of spectral normalization was strongest for subjects whose performance differed substantially between M1 and F1 (i.e., subjects S1, S2, S3, S8 and S9). In terms of mean performance, note that the Better-to-Worse transformation resulted in a decrement of ~6 percentage points, while the Worse-to-Better transformation resulted in an improvement of ~3 percentage points. While this is a relatively small difference in terms of effect size, there are three possible explanations for this bi-directional imbalance. First, the mean performance deficit with the Better-to-Worse transformation may have been primarily due to the large drop in performance for subject S4. Second, artifacts associated with the spectral normalization algorithm (e.g., spectral discontinuities, unnaturalness) may have limited any improvements in performance with the Worse- to-Better transformation and may have contributed more strongly to “worsening” performance with the Better-to-Worse transformation. Third, ceiling effects may have limited the degree of improvement with spectral normalization, but not the degree of deterioration in performance. 52 The results of Experiment 1 demonstrated that spectral normalization, using a GMM model trained with relatively few stimuli, significantly improved speech understanding with less-intelligible talkers, especially for CI users whose speech performance was sensitive to different talkers. The objective acoustic measures using the CI simulation showed that spectral normalization was efficient in transforming the source speech toward the target speech, regardless of the number of spectral channels. Although some CI subjects were more sensitive than others to talker differences and the subsequent spectral normalization, the perceptual measures showed relatively small effects, on average. The modest effects may have been due to the small number and/or quality of the test talkers (1 male and 1 female) who may not have elicited sufficient baseline talker sensitivity effects. Also, ceiling performance effects associated with sentence recognition may have limited the effects of spectral normalization. 3.4. Experiment 2: Effect with simulated talkers In Experiment 1, the speech materials differed between the two talkers not only in terms of spectral cues, but also in terms of temporal cues, even for the same sentences. Previous studies have shown that speech intelligibility may be influenced by speaking rate (Kurdziel et al., 1976; Miller and Volaitis, 1989; Gordon-Salant and Fitzgibbons, 1997). Cross-talker temporal variability such as voice-onset-time may also affect speech recognition (Allen et al., 2003). Temporal cue effects have been observed in NH listeners (Miller and Volaitis, 1989), elderly listeners (Kurdziel et 53 al., 1976), HI listeners (Gordon-Salant and Fitzgibbons, 1997; Kirk et al., 1997) and CI users (Liu et al., 2004). In Experiment 1, the spectral normalization algorithm was intended to modify only the spectral information. However, it is possible that modified spectral information may interact with temporal information (which was not modified) and thereby affect speech understanding. It would be preferable to test the spectral normalization algorithm using different talker speech materials that have been normalized in terms of temporal information. It is very difficult to constrain temporal cues (i.e., speaking rate, total duration, emphasis, etc.) across different talkers with naturally produced speech materials. Therefore, in Experiment 2, the different talker conditions were simulated by adjusting the voice pitch and vocal tract characteristics of a reference talker (F1). 3.4.1. Methods A. Subjects The same 9 CI subjects from Experiment 1 also participated in Experiment 2. Four NH subjects (2 men, 2 women) also participated in Experiment 2, and served as a control group. All NH subjects had sensitivity thresholds better than 15dB hearing level for audiometric test frequencies from 250 to 8000 Hz; all were native speakers of American English. Informed consent from each subject and local IRB approval were obtained for the study. 54 B. Stimuli and Speech Processing In Experiment 2, talker differences were simulated by systematically altering the acoustic characteristics of a reference talker (F1), while preserving speaking rate (i.e., duration of utterances) and prosodic characteristics, in the form of relative changes in F0. The IEEE sentences produced by talker F1 from Experiment 1 were used as the reference. To simulate different talkers, sentences were altered by using the “pitch-stretch” processing feature in Cool Edit Pro (Version 2.0; Syntrillium Software). The pitch-stretching algorithm changed the fundamental frequency of the original speech and hence the spectral envelope, mimicking different vocal tract configurations. Each sentence produced by talker F1 was processed using six different pitch-stretch ratios: 0.6, 0.8, 1.0, 1.2, 1.4, and 1.6. Higher ratios resulted in lower-pitched speech, and smaller ratios resulted in higher-pitched speech; when the ratio was equal to 1.0, there was no pitch shift (i.e., the original speech tokens from talker F1). For example, the average F0 for talker F1 across all sentences was 185.10 Hz; when pitch-stretched by a ratio of 1.6, the average F0 was 116.96 Hz (i.e., 185.10/1.6=115.69 Hz), simulating a male voice. Note that because the reference talker F1 was female, the minimum ratio was 0.6 in this experiment, as lesser values produced overly high F0 values. Thus, ratios of 0.6, 0.8 and 1.0 were intended to simulate female talkers, while ratios of 1.2, 1.4 and 1.6 were intended to simulate male talkers. Note that as the pitch-stretching ratio deviated from 1.0, speech increasingly sounded less natural. For reference purposes, the transformations 55 associated with the different pitch-stretching ratios are labeled T0.6, T0.8, T1.0 (unprocessed speech by talker F1), T1.2, T1.4 and T1.6. After generating the different pitch-shift transformations, the spectral normalization algorithm was applied, with T1.0 as the target talker. The same 100 sentences used in Experiment 1 were used to train the GMM model, while the entire database (720 sentences) was used for testing. Spectral normalization was performed exactly as in Experiment 1. The stimuli in Experiment 2 included 11 sets of “talkers”: one source talker (T1.0), five pitch-shift transformations (T0.6, T0.8, T1.2, T1.4, and T1.6) and five spectral transformations (T0.6-to-T1.0, T0.8-to-T1.0, T1.2- to-T1.0, T1.4-to-T1.0, and T1.6-to-T1.0). For the spectral transformations, T0.6-to- T1.0 and T0.8-to-T1.0 represented female-to-female transformations, while T1.2-to- T1.0, T1.4-to-T1.0, and T1.6-to-T1.0 represented female-to-male transformations. Table 3-2 shows the pitch and formant analysis for the pitch-shifted and spectrally transformed speech. Table 3-2 indicates that voice pitch was well-scaled by the pitch-stretching operation, and was maintained by spectral transformation. While formant frequencies were not maintained or scaled by the pitch-stretching algorithm (relative to the source speech T1.0), spectral normalization largely restored formant frequencies to those of the source speech. The mean difference of all formant frequencies between the target speech (i.e., T1.0) and the spectral transformation was only 118 Hz, with a standard deviation of 199 Hz. 56 Table 3-2: Pitch and formant analysis for the pitch-shift and spectral transformations in Experiment 2. The target F0 for the pitch-shift transformations was scaled according to the pitch-stretching ratio used for processing; the target F0 for the spectral transformation refers to the Measured F0 values after pitch-stretching. The F0s were measured with software Wavesurfer 1.8.5. Formant frequencies were estimated for the vowel /I/ from the sentence “Glue the sheet to the dark blue background”. Note that reference talker T1.0 (in bold) was F1 from Experiment 1. Trans- formation Condition Target F0 Measured F0 Measured F1 Measured F2 Measured F3 T0.6 185/0.6=308 298 326 582 4040 T0.8 185/0.8=231 228 422 2051 3046 T1.0 185/1.0=185 185 344 2440 2859 T1.2 185/1.2=154 155 290 2062 2405 T1.4 185/1.4=132 133 248 1816 2232 Pitch- shift T1.6 185/1.6=116 117 214 1610 2267 T0.6-to- T1.0 298 300 291 2497 2996 T0.8-to- T1.0 228 230 290 2408 2871 T1.2-to- T1.0 155 157 279 2366 2750 T1.4-to- T1.0 133 135 274 2151 2744 Spectral T1.6-to- T1.0 117 118 278 1750 2498 Figure 3-4 shows example waveforms for the sentence “Glue the sheet to the dark blue background”, as produced by source talker T1.0 and two pitch-shift transformations (T0.6, T1.6); note that the duration and modulation depth is nearly identical across the waveforms. The top panel of Figure 3-5 shows the spectral envelope for the speech segment /IY/ from the word “sheet;” note the relative stretch in the spectral envelope for the two pitch-shift transformations. The bottom panel of Figure 3-5 shows the spectral envelope for the same speech segment, as produced by Time(s) -0.2 0.0 0.2 0.4 Time(s) Sample value -0.2 0.0 0.2 0.4 Time (s) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 -0.4 -0.2 0.0 0.2 0.4 T1.6 T1.0 T0.6 Sample value Sample value Figure 3-4: Waveforms for the sentence “Glue the sheet to the dark blue background.” Top panel: pitch-shift transformation T0.6 (upward pitch shift). Middle panel: reference talker T1.0 (unprocessed speech from talker F1). Bottom panel: pitch-shift transformation T1.6 (downward pitch shift). 57 100 1000 10000 Spectral amplitude (dB) -120 -100 -80 -60 -40 -20 0 T0.6 T1.0 T1.6 Frequency (Hz) 100 1000 10000 Spectral amplitude (dB) -100 -80 -60 -40 -20 0 T1.0 T0.6-to-T1.0 T1.6-to-T1.0 Figure 3-5: Spectral envelopes for different processing conditions in Experiment 2. Top panel: spectral envelopes for reference talker T1.0 and pitch-shift transformations T0.6 and T1.6. Bottom panel: spectral envelopes for T1.0 and spectral transformations T0.6-to-T1.0 and T1.6-to-T1.0. 58 59 two spectral transformations (T0.6-to-T1.0, T1.6-to-T1.0) and reference talker T1.0; note that the spectral envelopes are quite similar for the two spectral transformations and T1.0. C. Procedure For all talker conditions, IEEE sentence recognition was measured using the same procedures described in Experiment 1. In addition to sentence recognition, discriminability among the pitch-shift transformations was also measured in NH subjects to verify whether the pitch-stretching algorithm produced different “talker identities.” During the discriminability test, subjects were presented with a sentence produced by T1.0 and a different sentence produced by one of the pitch-shift transformations. NH subjects were asked whether the sentences were produced by the same or different talkers. Each of the five pitch-shift transformations was compared to talker T1.0 six times. The presentation order of the processed and unprocessed sentences was randomized, and the test sentences were randomly selected (without replacement) from the test materials. After the discriminability test, subjective quality ratings were obtained from the same NH subjects. Subjects were asked “How would you rate the overall speech quality on a scale from 1 to 10, with larger values indicating better overall speech quality?” Subjective ratings were anchored to T1.0, which was given the highest rating (10, on a 10-point scale). For each pitch-shift and spectral transformation, speech quality ratings were averaged across subjects. 60 3.4.2. Results and discussion In terms of stimulus discriminability, for three out of the four NH subjects, all of the pitch-shift transformations sounded like different talkers, relative to the reference talker T1.0 (i.e., 100% discrimination across all six trials). For the remaining NH subject, pitch-shift transformations T1.2, T1.4 and T1.6 were easily discriminated from talker T1.0 (100% discrimination across all six trials); however, T0.8 and T0.6 were judged to be the same as T1.0 in four out of six trials. Thus, the pitch-shift transformations were judged, for the most part, to represent different talker identities. Figure 3-6 shows the overall speech quality ratings for the 4 NH subjects, with and without spectral normalization, as a function of pitch-stretched transformation. A two-way RM ANOVA showed that overall speech quality ratings were significantly affected by pitch-stretching [F(5,15)=15.352, p<0.001]. While spectral normalization also seemed to affect quality ratings, the effect failed to reach significance [F(1,15)=9.517, p=0.054]. Except for pitch-shift transformation T1.2, all the pitch-shift and spectral transformations produced significantly lower ratings, relative to reference talker T1.0 (t-test: p<0.05). Quality ratings were generally lower after spectral transformation; the decrements were only significant for T1.2-to-T1.0 (t-test: p<0.05). This suggests that pitch shifting may have introduced one set of artifacts, and the spectral normalization (intended to compensate only for spectral differences) may have introduced a second set of artifacts. Artifacts associated with 61 spectral normalization may have added to the pitch-shift artifacts, further reducing the speech quality ratings. Figure 3-7 shows sentence recognition performance for CI and NH listeners (circle and diamond symbols, respectively), with and without spectral normalization (open and filled symbols, respectively), as a function of pitch-shift transformation. For NH subjects, sentence recognition remained nearly perfect for all conditions, except for T0.6-to-T1.0 (94% correct, significantly lower than performance with T0.6). CI subjects were very sensitive to the different pitch-shift and spectral transformations. Mean peak performance was 84% correct with T1.0. Performance with the pitch-shift transformations sharply declined as the shift ratios became more extreme. With T0.6, mean performance was only 20% correct, and with T1.6, mean performance was only 31% correct. Performance with spectral normalization also declined for the more extreme pitch-shift transformations, but the decline was less steep than without spectral normalization. A two-way repeated measures ANOVA showed that performance was significantly affected by pitch-stretching (F(5,40)=124.014, p<0.001) and spectral normalization (F(1,40)= 64.770, p<0.001). There was a significant interaction between the pitch-shift and spectral transformations (F(5,40)=21.557, p<0.001). Post-hoc Bonferroni t-tests showed that performance with T1.0 was significantly better than that with T0.6, T0.8, T1.4, and T1.6, before or after spectral normalization (p<0.05). Post-hoc Bonferroni t-tests also showed that spectral normalization significantly improved performance for T0.6, T1.4 and T1.6 (p<0.001). In contrast, performance with T1.2 was not significantly different from that with T1.0, with (p=0.271) or without spectral normalization (p=0.078). Performance with T0.8 significantly declined after spectral normalization (p=0.012). The failure of spectral normalization to significantly improve performance with T0.8 and T1.2 may have been due to the relatively small differences in baseline performance between T0.8, T1.0 and T1.2, before normalization. The potential benefits of spectral normalization may not have been large enough to overcome processing artifacts associated with the speech modification and synthesis. Pitch-shift transformation T0.6 T0.8 T1.0 T1.2 T1.4 T1.6 Overall speech quality ratings 0 2 4 6 8 10 without spectral normalization with spectral normalization * Figure 3-6: NH subjects’ overall speech quality ratings for the pitch-shift transformations, with (open symbols) and without (filled symbols) spectral normalization. The error bars show one standard deviation, and the asterisks indicate significantly different ratings with spectral normalization (p<0.05). Note that source talker T1.0 (unprocessed speech from talker F1) was used to anchor the subjective quality ratings. 62 Pitch-shift transformation T0.6 T0.8 T1.0 T1.2 T1.4 T1.6 Percent correct 0 20 40 60 80 100 CI without spectral normalization CI with spectral normalization NH without spectral normalization NH with spectral normalization * * * * * Figure 3-7: Sentence recognition performance for NH and CI subjects, with (open symbols) and without (filled symbols) spectral transformation, as a function of pitch- shift transformations. The error bars show one standard deviation, and the asterisks indicate significantly different performance after spectral transformation (p<0.05). The results suggest that, despite potential signal processing artifacts, spectral normalization may benefit CI users. It is important to note that, as suggested from Figure 3-6, the artifacts associated with spectral normalization may have added to the artifacts associated with the pitch-stretching algorithm. Note that the spectral normalization was intended to modify the spectral envelope toward that of the reference talker, not to reduce the artifacts associated with the pitch-stretching processing. Ultimately, spectral normalization significantly improved CI users’ speech understanding with the pitch-shift transformations. This implies that CI 63 64 listeners may not have been sensitive to these processing artifacts (due to the reduced spectral resolution), or that CI listeners were able to ignore these artifacts and receive the benefits of spectral normalization. Given the probable artifacts associated with pitch stretching and spectral normalization, it is possible that some learning may have occurred during testing. Note that the test order was randomized within and across subjects for the measurements with pitch-shift and spectral transformations. A two-way ANOVA was conducted for each individual subject (with the pitch-stretch ratio and test session as factors), for both the baseline and spectral normalization conditions. While there were significant effects for the pitch-stretch ratio (both with and without spectral normalization), there were no significant effects for test session, for any subject in any condition (p>0.05). Note also that the effects of spectral normalization were measured acutely. It is possible that long-term experience or explicit training might have influenced baseline performance and/or further enhanced the benefit of spectral normalization. 3.5. General discussion The results of the present study demonstrate that the proposed spectral normalization algorithm can significantly improve CI users’ speech understanding with less-intelligible talkers. In Experiment 1, spectral normalization provided the greatest benefit to CI subjects who exhibited the greatest talker sensitivity. In Experiment 2, a pitch-stretching algorithm was used to simulate different talkers 65 while keeping temporal cues (e.g., speaking rate, overall sentence duration, temporal modulation depth) constant. While pitch-shifting and subsequent spectral normalization produced some undesirable processing artifacts, CI users’ speech recognition improved with the spectral normalization algorithm. Taken together, the results suggest that this spectral normalization approach may benefit CI users’ understanding of speech produced by less-intelligible talkers. However, some considerations should be kept in mind when interpreting these results, and in designing an effective spectral normalization algorithm for real-time speech processing. In Experiment 1, only 4 of the 9 CI subjects exhibited significant better performance with one of the two test talkers (S1, S2, and S3 with talker F1; S8 with talker M1). The best-performing subjects exhibited no significant difference in performance with talkers F1 or M1. It is possible that a greater number of source talkers would have elicited stronger talker sensitivity effects in all subjects, albeit with different talkers for each subject. It is also possible that interfering noise may have elicited more talker sensitivity across subjects. Note that both F1 and M1 produced the IEEE stimuli in the manner of “clear” speech, i.e., relatively slow speaking rate, well articulated, etc. Thus, the normalization algorithm largely addressed spectral envelope differences between talkers, which might have to be more extreme to produce talker sensitivity effects. Temporal differences between talkers (e.g., speaking rate, overall duration, emphasis, etc.) may produce equal if not greater talker sensitivity effects. An effective speaker normalization algorithm may 66 also need to compensate for temporal differences between talkers, as well as spectral differences. Experiment 2 was designed to factor out possible contributions of varying temporal information and to expand the range of talker characteristics presented in Experiment.1. While the pitch-stretching algorithm may not have been the ideal method to create different talker characteristics, it is difficult to control temporal variations among different talkers. One would have to record a very large database to include the range of spectral and temporal characteristics encountered in everyday listening experience. Alternatively, judicious amounts of duration adjustments (via cut and splicing or duplication of speech segments) might offer some experimental control, albeit with another set of possible signal processing artifacts. In Experiment 2, the pitch-shift transformations significantly reduced performance relative to the reference source talker F1. The results were in agreement with those from Experiment 1, in that spectral normalization generally improved speech understanding with less-intelligible talkers. However, it should be noted that the pitch-stretching algorithm simulated only some of the acoustic characteristics that may differ between real talkers. Also, NH subjects’ overall speech quality ratings suggest that the spectral normalization algorithm may introduce undesirable artifacts when talker differences are sufficiently extreme. While CI listeners generally benefited from spectral normalization in Experiment 2, it is unclear whether the proposed spectral normalization algorithm would sufficiently compensate for 67 differences between real talkers. Further testing with a more diverse group of real source talkers is needed to verify the feasibility of the proposed technique. Individual CI subjects’ talker sensitivity may have also contributed to the pattern of results in Experiments 1 and 2. For example, it might be expected that subjects who performed better with talker F1 in Experiment 1 would also perform better with the upwardly-shifted transformations T0.6 and T0.8 in Experiment 2, as these pitch-shifts were smaller relative to F1 than to M1. It might also be expected that these same subjects would benefit more greatly from the spectral transformations T1.2-toT1.0, T1.4-to-T1.0 and T1.8-to-T1.0. Data was compared between experiments to see whether individual subjects’ performance in Experiment 1 was reflected in Experiment 2. In the first analysis, subjects were divided into two groups: Group 1 (S1, S2, and S3; better performance with F1) and Group 2 (S4 – S9; better performance with M1). A two-way ANOVA, with subject group (Group 1 or 2) and pitch-shift transformation (T0.6 or T0.8) showed no significant effect for subject group [F(1,7) = 0.0584, p=0.816]; post-hoc Bonferroni t-tests showed no significant effect for subject group, for either T0.6 (p=0.987) or T0.8 (p=0.643). One issue with this analysis is that for 5 out of the 6 subjects in Group 2, there was no significant difference in performance between M1 and F1. Individual subject performance with talker M1 or F1 in Experiment 1 was also compared to that with the pitch-shift transformations in Experiment 2. Table 3-3 shows the r 2 and significance values for the linear regression analysis across different subjects. Subject performance with the F1 talker in Experiment 1 was fairly well correlated with performance for T0.8 and T1.2 from Experiment 2 (both were relatively close to the original F1 talker). Similarly, subject performance with the M1 talker in Experiment 1 was fairly well correlated with performance with T1.6 from Experiment 2 (the most “male” of the pitch-shift transformations). However, one issue with this analysis is that the better performers in Experiment 1 performed equally well with the M1 and F1 source talkers. Thus, it is difficult to separate talker sensitivity from overall performance with this regression analysis. In the present study, it is difficult to know how talker sensitivity for the top-performing subjects may have been limited by performance ceiling effects. Again, sufficiently different talkers or difficult listening conditions might allow talker sensitivity effects to emerge in even good CI users. It may be true that better CI performers may be less sensitive to talker differences, and therefore may benefit less from spectral normalization. For these CI users, the acoustic input may be better matched to the electrode locations in the cochlea, or other patient-related factors may contribute to the better overall performance. Table 3-3: r 2 and significance values for linear regressions performed between the unprocessed talkers from Experiment 1 (M1 and F1) and the pitch-shift transformations from Experiment 2 (T0.6, T0.8, T1.2, T1.4, T1.6). Pitch-shift transformation (Exp. 2) Talker (Exp.1) T0.6 T0.8 T1.2 T1.4 T1.6 M1 r 2 =0.409 p= 0.064 r 2 = 0.532 p= 0.026 r 2 = 0.718 p= 0.004 r 2 = 0.521 p= 0.028 r 2 = 0.635 p= 0.010 F1 r 2 = 0.393 p= 0.071 r 2 = 0.725 p= 0.004 r 2 = 0.688 p= 0.006 r 2 = 0.470 p= 0.041 r 2 = 0.358 p= 0.089 68 69 While the results from these two experiments are promising, special care is needed when designing a real-time normalization algorithm that will be robust to ambient noise, interfering speech and the wide variety of talker characteristic found in everyday listening environments. A standard set of talkers and listening conditions might help to quickly identify a reference talker (or maybe even several reference talkers) that could be used in the algorithm. Ideally, the algorithm would be continuously updated as new talkers and listening conditions are introduced. Finally, CI patients would likely experience a period of adaptation to such a normalization algorithm (and any adverse processing artifacts). In the present study, the effects of spectral normalization were acutely measured, which may have underestimated the benefits after long-term experience. 3.6. Conclusions The present study showed substantial differences in cross-talker intelligibility in CI users’ speech recognition. A spectral normalization algorithm was used to compensate for acoustic differences between the less-intelligible and more- intelligible speech patterns from different talkers. In Experiment 1, spectral normalization was shown to significantly improve overall CI speech performance; however, some CI users were more sensitive than others to talker differences and the subsequent spectral normalization. In Experiment 2, the spectral normalization algorithm was applied to “simulated talkers,” in which the fundamental frequency and vocal tract characteristics were modified while preserving temporal information 70 such as speaking rate. Compared to NH listeners, CI users’ speech understanding was more sensitive to the pitch-shift transformations and subsequent spectral normalization. The results suggest that spectral normalization, as a front end to CI speech processing, may help CI users maintain perceptual constancy when presented with multiple and/or less-intelligible talkers. 71 Chapter 4 Speech intelligibility modeling: challenges and proposal This Chapter summarizes the inter-subject performance difference and cross talker intelligibility difference that we observed in the speech enhancement study. The factors relating to these unique speech perception patterns are discussed. We further study this phenomenon from speech intelligibility modeling point of view. The investigation of the intelligibility model was outlined. 4.1 Introduction In Chapter 2 and 3, a speech enhancement framework was proposed and realized in the scenarios that acoustic information is lost or highly variant across different talkers. The speech enhancement method showed significant improvement for the speech perception in CI users. Yet, substantial inter-subject performance difference was observed. There are three aspects of the inter-subject performance difference. 1) Different subject achieved different level of benefit from the speech enhancement techniques. For example, with bandwidth extended speech, 72 subject S2 achieved 9.3% improvement while subject S8, instead, decreased performance by 2.6%. As we’ve discussed in Chapter 2, this may be due to the reasons that a) different subjects had different performance difference between wideband speech and narrowband telephone speech; b) different subjects had different sensitivity to the artifacts introduced with the series of sophisticated speech processing and synthesis. 2) Different subject showed substantially different performance patterns when acoustic information is lost and variant across different talkers. Figure 4-1 summarizes speech recognition scores in quiet with naturally recorded stimuli with two different talkers at two different bandwidths. Only the subjects participated in both the bandwidth extension study (i.e., Chapter 2) and different talker normalization study (i.e., Chapter 3) were shown in this Figure 4-1. From the average performance pattern across subjects, the female talker outperformed the male talker and wideband speech outperforms the narrowband. However, the detail patterns for individual subject were very different. For example, subject S3 shared exactly the same pattern with the average pattern but with a much lower performance level; subject S2 was not that sensitive to talker difference under narrowband speech. Contrary to subject S2, subject S5 and S6 were sensitive to bandwidth but not talkers under wideband speech (which may be related to the ceiling effect in these two subjects). Regarding subject S8 and S9, performance with wideband male speech was much higher than the other three speech conditions. Subjects S1 S2 S3 S5 S6 S8 S9 Avg Percent correct (%) 20 40 60 80 100 11025 male 11025 female 3400 male 3400 female Figure 4-1: Speech recognition scores with two different talkers at two different speech bandwidths. The error bars show one standard deviation. 3) The magnitude of performance difference across talkers and talker preference varied across subjects. Although there were only two talkers involved in the speech enhancement study. A substantial talker preference was observed both in narrowband and wideband speech, even when the speech materials were produced from clear articulation of talkers and speech recognition was tested in quiet. It was suspected that the talker preference may be even more complex and the magnitude of performance difference may be even higher, when CI users were exposed to more talkers’ speech under conversational speech. It was not clear why certain CI users may achieve a better perceptual 73 74 constancy than the others? Why certain CI users prefer one talker but not the others? Rather than induced from the objective acoustic difference, this profound performance variability across different subject may be due to the intrinsic complex CI device settings and how well the input acoustic information matches the CI settings. 4.2 Factors relating to inter-subject performance difference Although there is only few literature directly studied the inter-subject performance difference (e.g., Green et al. 2007), ample research results indicated that inter-subject performance difference was almost everywhere once perceptual study was involved and standard deviation of average data across subjects were plotted (e.g., Fishman et al., 1997; Friesen et al., 2001; Fu, 1997). From anecdotal report, the best CI subjects are able to communicate freely without visual cues and even are able to appreciate some simple music; the worst CI subjects may not even obtain open-set speech perception. The huge performance difference across CI users may generally come from different deaf profiles, personal motivation, language experiences, cochlear implant device, and speech processing strategies. Yet, little was truly understood behind the inter-subject performance difference. From signal processing point of view, the factors suspected to relate to inter-subject performance difference may be as follows: 1) Acoustic to electric mapping: Normal auditory system is capable to process wide range of acoustic information up to at least 100dB. Cochlear implant, as 75 a prosthetic device to restore mainly speech recognition, is designed to mainly transmit speech information through the device. Although conversational speech generally expands the acoustic range from 40 to 60dB, the acoustic information that CI device designed to deliver is of much smaller range, which is commonly assumed to be 30dB in Nucleus device (Cochlear, 1999). This range was suspected to be less than optimal in speech dynamic range coding (e.g., Zeng et al. 2002). Furthermore, to accommodate the much smaller electrical dynamic range to the electrode array, the delivered acoustic information has to be compressed. To satisfy different users’ personal listening experience, the cochlear implant device usually supply a flexible program for audiologists and CI users to fine tune non-linear compression from acoustic to electric stimulation. This fine-tune process may be realized through control of sensitivity level, base value and Q value in Nucleus device (Cochlear, 1999). Such customization of the device may result in different nonlinear acoustic-to-electric mapping, which was shown to affect the speech recognition (e.g., Zeng and Galvin 1999). 2) Acoustic channel partition: The stimulation of certain electrode was designed to associate with certain acoustic bandwidth. The partition of the acoustic information across the electrode array controlled “what” acoustic information was delivered to “where”. With such limited number of electrodes to deliver full range of acoustic information, it evidently will introduce the mismatch of the acoustic information to the auditory system. Depending on different CI 76 users’ neural survival patterns and the effectiveness of the electric stimulation, CI device supplies different mapping tables to divide the acoustic channel information. Since different acoustic channel contribute to speech intelligibility in different weights, it was suspected that such information may also relate to inter-subject performance difference. 3) Stimulation rate: The stimulation rate affects how fast the speech processor analyzes the acoustic information and updates the electric stimulation. With the advancement of DSP design, CI device is able to deliver a higher stimulation rate to each electrode. Different CI devices support variant stimulation rates from 250 pulses per second (pps) to over 1200 pps. It was expected that the higher the stimulation rate, the finer the temporal resolution delivered. Yet, the effect of stimulation rate to CI speech recognition is mixed provided that the results were highly subject dependent (e.g., Vandali et al. 2000, Holden et al. 2002). When considering inter-subject performance difference, this factor shall be taken into account into the speech intelligibility model. 4) Electrical dynamic range: The electrical dynamic range for CI patients may range from 3 to 4 dB in the worse case to 20 to 30 dB in the better case (Loizou et al. 2000). The electrical dynamic range was inter-related to the electrode stimulation mode and stimulation rate etc. Acoustical studies of its effect to speech recognition suggested that “it is more likely for an implant patient with a large dynamic range to obtain high scores on vowel recognition 77 than for an implant patient with a small dynamic range” (Loizou et al. 2000), provided that its effect to vowel, consonant and sentence recognition differed. The electrode dynamic range, together with the acoustic to electric compression, greatly governs the loudness growth in the electric stimulation and affects the speech perception. 5) Electrode confusion patterns: After the acoustic information was delivered to the electrodes, it was expected that the electric stimulation may arise the sensation corresponding to the acoustic information that the electrodes mapped to. Unfortunately, due to the irregular distribution of the survival hair cells and the auditory fibers, the targeted information may leak to the neighborhood electrode and generated virtual or misleading information that sounds like originated from the other electrodes. In such cases, the real acoustic information delivered is actually a combination of information from multi-electrodes. Nelson et al. 1995 studied electrode ranking of “place pitch” and speech recognition. It was observed that individual difference on pitch ranking task was substantial. The correlations between place pitch sensitivity and transmitted speech information were as high as 0.71. To interpret the inter-subject performance difference, individual subject’s dynamic range may need to take into account into the speech intelligibility model. 6) Intensity resolution: Within the electrical dynamic range, the difference limens to detect changes in electric current was also varied across subjects, 78 depending on the CI stimulation strategy (e.g., stimulation level, pulse rate) and individual differences in patterns of neural survival. It was reported that the cumulative number of discriminable intensity steps across the dynamic range in electric hearing ranged from as few as 6.6 to as many as 45.2 (Nelson et al., 1996). Loizou et al. 2000 studied the effect of intensity resolution to speech recognition by quantizing the channel amplitudes into 2,4,8,16 and 32 steps. It was found that eight steps within the dynamic range are sufficient for reaching asymptotic performance in consonant recognition. A relatively fine intensity resolution is only needed when the spectral resolution is poor. Yet, it was unclear whether such rule may apply to vowel and sentence recognition and may interact with other psychoacoustic effects in CI users, especially when the inter-subject performance difference was to be modeled. 4.3 The proposed general speech intelligibility model To predict speech intelligibility in cochlear implants, a noise-band vocoder, which processes the speech in a similar fashion as the CI signal processor, was typically used to render acoustic signals. Based on such simulation stimuli, speech intelligibility predictor was majorly studied based on articulation-index (AI) (e.g., Goldsworthy 2005). Such method typically was applied in predicting speech intelligibility in noise. If apply such AI type intelligibility predictor to interpret the inter-subject performance difference, two major difficulties may exist: 1) Since AI type methods need a “good” reference speech (which is typically speech in quiet), it would be very difficult to choose a reference when inter-subject performance difference is targeting in quiet. 2) It is very difficult to accommodate the psychoacoustic responses that reflect inter-subject performance difference into the model. To address the inter-subject performance difference, this thesis proposed a speech intelligibility predictor based on acoustic features that can be modulated according to the unique psychoacoustic response of individual CI users. On one hand, it was expected that such speech intelligibility predictor would predict average CI performance across subjects with regard to the essential acoustic elements delivered through CI device. On the other hand, it was expected that such predictor would be able to predict the inter-subject performance difference. Based on these considerations, the speech intelligibility model was proposed as an acoustic distance that may be modulated with factors that may introduce inter- subject performance difference. In this section, the general speech intelligibility model (without customization of psychoacoustic responses) is introduced. Suppose speech is expressed as a sequence of feature vectors as follows: 79 I 12 12 , ,... ,... ,,... ,... i j J A aa a a B bb b b = = (4-1) A warping function to map the time axis of pattern onto that of pattern A B is described by a sequence of points (, ) cij = : [ (1), (2),... ( ),... ( )] where () ((), ()) Fc c ck cK ck ik jk = = (4-2) When there is no timing difference between these patterns, the warping function coincides with the diagonal line j i = . It deviates further from the diagonal line as the timing difference grows. The Euclidean distance is used to measure the difference between two feature vectors and i a j b : () (, ) || || dc di j a b ij = =− (4-3) The time-normalized distance between speech patterns and A B is defined in Eq. 4- 4 (Sakoe and Chiba 1978): 1 1 [( ( )) ( )] (, ) min () K k K F k dck wk DAB wk = = ⎡ ⎤ ⋅ ⎢ ⎥ = ⎢ ⎥ ⎢ ⎥ ⎣ ⎦ ∑ ∑ (4-4) where numerator is the weighted summation of distances for warping function , denominator 1 [( ( )) ( )] K k dck wk = ⋅ ∑ F () wk ∑ compensates for (number of points on the warping function F ), and is a non-negative path-weighting coefficient. The distance is the minimal distance among all the possible warping functions between speech patterns and K ( ) wk ( , ) DAB A B . When speech patterns A and B are the same (i.e., extracted from the same token), the distance is zero by the above computation. The distance increases as the two speech templates become more different. ( , ) ( , ) DAB DAA = For tokens produced by talker t, suppose ( , ) t M ij is the acoustic distance between tokens i and j and t M is a nn × matrix (where is the number of tokens) n 80 that describes the acoustic distance between vowels pairs. The mean acoustic vowel space for a multi-talker vowel set (T is the number of talkers) is defined as: 11 1 n j (, ) ** Tn t ti M ij S n == = ∑∑∑ Tn = (4-5) Eq. 4-5 reflects the mean acoustic space; larger values indicate greater phonetic distinction among vowels, while smaller values indicate greater similarity among vowels. Based on the above proposed general speech intelligibility model, the psychoacoustic measurements that may introduce inter-subject performance difference may be incorporated by modulating the extracted acoustic features. The details on how to customize this general model according to individual psychoacoustic responses will be introduced later in Chapter 7. 4.4 Investigation outline Working towards the end of modeling the inter-subject performance difference, the investigation of the proposed speech intelligibility model was divided into three phases. Phase 1 validated the prediction ability of the general intelligibility model for averaged subject performance under different parametric variations that CI device encountered. The investigation result was presented in Chapter 5. Phase 2 (i.e., Chapter 6) investigated the effect of electrode confusion patterns to CI speech 81 82 recognition. The investigation was presented in Chapter 6. Phase 3 customized the general intelligibility model based on real CI subject data. The prediction ability of the customized model for inter-subject performance difference was analyzed in Chapter 7. 83 Chapter 5 Speech intelligibility modeling: parametric effect 1 To erect the possibility that the proposed general speech intelligibility model to be able to model inter-subject performance difference, it remains to be validated if the proposed general predictor may predict the general trend in CI speech perception patterns under parametric variations. In this chapter, five CI parameter settings were explored to evaluate the prediction ability of the general model for average speech performance by normal hearing people listening to CI simulations. 5.1 Introduction The cochlear implant (CI) is a prosthetic device that partially restores hearing function in individuals with severe-to-profound sensorineural hearing loss. However, there is considerable variability in CI patient outcomes. This variability may be due to patient-related factors (e.g., etiology of the hearing loss, duration of deafness, the health and location of the remaining auditory neurons, the insertion depth of the 1 Most of the Chapter 5 is reproduced from Liu and Fu 2007. Part of the content was moved to Chapter 4 for a better accommodation of the thesis. Changes were made to section and equation numbers to be internally consistent with this thesis. 84 electrode array, etc.), or to processor-related factors (e.g., the number of electrodes/channels, the stimulation rate, the frequency-to-electrode allocation, etc.). To accommodate some of these factors, there are a number of speech processing strategies and parameters that can be customized for individual patients. Many previous studies have investigated the effects of processor parameters on the speech recognition performance of CI users (Loizou et al. 2000) and normal hearing (NH) subjects listening to acoustic CI simulations (Fu 1997). In general, speech performance is most strongly influenced by parameters that affect the spectral resolution (e.g., the number of electrodes/channels). For both CI users and NH subjects listening to CI simulations, speech recognition in quiet improves with increasing numbers of spectral channels (Fu 1997; Dorman et al. 1997a; Fishman et al. 1997; Friesen et al. 2001); CI performance in noise does not significantly improve beyond ~ 8 channels, while NH performance continues to improve as more spectral channels are added (Friesen et al. 2001). CI users’ spectral resolution can be further reduced by channel interaction between adjacent electrodes (Wilson et al. 1988; Wilson el al. 1991; Throckmoton and Collins 2002). When spectral channels overlap (i.e., current delivered to different electrodes stimulate same neural populations), the temporal envelopes delivered to each electrode will interfere with one another. The reduced spectral resolution and the spectral smearing associated with channel interaction greatly limit CI patient performance in noise, especially in fluctuating noise (Fu and Nogaki 2005; Nelson et al. 2003). 85 In CI devices, a frequency mismatch between acoustic information and the site of stimulation within the cochlea can result in spectral distortion. In CI speech processing, the frequency allocation maps acoustic frequencies onto electrode locations. Invariably, because of the limited insertion depth and electrode spacing in the implanted array, there is a spectral mismatch between the acoustic and electric representation of speech signals. Spectral warping, in which the acoustic frequency information is mapped onto electrodes that result in local spectral mismatches (sometimes due to uneven nerve survival along the electrode array), can cause one type of spectral distortion. For example, given a fixed input frequency range and a fixed distribution of frequency analysis bands, the distribution of electrode locations can significantly affect speech performance (Fu 1997). Spectral distortion can also be caused by spectral shifting, due to different insertion depths of the electrode array into the cochlea. Previous studies have shown that spectral shifting can reduce speech performance in CI users (Fu and Shannon 1999a) and NH subjects listening to CI simulations (Dorman et al. 1997b). In CIs, the relatively large acoustic dynamic range (DR) must be compressed onto the much smaller electric DR using an amplitude mapping function. For spectral peak-picking speech processing strategies (e.g., Cochlear’s SPEAK or ACE strategies), the electric DR can be severely compressed without significantly affecting phoneme recognition in quiet (Zeng and Galvin 1999). For continuous interleaved sampling (CIS) strategies (Wilson et al. 1991), only extreme non-linear mappings result in significant reductions of phoneme recognition in quiet (Fu and 86 Shannon 1998). When the spectral resolution is severely reduced (i.e., 4 channels), thereby increasing listeners’ dependence on temporal cues, there are optimal settings of the amplitude mapping function for phoneme recognition (Fu and Shannon 1998). Thus, while the amplitude mapping function in CIs is aimed at restoring normal loudness growth, amplitude envelope distortion seems to have little effect on CI speech recognition in quiet. However, the amplitude mapping function may greatly affect CI performance in noise (Fu and Shannon 1999b). In many CI studies, the effects of speech processor parameters have been systematically explored by incrementally varying parameter settings. However, given the number of speech processor parameters and interactions between parameters, such an approach is time consuming and the results are sometimes difficult to interpret. In a clinical setting, there are even greater time constraints and a greater need to optimize CI patients’ speech processors. Thus, it is important to estimate the effect of a parameter setting prior to implementation in a speech processor, and prior to extensive evaluation in a CI patient. If the effects of parameter settings can be accurately estimated, audiologists would require less time and guesswork in the clinical fitting of speech processors. For example, previous studies have used acoustic analysis to model CI users’ speech recognition in quiet and noise (Throckmorton and Collins 2002; Remus and Collins 2005). Noise-band and sinewave vocoders have been used to simulate various CI speech processing parameters in NH listeners. While these acoustic simulations may not accurately model all aspects of CI speech processing, NH subjects’ phoneme recognition 87 patterns with noise-band vocoders have been shown to be quite similar to that of CI users, for a variety of speech processor parameters (Fu and Shannon 1998, 1999). In the present study, a metric was obtained using acoustic analysis of the vowel stimuli; the metric was compared to NH subjects’ vowel recognition performance for several simulated CI speech processor parameters. In the simulations, five speech processor parameters were systematically varied. The parameter settings controlled the number of spectral channels, the degree of spectral smearing, the degree of spectral warping, the amount of spectral shifting, and the amplitude mapping function. The acoustic Euclidean distances between processed vowels were estimated using Mel-frequency cepstrum coefficients. For each simulated processor condition, the acoustic vowel space was calculated as the mean acoustic distance between processed vowels. The acoustic vowel space was then compared to NH subjects’ vowel recognition performance with the same stimuli (based on the data from several previous studies reported by the second author (Fu 1997; Fu and Shannon 1998 1999). 5.2 Estimating the vowel acoustic space In speech coding, synthesis or recognition, it is important to identify the acoustic features that contribute most strongly to phoneme discrimination. Mel- frequency cepstrum coefficient (MFCC) analysis is a perceptually motivated frequency scale, in which the cepstrum for a brief window of a speech signal is derived from the FFT of the speech signal (Huang et al. 2001). MFCC is preferable 88 to the linear frequency cepstrum analysis, in that MFCC is able to extract more salient acoustic features. Euclidean distance metrics using MFCCs provide better separation of phonetically distinct spectra (Davis and Mermelstein 1980), and MFCC analysis has been shown to be more accurate than temporal waveform analysis in extracting key speech features from stimuli processed by acoustic CI simulations (Remus and Collins 2005). In the present study, an MFCC-based analysis was used to extract the acoustic features of vowel stimuli after simulated CI speech processing. A dynamic time-normalization algorithm with symmetric form and slope constraints was used to compensate for timing differences between speech patterns (Sakoe and Chiba 1978). After extraction of the MFCC features on the noise-band vocoder, the vowel acoustic space was calculated according to the general intelligibility model in Section 4.3. 5.3 Experimental Design For each simulated CI parameter, the acoustic space for a multi-talker vowel set was compared to NH subjects’ vowel recognition for the same processed stimuli. 5.3.1 Test materials The vowel tokens used for acoustic analysis and closed-set vowel recognition tests were digitized natural productions by 5 male and 5 female adults, drawn from speech samples collected by Hillenbrand et al. (Hillenbrand et al. 1994). The 89 stimulus set consisted of 12 vowels (10 monophthongs and 2 diphthongs), presented in an /hVd / context (heed, hid, head, had, who’d, hood, hod, hud, hawed, heard, hoed, hayed). All stimuli were normalized to have the same long-term root mean square (RMS) value. 5.3.2 Signal Processing for five experiments A noise-band vocoder was used to simulate a CI speech processor fitted with the CIS strategy. The simulation was implemented as follows. Signals were first processed through a pre-emphasis filter (first-order Butterworth high-pass with a cutoff frequency of 1200 Hz), and then band-passed into a number of frequency bands (the number of frequency bands was varied as an experimental parameter). The temporal envelopes were extracted from each frequency band by half-wave rectification and low-pass filtering (fourth-order Butterworth filter with a cutoff frequency of 160 Hz). The extracted envelope amplitudes were mapped to simulated electrical amplitudes using a power-law transformation (the exponent of the power function was varied as an experimental parameter). The transformed envelope amplitudes were used to modulate a wideband noise, which was then spectrally limited by a band-passed filter (the carrier filter bandwidth, location and slope were varied as experimental parameters). Finally, the modulated carriers from all bands were summed and scaled to the original RMS level of the unprocessed signals. In the present study, five experimental parameters were tested: the number of spectral channels, the amount of spectral smearing, the amount of spectral warping, 90 the degree of spectral shifting, and the degree of non-linearity in the amplitude mapping function. These parameters have been previously shown to affect CI users’ speech recognition performance (Loizou et al. 2000; Fu 1997; Dorman et al. 1997a, 1997b; Fishman et al. 1997; Friesen et al. 2001; Wilson et al. 1991; Throckmorton and Collins 2002; Fu and Shannon 1998, 1999). Table 5-1 lists the experimental parameters varied in each experiment, as well as the fixed parameter settings for each experiment. The experimental parameters are described in greater detail below. Table 5-1: Summary of the fixed and variable speech processor parameters for all five experiments. Shadowed areas represent the variable parameters in each experiment. Experiment 1 2 3 4 5 Number of channels 1-8 4 4 4 4 Analysis frequency range (Hz) 100-4000 100-4000 100-4000 337-3687 100-6000 Distribution of analysis bands linear P2 (see Table 5-2) linear Greenwood custom Carrier frequency range (Hz) 100-4000 100-4000 100-4000 see Table 5-3 100-6000 Distribution of carrier bands linear P2 linear to log (see Table 5-2) Greenwood custom Analysis/Carrier band filter slope (dB/octave) 24/24 36/ 6, 12, 24, 36 24/24 24/24 24/24 Amplitude mapping function linear linear linear linear non-linear 91 Table 5-2: Corner frequencies for analysis/carrier bands used in Exps. 2 and 3. Corner frequency Frequency partition 1 2 3 4 5 P0 100 1000 2000 3000 4000 P1 100 808 1733 2790 4000 P2 100 639 1475 2569 4000 P3 100 494 1235 2342 4000 P4 100 375 1018 2117 4000 P5 100 280 827 1900 4000 P6 100 205 665 1694 4000 Experiment 1: The number of spectral channels Six spectral channel conditions were simulated (i.e., 1, 2, 3, 4, 6, or 8 channels). The overall frequency range, frequency band distribution and filter slope were fixed, and were matched between the analysis and carrier bands; a linear amplitude mapping function was used. The frequency band partitions were evenly distributed within the overall frequency range, according to the number of channels. Experiment 2: Spectral smearing Four degrees of spectral smearing were simulated in 4-channel processors by varying the amount of spectral overlap between adjacent carrier bands. The overall frequency range and the frequency band partition (P2; see Table 5-2) were fixed, and were matched between the analysis and carrier bands; a linear amplitude mapping function was used. The analysis filter band slope was fixed at 36 dB/octave. The carrier filter slope was varied between 36, 24, 12 and 6 dB/octave; the 36 dB/octave slope produced the least spectral smearing, while the 6 dB/octave slope produced the most smearing. Experiment 3: Spectral warping Six degrees of spectral warping were simulated in 4-channel processors. The overall frequency range and filter slope were fixed, and were matched between the analysis and carrier bands; a linear amplitude mapping function was used. The analysis band partition was fixed at a linear distribution (i.e., P0) while the carrier band partition was varied from a linear to logarithmic distribution (i.e., P0 to P6). Table 5-2 shows the cutoff frequencies for each band, for each partition. Thus, while the overall frequency range was fixed for the analysis and carrier bands, the spectral envelope was warped by the mismatch between individual analysis and carrier bands. Experiment 4: Spectral shifting Nine degrees of spectral shifting were simulated in 4-channel processors. Analysis and carrier bands were distributed according to Greenwood’s function (Greenwood 1990). The filter slope was fixed for both analysis and carrier bands; a linear amplitude mapping function was used. The overall frequency range for the analysis filters was fixed (337-3687 Hz) while the overall frequency range for the carrier filters was varied to simulate different electrode insertion depths. The simulated insertion depths ranged from 28 mm from the base of the cochlea (deepest insertion) to 22 mm from the base (shallowest insertion), in 0.75 mm steps. The tonotopic locations of the carrier bands were determined according to: 0 ( ) 3.75 i=0,1,...4 Li L i = − (5-1) 92 where is the most apical location for a given frequency allocation (in mm from the base), and 3.75 represents the tonotopic extent of each band (in mm). The corner frequencies of each carrier band (in Hz) were determined according to (Greenwood 1990): 0 L 0.06*(35.0 ( )) ( ) 165.4 * (10 -0.88) Li Fi − = (5-2) Note that a 35 mm cochlea length was assumed and represented in the exponent. Table 5-3 lists the corner frequencies of the carrier bands for the 9 simulated insertion depths. Note that the overall frequency range for the analysis filters was same as that for the carrier filters with an insertion depth of 27.25 mm. Table 5-3: Corner frequencies for carrier bands used in Exp. 4. The analysis bands were fixed and distributed according to Greenwood’s formula (Greenwood 1990). For each simulated insertion depth, the corner frequencies were calculated according to Greenwood’s formula (Greenwood 1990). Corner frequency Simulated insertion depth (mm from base) 1 2 3 4 5 28.00 289 585 1081 1913 3310 27.25 337 665 1214 2138 3687 26.50 390 753 1363 2387 4106 25.75 448 851 1528 2663 4570 25.00 513 960 1710 2970 5085 24.25 585 1081 1913 3310 5656 23.50 665 1214 2138 3687 6289 22.75 753 1363 2387 4106 6992 22.00 851 1528 2663 4570 7771 Experiment 5: Nonlinear amplitude mapping Seven different amplitude mapping functions were simulated in 4-channel processors. The overall frequency range (100 – 6000 Hz), filter partition (custom corner frequencies: 100, 713, 1509, 3043, and 6000 Hz) and filter slope were fixed, 93 and were matched between the analysis and carrier bands. The amplitude mapping function was varied to be expansive or compressive. Input amplitudes were mapped to output amplitudes according to the following power-law transformation: min min min min min max max max if ( ) if if p AAA nA A k A A A A A A AA ⎧ < ⎪ ⎪ =+ − ≤≤ ⎨ ⎪ ⎪ > ⎩ (5-3) where represents the 99 max A th percentile of envelope amplitudes across all analysis bands, is equal to (thereby yielding a 60 dB dynamic range). For a specific value of p, the value of the constant k was determined by setting when . The exponent p was varied between 0.3 and 3.0 (p = 0.3, 0.5, 0.8, 1.0, 1.5, 2.0, 3.0). When p = 1.0, the amplitude mapping was linear. When p < 1.0, the amplitude mapping was compressive. When p > 1.0, the amplitude mapping was expansive. min A max /1000 A max nA A = max AA = 5.3.3 Subjects For Experiments 1 – 3, NH subjects’ vowel recognition performance was previously reported in (Fu 1997); 4 NH subjects participated in each experiment. For Experiment 4, NH subjects’ vowel recognition performance was previously reported in (Fu and Shannon 1999); 5 NH subjects participated in the experiment. For Experiment 5, NH subjects’ vowel recognition was previously reported in (Fu and Shannon 1998); 4 NH subjects participated in the experiment. All NH subjects had 94 95 sensitivity thresholds better than 15dB HL for audiometric test frequencies from 250 to 8000 Hz; all were native speakers of American English. Informed consent from each subject and local IRB approval were obtained for the study. 5.3.4 Procedures For all experimental processors, speech features were represented as MFCC vectors (order 14), extracted from 20 ms windows with 50% overlap (Huang et al. 2001); the acoustic vowel space was calculated according to methods described above in Section 5.2. In order to obtain meaningful units with which to compare the relative effects of the experimental parameters, the acoustic space for the processed vowels was normalized to the acoustic space for unprocessed vowels. Thus, the normalized acoustic space for processed vowels was expressed as a percentile of the acoustic space for unprocessed vowels. Vowel recognition was measured in NH listeners for all experimental processors. All stimuli were presented in free field at 70 dBA. Stimuli were output via PC soundcard (Turtle Beach MultiSound Fiji board) to an amplifier (Crown D- 75) and single loudspeaker (Tannoy Reveal). Subjects were seated in a double- walled sound-treated booth (IAC), directly facing the loudspeaker. During the test, a vowel token was randomly selected (without replacement) from the stimulus set and presented to the subject. The subject responded by clicking on one of the 12 response buttons displayed on the computer screen; response buttons were labeled in an h/V/d format. No training or familiarization with the speech processing was provided; no 96 trial-by-trial feedback was provided. Each test block consisted of 120 vowel tokens (12 vowels*10 talkers =120 tokens). The test order for the experimental conditions was randomized for each subject. 5.4 Results Experiment 1: The number of spectral channels Figure 5-1 shows the normalized acoustic space and NH subjects’ mean vowel recognition as a function of the number of spectral channels. In general, as more spectral channels were added, the acoustic space increased and vowel recognition improved. The acoustic space was significantly reshaped by the spectral channels [one-way ANOVA: F(5,54) = 114.37, p<0.001]. Post-hoc Bonferroni t-tests showed that acoustic space was significantly different between all but the 8- and 6- channel and the 3- and 2-channel processors (p < 0.05). Vowel recognition was also significantly affected by the spectral channels [one-way ANOVA: F(5,18) = 818.75, p<0.001]. Post-hoc Bonferroni t-tests showed that the vowel recognition was significantly different between all but the 8- and 6-channel processors (p < 0.05). On average, as the number of spectral channels was doubled, both the acoustic space and vowel recognition performance increased by ~ 20 percentage points. Note that even with 8 spectral channels, the acoustic space is significantly compressed (~ 70% of the unprocessed vowel space). NH vowel recognition performance was somewhat higher than predicted by the acoustic analysis. Linear regression analysis showed that the acoustic space and vowel recognition performance were highly correlated [r 2 = 0.99; p < 0.001]. Number of channels 8 6 4 3 2 1 Normalized acoustic space 0 20 40 60 80 100 Percent correct 0 20 40 60 80 100 Acoustic Perceptual Figure 5-1: Normalized mean acoustic space (left axis) and mean vowel recognition performance (right axis), as a function of the number of spectral channels. Experiment 2: Spectral smearing Figure 5-2 shows the normalized acoustic space and NH subjects’ mean vowel recognition for 4-channel speech processors, as a function of slope of the carrier band filters; note that for all conditions, the analysis band filter slope was fixed (36 dB/octave), while the carrier band filter slope was varied. The acoustic space significantly increased as the amount of channel overlap/spectral smearing was reduced [one-way ANOVA: F(3,36) = 73.99, p < 0.001]. Post-hoc Bonferroni t-tests showed that the mean acoustic space was significantly different between all carrier filter slope conditions. Similarly, vowel recognition significantly improved as the spectral smearing was reduced [one-way ANOVA: F(3,12) = 278.28, p < 0.001]. 97 Post-hoc Bonferroni t-tests showed that performance was significantly different between all carrier filter slope conditions. Again, the acoustic space and vowel recognition performance were highly correlated [r 2 = 0.96; p = 0.019]. Slope of carrier band filters (dB/octave) 6 122436 Normalized acoustic space 0 20 40 60 80 100 Percent correct 0 20 40 60 80 100 Acoustic Perceptual more smearing less smearing Figure 5-2: Normalized mean acoustic space (left axis) and mean vowel recognition performance (right axis) for 4-channel speech, as a function of the slope of the carrier band filter. Experiment 3: Spectral warping Figure 5-3 shows the normalized acoustic space and NH subjects’ mean vowel recognition for 4-channel speech processors, as a function of carrier band frequency partition; note that for all conditions, the analysis band frequency was fixed (P0, i.e., a linear distribution). The acoustic space was not significantly reshaped by spectral warping [one-way ANOVA: F(6,63) = 0.275, p = 0.947]. However, vowel recognition was significantly affected by spectral warping [one-way 98 ANOVA: F(6,21) = 184.93, p < 0.001]; vowel recognition began to significantly decline beyond carrier partition P3. The acoustic space and vowel recognition performance were not significantly correlated [r 2 = 0.421; p = 0.115]. Carrier frequency partition P0 P1 P2 P3 P4 P5 P6 Normalized acoustic space 0 20 40 60 80 100 Percent correct 0 20 40 60 80 100 Acoustic Perceptual no warping extreme warping Figure 5-3: Normalized mean acoustic space (left axis) and mean vowel recognition performance (right axis) for 4-channel speech, as a function of the distribution of carrier band filters. Experiment 4: Spectral shifting Figure 5-4 shows the normalized acoustic space and NH subjects’ mean vowel recognition for 4-channel speech processors, as a function of the simulated insertion depth; note that the analysis input frequency range was fixed (337-3687 Hz) while the carrier frequency range was varied. The acoustic space was not significantly reshaped by spectral shifting [one-way ANOVA: F(8,81) = 0.130, p = 0.998]. However, vowel recognition was significantly affected by spectral shifting [one-way ANOVA: F(8,36) = 30.36, p < 0.001]; vowel recognition began to 99 significantly decline for simulated insertion depths less than 24.25 mm from the base. Thus, a frequency to place mismatch of more than 3 mm resulted in a significant drop in performance. Similar to Experiment 3, the acoustic space and vowel recognition performance were not significantly correlated [r 2 = 0.206; p = 0.220]. Simulated insertion depth (mm from base) 22 23 24 25 26 27 28 Normalized acoustic space 0 20 40 60 80 100 Percent correct 0 20 40 60 80 100 Acoustic Perceptual deep insertion shallow insertion Figure 5-4: Normalized mean acoustic space (left axis) and mean vowel recognition performance (right axis) for 4-channel speech, as a function of the simulated insertion depth of the carrier bands. Experiment 5: Nonlinear amplitude mapping Figure 5-5 shows the normalized acoustic space and NH subjects’ mean vowel recognition for 4-channel speech processors, as a function of exponent used in the amplitude mapping function; note that values less than 1 produced a compressive mapping, and values greater than 1 produced an expansive mapping. As in Experiments 1 and 2, the acoustic space was significantly reshaped by the amplitude mapping function [one-way ANOVA: F(6,63) = 79.15, p< 0.001]. Post-hoc 100 101 Bonferroni t-tests showed that the mean acoustic space was significantly different between most amplitude mapping functions (p < 0.05); there was no significant difference between exponent values 3.0 and 2.0, 1.0 and 0.8, 0.8 and 0.5, and 0.5 and 0.3 (p > 0.05). In general, the acoustic space increased monotonically as the amplitude mapping function became more expansive. Vowel recognition was also significantly affected by the amplitude mapping function [one-way ANOVA: F(6,21) = 9.78, p< 0.001]. The linear mapping function (exponent p = 1.0) provided the best vowel recognition. Post-hoc Bonferroni t-tests showed that the linear mapping function provided significantly better performance (p < 0.05) than that with extremely compressive (exponent p = 0.3) or expansive mapping functions (exponent p = 2.0, 3.0); there was no significant difference in vowel recognition between mapping exponents between 0.5, 0.8, 1.0 and 1.5 (p > 0.05). Similar to Experiments 3 and 4, the acoustic space and vowel recognition performance were not significantly correlated [r 2 = 0.082; p = 0.534]. Value of exponent p in amplitude mapping function 0.3 0.5 0.8 1 1.5 2 3 Normalized acoustic space 0 20 40 60 80 100 Percent correct 0 20 40 60 80 100 Acoustic Perceptual compressive expansive Figure 5-5: Normalized mean acoustic space (left axis) and mean vowel recognition performance (right axis) for 4-channel speech, as a function of the amplitude mapping function. 5.5 Discussion In the present study, the acoustic vowel space was estimated for a variety of simulated speech processing parameters using an MFCC-based acoustic analysis. Linear regression analysis revealed that acoustic space was highly correlated with NH subjects’ vowel recognition performance for parameters that manipulated the number of spectral channels (Exp. 1) and the amount of spectral smearing (Exp. 2). However, the acoustic space was not correlated with vowel recognition performance for parameters that manipulated the degree of spectral warping (Exp. 3) and spectral shifting (Exp. 4). Also, the acoustic space was not correlated with vowel recognition 102 103 performance for parameters that manipulated the amplitude mapping function (Exp. 5). These results suggest that MFCC-based acoustic analysis may well predict speech perception scores that cochlear implant users obtained with different processing parameters (i.e., the number of channels) and patient-related factors (channel interaction due to poor electrode discrimination). For other conditions (i.e., spectral warping, spectral shifting and nonlinear amplitude mapping), acoustic analysis may not have been sensitive to perceptual factors that cause speech performance to decline, especially for extreme parameter settings. For Exps.1 and 2, the strong correlations between the acoustic space and perceptual results suggest that the acoustic analysis extracted the critical speech features affected by spectral channels and spectral smearing. However, vowel recognition was generally higher than predicted by the acoustic analysis. This discrepancy may be due to the normalization of the acoustic space of processed speech to the acoustic space of unprocessed (i.e., broadband) speech. For the CI acoustic simulation, speech was spectrally degraded (i.e., 8 or fewer spectral channels); the normalized acoustic space was, at best, only 70 % of the broadband acoustic space (8-channel speech in Exp. 1). When listening to spectrally degraded speech, NH subjects may have used temporal cues (e.g., formant transitions) that were not extracted by MFCCs, resulting in somewhat better performance than predicted. An acoustic analysis uses both MFCCs and temporal information may better predict performance for spectrally degraded speech. 104 For conditions of spectral mismatch, whether by spectral warping (Exp. 3) or spectral shifting (Exp. 4), the acoustic analysis was not correlated with perceptual results. The acoustic space was not significantly reshaped by the degree of spectral mismatch. This is not surprising, as the MFCC-based analysis was most sensitive to changes in spectral resolution, whether by manipulating the number of spectral channels (Exp. 1) or the spectral overlap between channels (Exp. 2). In Exp. 3 and 4, the number of spectral channels (4) and the carrier band filter slope were fixed. In Exp. 3, the input and output frequency ranges were fixed; the analysis band distribution was fixed, while the carrier bands were distributed in a linear to logarithmic fashion. Spectral warping would result in a relatively constant acoustic space, at least in terms of MFCC analysis. In Exp. 4, the distribution of analysis and carrier bands both followed Greenwood distribution, thereby ensuring comparable excitation in terms of cochlear extent. The input frequency range was fixed, while the output range was systematically shifted. Similar to Exp. 3, spectral shifting would result in a relatively constant acoustic space, at least in terms of MFCC analysis. In Exps. 3 and 4, vowel recognition rapidly declined as the degree of mismatch became too severe. In Exp. 3, the spectral envelope was distorted, while in Exp. 4, the spectral envelope was shifted. While these manipulations resulted in comparable deficits in performance, they may have affected performance in different ways. For example, with spectral warping, NH subjects were able to tolerate some shifting in relative formant frequencies; vowel recognition began to significantly 105 decline at partition P3. Similarly, NH subjects were able to tolerate some shifting in the spectral envelope (i.e., relative formant frequency differences were preserved, but globally shifted); vowel recognition began to significantly decline for a frequency shift beyond 3 mm (in terms of cochlear location). In both cases, NH listeners were sensitive to extreme spectral mismatch, whether for local or global shifts in spectral information. In all experiments, performance was acutely measured in NH listeners (i.e., no training or feedback). Previous studies have shown that NH listeners and CI patients can adapt to spectral mismatches similar to those in Exp. 3 and 4, via auditory training (Rosen et al. 1999; Fu et al. 2005a, 2005b; Fu et al. 2003; Svirsky et al. 2004; Svirsky et al. 2001; Faulkner et al. 2006) and/or long-term experience (Fu et al. 2002). Rosen et al. 1999 found that, for a 6.5 mm spectral shift toward the base, 3 – 4 hours of auditory training improved NH subjects’ vowel recognition from chance level (6 % correct) to nearly 25 % correct. Fu et al. similarly found that moderate auditory training significantly improved CI patients’ phoneme recognition (Fu et al. 2005a) and NH subjects’ recognition of spectrally shifted speech (Fu et al. 2005b). Svirsky et al. found that, during the first two years post-implantation, CI users’ vowel space (in terms of perceptual distances of F1 and F2 contours) gradually expanded toward that of NH listeners (Svirsky et al. 2001). These training studies suggest that both NH and CI listeners can at least partially adapt to extreme spectral mismatches. Because CI patients are continually exposed to speech processing while wearing the implant device, CI performance may be better than NH 106 performance for some simulated CI parameters. In light of these previous training studies, the discrepancies between the acoustic analyses and perceptual results in Exp. 3 and 4 may be overcome by auditory training. Similarly, auditory training may help CI patients adapt to changes in speech processor parameters that preserve or enhance speech cues (as shown by acoustic analysis), but result in acute deficits in performance. Because the acute performance deficits due to spectral mismatch may be overcome with auditory training, the acoustic analysis may ultimately correlate with perceptual results in the long term. The pattern of results in Exp. 5 points to one of the limitations of the acoustic analysis used in the present study. The acoustic space decreased monotonically as the amplitude mapping function became more compressive. This is not surprising, as amplitude compression will cause the peaks in the spectral envelope to become less distinct; MFCC-based acoustic analysis is sensitive to these differences in spectral peaks. However, vowel recognition was best with a linear mapping function, and only extremely expansive or compressive mappings resulted in a significant deficit in performance. Thus, while the spectral envelope may have been enhanced with an expansive amplitude mapping, the spectral peaks were distorted, relative to the normal pattern. It is unclear whether auditory training would ultimately provide better performance with expansive mappings. It is possible that, as weak peaks are made weaker and strong peaks made stronger, vowel recognition would remain relatively poor, compared with a linear mapping. Relative audibility between weak and strong spectral peaks with expansive mappings may also reduce performance. 107 5.6 Conclusions In the present study, the acoustic space of multi-talker vowel stimuli was compared to NH vowel recognition performance, for a variety of CI simulations. Different degrees of spectral resolution (e.g., spectral channels, spectral smearing), spectral mismatch (e.g., spectral warping, spectral shifting) and amplitude distortion were simulated. The acoustic space was highly correlated with vowel recognition performance for conditions that simulated different numbers of spectral channels and different degrees of spectral smearing. However, the acoustic space was not highly correlated with vowel recognition performance for conditions that simulated different degrees of spectral mismatch (i.e., spectral shifting or warping between analysis and carrier bands) or non-linear amplitude mapping. The results suggest that spectral resolution is a limiting factor in CI performance, as both the acoustic space and vowel recognition scores increased as the number of spectral channels was increased and/or the spectral overlap was reduced. Because the acoustic space was not significantly reshaped by conditions of spectral mismatch, auditory training and/or long-term experience may help CI listeners better receive salient but frequency-shifted speech cues provided by the CI device. 108 Chapter 6 Speech intelligibility modeling: effect of electrode confusions 1 In the previous chapter, the prediction ability of the general speech intelligibility model was studied under five different parametric effects. The goodness of fit was based on averaged subject performance. In this Chapter, the effect of electrode confusions was studied based on the general speech intelligibility model. We focused on the quantification of the electrode confusions and its prediction ability on averaged subject performance with different level of electrode confusions. 6.1 Introduction Cochlear implant (CI) users were observed to have substantial inter-subject performance difference when exposed to the same listening materials and conditions. Such subject performance difference may come from factors including the effective spectral channels, variant electrode psychoacoustic patterns, deaf profiles, and CI signal processing strategies. In this study, it was hypothesized that the electrode confusion patterns may partially contribute to the subject performance difference. 1 This Chapter is a collaborative work with Drs. Louis Braida and Ray Goldsworthy at MIT RLE lab and Sensimetrics Corp, respectively. 109 The objective of this study was to quantify the effect of electrode confusions to CI speech recognition and predict CI speech recognition performance under electrode confusion conditions. The study may facilitate the later on investigation on customization of the general speech intelligibility model. The findings of this study may help to optimize CI speech perception in a subject-oriented fashion. Depending on the deaf profile and stimulation scheme, the electrode discrimination patterns may be very different across different CI users (Nelson et al. 1995; Collins et al. 1997; Henry et al. 2000). For example, Nelson et al 1995 found that good subjects obtained perfect “place pitch” discrimination with comparison electrodes as close as 0.75 mm; poor subjects could only obtain perfect discrimination with comparison electrodes as distant as 13 mm (which was more than three quarters of the entire length of the electrode array). The electrode discrimination patterns were shown to relate to CI speech perception performance in previous literatures (Henry et al. 2000; Zwolan et al. 1997; Nelson et al. 1995). Although Zwolan et al. (1997) found no significant correlation between electrode discrimination and speech perception, a significant correlation was found between these two measurements in the study of Nelson et al. 1995 and Henry et al. 2000. Henry et al. (2000) further found that the electrode discrimination ability was significantly correlated to the amount of speech information perceived only in the frequency range of 170 to 2680 Hz, but not in the region of 2680 Hz to 5744Hz. 110 With such huge subject dependent factor and mixed perceptual result, the task to predict CI speech recognition is challenging. The CI speech recognition prediction may be roughly categorized into two directions. The first direction is based on the articulation index theory. The speech intelligibility predictor was typically calculated as a weighted summation of an intermediate variable describing the SNR level and importance function associated with each sub-band (Pavlovic 1987). The importance function was derived from the context of speech intelligibility with normal hearing people. Goldsworthy (2005) studied the prediction ability of such AI type method for speech intelligibility in cochlear implants (CI) in noise. It was found that these conventional methods did not provide consistently satisfactory results, especially when nonlinear operation or processing was involved. The second direction is based on acoustic distance. For example, Dorman and Loizou (1997) studied the mechanism of vowel recognition by looking at the relationship of the acoustic space (based on the Euclidean distance of average output level of 6 channels across phonemes) and the percent error responses of vowel pairs. The correlation of these two was not significant, only with correlation strength at -0.27. Liu and Fu (2007) studied the prediction ability of the acoustic space based on MFCC under variant CI simulation conditions without additive noise to the original speech. It was found that the acoustic space was highly correlated to the speech perception scores by normal hearing people listening to the CI simulations, when the spectral information was degraded or smeared (r 2 >0.95). Furthermore, by minimizing the acoustic distance between ideal speech and non-ideal speech, speech enhancement was shown to be effective in leveraging the perception difference across talkers (Chapter 3) and bandwidth (Chapter 2). Working towards a better understanding of the inter-subject performance difference, it is critical for the subject-dependent psychoacoustic responses to be incorporated into the speech intelligibility model. In this Chapter, we specifically study the effect of electrode confusion patterns for CI speech perception. We also compared the prediction ability of different speech intelligibility indexes for speech recognition scores by NH people listening to CI simulations. 6.2 Method 6.2.1 Quantification of electrode confusion patterns To study the effect of electrode confusion in speech perception, we simulated electrode confusion by smearing the spectral contrast of the acoustic features that was associated to the electrode array. Based on the Continuously Interleaved Strategy (CIS) in CI stimulation, the selected electrodes were stimulated with electric amplitudes proportional to the average spectral amplitudes of their corresponding analysis bands. Suppose the speech feature vector X represents the output of the analysis filter bank that was assigned to the electrode array. Suppose matrix represents the electrode confusion pattern, where each row represents the percent confusion of the electrode i with all the other electrodes. Note that represents the percentage that D 12 1 [ , ,... ] (1 , 1) n i i in ij j dd d i n d = ≤≤ = ∑ ii d 111 112 d the electrode can be correctly identified since 1 ii ij ji d ≠ =− ∑ .The acoustic feature that was “reshaped” by the electrode confusion patterns may be represented with a linear multiplication as follows: ' 11 12 1 1 12 1 11 1 12 2 1 11 2 2 11 2 2 .. .. . .. .. . .. ... ... ... nE ii in Ei nnn EE n iE i E in En nE n E nn En XDX dd d x dd d x dd dx d x d x dx d x d x dx d x d x En En x = ⎡ ⎤⎡ ⎤ ⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥ = ⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎣ ⎦⎣ ⎦ +++ ⎡ ⎤ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ =+ ++ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ +++ ⎣ ⎦ (6-1) According to Eq. (6-1), each feature dimension in will be a weighted summation of feature dimensions in X based on the electrode confusion pattern. When the electrode discrimination is perfect, i.e., the electrode confusion matrix is an identity matrix, the acoustic feature vector remains exactly the same as . When all the electrodes were totally confused, i.e., the entries in the matrix D were of equivalent value, the resulted feature vector will also have equivalent values across different dimensions. Hence, the introduction of electrode confusion pattern ' X D ' X X ' X D reshapes the spectral contrast of feature vectors such that the functional spectral information through individual CI device is reflected. The level of electrode confusion D is proposed to be characterized by the average confusion across n channels: 1 (1 ) n ii i d n λ = − = ∑ (6-2) To quantify the effect of electrode confusion, a systematic approach was introduced by raising the power of a base matrix. A base matrix was designed to have certain amount of electrode confusion. When it was raised to a higher number of powers, the off diagonal values will increase and spread out further in off diagonal direction. The higher the power raised, the more the electrode confusion is. 6.2.2 Prediction of speech intelligibility Three speech intelligibility indexes were studied on the speech materials with CI simulations combining electrode confusion patterns. The first method was the acoustic space that was presented in Section 4.3. The acoustic space basically measured the average Euclidean distance based on the MFCC features across all the vowel pairs under a specific listening condition. The second method was the normalized covariance STI method (STI). A modified version was used in this study according to Goldsworthy 2005. The third method was the normalized correlation method (NCM) (Goldsworthy 2005). NCM and STI method were similar in the sense 113 114 of the correlation operation, but NCM had a simpler relation to SNR and required no clipping operation of the transparent SNR. The speech materials in this study were CI simulations; yet, it was not well established that STI and NCM method may well predict noise band vocoded speech material. Hence, the STI and NCM routine were calculated with two different inputs: acoustic stimuli or features. The “STI acoustic” and “NCM acoustic” calculated the speech intelligibility index directly on the CI simulations, (i.e., acoustic noise band vocoded speech); the “STI feature” and “NCM feature” calculated the speech intelligibility index directly on the extracted acoustic features simulating the CIS processing, bypassing the acoustic CI simulation materials. For easy comparison of the prediction ability of the three methods, the calculated acoustic space (i.e., the first method) was normalized to the acoustic space calculated from CI simulation with 30 bands spectral resolution covering all the frequency information (which was assumed that perfect speech recognition was achievable). 6.3 Experimental Design 6.3.1. Test materials The vowel tokens used for closed-set vowel recognition tests were digitized natural productions by 2 male and 2 female adults, drawn from speech samples collected by Hillenbrand et. al.. The stimulus set consisted of 12 vowels (10 monophthongs and 2 diphthongs), presented in an /hVd / context (heed, hid, head, 115 had, who’d, hood, hod, hud, hawed, heard, hoed, hayed). All stimuli were normalized to have the same long-term root mean square (RMS) value. 6.3.2. Signal Processing A noise-band vocoder was used to simulate a CI speech processor fitted with the CIS strategy. The simulation was implemented as follows. Signals were first processed through a pre-emphasis filter (first-order Butterworth high-pass with a cutoff frequency of 1200 Hz). A short time hamming window (20ms) was sliding at a rate of 500 frames/second over the signal, simulating an electric stimulation rate of 500 pulses per second (pps). The spectrum of the short time windowed signal was obtained from FFT operation. The average spectrum amplitude of the frequency bands that associated with the activated electrodes were extracted (the number of frequency bands was varied as an experimental parameter) to render the acoustic features of the signal. The acoustic features were smeared according to Eq. (6-1), simulating the electrode confusion patterns. The modified acoustic features were then modulated with noise of the same band partition as the analysis process. Finally, the modulated carriers from all bands were summed and scaled to the original RMS level of the unprocessed signals. In this study, speech materials were simulated with two channel conditions (6 and 8) and four levels of electrode confusions. The band partition of the channel condition simulated an electrode array of 15mm in length, 22mm in insertion depth, and of uniform tonotopic distribution of the electrodes 2 . The Greenwood function was used to map the tonotopic locations to acoustic frequencies. The base matrix used to simulate electrode confusion patterns for the 6 channels simulation was designed as follows: 0.9000 0.1000 0 0 0 0 0.1000 0.8000 0.1000 0 0 0 0 0.1000 0.8000 0.1000 0 0 a = 0 0 0.1000 0.8000 0.1000 0 0 0 0 0.1000 0.8000 0.1000 0 0 0 0 0.1000 0.9000 ⎡ ⎣ ⎤ ⎢⎥ ⎢⎥ ⎢⎥ ⎢⎥ ⎢⎥ ⎢⎥ ⎢⎥ ⎦ (6-3) The above base matrix has 16.7% electrode confusion per channel according to Eq. (6-2). The base matrix for the 8 channel CI simulation was similar to Eq. (6-3) except having 8 bands rather than 6. The raising power of the base matrix that was used in this study was 0, 2, 4 and 8. Table 6-1 summarized the band partition and electrode confusion levels for the 6 and 8 CIS simulation. The calculated electrode confusion levels were slightly different for 6 and 8 channel simulation conditions. For convenience, the average electrode confusion level across 6 and 8 channel condition was used to identity the electrode confusion level as a function of the raising power of the base matrix. 116 2 Note that this simulated insertion depth is the shallowest insertion depth as the one we explored in Experiment 4 in Chapter 5. Table 6-1: Summary of the band partition and electrode confusion levels for different number of channel simulation and different powers of the base matrix. The base matrix follows Eq. (6-3) and the electrode confusion level a λ is characterized according to Eq. (6-2). Electrode confusion level λ (%) 6 channels 8 channels Average λ a 0 0 0 0 a 2 29 30 30 a 4 44 46 45 a 8 59 61 60 Band partition [851, 1262, 1843, 2663, 3822, 5459, 7771] [851, 1146, 1528, 2022, 2663, 3494, 4570, 5964, 7771] -- 6.3.3. Subjects Eight NH subjects (4 males and 4 females) with age between 19 and 42 years old participated in this study. All NH subjects had sensitivity thresholds better than 20dB HL for audiometric test frequencies from 250 to 8000 Hz. All subjects were native English speakers. All subjects provided informed consent in accordance with the local IRB, and all were paid for their participation. 6.3.4. Procedures Vowel recognition was measured in NH listeners for all experimental conditions. All stimuli were presented in free field at 70 dBA. Stimuli were output via PC soundcard to an amplifier (Crown D-75) and single loudspeaker (Optimus). Subjects were seated in a double-walled sound-treated booth (IAC). The loudspeaker was situated at approximately the same horizontal plane of subject ears and 117 approximately 45 degrees to the left of the subject and was about 1.5 meters away. During the test, a vowel token was randomly selected (without replacement) from the stimulus set and presented to the subject. The subject responded by clicking on one of the 12 response buttons displayed on the computer screen; response buttons were labeled in an h/V/d format. No training with the speech processing was provided; no trial-by-trial feedback was provided. Each experimental condition consisted of 48 vowel tokens (12 vowels*4 talkers =48). The test order for the experimental conditions was randomized for each subject. Each data point collected from each subject was averaged across 6 repetitions, excluding the first repetition data which was used to help the subject to get familiar with the testing process. 6.4 Results Figure 6-1 shows the average speech recognition performance across all the subjects under different number of channels and different simulated electrode smearing levels. Speech recognition performance with 8 channel condition was higher than that with 6 channel condition. The difference was significant only when electrode smearing existed. When the electrode smearing level was increased, speech recognition performance decreased under both channel conditions. However, comparing to the performance without electrode smearing, the decrease of speech recognition performance was significant (p<0.05) when the electrode smearing level was above (i.e., 45% electrode confusion per channel) for 6 channels condition and (i.e., 60% electrode confusion per channel) for 8 channels condition. The 4 a 8 a 118 electrode smearing effect and channel effect were significant in reshaping the speech recognition performance [Two way repeated measures ANOVA: p(smearing) <0.001, p(channel) = 0.003]. Simulated electrode confusion level 0 304560 Percent correct (%) 20 25 30 35 40 45 50 6ch 8ch * * * Figure 6-1: Speech recognition performance under different number of channels and different electrode smearing levels. The error bars indicate one standard deviation, and the stars indicate significantly different recognition performance between 6 and 8 channel condition under the specific electrode smearing level (p<0.05). Figure 6-2 illustrated the prediction ability of the acoustic space for speech recognition performance under 8 experimental conditions (2 channels*4 smearing levels=8). The linear prediction ability of the acoustic space for the average speech recognition performance had a correlation strength of r2=0.901. The acoustic space 119 was a significant factor to explain the change of recognition performance (p<0.001). Figure 6-2 also showed that all the average speech recognition scores fell close to or within the range of 95% confidence level of the linear regression. acoustic space 0.12 0.14 0.16 0.18 0.20 0.22 0.24 0.26 0.28 0.30 0.32 Percent correct (%) 10 15 20 25 30 35 40 45 50 Individual subject data Mean Mean regression Mean 95% confidence level r 2 =0.901, p<0.001 Figure 6-2: The prediction ability of the acoustic space for percent correct under 8 combinations of smearing and channel conditions (4 smearing * 2 channel=8). The filled black dots represent individual subject data; the red diamonds was the mean speech recognition scores across subjects. The solid red line represents the linear regression; the red dashed lines represent the mean 95% confidence level. The linear regression strength and p value were displayed on the left upper corner. To observe the talker effect in speech recognition, the speech recognition scores were averaged across all the subjects under different channels and different smearing factors. Statistical analysis showed that talker was a significant factor to 120 121 affect speech recognition performance [one way repeated measures ANOVA: p<0.001]. Such finding again provides evidence that talker effect in CI users is very important for speech recognition as we’ve shown in Chapter 3. When the talker effect was included into the prediction analysis, there are 32 combination conditions (2 channels*4 smearing levels*4 talkers=32). Figure 6-3 illustrated the prediction ability of the acoustic space for speech recognition performance under these 32 combinations. The correlation strength of these two measurements dropped from 0.901 as in Figure 6-2 to of 0.215, which is still significant (p<0.001). Furthermore, the mean 95% confidence level captured much less percentage of mean perceptual scores than that of Figure 6-2. The decreased prediction ability of acoustic space when talker effect was involved (as shown in Figure 6-3) may implies that the acoustic space, in this case, would not be able to predict the talker effect. To directly analyze the prediction ability of the talker effect, the speech recognition score for particular talker were factored out from different smearing and channel conditions. Figure 6-4 showed the relationship between four talker’s acoustic space and their corresponding speech recognition scores. The prediction ability of the acoustic space would not be able to predict the talker effect on speech recognition (r2=0.006, p=0.925). However, note that the subject performance difference is very big even for NH listeners listening to CI simulations. acoustic space 0.10 0.15 0.20 0.25 0.30 Percent correct (%) 0 10 20 30 40 50 60 Individual subject data Mean Mean regression Mean 95% confidence level r 2 =0.215, p<0.001 Figure 6-3: The prediction ability of the acoustic space for percent correct under 32 conditions (4 smearing * 2 channel * 4 talkers). The filled black dots represent individual subject data; the red diamonds was the mean speech recognition scores across subjects. The solid red line represents the linear regression; the red dashed lines represent the mean 95% confidence level. The linear regression strength and p value were displayed on the left upper corner. Table 6-2 summarized the prediction ability of the three different speech intelligibility indexes under variant combination conditions. A proper speech intelligibility index is expected to span a comparably large range to reflect the change in speech recognition performance. Hence, Table 6-2 also includes the range of speech recognition scores and the calculated prediction index, besides the linear 122 acoustic space 0.16 0.18 0.20 0.22 0.24 Percent correct (%) 15 20 25 30 35 40 45 Individual subject data Mean Mean regression Mean 95% confidence level r 2 =0.006, p=0.925 Figure 6-4: The prediction ability of the acoustic space for percent correct under 4 conditions (4 talkers). The filled black dots represent individual subject data; the red diamonds was the mean speech recognition scores across subjects. The solid red line represents the linear regression; the red dashed lines represent the mean 95% confidence level. The linear regression strength and p value were displayed on the left upper corner. regression strength and significance level. The predictions achieved significant level of 0.05 was highlighted. Table 6-2 indicates that no prediction index was able to explain the talker effect. Both STI and NCM methods could not reflect the most dominant factor (i.e., channel effect) to speech perception. Instead, acoustic space expanded a much larger index to explain the channel effect (statistics was not obtained under channel condition since only two data points were available). Comparing to STI and NCM indexes, the acoustic space index was able to reflect the 123 speech recognition patterns in both the (smearing channel) and (smearing ×channel ×talker) combinations and possibly the channel effect. × Table 6-2: Summary of the prediction ability of the three speech intelligibility indexes under variant combination conditions. The unlisted factors are factored out by averaging. The low and high indicates the range of the perception scores or the intelligibility index. The r2 and p value are the linear regression results of the perception scores. The bolded digits highlight the significant prediction effect. Method Items Channel (2) Talker (4) smearing ×channel (16) smearing ×channel ×talker (32) Low 29 29 23 20 Perception Scores (%) High 34 35 38 42 Low 0.18 0.17 0.13 0.10 High 0.24 0.24 0.30 0.33 r2 - 0.006 0.901 0.560 Acoustic space p - 0.925 0.001 0.001 Low 0.51 0.49 0.49 0.47 High 0.52 0.54 0.54 0.57 r2 - 0.569 0.923 0.051 STI acoustic p - 0.246 0.001 0.214 Low 0.56 0.55 0.48 0.46 High 0.56 0.56 0.70 0.70 r2 - 0.324 0.480 0.381 STI feature p - 0.431 0.057 0.001 Low 0.78 0.77 0.77 0.76 High 0.78 0.80 0.79 0.82 r2 - 0.253 0.813 0.009 NCM acoustic p - 0.497 0.002 0.602 Low 0.67 0.67 0.63 0.63 High 0.67 0.67 0.70 0.70 r2 - 0.09 0.762 0.63 NCM feature p - 0.707 0.005 0.001 124 125 6.5 Discussion and conclusions The effect of electrode confusion was simulated by smearing the spectral contrast of the acoustic features across the analysis bands. The level of the electrode confusion was systematically controlled by raising the power of a base matrix. Results indicated that six/eight channel speech processing strategy may tolerate electrode confusion for at least 30%/45% without significantly decreasing the speech recognition performance. The results suggested that a speech processor with higher spectral resolution was more tolerant about electrode confusions than that with lower spectral resolution. This was possibly due to the reason that there were other acoustic cues available to compensate for the speech recognition process under electrode confusion conditions. This result strongly indicated that acoustic space can be reshaped by the electrode confusion patterns and may be further utilized in the customization of the speech intelligibility model. Compared to the other speech intelligibility indexes, the acoustic space index gave much larger index range and stronger prediction ability in all the combination conditions, at least under the CI simulation conditions. This may be due to the fact that STI and NCM methods are designed to study speech intelligibility in noise, rather than speech materials simulated with noise band vocoder. Instead, acoustic space is designed to calculate the acoustic distance across vowel pairs, which makes that the comparison holds regardless of the speech materials. Yet, cautions should be taken because the index range for the acoustic space was also dependent on the 126 normalization factor (for example, number of bands that was used to calculate the reference acoustic space). 127 Chapter 7 Speech intelligibility modeling: customization with psychoacoustic measurements This Chapter unified all of the psychoacoustic measurements from individual CI users into the proposed general speech intelligibility model. The customized speech intelligibility predictor was used to predict the inter-subject performance difference. The effect of different psychoacoustic measurements to the prediction ability was also analyzed and discussed. 7.1 Introduction Working towards the understanding of inter-subject performance difference, it is necessary to centralize individual CI user’s device settings and unique psychoacoustic patterns into the proposed speech intelligibility model. Chapter 4 discussed the possible factors that may relate to inter-subject performance difference and that are possible to integrate into the model. Chapter 5 has proved that the general speech intelligibility prediction was able to predict the speech perception under parametric variations; Chapter 6 showed that the proposed intelligibility model was able to reflect the effect of electrode confusions. In this Chapter, individual CI users’ MAP were fully considered and utilized 128 into the model. Furthermore, psychoacoustic responses were measured on individual CI users rather than using simulated data as in Chapter 6. The customized speech intelligibility model is used to predict inter-subject performance difference. 7.2 Method Based on the general speech intelligibility model, the customized model unifies variant psychoacoustic measurements and is illustrated in Figure 7-1. The acoustic features were firstly extracted based on the stimulation pulse rate and the acoustic partition map stored in the CI device. The acoustic range used to stimulate the electrode array was chosen based on the CI device design (in this study, 1% clipping was used to anchor the upper level of the acoustic range. A 30dB acoustic range was used to anchor the lower level of the acoustic range). A nonlinear acoustic-to-electric stimulation was applied based on the Q value of the MAP in individual CI users’ program, and were converted to perceptual related percentages of dynamic range. The acoustic features reflecting the percentages of dynamic range was corrupted by the electrode discrimination patterns with a liner scaling operation as introduced in Chapter 6. Eventually, the corrupted features were quantized according to the intensity resolution within the dynamic range of electric stimulation. Furthermore, when calculating the acoustic space, weightings based on the relative dynamic range of the electrode array were applied for combining cross-band acoustic distances. ,1 1,1 2,1 3,1 ,2 1,2 2,2 3,2 ,1, 2, 3, .. . . ..... ..... .. . . ii i i ii i i in i n i n i n xx x x xx x x xx x x ++ + ++ + ++ + Time Frequency pps MAP Acoustic Dynamic Range MAP '' ' ' ,1 1,1 2,1 3,1 '' ' ' ,2 1,2 2,2 3,2 '' ' ' ,1, 2, 3, .. . . .. .... .. . . ii i i ii i i in i n i n i n xx x x xx x x xx x x ++ + ++ + ++ + × jnd quantization Individualized predictor Dynamic range weightings Y’ Electrode confusion patterns General speech intelligibility predictor Figure 7-1: Framework for individualized speech intelligibility model 129 130 7.3 Experimental Design 7.3.1 Subjects Subject S2, S3, S5 and S8 participated in this study. 7.3.2 Psychoacoustic measurements procedures The following psychoacoustic measurements were conducted through the House Ear Institute Neucleus Research Interface (HEINRI). The software used to perform the psychoacoustic study is developed by Qian-Jie Fu and Xiaosong Wang at House Ear Institute. 1) Measurement of T/C level Measurement of the threshold and maximum comfortable level of individual activated electrode on the electrode array is the basic measurements for all the other psychoacoustic measurements. It sets the electric stimulation range for the electrode array. The stimulus parameters for measuring the T/C level is based on the daily speech processor of individual CI users, which specifically include: the number of activated electrodes, the pulse rate, the pulse width and pulse separation, and stimulation mode. The pulse duration for measuring the T/C level was set as 200ms. When measuring the threshold level, the experimenter gradually increased the stimulation amplitude, and varied the number of stimulation trains to certain electrode. The threshold level was the lowest stimulation level that the CI subject could consistently tell the number of stimulation trains correctly. 131 When measuring the maximum comfortable level, the number of stimulation trains was fixed at 2. The experimenter gradually increased the stimulation amplitude from the T level. The subject reported what level the stimulation was to the maximum comfortable level. The maximum comfortable level is the highest stimulation amplitude that the CI subject can accept over a brief stimulation period. After measuring the T/C levels of individual electrodes, a loudness balance procedure was used to fine tune the C levels to make sure all the electrodes were equally loud when stimulated at 70% of dynamic range of each electrode. 2) Measurement of electrode discrimination patterns Based on the measured T/C levels, the stimulation amplitude of 70% dynamic range (DR) of each electrode was calculated and used as the standard level to measure the electrode discrimination patterns. In this experiment, a “pitch ranking” process was used to evaluate CI subject’s ability to distinguish electrical stimulation of different electrodes. All electrode pairs were measured with a 2AFC procedure. Each pair of electrode was tested with at least 30 comparisons. To avoid interference of the loudness cues in the pitch ranking process, the stimulation amplitude roved at a range of approximately 10% DR off the standard levels (i.e., the measured 70% DR). 132 3) Measurement of intensity resolution Based on the measured T/C level, a random electrode was selected to estimate the intensity resolution in the DR range. The intensity just noticeable difference (jnd) was measured with a 2AFC protocol when minimal 8 reversals, maximal 12 reversals or 60 trials had reached. The anchor points to measure the intensity jnd were 5, 15, 25, 35, 45, 55, 65 and 75 of a reference electrode’s DR. For subject S2 and S5, the reference electrode is the specific electrode where intensity discrimination was performed on; for subject S3 and S8, a different electrode was selected. 1 Based on the measured jnd and dynamic range, the intensity resolution was calculated as the number of the discriminable steps within the dynamic range. To save experimental labor, this intensity resolution was applied to all the electrodes. 7.3.3 Speech recognition materials and procedures The materials for speech recognition test were the same as those in Chapter 6. Each testing block consisted of 96 vowels produced from one talker. Speech recognition scores for individual talkers were tested with 2-3 testing blocks. This data was collected by Yi-Ping Chang in another study. 1 Due to the labor-intensive and time-consuming nature of the psychoacoustic measurements involved in this study, when data were available through the other studies in the same laboratory at House Ear Institute, they were used in this study. Specifically, for the intensity jnd measurement, subject S3 and S8 data were collected by John Galvin in another study. 133 7.4 Psychoacoustic measurement results 7.4.1 CI MAP and electrode T/C levels Individual CI users’ MAP and the measured T/C levels were listed in Appendix D. It is obvious that there is a huge difference in individual subject settings and dynamic range. 7.4.2 Acoustic to electric mapping In Nucleus device, “Q value controls the steepness of the amplitude growth algorithm and determines the percentage of the recipient’s electrical dynamic range that is allocated to the top 10dB of the speech processor’s channel amplitude range” (Cochlear, 1999). All the tested subjects in this study shared the common Q value 20, which means that the upper 20% DR was mapped to the top 10dB of the delivered acoustic dynamic range. Note that in Nucleus device, the absolute acoustic input range is 30dB, whose upper and lower bound was dependent on the microphone sensitivity control. In our modeling study, we firstly calculated the histograms of the acoustic amplitudes of all channels from all stimuli. The maximum acoustic amplitude (acoustic C level) to translate into electric stimulation at 100% DR (i.e., C-levels of individual electrodes) was determined as the 99% of the histograms (i.e., 1% clipping was applied.). The minimum acoustic amplitude (acoustic T level) to translate into electric stimulation at 0% DR (i.e., T-levels in individual electrodes) was 30dB lower than the acoustic C level. Acoustic amplitude outside this acoustic C level and T level were clipped or thresholded to the acoustic C and T level, respectively. For the acoustic amplitudes between the acoustic T and C level, a power function, similar to the nonlinear amplitude mapping in Chapter 5, was used to translate the acoustic amplitude into the percent dynamic range of the electrode array as follows: 134 ) %( p DR k acou c = − (7-1) where , and k c p were constants and were determined by mapping three points on the acoustic to electric mapping curve: (the acoustic T level, 0% DR), (acoustic C- level, 100% DR) and (acoustic amplitude of (T + 20dB), 80% DR). For Q=20, the compression power p was calculated to be 0.1822. 7.4.3 Intensity resolution The intensity resolution in the dynamic range across the electrode array was estimated from one electrode only. The measured intensity shift to allow intensity discrimination was adjusted to the minimal clinical unit 0.18dB when the 2AFC result reported a smaller shift than this clinical unit. The measured resolution steps for subject S2, S3, S5, and S8 were 32, 24, 30, and 49, respectively. 7.4.4 Electrode confusion patterns Suppose the electrode confusion matrix is of sizeNN × , where N is the number of active electrodes. The percent correct of the pitch ranking between all the electrode pairs were represented in an upper diagonal matrix, which was termed as electrode comparison patterns. To be able to incorporate the electrode comparison patterns (percent correct) into the feature matrix, further process was needed to convert to a matrix that represents how much information was leaking to where when a certain electrode was stimulated. The process from the electrode comparison patterns ( ) to normalized electrode confusion patterns (C ) were as follows: P 1) Create a symmetric matrix ' A PP = + where is the transpose of the electrode comparison patterns (i.e., lower diagonal matrix); ' P 2) Convert the percent correct to percent wrong by subtracting each entry of matrix A from 100. The resulted matrix is named as W . 3) Replace the diagonal values of W with 100 and normalize each row to 1. This process assuming that when acoustic information was delivered to certain electrode, the information was distributed or coupled to other electrodes according to the electrode confusions within the electrode array. To better serve this physical meaning, the resulted matrix was named as normalized electrode confusion patterns (C ). The four subjects’ electrode comparison patterns (percent correct) were summarized in Appendix E. Their corresponding normalized electrode confusion patterns were also listed together in the Appendix E. 135 136 7.5 Prediction results To evaluate prediction ability of the customized speech intelligibility model for inter-subject performance difference, we compared the prediction performance without customization (i.e., without integrating individual subject’s psychoacoustic responses) to the model with customization. We investigated the inter-subject performance difference in terms of three aspects: averaged performance across talkers, individual subject’s talker preference and acoustic distances of vowel pairs. 7.5.1 Averaged performance across talkers As we’ve shown in speech enhancement study, the inter-subject performance difference may be that CI users’ speech recognition performance are of different levels even when listening to the same set of stimuli. To predict such inter-subject performance difference, the measured acoustic space based on the speech intelligibility model was correlated to the measured speech recognition performance. In this section, we focused on the averaged performance across talkers. Figure 7-2 shows the relationship between the acoustic space and speech performance in percent correct with and without customization of the speech intelligibility model. Note that the values plotted were averaged across four different talkers. Without customization, the acoustic space was relatively small in the range of [12, 14] while the percent correct ranged between [74, 97]. The linear prediction ability of acoustic space without customization is very marginal [linear regression, r2=0.03, p=0.84]. When the acoustic space was customized with individual CI user’s psychoacoustic response, the prediction ability of the acoustic space was boosted to a much higher value [linear regression, r2=0.58, p=0.24]. Acoustic space averaged across different talkers 6 8 10 12 14 16 18 20 22 24 Percent correct (%) 70 75 80 85 90 95 100 S2 without S3 without S5 without S8 without S2 with S3 with S5 with S8 with without: r2=0.03 with: r2=0.58 Figure 7-2: Relationship between acoustic space and averaged performance across talkers under with and without customization of speech intelligibility model. To factor out which psychoacoustic response may contribute best to the prediction ability boost, acoustic space with different psychoacoustic response combinations were also calculated and correlated to the speech performance. Since the acoustic stimulation rates and the MAP have been embedded in the acoustic feature extraction, we focused on the factors of dynamic range (D), acoustic-to- electric comprssion (C), electrode confusion patterns (E), and intensity resolutions 137 (I). Table 7-1 lists the linear regression results (r2 values and p values) of the prediction ability under 2 4 combinations of four different psychoacoustic responses. Note that the combinations were ordered incrementally according to the r2 value. It was found that dynamic range and electrode confusion patterns contributed most to the prediction of inter-subject performance difference while the intensity resolution and acoustic to electric compression contributed the least. When different psychoacoustic responses were combined, the prediction ability fell in between these two extremes. Table 7-1: Prediction ability for averaged performance across talkers with customized speech intelligibility model under different psychoacoustic response combinations. Combinations r 2 p None 0.03 0.84 Intensity resolution (I) 0.02 0.87 Acoustic to electric compression (C) 0.16 0.6 C+I 0.16 0.6 D+C+E 0.53 0.24 D+C 0.57 0.25 D+C+I 0.57 0.24 D+C+E+I 0.58 0.24 D+E+I 0.62 0.22 D+E 0.63 0.21 C+E 0.69 0.17 C+E+I 0.69 0.17 E+I 0.82 0.1 Elec. confusion (E) 0.83 0.09 D+I 0.88 0.06 Dynamic range (D) 0.89 0.06 138 139 7.5.2 Individual subject’s talker preference The customized speech intelligibility model was further utilized to analyze talker preference patterns in individual CI subjects. Figure 7-3 shows relationship between acoustic space and speech performance with individual talkers. Black symbols represent the condition without customization of speech intelligibility model; red symbols represent the condition with customization. The correlation coefficient r2 values of linear regression are also showed. It was found that the correlation strength r2 values were substantially increased with customization model for subject S2, S3 and S8. However, the correlation strength remained very low for subject S5. Anecdotally, this subject said that “how I wish all the people around speak like this talker (the best talker)”, although the real reason behind it was unknown. This anecdotal report may indicate that subject S5’s talker preference may be dominated by other psychoacoustic responses, which may not be considered or properly integrated into the customization model. To factor out contribution of different psychoacoustic responses to talker preference effect, we also analyzed the model with different psychoacoustic response combinations and reported the linear prediction ability of the corresponding acoustic space in Table 7-2. Note that the ordering of the psychoacoustic combinations was according to Table 7-1. Statistical analysis was further performed on the correlation strength for subject S2, S3 and S8 on comparing the effect of customization to without customization. Those psychoacoustic response combinations that significantly improved the prediction ability than without customization were asterisked in Table 7-2 [paired t-test, p<0.05]. Different from the effect of averaged performance across talkers, it was found that the acoustic to electric compression factor was the key factor to interpret talker preference. The other individual psychoacoustic responses, including intensity resolution, electrode confusion patterns, and dynamic range, contributed very marginal to the effect of talker preference. When comparing the prediction ability across different CI subjects, it was found that there were substantial inter-subject prediction difference, ranging from zero (i.e., S5) to over 0.95 (i.e., S8). 70 75 80 85 90 95 100 010 20 30 Percent correct (%) 70 75 80 85 90 95 Acoustic space 010 20 30 Figure 7-3: Relationship between acoustic space and individual CI users’ talker preference under with and without customization of speech intelligibility model. 140 141 Table 7-2: Prediction ability for talker preference with customized speech intelligibility model under different psychoacoustic response combinations. Combinations S2 r 2 S3 r 2 S5 r 2 S8 r 2 None 0.03 0.08 0.00 0.46 Intensity resolution (I) 0.03 0.08 0.00 0.46 Acoustic to electric compression (C) * 0.63 0.83 0.03 0.95 C+I * 0.63 0.84 0.03 0.95 D+C+E * 0.44 0.73 0.00 0.95 D+C * 0.55 0.83 0.01 0.92 D+C+I * 0.55 0.84 0.01 0.92 D+C+E+I * 0.50 0.73 0.00 0.98 D+E+I 0.10 0.09 0.00 0.50 D+E 0.09 0.09 0.00 0.49 C+E * 0.49 0.73 0.00 0.98 C+E+I * 0.50 0.73 0.00 0.98 E+I 0.11 0.10 0.01 0.48 Elec. confusion (E) 0.09 0.10 0.00 0.48 D+I 0.04 0.07 0.00 0.49 Dynamic range (D) 0.04 0.07 0.00 0.49 7.5.3 Acoustic distances of vowel pairs To study the effect of customized speech intelligibility model, the acoustics distances with and without customization of the model was further analyzed. Figure 7-4 shows the percent confusion of all the confused vowel pairs and their corresponding acoustic space. It was shown that when the acoustic space of a vowel pair was smaller, the possibility of perceptual confusion was higher. Yet, such relation was not a simple one on one mapping. Statistical analysis shown that acoustic space with customization of the intelligibility model (with all psychoacoustic responses) were significantly different than that without customization for all subjects [paired t-test, p<0.05]. 0 1020 3040 5060 20 40 60 80 100 0 20 40 60 80 100 Acoustic space of confused vowel pairs 0 1020 304050 Percent confusions of vowel pairs (%) 0 20 40 60 80 Figure 7-4: Relationship between acoustic space and percent confusions of vowel pairs in individual CI users under with and without customization of speech intelligibility model. Table 7-3 compares the means and standard deviations of the acoustic space for the right and wrong vowel pairs under with and without customization of the model. The result indicated an obvious pattern that the average acoustic distance of the correctly perceived vowel pairs was significantly larger than that of the wrongly perceived vowel pairs under with [paired t-test: t=-4.920, p=0.016] and without customization condition [paired t-test: t=-12.214, p=0.001]. Interestingly, compared to the acoustic space without customization, the average acoustic space for wrong 142 143 vowel pairs was lower under customization for subject S2 and S3, while the acoustic space were higher for subject S5 and S8. Such pattern also applied to the right vowel pair situation. This pattern implied that the psychoacoustic responses affect the acoustic space in a subject dependent and complex manner. Table 7-3: Descriptive statistics of the acoustic space for the wrong and right vowel pairs under with and without customization of the intelligibility model. The numbers in the parenthesis represent the standard deviation of the acoustic space. Wrong Right subject count Avg. per- centages of confusion (%) without mean (std) with mean (std) count without mean (std) with mean (std) S2 46 18.80 9.25 (5.25) 6.92 (5.11) 482 15.52 (8.93) 14.74 (9.99) S3 78 15.59 9.20 (5.34) 4.18 (3.38) 450 14.50 (8.99) 8.34 (5.43) S5 42 16.12 7.96 (4.66) 13.44 (11.22) 486 14.52 (8.86) 25.80 (17.15 ) S8 10 18.30 6.03 (2.52) 12.09 (6.51) 518 13.91 (9.55) 20.96 (12.99 ) 7.6 Discussion and conclusions In this study, the general speech intelligibility model was expanded to predict inter-subject performance difference in terms of three aspects: averaged performance across talkers, individual subject’s talker preference, and acoustic distances of vowel pairs. Various psychoacoustic measurements of individual CI users were used to modulate the acoustic features. It was found that the customized speech intelligibility model achieved much higher linear prediction ability for averaged performance across talkers and individual subject’s talker preference in individual CI users. The analysis of the confusion vowel pairs showed a general pattern that the lower the acoustic space, the higher the probability of confusion. Yet, the mixed changing patterns of acoustic space with customization model indicated that the various psychoacoustic responses reshape the acoustic distance in a complex manner and may be also subject dependent. This may also implied that our unified model may not incorporate all of factors that may affect the inter-subject performance difference. The electrode confusion levels for the four CI subjects were calculated to be of 23 58 63.5%, 72.3%, 29.9%, 33.5% SS S S λ λλ λ == = = . Revisit the findings in Chapter 6 that speech perception performance significantly decreased when the electrode confusion levels were higher than 45% for 6 channel condition or 60% for 8 channel condition. Compared to this finding in Chapter 6, Subject S5 and S8’s electrode confusion levels were much smaller than the level that can result in significant performance drop, while subject S2 and S3’s electrode confusion levels were much higher than this suggested threshold. This has important implication for the speech enhancement strategy design for different CI subjects. Suppose there is a new technology to improve electrode confusion, we may refer that subject S2 and S3 may potentially significantly improve their speech performance with this new technology. However, subject S5 and S8 possibly may not be recommended to this 144 145 type of technology since the effect of electrode confusion to these two subjects was not too bad. The analysis of the prediction ability under different psychoacoustic response combinations indicated that the rule governing inter-subject performance difference may be different than the rule determining the talker preference within individual subject. It was found that the dynamic range and electrode confusion patterns determined most of the linear prediction ability for inter-subject performance difference. Yet, in talker preference situation, the acoustic-to-electric compression determined most of the talker difference within individual CI subjects. However, interestingly, the relationship between acoustic space and inter-subject performance difference is of positive direction (i.e., the larger the acoustic space, the higher the subject performance); while the relationship between acoustic space and talker preference is of negative direction (i.e., the larger the acoustic space, the lower the talker performance). It is not fully understand why these two relationships are of opposite direction, considering that the acoustic space has a physical meaning of the average acoustic difference. The possible explanations are that 1) from the descriptive analysis of the acoustic distance of vowel pairs, the customized model reshape the acoustic space in a subject dependent manner, which may be related to the pulse stimulation rate and the acoustic band partition. 2) the acoustic-to-electric compression actually boosts the originally weak acoustic distance in a very non- linear manner. This non-linear process may transform the difference in another domain that may not be simply still treated as acoustic distances; 3) there may be 146 other important cues contribute to talker preference, such as pitch, talking style and speed, stress or modulation depth. Future work may want to further investigate the reasons behind that. The findings of the key psychoacoustic responses to interpret the averaged performance across subjects and individual subject’s talker preference had significant implications. The dynamic range and electrode confusion patterns were found to be the dominant keys to decode the averaged performance of CI users. Imagine that the dynamic range of electric stimulation was able to be as large as the acoustic signal’s dynamic range, and the electrode confusion patterns were able to be asymptotically approach identity (i.e., no confusions), the delivery of the acoustic signals won’t experience compression and mismatch. Under such assumptions, the CI subjects were expected to obtain the optimal averaged performance. Such inference is very reasonable. Although it is out of the scope of current thesis, there are other exciting research directions working towards a better electrode array design (e.g., MEMS), hoping to obtain a better stimulation scheme for CI users. On the other hand, the acoustic-to-electric compression was found to be the key components relating to the talker preference. This makes sense when we think of the effect of modulation depth (either temporal or spectral) to speech intelligibility. Interestingly, note that all the CI subjects participated in this study all share the same curve of the compression. This may imply that the acoustic-to-electric nonlinear compression drives different talkers’ speech to map to a skewed space, which further changed the speech intelligibility in a complex manner. This finding may imply that toward to 147 better listening across all talkers, the acoustic-to-electric nonlinear compression needs to be further fine tuned, either in a fixed manner or adaptive manner. It was found that the intensity resolution within the dynamic range contributed very marginal to the prediction. This agreed with the previous findings from Loizou et al. 2000. Yet, cautions should be taken considering the following aspects of the current study: a) the effect of intensity resolution may be partly involved with the effect of dynamic range; b) the intensity resolution estimation was only from one single electrode and used to quantize the spectral features across all channel information. These estimations may be too weak to give the intensity resolution enough potential to predict the inter-subject performance difference. Future study may want to further study this factor with more details. Our results indicated a successful prediction ability of the customized speech intelligibility model. However, with this model, it does not necessary mean that the brain process the information in an exactly the same manner as the model proposed here. In the proposed model, some other features that the brain possesses may not be incorporated into the model, for example, the nonlinearity process. Although we introduced the nonlinear compressive mapping from acoustic feature to electrode stimulation, the proposed model, overall, integrated the feature vectors in a linear manner. For example, the electrode confusion patterns were used to smear the acoustic features in a linear weighting fashion. Furthermore, to evaluate the prediction ability of the proposed model, linear prediction ability was assessed. It is believed that the speech recognition process is far more complicated in human 148 brains. Yet, with the customized speech intelligibility model, we can decode the highly nonlinear process with a series of simple linear deterministic process by incorporating individual listeners’ psychoacoustic response. It implies that this model is effective to explain the important speech recognition process, at least in terms of inter-subject performance difference. 149 Chapter 8 Conclusions and future work 8.1 Conclusions This thesis addresses two inter-related challenges in CI speech perception: highly non-robustness and huge inter-subject performance difference. Specifically, a speech enhancement framework was studied to compensate for non-robust CI speech perception based on individual CI subject’s high level perception patterns. A novel speech intelligibility model unifying variant psychoacoustic responses was researched to model CI inter-subject performance difference. Firstly, a speech enhancement framework was proposed to compensate for the deficit of speech listening in CI users. Based on the criteria that speech enhancement in CI shall be flexible to different CI speech processing strategies, adaptive to different listening conditions and on the fly implementable, a speech enhancement framework consists of front-end speech processing, off-line training and on-line synthesis was proposed. This framework was specifically studied in two situations: telephone speech listening and multi-talker listening. In telephone speech listening, different from NH listeners, CI users showed obvious disadvantages due to the lack of visual cues, speech quality, and small spectrum bandwidth. A speech enhancement method was studied to address the 150 spectrum bandwidth by utilizing the strong relationship between the narrowband telephone speech and wideband speech. A GMM model was trained to model the statistical distribution of the speech features; a conversion function was trained to minimize the feature difference in a training dataset. The learned conversion function was then used to convert sentences to render bandwidth extended speech. Speech recognition performance was measured with CI users listening to narrowband telephone speech and bandwidth extended speech. Results showed small yet significant improvement with the bandwidth extended speech. In multi-talker listening task, CI users showed substantial cross talker intelligibility difference, which is very different from the robust speech recognition for NH listeners. Furthermore, different CI users have very substantially different talker preference. The speech enhancement framework was applied to modify non- ideal talker speech to mimic that of ideal talker speech, based on the talker preference in individual CI users. The modification was performed on spectral domain with GMM modeling. Two experiments were designed to evaluate this methodology: two talkers with naturally recorded speech materials and six simulated talkers with different pitches but the same talking speed and modulation depth. Speech recognition performance measured with CI users showed significant improvement for non-ideal talker speech after the spectral normalization method. From the above two speech enhancement cases, it was observed that different CI users have very different speech recognition levels and different talker preference. The second part of the thesis intended to model such differences by unifying variant 151 psychoacoustic responses of individual CI users. A speech intelligibility model was proposed based on the acoustic distance of speech stimuli. The acoustic features to calculate acoustic distance was extracted according to the stimulation rate (pulses per second), and acoustic partition map in the CI MAP. It was then compressed to percent dynamic range according to the Q value of the CI MAP. Electrode confusion patterns were measured from pitch ranking electro-psychoacoustic study and used to modulate the acoustic features. The modulated features were further quantized according to the intensity resolution of the electrode array. Dynamic range of the electrode array was also used as weightings while calculating the acoustic space across different analysis band. The study of such intelligibility model involved three phases. Phase 1 studied the prediction ability of the model without customization of the features under different parametric effects; Phase 2 studied the prediction ability of the model with electrode confusion patterns for a general body of CI users. Phase 3 studied the intelligibility model with customization of psychoacoustic measurements. The inter-subject performance difference was focused in Phase 3. The general findings of the above intelligibility model are 1) the intelligibility model without customization is an effective predictor for the general body of CI users under the number of spectral channels, spectral smearing and spectral warping. 2) Electrode confusion pattern is a significant factor to reshape the acoustic space to predict perception change under different number of channels and electrode smearing conditions; 3) Speech intelligibility model with customized psychoacoustic responses is efficient to tie-break the acoustic distance calculated from the general 152 model without customization. The customized intelligibility model significantly boosted the linear prediction ability of the inter-subject performance in terms of averaged performance across talkers and individual subject’s talker preference. Interestingly, it was found that electrode confusion pattern is the most prominent factor to differentiate different CI’s average performance level. Yet, for talker preference, the acoustic-to-electric nonlinear compression dominated the prediction ability in acoustic space domain. To have a better general view of inter-subject performance difference and its psychoacoustic cause, Figure 8-1 summarized the four CI subjects participated in all the speech recognition tests and psychoacoustic measurements throughout this whole study. Note that for better representation, not all of the speech perception data and psychoacoustic measurements were listed in the very detail. For example, the average dynamic range was listed simply as a general view of the dynamic range map across the whole electrode array, which was not what the information was incorporated in our specific study. From speech perception pattern, subject S5 and S8 operated CI device at a much higher performance level than subject S2 and S3. However, compared to the CI settings and psychoacoustic responses, no single factor can linearly explain why such pattern happens except the electrode confusion patterns, which showed that the electrode confusion of subject S5 and S8 were much lower than that of subject S2 and S3. For example, although it was typically thought that a higher electric stimulation rate was able to deliver a higher temporal resolution of sound, it was not always true that CI subjects with a higher stimulation rate always achieve a higher speech perception level than their counterpart. Similar result also applied to the factor of number of active electrodes, the intensity resolution, and average dynamic range. This strongly supports the point that CI speech perception experienced very complex and nonlinear reshaping in terms of acoustic information delivery that was distorted by individualized psychoacoustic responses. This may also explain the reason why quite some CI literatures showed mixed results when study the effect of a single psychoacoustic response to speech perception. Percent correct (%) 20 40 60 80 100 11025 male 11025 female 3400 male 3400 female SubjectS2 S3 S5 S8 Stimulation rate (pps) 250 900 1200 250 No. of active electrodes 20 16 20 10 Elec. Confusion level (%) 63.5 72.3 29.9 33.5 Intensity resolution (steps) 32 24 30 49 Avg. dynamic range (dB) 6.81 9.08 11.53 8.71 Figure 8-1: Summary of CI speech perception patterns and psychoacoustic measurements 153 154 In summary, this thesis optimized CI speech perception from a systematic approach that unifying speech signal processing, hearing science and speech perception fields. A speech enhancement framework utilized unique CI subject’s speech recognition patterns to optimize the auditory signals, such that the input signal better match the internal complex feature representation through the CI device and individual psychoacoustic responses. The proposed speech enhancement framework utilizing special CI speech recognition patterns was shown to effectively improve CI speech understanding. To further address the observed huge inter-subject performance difference, a customized intelligibility model was proposed and researched. This model, as far as we know, is the very first available model that is able to take into account of inter-disciplinary field study to address high-level speech perception patterns. The model was validated to be able to explain the general body of CI performance in both CI simulation studies and real CI data measurements. The model characterized and modeled individual CI subject’s device settings and psychoacoustic responses in a unifying manner. It was shown to significantly boost the linear prediction ability for performance difference across subjects. The contribution analysis of different psychoacoustic responses shed lights on understanding the psychoacoustic cause of inter-subject performance difference. By directly model the inter-subject performance difference, this thesis potentially extend the currently available speech intelligibility modeling that only targets on normal hearing people speech understanding. 155 This thesis provides a strong systematic approach to utilize discrete information across different disciplines to model and understand the cause of observed high-level patterns. With such understanding, next generation CI device may be able to address the most critical elements in this complex big system. This study will promote individualized CI device that help each patients achieve maximum benefits out of the high cost medical device. This study also has important implications for the other general assistive medical device design (e.g., retinal implant, prosthetic arm), in the sense that it is possible and very beneficial to researched on a unified framework and performance measurement to properly take into account individual patients’ personal settings and response such that a “sweet spot” of medical device may be achieved on individual patients. This will help promote a new era of medical device that aims to provide fully customized device to each patient. 8.2 Future work Towards improving speech understanding through cochlear implant device, speech enhancement and intelligibility modeling has played and will continue to play an important role. This thesis addressed this problem by integrating three aspects of speech and hearing: speech perceptual pattern, speech processing, and psychoacoustic studies. It is of our belief that working in this direction will give us many insights that may not be able to be offered from individual angles. Based on 156 the findings from current study, future work may consider these approaches or directions: 1) Speech enhancement in temporal domain and in CI settings. Although we showed the benefits of improving speech understanding by only considering the spectrum features, as the process of natural speech understanding process, a lot of temporal cues are important for CI speech understanding as well. The future study may want to incorporate the temporal cues into the speech enhancement. Speech morphing may be a very powerful approach to enhance speech, even when the artifacts that it may bring are considered. Speech enhancement may be also optimized from the CI settings if appropriate, which may avoid complex speech modification and synthesis. For example, the finding that acoustic-to-electric compression dominated the talker preference suggests that a better design of the Q value in CI device is important to improve non-ideal talkers’ speech, besides the spectral normalization technique that we studied under the speech enhancement framework. A better Q value design may imply an optimum loudness growth control of the spectral components for different talkers. With the real situation that variant talkers encountered in daily life, an adaptive Q value according to environment may be promising to preserve a higher level of 157 perceptual constancy across different talkers, although quite challenging difficulties may involve in this task as well. 2) Speech intelligibility standards for cochlear implant users. The prediction ability of the customized model in this thesis is an important supplement for the current available speech intelligibility standards, which is generally only for normal hearing people. By combining the essential speech processor parameters in individual CI users and their individual psychoacoustic measurements, we showed promising result to predict individualized speech perception patterns that may not be able to be predicted from a general speech intelligibility model. Yet, the psychoacoustic responses involved in this study may not include all of the psychoacoustic responses that may affect speech perception for CI users. Future study may want to continue to study the effect of other psychoacoustic responses and consider how to integrate such measurements into the model with a better manner. 3) A better CI electrode array design. It was found that electrode confusion pattern dominants the inter-subject performance difference. This finding suggests that the efficiency to deliver the right acoustic information to the right place is very important in CI. Based on the current discrete electrode array design, there may be some upper limit that this design system may not be able to surpass. Although it is out 158 of the scope of this thesis, other research direction such as new electrode array design with MEMS technique and totally implantable cochlear has started addressing this issue. These research directions may lead to a better CI electrode array design and possibly better speech understanding for a general body of CI users. 159 Appendix A Estimation of the conversion function based on GMM modeling Suppose we want to estimate a conversion function ( ) t F x from source parameter t x to target parameter t y such that the error between the converted parameter and the target parameter is minimized over all the dataset. Suppose the source parameter distribution is modeled as a GMM. A GMM represents the distribution of the observed parameters x by m mixture Gaussian components in the form of 1 () ( ; , ) m iii i pN αμ = = Σ ∑ xx (A-1) where i α denotes the prior probability of component i ( i 1 1 and 0 m i i αα = =≥ ∑ ) and (; , ) ii N μ Σ x denotes the normal distribution of the th i component with mean vector i μ and covariance matrix i Σ in the form of 1 /2 1/2 11 (;,) exp[ () ()] 2 (2 ) T ii i i i p i N μ μμ π − Σ= − − Σ − Σ xxx (A-2) where p is the number of vector dimensions. The parameters of the model (, , ) α μ Σ can be estimated using the well-known expectation maximization (EM) algorithm (Huang et al. 2001). 160 Once the source parameter distribution is characterized by a GMM model, the conversion function ( ) t Fx is chosen such that the total conversion errors of n vectors 2 1 (()) n tt t F ε = =− ∑ yx (A-3) is minimized for the learning data. In Eq. (A-3), t x is a source vector and t y is the time-aligned target vector. Assuming that the source vector t x follows a GMM model and that the source and target vectors are jointly Gaussian, the conversion function is given in the form of (Stylianou et al. 1998, Kain and Macon 1998, Mendel 1995): 1 1 () ( | )[ ( )] m titiiiti i FPC μ − = =+Σ− ∑ xxv Τ x (A-4) where ( | ) it PC x is the posterior probability of the th i Gaussian component given t x . (| ) it PC x is calculated by the application of Bayes theorem: 1 (; , ) (| ) (; , ) it i i it m j tj j j N PC N α μ αμ = Σ = Σ ∑ x x x (A-5) The unknown parameters i v and i Τ are computed by solving the following set of over-determined linear equations for all feature vector (1,..., ) tn = 1 1 (| )[ ( )] m t i t i iiti i yPC μ − = =+Σ− ∑ xv Τ x (A-6) Note that Eq. (A-4) and Eq. (A-6) are identical on the right hand side but are 161 different on the left side [ ( ) t F x for Eq. (A-4), t y for Eq. (A-6)]. Hence, the minimum mean square error (MMSE) solutions for i v and i Τ from Eq. (A-6) will guarantee that the total conversion error of Eq. (A-3) is minimized. Estimating parameters i v and i Τ from Eq. (A-6) determines the conversion function in Eq. (A- 4). Solving i v and i Τ from Eq. (A-6) associated with intensive computation load and storage requirement because it involves inversion of a huge matrix. It can be simplified when covariance matrix i Σ and conversion matrix i Γ are diagonal. This is termed diagonal conversion, which is used in the current study. In this case, the optimization problem can be splitted into some independent scalar minimization problems by considering each coordinate of the vectors separately. The th k coordinate of Eq. (A-6) can be written as () () () () () () 1 (| )[ ( )/ ] iii m kkkkkk tit it i yPC γμσ = =+ − ∑ xv x (A-7) where superscript (k ) denotes the th k coordinate of the vector , and () k i γ and () k i σ are the th k diagonal elements of matrices i v and i Τ . 162 Appendix B The conversion between Mel-scaled LSF and linear-scaled LPC Suppose an all-pole filter with sufficient number of poles is a good approximation for speech signals and is modeled as 1 11 () () 1 p k k k Hz A z az − = == − ∑ (B-1) where p is the order of the LPC analysis and 1 ,..., p aa are LPC coefficients. The inverse filter () A z is defined as 1 () 1 p k k k Azaz − = =− ∑ (B-2) B.1. Conversion between linear LPC and Mel-scaled LPC When coefficients 1 ,..., p aa are determined from LPC analysis, the all-pole filter is determined. The conversion from linear LPC to Mel-scaled LPC involved four steps: Step 1: Calculate the frequency response of () H z with uniform step in Hz. Step 2: Warp the result from step 1 to Mel-scaled frequency response. The warping follows the relationship ( ) 1125ln(1 /700) Mf f =+ where f is frequency in Hz and ) (f M is the corresponding Mel 163 frequency in Mels. It results in non-uniformly sampled frequency response in Mel domain. Step 3: Interpolate the result from step 2 with uniform step in Mel domain. A splined cubic phase interpolation is typically used. It results in uniformly sampled frequency response in Mel domain. Step 4: Convert the result from step 3 to Mel-scaled LPC coefficients. A least square fit was used to determine the coefficients of the Mel-scaled filter. Suppose the above conversion results in Mel-scaled LPC coefficients represented as 1 ,..., k bb . These coefficients are further represented as LSF for further modeling. For speech synthesis purpose, the Mel-scaled LPC coefficients 1 ,..., k bb need to convert back to linear LPC coefficients. The conversion in this direction follows similar steps as above. The only difference is to use an inverse relationship 1125 ( ) 700 ( 1) M fM e =× − in step 2 to convert mel-scaled frequency and linear frequency. B.2. Conversion between LSF and LPC From the inverse filter defined in Eq. (B-2), we define 164 (1) 1 (1) 1 1 (1) 1 (1) 1 1 () () ( ) 1 ( ) () () ( ) 1( ) p p kp kpk k p p kp kpk k Pz Az z Az aa z z Qz Az z Az aa z z −+ − − −+ +− = −+ − −−+ +− = + =− + + − =− − − ∑ ∑ (B-3) Then we have the relationship that () () () 2 Pz Qz Az + = (B-4) It is proved that () Pz and () Qz have important and interesting properties: the roots of () Pz and () Qz lie in the unit circle, the roots of () Pz and () Qz alternate once sorted (Huang et al. 2001). When the order p is an even number, we have an additional property: () Pz has a root of -1 and () Qz has a root of +1. Besides the root of ± 1, () Pz and () Qz each have additional 2 p pairs of conjugate roots. Therefore, when p is an even number, Eq. (B-3) can be rewritten as /2 11 1 1 /2 112 1 /2 11 1 1 /2 112 1 ( ) (1 ) (1 )(1 ) (1 ) (1 2cos ) ( ) (1 ) (1 )(1 ) (1 ) (1 2cos ) ii i p jj i p i i p j j i p i i Pz z z e z e zzz Qz z z e z e zzz ωω θ ω θ − −− − = −−− = − −− − = −−− = =+ − − =+ − + =− − − =− − + ∏ ∏ ∏ ∏ (B-5) 165 where i ω and i θ are the phases of the conjugate zeros in () Pz and () Qz respectively. Suppose () A z can be decomposed into a series of second-order filters and each second-order filter has a pair of complex roots: /2 12 1 /2 122 1 () (1 ) (1 2 cos( ) ) p ii i p ii i i Az sz tz zz ρψ ρ −− = −− = =− − =− + ∏ ∏ (B-6) It is shown that i ω and i θ are related to i ψ as 2 2 1 cos cos 2 1 cos cos 2 i ii i i ii i ρ ωρ ψ ρ θρ ψ − =+ − =− (B-7) 2 i ω π and 2 i θ π , by definition, are the line spectral frequencies of () A z . Since ||1 i ρ < , cos cos ii θ ψ < , and thus ii θ ψ > . It is also the case that ii ω ψ < . As || 1 i ρ → , ii ω ψ → and ii θ ψ → . Eq. (B-7) can be used to convert LPC coefficients to LSF coefficients. The inversion form of Eq. (B-7) can be used to convert LSF coefficients to LPC coefficients by solving i ρ and i ψ . Then () A z can be calculated from either Eq. (B- 6) or Eq. (B-4). Combining the two-layer conversions in Section B.1 and B.2, the conversion between mel-sacled LSF and linear LPC is determined. 166 Appendix C Summary of CI subjects demographics and participation in all experiments Subject Age Gender Etiology Implant Type Strategy Duration of Implant Use (years) Chapter 2 Chapter 3 Chapter 7 S1 67 M Hereditary Nucleus-22 SPEAK 14 Y Y S2 75 M Noise induced Nucleus-22 SPEAK 9 Y Y Y S3 72 F Unknown Nucleus-24 ACE 5 Y Y Y S4 54 M Unknown Nucleus-22 SPEAK 11 Y S5 62 F Genetic Nucleus-24 ACE 2 Y Y Y S6 55 M Hereditary Freedom ACE 1 Y Y S7 52 M Unknown Clarion-CII HiRes 6 Y S8 48 M Trauma Nucleus-22 SPEAK 13 Y Y Y S9 64 M Trauma/ Unknown Nucleus-22 SPEAK 15 Y Y 167 Appendix D CI device MAP settings and T/C levels for individual CI subjects 168 Subject S2 MAP and T/C levels Inter-phase gap (us) 45 Pulse width (us) 200 Stimulation rate (pps/channel) 250 Stimulation mode BP+1 Active Electrode Return Electrode Lower Freq (Hz) Upper Freq (Hz) T (dB) MCL (dB) DR (dB) 20 22 120 280 50.55 54.84 4.29 19 21 280 440 49.54 54.52 4.98 18 20 440 600 49.19 54.68 5.49 17 19 600 760 48.82 54.99 6.17 16 18 760 920 48.60 54.84 6.24 15 17 920 1080 48.16 54.52 6.36 14 16 1080 1240 48.03 54.52 6.49 13 15 1240 1414 48.60 54.68 6.08 12 14 1414 1624 48.03 55.53 7.5 11 13 1624 1866 49.04 56.32 7.28 10 12 1866 2144 49.04 56.98 7.94 9 11 2144 2463 49.63 57.10 7.47 8 10 2463 2856 49.43 57.52 8.09 7 9 2856 3347 48.66 56.98 8.32 6 8 3347 3922 49.04 55.48 6.44 5 7 3922 4595 48.03 54.99 6.96 4 6 4595 5384 48.33 55.85 7.52 3 5 5384 6308 48.66 56.12 7.46 2 4 6308 7390 48.33 56.57 8.24 1 3 7390 8658 49.43 56.32 6.89 169 Subject S3 MAP and T/C levels Inter-phase gap (us) 45 Pulse width (us) 100 Stimulation rate (pps/channel) 900 Stimulation mode MP1+2 Active Electrode Return Electrode Lower Freq (Hz) Upper Freq (Hz) T (dB) MCL (dB) DR (dB) 22 R1+2 240 560 37.77 46.74 8.97 21 R1+2 560 880 36.01 46.74 10.73 20 R1+2 880 1200 35.48 46.21 10.73 19 R1+2 1200 1520 34.95 45.86 10.91 18 R1+2 1520 1840 35.48 46.04 10.56 17 R1+2 1840 2160 35.48 46.04 10.56 16 R1+2 2160 2480 36.89 46.56 9.67 15 R1+2 2480 2828 36.01 46.39 10.38 14 R1+2 2828 3249 37.42 46.56 9.14 13 R1+2 3249 3732 36.36 45.86 9.5 12 R1+2 3732 4288 37.06 45.68 8.62 11 R1+2 4288 4926 35.66 45.16 9.5 10 R1+2 4926 5713 37.42 45.33 7.91 9 R1+2 5713 6694 36.36 44.63 8.27 8 R1+2 6694 7844 39.18 44.28 5.1 7 R1+2 7844 9190 39.18 43.93 4.75 170 Subject S5 MAP and T/C levels Inter-phase gap (us) 8 Pulse width (us) 25 Stimulation rate (pps/channel) 1200 Stimulation mode MP1+2 Active Electrode Return Electrode Lower Freq (Hz) Upper Freq (Hz) T (dB) MCL (dB) DR (dB) 22 R1+2 120 280 43.35 49.88 6.54 21 R1+2 280 440 44.24 53.24 8.99 20 R1+2 440 600 43.69 52.89 9.19 19 R1+2 600 760 43.69 53.42 9.73 18 R1+2 760 920 43.69 54.30 10.61 17 R1+2 920 1080 43.69 55.53 11.84 16 R1+2 1080 1240 44.40 57.29 12.89 15 R1+2 1240 1414 44.40 58.70 14.30 14 R1+2 1414 1624 44.08 58.70 14.62 13 R1+2 1624 1866 44.76 59.75 14.99 12 R1+2 1866 2144 45.15 59.58 14.43 11 R1+2 2144 2463 44.61 59.75 15.14 10 R1+2 2463 2856 44.76 58.35 13.59 9 R1+2 2856 3347 45.15 58.35 13.20 8 R1+2 3347 3922 45.85 58.17 12.32 7 R1+2 3922 4595 46.89 57.11 10.22 6 R1+2 4595 5384 46.89 56.59 9.70 5 R1+2 5384 6308 45.48 55.53 10.05 4 R1+2 6308 7390 44.40 54.99 10.59 3 R1+2 7390 8658 46.02 53.59 7.57 171 Subject S8 MAP and T/C levels Inter-phase gap (us) 45 Pulse width (us) 200 Stimulation rate (pps/channel) 250 Stimulation mode BP+1 Active Electrode Return Electrode Lower Freq (Hz) Upper Freq (Hz) T (dB) MCL (dB) DR (dB) 20 22 150 550 52.42 58.88 6.46 18 20 550 950 51.22 58.76 7.54 16 18 950 1350 51.62 58.76 7.14 14 16 1350 1768 51.22 58.88 7.66 12 14 1768 2333 48.97 57.72 8.75 10 12 2333 3079 49.34 58.07 8.73 8 10 3079 4184 47.68 58.98 11.3 6 8 4184 5744 47.04 57.83 10.79 4 6 5744 7885 47.92 57.56 9.64 2 4 7885 10823 46.65 55.75 9.1 172 Appendix E Electrode comparison and normalized electrode confusion patterns for individual CI subjects 173 Electrode comparison pattern for subject S2 (The number represents the percent correct of pitch ranking between the electrode pairs. Note that for better representation, the numbers were rounded to integers.) S2 Elec 20 19 18 17 16 15 14 13 12 11 109 87 654321 20 - 63 93 97 90 93 100 93 100 100 100 100 100 100 100 100 100 100 100 100 19 - - 73 80 97 97 93 97 97 97 100 100 100 100 100 100 100 100 100 100 18 - - - 83 87 83 90 93 93 93 100 93 100 100 100 100 100 100 100 100 17 - - - - 80 77 93 90 87 90 97 97 93 100 93 100 100 100 100 100 16 - - - - - 73 87 83 87 93 83 93 97 97100 97100 97 97100 15 - - - - - - 70 80 93 97 80 87 100 97100100100100100100 14 - - - - - - - 77 90 97 97 93 93100100100100 97100100 13 - - - - ---- 73 80 97 87 90 97 97100 97 97100100 12 - - - - ---- - 83 87 93 97100100100 97 93 97 97 11 - - - - ---- -- 80 90 93 97 97 93 97 93 93 97 10 - - - - ---- --- 83 97 97 90 87 97 90 97 93 9 - - - - ---- ---- 77 87 90 97 93 93 97100 8 - - - - ---- ---- - 77 77 63 67 80 90 93 7 - - - - ---- ---- -- 50 40 63 47 87 73 6 - - - - ---- ---- -- - 43 33 57 77 73 5 - - - - ---- ---- -- -- 47 57 73 60 4 - - - - ---- ---- -- --- 47 77 60 3 - - - - ---- ---- -- ---- 57 83 2 - - - - ---- ---- -- ----- 60 1 - - - - ---- ---- -- ------ 174 Normalized electrode confusion pattern for subject S2 (Note that for better representation, the numbers were rounded to integers.) S2 Elec 20 19 18 17 16 15 14 13 12 11 109 876543 21 20 59 22 4 2 64 030000 000000 00 19 17 48 12 10 22 322200 000000 00 18 3 12 46 8 68 533303 000000 00 17 1 8 7 42 8 10 345411 303000 00 16 4 1 5 8 39 10 565363 110101 10 15 2 1 6 9 10 38 1172175 010000 00 14 0 3 4 3 6 14 46 114113 300001 00 13 2 1 2 4 67 9 38 10715 411011 00 12 0 1 3 6 63 4 11 43763 100013 11 11 0 1 3 4 31 188 4384 311313 31 10 0 0 0 1 78 1158 427 114514 13 9 0 0 3 1 35 35347 40 954133 10 8 0 0 0 2 10 231217 3377 12 116 32 7 0 0 0 0 11 010113 6 25 13 159 14 37 6 0 0 0 2 00 010122 6 12 24 13 16 10 65 5 0 0 0 0 10 000231 8 14 13 23 12 10 58 4 0 0 0 0 00 011112 89 16 13 23 13 48 3 0 0 0 0 10 112222 5 13 10 10 13 24 104 2 0 0 0 0 10 001211 34898 14 35 13 1 0 0 0 0 00 001120 299 13 135 13 32 175 Electrode comparison pattern for subject S3 S3 Elec 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 22 - 63 73 85 80 88 95 93 93 98 98 95 98 100 100 100 21 - - 55 78 93 88 85 93 83 98 95 95 100 100 98 100 20 - - - 58 83 90 95 90 85 88 98 95100100100 98 19 - - - - 58 65 88 73 90 78 90 93 98100 98 98 18 - - - - - 53 78 65 68 73 80 88 90100100 98 17 - - ---- 68 55 65 68 83 88 90 93 98 90 16 - - ---- - 55 43 65 68 78 85 95 93 95 15 - - ---- -- 58 78 68 85 78 90 93 98 14 - - ---- --- 53 60 83 70 75 85 88 13 - - ---- ---- 65 73 80 80 88100 12 - - ---- ---- - 60 68 65 80 88 11 - - ---- ---- -- 55 65 58 73 10 - - ---- ---- -- - 50 53 73 9 - - ---- ---- -- -- 58 53 8 - - ---- ---- -- --- 50 7 - - ---- ---- -- ---- 176 Normalized electrode confusion pattern for subject S3 S3 Elec 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 22 42 15 11685 2331 12 1000 21 13 36 16834 5361 22 0010 20 9 15 35 1463 2354 12 0001 19 3 6 12 28 12 10 4 8 3 6 3 2 1 0 1 1 18 4 2 4 11 25 12 6 9 8 7 5 3 3 0 0 1 17 3 3 2 8 11 24 8 11 8 8 4 3 2 2 1 2 16 1 4 1358 24 11 148 85 4121 15 2 2 2 6 8 10 10 24 10 5 8 3 5 2 2 1 14 1 3 3267 118 239 83 6532 13 1 1 3578 85 11 24 87 5520 12 1 1 1254 7798 239 7853 11 1 1 1233 5447 10 24 118 106 10 1 0 0122 4575 8 11 24 12 117 9 0 0 0002 1375 99 13 27 11 13 8 0 1 0101 2244 6 12 13 12 28 14 7 0 0 1113 2140 49 9 16 16 33 177 Electrode comparison pattern for subject S5 S5 Elec 22 21 20 19 18 17 16 15 14 13 12 11 109 876543 22 - 100 100 100 100 100 100 100 93 100 100 100 100 100 100 100 100 100 100 100 21 - - 100 97 100 100 100 100 100 100 100 97 100 97 100 100 97 100 100 97 20 - - - 97 97 100 100 100 97 100 100 93 97 100 100 97 100 97 100 100 19 - - - - 100 100 93 93 93 90 97 97 97 100 90 97 100 100 97 100 18 - - - - - 87 63 67 90 97 93 93 100 97100100100100100100 17 - - - - - - 37 60 83 77 97 97 100100 97100100100100100 16 - - - - - - - 87 97 93100100 97100100100 97 97100100 15 - - - - ---- 93 97 97 93 100100100100100100100100 14 - - - - ---- - 77 90 100 100 97 100 100 100 100 100 100 13 - - - - ---- -- 90 97 100 97100100100100100100 12 - - - - ---- --- 93 100100100100100100100100 11 - - - - ---- ---- 97100100 97100100100100 10 - - - - ---- ---- - 83100 97100100100100 9 - - - - ---- ---- -- 97100100100100100 8 - - - - ---- ---- -- - 97100100100100 7 - - - - ---- ---- -- --100100100100 6 - - - - ---- ---- -- --- 97100100 5 - - - - ---- ---- -- ---- 90 93 4 - - - - ---- ---- -- ----- 97 3 - - - - ---- ---- -- ------ 178 Normalized electrode confusion pattern for subject S5 S5 Elec 22 21 20 19 18 17 16 15 14 13 12 11 1098765 43 22 94 0 0 0 00 006000 000000 00 21 0 86 0 3 00 000003 030030 02 20 0 0 79 3 30 003005 300202 00 19 0 2 2 62 00 444622 206200 20 18 0 0 2 0 466 17 155233 010000 00 17 0 0 0 0 5 38 24 156911 001000 00 16 0 0 0 3 15 26 4361300 100011 00 15 0 0 0 3 16 19 6 473222 000000 00 14 4 0 2 4 59 24 53 1240 010000 00 13 0 0 0 5 2 13 42 13 5451 010000 00 12 0 0 0 2 52 0277 705 000000 00 11 0 2 5 2 52 05025 68 200200 00 10 0 0 3 3 00 300003 75 110200 00 9 0 3 0 0 20 002200 13 753000 00 8 0 0 0 8 03 000000 03 83300 00 7 0 0 2 3 00 000003 303 8600 00 6 0 3 0 0 00 300000 0000 913 00 5 0 0 2 0 00 300000 00003 79 85 4 0 0 0 2 00 000000 000009 863 3 0 3 0 0 00 000000 000006 3 88 179 Electrode comparison pattern for subject S8 S8 Elec 20 18 16 14 12 10 8 6 4 2 20 - 73 98 100 100 100 100 100 100 100 18 - - 88 100 100 100 100 100 100 100 16 - - - 95 98 95 98 98 100 100 14 ---- 90 95 100 100 100 100 12 ---- - 78 100 100 100 100 10 ---- -- 95 100 98 100 8 ---- --- 90 85 93 6 ---- --- - 15 58 4 ---- --- -- 75 2 ---- --- --- 180 Normalized electrode confusion pattern for subject S8 S8 Elec 20 18 16 14 12 10 8 6 4 2 2077 21 2 0 0 0 0 0 0 0 1820 71 9 0 0 0 0 0 0 0 16 29 754 242 200 14 0 0 4 84 8 4 0 0 0 0 12 0027 74 170 000 10 0044 16 704 020 8 0020 04 71 7 115 6 0010 004 42 35 18 4 0000 017 37 44 11 2 0000 004 24 14 58 181 Appendix F Summary of speech recognition scores for individual CI subjects 182 For quick reference, this appendix summarizes CI speech recognition data (percent correct) that were related to the discussion of inter-subject performance difference appeared in the thesis. For the listening conditions, phone and phone+hf appeared in Chapter 2; M1, F1, M1 Æ F1, and F1 Æ M1 appeared in Chapter 3, A01, A02, A11 and A12 were the four talkers (2 males and 2 females) appeared in Chapter 7 and Chapter 8. Averaged scores and standard deviations (in parenthesis) were listed together. The shaded cells were the speech recognition scores for transformed speech materials with the proposed speech enhancement framework. Subject Speech material Listening condition S1 S2 S3 S4 S5 S6 S7 S8 S9 Phone 61.16 (13.25) 43.48 (12.68) 35.49 (9.52) -- 74.72 (11.03) 83.90 (7.85) -- 80.17 (7.83) 72.06 (12.41) Phone+hf 63.01 (10.90) 52.73 (6.52) 38.83 (9.36) -- 80.88 (7.00) 85.36 (4.91) -- 77.58 (8.99) 77.26 (7.77) M1 73.71 (7.09) 62.33 (7.85) 49.90 (10.40) 79.01 (10.44) 93.83 (1.07) 97.20 (1.89) 96.57 (2.06) 89.65 (4.89) 88.45 (2.98) F1 94.40 (2.50) 78.45 (3.63) 62.26 (6.34) 78.74 (9.43) 93.43 (1.25) 96.68 (3.21) 94.57 (6.00) 80.30 (4.91) 76.88 (14.38) M1 Æ F1 78.93 (11.21) 70.68 (6.97) 51.62 (10.06) 57.45 (11.52) 93.17 (1.66) 97.00 (3.21) 93.37 (3.52) 84.72 (9.98) 79.78 (9.25) IEEE Sentences F1 Æ M1 84.23 (7.37) 66.85 (6.10) 67.55 (10.36) 78.51 (9.34) 95.53 (1.53) 98.00 (2.58) 96.13 (3.80) 83.63 (2.22) 81.70 (7.53) A01 (M) -- 78.82 (1.59) 76.05 (3.61) -- 93.40 (0.60) -- -- 96.53 (2.17) -- A02 (M) -- 81.25 (1.04) 73.96 (2.76) -- 85.42 (1.04) -- -- 97.22 (1.59) -- A11 (F) -- 78.13 (1.05) 71.18 (2.41) -- 83.68 (1.59) -- -- 92.02 (1.59) -- Vowels A12 (F) -- 89.59 (3.61) 78.13 (9.33) -- 80.90 (0.60) -- -- 98.96 (1.04) -- 183 Bibliography Allen, J.S., Miller, J.L., and DeSteno, D. (2003). “Individual talker differences in voice-onset-time,” J. Acoust. Soc. Am, 113(1), 544-552. Assmann, P.F., Nearey T. M., and Hogan, J. T. (1982). “Vowel Identification: Orthographic, perceptual, and acoustic aspects,” J. Acoust. Soc. Am, 71, 975- 989. Bladon, R. A., Henton, C. G., and Pickering, J. B. (1984). “Towards an auditory theory of speaker normalization,” Lang. Commun. 4, 59-69. Bond, Z. S., and Moore, T. J. (1994). “A note on the acoustic-phonetic characteristics of inadvertently clear speech,” Speech Commun. 14, 325-337. Bradlow, A. R., Torretta, G. M., and Pisoni, D. B. (1996). “Intelligibility of normal speech: 1. Global and fine-grained acoustic-phonetic talker characteristics,” Speech commun. 20, 255-272. Byrd, D. (1992). “Preliminary results on speaker-dependent variation in the TIMIT database,” J Acoust Soc Am., 92(1):593-6. Cochlear Ltd. (1999), “Nucleus technical reference manual,” Englewood, CO, USA Collins, L. M., Zwolan, T. A., and Wakefield G. H. (1997). “Comparison of electrode discrimination, pitch ranking, and pitch scaling data in postlingually deafened adult cochlear implant subjects,” J. Acoust. Soc. Am. 101(1), 440- 455. Cox, R.M, Alexander, G.C, Gilmore, C. (1987). “Intelligibility of average talkers in typical listening environments,” J. Acoust. Soc. Am., 81(5), 1598-608. Cray, J. W., Allen, R. L., Stuart, A., Hudson, S., Layman, E., Givens, G. D., (2004). “An investigation of telephone use among cochlear implant recipients,” Am. J. of Audiology, 13, 200-212. Davis, S.B., and Mermelstein, P.M. (1980). “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences”, IEEE Trans. on Acoustics, Speech, and signal processing, Vol. ASSP-28, No. 4, 357-366. 184 Dorman, M. F., Loizou, P.C., and Rainey, D. (1997a). “Speech intelligibility as a function of the number of channels of stimulation for signal processors using sine-wave and noise-band outputs,” J Acoust Soc Am. 102(4), 2403-2411. Dorman, M.F., Loizou, P.C, and Rainey, D. (1997b). “Stimulating the effect of cochlear implant electrode insertion depth on speech understanding,” J Acoust. Soc.Am. 102(5 Pt 1), 2993-2996. Dorman, M.F., and Loizou, P.C. (1997). “Mechanisms of vowel recognition for Ineraid patients fit with continuous interleaved sampling processors,” J. Acoust. Soc. Am. 102(1). 581-587. Faulkner, A., Rosen, S., and Norman, C. (2006).“The Right Information May Matter More Than Frequency-Place Alignment: Simulations of Frequency-Aligned and Upward Shifting Cochlear Implant Processors for a Shallow Electrode Array Insertion,” Ear & Hearing, 27(2), 139-152. Fishman, K., Shannon, R.V., and Slattery, W. H. (1997). “Speech recognition as a function of the number of electrodes used in the SPEAK cochlear implant speech processor,” J. Speech Hear Res. 40,1201-1215. French, N.R., and Steinberg, J.C. (1947). “Factors governing the intelligibility of speech sounds,” J. Acoust. Soc. Am. 19, 90-119 Friesen, L.M., Shannon, R. V., Baskent, D., and Wang, X. (2001). “Speech recognition in noise as a function of the number of spectral channels: comparison of acoustic hearing and cochlear implants,” J Acoust Soc Am. 110, 1150-1163. Fu, Q.-J. (1997). “Speech perception in acoustic and electric hearing”, PhD dissertation, University of Southern California. Fu, Q.-J. and Shannon, R. V. (1998a). “Effects of amplitude nonlinearity on phoneme recognition by cochlear implant users and normal-hearing listeners,” J. Acoust. Soc. Am. 104 (5), 2570-2577. Fu, Q.-J, Shannon, R.V, Wang, X. (1998b). “Effects of noise and spectral resolution on vowel and consonant recognition: Acoustic and electric hearing,” J. Acoust. Soc. Am. 104(6), 3586-3596. Fu, Q.-J., and Shannon, R.V. (1999a). “Recognition of spectrally degraded and frequency shifted vowels in acoustic and electric hearing,” J. Acoust. Soc. Am. 105(3), 1889-1900. 185 Fu, Q.-J., and Shannon R.V (1999b). “Effect of acoustic dynamic range on phoneme recognition in quiet and noise by cochlear implant users,” J Acoust Soc Am. 106(6), L65-L70. Fu, Q.-J., Shannon, R. V., and Galvin III, J.J. (2002). “Perceptual learning following changes in the frequency-to-electrode assignment with the Nucleus-22 cochlear implant,” J Acoust Soc Am. 112(4), 1664-1674. Fu, Q.-J., and Galvin III, J.J.(2003). “The effects of short-term training for spectrally mismatched noise-band speech”, J. Acoust. Soc. Am. 113(2), 1065-1072. Fu, Q.-J. and Nogaki, G. (2005). “Noise susceptibility of cochlear implant users: the role of spectral resolution and smearing,” J Assoc Res Otolaryngol. 6(1), 19- 27. Fu, Q.-J., Galvin III, J. J., Wang, X., and Nogaki, G. (2005a) “Moderate auditory training can improve speech performance of adult cochlear implant users,” Acoustics Research Letters Online 6(3), 106-111. Fu, Q.-J., Nogaki, G., and Galvin III, J.J. (2005b). “Auditory training with spectrally shifted speech: implication for Cochlear Implant patient auditory rehabilitation”, JARO 6:180-189. Fu, Q.-J., and Galvin, J. J. (2006). “Recognition of simulated telephone speech by cochlear implant users,” Am J Audiol., 15(2) 127-32. Gerstman, L. J. (1968). “Classification of self-normalized vowels,” IEEE Trans. On audio and electroacoustics, AU-16(1), pp. 78-80. Goldsworthy, R. (2005), “Noise reduction algorithms and performance metrics for improving speech reception in noise by cochlear-implant users”, Ph.D thesis, MIT. Goldsworthy, R., and Greenberg, J.E. (2004). “Analysis of speech-based speech transmission index methods with implications for nonlinear operations,” J. Acoust. Soc. Am. 116(6). 3679-3689. Gordon-Salant, S., and Fitzgibbons, P.J. (1997). “Selected cognitive factors and speech recognition performance among young and elderly listeners,” J. Speech Lang. Hear. Res. 40, 423-431. Green T., Katiri, S., Faulkner, A., and Rosen, S. (2007). “Talker intelligibility differences in cochlear implant listeners,” J. Acoust. Soc. Am. 121(6), EL223-EL229. 186 Green, T., Faulkner, A., and Rosen, S. (2004), “Enhancing temporal cues to voice pitch in continuous interleaved sampling cochlear implants,” J. Acoust. Soc. Am., 116, 2298-2310. Greenwood, D.D. (1990). “A cochlear frequency-position function for several species – 29 years later,” J.Acoust.Soc.Am., 87(2), pp. 2592-2605. Gustafsson, H., Lindgren, U.A., and Claesson, I. (2006). “Low-complexity feature- mapped speech bandwidth extension,” IEEE trans. on audio, speech and language processing, 14(2), 2006, pp. 577-588 Hazan, V., and Markham, D. (2004). “Acoustic-phonetic correlates of talker intelligibility for adults and children,” J. Acoust. Soc. Am. 116, 3108-3118. Henry, B.A., McKay, C.M., McDermott, H.J., and Clark, G.M. (2000). “The relationship between speech perception and electrode discrimination in cochlear implantees,” J. Acoust. Soc. Am., 108(3), 1269-1280. Hermansky, H. and Morgan, N. (1994). “RASTA processing of speech,” IEEE transactions on speech and audio processing, 2(4), 578-589. Holden, L. K.; Skinner, M. W.; Holden, T. A.; Demorest, M.E. (2002)” Effects of Stimulation Rate with the Nucleus 24 ACE Speech Coding Strategy,” Ear Hear. 23(5), 463-476 Hood, J.D., Poole, J.P. (1980). “Influence of the speaker and other factors affecting speech intelligibility,” Audiology, 19(5), 434-55. Hillenbrand, J., Getty, L., Clark, M, and Wheeler, K. (1994). “Acoustic characteristics of American English vowels,” J. Acoust. Soc. Am. 97, 3099- 3111. Huang, X.-D., Acero, A., and Hon, H.-W., (2001). “Spoken language processing- a guide to theory, algorithm, and system development”, Prentice Hall. IEEE, (1969). “IEEE recommended practice for speech quality measurements,” Institute of Electrical and Electronic Engineers, New York. Ito, J., Nakatake, M., Fujita, S. (1999). “Hearing ability by telephone of patients with cochlear implants,” Otolaryngology – Head and Neck Surgery, 802-804. 187 Kain, A., and Macon, M. W. (1998). “Spectral voice conversion for text-to-speech synthesis,” Preceedings of the International Conference on Acoustics, Speech, and Signal Processing, pp.285-288. Kepler, L.J., Terry, M., and Sweetman, R. H. (1992). “Telephone usage in the hearing-impaired population.” Ear Hear, 13: 311-319 Kirk, K.I., Pisoni, D.B., and Miyamoto, R.C., (1997). “Effects of stimulus variability on speech perception in listeners with hearing impairment,” J. Speech Lang. Hear.Res., 40, 1395-1405. Krause, J. C., and Braida, L. D. (2002). “Investigating alternative forms of clear speech: The effects of speaking rate and speaking mode on intelligibility,” J. Acoust. Am. 112, 2165-2172. Krause, J. C., and Braida, L. D. (2004). “Acoustic properties of naturally produced clear speech at normal speaking rates,” J. Acoust. Am. 115, 362-378. Kompe R. (1997). “Prosody in Speech Understanding Systems,” Springer-Verlag New York, Inc. Kurdziel, S., Noffsinger, D., and Olsen, W. (1976), “Performance by cortical lesion patients on 40 and 60% time-compressed materials,” J. Am. Aud. Soc., 2, 3-7 Liu, S., Rio, E.D., Bradlow, A.R., Zeng, F.-G., (2004), “Clear speech perception in acoustic and electric hearing,” J.Acoust. Soc. Am. 116(4), Pt.1, 2374-2383. Liu, S, Zeng, F.-G.(2006). “Temporal properties in clear speech perception,” J. Acoust. Soc. Am., 120(1), 424-32. Liu, C., Galvin, J., Fu, Q.-J., and Narayanana, S.,S. (2008) “Effect of spectral normalization on different talker speech recognition by cochlear implant users,” J. Acoust. Soc. Am., 123(5), 2836-2847. Liu, C., and Fu, Q,-J. (2007). “Estimation of vowel recognition with cochlear implant simulations,” IEEE Transactions on Biomedical Engineering, 54(1), 74-81. Liu, C., Fu, Q.-J., Narayanan, S.S. (2006). “Smooth GMM based multi-talker spectral conversion for spectrally degraded speech,” ICASSP, 5, 141-144. Liu, C., and Fu, Q.-J. (2005) “ Relating acoustic space of vowels to perceptual space in cochlear implants”, IEEE ICASSP, 3, 33-36. 188 Lobanov, B. M. (1971). “Classification of Russian vowels spoken by different speakers,” J. Acoust. Soc. Am., 49(2), pp. 606-608. Loizou, P. C. (1998). “Mimicking the human ear,” IEEE Signal Proc. Mag. 15(5), 101-130. Loizou, P.C., Poroy, O., and Dorman, M. (2000). “The effect of parametric variations of cochlear implant processors on speech understanding,” J Acoust Soc Am. 108(2), 790-802 Loizou, P.C., Lobo, A., Hu, Y. (2005) “Subspace algorithms for noise reduction in cochlear implants, ” J Acoust Soc Am. 118(5), 2791-3. Loizou, P.C., (2006). “Speech processing in vocoder-centric cochlear implants,” Adv. Otorhinolaryngol, 64, 109-143. Loizou, P.C., Dorman, M., Poroy, O., Spahr, T. (2000). “Speech recognition by normal-hearing and cochlear implant listeners as a function of intensity resolution,” J. Acoust. Soc. Am, 108(5), 2377-2387. Luo, X., and Fu, Q. –J., (2005). “Speaker normalization for Chinese vowel recognition in cochlear implants,” IEEE Transaction on Biomedical Engineering, 52(7), pp. 1358 – 1361. Makhoul, J., and Berouti, M. (1979). “High-frequency regeneration in speech coding systems,” ICASSP, 428-431. Mendel, J.M., (1995). “Lessons on estimation theory for signal processing,communications and control,” Prentice Hall. Milchard, A. J., and Cullington, H. E.,(2004).”An investigation into the effect of limiting the frequency bandwidth of speech on speech recognition in adult cochlear implant users”. International J. of Audiology, 43: 356-362 Miller, J.L., Green, K.P., and Reeves, A., (1986). “Speaking rate and segments: A look at the relation between speech production and perception for the voicing contrast,” Phonetica 43, 106-115. Miller, J.L., and Volaitis, L.E. (1989). “Effect of speaking rate on the perceptual structure of a phonetic category,” Percept. Psychophys., 46(6), 505-12. Mullennix, J. W., Pisoni, D. B., and Martin, C. S. (1989). “Some effects of talker variability on spoken word recognition,” J. Acoust. Soc. Am., 85(1), pp. 365- 378. 189 Muller, J., Schon, F., and Helms, J. (2002). “Speech Understanding in Quiet and Noise in Bilateral Users of the MED-EL COMBI 40/40+ Cochlear Implant System, ” Ear Hear. 23(3), 198-206. Munson, B., Nelson, P.B. (2005). “Phonetic identification in quiet and in noise by listeners with cochlear implants,” J. Acoust. Soc. Am., 118(4), 2607-2617. Nelson, D.A., Tasell, D.J.V, Schroder, A. C., Soli, S., and Levine, S. (1995). “Electrode ranking of “place pitch” and speech recognition in electrical hearing,” J. Acoust. Soc. Am., 98(4), 1987-1999. Nelson, P.B., Jin,S.H., Carney, A.E., and Nelson, D. A. (2003). “Understanding speech in modulated interference: cochlear implant users and normal-hearing listeners,” J Acoust Soc Am. 113(2), 961-968. Nelson, D.A., Schmitz, J.L., Donaldson, G.S., Viemeister, N.F. and Javel, E. (1996), “Intensity discrimination as a function of stimulus level with electric stimulation,” J. Acoust.Soc.Am., 100, 2393-2414 Nejime, Y., and Moore, B.C. (1998). “Evaluation of the effect of speech-rate slowing on speech intelligibility in noise using a simulation of cochlear hearing loss,” J. Acoust. Soc. Am., 103(1), 572-576. Newman, R.S., Clouse, S.A., Burnham, J.L. (2001). “The perceptual consequences of within-talker variability in fricative production,” J.Acoust.Soc.Am., 109(3), 1181-1196. NIDCD (2006) http://www.nidcd.nih.gov/ Nilsson M., and Kleijn, W.B. (2001). “Avoiding over-estimation in bandwidth extension of telephone speech,” ICASSP, 869-872. Nilsson, M., Gustafsson, H., Andersen, S.V., and Kleijn, W.B., (2002). “Gaussian mixture model based mutual information estimation between frequency bands in speech,” ICASSP, 525-528. Nygaard, L. C., & Pisoni, D. B. (1998). Talker-specific learning in speech perception. Perception & Psychophysics, 60, 355-3769. Park, K.-Y, and Kim, H.S. (2000). “Narrowband to wideband conversion of speech using GMM based transformation”, ICASSP, 1843-1846. Peterson, G. E., and Barney, H. L. (1952). “Control methods used in a study of the vowels,” J. Acoust. Soc. Am., 24(2), pp. 175-184. 190 Payton, K.L., Uchanski, R.M., and Braida, L.D. (1994). “Intelligibility of conversational and clear speech in noise and reverberation for listeners with normal and impaired hearing,” J. Acoust. Soc. Am. 95, 1581-1592 Pavlovic, C.V. (1987). “Derivation of primary parameters and procedures for use in speech intelligibility predictions,” J. Acoust. Soc. Am, 82(2), 413-422. Picheny, M. A., Durlach, N. I., and Braida, L. D. (1985). “Speaking clearly for the hard of hearing 1: intelligibility differences between clear and conversational speech,” J. Speech Hear. Res. 28, 96-103. Picheny, M. A., Durlach, N. I., and Braida, L. D. (1986). “Speaking clearly for the hard of hearing 2: acoustic characteristics of clear and conversational speech,” J. Speech Hear. Res. 29, 434-446. Picheny, M.A, Durlach, N.I, Braida, L.D. (1989). “Speaking clearly for the hard of hearing. III: An attempt to determine the contribution of speaking rate to differences in intelligibility between clear and conversational speech,” J. Speech. Hear. Res., 32(3), 600-3. Pisoni, D.B. (1993). “Long term memory in speech perception: some new findings on talker variability, speaking rate, and perceptual learning”, Speech Communication, vol.13, no.1-2, pp.109-125. Potamianos A., and Narayanan.S., (2003). “Robust recognition of children’s speech,” IEEE Trans. Speech and Audio Processing, 11(6), 603–616. Rabiner, L. and Juang, B.-H.(1993). “Fundamentals of speech recognition,” Prentice Hall. Remus, J.J., and Collins, L.M., (2005). “The effects of noise on speech recognition in cochlear implant subjects: predictions and analysis using acoustic models,” EURASIP Journal on Applied Signal Processing, 18, pp. 2979-2990. Rosen, S., Faulkner, A., and Wilkinson, L. (1999). “Adaptation by normal listeners to upward spectral shifts of speech: implications for cochlear implants,” J Acoust Soc Am 106: 3629-3636. Sakoe H., and Chiba, S. (1978). “Dynamic programming algorithm optimization for spoken word recognition,” IEEE Trans. on Acoustic, Speech, Signal Processing Vol. ASSP-26, pp. 43-49. 191 Shannon, R.V., Zeng, F.-G., Kamath, V., Wygonski, J., and Ekelid, M. (1995). “Speech recognition with primarily temporal cues,” Science, 270, 303-304. Shannon, R. V. (2007). “Ear and brain: integration of sensory cues into complex objects,” Association for research in otolaryngology (ARO), 30th annual midwinter research meeting. Sommers, M. S., Nygaard, L. C., and Pisoni, D. B. (1994). “Stimulus variability and spoken word recognition. I. Effects of variability in speaking rate and overall amplitude,” J. Acoust. Soc. Am., 96(3), pp. 1314-1324. Stylianou, Y., Cappe, O., and Moulines, E. (1998). “Continuous probabilistic transform for voice conversion,” IEEE Trans. on speech and audio processing, 6(2), pp. 131-142. Syrdal, A. K., and Gopal, H. S. (1986). “A perceptual model of vowel recognition based on the auditory representation of American English vowels,” J. Acoust. Soc. Am., 79(4), pp. 1086-1100. Svirsky, M.A., Silveira, A., Suarez, H., Neuburger, H., Lai, T.T., and Simmons, P.M. (2001). “Auditory learning and adaptation after Cochlear Implantation: A preliminary study of discrimination and labeling of vowel sounds by Cochlear Implant users”, Acta Otolaryngol, 121:262-265. Svirsky, M.A., Silveira, A., Neuburger, H., Teoh, S. W., and Suarez, H., (2004). “Long term auditory adaptation to a modified peripheral frequency map”, Acta Otolaryngol, 124:381-386. Terry, M., Bright, K., Durian, M., Kepler, L., Sweetman, R., Grim, M., (1992). “ Processing the telephone speech signal for the hearing impaired,” Ear and Hearing, 13(2), 70-79. Throckmorton, C.S. and Collins, L.M. (2002). “The effect of channel interactions on speech recognition in cochlear implant subjects: predictions from an acoustic model,” J. Acoust. Soc. Am. 112(1), 285-296. Tsao Y.-C., Weismer, G., and Iqbal, K. (2006). “Interspeaker variation in habitual speaking rate: additional evidence,” Journal of Speech, Language, and Hearing Research, 49, 1156-1164. Yang, L-P., Fu, Q.-J. (2005). “Spectral subtraction-based speech enhancement for cochlear implant patients in background noise,” J Acoust Soc Am. 117(3 Pt 1): 1001-4. 192 Yonan, C. A., Sommers, M.S.(2000). “The effects of talker familiarity on spoken word identification in younger and older listeners,” Psychol Aging, 15(1), 88- 99 Uchanski, R. M., Geers, A. E., and Protopapas, A. (2002). “Intelligibility of modified speech for young listeners with normal and impaired hearing”, Journal of Speech, Language, and Hearing Research, 45, 1027-1038. Vandali A.E., Whitford L.A.,Plant K.L., Clark G.M. (2000). “Speech perception as a function of electrical stimulation rate: using the Nucleus 24 cochlear implant system,” Ear Hear. 21(6), 608-24.” Verbrugge, R.R., Strange, W., Shankweiler, D. P., and Edman, T.R., (1976). “What information enables a listener to map a talker’s vowel space?” J. Acoust. Soc. Am. 60, 198-212. Wang, D., Narayanan, S. (2007). “An acoustic measure for word prominence in spontaneous speech,” IEEE Transactions on Speech, Audio and Language Processing, 15(2), 690–701. Wilson, B.S., Finley,C.C., Lawson, D.T., and Wolford, R.D. (1988) “Speech processors for cochlear prostheses,” Proceeding of the IEEE, 76(9), 1143- 1154. Wilson, B.S., Finley, C.C., Lawson, D.T., Wolford, R.D., Eddington, D.K., and Rabinowitz, W.M. (1991). “New levels of speech recognition with cochlear implants,” Nature 352, 236-238. Wilson, B.S., Finley, C.C and Lawson, D.T. (1990) in “cochlear implants: models of the electrically stimulated Ear (Eds Miller, J.M., and Spelman, F.A.)”, springer, NY, 339-376. Zeng, F.-G., Oba, S., Garde, S.,Sininger, Y., and Starr, A., (1999). “Temporal and speech processing deficits in auditory neuropathy,” NeuroReport 10, 3429- 3435. Zeng, F.-G., and Galvin III, J.J. (1999). “Amplitude mapping and phoneme recognition in cochlear implant listeners,” Ear Hear. 20(1), 60-74. Zeng, F.-G., Grant, G., Niparko, J., Galvin, J., Shannon, R., Opie, J., Segel, P.,(2002), “Speech dynamic range and its effect on cochlear implant performance,” J. Acoust.Soc. Am. 111(1), 377-386. 193 Zwolan, T.A., Collins, L.M., and Wakefield, G.H. (1997). “Electrode discrimination and speech recognition in postlingually deafened adult cochlear implant subjects,” J. Acoust. Soc. Am. 102(6), 3673-3685.
Abstract (if available)
Abstract
Accompanying cochlear implant (CI) performance improvement over years, CI speech recognition increasingly showed unique patterns and huge inter-subject performance difference that are different from that of normal hearing (NH) listeners. Previous literatures have not paid enough attention to such unique patterns, regardless of numerous evidences. To further improve speech perception with the next generation CI device, it is critical to integrate such unique patterns into the CI framework.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Understanding music perception with cochlear implants with a little help from my friends, speech and hearing aids
PDF
Active data acquisition for building language models for speech recognition
PDF
Enriching spoken language processing: representation and modeling of suprasegmental events
PDF
Intensity discrimination in single and multi-electrode patterns in cochlear implants
PDF
Categorical prosody models for spoken language applications
PDF
Improving frequency resolution in cochlear implants with implications for cochlear implant front- and back-end processing
PDF
Effects of talker variability in cochlear implants
PDF
Noise aware methods for robust speech processing applications
PDF
Emotional speech resynthesis
PDF
Perceptual adaptation to spectrally shifted vowels: experiment studies and an acoustic analysis framework
PDF
Did you get all that? Encoding of amplitude modulations at the auditory periphery predicts hearing outcomes
PDF
Speech recognition error modeling for robust speech processing and natural language understanding applications
PDF
Hierarchical methods in automatic pronunciation evaluation
PDF
Statistical enhancement methods for immersive audio environments and compressed audio
PDF
Context-aware models for understanding and supporting spoken interactions with children
PDF
Concept classification with application to speech to speech translation
PDF
Intelligent signal processing for oilfield waterflood management
PDF
Matrix factorization for noise-robust representation of speech data
PDF
Extracting and using speaker role information in speech processing applications
PDF
Toward understanding speech planning by observing its execution—representations, modeling and analysis
Asset Metadata
Creator
Liu, Chuping
(author)
Core Title
Speech enhancement and intelligibility modeling in cochlear implants
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
07/28/2008
Defense Date
06/24/2008
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
cochlear implants,intelligibility modeling,OAI-PMH Harvest,speech enhancement
Language
English
Advisor
Narayanan, Shrikanth S. (
committee chair
), Byrd, Dani (
committee member
), Jenkins, B. Keith (
committee member
), Mendel, Jerry (
committee member
), Shannon, Robert (
committee member
)
Creator Email
chupingl@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m1423
Unique identifier
UC1302407
Identifier
etd-Liu-20080728 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-200985 (legacy record id),usctheses-m1423 (legacy record id)
Legacy Identifier
etd-Liu-20080728.pdf
Dmrecord
200985
Document Type
Dissertation
Rights
Liu, Chuping
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
cochlear implants
intelligibility modeling
speech enhancement