Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Digital signal processing techniques for music structure analysis
(USC Thesis Other)
Digital signal processing techniques for music structure analysis
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
DIGITAL SIGNAL PROCESSING TECHNIQUES FOR MUSIC STRUCTURE ANALYSIS by Yu Shiu A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) December 2007 Copyright 2007 Yu Shiu Table of Contents List Of Tables v List Of Figures vi Abstract x Chapter 1: Introduction 1 1.1 Significance of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Background of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Contributions of the Research . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Outline of the Dissertation. . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Chapter 2: Research Framework and Music Background Review 8 2.1 Beat-Level Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.1 Tempo Induction and Beat Tracking . . . . . . . . . . . . . . . . . 10 2.1.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Measure-Level Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 Structure-Level Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.1 Repetition Extraction and Application . . . . . . . . . . . . . . . . 13 2.3.2 Music Genres and Their Structure Analysis . . . . . . . . . . . . . 14 Chapter 3: Music Signal Processing via Tempo and Measure Analysis 16 3.1 Basic Tempo Analysis: Music Onset Detection and Period Estimation . . 16 3.2 Advanced Tempo Analysis: Musical Beat Tracking . . . . . . . . . . . . . 21 3.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3.1 Low-level Feature Vector: Pitch Class Profile . . . . . . . . . . . . 28 3.4 Measure Analysis and Their Similarity Measurement . . . . . . . . . . . . 30 3.4.1 STFT-based Similarity Matrix and Its Shortcomings . . . . . . . . 31 3.4.2 Construction of Measure-level Similarity Matrix . . . . . . . . . . 34 3.4.3 OptimizedDistanceMatrixCalculationUsingDynamicTimeWarp- ing (DTW) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 ii Chapter 4: High-Level Music Structure Analysis and Decomposition 45 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2 Similarity Matrix Computation Using Filtered PCP . . . . . . . . . . . . 47 4.3 Similarity Segment Detection via Viterbi Algorithm . . . . . . . . . . . . 49 4.4 Post-processing Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.4.1 Decomposition of Overlapping Similar Parts via Boundary Point Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.4.2 Similarity Variation along Temporal Segments . . . . . . . . . . . 60 4.5 Algorithm for Automatic Music Structure Analysis . . . . . . . . . . . . . 64 4.6 Experimental Results and Discussion . . . . . . . . . . . . . . . . . . . . . 68 4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Chapter 5: Musical Beat Tracking with Kalman Filters 71 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.2 Musical Data Pre-processing. . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.2.1 Onset Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.2.2 Period Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.3 Beat Tracking with Kalman Filters . . . . . . . . . . . . . . . . . . . . . . 77 5.3.1 Linear Dynamic Model of Beat Progression . . . . . . . . . . . . . 77 5.3.2 Kalman Filter Algorithm . . . . . . . . . . . . . . . . . . . . . . . 79 5.4 LM (Local Maximum) Measurement Selection Method . . . . . . . . . . . 82 5.5 PDA (Probabilistic Data Association) Measurement Selection Method . . 85 5.5.1 Measurement Validation . . . . . . . . . . . . . . . . . . . . . . . . 85 5.5.2 Description of PDA Method . . . . . . . . . . . . . . . . . . . . . . 88 5.6 Enhanced PDA (EPDA) Measurement Selection Method . . . . . . . . . . 91 5.7 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.7.1 Experimental Data and Setup . . . . . . . . . . . . . . . . . . . . . 94 5.7.2 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.7.3 P-Score Evaluation for MIREX Data Set . . . . . . . . . . . . . . 97 5.7.4 P-Score Evaluation for Billboard Data Set . . . . . . . . . . . . . . 99 5.7.5 Performance of Longest Tracked Music Segment Ratio (LTMSR) . 99 Chapter 6: Musical Beat Tracking with A Hidden Markov Model (HMM)101 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.2 States and Observations of Proposed Beat-Tracking HMM . . . . . . . . . 103 6.2.1 States and Observations . . . . . . . . . . . . . . . . . . . . . . . . 103 6.2.2 Observation Probability Determination . . . . . . . . . . . . . . . 105 6.3 State Transition Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.3.1 Periodic Left-to-Right (PLTR) Model . . . . . . . . . . . . . . . . 108 6.3.2 Determination of State Transition Probability a ij . . . . . . . . . . 110 6.4 State Decoding via Viterbi Algorithm . . . . . . . . . . . . . . . . . . . . 114 6.5 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.5.1 Experimental Data and Setup . . . . . . . . . . . . . . . . . . . . . 120 6.5.2 P-Score Performance Evaluation . . . . . . . . . . . . . . . . . . . 120 iii 6.5.3 Comparison of KF-based and HMM-based Musical Beat Tracking Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 Chapter 7: Conclusion and Future Work 123 7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Bibliography 128 iv List Of Tables 3.1 Comparison of the averaged distances with and without DTW. . . . . . . 43 4.1 The performance of chorus detection in terms of recall (R), precision (P) and F-measure (F). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.2 The performance of chorus detection in terms of recall (R), precision (P) and F-measure (F). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.1 P-score comparison of Kalman filtering with LM, PDA and EPDA. . . . . 97 5.2 P-score comparison of the KF-based beat tracking algorithm with LM, PDA and EPDA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.3 LongestTrackedMusicSegmentRatio(LTMSR)withLM,PDAandEPDA.100 6.1 Performances of HMM-based musical beat tracking with various settings applied to the MIREX data set. . . . . . . . . . . . . . . . . . . . . . . . . 121 6.2 Performances of HMM-based musical beat tracking with various settings applied to the Billboard Top10 data set. . . . . . . . . . . . . . . . . . . . 121 6.3 Performances comparison between Kalman-filter-based and HMM-based musical beat tracking algorithms. . . . . . . . . . . . . . . . . . . . . . . . 122 v List Of Figures 2.1 A three-level framework of music signal processing and structure analysis. 9 3.1 The block-diagram of a music tempo estimation system. . . . . . . . . . . 18 3.2 The music onset signal for part of the Capenter’s For all we know. . . . . 19 3.3 The autocorrelation function (ACF) of an onset signal. . . . . . . . . . . . 20 3.4 The block diagram of a phase-locked loop (PLL) system. . . . . . . . . . . 22 3.5 The non-return-to-zero (NRZ) data format in a communication system. . 23 3.6 Conversion of edges of NRZ data to pulses. . . . . . . . . . . . . . . . . . 23 3.7 An example of music beat location tracking. . . . . . . . . . . . . . . . . . 26 3.8 An example of music beat location tracking for non-ideal onset signals. . . 27 3.9 Block diagram of calculating PCP. . . . . . . . . . . . . . . . . . . . . . . 28 3.10 Attack - Decay - Sustain - Release pattern. . . . . . . . . . . . . . . . . . 32 3.11 (a) The small window method and (b)the tempo-based window method. . 33 3.12 Measure-levelsimilaritymatrix, wheres ij representsthesimilarityofmea- sure i and measure j. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.13 The measure-level distance matrix of Nirvana’s Smell like teen spirits. . . 37 3.14 The calculation of measure-level similarity:(a)linear one-by-one correspon- dence between ordered shortest note;(b)distorted correspondence between ordered shortest note. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.15 DTW distance matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 vi 3.16 DTW’s local continuity constraint on the path. . . . . . . . . . . . . . . . 41 3.17 The measure-level distance matrix using DTW for Nirvana’s Smell like teen spirits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.18 Comparison of distance between similar measures with and without DTW. 43 4.1 The relative intensity of individual notes, where the x-axis represents the note number (Note that 69 is the note number for A4). . . . . . . . . . . 50 4.2 The measure-level similarity matrix constructed using the conventional PCP feature vectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.3 The measure-level similarity matrix constructed using the filtered PCP feature vectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.4 Twosimilar segmentsj 1 toj M andi 1 toi M are shownina similaritymatrix. 52 4.5 The time-state space of the Veterbi algorithm on the lower-triangular sim- ilarity matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.6 Different paths in the similarity matrix, where path A goes along the di- agonal direction while path B deviates from the diagonal direction. . . . . 57 4.7 Illustration of two overlapping parts in a song due to the detected segment from (i 1 ,j 1 ) to (i M ,j M ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.8 Anillustrationoftheendpointadjustmentpost-processingfortwooverlap- pingsimilarpartsinasong,wherethesimilaritymatrixisderivedforJoan Jett’s I love rock n’ roll and the path detected by the Viterbi algorithm is given by the plus sign. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.9 Detection of structural difference based on the difference of the avarage similarity values in blocks B 1 , B 2 and B 3 along the diagonal and the first sub-diagonal lines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.10 The variation of similarity values for U2’s Vertigo, where its similarity matrix is shown in Fig. 4.3. . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.11 The similarity matrix of Nirvana’s Smell like teen spirits, where the de- tected path by Viterbi algorithm is indicated by the plus sign ”+”. . . . . 67 4.12 Thetime-variationofaveragedsimilarityvalueofNirvana’s Smell like teen spirits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 vii 4.13 The detected temporal segments with high similarity values, and the ana- lyzed structure of Nirvana’s Smell like teen spirits. . . . . . . . . . . . . . 68 5.1 The framework of the proposed musical beat tracking system. . . . . . . . 72 5.2 Measurement selection in the conventional Kalman filter with the local maximum (LM) method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.3 Billy Joel’s We didn’t start the fire: (top) the spectrogram of a music seg- ment from 30:00 to 35:00 seconds, where the y-axis represents the frequency from 0 to 8 kHz; (bottom) the musical onset signal as a function of time (in the unit of seconds) for the same music segment. . . . . . . . . . . . . 83 5.4 The performance of the KF-based beat tracking algorithm with LM (the x-coordinate) and EPDA (the y-coordinate) for each music clip from the MIREX data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.1 The framework of musical beat tracking system. . . . . . . . . . . . . . . 102 6.2 The onset histogram conditioned on the beat and non-beat state types with the bin width equal to (a) 0.5 and (b) 0.25, where the x-axis is the music onset intensity while the y-axis is the frequency of occurence. . . . 106 6.3 Observation probabilities for the beat state type, i.e., P(o(t)|B), and the non-beat state type, i.e., P(o(t)|N) derived from the MIREX data set. . . 108 6.4 (a) A general LTR state transition model and (b) the LTR model with beat and non-beat state types labeled. . . . . . . . . . . . . . . . . . . . . 109 6.5 ThePLTRmodelwithlarge-loopstatetransitionformodelingtheperiodic repetition of beats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.6 The probability distribution Prob(d) with p=0.75 and 0.25. . . . . . . . 111 6.7 A complete LTR model with only single self-loop on the beat state type. . 112 6.8 The probability distribution Prob(d) for the PLTR model with a self-loop at every state parameterized by self-transition probability p = 0.4 and p=0.6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.9 An example of the decoded state sequence based on the 4th music clip in MIREX: (a) the musical onset, (b) the log-likelihood difference between the beat and the non-beat state types, (c) the decoded state sequence, where “1” represents beat state and “0” represents non-beat state, as a function of time (in the unit of second). . . . . . . . . . . . . . . . . . . . 117 viii 6.10 An example of the decoded state sequence based on the 7th music clip in MIREX: (a) the musical onset, (b) the log-likelihood difference between the beat and the non-beat state types, (c) the decoded state sequence, where “1” represents beat state and “0” represents non-beat state, as a function of time (in the unit of second). . . . . . . . . . . . . . . . . . . . 119 ix Abstract Automaticmusicstructureanalysisfromaudiosignalsisaninterestingtopicthatreceives much attention these days. Its objective is to find the music structure by decomposing the music audio signals into sections and detect the repetitive parts. The technique will benefit music data analysis, indexing, retrieval and management. In this research, a three-level framework of music structure analysis is proposed. The first level is the beat level. Musical audio signals are analyzed via tempo analysis and the beat is derived as the basictemporalunitforeachmusicpiece. Then, featurevectorsareextractedforeach basic unit. The second level is the measure level, a similarity matrix between measures can be constructed based on multiple feature vectors in one measure. Then, the third level is the structure level, whose elements, for example, in a pop or rock song consists of the repetitive parts such as verses and choruses and the non-repetitive parts such as intro, outro and bridge. A technique based on dynamic programming is proposed to search similar parts of a song. With post-processing, the musical sections of the song can be extracted and their boundaries are estimated. Many digital signal processing (DSP) techniques are proposed to address low-level music signal processing problems in the thesis. They are used to analyze the character- istics of musical audio signals. Specifically, techniques based on the phase-locked-loop x (PLL), Kalman filter and the Hidden Markov Model (HMM) are developed for musical beat tracking. The beat locations are estimated on-line for the first two techniques while off-line for the last technique. To further tackle the incorrect measurements of beats, an enhanced probabilistic data association (PDA) that considers both information of pre- diction residual and music onsets’ intensities is applied to original Kalman filter. On the other hand, for HMM-based musical beat tracking, a special state space is built to model the beats’ periodic progression and Viterbi algorithm is used to estimate the beats’ loca- tions by decoding the musical audio signal into a sequence of beat states and non-beat states. Moreover, dynamic time warping (DTW) is used to calculate the optimized dis- tance between two segments of music signals and thus helps building the measure-level similarity matrix. Finally, the measure-level similarity matrix is analyzed and repetitive partsofasongsuchasversesandchorusesareidentifiedviadynamic-programming-based technique. Thesearepioneeringeffortsinthemusicsignalprocessingfield,whichappears to be a new frontier in digital signal processing. xi Chapter 1 Introduction 1.1 Significance of the Research Automaticmusicstructureanalysisfromaudiosignalsisaninterestingtopicthatreceives more and more attention these days. The technique can be used for music data analysis, indexing,retrievalandmanagement. Withthewideaccesstodigitalmusic,manyindivid- uals and organizations will all benefit from automatic music structure analysis, including music listeners, instrument players (both amateurs and professionals), commercial music distribution companies and digital music libraries, etc. The objective of automatic music structureanalysisistodecomposethemusicstructureintosegmentsanddetectrepetitive parts in a song for structure extraction. Automatic music structure analysis helps solve the problem of music transcription. By music transcription, one transcribes the music content from audio signals to notes. It has to evaluate the short-time spectrum of musical audio signals and maps them to notes representedbythepitchandduration. Itprovidesadirect(orlow-level)analysisofmusic 1 signals. With the help of the global (or high-level) information from the music structure, it is possible to do music transcription at reduced complexity with high accuracy. A three-level framework of music structure analysis is proposed in this research. The first level is the beat level. Musical audio signals are analyzed via tempo analysis and the basic temporal unit for each music piece is derived. Thus, feature vectors can be extracted for each basic unit. The second level is the measure level, a similarity matrix between measures can be constructed based on multiple feature vectors in one measure. Then, the third level is the structure level, whose elements, for example, in a pop or rock song consists of the repetitive parts such as verses and choruses and the non-repetitive partssuchasintro,outroandbridge. AtechniquebasedonViterbialgorithmisproposed to search similar parts of a song. With post-processing, the musical sections of the song can be extracted and their boundaries are estimated. Many digital signal processing (DSP) techniques are proposed to address low-level music signal processing problems in the thesis. They are used to analyze the charac- teristics of musical audio signals. Specifically, techniques based on the phase-locked-loop (PLL),theKalmanfilterandthehiddenMarkovmodel(HMM)aredevelopedformusical beat tracking. The beat locations are estimated on the fly. With the beat as the basic temporal unit, feature vectors are then extracted from the audio signal using the pitch class profile (PCP). Similarity matching is conducted at the measure level to result in a similarity matrix. Finally, the measure-level similarity matrix is analyzed and repetitive parts of a song such as verses and choruses are identified. These are pioneering efforts in the music signal processing field, which appears to be a new frontier in digital signal processing. 2 1.2 Background of the Research In this thesis, we are concerned with computational problems related to music structure analysis. Some related concepts and research works are briefly reviewed below. • Musical Tempo Induction Musical tempo induction is to estimate the period of music onset signals A lot of previous research was conducted on the MIDI data [20]. With the availability of digital music in recent years, researchers have turned the attention to the audio data recorded in the performance [10, 21, 27]. Goto [18] and Scheirer [35] made pioneering studies on music tempo induction and beat tracking. They developed methods to extract onset signals from musical audio data and estimate the period fromthem. Inarecentreviewarticle,Gouyon et al. [22]comparedtheperformance of seven methods in tempo induction. Most algorithms can achieve an accuracy of 65% or above for music of different genres with a constant tempo under the condition that any metrical tempo is accepted as the correct answer. However, it is difficulttoobtainaparticularmetricallevelsuchasthebeatsormeasureformusical audio signals without ambiguity. Furthermore, there is still room for performance improvement since the accurate rate is still in the range of 60-70%. • Musical Beat Tracking Musicalbeattrackingistoestimatebeatlocationsinasong. Someofpreviouswork focusedonon-linemusicalbeattracking[23,35,36]. Scheirer[35]usedacombfilter to estimate the tempo and then propose a method to estimate the beat locations. Beat tracking algorithms in [23, 36] adopt a particle filtering technique. Particle 3 filtering is more generic than Kalman filtering since it makes no assumption on the linearity of the tracking system and Gaussianarity on underlying signals. However, its complexity is significantly higher. Besides it does not address the problem of inaccurate measurements. Others studied off-line musical beat tracking methods [30, 27, 1]. Klapuri [27] used the comb filter from [35] but applied it in a graphical framework so as to estimate severalmetricallevelsineachmusicpiecesimultaneously. Accordingto[22],itgives thebestperformanceamong7participants. However,thecomputationalcomplexity is extremely high since the graphical model uses three metrical levels and each of them has over 100 states. • Similarity Matrix Foote [11], [12] proposed to use the similarity matrix for music structure analy- sis. The x- and y-axes of the similarity matrix represent the index number of a sequence of uniformly divided segments of a music piece. The element of similarity matrix shows the similarity degree between these two segments. Foote adopted the similarity matrix mainly as a tool for audio segmentation and summarization. The measureofaudionoveltywasalsodiscussedin[12]. Whenweseeadramaticchange in the similarity degrees along the diagonal line of the similarity matrix, it often implies the occurrence of a different part of a song. • Chorus Section Detection Goto [16], [17] proposed a technique to detect the chorus-section of popular songs. Basically, he detected the most prominent segments in parallel with the diagonal 4 line in the similarity matrix and identified the corresponding parts as the chorus section. Since the chorus part contains familiar tunes in popular songs, it can be a good representation for audio summarization. 1.3 Contributions of the Research Several major contributions are made in this research. They are detailed below. • A three-level framework of automatic music structure analysis is proposed. Each of the three levels has its own function: (i) the note-level for synchronization and feature extraction, (ii) the measure-level for similarity matching and comparison, and (iii) the structure-level for music structure extraction and analysis. Under this framework, we can treat the three levels separately. It makes the music structure analysis task more systematic. Methods for each level could be improved individu- ally. • Atechniquebasedonthephase-lockedloop(PLL)isproposedtotrackthepositions ofbeatpulses. Beatpulsesarenotregularinmusicalaudiosignalsinthesensethat they may have a varying intensity level with unequal spacing. In addition, musical onset signals are not purely periodic but a linear combination of several frequency components whose frequencies are integer multiples of the fundamental frequency and known as harmonics. With the analogy to the clock and data recovery (CDR) problem, PLL can track the beat pulses if the frequency of the voltage-controlled oscillator (VCO) is well initialized. 5 • We propose a dynamic time warping (DTW) technique to calculate the optimal distance between two measures of a song. Musical audio signals performed by human being have intrinsic characteristics of imperfect synchronization and timing variation. Even for two parts in a song that are supposed to be identical, they might have subtle differences. With an appropriate local adjustment in matching time instances (i.e. time warping), DTW yields the optimal similarity between measures. • Weincorporatethemusicaltempoinformationinthesimilaritymatrixconstruction. The musical tempo information, such as the period and the beat location, provides basic time units that are meaningful to human perception. To use them effectively can solve the synchronization problem at the beat level. This in turn will lead a accurate similary matrix at the measure level. The newly proposed measure-level similaritymatrixcangiveacompactandaccuraterepresentationunderanaccurate musical tempo information. • A technique based on the Viterbi algorithm is proposed to detect the repetitive parts of a song. The detection of the repetitive parts, including the verse and chorus parts in modern popular and rock songs, is a key component to the success of automatic music structure analysis. The Viterbi algorithm helps overcome the characteristics of relatively weak similarity between verse parts, and improve the accuracy of repetitive part detection. • AKalmanfilter(KF)-basedmusicbeattrackingalgorithmwiththeprobabilitydata association (PDF) measurement selection method is proposed to conduct musical 6 beat tracking on-line. Many musical beat tracking algorithms suffer from poor measurement selection methods. This is particularly true when large music onsets occur in the neighborhood of the beat location or when there is no music onset on certain beats. PDA and enhanced PDA are developed to address this problem. PDA uses the prediction residual while enhanced PDA uses the prediction residual as well as the music onset intensity to improve the measurement selection process. It enhances the tracking performance significantly. • Another technique based on the hidden Markov model (HMM) is proposed to per- form musical beat tracking. HMM provides a probabilistic framework for musical beat tracking, where the prior knowledge of the beat period can be incorporated. It also provides a satisfactory beat tracking performance. 1.4 Outline of the Dissertation The dissertation is organized as follows. A three-level system framework is proposed and related music knowledge is reviewed in Chapter 2. Musical tempo estimation, feature extractionandsimilaritymatrixarediscussedinChapter3. Amethodformusicstructure analysis on the similarity matrix is discussed and performance evaluation is given in Chapter 4. Then, two advanced musical beat tracking algorithms are demonstrated. The Kalman-flter-based musical beat tracking algorithm is presented in Chapter 5 while the HMM-based musical beat tracking algorithm is presented in Chapter 6. Finally, concluding remarks are given and future research directions are pointed out in Chapter 7. 7 Chapter 2 Research Framework and Music Background Review As shown in Fig. 2.1, a three-level framework is adopted in our research for music signal processing and music structure analysis. The three levels are: (i) the beat-level, (ii) the measure level and (iii) the structure level. They represent the fine, medium and coarse representationsofamusicpiece. Thisframeworkprovidesabottom-upprocessindealing with musical audio signals. The output from a lower level will be served as the input to its higher level. With the help of music knowledge, meaningful music segmentation can be achieved under this framework. First, the musical audio signal is used as the input signal to the beat-level subsystem. During the beat level, musical tempo is estimated from the onset signals. The estimated beat, its half or its double provide a basic temporal unit for further music analysis, and featurevectorsthatrepresentmusiccontentscanbeextractedatthislevel. Second,inthe measure-levelsubsystem, measuresofmusicsignalsarecomparedbytheirfeaturevectors with each other to calculate their similarity degree. We can record the similarity degrees of all measures in a song in one similarity matrix whose element gives the similarity value between a pair of measures in the song. Finally, based on the derived similarity matrix 8 and a prior knowledge on the music structure, a structure-level subsystem segments the song into the verse, chorus, intro, outro and bridge parts so as to achieve the function of semantic segmentation. More details ofeach level will be discussedbelow. In the meantime, some background music knowledge and techniques for computational music analysis will also be reviewed. Tempo Estimation & Feature Extraction Note Level Similarity Matrix Measure Level Structure Analysis Structure Level Music Signals Figure 2.1: A three-level framework of music signal processing and structure analysis. 2.1 Beat-Level Analysis There exist two fundamental problems in the beat-level subsystem: (i) the length of the basicunitformusicanalysisand(ii)featurestobeextractedtorepresentmusiccontents. 9 2.1.1 Tempo Induction and Beat Tracking Being different from an arbitrary audio signal, musical signals have structural charac- teristics in both the time and frequency domains. One of its characteristics in the time domain is the musical tempo, which shows the speed of beats (or, equivalently, the num- ber of beats in a given duration). It is often expressed by the unit of “Beat-Per-Minute” (BPM) that counts the number of beats in one minute. When people listen to music and foot-tap with it, it is not difficult to find that there are secondary pulses that have a twice or half speed of foot-tapping. In other words, there are several metrical levels in a music structure. The ratio between lengths of adjacent metrical levels is typically two. However, in some music pieces, the tempo ratio between adjacent metrical levels can be three, such as Green Sleeves. Even though beats are the most obvious metrical level, it is worthwhile to point out that the smallest metrical level is actually composed by the shortest note. For example, the eighth or sixteenth note is often the shortest note in modern popular and rock songs. Adjacent levels are notes of a different length, e.g. the sixteenth, eighth, quarter, half and whole notes. Existingtempoestimationmethodsconsistoftwosteps: 1)extractionofmusicalonset signals and 2) computation of their period. Onset signals mean the intensity change of energy [35, 3, 39] or other low-level features [24, 21, 31]. They often occur when the drum is hit or the musical note is played. Different music instruments show different characteristics in their onset signals. Struck-string instruments such as the piano have clear onset signals while bow-string instruments such as the violin and cello have rather 10 unclearintensitychange. Forthecomputationofperiodicity,theautocorrelationfunction anditsmodificationwereusedin[15,21,38]whileScheirer[35]usedthecomb-filterbanks to extract the period. There are two approaches to perform musical beat tracking: on-line and off-line. On-line musical beat tracking algorithms track the beat location on the fly. Since it is difficult to acquire clear onset signals in a noisy or clutter environment, to keep the tracking performance is the main challenge. On the other hand, off-line musical beat tracking emphasizes the correctness of the estimated beat time. 2.1.2 Feature Extraction Tofindfeaturesforeffectivemusiccontentrepresentationisfundamentaltomusicdatabase management. Features based on the short-time Fourier Transform (STFT), such as the Mel-Frequency Cepstrum Coefficient (MFCC), and the Linear Prediction Cepstrum Co- efficient (LPCC) have been discussed extensively in previous work. They can be justified by the characteristics of the human auditory system and was developed specifically for speech signal processing. They provide a compact feature representation in a multi- dimension vector for each frame of music signals. However, they do not take the music knowledge such as notes’ pitch and harmonic relationship into consideration. Thus, even though these features can be used to describe the music signals in the frequency domain, they may not be efficient in music information extraction and summarization. Features of music contents include rhythm, melody and harmony. Rhythm refers to the pattern of music signals in the time domain that is formed by a different duration andstress. Percussioninstrumentsplayanimportantroleforrhythmdetectioninasong. 11 The duration between pulses by percussion and its intensity forms the rhythm pattern. Melody in pop and rock songs mostly corresponds to vocal sounds from the lead singer and occasionally refers to the music instrument in solo performance. It is the backbone of pop and rock songs because they carry the main messages through lyrics. Even with state-of-the-art signal processing techniques, melody lines are difficult to extract so as to be separated from the music accompaniment. Harmony refers to the chords of music accompaniments. Chordsareasmallcollectionofpitchesplayedsimultaneouslyorplayed in order (for example, arpeggiating chords). The number of chord types is limited and they set the basic tone for the music. 2.2 Measure-Level Analysis The medium metrical level of a song is its measure structure, which denotes the notes betweentwobarlinesinamusicscore. Thetimesignatureinthescoreshowstherelation between the duration of beats and that of measures. For example, time signature ( 2 4 ) means there are two beats in a measure and a quarter note is used as one beat. Pop and rock songs tend to be in simple metrical structure and, hence, most of them use the time signature ( 4 4 ), which means there are four beats in a measure and a quarter note is used as the beat. In our work, we do not estimate the measure from music signals directly. Rather, the time signature is assumed to be known. In the measure-level subsystem, the tempo information about the measure is used for temporal alignment. Then, the similarity degree between each pair of measures within a song is calculated. The similarity matrix has been used by researchers [11], [17] in the 12 past to explore the self-similarity characteristics within a music piece. The element S(i,j) of a similarity matrix reflect the similarity degree between frame i and frame j. Here, we choose the frame unit to be a measure. The repetition parts within one song have high similarity, which will result in a segment in parallel with the diagonal line in the similarity matrix. 2.3 Structure-Level Analysis 2.3.1 Repetition Extraction and Application Two important properties of music, i.e., repetition and variation, play key roles on the music structure. Repetition provides a stable framework that makes listeners familiar with the setting of the song. It also gives a sense of regularity so that listeners can follow the flow of sound effortlessly. On the other hand, variation offers listeners the contrast of emotion and strength among different parts of a song. Sometimes, it yields an effect out of expectation so as to create some unique impression on the audience. There is a gap between low-level features extracted from music signals and the se- mantic segmentation of the music structure. The similarity matrix built from low-level featuresdoesnotaddressthestructuralaspectofmusicsignalsdirectly. Chorusdetection was studied in [17] and [29]. Since chorus parts are in essence identical signals, they are easier to detect than those that are partially identical, say identical melody or harmony but different lyrics. One example is the verse parts. The difference in lyrics affects music signals seriously on the the strength of onset pulses and relative magnitude of overtones. It sometimes affects the pitch and melody of music signals, too, which occurs when the 13 singer tries to shout or howl off the melody and results in off-key from the melody in the rock music. As a result, verse parts are not as clear as chorus parts on a similarity matrix. It demands careful examination and advanced processing. 2.3.2 Music Genres and Their Structure Analysis Thetargetmusicgenresinthisresearcharepopularsongs,rocksongs,folksongs,country music and other singing music like hymn and children songs. They possess the charac- teristics of repetition and a various degree of variation. This work does not deal with classical music, jazz music, blue music and electronica/techno music. They might have the property of repetition as well as pop and rock songs but, in most cases, they have different music structures. For example, classical music does not focus on the vocal per- formance of singers as much as they do on the pop and rock music. Instead, it uses a different combination of music instruments and emphasizes the instrument performance. Jazz music and blue music rely heavily on improvisation or ”jamming” and tend to have different variations more freely. Electronica or techno music uses the sampler, mixer or other modern electronic devices to produce new experimental sounds. For the singing music from the pop, rock, folk, country, hymn to children songs, the repetition is a key element of the music structure. Basically, there are three forms of music structures [9]. The first form is the AAA form. The letter A in the AAA form represents one complete melody line. It may repeat several times, which is not limited to three. Each repetition uses identical melody with different lyrics. The structure of most hymns and folk songs belongs to this form. The second form is the verse/chorus form. Over 90 percents of modern pop songs and rock songs belong to this form. It has two 14 different types of repetition patterns: verse and chorus, which provide alternate states of emotion and strength. Other elements of music in the verse/chorus form include the intro(thebeginningpartofasong), theoutro(theendpartofasong), andthebridge(in the midst of verse and chorus repetitions and often provides variation on timbre through musicinstrumentsolooronmusicaltonethroughmusickeychange). Thelastformisthe AABAform,whereletterArepresentstherepetitionofmelodyandharmonybutdifferent lyricsandletterBinAABAoffersthevariationthathasdifferentmelodyandlyrics. This work focuses on the verse/chorus form. The techniques used in the verse/chorus form may be applicable to other music forms, too. 15 Chapter 3 Music Signal Processing via Tempo and Measure Analysis In this chapter, we apply the digital signal processing techniques to low-level music anal- ysis such as the extraction of the beat frequency and the beats’ location. They are presented in Sec. 3.1 and Sec. 3.2, respectively. Then, we discuss ways to extract feature vectors based on the pitch class profiles in Sec. 4.2. Afterwards, we discuss the similarity degree between any two measures of a song and record the data in a similarity matrix for further analysis to be considered in Chapter 4. 3.1 Basic Tempo Analysis: Music Onset Detection and Period Estimation Music tempo plays an important role in human perception of music signals. It provides an temporal structure in delivering melody and rhythm. When listening to music, even non-professional listeners have the ability to foot-tap with the music tempo. Since tempo normally provides a steady speed of beats, listeners can follow the progression of the music accordingly. 16 Formally, music tempo refers to the speed at which music is played, and it is often measured in the frequency of beats, that is, the number of beats in a given time. The beats are pulses in a song that occur relatively regularly. Most other metrical elements of music signals do not have such consistent characteristics. For example, in a popular song with time signature ( 4 4 ), the quarter note is used as a beat. Other elements in metrical structure such as one-eighth notes or half-notes do not occur regularly. For specific metrical element like one-eighth note, the pulses might occur at some time instance but not at others. They do not occur as regularly as the pulses of beats. In fact, specific combination of pulses from different metrical structures constitutes a music rhythmic pattern. In this work, we focus on the regular part of pulses, that is, music tempo, since it provides a uniform temporal framework with respect to a song. Under the framework, the music signal could be decomposed into various time units such as shortest notes, beats and measures. The beat-level and measure-level analysis will leverage the musical tempo information. Previous work on music tempo analysis was reviewed in [20] and [22]. The former examined music tempo analysis systems for both MIDI and audio data and studied other relevant issues such as tempo tracking, time signature determination, rhythmic quantiza- tion, etc. The latter reviewed six state-of-the-art music tempo analysis systems for audio data, and compared their performance on tempo induction for a large music collection of diverse genres. The common goal of all these systems is to estimate the value of music tempo based on short excerpts of 20sec or 30sec. That is, they focus on the detection of thefrequencyofbeatpulses,butnotonthespecificpositionofeachbeatpulse(thephase of the period). We attempt to extract the frequency as well as the phase information 17 here, since they are both valuable in locating beats or measures of music signals. Our method is built upon the phase-locked-loop (PLL) tracking technique. Fig. 3.1 depicts a procedure to estimate the music tempo from musical audio signals. There are two main components: (i) onset detection and (ii) period estimation. Once the period is estimated, we can derive the tempo accordingly, which is simply the inverse of the period. Subband Filterbank Envelope Extraction First-order Differentiation Period Estimation Music Signal Music Tempo Half-wave Rectification Down-sampling Onset Detection Figure 3.1: The block-diagram of a music tempo estimation system. By following the framework proposed by Scheirer [35], the music onset detection task includes five subtasks as shown in Fig. 3.1. The filterbank consists of one IIR lowpass filter with a cutoff frequency at 200Hz, four bandpass filters with pass frequencies be- tween 200Hz and 400Hz, 400Hz and 800Hz, 800Hz and 1,600Hz, 1,600Hz and 3,200Hz, 18 respectively, and one highpass filter with a cutoff frequency at 3,200Hz. With the filter- bank, the input music signal is filtered into six subband signals. For each subband signal, the energy envelope is extracted and down-sampled, the first-order difference function is calculated and then half-wave rectifier is applied. The resulting signal for each subband is the onset signal with respect to time. The sampling rate is down-sampled to 200Hz, which aims to reduce the data size and the onset signal has a time resolution of 5msec. An example of the extracted music onset signal is depicted in Fig. 3.2, which is an excerpt of 2.5sec from the Carpenters’ For all we know. A large intensity indicates a significant change on the energy envelope of one subband signal, which corresponds to the location of an onset. As shown in the figure, there are five primary onsets in the 2.5sec excerpt. All other onsets are relatively small when compared to these five main ones. We see that these primary onsets last for a certain duration. 10 10.5 11 11.5 12 12.5 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 x 10 !3 Time in sec Figure 3.2: The music onset signal for part of the Capenter’s For all we know. 19 The last block in Fig. 3.1 is to estimate the period based on extracted onset signals. Severaltechniqueswerereviewedin[22]. Forexample,theautocorrelationfunction(ACF) and the comb-filter method [35] can be used to calculate the period of beat pulses. We plot the ACF of the first 20sec of the Police’s Every breath you take for the first subband below 200Hz in Fig. 3.3. There are several peaks that have strong resonances. Usually, the beat-level pulses that have most prominent and consistent characteristics in onset signalswillcontributetothehighestpeakinthecomb-filtermethodorthesecondhighest peak in ACF (since the highest peak occurs at the zero lag location, which represents the self-similarity of the onset signal). Thus, they are selected to be the period of onset signals. 0 50 100 150 200 250 300 1 1.5 2 2.5 3 3.5 4 4.5 5 x 10 !3 Delay Figure 3.3: The autocorrelation function (ACF) of an onset signal. 20 3.2 Advanced Tempo Analysis: Musical Beat Tracking Insomeapplications,itisvaluabletoknowtheexacttimeinstantsofbeats. Forexample, if the locations of all beats are detected accurately, the feature extraction and similarity comparison tasks can be greatly simplified in the later stage. In the last section, we performed basic tempo analysis by examining the onset signal as well as the frequency of beats (or the period). However, the peaks of the onset signal do not have a one-to-one correspondence to the beats as shown in Fig. 3.2. Even though the extracted period information is more reliable, this information alone is not sufficient to locate the beat location accurately. It is observed in our experiments that, under the uniform period assumption, we can lose the track of the beat location due to cumulative errors for some time. Byfusingtheonsetsignalandtheperiodinformation, theresultingbeatlocationis stillnotreliable. Besides,themethodologyiskindofheuristic. Inthesection,wepropose to use the phase-locked loop (PLL) technique to estimate the location of each beat. ThePLLtechniquehasbeenwidelyusedincommunicationandcontrolformorethan a decade. It is an adaptive method that synchronizes an output signal with a periodic input signal in both frequency and phase [6]. As shown in Fig. 3.4, there are three main components in a PLL system; namely, a phase detector, a lowpass filter and a voltage-control oscillator (VCO). The operation of a PLL system can be simply stated as follows. The phase detector compares the input signal u 1 and the synthesized output signal u 2 from VCO and calculates the phase error u d between them. If the phase of the input signal leads the VCO output signal, the positive phase error u d will speed up the VCO output to reduce the phase error. On the other hand, if the phase of the input 21 signal lags behind the VCO output signal, the negative phase error u d will slow down the VCO output to reduce the phase error, too. The lowpass filter is used to remove the high frequencypartofphaseerrorsignalu d tomakeitmorerobust. Whenthesystembecomes steady, the PLL system can synchronize the VCO output signal and the input signal in both frequency and phase. It is worthwhile to point out that musical onset signals are different from sinusoidal or square waves, which are widely used in communication or control systems. Instead, it consists of a train of pulses of different magnitudes. Thus, we need to find some suitable PLL technique to address our current problem. Phase Detector Lowpass Filter VCO u 1 u 2 u d u f Figure 3.4: The block diagram of a phase-locked loop (PLL) system. There is one PLL application known as clock and data recovery (CDR). In such a communication system, the receiver estimates the data clock from the received signal and then recover the content of data [6], [2]. In the CDR problem, we consider a data sequence encoded by the Non-Return-to-Zero (NRZ) format as illustrated in Fig. 3.5, which is often used in asynchronous communication. In this representation, the high and low voltage values correspond to data “1” and “0”, respectively. After a signal “1” is sent, the voltage does not drop back to the low voltage. Thus, it is difficult to detect the clock period directly from the NRZ-format data. In particular, when a long sequence of 22 data “1” or “0” is sent over the channel, there is little information revealed about the clock. That is, we do not see magnitude peaks of the Fourier spectrum that correspond to the fundamental clock frequency and its harmonics. To tackle a signal of the NRZ format, we can first transform it into a sequence of pulses. One commonly used method is to cascade a first-order differentiator with a half-wave rectifier. Then, the NRZ data in Fig. 3.5 can be converted to the pulse sequence as shown in Fig. 3.6. This procedure is similar to extracting onset signals from musical audio signals. Figure 3.5: The non-return-to-zero (NRZ) data format in a communication system. Figure 3.6: Conversion of edges of NRZ data to pulses. The procedure of applying PLL to musical beat tracking can be stated as follows. • Initialization of the PLL System We estimate the period from the given musical onset signal and use it as the initial period of the VCO output of the PLL system. 23 • Music beat tracking via PLL We use data “0” and “1” in the CDR problem to indicate the non-existence or existence of a clear pulse in the music onset signal. Then, PLL can be applied to thisequivalentCDRproblemtorecovertheclockanddata, wherethedataindicate the beat location. ThecomponentsofthePLLsystemandtheirparametershavetobeselectedproperly for our current application. First, a multiplier is used as the phase detector. Given the input musical onset signal u 1 and the VCO output signal u 2 , the output of the phase detector can be written as u d =u 1 ·u 2 , (3.1) where u 1 is normalized to a range between 0 and 1, and u 2 is a square wave whose values are in the set of {−1,0,1}. Second, the lowpass filter calculates the average of phase erroru d andfeeditsoutputu f toVCO.Givenasamplingrateof200Hzformusicalonset signals,afirst-orderIIRfilterwithacutofffrequencyof2Hzisapplied. Third,wehaveto decide the center frequency of VCO. With the analogy to the clock in the CDR problem, the frequency that corresponds to the “shortest period” in ACF is selected as the center frequency of VCO. In other words, the highest frequency that exists in the musical onset signals is used as the “clock”. IntheexampleofthePolice’s Every breath you take asshowninFig. 3.3, theshortest period is 51 samples and its inverse is used as the center frequency of VCO. Then, the 24 input musical onset signals are quantized by this clock rate along the time domain. The angular frequency of the VCO output signal is given by [6] ω 2 =ω 0 +K 0 ·u f , (3.2) where ω 0 is the center frequency of VCO, K 0 is a pre-defined constant, and K 0 u f de- termines the amount of phase deviation from the center frequency. If K 0 u f is positive, ω 2 will increase and the VCO output signal u 2 will proceed faster. If K 0 u f is negative, ω 2 will decrease and the VCO’s output signal u 2 will proceed slowlier. The VCO output signal u 2 is actually decided by θ 2 , which is the integral of ω 2 over time and confined between −π and π. The output square-wave signal takes the value 1 or -1 depending on θ 2 via u 2 = 1, if 0≤θ 2 <π, −1, if −π ≤θ 2 <0.. (3.3) An example of music beat location tracking for the first 20 seconds of the the Police’s Every breath you take is shown in Fig. 3.7, where the x-axis represents the time (in the unit of down-sampled time). From the top to the bottom, we show (i) the input musical onset signals u 1 , (ii) the output of the phase detector u d , (iii) the lowpass filtered signal u f , (iv) the total phase θ 2 of the VCO output u 2 , which is integral of ω 2 , (v) the VCO output u 2 which is calculated by θ 2 and Eq. (3.3) and (vi) the input musical onset signal along with tracked boundaries between two beats. We see that the PLL system locks input signal u 1 in the 7th period in this example. 25 0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 0.5 1 0 200 400 600 800 1000 1200 1400 1600 1800 2000 !1 0 1 0 200 400 600 800 1000 1200 1400 1600 1800 2000 !0.5 0 0.5 0 200 400 600 800 1000 1200 1400 1600 1800 2000 !5 0 5 0 200 400 600 800 1000 1200 1400 1600 1800 2000 !1 0 1 0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 0.5 1 Figure 3.7: An example of music beat location tracking. 26 If there are small or no pulses for a short while, the PLL system will continue to proceed with center frequency ω 0 . If the musical onset signal resumes again soon, PLL will be able to keep the tracking performance without much interruption. The kind of pause is common in music performance. For example, when a singer finishes singing a chorus, he might pause for a while intentionally to prepare for the next verse or chorus. As long as the pause time is not too long, PLL is still in the locked state. One example, Debbie Gibson’s Lost in your eyes, is shown in Fig. 3.8. Even though we see small or no onset pulses in the 3rd, 7th, 9th and 11th periods, the PLL system continues with the proper center frequency ω 0 . Of course, if the lack of onset pulses lasts too long, the PLL system may get into an unlocked state. 200 300 400 500 600 700 800 900 1000 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Figure 3.8: An example of music beat location tracking for non-ideal onset signals. 27 3.3 Feature Extraction The beat-level subsystem aims at representing the music content by low-level features. Music signals basically are multi-dimensional time-series signals. They proceed with time and change the music contents accordingly. However, the difference between human perception on music and the numerical representation of signals is very large. Signal processing techniques are needed to fill the gap by representing the music content as close to the human perception’s perspective as possible. In this section, the technique for calculating Pitch Class Profile(PCP) from the music signals is introduced. 3.3.1 Low-level Feature Vector: Pitch Class Profile Hamming window and FFT Peaks Picking Frequency Mapping & Energy addition on 12 semitones Music Signal PCP Normalization Figure 3.9: Block diagram of calculating PCP. 28 PCP is used for chord recognition and key-finding in [13], [37], [14] for musical audio data. Each element in the vector represents the relative intensity of one of the 12 pitch classes, i.e., A, A], B, C, C], D, D], E, E], F, G and G]. It is calculated once for each basic time unit, which is selected to be the length of one half beat. For example, for a time signature of 4/4, the quarter note is one beat so that the duration of a one-eighth note is the basic time unit. The calculation of PCP includes several steps, which are shown in Fig. 3.9. First, a Hamming window is applied to the music signal and its discrete-time Fourier transform (DFT) is calculated. Next, peaks that correspond to dominant harmonic components are picked from the magnitude spectrum, and their frequencies are mapped to one of the 12 pitch classes according to Pitch Class Number=mod(b12×log 2 ( f 440 )c,12)+1, (3.4) where bc is round operation that rounds the operand into integer, f is the frequency of peaksand440,whichisthefrequencyofthenoteA4. Third,theenergyofthepeaksinthe magnitude spectrum is added to the element of the PCP feature vector according to the pitch class number. That is, energies of all the peaks that have pitch class number i are added to the i-th element of a PCP vector. Each element of a PCP vector represents the relative intensity of each pitch class number. Finally, the PCP vectors are normalized to be with the unit length since we are only concerned with the energy distribution pattern of the PCP vector. 29 The PCP feature vectors provide the tonal characteristics of music signals and offer a different pattern for each chord used in a song. Moreover, if PCP feature vectors are extracted successively from a music segment, the sequence of PCP feature vectors could reflect the chord progression via the pattern change of PCP. For modern popular and rock songs, a song basically consists of the main melody part sung by the singer and the harmonic part produced by music instruments such as guitars and pianos. Even though the two parts play different roles in a song, different chords are assigned to the song and both parts need to conform to the harmonic constraint of the same chord. That is, if music in a measure (between bar lines) belongs to the C chord, both the melody and the harmonic parts will most likely have C, E and G (Do-Mi-Sol) notes while the F(Fa) note isunlikelytohappen. Forexample,inOasis’Don’t look back in anger,thechordsequence is played several times through the whole song. The PCP sequences for two segments that have the same chord sequence will show a similar pattern of the PCP sequence. 3.4 Measure Analysis and Their Similarity Measurement A similarity matrix is a matrix that each element represents the similarity of two corre- sponding feature vectors. Say, there exists a similarity matrix S =[s ij ]. The element s ij is the similarity between feature vectors f i and f j . If the two feature vectors f i and f j are”similar”, thesimilarityvalues ij willbehigh. Converselyspeaking, ifthetwofeature vectors f i and f j are dissimilar, the similarity value s ij will be low. A song often consists of several repetitions, such as chord sequence or melody. There are many pairs of feature vectors that are similar to each other and will have high similarity values. 30 The similarity matrix has been used in previous works. Foote [12] used the similarity matrixofamusicpiecealongwithitsnoveltymeasureforaudiosummarization. Hiswork aimed at detecting the boundaries of two different segments such as music and speech. Goto [16] examined the problem of chorus detection based on their high similarity. 3.4.1 STFT-based Similarity Matrix and Its Shortcomings Previous works [8] and [16] use the Short-time Fourier Transform (STFT) with small window and step size and the similarity is calculated between feature vectors extracted from the frames of data. However, there are some problems using the small step size or large window size. • Large Data Size First of all, the use of a fixed window step size in STFT could achieve reliable similarity matrix computation for the targeted music piece only when the step size is small. Thus, it is inevitable that the size of the resulting similarity matrix is large and the computational burden of further processing increases accordingly. For example, to capture the time-varying characteristics of music signals, the used step size of the sliding window is in the range of tens of millisecond, say, 10ms to 50ms. Thus, the size of the similarity matrix would be around 18,000 by 18,000 to 3,600 by 3,600 for a 3-minute-long pop song. It will impose huge computational burden if further processing is needed. 31 Figure 3.10: Attack - Decay - Sustain - Release pattern. • Data Redundancy Thesmallstepsizedoesnotprovidemuchmeaningfulinformationwhenitisshorter than the shortest note in a music piece. For a quarter note of 120BPM tempo, suppose the shortest note is sixteenth note, its duration will be 125msec. When 10msec step size is used, there will be consecutive 12 feature vectors that refer to the same sixteenth note. The 12 feature vectors do not provide more information thanasingleonefeaturevectorthatcouldbeappliedexactlytothesixteenthnote’s duration. • Phase Synchronization Issue for Large Window If a large window size is used, the position of the data that it will be applied to will result in different feature vectors. Thus, the similarity of two identical music signals is not possible to be achieved. For a played single note, musical audio signal does not have steady energy envelope for the whole note duration. Instead,it shows the characteristics of ADSR (Attack Decay Sustain Release) as shown in Fig. 3.10. 32 Attack and decay is the increasing and decreasing regions of the sound energy, respectively. Theyhavemoreenergyandperiodicthantheothertwoparts: sustain and release, which are the durations when the note is held and released. Suppose there are two identical notes in a music piece but in different phrases, the first one is applied a large window size with center on the attack region while second one is applied a large window size with center on the sustain region. The similarity of these two notes will not be as high as it is supposed to be because the applied windows are shifted and the feature vectors are not highly similar. Figure 3.11: (a) The small window method and (b)the tempo-based window method. All these problems could be solved if a window size that corresponds to the shortest note. It provides a coarse-level similarity matrix and make its data size reduce signifi- cantly. Fora3-minute-longpopsongoftempo120BPM,thedatasizeofsimilaritymatrix will decrease from 18,000 by 18,000 to 1,440 by 1,440, given the shortest note is sixteenth note, or to 720 by 720, given the shortest note is eighth note. Fig. 3.11 shows the two different window design. Fig. 3.11(a) shows the original small window and step size. Step size is smaller than window size so as to have overlap of signals on the boundary 33 of windows. (b) shows the tempo-based window, which is as long as the duration of shortest notes. The window size in (b) is longer than (a) and there’s no need for adjacent windows’ overlap, which means the step size will be equal to the window size. Thus, for a given song, the number of feature vectors of tempo-based window will be much fewer than that of original small window. The goal of data reduction is hence achieved. 3.4.2 Construction of Measure-level Similarity Matrix Figure 3.12: Measure-level similarity matrix, where s ij represents the similarity of mea- sure i and measure j. In order to further reduce the data and help deduce the higher level of information, measure-levelsimilaritymatrixisusedtoreplacenote-levelsimilaritymatrix. Thatis,the element s ij of similarity matrix S represents the similarity of measures i and j, as shown inFig. 3.12,insteadofthatofshortestnoteiandj. Thus, thenumberofrowsorcolumns formeasure-levelsimilaritymatrixwouldbethenumberofmeasuresinthetargetedsong. Fora3-minute-longpopsong,giventempo120BPMand(4,4)timesignature,thenumber of measures is 3·60 60 120 ·4 =90. The size of similarity matrix is reduced from 1,440 by 1,440 to 90 by 90. 34 The similarity degree s ij between measures i and j is a single value between two feature vector sequences, each of which consists of 16 feature vectors if the shortest note is sixteenth note, or 8 feature vectors if the shortest note is eighth note. One way to calculate the similarity between two feature sequence is s ij = 1 N N X k=1 p T ik p jk , (3.5) where{p i1 ,p i2 ,p i3 ,...,p iN }and{p j1 ,p j2 ,p j3 ,...,p jN }arefeaturevectorsequencesformea- sures i and j, respectively, and N is the number of shortest notes in a measure. One exampleoftheusedfeature vectorisPCPandhencep ik andp jk areboth12by1column vectors. The inner product of two PCP feature vectors, which are both normalized unit vector, p T ik p jk is the measure for the degree of similarity between two k-th feature vector ofmeasureiandmeasurej. Thatis,twoPCPfeaturevectorsofthesameorderinthetwo measures are compared and the similarity between them is calculated. Say, the similarity of the first PCP from measure i and the first PCP from measure j is calculated and then that of the second PCP from measure i and the second PCP from measure j is calculated and so on and so forth until the Nth PCPs. Thus, the similarity of two feature sequences istheaverageoftheN similarityvaluesoffeaturevectorpairs. Thefinalsimilarityoftwo feature sequences is always between 0 and 1. Large values that are close to 1 represents a high similarity between two corresponding measures while small values that are far away from 1 represents a low similarity between two corresponding measures. 35 Sometimes, it is convenient to transform similarity values between two measures into distance values (dissimilarity) between two measures by the following equations. D =I−S, d ij =1−s ij . where D is the distance matrix, I is the identity matrix and S is the similarity matrix. Fig. 3.13 below shows an example of measure-level distance matrix of Nirvana’s Smell like teen spirits. Thenumberofmeasuresis147andtheusedshortestnoteiseighthnote. The darker the color is, the higher the similarity is and the lower the distance value is. Qualitatively, the distance matrix shows several characteristics. First of all, the diagonal line represents the similarity between a measure i and itself. Therefore, the similarity values are all equal to 1 along the diagonal line from the first measure pair to the last measure pair. Second, the similarity matrix is symmetric along the diagonal line. The upper part over the diagonal line and the lower part below the diagonal line are identical because a symmetrical distance measure is used. That is, s ij =s ji for all 1≤i≤N,1≤j ≤N (3.6) Third, a off-diagonal matrix element s ij represents the similarity between measure i and measure j. What is important is that high similarity values often exist for successive measures. It is expected because, for example, chorus parts are highly similar repetition parts within a song and they will show the similarity between them in the similarity 36 matrix. In the Fig. 3.13, one chorus part starts from the 61th measure and lasts until 68th measure and the second chorus part starts from the 113th measure and lasts until 120 measure, elements {s 61,113 ,s 62,114 ,...,s 68,120 } in similarity matrix will have relatively high similarity values. They constitutes a off-diagonal line that is parallel to the diagonal line and each point on it has high similarity values. Finally, as I state in the last section, PCP could reflect the tonal characteristics of music signals. On the similarity matrix, someblockswillshowdarkercolorthantheadjacentblocks. Itsuggeststhatthemeasures in dark color share similar or close chords. 20 40 60 80 100 120 140 20 40 60 80 100 120 140 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 Figure 3.13: The measure-level distance matrix of Nirvana’s Smell like teen spirits. 37 Figure 3.14: The calculation of measure-level similarity:(a)linear one-by-one correspon- dence between ordered shortest note;(b)distorted correspondence between ordered short- est note. 3.4.3 OptimizedDistanceMatrixCalculationUsingDynamicTimeWarping (DTW) Fig. 3.14(a) shows the ordered comparison of PCP feature vectors in two measures i and j. The k-th shortest note of measure i is compared with the k-th shortest note of measure j and the similarity between them is calculated. However, the linear one-by- one correspondence of shortest notes sometimes does not exist and the distance matrix calculated via 3.5 and 3.6 is not optimal due to the following reasons. First, in real audio musical signals, either from studio recording or live performance, exact temporal synchronization of repetition parts within a song is always imperfect. One performer might be late by a short duration of time at the first of a song but then catch up the tempo afterwards. Another performer might intentionally extend the duration of one or several notes to express his interpretation of the song while keep other notes intact, which is the situation depicted in Fig. 3.14. Second, the phase of a beat or a measure is only possible to estimate with some estimation error. The pulses of beats themselves could last more than over 100msec and hence the onset signals are note a delta function 38 on specific time instant but looks more like ADSR pattern times Walsh function, which consists of trains of square pulses. The estimation error of onsets might be as high as 200msec. Dynamic Time Warping (DTW)[33] provides a powerful tool to calculate of the opti- mal distance between two temporal segments that may have some noisy variation along the temporal domain. In other words, DTW attempts to remove variations in the singing speed and irregular timing on tempo. The main idea is that DTW matches two fea- ture sequences not one by one in the exact order, as shown in Fig. 3.14(a) but tries to ”tweak” the adjacent features back and forth so as to find the best match, as shown in Fig. 3.14(b). Figure 3.15: DTW distance matrix. In order to calculate the distance between two feature sequences of measure i and measure j, each of which consists of N shortest notes, the two feature sequences are located in the x-axis and y-axis of a distance matrix, respectively, as shown in 3.15. Note that the distance matrix here should not be confused with Fig. 3.12 and Fig. 3.13. The 39 previous distance matrix (or similar matrix) represents the distance relationship among measuresofmusicsignalsinasongwhilethedistancematrixhererepresentsthedistance relationship among PCP vectors of two measures i and j. The goal of distance matrix here is to spread out all the possible distances between pairs of PCP feature vectors and let DTW to find out the optimal path that will have shortest distance from the first shortest note to the last one. In Fig. 3.15, a element d pq means the distance between the p−th feature vector from measure i and q−th feature vector from measure j. The calculation of distance is based on the their inner product (note again that the used PCP feature vectors are unit vectors) as shown below. d pq =P T p P q = M X k=1 p pk p qk . (3.7) where M is the number of elements in feature vectors. For PCP, M is 12. Every possible distance between features of measure i and measure j are calculated and are put on the distancematrix. Thesearchofoptimalpathstartsfromthebottom-leftcornerd 11 ,which isthedistanceofthetwofirstshortestnotes,andstopsonthetop-rightd MM ,whichisthe distance of the two last shortest notes. The DTW algorithm uses the following equation to find the path that has the minimum cumulative distance on the distance matrix. C(p,q)=d(p,q)+min C(p−1,q−1) C(p−1,q) C(p,q−1) . (3.8) 40 Figure 3.16: DTW’s local continuity constraint on the path. Eq. (3.8) imposes a local continuity constraint on the path. It considers only three possible choices, i.e., (p-1,q), (p,q-1) and (p-1,q-1), to proceed into the position (p,q) in one step as shown in Fig. 3.16. There is another constraint which specifies the legal range that the path is allowed to deviate from the conventional 45-degree line matching. It represents the constraints that the degree of ”tweak” on each feature vector is limited. All the tweaks that exceeds the limit are not allowed. Therefore, the search space is limited within the area between the pair of dashed lines parallel to the diagonal line in Fig. 3.15. DTW has to find the best path within this range. This is called the global constraint. In my implementation, the upper dashed line stays 4 shortest notes higher than the diagonal line while the lower one stays 4 shortest notes lower than the the diagonal line. It is obvious that DTW achieves a smaller distance for every measure pair within a song than that calculated by Eq. (3.5). Take Nirvana’s Smell like teen spirits as an example. Therearetwolongsimilarsegments. Thefirstonestartsfromthe13-thmeasure and ends at the 44-th measure while the second one starts at the 49-th measure and ends at the 80-th measure. They have low distance values in a sub-diagonal segment from 41 (49,13) to (80,44) in the similarity matrix. The distance matrix using DTW is shown in Fig. 3.17 while that without DTW was shown in Fig. 3.13. The sub-diagonal segment (80,44)inthematrixusingDTWismoreobviousthanthatinthematrixwithoutDTW. 20 40 60 80 100 120 140 20 40 60 80 100 120 140 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 Figure 3.17: The measure-level distance matrix using DTW for Nirvana’s Smell like teen spirits. The distance decrease for the two segments due to DTW is shown in Fig. 3.18. The upper dash-dotted line represents the distance without DTW while the lower dashed line is the distance with DTW. The distance values without using DTW is for the two similar segments are much higher than those with using DTW. The averaged distances for the whole similarity matrix and for similar segments only are compared in the first and the second rows of Table 3.1. We see that DTW can greatly reduce the distance for similar segments. 42 45 50 55 60 65 70 75 80 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Measure Number Average Distance Figure 3.18: Comparison of distance between similar measures with and without DTW. Table 3.1: Comparison of the averaged distances with and without DTW. Without DTW With DTW Distance Decrease Whole Similarity Matrix 0.1956 0.1820 6.95% Similar Segments 0.2140 0.1210 43.46% 43 3.5 Conclusion In this chapter, we examined three low-level music signal processing tasks. They are: (i) tempo analysis at the beat level, (ii) time-frequency feature extraction at the beat level and (iii) similarity analysis at the measure level. For tempo analysis, we aimed at esti- mating the period and the phase of music signals so as to extract the beat pattern, which offers the basic time unit in human perception of music known as the beat. Then, we considered the feature extraction problem for music signals, which can be used for simi- lariy comparison between beats and measures. We proposed a scheme to extract feature vectors based on time-frequency analysis with respect to beats. Finally, we presented a scheme to measure the similarity of different parts of a song. It turns out that the measure provides a proper unit for this objective. Thus, we described a systematic way to construct a measure-level similarity matrix. These low-level musical signal processing results can be used for high level music structure analysis to be studied in Chapter 4. 44 Chapter 4 High-Level Music Structure Analysis and Decomposition 4.1 Introduction The music structure of most modern popular and rock songs is of the verse-chorus form [9]. Under this form, the chorus is repetitive with the same melody, chords and lyrics whileversesarerepetitivewiththesamemelodyandchordbutdifferentlyrics. Theverse parts are for the “story telling” purpose. A singer uses them to tell the story and prepare for the coming of the chorus. The chorus is for the “sing along” purpose, and it has more dynamicsandtimbretypes. Usually,asongcontainsrepetitionsandalternationsbetween several verses and the chorus. There are other elements in a song that are non-repetitive. Examplesinclude”intro”,whichisthebeginningpartofasong,”outro”,whichistheend part of a song, and ”bridge”, which is the section to offset the predictability of the verse and chorus or make transition from one pattern to another. Practically, a bridge is either between a verse and a chorus or between multiple choruses. The bridge is sometimes played by a solo instrument to provide “freshness” of timbre [9]. 45 Someexamplesofthestructureofpopandrocksongsaregivenbelow. Here,alphabets I, V, C, B and O denote intro, verse, chorus, bridge and outro, respectively. • Pixies’s Wave of Mutilation: I V C B V C O • Nirvana’s Smell Like Teen Spirit: I V C V C B V C O • U2’s Vertigo: I V C V C B C O • Oasis’ Don’t look back in anger: I V C V C V C O • Talking Head’s Psycho Killer: I V C V C B V C O Accurate detection of verses and choruses is a key component to the success of automatic music structure analysis. However, as compared to choruses, verses often have weaker similarity among themselves. Even though verses share identical or similar melody on music scores, they do not exhibit high similarity in their feature vectors due to different lyrics. This makes their robust detection more difficult. Two techniques are proposed in this chapter to enhance the detection performance of repetitivesegments. First,relativeintensitiesofallnotesareexamined. Inparticular,low frequency notes (lower than A3) and high frequency notes (higher than A6) are removed from the pitch class profile (PCP) feature calculation if they have dominating intensities since they are primarily contributed by musical instruments rather than human voices. Second, the Viterbi algorithm is used to find the optimal path in the lower-triangular part of a similarity matrix. Even though there may exist low similarity parts in a verse orchorussegment, theViterbialgorithmcandeterminetheglobaloptimalsegmentwhile ignoring low similarity measures locally. Finally, we introduce post-processing steps to 46 decompose the music structure into verses, choruses and non-repetitive parts such as intro, bridge and outro. The rest of this chapter is organized as follows. The construction of a measure-level similary matrix using a modified pitch class profile (PCP) is described in Sec. 4.2. Long similarity segment detection via the Viterbi algorithm is presented in Sec. 4.3. Some postprocessing techniques are presented in Sec. 4.4. A procedure for automatic music structure analysis is described in Sec. 4.4. Experimental results are given and discussed in Sec. 4.6. Concluding remarks are given in Sec. 4.7. 4.2 Similarity Matrix Computation Using Filtered PCP Givenasong, wefirstconstructameasure-levelsimilaritymatrix, whichwasdescribedin Sec. 3.4. Thesimilaritydegreebetweentwomeasuresiscalculatedusingthecorrelationof the pitch class profile (PCP) vectors. However, a straightfoward manner to compute the similaritymatrixbasedonPCPfeaturevectorhasaproblem. Thatis, theaccompanying background music from instruments such as basses and guitars may provide repetitive phrases all over the whole song, and their intensities are so strong that they dominate the PCP vectors. Then, the similarity matrix using PCP feature vectors may not reveal the repetitive patterns of choruses and verses properly. Instead, it shows the repetitive patternofaccompanyingphrasesinformofmanyshortsegmentsinthesimilaritymatrix. 47 To suppress the effect of repetitive accompanying musical phrases, the intensity of each individual semitone between note A1 (55Hz) and A8 (7040Hz) is first examined for the whole song. The note number can be computed as Note Number for A Semitone=b12×log 2 ( f 440 )c+69, (4.1) where 69 is the note number of A4. Then, if the notes lower than A3 (220Hz) or higher than A6 (1,760Hz) have very strong intensities as compared to those is the middle range (i.e. between A3 and A6), their frequency components are removed from the calculation of PCP. The resulting PCP is called the filtered PCP. The middle range between A3 and A6 is the range where most vocal sounds are located, and this interval often corresponds to the frequency range of the main melody. However, their energy on the spectrum is often lower than the accompanying music in the low frequency region and percussion sounds in the high frequency region. Many rock songs have strong music accompaniment from electrical basses and guitars as well as percussion. The similarity matrices built by PCP suffer seriously from the low and high frequency components and cannot show clearly the similarity between verse and chorus parts. One example is U2’s Vertigo, where notes’ intensities below A3 (note number 57) and above A6 (note number 93) are much higher than those between A3 and A6. The similarity in the middle range is masked by the low and high frequency components. After they are removed in the calculation of PCP, the similarity of the verse and chorus is revealed more clearly. 48 Fig. 4.2 and Fig. 4.3 show the similarity matrices obtained using the conventional PCP and the filtered PCP, respectively. The former provides strong similarity lines that correspond to the partial chorus parts around the 25th, the 50th, the 80th and the 90th measure. It does not indicate the similarities around the 30th, the 60th and the 100th measure, which are however revealed in the latter. In addition, only the latter one shows the dramatic change of the average similarity by the cascaded block areas along the diagonal line, which correspond to used chords and keys. For example, the average similaritybetweenthe5thandthe13thmeasureislargerthanthatbetweenthe14thand the 21th measure. The variance of the average similarity within each segment is small while it is large between segments. On the contrary, the former one based on PCP does not show any “cascaded block areas” and appears to be noisy. The latter one based on filtered PCP has clear sub-diagonal lines and provides more information about similar segments. Thecalculationofboundariescorrespondingtothedramaticchangeofaverage similarity will be discussed in detail in Sec. 4.4. 4.3 Similarity Segment Detection via Viterbi Algorithm Given a measure-level similarity matrix, our next task is to study the global musical structure based on the pattern analysis of the similarity matrix. For example, we are interestedinfindingrepetitiveparts, sincetheyrepresentversesorchoruses. Asshownin Fig. 4.4,ansub-diagonalintervalinthesimilaritymatrixwithconsecutivehighsimilarity valuesfrom(i 1 ,j 1 )to(i M ,j M )meansthatastrongsimilaritybetweentwosegmentsinthe song composed by consecutive measures j 1 ,··· ,j M and i 1 ,··· ,i M , respectively. Then, 49 30 40 50 60 70 80 90 100 110 120 0 1 2 3 4 5 6 7 8 9 x 10 4 Figure4.1: Therelativeintensityofindividualnotes,wherethex-axisrepresentsthenote number (Note that 69 is the note number for A4). 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 4.2: The measure-level similarity matrix constructed using the conventional PCP feature vectors. 50 10 20 30 40 50 60 70 80 90 100 110 10 20 30 40 50 60 70 80 90 100 110 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure4.3: Themeasure-levelsimilaritymatrixconstructedusingthefilteredPCPfeature vectors. 51 the two segments could be either a chorus, a verse or the combination of a chorus and a verse. (i 1 , j 1 ) (i M , j M ) i 1 j 1 j M i M Figure 4.4: Two similar segments j 1 to j M and i 1 to i M are shown in a similarity matrix. Generally speaking, the chorus parts have strong similarity among themselves and show very apparent segments along sub-diagonals of the similarity matrix. In contrast, the similarity values between two verse parts are not as obvious. They do not constitute solid segments but are broken into short segments with low similarity values in between. Thus,theversepartsaremoredifficulttodetect,andnewtechniqueshavetobedeveloped for this purpose. A systematic approach to detection of similar segments based on the Viterbi algorithm is described in Sec. 4.3. Once the similarity matrix is given, we would like to detect segments along its sub- diagonals that have high similarity values consecutively. Since the matrix is a symmetric one, we can focus on the lower-triangular part only. To overcome the problem of weaker similarity of verses, the Viterbi algorithm is used to detect these line segments reliably. The algorithm starts from the first measure of music signals in the bottom-left corner of 52 the similarity matrix. Originally, the x- and y-axes of the similarity matrix represent the measure index number of a given song. To perform the Viterbi algorithm, we interpret the x-axis as “time”, the y-axis as the “state”, and the element s ij , as the probability at time i and state j. Thus, a higher similarity degree implies a larger probability. The Viterbi algorithm attempts to find a more similar segment, thus, a higher cumulative probability. The time-state space of the Viterbi algorithm is illustrated in Fig. 4.5, where the dashed line represents the diagonal line of the similarity matrix. Thus, the time-state space is located below the diagonal line of the similarity matrix. A circle in the time- statespacerepresentsastatethatisoneofthepossiblemeasuresbelowthediagonalline. The element s ij of similarity matrix represents the probability of the time, measure i, is similar to a previous measure j. Time i State j Figure 4.5: The time-state space of the Veterbi algorithm on the lower-triangular simi- larity matrix. 53 For each time index i and state index j, the Viterbi algorithm can update cumulative probabilities of different paths along time, and find the path with the highest probability. We use Q(i−1,k) to denote the largest cumulative probability from some initial time i 0 to time i−1 and state k. Then, the largest cumulative probability Q(i,j) from time i 0 to time i and state j can be written as Q(i,j)=[max k P T (j,k)Q(i−1,k)]P S (i,j), (4.2) where P T (j,k) is the transitional probability from state k to state j and P S (i,j)=s(i,j) is the probability at time i and state number j. The best previous state for time i and state j is the one that maximizes P T (j,k)Q(i−1,k). Thus, it can be expressed as R(i,j)=argmax k P T (j,k)Q(i−1,k), (4.3) Givenappropriateinitialconditions,theViterbialgorithmrecursivelycalculatesQ(i,j) and R(i,j) first for 2≤i≤L, where L is the number of measures of the song, and then for 1≤j <i. At time i=L (or the last measure), the maximum of Q(L,j) for all states 1≤j <L and the corresponding previous state R(L,j) can be found via Q ∗ = max 1≤j≤L−1 Q(L,j), (4.4) R ∗ L = arg max 1≤j≤L−1 Q(L,j) (4.5) 54 Since the optimal path should lie in the lower triangular part of the similarity matrix, we demandi>j forQ(i,j). Backtracking is then applied toR ∗ i in order to find the previous R ∗ i−1 . Then, the optimal state sequence can be found accordingly and expressed as R ∗ 2 , R ∗ 3 ,··· ,R ∗ L , (4.6) which is also called the optimal path in the similarity matrix. Since that the path is only in the lower triangular part of the similarity matrix, the only initial state Q(2,1)=P Init ·s 21 (4.7) where P Init is an arbitrary positive constant. We can enforce the preference of transversing directions of the optimal paths by choosing the proper state transition probability. Since adjacent measures share the same chords and keys, they tend to have a higher similarity degree than measures that are far apart. This implies thats(i,j) tends to have a larger value if|i−j| is small. However, we are more interested in finding similar segments located in other parts of the song, where |i−j| should be relatively large, the simiarity effect due to the location vicinity should be somehow deemphasized. In other words, we need to develop some weighting scheme to lower the s(i,j) value in this region. 55 Since only sub-diagonal lines are pertinent to the similarity of segments composed by consecutive measures, the state transition probability P T (j,k) is selected accordingly to reflect such a preference. That is, for state j, we choose P T (j,k)= P T0 , j =k+1, P T1 = 1−P T0 L−1 , otherwise, (4.8) where L is the number of measures in a song. Furthermore, we demand P T0 >P T1 = 1−P T0 L−1 to guarantee the preference along the sub-diagonal line. Practically, in the design of P T0 , we may examine the ratio of P T0 to P T1 , which indicates the degree of preference to be placedalongthe45-degreeline. Thelargertheratiois, thelessprobablethattheoptimal path will deviate from the 45-degree line. This can be explained by the example given in Fig. 4.6, where Path A represents a path that goes along the diagonal direction while Path B represents a path that has state transitions not along the diagonal direction. The log likelihood function of Paths A and B can be written as logP A = logs ij +logP T0 +logs i+1,j+1 +logP T0 +s i+2,j+2 +logP T0 , (4.9) logP B = logs ij +logP T1 +logs i+1,j 0 +1 +logP T0 +s i+2,j+2 +logP T1 , (4.10) 56 respectively. The log likelihood ratio function between scores of Path A and Path B can be found as log{ P A P B }=2(log P T0 P T1 )+(log s i+1,j+1 s i+1,j 0 +1 ). (4.11) Ifs i+1,j+1 =s i+1,j 0 +1 in(4.11),wewillselectPathAsinceithasahigherscoreasaresult of P T0 >P T1 . Generally speaking, to ensure that P A is larger than P B , we can demand P T0 P T1 ≥( S i+1,j 0 +1 S i+1,j+1 ) 1 2 , (4.12) even when S i+1,j+1 is smaller than S i+1,j 0 +1 . For example, if S i+1,j 0 +1 is 2 times larger than S i+1,j+1 , P T0 needs more than √ 2 times larger than P T1 to keep the preference along the 45-degree line. Thus, the ratio of P T0 to P T1 can be regarded as a measure of the degree of preference placed along the diagonal direction. The larger the ratio is, the less probable that the path will deviate from the sub-diagonal line. In my experiment, we choose the ratio between P T0 and P T1 to be 1.5. Path B Path A S ij S i+1,j+1 S i+2,j+2 S i+1,j’+1 Figure4.6: Differentpathsinthesimilaritymatrix,wherepathAgoesalongthediagonal direction while path B deviates from the diagonal direction. 57 The use of the Viterbi algorithm and a careful selection of the state transition proba- bilityfunctionshelpresolve thetheweaksimilarityproblemofverses. Eventhoughthere are several measures that have a lower similarity in a segment consisting of verses, the Viterbi algorithm willnot remove them from the optimalpathimmediately but considers the cumulative similarities of a segment of consecutive measures. In particular, if a verse part is followed by a chorus part, the preference along the diagonal direction will tend to group the weak verse part with the strong chorus part into one large segment. 4.4 Post-processing Techniques The optimal path obtained by the Viterbi algorithm as given in Eq. (4.6) have the following interesting properties. 1. It includes all high similarity sub-diagonal lines in the lower-triangular part of the similar matrix, which correspond to the chorus part. 2. It may transverse through lines of relatively weaker similarity, which correspond to the verse part. 3. Iftherearenostrongsimilarsegments, theoptimalpathmaystayeitherinthefirst lower sub-diagonal line, which is closest to the diagonal line due to the preference of the 45-degree orientation and the initial condition, or along the best path for the i-th measure that corresponds to the current time. After the detection of the optimal path, some postprocessing steps need to be performed toextractthemusicstructure. Asimpleoneistofilterouttheportionofadetectedpath 58 with low similarity values so that only those lines with high similarity values are kept. Two more advanced post-processing techniques are described below. 4.4.1 Decomposition of Overlapping Similar Parts via Boundary Point Selection The detected optimal path may correspond to two overlapping similar intervals. For example, consider one detected segment starts from (i 1 ,j 1 ) to (i M ,j M ) as shown in Fig. 4.7. If i 1 ≤ j M , the two corresponding similar parts (i.e., (i 1 ···i M ) and (j 1 ···j M ) are overlapped in the interval of (i 1 ···j M ). Then, we have to trim their boundaries so that j M < i 1 . This can be done as follows. Suppose λ 1 and λ 2 are the numbers of measures needed to adjust at the ends of line segments, (i 1 ,j 1 ) and (i M ,j M ), respectively. After shifting the boundaries, they should satisfy j M −λ 2 =i 1 +λ 1 −1. (4.13) Since both λ 1 and λ 2 are positive integers, their best combination could be searched exhaustively under the constraint (4.13) to minimize the accumulated similarities of two non-overlapping parts arg min λ 1 ,λ 2 λ 1 X q=0 S(i 1 +q,j 1 +q)+ λ 2 X p=0 S(i M −p,j M −p). (4.14) For example, as shown in Fig. 4.8, the longest detected segment for Joan Jett’s I love rock n’ roll is from (38,4) to (81,47). Without any modification, this means that the song has two similar parts. One is from the 38-th measure to the 81-th measure while 59 (i 1 , j 1 ) (i M , j M ) i 1 j 1 j M i M Figure 4.7: Illustration of two overlapping parts in a song due to the detected segment from (i 1 ,j 1 ) to (i M ,j M ). the other one is from the 4-th measure to the 47-th measure. However, These two parts overlap with each other from the 38-th measure to the 47-th measure. We can do the end point adjustment based on the optimization of (4.14). Then, the one long segment is broken into two so that we can conclude that the two similar parts are actually from the 14-th measure to the 47-th measure and from the 48-th measure to the 81-th measure. 4.4.2 Similarity Variation along Temporal Segments There is an interesting observation in similarity matrices. The similarity values are clus- tered in rectangular blocks, which help segment the diagonal line in a straightforward manner. As shown in Figs. 4.2, 4.3 and 4.8, the “rectangular blocks” are visually appar- ent since the similarity values are relatively uniform. Sometime, they show a particular pattern in a block while they are different between adjacent blocks. Thus, the diagonal line, whose index represents the measure index of the song, can be segmented by the change of the average of similarity values. Usually, the corresponding segmented results 60 20 40 60 80 100 120 140 20 40 60 80 100 120 140 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 4.8: An illustration of the end point adjustment post-processing for two overlap- ping similar parts in a song, where the similarity matrix is derived for Joan Jett’s I love rock n’ roll and the path detected by the Viterbi algorithm is given by the plus sign. 61 are too fine to be used in high-level music structure analysis. It is often that the bound- aries of these changes correspond to the change points of chords or keys, which imply potential changes on the music structure, such as the transition from the verse part to the chorus part or from the chorus part to the outro part. In this section, we propose a method to measure the change degree along these fine intervals and use the results to detect the boundaries of verses and choruses. As shown in Fig. 4.9, we examine along the diagonal and apply three rectangular boxes for current measure i. For each rectangular box, we can compute the average similarity values. The oftenusedsizeoftheblocksis4measuresby4measures. Itassumesthattheusedchords or keys will not change twice within 4 measures. In this figure, we use W 1 , W 2 and W 3 to denote the average similarity values in blocks B 1 , B 2 and B 3 . Please note that B 1 and B 2 are blocks along the diagonal line while B 3 is the block along the first sub-diagonal line. To detect the change of the similarity degree at the boundary of blocks B 1 and B 2 , we can examine the change of the average similarity value by ΔS 1,2 =max(|W 1 −W 3 |,|W 2 −W 3 |). (4.15) If ΔS 1,2 is larger, it indicates a higher probability that there exists a structure difference between blocks B 1 and B 2 . An example is shown in Fig. 4.10 for the similarity matrix of Fig. 4.3, where x-axis represents the measure index number and the y-axis represents the change of the average similarity values. Peaks in Fig. 4.10 are potential segmentation positions for verse and chorus parts. 62 ! ! ! " ! # "$%&'()*+,-)' .i*012*3'4i2) Figure 4.9: Detection of structural difference based on the difference of the avarage similarity values in blocks B 1 , B 2 and B 3 along the diagonal and the first sub-diagonal lines. 0 20 40 60 80 100 120 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Figure4.10: ThevariationofsimilarityvaluesforU2’sVertigo,whereitssimilaritymatrix is shown in Fig. 4.3. 63 4.5 Algorithm for Automatic Music Structure Analysis Foragivensimilaritymatrix,wefinditsoptimalpathviatheViterbialgorithm. Then,the optimal path can be broken into multiple short segments depending on the connectivity. For each segment, we can use two parameters for its characterization; namely, the length and the average similarity value over the segment. Thus, we can assign each segment a score by Score i = L i L max ·S i,max . (4.16) where i is the segment index, L i is the length of the i-th segment, L max is the maximum length of all segments in a similarity matrix and S i is the average similarity value in the i-th segment. Intuitively, the longer the segment and the higher average similarity value it has, it is more probable for the segment to contain repetitive parts. Furthermore, it is observed thattheaveragesimilarityvaluesofchoruspartstendtobeclusteredinoneclass,denoted by C C , and those of non-repetitive parts tend to be clustered in another class, denoted by C NC . The similarity values of the verse parts can be in either C C or C NC . An algorithm to achieve automatic music structure analysis is described below. 1. Non-repetitivesegmentsareremovedbyevaluatingthescoreinEq. (4.16). Wesort all segments by their scores and filter out those of low scores. Then, the remaining segments will be related to repetitive parts in a song. 2. Sinceaversealwaysgoestogetherwithachoruspartanddoesnotexistalong. The time-domain patterns of remaining segments include VC (verse and then chorus), 64 CV (chorus and then verse), C (chorus only) and partial C (partial chorus). The first two usually happen in the first half of a song while the last two happen in the second half of a song. The average similarity values between adjacent temporal segments are examined for the discrimination of chorus and verse parts. For a segment whose average similarity values in a block of the similarity matrix is high, it is assigned to C (chorus or partial chorus). In modern popular and rock songs, the final chorus may include a certain part of the previous chorus, multiple times of the previous chorus, or multiple times of a certain part of the previous chorus. Regardless of the format, all of them are treated as C. 3. After Step 2, the remaining segments may still have the combination of verse and chorus parts, such as VC or CV. We examine the boundary points of temporal segments included in the detected segment so as to find the the boundary between CandV.Theboundarypointthatcouldachievethelargestdifferenceoftheaverage similarityvalues is assignedas the boundary ofthe verse part andchorus part. The part that has larger average similarity values is assigned to C and that has smaller average similarity values is assigned to V. 4. All repetitive parts are assigned to either C or V in Steps 1-3. The remaining unassigned segments are non-repetitive parts. The one in the beginning of a song is assignedtotheI(intro)part,theoneintheendofasongisassignedtotheO(outro) part, and the ones in the middle of a song of a relatively long duration are assigned to the B (bridge) part. The duration of a bridge in modern popular and rock songs is usually no more than 8 measures. 65 To give an example, the similarity matrix and the variation of the average similarity degree along the detected segments of the similarity matrix of Nirvana’s Smell like teen spirits are shown in Fig. 4.11 and Fig. 4.12, respectively. Afterthefirststep,onlysegmentsfromthedetectedline6and9inFig. 4.11arekept, as shown in the upper part of Fig. 4.13. Line 6 corresponds to the segment from 13-th measure to 48-th measure and from 49-th measure to 84-th measure. Line 9 corresponds to the segment from 49-th measure to 81-th measure and from 101-th measure to 133- th measure. Other segments are eliminated because their small similar values. In the second step, the corresponding similarity values of the detected line 6 and 9 do include highsimilarityvaluesandalsoincludelowsimilarityvalues. Thus, theybelongtotheCV or VC pattern. The segmentation position between C and V is estimated by examining all segmentation positions that have dramatic average similarity values. 25-th, 61-th and 113-th are the segmentation positions for the verse and chorus parts in each segment. Finally, the remaining segments are assigned to one of the three non-repetitive parts, intro, outro and bridge. The segment between 1st and 12-th measure is assigned to intro part I and the segment after 134-th measure is assigned to outro part O. The segment between the second and third verse-chorus combination is the bridge part, ranging from 85-th measure to 100-th measure. The complete music structure along with its position and duration is shown in the lower part of Fig. 4.13. 66 20 40 60 80 100 120 140 20 40 60 80 100 120 140 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure4.11: ThesimilaritymatrixofNirvana’s Smell like teen spirits, wherethedetected path by Viterbi algorithm is indicated by the plus sign ”+”. 0 50 100 150 0.5 0.6 0.7 0.8 0.9 1 Figure 4.12: The time-variation of averaged similarity value of Nirvana’s Smell like teen spirits. 67 0 50 100 150 0 2 4 6 8 10 12 V V V C C C B I O Figure 4.13: The detected temporal segments with high similarity values, and the ana- lyzed structure of Nirvana’s Smell like teen spirits. 4.6 Experimental Results and Discussion ThemethodusingtheViterbialgorithmtodetecttherepetitivepartsisverifiedbytesting on a collection of 40 popular and rock songs in 80’s and 90’s. Examples of rock songs are Nirvana’s Smell like teen spirits and Oasis’ Don’t look back in anger. Examples of popular songs are Beach boys’ Kokomo and Debbie Gibson’s Lost in your eyes. The musical signals are 16bit per sample and are down-sampled to 22,050Hz and converted to the mono channel. Each song comes with the knowledge of the musical tempo either from previous work or published musical sheets. The Hamming window has the duration of one-eighth note. Since all data in our collection have the ( 4 4 ) time signature and use the quarter-note as the beat, the window has a length of 250msec for a tempo of 120 Beats Per Minute (BPM). Filtered PCP feature vectors are then calculated for each windowed musical signals without overlap. 68 The performance is first evaluated based on the correctness of classifying the music structure into verse and chorus parts. A song can be decomposed into a sequence of repetitive elements: verse(V) and chorus (C) and other non-repetitive elements: intro (I), bridge (B) and outro (O), which are non-repetitive parts in the beginning, middle and end of the target song, respectively. The total correctness rate for the whole data set is 31/40 = 77.50%. Errors are mostly due to the complicated structure of the target song. It is difficult to discriminate verses from choruses and multiple verse patterns. Then, the performance is evaluated based on the correct retrieval rate, which is in- teresting for the information retrieval application. The commonly used F-measure [16] is defined as the harmonic mean of recall R and precision P, i.e., F = 2RP R+P , (4.17) where R is the ratio of the number of measures that are correctly detected using our method over the number of the correct measures in the test song andP is the ratio of the number of measures that are correctly detected using our method over the total number of detected measures. Among the songs that were correctly decomposed, the R, P and F values for the chorus and verse parts are shown in Table 4.1. and Table 4.2, respectively. It is obvious that the performance improves due to the use of the filtered PCP feature vectors. It is also clear that the performance of correct chorus detection is higher than that of verse detection, since the chorus parts have higher similarity values than the verse parts. Insomesongs, forexample, Oasis’ Don’t look back in anger, thetwoversepartshavevery 69 Table 4.1: The performance of chorus detection in terms of recall (R), precision (P) and F-measure (F). R P F Original PCP 81.4% 79.9% 80.6% Filtered PCP 89.3% 86.4% 87.8% Table 4.2: The performance of chorus detection in terms of recall (R), precision (P) and F-measure (F). R P F Original PCP 61.7% 58.0% 59.8% Filtered PCP 71.2% 66.5% 68.8% low similarity since vocal sounds generated as a result of different lyrics are very different eventheysharethesamemelody. Itcouldbeevenworseforthoserocksongswhoselyrics are shouted out by singers, since the vocal signals deviate a lot from harmonic musical signals. 4.7 Conclusion A framework of automatic music structure analysis from audio signals was proposed in this work. The similarity matrix that records the similarity degree between measures was introduced. Then, the Viterbi algorithm was used to detect long similarity segments along the sub-diagonals of the matrix. Several post-processing techniques were proposed to fine-tune the decomposition procedure so that we are able to decompose a song into repetitiveparts(i.e.,versesandchoruses)andnon-repetitiveparts(i.e.,intro,bridgeand outro). Performance of the proposed scheme was demonstrated. 70 Chapter 5 Musical Beat Tracking with Kalman Filters Two musical beat tracking algorithms are proposed in this and the next chapters of this thesis. The first algorithm uses Kalman filtering with the probability data association (PDA) for measurement selection. The Kalman filter has been extensively developed over the last four decades for target tracking. Here, we apply it to on-line musical beat tracking. The beat tracking mechanism is built upon a linear dynamic model of beat progression. Then, the PDA method is used to improve beat selection accuracy among noisy measurements. The second beat tracking algorithm is developed based on the hidden Markov model (HMM). A left-to-right state space is designed specifically to model beat’s progression. When the tempo is perceptually fixed, variation among beat locations can be well modelled by HHM. The two proposed beat tracking algorithms actually share the same framework as shown in Fig. 5.1. The input is the digital music signal. The musical onset signal and its period are first estimated. Given these estimates, the two beat tracking algorithms can estimate the beat location accordingly. Musical beat tracking with Kalman filters 71 Musical Audio Signals Onset Detection Period Estimation Kalman-filter-based Beat Tracking Algorithm Beats' Positions Figure 5.1: The framework of the proposed musical beat tracking system. will be presented in this chapter while the HMM-based beat tracking will be discussed in Chapter 6. 5.1 Introduction When listening to music, most people can catch the speed of music and follow it by foot- tapping, head-shakingorhand-clappingalongwithbeats. However, itischallengingfora computertocomprehendtempoandmelodyofdigitalmusicwell. Accuratebeattracking playsanimportantroleinmusictranscriptionandinformationretrieval. Inthisresearch, we are concerned with real-time musical beat tracking from acoustic data. The tempo value and the beat location of a music piece are estimated on-the-fly as it is played. The tracking system predicts the next beat location and update the tempo value based on all received data using a state-space model. In some sense, it mimics the human behavior of 72 beat tapping along with music procession and, as a result, the technique can be used in applications such as automatic musical accompaniment. Automatic musical beat tracking by computers can be done either on-line or off-line. On-line beat tracking algorithms [7, 15, 23, 36] attempt to detect the beat location from audio waveforms on the fly. In contrast, off-line beat tracking algorithms [10, 21, 20, 22, 27, 35] determine all beat positions of beats using both the causal and the non-causal information from a given music piece. This is also known as the batch processing. Althoughmusicalbeattrackingtechniqueshavebeenextensivelystudied,onlyafewof themapplytoreal-time(orcausal)audioprocessing,e.g.,[23],[35],[36]. Scheirer[35]used a comb filter to estimate the tempo and the beat location with an open-loop approach. That is, new estimates do not take prediction residuals in the past into account. Beat trackingmethodsin[23]and[36]adopttheparticlefilteringtechnique. Theparticlefilter is more general than the Kalman filter since it makes no assumption on the linearity of thetrackingsystemandtheGaussianarityofunderlyingsignals. However, itscomplexity is significantly higher and it does not address the problem of incorrect measurements as discussed later in this chapter. The goal of musical beat tracking is to find the time instance of all beats from a music piece. When beat pulses are strong and the duration between adjacent beats is perceptuallyclear,automaticbeattrackingcanbedoneeasily. However,thereareseveral challenges thatmaydeterioratethe beat trackingperformance. Thefirstchallenge comes from rest notes and missed-beat syncopation, which lead to beats without a significant onset pulse. Rest notes from music instruments hide beat tracking cues. Missed-beat syncopation has similar characteristics in that it does not have an onset pulse on the 73 expected beat’s position but with a small shift. In both cases, the lack of clear onsets make beat tracking difficult. Second, there exists variability in human performance. Even a performer attempts to keep the duration between two adjacent beats constant through the whole music piece, the actual duratioin tends to vary along time. Besides, the performer may want to change the beat’s period intentionally from time to time to achieve certain effects such as improvisation or music jamming. Third, some music pieces have a time-varying tempo and, consequently, a tim-varying beat period. Two typesoftempochangesfrequentlyoccur; namely, theabrupttempochange(whichjumps betweentwodistincttempos)andthegradualtempochange(whichcruiseswithinarange of tempos). The rest of this chapter is organized as follows. Music data pre-processing for beat trackingisexaminedinSec. 5.2. Then,theKalmanfilter-basedbeattrackingalgorithmis described in Sec. 5.3. To tackle noisy measurements, The local maximum (LM) method, theprobabilisticdataassociation(PDA)methodandtheenhancedPDAmethod(EPDA) are discussed in Secs 5.4, 5.5 and 5.6, respectively. Experimental results are shown in Sec. 5.7 to demonstrate the performance of the proposed beat tracking algorithm. 5.2 Musical Data Pre-processing 5.2.1 Onset Detection The musical onset signal provides the intensity change of musical contents along time. It can reflect two types of music content changes: 1) instantaneous noise-like pulses caused by percussion instruments; and 2) changes of music pitches/harmonies due to the new 74 note arrival. In this work, the cepstral distance method [33] is used to calculate musical onsets. The process is detailed below. Wefirstrepresentmusiccontentsusingmel-scalefrequencycepstralcoefficients(MFCC), c m (n), for each shifting window of 20-msec with 50% overlap (10 msec), where m = 0,1,...,L is the order of the cepstral coefficient and n is the time index. Please note that the Mel-scale considers human perception towards spectral bands and divides the spectrum nonuniformly. The Mel-scale emphasizes frequency regions to which human listeners are most sensitive. The MFCC with small (or large) m represents low-frequency (or high-frequency) changes on the mel-scale spectrum. For example, if a mel-scale spec- trum has large oscillations in the mel-scale frequency domain, high-order MFCCs will have a larger amplitude. We would like to approximate musical signals with only a few MFCCs. Low-order MFCCs are chosen since they are highly correlated to the mel-scale energy envelope. Besides the 0-th order MFCC c 0 (n), which is exactly the mel-scale energy, three low-order coefficients c 1 (n), c 2 (n) and c 3 (n) are also selected. Then, selected MFCCs are averaged over p consecutive frames (i.e. c m (n),c m (n− 1),··· ,c m (n−p+1)) to represent the smoothed coefficient ˆ c m (n) at time index n. In our implementation, we choose p=3 so that a fast change in music contents can still be captured. Finally, we compute the change of spectral contents by examining the MFCC differ- ence between the two adjacent smoothed cepstral coefficients, ¯ c m (n−1) and ¯ c m (n). The mel-scale cepstral distance d(n)= L X m=1 (¯ c m (n)−¯ c m (n−1)) 2 , (5.1) 75 ischosentobethemusicalonsetdetectionfunctionattimen. Ifd(n)isaboveathreshold at time n=n 0 , we say that an onset is detected at n 0 . It is worthwhile to point out that energy changes caused by percussion instruments are usually reflected well by the change in ˆ c 0 (n) while harmonic changes due to new note arrivals are reflected by the change in ˆ c 1 (n) and other low-order smoothed MFCCs. 5.2.2 Period Estimation The tempo is assumed to be perceptually fixed in our beat tracking system. Before we conduct the actual beat tracking task, the music tempo or its inverse (i.e., the mu- sic period has to be roughly estimated. One way to estimate the period is to use the autocorrelation function (ACF), which can be calculated as A(N)= ∞ X N=−∞ d(n)d(d−N), (5.2) where d(n) is the musical onset signal and N is a delay parameter. If the musical onset is a periodic function with period N 0 , it can be ideally approximated by a periodic impulse train in form of d(n)= P X p=−P δ(n−p·N 0 ), (5.3) where P is a positive integer. Signal d(n) in Eq. (5.3) will result in an ACF A(n) with clear peaks at kN 0 , k =0,1,2,···. However, the ACF of real-world musical onset signals does not exhibit such an ideal behavior. It is usually difficult to find the exact peak for the period. All we can tell is that the period corresponds to one of many detected peak values. Furthermore, there 76 often exists confusion between the real period and its double/half-period (or triple/one- third-period for the triplet case). Some researchers tackle this problem with the help of other metrical units such as meters in [21, 20, 27]. Here, since our focus is beat tracking, we do not address the problem explicitly but choose a period manually among the peaks of the autocorrelation function as the input parameter. In practice, we may select multiple period candidates, and run multiple beat trackingalgorithmsinparallelfortheverificationpurpose. Afterawhile,wecaneliminate unlikely period candidates and turn off their corresponding beat tracking algorithms. 5.3 Beat Tracking with Kalman Filters The Kalman filter has been widely used in target tracking applications. It models track- ing inaccuracy with additive white Gaussian noise and tracks the target via sequential computation. In this section, we show how to apply the Kalman filter to musical beat tracking. 5.3.1 Linear Dynamic Model of Beat Progression To apply the Kalman filter, the first step is set up a linear dynamic system of equations of the following form: x(k+1) = Φ(k+1|k)x(k)+μ(k), (5.4) y(k) = M(k)x(k)+υ(k), (5.5) 77 where k is a discrete time index, x(k) is the state vector, y(k) is the measurement (or observation), μ(k) is system noise, and υ(k) is measurement noise. In the context of music beat progression, we follow the framework of [7, 23, 36] and choose the state vector and the measurement as x(k) = [τ(k),Δ(k)] T , (5.6) y(k) = τ(k), (5.7) where τ(k) and Δ(k) are the beat location and the instantaneous period, respectively. The instantaneous period, Δ(k), is defined to be the time difference between the current and the next beats as Δ(k)=τ(k+1)−τ(k). (5.8) Ideally, if there is no tempo change, period Δ(k+1) should be the same as period Δ(k); namely, Δ(k+1)=Δ(k). (5.9) Based on the above discussion, the state transition matrix Φ(k+1|k) can be written as Φ(k+1|k)= 1 1 0 1 , (5.10) and the observation matrix M(k) is in form of M(k)= 1 0 . (5.11) 78 Thus, the linear dynamic system can be summarized as x(k+1)=Φ(k+1|k)x(k)+μ(k)= 1 1 0 1 x(k)+μ(k), (5.12) and y(k)=M(k)x(k)+υ(k)= 1 0 x(k)+υ(k), (5.13) where x(k)=[τ(k),Δ(k)] T is the state vector. For the Kalman filter to apply, we have to make some assumption on the statistics on system noise as well as measurement noise. It isassumedthattheyareallzero-meanwhiteGaussianrandomprocesses. Thecovariance matrix of μ(k) is Q(k)= σ 2 μ1 0 0 σ 2 μ2 , (5.14) and the variance of υ(k) is σ 2 υ . 5.3.2 Kalman Filter Algorithm With the linear dynamic system given in (5.12) and (5.13), The Kalman filtering process [26]canbeappliedeasily. Theyaresummarizedbelow. Notethatweuse ˆ f(i|j)todenote the estimate of variable f at time i given all measurements up to time j below. With a proper initialization, we can perform the following iteration in the time domain with k =1,2,···. 79 1. Suppose that the current time is k. We store state vector ˆ x(k|k) and its covariance matrix P(k|k), where ˆ x(k|k 0 ) , E[x(k)|Y k 0 ], (5.15) P(k|k 0 ) , Cov[x(k)|Y k 0 ], (5.16) where Y k 0 ,{y(j),j =1,...,k 0 }. is the set of measurements up to time k 0 . 2. Predict the state vector at time k+1 ˆ x(k+1|k)=Φ(k+1|k)ˆ x(k|k), (5.17) and compute the predicted error covariance matrix as P(k+1|k)=Φ(k+1|k)P(k|k)Φ T (k+1|k)+Q(k). (5.18) 3. Compute the filter gain (or called the Kalman gain) by K(k+1)=P(k+1|k)M T (k+1)·[M(k+1)P(k+1|k)M T (k+1)+R(k+1)] −1 . (5.19) 80 4. After receiving measurementy(k+1), we use prediction residualy(k+1)−E(y(k+ 1)|Y(k)) to re-estimate ˆ x via ˆ x(k+1|k+1)= ˆ x(k+1|k)+K(k+1)·[y(k+1)−E(y(k+1)|Y(k))] = ˆ x(k+1|k)+K(k+1)·[y(k+1)−M(k+1)ˆ x(k+1|k)]; (5.20) 5. After receiving measurement y(k+1), we compute the new error covariance matrix P(k+1|k+1) by P(k+1|k+1)=[I−K(k+1)M(k+1)]·P(k+1|k). (5.21) 6. Set the current time to k+1, and go back to Step (1). It is assumed that the main error source in the measurement equation (5.5) or (5.13) isduetothemeasurementerror, whichismodeledbynoisetermυ(k)intheconventional Kalman filtering framework. However, there could be uncertainty associated with mea- surements [4], which results from the confusion between the target signal of interest and other non-target signals that bear properties similar to those oftarget signals. In musical beat tracking, the target signal is the beat location. Althought most beats have large musical onset signals, there are notes and percussion sounds that are not beats but with a large musical onset. Since non-beat notes and/or percussion sounds may still possess the onset property and beat location measurement y(k+1) is selected based on musical onset signal d(n), the true bit location could be masked. We will discuss several ways to address this problem. 81 5.4 LM (Local Maximum) Measurement Selection Method Measurement selection in the conventional Kalman filter is called the Local Maximum (LM) method as shown in Fig. 5.2. Simply speaking, LM selects the time instance that hasthemaximummusicalonsetwithinafixedwindowaroundthepredictedbeatlocation ˆ τ(k+1|k) as measurement y(k+1). Mathematically, this can be written as y(k+1)=τ(k+1)= argmax |m−ˆ τ(k+1|k)|<w/2 d(m), (5.22) where d(m) is the onset signal, w is the window width, and ˆ τ(k+1|k)= ˆ τ(k|k)+ ˆ Δ(k|k). (5.23) is the predicted beat location. Figure 5.2: Measurement selection in the conventional Kalman filter with the local max- imum (LM) method. To give an example, we show the spectrogram and the onset signal of a segment from Billy Joel’s We didn’t start the fire in Fig. 5.3. We see that musical onsets of non-beat notes and/or percussion sounds may deteriorate the performance of the Kalman filter 82 with the LM method for measurement selection. The top sub-figure of Fig. 5.3 is the spectrogram from 30 sec to 35 sec (which is re-labeled to be from 0 to 5 seconds in the figure) while the bottom sub-figure plots the musical onset signal. The x-axis in both figures represents time in the unit of seconds. This song has strong beats of a regular period from percussion sounds. It is clear that the onset behaves like a pulse train with fixed interval between consecutive pulses in the first 3 seconds. However, beat notes do not have stronger musical onsets than non-beat notes between 3.0 sec and 4.5 sec. Figure 5.3: Billy Joel’s We didn’t start the fire: (top) the spectrogram of a music segment from 30:00 to 35:00 seconds, where the y-axis represents the frequency from 0 to 8 kHz; (bottom) the musical onset signal as a function of time (in the unit of seconds) for the same music segment. LM fails when the beat note does not have the strongest musical onset in the neigh- borhood of predicted beat location ˆ τ(k +1|k). For example, around 4 sec, there are 3 83 strong pulses, denoted by letters A, B and C. A and C are true beats while B is a note of the half-beat metrical structure. However, B has a musical onset larger than that of A. When the Kalman filter with LM is applied, the predicted next beat location ˆ τ(k+1|k) is A. However, B will be selected as the new measurementy(k+1) since it has the largest musical onset in a pre-fixed window around A. The newly estimated ˆ τ(k+1|k+1) will be located between A and B (instead of A which should be the case if the measurement is chosen to be A). Thus, at time k+1, it is possible to select D as measurement y(k+2) since the newly estimated beat location ˆ τ(k + 1|k + 1) moves closer to B. When D is chosen as measurement y(k+2) (instead of C), the Kalman filter starts to track wrong pulses as beats from this point on. Another error source of Kalman-filter-based musical beat tracking is the lack of musi- cal onset pulses on certain music segments. In the transition from one section to another or at the end of the whole music piece, rest notes are frequently used. In some extreme cases, there are even no percussion sounds. No significant musical onset change can be detectedfrombeats. Thus, thelackofmusicalonset’spulsesmaycausetheKalmanfilter tolosepropertrackingperformance. Inparticular, whentherearepulsesduetonon-beat notes or percussion sounds, LM always fails since it will use the largest musical onset as the new measurement erroneously. As a matter of fact, beats are more perceptual than sensible in the musical onset feature domain with respect to human being. When there are no sensible music melody or rhythm, people can continue counting beats as if the music was still there. Once the melody and the rhythm appear again, people can pick up the music and align their tapping along with the new stream of music. Human keeps a memory of recent music 84 signals and thus possesses the capability of keeping the right track. As demonstrated in next section, the philosophy of PDA is similar to that of human’s beat tracking: that is, keep the flexibility on choosing the correct measurements and therefore enhance the robustness of maintaining the right track. 5.5 PDA(ProbabilisticDataAssociation)MeasurementSelection Method To overcome the weakness of the LM method, we propose another mechanism called the probabilistic data association (PDA) method in this section. Probabilistic data associa- tion(PDA)isaprobabilisticmethodusedintheKalmanfiltertoassociatemeasurements with the target of interest in a clutter environment. The term clutter refers to the de- tection of signals from nearby objects, false alarms and any other non-target of interest [4]. PDA helps the Kalman filter to maintain the track by avoiding choosing a single measurement from several candidates as the LM method does. Instead, it considers all candidate measurements and their associations with the current track. Under a Bayesian framework, probabilities of all associations are computed. 5.5.1 Measurement Validation Before presenting the PDA method, we first introduce a concept called measurement validation. A measurement is “validated” if it can be a correct measurement with a rea- sonable probability. In other words, the measurement validation process aims to remove measurements that are very unlikely to be a correct measurement for PDA. A validation 85 region is set up probabilistically. It is a multi-dimentional probabilistic threshold applied to the continuous-valued measurements. PDA will consider only measurement sequences within the validation region so that the computation load in PDA can be significantly reduced by removing non-validated measurements. The validation region is a region in the measurement space where measurement y will be found with a non-trivial probability. To derive it mathematically, we starts from predicted measurement ˆ y(k+1|k)=M(k+1)ˆ x(k+1|k), (5.24) and its prediction residual can be derived as ˜ y(k+1|k),y(k+1)− ˆ y(k+1|k) =M(k)(x(k+1|k)− ˆ x(k+1|k))+υ(k) =M(k)˜ x(k+1|k)+υ(k). (5.25) Thus, the covariance matrix of predicted measurement ˆ y(k+1|k) becomes S(k+1),E[˜ y(k+1|k)˜ y 0 (k+1|k)|Y k ] =M(k+1)P(k+1|k)M 0 (k+1)+υ(k), (5.26) where P(k+1|k)=E[˜ x(k+1|k)˜ x 0 (k+1|k)|Y k ]=Φ(k+1|k)P(k|k)Φ 0 (k+1|k)+Q(k), (5.27) 86 and Y k consists of every set of validated measurements Y(j) from time index j =1 to k, and it is denoted as Y k ={Y(j)} k j=1 , (5.28) where Y(j) comprises validated measurements at time index j with the number of vali- dated measurements equal to m j , Y(j),{y i (j)} m j i=1 . (5.29) If real measurement y(k+1) at time k+1 conditioned on Y k is normally distributed with its mean equal to predicted measurement ˆ y(k +1|k) and covariance S(k +1), the probability distributin of y(k+1) can be written as Pr[y(k+1)|Y k ]=N{ˆ y(k+1|k),S(k+1)}. (5.30) Then, a region can be defined in the measurement space via [4] ˜ V(k+1)(γ),{y :[y− ˆ y(k+1|k)] 0 S −1 (k+1)[y− ˆ y(k+1|k)]≤γ}, (5.31) where S(k+1) is the covariance matrix of the predicted measurement ˆ y(k+1|k). Note that ˜ V(k+1)(γ) is a region that contains measurements of non-trivial probability. Thus, measurementsinside ˜ V(k+1)(γ)areconsidered“valid”. Theprobabilityofmeasurements outside is very small and, thus, they are discarded. It is shown in [4] that the weighted norm of prediction residual in form of ˜ y(+1|k) [y − ˆ y(k + 1|k)] 0 S −1 (k + 1)[y − ˆ y(k + 87 1|k)] is Chi-square distributed with the degree of freedom equal to the dimension of the measurement. In musical beat tracking, the measurement dimension is 1. Then, by choosing γ = 9 in (5.31), the probability for the region to include true measurements is 99.7%. For comparison, the choice of γ =4 result in a probability of 95.4% [4]. Eqs. (5.11), (5.26) and (5.31) can be used to derive the validation region for the proposed musical beat tracking algorithm based on Kalman filtering. It is equal to ˜ V(k+1)(γ),{y : [y− ˆ y(k+1|k)] 2 p 11 +υ ≤γ}, (5.32) where p 11 is the variance of beat location τ(k). Given γ, p 11 and υ, a validation region can be calculated for each predicted measurement ˆ y(k+1|k) as ˆ y(k+1|k)−γ(p 11 +υ)≤y ≤ ˆ y(k+1|k)+γ(p 11 +υ). (5.33) 5.5.2 Description of PDA Method The PDA method uses a weighted average of estimates from candidate measurements within the validation region to replace ˆ x(k +1|k +1). The weight is chosen to be the probability for measurement y i (k + 1) to be a correct measurement. Mathematically, 88 PDAdecomposestheestimateintoalinearcombinationofestimatesofallmeasurements within the validation region as ˆ x(k+1|k+1)=E[x(k+1)|Y k+1 ] = m k+1 X i=0 E[x(k+1)|θ i (k+1),Y k+1 ]Pr{θ i (k+1)|Y k+1 } = m k+1 X i=0 ˆ x i (k+1|k+1)β i (k+1) (5.34) where β i (k+1),Pr{θ i (k+1)|Y k+1 }, i=0,1,...,m k+1 . (5.35) In the above equation, θ 0 (k+1) represents the event that none of the measurements originatesfromthetargetofinterestandβ 0 (k+1)istheprobabilitythatnoneofthemea- surements originates from the target of interest. Similarly, θ i (k+1), i = 1,2,··· ,m k+1 , represents the event that y i (k+1) is the correct measurement that originates from the target of interest and β i (k+1) is the probability that y i (k+1) originates from the target of interest. Since all events are exclusive, we have m k+1 X i=0 β i (k+1)=1. (5.36) TomaketheKalmanfilteralgorithmcompatiblewithPDA,somestepsintheKalman filteralgorithmneedmodification. Webeginwiththeupdateofstatevector ˆ x(k+1|k+1) from ˆ x(k+1|k) in Eq. (5.20), which is re-written below ˆ x(k+1|k+1)= ˆ x(k+1|k)+K(k+1)·[y(k+1)−M(k+1)ˆ x(k+1|k)]. (5.37) 89 Ifcandidatemeasurementy i (k+1)isused,theestimate ˆ x i (k+1|k+1)canberepresented in a way similar to Eq. (5.37) as ˆ x i (k+1|k+1)= ˆ x(k+1|k)+K(k+1)·[y i (k+1)−M(k+1)ˆ x(k+1|k)]. (5.38) Based on Eqs. (5.34) and (5.38), ˆ x(k+1|k+1) can be written as ˆ x(k+1|k+1)= m k+1 X i=0 ˆ x i (k+1|k+1)β i (k+1) = m k+1 X i=0 {(ˆ x(k+1|k)+K(k+1)·[y i (k+1)−M(k+1)ˆ x(k+1|k)]}β i (k+1) = ˆ x(k+1|k) m k+1 X i=0 β i (k+1) +K(k+1) m k+1 X i=0 {([y i (k+1)−M(k+1)ˆ x(k+1|k)]}β i (k+1) = ˆ x(k+1|k)+K(k+1)φ(k+1), (5.39) where φ(k+1), m k+1 X i=1 [y i (k+1)−M(k+1)ˆ x(k+1|k)]β i (k+1). (5.40) BycomparingEqs. (5.20)and(5.39),weseethattheoriginalupdateiskeptintactexcept for the replacement of prediction residual y(k +1)−M(k +1)ˆ x(k +1|k) by φ(k +1). Since the quantity y(k+1)−M(k+1)ˆ x(k+1|k) is called the innovation, φ(k+1) can be viewed as the equivalent innovation for the PDA method, which is the weighted average of innovations from all validated measurements i=1···m k+1 . 90 Another modification required is the computation of P(k + 1|k + 1), which is the covariance of ˆ x(k +1|k +1). We only state the result below and refer to [4] for more details: P(k+1|k+1)=β 0 (k+1)P(k+1|k)+[1−β 0 (k+1)]P 0 (k+1|k+1)+ ˜ P(k+1), (5.41) where P 0 (k+1|k+1)=[I−K(k+1)M(k+1)]·P(k+1|k), which is the original covariance of ˆ x(k+1|k+1) given in Eq.(5.21) and ˜ P(k+1),K(k+1)[ m k+1 X i=1 β i (k+1)ν i (k+1)ν 0 i (k+1)−ν i (k+1)ν 0 i (k+1)]K 0 (k+1). (5.42) 5.6 EnhancedPDA(EPDA)MeasurementSelectionMethod PDA uses the weighted average of innovations from all validated measurements y i as the equivalentinnovationasshowninEq.(5.40). Theweightβ i (k+1),whichisalsoknownas the “association probability”, is related to the distance between candidate measurement y i (k + 1) and prediction x(k + 1|k) as defined in Eq. (5.35). That is, the smaller the distance,thelargerβ i (k+1)is. Intuitively,ifmeasurementy i (k+1)isclosertoprediction x(k +1|k), its contribution will weigh more. However, in music beat tracking, human uses not only the closeness of the measurement and the predicted beat location but also the intensity of musical onsets as cues to pick the next beat location. Thus, we need to 91 modify the definition of association probability β i (k+1). The resulting method is called the enhanced PDA (EPDA) method. It is worthwhile to point out that the modification of the association probability has been considered by researchers before in various contexts to improve the tracking performance, e.g., in visual object tracking [25, 34] and radar applications [5]. The former uses both the prediction residual and image similarity as cues for object tracking while the latter uses the prediction residual and the intensity of the reflected radar signal as cues for airplane tracking. In[25],theintensityoftheobservedsignalisintroducedtotheassociationprobability calculation via β i (k+1),Pr{θ i (k+1)|I Y (k+1),Y k+1 } ∝Pr{I Y (k+1)|θ i (k+1),Y k+1 }Pr{θ i (k+1)|Y k+1 }. (5.43) where I Y (.) is the distribution function of measurement intensity, (i.e., musical onset intensity). Note that the term Pr{θ i (k+1)|Y k+1 } in the right-hand-side of Eq. (5.43) is β i (k + 1) in Eq.(5.35), which considers only the prediction residual. As shown in Eq. (5.43), the modified β i (k +1) is the product of two terms. One is contributed by musical onset’s intensity and the other by the prediction residual. The term contributed by musical onset’s intensity in Eq. (5.43) can be further de- composed [25] as P{I Y (k+1)|θ i (k+1),Y k+1 }= I i (y i ) I 0 (y i ) m k+1 Y j=1 I 0 (y j ), (5.44) 92 for i = 1,2,...,m k+1 , where I i (y i ) is the probability distribution of correct measurement y i and I 0 (y i ) is the probability distribution of y i as an incorrect measurement. We can re-write (5.44) as P{I Y (k+1)|θ i (k+1),Y k+1 }=I i (y i ) m k+1 Y j=1,j6=i I 0 (y j ). (5.45) It is difficult to compute I i (y i ) with 1≤i≤m k+1 in (5.45) efficiently and accurately for two reasons. First, the number of candidate measurements, m k+1 , is determined dynamically by validated region ˜ V(k+1)(γ) at each time step k. Second, for particular i andk+1,therearenotenoughsamplesinestimatingtheprobabilitydistributionI i (k+1) accurately. To address this issue, the probability distribution of I i (k+1) is replaced by a general probability distribution of the intensity of measurement y(.) for all i and k when y(.) is a “correct” measurement. That is, we have I i (k+1)'I B , i=1,2,...,m k+1 . where I B is the probability distribution of musical onset intensities when their corre- sponding measurements are truly beats. Thus, I B can be found by collecting intensities of musical onsets that correspond to beats. On the other hand, I 0 (y i ) is the probability distribution of the intensity of candidate measurements y i which do not correspond to a beat. Again, it is difficult to get the 93 accuratedistributionallspecificiandk+1. Instead,wereplaceitbyageneralprobability distribution I 0 (k+1)'I N (5.46) whereI N is probability distribution of musical onset intensities when their corresponding measurements are not beats. The details on building probability distributions I B and I N from the musical audio signals and the ground-truth labeling of beats will be discussed in Chap. 6. 5.7 Experimental Results 5.7.1 Experimental Data and Setup Two data sets were used in our experiments. The first data set was MIREX 2006 beat tracking competition practice data [1]. Itconsisted oftwenty 30-sec music clips ofdiverse genres, including classical, pop, rock, blues and foreign-laguage pop songs. The audio signal was of the 16-bit waveform format with a sampling rate of 44.1 kHz. Beats in each clip were listened and verified by more than 40 people. The tempo was not known in advance to listeners so that they might use a different period to label beats. A single period was used for tracking in our experiment. Since our algorithm did not detect the period automatically, the period agreed by most listeners was adopted. The second data set was 20 Billboard Top10 songs in 80’s. The genre includes pop, rock, andsomeadultcontemporaryfromsinger-songwriterssuchasBillyJoel. Theaudio signalwasofthe16-bitwaveformformatwithasamplingrateof44.1kHz. Foreachsong, a 60-sec music clip was segmented from the original song. The candidate tempos were 94 estimated from auto-correaltion and manually selected. Afterwards, beats were labeled based on the selected period to serve as the ground truth. To be compatible with the MIREX 2006 data set, only the first 30 seconds of Billboard Top10 dataset were used in the automatic beat tracking test. The first 5 seconds were used to initialize the state vector, including beat τ(0) and period δ(0). The remaining 25 seconds of the music clip were used to evaluate the beat tracking performance. There was no abrupt tempo change in most music clips. Their tempos were perceptually constant with mild fluctuation. Musical onsets were detected from the audio waveform with a sampling rate of 100 Hz and used in the proposed musical beat tracking algorithm. To build probability distributions I B and I N of music onset intensity in EPDA, all except for test music clips in the data set were used for parameter estimation. 5.7.2 Performance Metrics Two metrics were used to evaluate the musical beat tracking performance. The first one is similar to the P-score evaluation in MIREX 2006 [1], which evaluates the correct rate of detected beats while considering the impact of false-alarms and is defined as P = 1 N max ∞ X n=−∞ w X m=−w τ d (n)τ g (n−m), (5.47) 95 whereτ d istheunitpulseinthedetectedbeatlocation,τ g istheunitpulseintheground- truth beat location, 2w is the tolerable window size and N max =max(N d ,N g ), (5.48) and where N d and N g are the detected and the ground-truth beat numbers, respectively. Note the τ d and τ g take values 0 and 1 only. The window size 2w was chosen to be 20% of the beat duration throughout the experiment. Note that the value of P lies between 0 and 1. The higher, the better. If all detected beats are correct, i.e., N d =N g =N max , P = 1. If no correct beats are detected at all, the cross-correlation term in Eq. (5.47) is equal to 0 so that P = 0. If there are false alamrs, the P value will be penalized by a larger N d value. Thesecondperformancemetricisthelongestmusicsegmentwithallitsbeatscorrectly tracked [19, 28]. Usually, this longest segment is normalized by the total duration of the clip (i.e. 25 sec in MIREX dataset) so that its value also lies between 0 and 1. It shows how long the beat tracking algorithm can maintain the accurate tracking once it starts to track. Even if a single beat is missed in the tracking process, this metric drops significantly. For example, for a clip of duration 25 sec, if all beats are correctly detected except one at 15 sec, the performance will drop from 100% (=25/25) to 60% (=15/25). In contrast, the P-score metric can still be as high as 96%. 96 5.7.3 P-Score Evaluation for MIREX Data Set For the MIREX data set, the average performance of all music clips are shown in Table 5.1. We compare the performance of the Kalman-filter-based (KF-based) beat tracking algorithmwithLM,PDAandEPDAmeasurmentselectionmethodsasdescribedinSecs. 5.4, 5.5 and 5.6, respectively. We observe that LM, PDA and EPDA can achieve per- formance of 74.08%, 72.67% and 87.33%, respectively. EPDA improves the performance greatly over LM by an average of 13.25%. In contrast, PDA has a performance similar to LM. Table 5.1: P-score comparison of Kalman filtering with LM, PDA and EPDA. LM PDA EPDA MIREX 74.08% 72.67% 87.33% The performance of LM and EPDA for each music clip can be further compared by the scatter plot in Fig. 5.4, where each square represents a music clip from the MIREX data set and its x- and y- coordiates represent the performance using LM and EPDA, respectively. There are 14 squares in one cluster in top-right corner of the figure, where the LM performance is between 80% to 90% and the EPDA performance ranges from 95% to 100%. The remaining 6 squares stay close or above the 45-degree line. Thus, although EPDA may not improve the performance for some music clips, it will degrade the performance much. There is a one square in the bottom-left corner, for which both LM and EPDA perform poorly. It is train13.wav in the MIREX data set, which has a very fast tempo (178BPM) and musical onsets from the metrical level of the half beat very strong, which results in serious confusion on the beat detection. 97 Figure 5.4: The performance of the KF-based beat tracking algorithm with LM (the x-coordinate) and EPDA (the y-coordinate) for each music clip from the MIREX data set. 98 5.7.4 P-Score Evaluation for Billboard Data Set The P-Score performance for the Billboard Top-10 data set is is shown in Table 5.2 along with that of MIREX from Table 5.1 for comparison. For the Billboard Top-10 data set, the P-Scores of LM, PDA and EPDA all improve but with a different degree. The improvement of PDA is most significant while EPDA still offers the best performance. Actually,exceptforpoorerperformancewithEltonJohn’sCandle In The Wind,allsongs with EPDA can achieve a P score higher than 94%. In contrast, LM achieves similar performance on MIREX and Billboard Top-10 dataset. There is only 3.72% difference between two data sets using LM. As described earlier, MIREX has rather diverse genres while songs in the Billboard Top-10 data set are more homogeneous. For the latter, since all of them received great commercial success, they resorted to average people’s music taste. Generallyspeaking, theBillboarddatasethasmoreregularbeatsthroughouteach music clip than the MIREX data set. Table 5.2: P-score comparison of the KF-based beat tracking algorithm with LM, PDA and EPDA. LM PDA EPDA MIREX 74.08% 72.67% 87.33% Billboard 77.80% 87.81% 94.68% 5.7.5 Performance of Longest Tracked Music Segment Ratio (LTMSR) Theotherperformancemetricisthelongestcontinuouslycorrectlytrackedmusicsegment ratio(LTMSR).Itemphasizestherobustnessoftrackinginavaryingenvironment, which couldresultfrombeatvariation,restnotesandnoisymeasurements. Themetricisaratio 99 betweenthelongesttrackedmusicsegmentandtheoveralllengthofeachmusicclip. The results are shown in Table 5.3. For the MIREX data set, the performance of PDA is than that of LM. It implies that the onset intensity may play a more important role than the prediction residual in beat tracking for the MIREX data set. EPDA outperforms both LM and PDA by 12.36% and 26.71%, respectively. For the Billboard top-10 data set, PDAachievesslightlybetterperformancethanLM.TheperformanceofEPDAis90.88%, which is significantly better than LM and PDA. Similar to theP-score evaluation metric, LM, PDA and EPDA all have better LTMSR performance with respect to the Billboard Top-10 data set than the MIREX data set. Table 5.3: Longest Tracked Music Segment Ratio (LTMSR) with LM, PDA and EPDA. LM PDA EPDA MIREX 66.18% 51.83% 78.54% Billboard 73.42% 78.98% 90.88% 100 Chapter 6 Musical Beat Tracking with A Hidden Markov Model (HMM) 6.1 Introduction A musical beat tracking algorithm using the hidden Markov model (HMM) is proposed to determine beat locations in this chapter. The framework of a generic beat tracking system is shown in Fig. 6.1. The input to the system is the musical audio signal. The system consists of three main modules. First, the musical onset signal is estimated from the audio signal. Second, its period within musical onset signals is estimated. Finally, the HMM-based musical beat tracking algorithm is used to estimate all beat locations. In this chapter, we assume that the onset detection module and the period estimation module are both available. Thus, we only focus on the HMM-based beating tracking module. HMM has been widely used in speech recognition and other applications for several decades [33]. It can model the dynamics of a class of time series effectively such as speech signals. Typically, there are multiple hidden states and observations defined in a 101 Musical Audio Signals Onset Detection Period Estimation HMM-based Beat Tracking Algorithm Beats' Positions Figure 6.1: The framework of musical beat tracking system. given HMM. The distribution of these hidden states and the distribution of observations conditioned on a given hidden state are all described by a probabilistic model. Further- more, temporal dependency of a time-varying signal is modeled by the state transition characterized by a first order Markov chain. Sequential optimization of the resulting probabilitistic signal model, known as the Viterbi algorithm, is adopted to replace its deterministic counterpart, i.e. dynamic programming. Parameters of the probabilitistic signal model and its dynamics are learned through a large amount of training data to capture the statistical characteristics for a particular audio input segment (e.g. a speech sound unit). To apply HMM to music beat tracking, we consider probabilistic models on states and their associated observations in Sec. 6.2 and the state transition model in Sec. 6.3. Then, the state decoding problem with the Viterbi algorithm will be described in Sec. 102 6.4. Finally, experimental results are shown in Sec. 6.5 to demonstrate the performance of the proposed HMM-based beat tracking algorithm. 6.2 StatesandObservationsofProposedBeat-TrackingHMM Consider an HMM with N states and M observation symbols and denoted by λ=(A,B,π), (6.1) where A = {a i,j } is the state transition probability distribution, B = {b j (k)} is the ob- servationsymbolprobabilitydistributiononstatej, andπ is the initialstate distribution [32]. The time axis is uniformly partitioned by a basic time unit so that k = 1,2,··· denotes the discrite time index. The application of HMM to the music beat-tracking problem is detailed below. 6.2.1 States and Observations The proposed beat-tracking HMM has N 0 states, which can be classified into two types; namely, the beat state type (B) and the non-beat state type (N) depending on whether a beat occurs or not inside a time unit. Observations are the quantized musical onset intensity in each time unit. Not all beats yield strong musical onsets. When there is no obvious changes in music tempo, a music beat is more a human perceived object than a real measureable signal. For example, within the transition between two sections in a pop song, the first section may end with rest notes that do not have zero musical onset intensity until the start of 103 the second section. However, beat processing still continues in human perception. This is because that human listeners continue counting the beats via prediction and wait for the incoming music activity in the second section. On the other hand, the occurrence of a strong musical onset may not imply the occurrence of a beat. Musical onsets reflect the change of music activities in the audio spectrum caused by the change of music notes and/or instruments. However, they are not necessarily part of beats, which are used to maintain the music temporal framework. In HMM, the relationship between states and observations is modeled by the observation symbol probability distribution B given in Eq. (6.1). Toestimatetheobservationprobabilitydistributionofbeatandnon-beatstatetypes, training data are used. Both the musical onset signal and the annotated beat location are given in the training data. Musical onset signals are collected in the vicinity of each annotatedbeatlocationtoestimatetheobservationalprobabilitydistributionofthebeat state type. Likewise, the remaining musical onset signals that are away from annotated beats’ time are used in estimating the observational probability distribution of the non- beat state type. Since beat annotation is done by human listeners, there exists a difference between the annotated and the exact locations of beats. In some cases, the difference can be as large as several tens of milliseconds. Given the fact that the used sampling rate for musical onsets is 100Hz, it corresponds to several samples. This imprecision problem mayassignobservationstodifferentstatesandresultinerrosintheestimatedprobability distribution. 104 To address the imprecision problem, a small window is applied around the location of an annotated beat, where the local maximum is searched. The time corresponding to the local maximum is regarded as the location of the beat. Besides, only onsets from the music segment around the peak are used as observations of the beat state type. On the other hand, for observations of the non-beat state type, only musical onsets out of the window are considered. For example, if there are two annotated beats at two consecutive time instances t 1 and t 2 , musical onsets within 20% of (t 2 −t 1 ) around either t 1 or t 2 are usedastheobservationsofthebeatstatetypewhileonsetsignalsbetweent 1 +0.1(t 2 −t 1 ) and t 2 −0.1·(t 2 −t 1 ) are used as observations for the non-beat state type. 6.2.2 Observation Probability Determination Musical onsets are one-dimensional signals with non-negative values. To estimate the observation probability for a given state type (denoted by P(o(k)|j = B) or P(o(k)|j = N) for the beat and the non-beat state types, respectively), we consider both parametric and non-parametric approaches. The data set from MIREX 2006 beat tracking competition is used to determine the observation probability conditioned on beat and non-beat state types. It consists of twenty 30-sec music clips with annotated beat location. Histograms of music onsets conditioned on beat/non-beat state types are shown in Fig. 6.2, where bin centers are uniformly located from 0.25 to 8.25 seconds with bin width 0.5 second in (a) and from 0.125to8.125secondswithbinwidth0.25secondin(b), respectively. Thex-axisdenotes the music onset intensity while the y-axis is the frequency of occurence. 105 (a) (b) Figure 6.2: The onset histogram conditioned on the beat and non-beat state types with the bin width equal to (a) 0.5 and (b) 0.25, where the x-axis is the music onset intensity while the y-axis is the frequency of occurence. 106 We see from Fig. 6.2 (a) that the onset distribution for non-beats, P(o(k)|j = N), heavily concentrates on small onset values with few large onset values. The non-beat statetypehasamuchhigherprobabilitythanthebeatstatetypeatthefirstbin. Forthe secondbincenteredat0.75,thebeatandthenon-beattypeshavecomparableprobabilites around 20% (0.190 for beats and 0.207 for non-beats). For the third bin centered in 1.25 and above, the musical onset probability for beats is larger than that for non-beats. By refining the bin width from 0.5 to 0.25, we get a better resolution for small onset values as shown in Fig. 6.2 (b) at the cost of a higher complexity and a larger storage. Observation probabilities P(o(t)|B) and P(o(t)|N) can be approximated by a para- metric model of the following form f(x|k,θ)= e − x θ θ k Γ(k) x k−1 , (6.2) where θ is a scale parameter and Γ(k)= Z ∞ 0 t k−1 e −t dt, (6.3) is the Gamma function of shape parameter k. To find parameters k and θ, the maximum likelihood estimation (MLE) method can be used. To approximate the histograms in Fig.6.2, parameters are k = 0.949 and θ = 0.904 for the beat state type and k = 0.581 and θ = 1.045 for the non-beat state type. Observation probabilities P(o(t)|B) and P(o(t)|N) plotted with these two parameter sets are shown in Fig.6.3. We see a close match between results from histogram analysis and the Gamma function. 107 Figure 6.3: Observation probabilities for the beat state type, i.e., P(o(t)|B), and the non-beat state type, i.e., P(o(t)|N) derived from the MIREX data set. 6.3 State Transition Modeling 6.3.1 Periodic Left-to-Right (PLTR) Model The left-to-right (LTR) model (or called the Bakis model) [33] as shown in Fig. 6.4 (a) is often adopted to model a time-varying signal, where each node denotes a state. In the LTR model, the allowed transition between the current state s(k) and the next state s(k+1)iseithertoproceedtothenextstateintherightortostayatthecurrentstatevia self-looping, and no transition is allowed from a state to any of its left states. The LTR model is used to model signals that change over time in a successive manner. Since the right hand side represents the direction into which time proceeds, the left to right state transition means time progression. There are N 0 states in the proposed HMM, where the 108 first state belongs to the beat state type while others belong to the non-beat state type as shown in Fig. 6.4 (b). (a) (b) Figure 6.4: (a) A general LTR state transition model and (b) the LTR model with beat and non-beat state types labeled. The pure LTR model given in Fig. 6.4 consists of the beat state type only. To model the temporal progression of multiple beats, we add a large-loop state transition from the right-utmoststatebacktotheleft-utmoststateandcallittheperiodicleft-to-rightmodel (PLTR) as shown in Fig. 6.5. In contrast with the pure LRT model, the PLTR model can be used to model the periodic phenomenon such as music beats. Figure 6.5: The PLTR model with large-loop state transition for modeling the periodic repetition of beats. 109 6.3.2 Determination of State Transition Probability a ij To capture beat progression by the LTR model, the total number of states, denoted by N 0 , in Fig. 6.5 and the self-loop transition probability a ii have to be designed carefully. The former determines the minimum beat period since it takes at least N 0 states to go fromonebeattothenext. Thelatterdeterminesthebeatperioddistributionasdiscussed below. Leta ii =p. Todemonstratetheeffectofp, asinglestatewiththeself-loopstatetran- sition is discussed first. Since the self-loop transition probability is p and the probability to exit the state is 1−p, the probability distribution Prob(d) as a function of duration d can be computed as Prob(d)=p d−1 (1−p). (6.4) Consequently, the expected duration is E{d}= ∞ X d=1 dProb(d)= ∞ X d=1 dp d−1 (1−p) = 1 1−p . (6.5) TheplotsofProb(d)asafunctionofdparameterizedbyp=0.75andp=0.25areshown in Fig. 6.6. Clearly, a larger p value favors a longer duration. The expected durations for p=0.75 and 0.25 are 4.0 and 1.33, respectively. 110 Figure 6.6: The probability distribution Prob(d) with p=0.75 and 0.25. Next, to understand the relationship between the number of states, N 0 , and the beat periodT 0 , weconsiderthecasewithasingleself-loopstatetransitioninthePLTRmodel as shown in Fig. 6.7. It is straightforward to show that T 0 =(N 0 −1)+ 1 1−p , where 1 1−p is expected duration in the B state and N 0 −1 is the time for the signal to go throughtheremainingNstates. Thus,N 0 canbedeterminedfromthefollowingequation N 0 =T 0 +1− 1 1−p . 111 Figure 6.7: A complete LTR model with only single self-loop on the beat state type. However, adding a single self-loop to the beat state only results in an undesirable probability distribution d T , since it is an exponentially distributed function as shown in Fig. 6.6. It gives an asymmetrical probability distribution around the target beat period T 0 . This problem can be overcome by adding more self-loop transitions to all non-beat states as shown in Fig. 6.5. The probability distribution Prob(d) for the PLTR model with a self-loop at every state can be analyzed as Prob(d)= d−1 N 0 −1 p d−N 0 (1−p) N 0 . (6.6) Two plots of Prob(d) with p = 0.4 and p = 0.6 are shown in Fig. 6.8. We see from the figure that the two plotted probability distributions are symmetrical around the beat period T 0 . Moreover, the model with p=0.6 gives a flatter probability distribution than that with p=0.4. Simply speaking, the p value control the width of the envelop centered around T 0 . 112 Figure 6.8: The probability distribution Prob(d) for the PLTR model with a self-loop at every state parameterized by self-transition probability p=0.4 and p=0.6. The expected beat period from the PLTR model of N 0 states is E{d N 0 }=N 0 E{d}= N 0 1−p . (6.7) By setting E{d N 0 }=T 0 , we can determine the value of N 0 via N 0 =T 0 ·(1−p). (6.8) For example, since both curves in Fig. 6.8 are used to model beat period T 0 =72, we get N 0 =44 from Eq. (6.8). 113 The p value can be roughly estimated from (6.8) by p= T 0 −N 0 T 0 . (6.9) Please note that the value of T 0 −N 0 is approximately one half of the window size. Since p = 1 16 or 1 32 is used in this work, the corresponding window size is about one-eighth of beat period ( T 0 8 ) or one-sixteenth of beat period ( T 0 16 ) respectively. Finally, the initial probability distributions are assumed to be uniformly distributed at N 0 states; namely, π i = 1 N 0 , i=0,··· ,N 0 −1. (6.10) 6.4 State Decoding via Viterbi Algorithm GivenanHMMmodel,λ,andasequenceofobservations,o(t),t=1,2,··· ,T,theViterbi algorithm is used to decode the optimal state sequence ˆ S = {ˆ s(1),ˆ s(2),··· ,ˆ s(T)} that has the largest posterior probability conditioned on observational sequence o(t) among all possible state sequences. Mathematically, this can be written as ˆ S ={ˆ s(1),ˆ s(2),··· ,ˆ s(T)}= argmax s(1),s(2),···,s(T) P(s(1),s(2),··· ,s(T)|o(1),o(2),··· ,o(T)). (6.11) 114 WehaveP(A|B)= P(B|A)P(A) P(B) bytheBayesianRule. Furthermore,sinceP(o(1),o(2),··· ,o(T) in Eq. (6.12) is the same for any given observation sequence o(t), it does not affect the result of the optimal state sequence ˆ S. Thus, Eq. (6.11) can be rewritten as ˆ S = argmax s(1),s(2),···,s(T) P(s(1),s(2),··· ,s(T))P(o(1),o(2),··· ,o(T)|s(1),s(2),··· ,s(T)). (6.12) The first term P(s(1),··· ,s(T)) in (6.12) can be decomposed into the product of transition probabilities between two states for two consecutive time instances t and t−1. Given state space λ and the Markov assumption, the decomposition can be written as P(s(1),s(2),··· ,s(T))=P(s(1))P(s(2)|s(1)P(s(3)|s(2))···P(s(T)|s(T −1)), (6.13) where the transition probability from state i to state j, P(s(t+1)=s j |s(t)=s i ),a ij , arespecifiedbyHMMasdiscussedinthelastsection. TosecondtermP(o(1),o(2),··· ,o(T)|s(1),s(2),··· ,s(T)) in (6.12) can be decomposed into the product of terms P(o(t)|s(t)) for t=1,··· ,T as P(o(1),o(2),··· ,o(T)|s(1),s(2),··· ,s(T))=P(o(1)|s(1))·P(o(2)|s(2))·····P(o(T)|s(T)). (6.14) ThisisvalidsinceitisassumedinHMMthatobservationo(t 0 )dependsonlyonstates(t 0 ) and it is independent of observations at t6=t 0 . The observation probability distribution b s i (o),P(o|s i ) is discussed in Sec. 6.2. 115 Thus, we can derive the following optimization problem from Eqs. (6.11), (6.13) and (6.14) as ˆ S = argmax s(1),s(2),···,s(T) P(s(1))P(o(1)|s(1))·P(s(2)|s(1))P(o(2)|s(2))···· ·P(o(T)|o(T −1))P(o(T)|s(T)) = argmax s(1),s(2),···,s(T) P(s(1))P(o(1)|s(1)) T Y t=2 P(s(t)|s(t−1))P(o(t)|s(t)), (6.15) where P(s 1 ) is the initial probability distribution of a certain state type. It is well known that the Viterbi algorithm can be applied to (6.15) to decode the best state sequence s(1),s(2),··· ,s(T), which corresponds to beat locations in the beat sequence. One example of state sequence decoding is shown in Fig. 6.9. The test music is the 4thmusicclipintheMIREXbeattrackingcompetitiondata, whereafemalevocalsinger performs in the Jazz or R&B style. The accompanying music and percussion instruments provide a clear and specialized Jazz rhythm against the singer’s vocal. In some segments, beats are easy to follow due to strong musical onsets and concrete periodicity. In other segments, beats are irregular and, therefore, difficult to follow. The beats of this music clipcannotbeeasilyfollowedasmostpopandrocksongs,inwhichpercussioninstruments are heavily used. Resultsforthemusicalsegmentfrom20to30secondsisshowninFig.6.9. Fig.6.9(a) shows the musical onset extracted from the musical audio signal. Fig. 6.9(b) shows the differencebetweenthelog-likelihoodoftheobservationofthebeatandthenon-beatstate types. Larger values in Fig. 6.9(b) means the observation is more likely to come from a 116 Figure 6.9: An example of the decoded state sequence based on the 4th music clip in MIREX: (a) the musical onset, (b) the log-likelihood difference between the beat and the non-beatstatetypes, (c)thedecodedstatesequence, where“1”representsbeatstateand “0” represents non-beat state, as a function of time (in the unit of second). 117 beat state while smaller values means the observation is less likely to come from a beat state. It is difficult for human to find beat locations via visual inspection on Fig. 6.9(a). In particular, musical onsets look “congested” between 20 and 22 seconds and between 24 and 26 seconds. Even though beats usually correspond to larger musical onsets, they are buried in the duration of busy music activities due to vocal sounds and musical instruments. HMM provides a way to clean the data for us. HMM incorporates the information of periodic cycles of beat pulses. Only those periodic signals of pre-designed period T 0 can achieve a high likelihood along time. Fig. 6.9(c) shows the resulting state sequence, where value 1 represents a beat state and the value 0 represents a non-beat state. Beats can be extracted from the “congested” musical onsets in Fig. 6.9(a) since the proposed HMM applies the beat periodicity constraint. Some beats, e.g. the ones in 25.93 sec, 27.35 sec and 28.79 sec, stand out from other onsets because they not only have large musical onsets individually but also constitute together a short duration of a pulse train. They have a larger likelihood to be beats than other combinations for a small time window. However, they are filtered out by a larger time window consisting of 6-7 beats as shown in this example. Another example is taken from 15 to 25 seconds of the 7th music clip in the MIREX beat tracking competition data, which is a classical music performed by an orchestra. Trumpet-likeinstrumentsprovidethetemporalframeworkofbeatsthroughoutthemusic clip. As shown in Fig. 6.10(a), musical onsets of beats and non-beats yield different intensities clearly except for the segment between 18 and 20 seconds, where beats show very regular and strong musical onsets. One can almost tell beat locations by visual 118 inspection. We see from Fig. 6.10(c) that whenever there are beat states in the decoded state sequence, they often happen consecutively. This is due to the “width” of musical onsets for the beat. Whenever there is a beat, it usually corresponds to a segment of musicalonsetswithahighintensitylevelwithlargeobservationprobabilitiesP(o|s i =B). Under the periodic constraint of the state space, HMM can adjust the beat duration in the modeling of a quasi-periodic duration. Figure 6.10: An example of the decoded state sequence based on the 7th music clip in MIREX: (a) the musical onset, (b) the log-likelihood difference between the beat and the non-beatstatetypes, (c)thedecodedstatesequence, where“1”representsbeatstateand “0” represents non-beat state, as a function of time (in the unit of second). 119 6.5 Experimental Results 6.5.1 Experimental Data and Setup The two data sets used in evaluating Kalman-filter-based musical beat tracking were also usedintheevaluationofHMM-basedmusicalbeattracking. TheyweretheMIREX2006 beat tracking competition practice data and the Billboard Top10 songs in 80’s. The first 5 seconds of each each music clip were used to calculate the beat period via the autocor- relation function. We do not intend to discriminate among different metrical levels, and the metrical level with strong intensity is chosen manually. The remaining music clips were used for the performance evaluation of HMM-based musical beat tracking. The observation probabilities of the beat and the non-beat state types were trained using the leave-one-out technique. For example, when the performance on the 1st music clip of MIREX 2006 data was tested by the parametric approach, parameters of the Gamma distribution were estimated from the remaining 19 music clips, and the resulting probability model was used to calculate the observation probabilities for the 1st music clip. 6.5.2 P-Score Performance Evaluation The P-score and the longest correctly tracked music segment are the two metrics widely usedtoevaluatetheperformanceofmusicalbeattracking. SincetheHMM-basedmusical beat tracking aims at off-line beat tracking, the former metric is more appropriate. The P-score is defined as P = 1 N max w X m=−w τ d (n)τ g (n−m), (6.16) 120 where τ d and τ g have values 1.0 on the detected location and the ground-truth beat location, respectively, N max =max(N d ,N g ), (6.17) and where N d is detected beat number, N g is ground-truth beat number. The window size W used throughout our experiments is 20% of the beat period. We first evaluated the performance of several HMM settings applied to the MIREX data set. The first was the way in observation probability calculation: 1) non-parametric (histogramanalysiswithbinwidth0.25)and2)parametericviatheGammadistribution. The second was the value of self-loop probability a ii ,p for all states. The performance of p= 1 16 and 1 32 was compared. The P-Score performance is shown in Table 6.1. We see very similar performance under different settings. Table 6.1: Performances of HMM-based musical beat tracking with various settings ap- plied to the MIREX data set. Non-Parameterized Gamma Distribution p=1/16 74.65% 77.76% p=1/32 78.63% 76.65% Then, we evaluated the performance of the same HMM settings applied to the Bill- board Top10 data set. The results are shown in Table 6.2. Table 6.2: Performances of HMM-based musical beat tracking with various settings ap- plied to the Billboard Top10 data set. Non-Parameterized Gamma Distribution p=1/16 91.67% 91.42% p=1/32 96.23% 97.53% 121 AscomparedwiththeresultsoftheMIREXdatasetinTable6.2, theresultsofTable 6.1 are significantly better. Recall that the Billboard Top10 data set consists mainly of genres such as pop and rock while the MIREX data set has diverse genres from classical music, jazztofolkmusic. Popandrockusealotofpercussioninstruments. Usually, they have more regular and stronger beats than other genres such as classical and jazz music. Regular beats fit into the HMM model, which enforces the periodicity of beats. Stronger beats yield higher observation probability for the beat state type, and make themselves more easily decoded by the Viterbi algorithm. 6.5.3 ComparisonofKF-basedandHMM-basedMusicalBeatTracking Algorithms The P-score performance metric of Kalman-filter-based and HMM-based musical beat trackingalgorithmswasalsocompared. Forperformancecomparison, thefirst30seconds of the MIREX 2006 and the Billboard Top10 data set were tested, respectively. The settings of the Kalman filter and HMM were chosen based on the best performance that they can achieve. The results are shown in Table 6.3. For the MIREX 2006 data set, the Kalman filter with EPDA outperforms HMM by 8.7%. For the Billboard Top10 data set, HMM outperforms the Kalman filter by 1.55%. Table 6.3: Performances comparison between Kalman-filter-based and HMM-based mu- sical beat tracking algorithms. Kalman Filter HMM MIREX 87.33% 78.63% Billboard 94.68% 96.23% 122 Chapter 7 Conclusion and Future Work 7.1 Conclusion In this thesis, a three-level framework for automatic music structure analysis was pro- posed. The framework consists of beat-level feature extraction, measure-level similarity matrix and structure-level music structure decomposition. Under this framework, the music structure was investigated. In addition, several signal processing techniques were developed to support the framework. In Chapter 3, a technique based on phase-locked loop was proposed to detect the beat period’s boundaries from the musical onset signals. Pulses in the beat-level metrical structure are not regular since there are no corresponding pulses at all or the correspond- ing pulse has very small intensity. Given an initial estimate of the music tempo, the music tempo tracking system could track the beat pulses and locate the boundaries of beat-level pulses adaptively. In Chapter 3, a technique based on dynamic time warping (DTW) was proposed to estimate the optimal distance between two measures of music signals in a song. Because 123 of imperfect synchronization that is either intrinsic in the the music signal or due to inaccuratedetectionofmeasures’boundary,twoclosemeasures,whereonemeasureisthe repetitive part of the other, cannot yield the low distance. The DTW-based technique helps overcome the imperfect synchronization problem by tweaking the corresponding feature vectors to achieve the optimal distance between two measures. In Chapter 3, a measure-level similarity was proposed where each element in the matrix represents the similarity degree between two measures of musical signals. It in- corporates the music tempo information, such as the period and phase of the beats and measures, into the construction of the similarity matrix. The resulting matrix has the advantages of a small data size and less data redundancy. It could alleviate the compu- tational burden in high-level music structure decomposition. In Chapter 4, the music structure of a popular or rock song was decomposed. A tech- nique based on the Viterbi algorithm was applied to the lower-triangular measure-level similarity matrix to find the best path that has the highest cumulative probability. Since only sub-diagonal lines are pertinent to the similarity of segments composed by consec- utive measures, the state transition probability is selected accordingly to reflect such a preference. The preference will overcome the weak similarity in the verse parts and the detectedpathwillincludemostlythehighsimilarityvaluessegments. Furthermore,post- processingisusedtoremovethepartsofpaththatcorrespondstosmallsimilarityvalues. The resulting segments in the similarity matrix represents the repetitive relationship be- tween different parts of the song. Within the repetitive parts, the verse parts are then segmented from the chorus parts by evaluating the similarity values. 124 InChapter5,theproblemofon-linemusicalbeattrackingwasdiscussed. Atechnique based on Kalman filtering with probability data association (PDA) was proposed. The beat progression can be modeled by a linear dynamic system with an appropriate choice of the state vector. One key issue in applying the Kalman filter (KF) is the choice of the measurement for the next beat. Three noisy measurement selection methods are consid- ered. They are local maximum (LM), PDA, enhanced PDA (EPDA) methods. EPDA considersboththepredictionresidualandthemusiconsetintensity. Itwasdemonstrated thattheKF-basedbeattrackingalgorithmwithEPDAprovidesanexcellentperformance. In Chapter 6, another music beat tracking technique based on HMM was proposed. The prior knowledge of the beat period was incorporated in the state space and the distribution of music onset intensities was statistically studied. The Viterbi algorithm was used to determine the best locations for a beat sequence. It was demonstrated that the HMM-based musical beat tracking algorithm performs well with respect to a data set of regular and strong beats. 7.2 Future Work Several assumptions were made in the three-level system framework of automatic music structure system. To make our current work more complete, it is necessary to relax them and investigate them furthermore. • General Forms of Music Structures In Chapter 2, three forms of the music structure are mentioned. Only verse-chorus form,whichiswidespreadinmodernpopularandrocksong,isdiscussedinChapter 125 4. Despitetheirdifferenceonthemusicstructure,theyareallcomposedofrepetitive parts. Thus, our system could be extended to cope with them. One major problem isthattherepetitivepartsinAAAandAABAformsaresimilartotheversepartsin the verse-chorus form. They share the same melody and chords but differ in lyrics. Thelatterresultsinaudiosignalswithmanyvariations. Anotherproblemisthereis confusion between AABA and certain verse-chorus form such as VCVCBVC where A is seen as VC. In order to cope with the three forms in a system, feature vectors should be developed to evaluate not only the general similarity of audio signals but also similarity of particular characteristics, for example, similarity of melody and similarity of lyrics. They will not only make the verse-chorus discrimination more reliable but also make the classification of the three forms possible. • Beat Tracking for Music with Time-Varying Tempos ItwasassumedinChapter5thatthetempooftheunderlyingmusicisperceptually constant. However, tempo change is used as a means of expressive performance in certain music pieces. There are two types of tempo change: abrupt change and gradual change. The former provides a strong contrast between musical sections in the changes of music contents such as melody, rhythm and chord. For example, consider two tempos, T 1 and T 2 in a rock song, where a musical section of tempo T 1 is followed by another section of tempo T 2 and then followed by a third section of tempo T 1 . The section of tempo T 2 asks for listeners’ attention since it is outside of expectation imposed by the section of T 1 . An additional component is needed 126 to detect the change of perceptually constant tempo so that the KF-based beat tracking can avoid the loss of tracking. The gradual tempo change often occurs in classical, jazz or blue music. It provides a sense of flowing music by changing the tempo over time. The tempo change is usually not random. It often starts from a low tempo T 1 , gradually increases to a high tempo T 2 , and then decreases to T 1 , again. The tempo change cycle could be modeled by modifying the linear dynamic model in the Kalman filter, say, by adding the third element in the state vector x(k) to represent the acceleration of beat progression. Since beats are no more perceptually constant, we will have a severe problem in associating the real next beat and its music onset. Thus, the PDA mechanism needs to be enhanced as well. This appears to be an interesting problem for future study. 127 Bibliography [1] “Music information retrieval evaluation exchange (mirex) competition,” 2006, http://www.music-ir.org/mirexwiki/index.php. [2] D. Abramovitch, “Phase-locked loops: A control centric tutorial,” Proceedins of the American Control Conference, 2002. [3] M. Alghoniemy and A. Tewfik, “Rhythm and periodicity detection in polyphonic music,” IEEE Workshop on Multimedia Signal Processing, pp. 185–190, 1999. [4] Y. Bar-Shalom and T. E. Fortmann, Tracking and Data Association. Orlando, Florida: Academic Press, 1988. [5] Y.Bar-ShalomandX.R.Li, Multitarget-Multisensor Tracking: Principles and Tech- niques. Storrs, CT: Yaakov Bar-Shalom, 1995. [6] R. E. Best, Phase-locked loops: theory, design and applications, 2nd ed. McGraw Hill, 1993. [7] A. T. Cemgil, B. Kappen, P. Desain, and H. Honing, “On tempo tracking: tem- pogramrepresentationandKalmanfiltering,” Journal of New Music Research,2001. [8] M. Cooper and J. Foote, “Automatic music summarization via similarity analysis,” Proceeding of ISMIR, pp. 81–85, 2002. [9] S. Davis, The Craft of Lyric Writing, 1st ed. Writer’s Digest Books, 1985. [10] S. Dixon, “Automatic extraction of tempo and beat from expressive performances,” Journal of New Music Research, vol. 30, no. 1, pp. 39–58, 2001. [11] J. Foote, “Visualizing music and audio using self-similarity,” Proc. ACM Interna- tional conference on Multimedia, 1999. [12] ——, “Automatic Audio Segmentation using a Measure of Audio Novelty,” IEEE International Conference on Multimedia and Expo, 2000. [13] T. Fujishima, “Realtime chord recognition of musical sounds: a system using Com- mon Lisp Music,” International Computer Music Conference, pp. 464–467, 1999. [14] E. Gomez, “Tonal description of polyphonic audio for music content processing,” INFORMS Journal on Computing, 2005. 128 [15] M. Goto, “An audio-based real-time beat tracking system for music with or without drums,” Journal of New Music Research, vol. 30, no. 2, pp. 159–171, 2001. [16] ——, “Achorus-sectiondetectingmethodformusicalaudiosignals,” ICASSP,2003. [17] ——,“SmartmusicKIOSK:music-playbackinterfacebasedonchorus-sectiondetec- tion method,” Journal of the Acoustical Society of America, vol. 115, no. 5, p. 2494, 2004. [18] M. Goto and Y. Muraoka, “Music understanding at the beat level X Real-time beat tracking for audio signals,” Proceedings of IJCAI 95 Workshop on Computational Auditory Scene Analysis, pp. 68–75, 1995. [19] ——, “Issues in evaluating beat tracking systems,” Proc. IJCAI-97 Workshop on Issues in AI and Music, pp. 9–16, 1997. [20] F. Gouyon and S. Dixon, “A review of automatic rhythm description systems,” Computer Music Journal, vol. 29, no. 1, pp. 34–54, 2005. [21] F. Gouyon and P. Herrera, “Determination of the meter of musical audio signals: seekingrecurrencesindescriptorofbeatsegment,” Proceedings of Audio Engineering Society, 114th Convention, 2003. [22] F. Gouyon, A. Klapuri, S. Dixon, M. Alonso, G. Tzanetakis, C. Uhle, and P. Cano, “An experimental comparion of audio tempo induction algorithms,” IEEE Trans on Speech and Audio Processing (submitted), 2006. [23] S. Hainsworth and M. Macleod, “Beat tracking with particle filtering algorithms,” IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 91–94, 2003. [24] J. Herre, M. Cremer, C. Uhle, and J. Robden, “Proposal for a core experiment on AudioTempo,” MPEG2001/8415, 2002. [25] C.-M. Huang, D. Liu, and L.-C. Fu, “Visual Tracking in Cluttered Environments Using the Visual Probabilistic Data Association Filter ,” IEEE TRANSACTIONS ON ROBOTICS, vol. 22, no. 6, 2006. [26] A. Jazwinski, Stochastic Processes and Filtering Theory. Academic Press, 1970. [27] A. Klapuri, “Musical meter estimation and music transcription,” Proc. Cambridge Music Processing Colloquium, 2003. [28] A. P. Klapuri, A. Eronen, and J. Astola, “Analysis of the Meter of Acoustic Musical Signals,” IEEE Trans. on Speech and Audio Proccessing, vol. 14, no. 1, pp. 342–355, 2006. [29] L. Lu, M. Wang, and H.-J. Zhang, “Repeating pattern discovery and structure anal- ysis from acoustic music data,” Proc. ACM International Workshop on Multimedia Information Retrieval, 2004. 129 [30] M.E.P.DaviesandM.Plumbley, “BeatTrackingwithaTwoStateModel,” ICASSP, 2005. [31] E. Pampalk, A. Rauber, and D. Merkl, “Content-based organization and visualiza- tion of music archives,” Proc. ACM International conference on Multimedia, pp. 570–579, 2002. [32] L.Rabiner,“AtutorialonhiddenMarkovmodelsandselectedapplicationsinspeech recognition,” Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286, Feb 1989. [33] L. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition. Prentice Hall, 1993. [34] C. Rasmussen and G. Hager, “Probabilistic data association methods for tracking complexvisualobjects,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 23, no. 6, 2001. [35] E.Scheirer,“Tempoandbeatanalysisofacousticmusicalsignals,”Journal of Acous- tic Society America, vol. 103, 1998. [36] W. A. Sethares, R. D. Morris, and J. C. Sethares, “Beat tracking of musical perfor- mances using low-level audio features,” IEEE Trans. on Speech and Audio Proccess- ing, vol. 13, no. 2, pp. 275–285, 2005. [37] A.ShehandD.Ellis,“ChordsegmentationandrecognitionusingEM-trainedHidden Markov Models,” Internationl Symposium on Music Information Retrieval, 2003. [38] G.Tzanetakis,G.Essl,andP.Cook,“Humanperceptionandcomputerextractionof musical beat strength,” Proc. Digital Audio Effects Conference, pp. 257–261, 2002. [39] B. Vercoe, “Computational auditory pathways to music understanding,” Perception and Cognition of Music, pp. 307–326, 1997. 130
Abstract (if available)
Abstract
Automatic music structure analysis from audio signals is an interesting topic that receives much attention these days. Its objective is to find the music structure by decomposing the music audio signals into sections and detect the repetitive parts. The technique will benefit music data analysis, indexing, retrieval and management. In this research, a three-level framework of music structure analysis is proposed. The first level is the beat level. Musical audio signals are analyzed via tempo analysis and the beat is derived as the basic temporal unit for each music piece. Then, feature vectors are extracted for each basic unit. The second level is the measure level, a similarity matrix between measures can be constructed based on multiple feature vectors in one measure. Then, the third level is the structure level, whose elements, for example, in a pop or rock song consists of the repetitive parts such as verses and choruses and the non-repetitive parts such as intro, outro and bridge. A technique based on dynamic programming is proposed to search similar parts of a song. With post-processing, the musical sections of the song can be extracted and their boundaries are estimated.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Investigations in music similarity: analysis, organization, and visualization using tonal features
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Computational analysis of expression in violin performance
PDF
Source-specific learning and binaural cues selection techniques for audio source separation
PDF
Automatic quantification and prediction of human subjective judgments in behavioral signal processing
PDF
Quantitative modeling of emotion perception in music
PDF
Classification and retrieval of environmental sounds
PDF
Human activity analysis with graph signal processing techniques
PDF
Statistical inference for dynamical, interacting multi-object systems with emphasis on human small group interactions
PDF
Hybrid methods for music analysis and synthesis: audio key finding and automatic style-specific accompaniment
PDF
Representation, classification and information fusion for robust and efficient multimodal human states recognition
PDF
Efficient coding techniques for high definition video
PDF
Data-driven image analysis, modeling, synthesis and anomaly localization techniques
PDF
Signal processing methods for interaction analysis of functional brain imaging data
PDF
Novel optimization tools for structured signals recovery: channels estimation and compressible signal recovery
PDF
Syntax-aware natural language processing techniques and their applications
PDF
Active state tracking in heterogeneous sensor networks
PDF
Advanced features and feature selection methods for vibration and audio signal classification
PDF
Novel techniques for analysis and control of traffic flow in urban traffic networks
PDF
Human motion data analysis and compression using graph based techniques
Asset Metadata
Creator
Shiu, Yu
(author)
Core Title
Digital signal processing techniques for music structure analysis
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
09/24/2007
Defense Date
08/20/2007
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
dynamic programming,hidden Markov model,Kalman filter,music segmentation,musical information retrieval,OAI-PMH Harvest
Language
English
Advisor
Kuo, C.-C. Jay (
committee chair
), Chew, Elaine (
committee member
), Narayanan, Shrikanth S. (
committee member
)
Creator Email
atoultaro@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m826
Unique identifier
UC166176
Identifier
etd-Shiu-20070924 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-556981 (legacy record id),usctheses-m826 (legacy record id)
Legacy Identifier
etd-Shiu-20070924.pdf
Dmrecord
556981
Document Type
Dissertation
Rights
Shiu, Yu
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
dynamic programming
hidden Markov model
Kalman filter
music segmentation
musical information retrieval