Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Statistical inference for dynamical, interacting multi-object systems with emphasis on human small group interactions
(USC Thesis Other)
Statistical inference for dynamical, interacting multi-object systems with emphasis on human small group interactions
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
STATISTICAL INFERENCE FOR DYNAMICAL, INTERACTING MULTI-OBJECT SYSTEMS WITH !EMPHASIS ON HIJMAN SMALL <iROIJP INTERACTIONSI by ~iktor Rozgicl A D1ssertatwn Presented to the FACUCI'Y OF !'HE USC GRADUATE SCHOOL IINIVERSITY OF SOUTHERN CAT IFORNIA [n Partial Fulfillment of the Requuements for the Degree IDOC !'OR OF PHILOSOPHY (ELECTRICAL ENGINEERING) ~ugust 2011 Copyright 20 II Viktor Rozgic Table of Contents L1st o Ta es L1st o F1gures stract Chapter 1: IntroductiOn 1.1 Mot1val!on and Challenges ................ . 1.2 Addressed problems and contributions . . . . . . . . . . . 1.2.1 Block samplmg methods for multi-ObJect tracking 1.2.2 1.2..1 1.2.4 1.2.5 Locahzat10n and trackmg ol mult1ple speakers usmg m1crophone array Mult1-modal speaker segmentalion and Jdenlificatwn ......... . Multi-modal multi-channel dyad1c mteract10n database ........ . Approach-avoidance behavior in dyadic interactions: Ordinal regression ap~ proach to approach-avmdance label estunatwn 1.3 DissertatiOn outlme ................. . Chapter 2: Block samphng methods lor multi object trackmg 2.1 Multi-object state space models ......... . 12 1 1 Classical model 12.1.2 Proposed model ol mull! object dynam1cs ..... 2.2 Probabilistic inference in models of multi-object dynamics 12.2.1 Related methods .. 12.2.2 Proposed algorithm ............. . v VI IX ~ II 2 3 4 6 8 9 Ill 13 17 19 211 24 25 28 2.3 Block proposal distr1butwns . . . . . . . . . . . . . 30 12.3.1 ConnecllVJly graph m observation tnne space 311 12.3.2 Multi-frame object-to-observation associations 32 12.3.3 Decoupled object dynam1cs: G1bbs samplmg of truncated Gaussian d1stnj 12.3.4 12.3.5 12.3.6 12.3.7 bullOUS ........................ . Connectivity graph parametrization and path sampling Varmtwnill method . . . . . . . . . . Adaptive MCMC for block sampling Comparison with related methods 2.4 Results .......... . 12.4.1 Synthetic datasets . . . . . . . . . 34 35 38 43 45 46 46 12.4.2 Evalualion metnc . 12.4.3 Experiments Chapter 3: Localization and tracking or multiple-speakers using microphone array ~.1 Proposed Method ...... . 3. I. I Statlslical Model ... 3.1.2 Mixture Particle Filter 3.1.3 Particle Reclustering . 3. 1.4 TraJectory Reconstructwn Algontiiln 3.1.5 Detection or Speaker Appearances and Disappearances rl.2 Expenmental Results and DlSCUSSlOn . [Ll ConclusiOns ....................... . Chapter 4: Multi modal speaker segmentatiOn and ldenllhcatwn 4.1 Proposed method . . . . . . . . . 4.2 Statlslical Model . . . . . . . . . . . . . . f4.2.1 Multi v1ew v1deo trackmg ..... 14.2.2 Microphone array likelihood model 4.3 f4.2.3 f4.2.4 14.2.5 Speaker 1denllficatJon . . . . . . . . Modahty I us tOn .......... . Speaker Identity and Activity Decoding Results and d1scuss1on 48 50 56 56 57 58 59 60 611 63 67 69 69 70 Ill 72 75 76 77 80 Chapter 5: Multi-modal multi-channel dyadic interaction database 87 5.1 Purpose of mulll-modal recordmg environment and dyad1c mteractwn database 87 5.2 Recording Environment and Hardware 88 5.3 Database. . . . . . . . . . . . . . . . 9] 5.3.1 Coilectwn protocol . . . . . . 9~ 5.3.2 Manual post-processing and data annotation 92 5.3.3 Illustrative Interaction Statistics . . . . . . 92 54 Results and Discussion 94 5.4.1 Approach and Avoidance - Expert Annotation . 95 5.4.2 1nteractwn descnptors -role dependency ... 95 5.4.3 Analysis or non-verbal reatures ror A-A estimation 96 Chapter 6: Approach-avmdance behavior m dyad1c mteractwns: Ordmal regressiOn approach to approach-avoidance label estimation 98 6.1 Proposed algontiiln for estunatwn of ordmal labels . . . . . . . . . 98 6.1.1 Label ordenng msp1red collectiOn ol bmary classli1ers . . . I 00 6.1.2 Cumulative logit logistic regression with proportional odds . 10] 6.1.3 6.1.4 F1ttmg CLLRMP on class1fier score vectors .... Hidden Markov model with OLR based likelihood 6.2 Dyad1c mteraclion dataset . . . . . . . . . . . . . . . . . . 6.3 Results and discussion .................. . 6.3.1 Features, estimator training and evaluation methodology 102 102 102 103 104 6.4 f2 6.3.2 Expenments ConclusiOns IX .1.1 .1.2 I 3 Journal Conference . . . . . . . Talks and Presentations Basel me microphone array hkehhood model . 105 108 114 115 116 II I List of Tables r1.1 Bnel descnpllon ol the PRIM algonthm 60 ~.2 Statistical properties of the mMPF-TbD-SOCPD algorithm 65 rl.J Perlormance on speaker detecllon task: strong deciSIOn 67 ~.4 Performance on speaker detection task: weak decision 67 4.1 Model Parameters: Varymg Neighborhood Sizes . 811 4.2 Performance vs. State Transition Model Parameter p(ar+I - a 1 iai) for neighbor-! lhood size A - 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.3 Performance vs. Neighborhood Size A for transition parameter (p;,; = 0.99) . 83 4.4 Segmentation Perlormance:VIterbi Decodmg. No Lnmton Numberol Active Speak- 4.5 Segmentatwn Performance: Bayes Decodmg, No Lumt on Number of Aclive Speakers 84 4.6 Segmentation Perlormance: VIterbi decodmg, Maxnnally 2 Active Speakers per 4.7 Segmentation Performance: Bayes filtering, Maximally 2 Active Speakers per Frame 85 5.1 lnteractwn and diillogue event counts for d1f!erent mteraclion roles: Q-questwn, BC backchannel, 01/SI unsuccesslul/successlul mterruption ..... . 93 5.2 Approach-avmdance (A-A) label villues for d1f!erent mteraclion roles 96 5.3 Mutual information between motion capture l'eatures and A-A labels 97 6.1 Estunalion accuracies for AA labels. (window: 6 s, V and AV:video-only and audio-and-video) 106 6.2 CPL-LR mputs: ongmal features vs. SVM outputs. . ... 107 List of Figures 2.1 Proposed sequential block sampling algorithm . . . . 2.2 Proposed statistical model of multi-object dynamics . 2.3 Sampling Importance Resampling . . . . . . . . . . 2.4 Sequential independence Metropolis-Hastings sampler 2.5 Sequential block MCMC sampling . . . . 2.6 Connectivity graph G(Y, xo, eo) = (V, E) 16 22 27 27 29 32 2.7 Graph partition into multiple simple-paths and isolated nodes. 33 2.8 Multi-frame object-to-observation assignment. (Left: original model; Right: alter- native model factorization) . . . . . . 33 2.9 Conditional path sampling algorithm. 37 2.10 Variational approximation . . . . . . 43 2.11 Object trajectory sets. Initialized with: 5 (upper left), 10 (upper right), 20 (lower left) and 40 (lower right) objects at t = 0. 47 2.12 OSPA metric evaluation . . . . . . . . . . 50 2.13 Performance evaluation: detection based observations, OSPA metric (p = 2,c E {30, 100, 5000} ), 5 active objects at t = 0 .................. 51 2.14 Performance evaluation: detection based observations, OSPA metric (p = 2,c E {30, 100, 5000} ), 10 active objects at t = 0 ................. 51 2.15 Performance evaluation: detection based observations, OSPA metric (p = 2,c E {30, 100, 5000} ), 20 active objects at t = 0 ................. 52 2.16 Performance evaluation: detection based observations, OSPA metric (p = 2,c E {30, 100, 5000} ), 40 active objects at t = 0 ................. 52 vi 2.17 Performance evaluation: GMTI with the Rayleigh likelihood model, OSPA metric (p = 2,c E {30, 100, 5000} ), 5 active objects at t = 0 . . . . . . . . . . . . . . . . 53 2.18 Performance evaluation: GMTI with the Rayleigh likelihood model, OSPA metric (p = 2,c E {30, 100, 5000} ), 10 active objects at t = 0 . . . . . . . . . . . . . . . 54 2.19 Performance evaluation: GMTI with the Rayleigh likelihood model, OSPA metric (p = 2,c E {30, 100, 5000} ), 20 active objects at t = 0 . . . . . . . . . . . . . . . 54 2.20 Performance evaluation: GMTI with the Rayleigh likelihood model, OSPA metric (p = 2,c E {30, 100, 5000} ), 40 active objects at t = 0 . . . . . . . . . . . . . . . 55 3.1 Microphone array. Xt- position of sourceS;-< (micS)- true angle between source S and microphone pair ( mi, mj); f3 ( Xt) - angle that defines source position in the Xo Y plane; Yi'j - Direction of Arrival estimated by microphone pair ( mi, mj) . . . 57 3.2 Left: instrumented conference room (ceiling camera views); Right: 16-microphone array with the omni-directional camera above it. . . . . . . . . . . . . . . . . . . . . . . . . 63 4.1 Proposed architecture for multimodal fusion. Parameters Pd,A and A are learned from training data. Parameter Pi,i defines state transition and parameters (a, /3) define likelihood fusion model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.2 HMM: at - speaker activity indicator vector; () - unknown trajectory-to-identity association parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.3 Sample participant arrangement: Xt -participants' locations, yfiA -local maxima of the SPR-GCC-PHAT function, 'Pi,j - angular distance between ith observation d ·th art" . t an J p 1c1pan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Sample associations for the given state: p(yMAia,x,r) = p(cpl,l)P('P4,3)J 3 and p(ySIDia, ()) = p(ySIDIJDl,ID4) ........................ . 5.1 Left: Outlook of the recording environment: (A) camera array; (B) microphone arrays; (C) motion capture cameras; (D)shotgun microphone; Right: Recording system architecture: blue - video recording system, red - motion capture Vicon sys tem, green - microphone recording system, PC 1 and PC 2 synchronized via dedicated 1394 bus and synchronized with PC3 using shutter signal . . . . . . . . . . . . . . 5.2 (a) Marker placement: 4 head markers, 3 back markers, 2 chest markers and 7 markers on each arm; (b) Reconstructed marker locations and derived features 5.3 Speaker tum duration histograms for different roles ............. . 5.4 Histogram of probability of the velocity of movement of hands and angular velocity of the head. As we can see the inactive participant is significantly less animated than the active speaker. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 75 89 90 93 94 vii 6.1 Proposed method and HMM extenswn. 100 6.2 MoCap markers and body/head onentalion features. 104 6.3 Dependency of AA category estunalion accuraCies on feature processmg wmdow !length for different estimation techniques. . . . . . . . . . . . . . . . . . . . . . . 105 6.4 Differences between confusion matrices (left column: (SVM-OLR)- (SVM); right !column: (HMM-SVM-OLR)- (SVM)). . . . . . . . . . . . . . . . . . . . . . . . 107 Abstract! In this dissertation we propose contributions that address the problems in behavioral signal process ing for small-group interactions from three important perspectives. We propose algorithmic con tnbutlons to general stallst1cal mlerence methods lor mteractmg dynam1cal systems. m particular multi-ObJect tracking problems. We propose mult1-modaJ, mulli-channel s1gnal processmg methods to addiess parlicular aspects of the smaii group mteractwn, w1th emphasis on speaker segmentatiOn and speaker/participant tracking. Finally, we present a recording environment, a collected dyadic in teractiOn database and propose methods lor estnnatwn ol approach avOidance behaviOr labels based on non verbal mteractwn cues.l ~n the first part of th1s dissertatiOn we present a class of sequentJal block samplmg illgonthms for tracking unknown and variable number of objects. Proposed algorithms arc applicable to multi object tracking scenarios in which only available observations arc detector outputs, and also to scenanos where both detector outputs and more complex observatiOns wh1ch ilgure m the data assocmtwn free hkehhood models. Proposed algontlilns prov1de a way to construct block proposaJ distr1butwns usmg deteclion based observatiOns. Key parts of the proposed algontlilns are methods for sampling block proposal distributions. We propose two novel methods for this purpose, one is based on a vanatwnal approx1mallon scheme and the other represents an adapllve MCMC samplmg scheme. Samples lrom block proposal d1stnbut1ons are lurther used m the sequenllal MCMC (or SMC) framework. We tested proposed schemes on two synthelic datasets. Results demonstrate benefits of processing longer observation sequences in multi-object tracking problems in a more efficient manner that the classical sequential sampling schemes! ix ~n the second part, we present a multi-target tracking algontliln for algontliln for tracking mul tiple speakers by a microphone array. As the microphone array observations do not provide an easy way to design speaker locatiOn detectors we propose a mixture particle !titer lor tracmg multiple acoustic sources track-before-detectwn (I bD) lramework. I his method belongs to the same class of sequentJal s1gnal processmg illgonthms (SMC or MCMC) as the block sampler proposed m the first part, while the major difference is that block sampler belongs to the detect-before-tracking class of algorithms. The sound source trajectories reconstructed by by the mixture particle filter do not necessanly correspond to speech only. I herelore, we apply an adapted optimal change pomt al gonthm to segment obtmned sound source traJectones mto speech and non-speech segments. The algonthm 1s tested on a mult!-part!c!pant meetmg database as a separate module and as a part o~ a multi-modal system for automatic meeting monitoring. In both cases it provided significant im provements on the speaker detection and segmentation tasks. lin the third part, we present a modality lus10n algonthm that exploits complementary properties of Video tracking, rrucrophone array localizatiOn and speaker Jdenlificalion and solves the problem of speaker segmentation in presence of the overlapped speech. In this paper we address improve ments to our multimodal system for tracking of meeting participants and speaker segmentation with a locus on the microphone array modality. We propose an algonthm that uses Dtrectwns of Arr1val estimated lor each microphone pair as observations and perlorms trackmg ol an unknown num ber of acoustJcaily-aclive meetmg parlic1pants and subsequent speaker segmentalion. The proposed algorithm is unique from multiple perspectives. First, we suggest a hidden Markov model mchi tccturc that performs fusion of three modalities: a multi-cmnera system for pmticipant localiza tion, a microphone array lor speaker locahzatwn, and a speaker IdentihcatiOn system; Second, we present a novel likelihood model for the tmcrophone array observatwns for dealing w1th overlapped speech. We propose a modificatiOn of the Steered Power Response Generalized Cross CorrelatiOn Phase Transform (SPR-GCC-PHAT) function that takes into the account possible microphone oc clusions. We employ the multi-object detect-before-trackinfi approach and use the local maxima of the modified SPR-GCC-PHAT functions as sound source detectors. Multiple detection locations are fused mto the JO!llt likelihood by the JO!llt probabi!Jslic data assocmtwn. Th1s transforms an X ongmal speaker segmentalion problem m the multi-Object tracking framework where 1t IS solved using Bayesian filtering/smoothing methods. II hts concludes ex post liOn on the core stgnal processmg algonthms closely related to the multi object trackmg and the last part ol the dtssertalton ts dedtcated to the analysts and automatiOn oil human behavwr codmg m smaii group mteractwnsl rwc present a new multi-modal database for analysis of participant behaviors in dyadic interac tions. This database contains multiple channels with close- and far-field audio. a high definition camera array and molt on capture data. Presence olthe molt on capture allows prectse analysts olthe body language low-level descnptors and 1ts comparison w1th similar descnptors denved from v1deo data. Data 1s manuaily labeled by mulliple human annotators usmg psychology-mformed gmdesi We analyzed relation between approach-avoidance (A-A) behavior and various non-verbal body language and acoustic leatures, and mlluence ol the audto and vtdeo channels on experts' labehng dectstons. Also we analyzed dependency ol the staltsltcal mteraclton descnptors and A A labels on parlic1pants' roles. V\t the end, we propose an ordinal regression (OR) algorithm and its extension applicable to time series for estimation the approach-and-avoidance (AA) behavior quantifiers (lablcs) in human dyadtc mteractwns. I he proposed algonthm translorms the ordmal regressiOn to mulltple bmary classtltcatiOn problems, solves them by mdependent score outputtmg classtlters and ltts the cumu latJVe logil logJslic regressiOn model w1th proportiOnal odds (CLLRMP) the classifier score vectors! The time series extension treats labels as states of the hidden Markov model with likelihood based on the probabilistic CLLRMP output. We compare performances of the proposed algorithm apply mg the wetghted bmary SVMs the second step (SVM-OLR), tls extensiOn (HMM-SVM-OLR) and the baseline mult1-class SVM. The HMM-SYM-OLR achieves the highest eslimatwn accuracy. xi Chapter 1: Introduction 1.1 Motivation and Challenges Human communication is a dynamic process where communicative goals are achieved through in teraction that employs multi-modal cues: speech and visual cues, including explicit and implicit information such as paralinguistic phenomena and body language. Although complex, the verbal and non-verbal behaviors in dyadic or small group interactions follow patterns that have been the research focus of psychologists for a long time. For example, psychologists have developed many coding schemes, such as the Couple Interaction Rating System (CIRS) [27] and the Rapid Marital Interaction Coding Scheme (RMICS) [28], for annotation of couples interaction and family therapy sessions. These schemes are based on recognition of low-level verbal and non-verbal cues (e.g. gaze, body orientation, tum taking patterns, presence of negative words, tone of voice etc.). Infer ences by the experts can be made using these low-level cues towards deriving high-level behavior codes (e.g. acceptance, positivity, blame, negativity, approach-avoidance etc.) with direct influence on evaluation and planning of the therapy process. Developments in signal processing methods focused on extraction of the low-level descriptors, speech (speaker diarization [20]), audio-visual (tracking humans [21], head orientation [10], facial feature extraction [63], hand tracking [39]) and natural language processing, have opened possibili ties to step forward towards automatic assessment of intermediate and high level behavioral codes. 1 Wh1le m a more tradil!onai apphcalions the low level-descnptors are used as features m the more complex and not necessarily human behavior related tasks (automatic content retrieval [31], interac tion type classd1cat10n [41] and content summanzat10n [43]), an emergmg research area, behaviOr s1gnal processmg [62] mcludes socwl and emotiOnal aspects ol mteract10n, locuses on estnnat10n of the h1gh and mtermedmte level behaviOr labels from automat1ca1ly extracted low-level descnptor and designs alternative intermediate level labels that arc intuitive and strongly related to the high level labels. For example in emotions research valence and activation arc used as an intermediate representation lor categoncal emotton classlitcatiOn. Our research mterest 1s probabJhstJc modelmg of dynamic mteraclive mulli-agent systems and development of accompanymg techmgues for mference, learmng and pattern discovery, w1th em phasis on modeling human perception, behavior and interaction patterns. From the engineering perspecllve we envtston non mtrustve personal momtonng systems lor analysts ol user's behavtor and perceptiOn that can prov1de a leedback and serve as a solt dmgnosllc tooL IWe emphasize three mam chailenges m the behavwral s1gnal processmg domam. The first challenge is treatment of out of model variables. Features used for assessment of human's behavior arc extracted from the communication channels in which different goals and motivations compete lor the bandw1dth. I hat makes model leammg an extremely challengmg task due to huge number ol out ol the model vanables. I he second challenge IS related to mabd1ty to asses quanllhers lor some variables of mterest. This 1s not an 1ssue related to the expenment des1gn but to the nature of variables. This is expressed by a low agreement even between expert labelcrs or inconsistcncic~ in labeling the same input by a single labelcr. Finally identification and design of minimal sensing conhgurat10ns, steppmg out !rom the controlled recordmg cond1t1ons and generahzmg conclusiOns over different datasets represents an additwnal technological and research chailenge. ~.2 Addressed problems and contributions Completed work presented m this dissertatiOn addfesses two aspects of the behavwral s1gnal pro cessmg. F1rst, we propose statJslical models and mference algonthins for extractiOn of low-level descriptors for small group interactions with emphasis on block sampling methods for multi-object 2 tracking and multJ-modaJ speaker segmentatiOn techniques. Second, we present a mulli-modal, multi-channel environment for human interaction monitoring, a novel dyadic interaction databascj and propose a method lor esttmatJOn ol the ordmal approach-avmdance behaviOr labels I rom audto vtsual automatically extracted non-verbal leatures. 1.2.1 Block sampling methods for multi-object trackini!l Multt object trackmg (MO I) problems mvolve ltltenng m the non It near and non Gausstan state spaces w1th emphasized multJ-modahty of the filtenng distr1butwn. Therefore 1t 1s not possible to usc traditional, Kalman filter based, methods and it is necessary to usc an approximate inference. In our work we usc sequential Monte Carlo filtering techniques with Markov chain Monte Carlo steps [3]. In general Markov cham Monte Carlo steps are mtroduced to tackle the particle 1mpovenshment problem 1.e. to rejuvenate sample set after the resamplmg step. Most number of the solutwns proposed m the literature suggest rejuvenatiOn of the parlicles obtmned m the last filtenng step and even more advanced methods [32][33] represent a special instances of such approach. Although the recent [17] and [15] presented a method to conduct block SMC smnpling arc no rcpm1cd practical apphcat10ns ol thts to the MO I. On the other stde, problems caused by apphcat10n ol poor proposal distr1butwns are mnphfied m the MOT apphcatwns. lf the observatwns m the MOT are based on object detectwns and the proposed filter effeclively addresses two problems, 1t searches for a good model (detection-to-object assignment) and performs possibly non-lincm and non-Gaussian filtering lor the chosen asstgnment. II the proposal dtstnbulton ts not good enough to dnve samples towards good models the particle tmpovenshment wtll occur. II observations are such that hkehhood models do not mvolve combmatonaJ data assocmtwn effect 1s the smne, and constructiOn of the proposaJ construction is even more difficult due to the association-free nature of the likelihood Conclusion is that in the high dimensional spaces, besides advanced filtering algorithms, it is necessary to have an elltctent observation dnven proposal dtstnbutJOn. ~f observatwns are the objects' detectors outputs, we can proceed m the detect-before-tracking, framework usmg one of the classical data assoCJalion schemes: Joint Proabilistic Data Associa tion[56], Multiple Hypothesis Testing [47], Optimal 2D assignment [49] etc. When available, the 3 assocmtwn-free likelihoods charactenstic for the track-before-detection( e.g. color histogram dis tances between color image patches [ 45]), can provide more detail on the correct target state cap ture, usually at the pnce ol more expenstve hkehhood evaluation. However, tt ts much easter to construct observation dnven proposals lor the detection based It kelt hood models. [We propose an efficient block MCMC samplmg scheme and a class of methods for design of the block proposal distributions based on the detection observations. The proposed sampling schemes arc applicable in two scenarios, when the list of the detections is the only available observation and, when besrdes a generally last and computationally mexpenstve detectors we can access and want to evaluate observatwns used m data-assocmtwn free likelihoods 15911451. In the latter case detectwn based observallons are used to propose samples further evaiuated by more complex observallon models I [Problem ol construction block proposal dtstnbutrons lor sequentral multt object trackmg has not been addressed in the literature. For this part we emphasize ideas presented in [44] and [58], where author pertorm block sampling ( usmg the Markov cham Monte Carlo data assocmtwn and cross entropy method) for the detection based observations in the model with Gaussian single-object transition and Gaussian likelihood djstrihutionsJ [We test two vanatrons ol the proposed algonthm lor construction ol the block proposal dtstn button based on vanatronal approxtmatrons and an adapttve MCMC samplmg method. Results on two synthellc datasets, show advantages of the proposed methods over the sampling Importance resampling with or without history rejuvenation steps.[ 1.2.2 Localization and tracking of multiple-speakers using microphone array Following our work [8] we note that the accuracy of the microphone array data fusion method rep resents a bottleneck for the performance of multimodal tracking of user dynamics. In this work we address thts rssue by locusmg on two problems related to the microphone array modality: trackmg of an urikriown number of acousllcaily actiVe participants and actiVe speaker segmentatwnJ [We employ a modified Mixture Particle Filter (mMPF), based on work by Vermaiik eta!. [60[, to track an unknown number of acoustic sources. The mMPF employs as observations the angular 4 estunates of source locatwns obtmned usmg the Fractional Lower Order Statistics Phase Trans [ form (FLOS-PHAT) method [22] for Time Difference Of Arrival (TDOA) estimation for each mi crophone parr. I he nature ollhe observaltons ts such lhaltlts dtlltculllo desrgn a robusllrame level detector ol acouslrc source appearances and disappearances. For lhal reason two modtltcalrons on the ongmill MPF algontliln are proposed: First, the parlicle re-clustenng step 1s modified to take into the account both spatial position and weights of particles; and second, MPF is placed within the Track-before-Detection (TbD) frmuework [64] where sources arc detected by accumulation o~ acoustic evidence over Lime and source trajectories are reconstructed by the optimal two-index [65] assignment of nuxture components m consecutiVe frmnes. ln th1s formulalion the disappearance o~ acouslic sources 1s detected when trajectory d1sconlinmty occurs.[ ~n order to discriminate trajectories that belong to active speakers (dominant acoustic sources) !rom the other acousltc sources (e.g. nmse produced by other parltctpanls such as paper raulmg, coughmg etc. as well as sound rellecltons on surlaces such as the proJeclton screens etc.) we apply a Sequential Optimal 010nge Point Detection (SOCPD) illgonthm [57 [on each reconstructed trajectory. As proposed in Kligys ct a!. [34], we usc separate likelihood statistics for detection o~ speaker appcmanccs and disappcmanccs and propose a method to compute these statistics from the parltcle represenlalrons ol lraJeclonesl [Although MPF mlerprelalton ol Vermaak el a!. [60], tmp!tctUy I ails mlo the I bD category, no parlicular solutwn for the trajectory reconstructwn was discussed. Other tracking apphcatwns ofj the MPF, such as in [45], do not follow the TbD frmuework and employ heuristics for detection of appearances and disappcmanccs. For the optimal Bayesian filtering setup, Kligys ct a!. [34] present a more elaborate treatment or the detection or appearances and disappearances than [5], which proposes an oplimal deteclion method for the parlicle filtenng task. Our method preserves the desirable properties of both frmneworks, MPF and TbD, and offers a consistent treatment o~ trajectory reconstruction and speaker appcmnacc/disappearance detection. 5 [We test the proposed algontliln m our mult1-modai smart room [9[. The proposed 16- microphonc, 120-channel data fusion technique, combined with the other modalities brings sig mltcant Improvement to the overall perlormance ol the smart-room system on speaker trackmg and segmentatiOn tasks. 1.2.3 Multi-modal speaker segmentation and identification Spontaneous mteractwns usually result m s1gmilcant speaker overlap, wh1ch degrades the quahty ol[ automatic speaker segmentalion through speaker Jdenlificatwn (SID) methods. In some sJtualionsl e.g. in meetings where participants form sub-groups and start multiple conversations, portion ofj the overlapped speech is very high and SID methods exhibit poor pclfonnance. Recently proposed methods [2, 3 /] suggest that the speaker segmentatiOn based on 1mcrophone array estnnatwn ol the diieclion of arnval (DoA) [7, 12[ outperforms segmentalion based on SID techniques [48[. How ever, the usuai momtonng setups m meetmg envrronments mclude smali on-the-desk nncrophone arrays with limited spatial resolution [9, 2I]making it difficult to accurately disambiguate densely spaced speakers based on DoA cues only. Furthermore, most methods rarely handle overlapped speech at the modelmg level and lad to advantageously combme the 1mcrophone array and the SID methods. [We suggest a novel des1gn of the speaker segmentalion and Jdenlificatwn system based on the hidden Markov model (HMM) (Fig. 4.2) in which state is obtained by concatenation of the binary speaker actiVIty md1cators lor all present part1c1pant. I he unknown parameter m th1s HMM IS the mappmg ol part1c1pants' locatiOns to elements ol the set ol all poss1ble part1c1pants' 1dent1t1es. I he kriown parameters are the parlic1pants' locatwns obtmned from the v1deo tracking module. Th1s model gives us three levels of flexibility. We can: ~ p1ck the most appropnate method to decode state sequences: Bayeswn ltltermg [ 16], V1terb1 decodmg [46[, Markov cham Monte Carlo [25[. ~ use any avmlable likelihood model, based e1ther on 1mcrophone array observatwns only, or on SID observations only, or their combinationj 6 ~ ailow for an easy modificalion of processmg techmgues m any modahty. Parlicularly, the system that we proposed in [9] can track unknown number of participants based on combina llon ol the cetlmg camera background subtractiOn and ommd1rectwnal camera lace trackmg systems. In th1s work we have opted lor a s1mpler solutiOn, trackmg ol color markers, m order to focus on the advances m fuswn and llllcrophone array processmg. !Besides the described specific modality combination (Fig. 4.1) our main contribution includes a statlslicalmodelthat enables the microphone array modahty to detectmult1ple overlappmg speakers (Sectwn 4.2.2). For this purpose we suggest a mod1ficatwn of the Steered Power Response Gener alized Cross-Correlation Phase Transform (SPR-GCC-PHAr) funclion I 12 I m wh1ch we re-we1ght GCC-PHAT functions for different microphone pairs based on their visibility from the different pomts m the meetmg room. Instead ol the usual pract1ce where only the global max1mum ol the SPR GCC PHAI JunctiOn (1.e. the locatiOn ol the most promment sound source) IS used m sound source locahzatwn we suggest extractiOn of mulliple local maxuna of the mod1fied SPR-GCC PHAT funclion. We treat these maxuna as the 1mcrophone array observalions and use the Joint Probabilistic Data Association (JPDA) model [47] to assign them to the active speaker locationsj I h1s way we are able to compute the JOmt hkehhood ol the m1crophone array observations when locatiOns olthe acllve speakers are obtamed !rom the v1deo trackmg module. ~n order to address the speaker overlap by the speaker Jdenlificatwn (SID) system we tram Gaussian mixture models (GMM) for every single speaker and all combinations of two overlapped speakers. We obtain training data for the models of overlapped speech by mixing single speaker channels w1th equal average energ1es. Speaker actiVIty md1cators together w1th part1c1pants' toea lions define locatwns m space occupied w1th actiVe speakers, which further m combmalion w1th localion-to-JdentJty mappmg provides aclive speaker Jdentilies necessary to compute the SID like lihoods !Further, we are able to combine probabilistically (Section 4.2.4) the microphone array modality and the speaker identification modality. Since our speaker identification modality (Section 4.2.3) computes hkehhoods that a speech frame 1s produced by one or two concurrent speakers from the known pool of possible participants, we model the joint likelihood of all acoustic observations 7 glVen the locatwns and Jdentltles of the active speakers as a product of Jmcrophone array hkehhood to the power of n and the speaker Jdenlificatwn hkehhood to the power of .d. D1flerent ehmces ofj the power coefficient pair (a. /1) define dirrerent rusion models. We argue that the independence assumption ts JUSttlted by the I act that the microphone array and speaker tdenttltcatwn observatiOns are based on different charactenslics and processmg of the signals. IWe propose (Section 4.2.5) two methods for the estimation of the unknown association pa rameter and the hidden state sequence. The first method, the sequential Bayesian filtering [16], is appropnate lor appltcatwns where both the perlormance and the sequential processmg are nnpor- tant. The second method, the YilerbJ decodmg 1461, should be the method of chmce for apphcatwns where only the performance 1s Important and data can be post processed. IWe tested the automatic segmentation performance using precision and recall measures [ 40] for sequences ol states obtamed by both decodmg algonthms. In Sectton 4.3 we present algonthm anal ysts and compare perlormances ol basel me ltkelthood models (SID ltkelthood model and a recently proposed hkehhood model 121 I based on the global max1ma of the SPR-GCC-PHAf functwn) w1th our JPDA likelihood model for microphone array (MA) observations and the proposed combination of the MA and SID likelihood models I 1.2.4 Multi-modal multi-channel dyadic interaction database First, we present our multi-modal recording environment aimed at collection and informed analy- sts ol human behavtors m collaboration wtth psychology experts. We descnbe the collectiOn and present some mtltal analysts ol the ltrst part ol the database ol three hours ol data conststmg ol mul liple five llllnute dyadic mteractwns; a product of the coilaboratwn between the USC YilerbJ School of Engineering and the USC Depm1ment of Psychologyl. Each short interaction represents an ar- gument on one of nine suggested topics where each pm1icipant is trying to provide evidence that supports her/hts pomt ol vtew. Some ol the toptcs are conlrontatwns about cheatmg m a relatiOn sh1p, a dfmking problem, steahng from a roollllnate etc. Data 1s manuillly transcnbed, segmented The described pmt of the database is collected through role-playing, but we in-parallel analyze real data r41 m1d mtend on collectmg real couple mLeractwn data m this environment m the luLure. 8 m speaker turns and annotated by experts w1th the approach/avmdance labels. The recordmg envi ronment contains an array of 10-high-dcfinition video cameras, multiple microphone arrays (13 mic total), 2 lapelmrcrophones and a 12-camera motron capture system. II hrs desrgn allows collection ol synchromzed hrgh quality srgnals m a controlled envrronment and enables mvestlgatwn of advanced s1gnal processmg techniques. For mstance, the corpus o~ real married couple interactions used in [4], although at the moment more realistic in terms of the impact to the field of psychology, it restricts the usc and development of algorithms. It was not desrgned and collected to also Javor automatic processmg and the mcluded audro-vrdeo recordmgs may be of considerably low gua.hty. In addlllon to these psychology domam data used m 141 our lab has a!Ieady released an acted multunodill database (http:l/saii.usc.eduhemocap) of emotwnill interactions. We also intend to disseminate to the community this richer in realism and sensing database] !An addrtronal rmportant advantage ol the database rs avarlabrhty ol both motron capture and v1deo data. Th1s ailows us to (a) analyze the relatwn of the body language features obtained from the motion capture output with domain labels; (b) perform training and testing of algorithms that extract equivalent or similm features from video and (c) analyze information loss through the video processmg and rei me the algonthms appropnately. lin relation to thrs, we present the second contnbul!on, statistical dependence ol mteractron de scnptors, e.g., turn duratlons, number of questlons, backchannels, successful and unsuccessful m tcrruptions, on pmticipants' roles and analysis of the relation between vmious non-verbal features obtained from the audio and the motion capture data and approach/avoidance labels. 1.2.5 Approach-avoidance behavior in dyadic interactions: Ordinal regression ap proach to approach-avoidance label estimation! In our previous work [53], we introduced the multimodal dyadic interaction database and used it for analysis of relations between various non-verbal features and AA labels as deli ned by psychologist~ 1271. These labels belong to the ordered set of mne categones, rangmg from complete avmdance to complete approach. In this paper we address the estunatlon of the ordinill AA labels for the 9 same dyadic mteractwns usmg the low-level non-verbal s1gnaJ features. These features represent the basic statistics (mean, minimum, maximum and standard deviation), of the various video (body onentatwn, head onentation, hand movement, measure ol how opened the postures are) and audio (pitch, energy) based measurements calculated on leature processmg wmdow.l ~n order to addfess the ordmal nature of the AA labels we propose a new ordmal regressiOn algorithm. This algorithm is applicable to any ordinal regression problem and consists of the three main steps: (1) we transform the ordinal regression problem to multiple binary classification prob lems delmed by the label ordenng; (2) we solve the bmary classiitcation problems mdependenliy usmg any classilicatwn method that outputs (possibly non-probabJbslic) classilicatwn score; (3) we fit the cumulalive !opt logJslic regressiOn model w1th proportiOnal odds (CLLRMP) on vectors obtained by concatenation of scores from the binary classifiers. Additionally, we propose a simple extension ol the proposed algonthm appltcable to the time senes ol ordmal labels. In the extended algonthm, we model the label sequence usmg the hidden Markov model with !tke!thood based on the probabJhstJc output from the CLLRMP. If or the AA label estimation used feature vectors arc continuous and have no missing values, and we choose to apply the weighted binary SVM classifiers [11] with native non-probabilistic scores.! II he two step ordmal regression algonthm proposed m [ 19]Is somewhat Simt!ar to the method we propose. Wht!e the ltrst steps are Idenltcal, m the second step this algonthm employs proba bJhstJc bmary classifiers. The vector of their outputs should represent estunate of the cumulalive distribution of the ordinal label, however, since the binary classifiers arc trained independently there is no guarantee that the estimate is monotonically non decreasing. IWe evaluate the proposed estimation methods usmg leave-one-sessiOn out cross-valtdation. We present evilluatwn results for 4 expenments: (I) analysis of dependency between the eslimatwn accuracy and lengths of wmdows used to calculate the feature stalislics; (2) companson of average estimation accuracies for proposed estimators and the baseline multi-class SVM and analysis o~ variability in estimation accuracy for different sessions; (3) analysis of diiTerences between confu sion matrices lor diiTerent estimators; (4) comparison of estimation accuracies lor the SVM-OLRI and estunator obtmned by filling CLLRMP directly on the ongmal feature vectors. lQ 11.3 Dissertation outline In Chapter 2 we address the mull! objecttrackmg problem. We descnbe stallst1cal model used lor MOf m Sectwn 2.1. ln Seclion 2.2 we descnbe a sequentlill method for block MCMC (or SMC) samplmg. ln Sectwn 2.3 we descnbe the proposed method for constructwn of the block proposaJ distribution based on the variational distribution approximation scheme. We describe both the in dependence MCMC block samplmg scheme based on 1111xture ol the mdependence d1stnbut1on~ and local MCMC moves m SectiOn 2.3.6. In SectiOn 2.4 we descnbe the used evaluation method, prov1de results on two synthetic datasets and compare pertormance of the proposed algontiilns w1th the baseline algorithms. !Chapter 3 describes a Track-before-Detection multi-object tracking algorithm for multi-speaker trackmg usmg a m1crophone array. I wo mam contnbutwns, the mod1iled mixture part1cle I liter for multi-speaker tracking and the optunal sequenlial change pomt algonthm for speech segmen tatwn are presented m Sectwns 3. 1.2 and 3. 1.5 respeclively. Speaker tracking and segmentatwn performance were tested on the USC smart room dataset and we present the results in Section 3.2. !Chapter 4 IS ded1cated to the mull! modal algonthm that explmts complementary properties oil v1deo trackmg, microphone array locahzallon and speaker ldenllhcatwn and solves the problem oil speaker segmentatwn m presence of overlapped speech. A Jomt probabilJslic data assocmtwn like lihood function, applied on observations derived from the proposed modified steer power response GCC-PHAT function, is incorporated in the time series model (Section 4.2.2) which is further used lor speaker segmentation. Segmentation results on the part ol the USC smart room dataset are presented m the Seclion 4.31 !Chapter 5 descnbes a novel recordmg envrronment and a dyadic mteraclion database. Sectwn 5.2 describes modalities and hardware architecture used in the recording envirorrment. In Section 5.3 we present the collected dyadic interaction database with the details on the collection protocolj data postprocessing and annotation. In Section 6.3 we analyze relations between various non-verbal features and the approach-avmdance labels, role dependent charactenslics of the participants non verbal behavwr. 1] !Chapter 6 1s dedicated to the algontliln proposed for eslimatwn of the ordmal approach avoidance labels using non-verbal features. In Section 6.1.1. we present the transformation of the ordmal regressiOn problem to multtple bmary classtilcatwn problems. In SectiOn 6.1.2 we present the cumulative log1t logtstic regression model wtth proportional odds (CLLRMP). I he nnphcatwns of the chmce to fit CLLRMP to the classifier scores mstead of the feature vectors are discussed m Section 6.1.3. The extension to time series data is presented in Section 6.1.4.wc propose an ordinal regression method for estimation of approach-avoidance labels and its extension applicable to time senes ol ordmal data. In SectiOn 6.1 we descnbe details ol the proposed estimation methods. In Sec lion 6.3.1 we descnbe deta.tls on feature extrac!ion. eslimator trmmng and evalua!ion methodology and m Sec!ion 6.3.2 we present results on different evalua!ion expenments. 12 Chapter 2: Block sampling methods for multi-object tracking Many dynamic processes can be perceived as multi-object motion and interaction phenomena. By the term object we do not refer to physical objects, but to an abstract representation of an entity that we observe (including physical objects). For example, in the classical multi-target tracking tasks [56], targets move, enter and exit tracking regions. Thinking about a single object as a collections of target's attributes of interest (e.g., location and velocity [36], appearance etc., and we get multiple objects that move, enter and exit the tracking space. The objects usually live in the same abstract space, and their interactions can be either direct - in that space (e.g., interactions between ants [33]), or indirect- in the observation space, which is a projection of the state space (e.g., occlusions in video). The observations in multi-object tracking applications include pixelized images from cameras and medical imaging devices, lidar, radar and sonar measurements , various measurements from multi-sensor networksetc. Multi-object motion phenomena are usually represented by multi-object dynamic models that belong to the class of probabilistic, generative, discrete-in-time hidden Markov models (HMM). In troducing multi-object states and observations as collections of, respectively, unobservable and ob servable variables of interest, a probabilistic multi-object dynamic model is defined by: (1) domains 13 of the multi-object states and observatwns; (2) a multi-object state trans1t10n d1stnbutwn descnb ing relation between multi-object states in consecutive time frames; and (3) observation likelihood !unctiOn that descnbes relat1on between multi-Object states and observallons m a smgle tnne lrame. IWhat d1stmgUJshes the multi-Object dynam1c models, I rom other models m the Imrly general HMM class, IS that the number of objects present m the momtored space IS ulikriown and vanes with time. This variability is caused by the fact that objects can both enter and exit the tracking region or appear and disappear inside the tracking space. The number of objects can be represented e1ther as the d1mens10n ol the multi-Object state or as a parameter m the assumed parametnc lor oil the multi-object state d1stnbutwns. In the first case, mul!iple smgle-object states are concatenated to create the multJ-object state m a h1gher dimenswnai space. In the second case, the number ofj objects is assumed to be equal to the number of components in the mixture representation of the multi object state d1stnbullons. As a consequence, m the second case mlerence IS perlormed usmg more complex d1stnbut10ns m the space w1th the lower smgle object state d1mens10n. In the case that the only goal 1s only to evaiuate pnor and postenor multJ-object state d1stnbutwns, apart from differences in computational complexities and storage requirements of evaluation, the choice of the multi-object state representation is not of the crucial importance. On the other side, the choice o~ the mulll object state representallon has a dommant mlluence on the samphng elilclency. Not all representatiOns are equally appropnate lor all mlerence methods, lor example, vectors ol vanable dimensiOn m the representa!ion demand a more careful handlmg m the Markov cham Monte Carlo sa111pling schemes [33] and possibly application of the reversible jumps. It is possible to avoid technical problems related to the vmiable dimension of random vectors by fixing the number o~ smgle-object states m the multi-Object state vector, 1.e. max1mal number ol objects present m a smgle frame. A bmary md1cator vector With the length equal to th1s number IS mtroduced and 1ts md1ces mark sub-vectors m the multJ-object state that correspond to the objects present m the tracking space. ~bile the multi-object state representations differ, all related multi-object state transition dis tributions must incorporate objects' birth and death-update processes (defining dynamics of new 14 object appearances and ex1stmg object disappearances), smgle-o§ject state trans1t10n distr1butwn§ and potentially object interaction models. II he cho1ce ol the smgle-object state representatiOn IS d1ctated by the problem ol mterest. In w1de range ol apphcat10ns, I rom radar to v1deo trackmg, 1t IS accepted pracllce to model the smgle-object state as a vector of object's locatiOn m the tracking space and 1ts denva!Jves (velocity, acceleratwn).l These kinematics inspired representations, can be augmented to incorporate switching !The observation likelihoods used in multi-object dynamic models belong to two general cate gones, assocJatJOn-lree and associatiOn dependent hkehhood lunct10ns. I he assocJatJOn-lree hkeh hoods mterpret each observa!Jon as a smgle en!Jty evaluatmg 1t condil!oned on the fuii multi-Object state. On the other s1de, the assocmtwn dependent likelihoods mterpret observatwns as vectors of multiple concatenated single-object observations and clutter observations, defined by the object detectiOn and the clutter generatiOn processes. lA relatiOn between observable and unobservable vanables at any samphng mstance IS modeled by a likelihood functwn that can belong to one of two broad classes dependmg on weather, possibly after augmentation of the unobservable state with data association variables, they can be represented as a product of functions, each depending on a single object state. The first class of likelihood lunct10ns treats an observable state as a smgle enllty and scores 1t cond1lloned on the lull h1dden state. Observable states used lor the second class ol hkehhood lunct10ns can be perce1ved as outputs from single frame opiect detectors. These detectors are many-to-one mappmgs that transform one original observable state into the set of detections. Each detection preserve a part of information related to the object location and appearance. As mentioned, in this case the unobservable state IS augmented by object-to-observatiOn ass1gnment van able. In the remammg part ol the paper we mtm!ively refer to the first class of likelihood functwns as assoCJa!ion-free likelihoods and to the second one as assocmtwn dependent likelihoodsj !Both classes of likelihood models have their advantages and disadvantages. In general, association-free likelihoods are more dirilcult to learn. More training data is necessary to marginal ize effects or variables that inlluence observations and are not part or the used statistical model. Also, evaluatlng the assoClatwn-fiee hkehhoods can be expensive since 1n IS necessary to score the Figure 2.1: Proposed sequential block sampling algorithm full observation and not only the detection regions. While the association dependent likelihoods are easier to evaluate, they are designed for augmented hidden states that contain object-to-observation association variables, which can increase computational cost of inference. The association dependent likelihood functions can be factorized and evaluated as a product of multiple single-object and clutter likelihoods, conditioned on the multi-object states augmented with the data-association variables that define object-to-observation (or observation-to-object) mappings. The multi-object tracking is inference in the task in which the goal is to estimate number and states of objects present in the tracking space from observations. Assuming that an observed multi-object motion phenomenon is formalized by the probabilistic multi-object dynamic model, the multi-object tracking is a matter of statistical inference. In this paper we propose three main contributions. First, we propose modifications of the clas- sical model for multi-object tracking with the association dependent likelihood. In Figure 5.1 we present a flowchart of the proposed block sampling algorithm. The chapter is organized as follows. In Section (2.1) we: (1) describe the classical model for multi-object motion phenomena (Section (2.1.1)); and (2) present proposed modifications address ing tracking objects with limited maximal velocity in bounded observation spaces (Section (2.1.2). In Section (2.2) we present: (1) a brief overview of exising methods for block processing for multi- object tracking; (2) a review of SMC and MCMC methods for sequential processing; and (3) the pro posed method for sequential block sampling based on block proposal distributions (Section (2.2.2)). In Section (2.3) we describe a method to construct block proposal distributions for the proposed block sampling model. First, we introduce the connectivity graph on all observation on the interval 16 of mterest (Sectwn (2.3.1)). We descnbe multi-frame assocmtwn vanable that mtroduces a part1lion of the connectivity graph into multiple simple paths in Section (2.3.2)). After the multi-frame object-to-observation vanable ts known, the multt-Object state esttmatton problem decouples mto multtple mdependent smgle object state estnnatwn problems. We present a Gtbbs samplmg method that allows us to effiCiently sample mullivanate truncated Gaussian d1stnbutwns m Sectwn (2.3.3). In Section (2.3.5) we describe the proposed variational method to approximate the original multi object state posterior distribution with a parameterized distribution based on the parametrization ol the multt-lrame assoctatton vanable (SectiOn (2.3.4) by means ol mmtmtzatwn ol the mclustve Kuilback-LeJbler divergence. We discuss relalions of the proposed method to ex1stmg algonthms m Sectwn (2.3.7) ~n Section (2.4) we present evaluation results for the proposed algorithm and two baseline algo nthms m terms ol the opttmal subpattern asstgnment metnc (SectiOn (2.4.2)). We test algonthms on two synthetic datasets, one based on the assoctatton dependent !tke!thood (Sectton (2.4.3.1) and the other based on the tresholded Rayleigh GMTl hkehhood model.[ 2.1 Multi-object state space models ln this seclion we mtroduce a stalistJcal state space model for tracking urikriown and varmble num ber of objects in presence of kinematic constraints. This model represents a modification of the classical state space model [56] used in the multi-object tracking literature which assumes linear Gausstan mdependent smgle object state and observation equatiOns when object to observation (or observatiOn-to-object) assocmtwns are kriown. The ongmal classical model assumes constant num ber of objects m the tracking space [36[, while 1ts extenswns mtroduce object brrth-and-death mod els that arc independent of objects' states. rrhc proposed model is motivated by a spectrum of tracking scenarios where objects arc tracked in bounded spaces, objects' kinematics (e.g velocities) are constrained and an unknown number oJl objects present m the tracking space vanes w1th tune. Formaily, we assume foiiowmg properties of the state equalion: (a) bounded smgle-object state domams; (b) truncated Gaussian smgle-object state transition distributions; and (c) state dependent object disappearance probabilities. The model 17 supports apphcalions w1th eilher only an object-to-observalion assocmtwn dependent hkehhood function or both association dependent and association-free likelihoods. Association dependent hkehhoods make the model apphcable to all tasks where ll IS poss1ble to tram and use object de lectors. When available. an add1l10nal assocwllon-lree hkehhood allows us to explml mlormalJOn contamed m observatwns that 1s not addfessed by the des1gn of object detectors' outputs. !Let us first introduce the notation and define variables in the proposed model. We assume a discrete time state-space model and index time instances by t E 1\1 0 . IWe delme the mulll-objecl stale as pmr ol unobservable vanables, a bmary vector ol object existence mdicators e 1 and a vector obtamed by concatenatiOn of smgle-object states x 1 . The bi nary object existence mdicators are mtroduced to sunphfy a treatment of vanabi11ty m a number of objects in the tracking space and allow all vectors in the model to have the same fixcd-lengthj The dimension K ol"lhe vector e 1 = (e 1 ,k)f<~ 1 ) E {O.l}g indicates the maximal allowed number ol objects m the lrackmg space, and 1ls coordmales lake value one (zero) when correspondmg ob jects ex1st (do not ex1st). Therefore, this representatiOn 1s apphcable when the number of objects in the tracking space, K 1 = I:,.'...., e 1 .b docs not exceed K at any time. The single-object state x 1 is a vector obtained by concatenation of object's position and velocity vectors that belong re spectively lo spaces xr c:: JR?.d and x;· c:: JR?.d In other words, we concatenate vectors Xt,k E X (k = L ... , K) to gel x 1 = (x 1 ,k)[~ 1 where X = Xf x Xi"- For each index k E {L ... , K} correspondmg to an object m the tracking space (e 1 ,k - 1), the vector :r 1 ,k represents a smgle-object state Xt.k = (xf,k, xf.k), and for each k for which there is no object (et.k = 0), vector Xt.k serves as a place-holder for an absent object and is equal to (an arbitrary) x 0 E X.l IWe denote observallons used m the assocJalJOn-lree hkehhood I unct1on as z 1 and observatiOns used m the object-to-observatiOn assocmtwn dependent hkehhood as y 1 . Each observatiOn z, (t E1 fl 0 ) belong to the space Z. Observalions y 1 are obtamed by concatenatiOn of M 1 E flu vectors y 1 = (yt,j)' 1 ~ 1 , where each Yt.j (j = L ... , 1'11 1 )) belongs to the single-object observation space Y C:: JR?.d. Therel"ore, the multi-object observation space Y, = Y M, corresponding to the association dependent observations has a Lime-varying dimension JU 1 d. 1.8 IWe introduce two additional vectors, the object detection d1. = ( d,,k){~ 1 and the object-to observation association a, = ( O.t,k)(':__ 1 • Each clement du of the binary object detection vector that does not correspond to an ex1stmg object (cJ..k - 0) IS zero, whrle the elements correspondmg to ex ISllng objects (el.,k - 1) can be e1ther one or zero dependmg whether objeCt IS detected or not. II an ex1stmg object IS detected (et.k - at,k - 1) the corresponding element m the object-to-observatJon association vector, O.t.k = j E {1, ... , iVIt} denotes that the object generated observation Yt.i· El ements of the object-to-observation association vector that correspond to non-detected existing and non-ex1stmg objects are zero. Vanous add1llonal problem dependent constramts can be posed on a, 1611. m parlicular, we assume that different o§jects can not be assigned to the same observatiOn. 2.1.1 Classical model Let us first descnbe the multi-object state equalion and observatiOn eguatwn of the classical model w1th an assocmtwn dependent likelihood, 1.e., multi-Object state and observatJon domams, the multi object state transition distribution and the observation likelihood function. rThe model assumes a bounded tracking space Y c ll'.a, an unbounded single-object state domain ll'."J and the single-object observation domain identical to the single-object position domain lll'.a. [he multJ-o§ject state trans1lion d1stnbutwn 1s defined by three processes: (1) the o§ject b1rth process generatmg new objects m each tune frame, where the number of new objects has the Pmsson distribution with the expectation equal to As V(Y) (As - a birth rate, V(Y) - the tracking space volume); (2) the Bemoulh death update process, where m each tnne lrame an ex1stmg object can e1ther d1sappear w1th the constant probab1hty PD or contmue to ex 1st w1th the probab1hty 1 pv; (3) the object state trans1lion process, where the updated smgle-object states are obtamed usmg the almost constant velocity Gaussian transition model and the new single-object states arc distributed uniformly in S x [- Vmax, V.naxld (V.nax - the maximal velocity in any space axis direction. [I he augmented mulll object state trans1llon d1stnbut1on IS delmed by three add1llonal processes:[ (1) the Bernou1l1 smgle-object detectiOn process, where each ex1stmg object creates an observatJon w1th the probability Pd and does not get detected w1th the probability 1 Pd; (2) the Pmsson clutter generation process, where the number of observations that docs not represent object detections 19 has Poisson distribution with the expectation equal to Apn V(Y) (Apn - a false detection rate, V(Y)- the tracking space volume); (3) the object-to-observation association process, where the total number ol observatiOns ts the sum ol the detected object number and number ol the clutter observations, and the object-to-assoctatton vanable delmes whtch observations are generated by the detected objects and which represent clutter. Fmillly, the observation likelihood function can be factorized into the product of single-observation likelihoods. Likelihoods for the observations generated by the detected objects arc Gaussian distributions with the mean equal to the object's posttton, and the !tke!thoods ol the clutter observations are umlorm dtstnbuttons on Y I k\ few mconsJstencJes m the classical model are obvwus. The o§ject existence mdicators are updated by the death-update process before the new positions and velocities for updated objects are generated using (infinite suppm1) Gaussian distributions. Therefore, an updated object can leave the bounded trackmg space wtthout settmg tts existence tndtcator to zero. Stmtlar holds lor the observatton generatton process descnbed by the Gauss tan It kelt hood dtstnbutton, a detected object can generate an observatiOn out of the bounded tracking regwn. While 1t 1s possible to use samplmg with rejection to address these issues, the evaluation of probabilities in the classical model is based on Gaussian assumptions that fail if rejection is introduced. Finally, it is unnatural to have the probabt!tty ol death (and update) mdependent !rom objects' locatiOns, smce objects close to the boundary ol the trackmg space are more !tkely to leave tt than objects Jar !rom ttl ~et us descnbe m more detml modifications proposed to address mconsJstencJes of the classJcaJ model. In order to make the notation more compact we denote a collection of unobservable variables that constitute the multi-object state ass,= (e,, x,) and a collection of unobservable variables that augment the multi-object state ass~ - (d,, a,). In order to analyze distributions on the interval Ill. T] we introduce a shorthand notations sa. S, M, y and z for variable sequences Sj,,., s,,r, M ~or, Y LT and Z LT respectively. 2Q 2.1.2 Proposed model of multi-object dynamics We factorize the joint distribution of all unobservable modeling variables S" and S and observable variables M and Y conditioned on the multi-object state at t = 0 as: 1l p(Sa, S, M, Ylso) = rrp(s,ls,_J) · · · · t=l ~(sf, 1\I,Ist) · ... IP(Y 1 Is7, s 1 , M 1 ). (2.la) (2.1 b) (2.lc) ~n order to factonze the JOmt d1stnbutwn of multi-ObJect states S and observatwns Z we must perform summation over all values of the auxilimy variable M. Since, we obtain NI 1 and y 1 using the deterministic detectors 1\1 1 - m(z 1 ) and Jlt - y(z 1 ), the summation has a single non-zero term. The d1stnbutwns of the multi-obJect state transitions (2.la) and the multJ-obJect state augmentation variables conditioned on the multi-object states for M1. - m(z,) (2.lb) remain the same, while the multi-object state likelihood (2.lc) is the problem dependent distribution p(z 1 le 1 , x 1 ). It is possible to mtegrate out all augmentatiOn vanables, leadmg to lactonzatiOn (2.2) JOmt distnbution ol multi- object states S and observations Z .1 p(S, Zlso) = II p(z,ls,)p(s,lst-1) (2.2) [The graphical model corresponding to factorizations (2.la)-(2.lc) and (2.2) is presented in Fig- ure (2.2). 2.1.2.1 I'ransibon model for augmented multi-object states Factors (2.1 a) denote multi object state transitions and can be I urther I acton zed as: 2] Figure 2.2: Proposed statistical model of multi-object dynamics The term describing objects' kinematics can be represented as a product of terms that correspond to updates of existing objects (2.3a) and terms that correspond to newborn objects (2.3b): p(xtiXt-1, et, et-1) = II p(xt,klxt-1,k) k: et,k=1 et-l,k=1 II Po(xt,k),. k: et,k=1 et-l,k=O (2.3a) (2.3b) The single-object state distribution for the newborn objects (2.3b) is uniform on X = XP x xv, and the single-object state transitions for the updated objects (2.3a) are truncated Gaussian distributions based on the near constant velocity model: Xt,k EX otherwise with 2d-dimensional square mean transformation and covariance matrices: 22 and where Tk = tk- tk_1 denotes the time interval between the sampling instances k and k- 1 and (J"x denotes acceleration variance. The transition distribution of the existence indicators: K p(etlet-1, Xt-1) = Poiss(L II(et,k(1- et-1,k))IA.B V(Y)) · k=1 II Pn(xt-1,k) II (1- Pn(xt-1,k)), where the probability of single object death is 1- Pn(xt-1,k) = (1- Pn)Pr(xt,k E Ylxt-1,k)· This probability must be evaluated numerically by integration over XP, and we avoid increase in the numerical complexity of the algorithm, precomputing it for different values of quantized positions and velocities. In general, it is possible to extend the same approach and construct an application dependent object termination probability pn(xt-l,k). Factors (2.lb) describe dependencies between the multi-object state augmentation variables and the multi-object state variables: K p(sf, Mtlst) = II p(at,klet,k, Mt,k)p(Mt,k ldt,k, et,k)p(dt,klet,k)· k=l The number of existing objects is Kt = ~f= 1 et,k and the number of detected objects Mf = ~f=l dt,k, where the probability of detection for each existing object is Pd· All observations Yt,j (j = 1, ... , Mt) that do not belong to detected objects represent false detections, and the number 23 of false detections Mfa = Mt - Mf has Poisson distribution with the parameter Afa V(Y). The distributions of the multi-object state augmentation variables are: 2.1.2.2 Observation likelihood models Mfa1 ~t!. Poiss(MfaiAfa V(Y)) K Po iss ( Mt - L dt,k I A fa V (Y)) k=l l:f (l _ Pd)Kt-Mf. In general, the association-free likelihood p(zt lxt, et) can not be further factorized, while condi- tioning on the data-association variable at allows factorization of the association based likelihood P(YtiXt, et, at). Assuming the number of false detections Mfa and the number of detected objects Mf defined in the Section (2.1.2.1) the association based likelihood can be expressed as: where the single object likelihood is the truncated Gaussian distribution: P -{N(Yt,at,k lxf,k, ~y), Yt,at,k E Y(xf,k) U Y P(Yt,at k lxt k)- ' , ' 0, otherwise with ~Y denoting a fixed covariance matrix and Y(xf k) ~ ffi.d denoting a bounded convex region ' with a fixed size and shape centered at xf k. ' 2.2 Probabilistic inference in models of multi-object dynamics In this section we address approximate inference for non-Gaussian model of multi-object dynamics described in Section (2.1.2). We the propose an inference algorithm based on sequential Markov 24 cham Monte Carlo block samphng. The illgonthm IS molivated by 1dea to improve samplmg effi ciency using using block proposal distributions based on longer sequences of observations! 2.2.1 Related methods Models of multi-objects dynamics arc non-Gaussian. Even when the single-object transitions and assocwllon dependent hkehhoods are Gaussian dislnbuliOns the multi object stale lransilion IS non Gaussian due to the object birth and death update processes. Addil!onally, assocwllon tree hkeh hoods are often non-Gaussian and, hke m the model proposed m Sectwn (2. 1.2), the smgle-obJect transitions and association dependent likelihoods can be non-Gaussian. In problems with moder ate to large number of objects the dimension of the multi-object state space prohibits application ol numencal approximallon methods and Il IS necessary to use approximate mlerence algonlhms belongmg eilher to the class of samplmg or the class of optumzatwn approaches. The models divergence from the conJugate exponenliill model structure makes the varmtwna.I mference meth ods difficult to apply. That leaves sampling methods, in particular sequential Monte Carlo (SMC) [61][36] and MCMC [32][33], as inference methods of choice. However, the cost of the approxi mate smoolhmg by samplmg becomes too compulallonally expensive lor large 1 . I herelore, Ills necessary to break these problems mto the sequence of manageable mference problems, 1.e. to use sample approxunalions obtmned for smailer 1 to construct smnple approxunatwns for larger 1 . We start this section with an overview of related sampling and optimization algorithms for sequential! block and block sequential mlerence m models ol mull! object dynamics. rn 2.2.1.1 Sampling importance resampling with MCMC moves A class of mference methods based on smnphng avmds comphcalions related to smoothmg by replacing it with the sequence of filtering problems. The goal in these methods is to usc the sample approximation of p(:rr, eTizh) forT= t- 1 to obtain the smnple approximation forT= t. This 25 class of one-step sequential methods can be implemented both in the SMC (importance sampling- resampling) [16] and in the MCMC [33] framework. Obtained methods, for the detection based observations, follow from the Chapman-Kolmogorov equation: p(Xt-1:t, et-1:tlzl : t) rv p(zt lxt, et)P(Xt, et lxt-1, et-d ( I ) q(xt, etiXt-1, et-1, zt)P(Xt-1, et-1lz1:t-1), q Xt,et Xt-1,et-1,Zt where q(xt, etiXt-1, et-1, zt) is the proposal distribution. The only formal prerequisite is that sup- port of q dominates support of the posterior distribution of interest. However, a good choice of the proposal distribution is crucial for the performance of sampling algorithms. While it is possible to use the multi-object state transition prior p(xt, et lxt-1, et-d as the proposal distribution it should not be the method of choice [16][17]. Namely, sampling from the multi-object state transition prior does not necessarily place samples in the regions with high posterior distribution. This problem is particularly emphasized when sampling from the multi-object existence prior, since target appear ances and disappearances are sampled without taking observations into account. Therefore, many samples end away from the the high probability regions of the posterior distribution and these sam- ples do not efficiently contribute to a good posterior approximation. Getting many unimportant samples leads to the sample impoverishment phenomenon, i.e. no matter weather SMC or MCMC is used most samples will be rejected and the effective sample number will be small. Same holds for other inappropriate proposal distributions. Intuitive rule is that proposal distribution should represent a reasonably good approximation of the posterior distribution with possibly heavier tails. Assuming that the distribution p(Xt-1, et-1IZ1:t-1) is approximated by it's N weighted samples p(Xt-1,et-1IZI:t-1) = 2:;:= 1 wf_ 1 8(xf_ 1 ,ef_ 1 ) the sampling importance resampling method is summarized in the Algorithm 2.3. The resampling step is introduced to control the degeneracy [16] of the sample set. The resam pling can be applied only if the degeneracy measure of the weighted sample set becomes too large. 26 for n = 1 toN (all particles) do SAMPLE xi_ 1 , ei_ 1 "" p( Xt-1 , et-1l Zl:t-I) SAMPLExi,ei ""q(xt,etlxi_ 1 ,ei_ 1 ,zt) COMPUTE WEIGHT wn ""wn p(ztlxf~f~(:f,eflxf_l~ef_l) t t-1 q(xt ,et lxt_ 1 ,et_ 1 ,zt) end for RESAMPLE {(xi, ei)};';= 1 ""Multi(w£, ... ,wf) Figure 2.3: Sampling Importance Resampling for n = 1 toN (all particles) do SAMPLE xi_ 1 , ei_ 1 "" p( Xt-1 , et-1l Zl:t-I) SAMPLE x~' 0 , e~' 0 "" q ( Xt, et I xi_ 1 , ei_ 1 , Zt) for m = 1 to Ntotal do SAMPLE x~, e~ ""q(xt, etlxi_ 1 , ei_ 1 , Zt) CALCULATE acceptance ratio: ( I 1 1) ( 1 1 I n n ) ( n m-1 n m-11 n n ) . ( 1 PZnXt,etPXt,etxt-1,et-1 qxt' ,et' Xt-1,et-1•Zn a= mzn ' n m-1 n m-1 n m-1 n m-1 n n 1 1 n n p(znlxt' ,et' )p(xt' ,et' lxt-1,et-1)q(xt,etlxt-1•et-1•zt) If u "" Unif[O, 1] ACCEPT sample (x~'m, e~'m) = (x~, x~) else REJECT 1 ( n,m n,m) _ ( n,m-1 n,m-1) samp e xt , et - xt , xt end for end for Figure 2.4: Sequential independence Metropolis-Hastings sampler If the resampling is applied at each sampling instance all sample weights are equal. A MCMC equivalent, an independence Metropolis-Hastings sampler is presented in the Algorithm 2.4. It is difficult to construct a good proposal distribution q(xt, et lxt-1, et-1, zt) based on the joint observations Zt. In most multi-object tracking studies based on models with the joint observation likelihood this issue is avoided by using multi-object state transition prior as the proposal distribu- tion. In models with detection based observations heuristic proposals based on the current observa- tion are used (e.g. [45]). Although for some scenarios this can be satisfying, in general this is not enough. First, in the presence of high false alarm rate there are many candidates for the new births and the number of true objects tend to be overestimated. Once accepted, a newborn object has a low probability of termination and stays active even if its birth was initiated by a false alarm. 27 IAn optton to address thts problem ts to apply MCMC moves [3] whtch rejuvenate sequence oil outputs I rom the SIR algonthm. Even tl a longer sample sequence (e.g. ol length L)ts rejuvenated usmg MCMC steps, th1s 1s done by a transition kernel (Equation 2.4 that represent product of pro posal distributions q(:rn e 7 l:r 7 _L e 7 _L z 7 ) (T = t- L + 1, ... , t) that usc one observation at the ICvw((.r, ch, 1 , (.r, c)u) = ... (2.4) IWhtle addressmg the sample tmpovenshment problem thts approach latls to mcorporate long term multt object state dependencies! 2.2.2 Proposed algorithm As noted m the prevtous sectton, tt can be beneltctalto use a longer observatton htstory lor construe- lion of the proposal d1stnbutwn. In this sectwn we propose an MCMC sequential block samplmg scheme !Let us assume that it is possible to perform block sampling on the intervals of length T. Ourl goal is to sample p((x.e) 7 1zb) for arbitrary Tusing the block sampling approach. Assuming that it is possible to sample from p((1:, e) 1 lzu) we do not necessarily want to generate samples from p((1:, e) 1 + 1 lzu+t) in the following step. We allow for steps of length L ::> 1 and our goal is to generate samples from p((:r, e)t-L+1·Izu-L+1') using the block proposal distribution q((.r, c )t-L+I:t-L+T I (x, c )t-L. Zt-L+l:t-L+T ). (see Figure 2.5)) T-L T-L+l t-L t-L+l Figure 2.5: Sequential block MCMC sampling The posterior distribution p( (x, e )t-L+T lzl:t-L+T) can be factorized as: p((x, e)t-L+TIZl:t-L+T) = · · · Zt z z P(Zt-L+l:t-L+TI (x, e)t-L+l:t-L+T) t-L+T t-L+l:t p( (x, e)t-L+l:t-L+T I (x, e )t-L)P( (x, e )l:t-L lzl:t), where the normalization constant Z 71 : 72 is equal to p(z 71 : 72 ). Since we assumed that it is possible to sample from p( ( x, e) l:t I Zl:t) it trivially follows that these samples can be also used as samples fromp((x, eh:t-LIZI:t) by discarding (x, e)t-L+l:t· This augmentation of the However, in order to evaluate p( ( x, e) l:t-L I Zl:t) it is necessary to integrate out discarded variables. In order to avoid evaluation of this integral we propose to sample from the distribution: q((x, e)t-L+TIZI:t-L+T) = p( (x, e )l:t-L lzl:t)q( (x, e)t-L+l:t-L+T I (x, e)t-L, Zt-L+l:t-L+T ). As noted before, we used (X, E) and Z as the unobservable and observable model variables in all expression just for notational simplicity. All equalities hold if we use (X, E, A) as unobservable (i.e. state) variables and the detection based observations (M, Y). Therefore, the independence 29 Metropohs-Hastmgs sampler can be used for mference both for jOint and detectwn based hkeh hoodsJ lA I so, the presented mdependence Metropohs-Aastmgs sampler has rts SMC equrvalent an the both algorithms are closely related to the block SMC algorithm [ 15], 2.3 Block proposal distribution~ I he statrstrcal model we proposed m Sectron (2. I) ts generative and observations are created m a bottom-up manner. At each tune frame we first update the prevwus mulll-object state usmg the birth and death-update models, and then generate observations using the chosen likelihood distribution.! In particular, for the association dependent observation model we first generate the data detection vanable, then the object-to-observation vanable, the number ol lalse detections and, at the end, observatwns for ali detected objects and false detectwns. In this model ali obJeCt-to-observation vanables are defined at the smgle frame level and are condil!onillly mdependent giVen the multi object state sequence. Therefore, in order to construct a proposal distribution that uses observa tions m multtple consecutive time lrames we need to solve multtple data assoctatton problems lor each lrame mdependently. I hts makes the model mappropnate lor constructron ol block proposal distr1butwns and motiVates us to propose an alternatiVe representatwn that supports mult1-frame object-to-observation assignments. ~n this section we introduce the directed connectivity graph, with vertices that include all ob servatrons m multrple consecutive lrames. I he edge set contams lmks between all vertex parrs, w1th corresponding observatwns that could be generated by the same object tilking mto account tune and space distance between observations and the lumt on objects velocities. Each path that connects observations generated by a single object defines a multi-frame object-to-observation as signment. Further, we propose a parametric distribution that approximates the posterior augmented multi-object state distribution in the original model and allows us to sample paths on the connectiv Ity graph. We use this d1stnbut10n to sample multiple non-mtersectmg paths that represent multiple object-to-observatwn ass1gmnents and define a partition of the connectivity graph. The obtmned multi-frame assignment allows us to sample objects' existence indicators and states independently.! ln parlicular we use an efficient Gibbs samphng technique to generate samples from truncated mul tivariate Gaussian distributions [35]. II he key pomt ol the descnbed block samplmg procedure IS learnmg ol the graph parameter values that lead to good mult1-lrame assignments. For th1s purpose we propose a a vanat1onal algo nthm that Iteralively updates parameters of the approxunatmg distnbutwn to mimffilze the Kuilback Leiblcr divergence between the original posterior distribution the parametric approximation. Our method is related to DDMCMC [44] and CEDA [58], also based on the connectivity graph 1dea. Bes1des the !act that DDMCMC and CEDA are presented rather as batch than sequentral pro cessmg methods, they differ from the method we propose m formulatwn of the processmg goillsl The DDMCMC samples the ongmaJ postenor distnbutwn usmg a fixed path transfonmng transi tion kernel, the CEDA method searches for the high probability regions in the posterior distribu tion, whrle our method approxnnates the postefll)f by an easy to sample d1stnbut1on. Add1tronallyJ we address models w1th non Gauss1an object state trans1t1ons and hkehhood d1stnbut1ons whrle the DDMCMC and CEDA are presented only for the Gaussian distnbutwns. We provide a bnefj overview of DDMCMC and CEDA methods in Section (2.3. 7). ~n order to simplify notation we write all expressions for the the proposal distribution ~'(E,X.D.AIY,M.x 0 ,e 0 ) on interval [l.T]. Proposal distributions for all other intervals oJI length 1 have the same lunctlonallorms w1th shdted t1me md1ces. 2.3.1 Connectivity graph in observation-time space As the ilrst step towards samphng multi lrame data assocrat1ons let us del me the d1rected connectJV ity graph G- G(y 1: 0 , e 0 ) on interval [1, T] (Figure (2.6)). Vertices of this graph belong toT+ 1 time layers with time indices T = 0, ... , T. Vertices in the layer T = 0 correspond to positions :rb.k of K, objects present at T = 0. Vertices in layer T E { L ... , T} detection correspond to observa tions {Y 7 .j : j- 1. .... M 7 }. Let us denote the vertex set as V- {vk : k- 1. .... lVI}, where the time layer of the vertex vk is T(v,J We draw an edge from vertex v; to vertex v 1 if the distance be tween the corresponding observations in the observation space is smaller than ( T( Vj) - T( v; ))V,,ax. the vertices belong to time layers closer than f'>T ( T( /Jj) - T(v;) S f'>T) and vertices do not belong BJJ Figure 2.6: Connectivity graph G(Y, xo, eo) = (V, E) to the same time layer (T(vj) -:f. T(vi)). Intuitively, edges are drawn between vertices that can rep resent consecutive observations generated by the same object. The maximal distance in time fj.T is introduced to prune the connectivity graph by excluding edges of low probability. Namely, the probability that an existing object will not be detected in fj.T consecutive frames is dropping with the power /j.T, (1- Pd)t:;.T. 2.3.2 Multi-frame object-to-observation associations Let us introduce a variable II = (ITo, II 1 , ... , II p) that defines a good partition on the connectivity graph. We assume that partition is good if sets lip (p = 1, ... , P) contain non-intersecting simply connected paths and II 0 is the set of isolated graph vertices. Obviously, a good partition defines multi-frame object-to-observation assignment for detected objects. Namely, vertices on each simple path Pp (p = 1, ... , P) correspond to observations generated by a single object, while isolated vertices in P0 correspond to false detections. An example of the graph partition is presented in Figure (2.7). As we discussed earlier, the original posterior augmented multi-object state distribution based on the generative model proposed in Section (2.1.2) is difficult to sample and we can evaluate it up to an unknown normalization constant. Our goal is to approximate it by a parameterized distribution that we can both sample and evaluate more easily. For this purpose we introduce a parameterized 32 Figure 2.7: Graph partition into multiple simple-paths and isolated nodes. Figure 2.8: Multi-frame object-to-observation assignment. (Left: original model; Right: alternative model factorization) discriminative model presented in Figure (2.8) that is based on the described partition variable. Since any good partition II contains exactly the same information as sequences of object detections D = dl:T and object-to-observation associations A = al:T together, the augmented multi-object state can be presented either as (X, E, II) or (X, E, D, A) where (D, A) and II= (D(II), A(II)). 33 [he condillonaJ mdependence structure mtroduced by the parametenzed discrumnalive model defines factorization of the approximating augmented multi-object state posterior distribution given in Equation (2.5). q(X, K IIIY, :r 0 , eo, G, 11) = ... q(XIY E, A(IT), X(), eu)q(EID(II))q(IIIG. e) - ... q(IIIG, e) II q(xt.~·.kiY, Ct:T,k. at.~·.k, :ro,k)q(et.~·.kldt.~·.kl ~ (2.5) [he distribution q(IIIG, 11) allows us to sample multi-frame associations, and conditioned on multi-l'rame associations both the distribution or object existence indices q(EID(II)) and the pos terior distribution of the concatenated single-object states q(XIY, E, A(IT), can be factorized with product terms correspondmg to smgle-obJect tra1ectones. !Before we describe in more detail the multi-frame association and concatenated single-object state drstnbutrons, let us say a lew words about the drstnbul!on ol object exrstence mdrces. Let us assume that lor k'h detected object we have detections only at frames 1 <:: t 1 ... <:: tLJ <:: T. The distribution q(e 1 ,~·.kld 1 ,~·.k) is defined in the following way: (1) the detected object exists on interval [tt, tD], i.e. e 11 , 10 .k = 1; (2) probability that the object exists without detection from frame to to frame tt(1 < to < tt) is (1 - JJD)''-to-l(1- Pd) 11 - 10 ; and (3) probability that the object exists without detection l'rom frame tLJ to l'rame tLJ+l(tLJ <:: tLJ+l <:: T) is JlLJ(1- JlLJ) 11 Zo 1 (1 =j ~d)'' to 2.3.3 Decoupled object dynamics: Gibbs sampling of truncated Gaussian distribu- Gtbbs samphng schemes I 35 I I 14 I present an accepted and natural method to sample multtvanate JOintly truncated Gausstan dtstnbutwns. The margmal dtstnbutwns of the Jomtly truncated Gausstan distribution can not be calculated in the closed forme nor evaluated exactly. However, distribution~ or a single variable conditioned on all other variables are truncated Gaussian [35]. ~n our mult1-object tracking scenano the smgle-object smoothmg distr1butwn q(:rl,T.kiY, e,,T,k, OJ,T.b :ro.k) can be factorized like in Equation (2.6). 1l [1 P(:Yt,at.k I :rt,k )p(:rt,k l:rt-l.k, Ct,k, "t-i,k) I t=l (2.6) Obviously the factorized joint distribution is truncated Gaussian with bounds on the object vc- loctty space mduced by the veloctty !tmtt and bounds on the object posttton space. In lrames m whtch a particular object ts not detected bounds on tts posttton comctde wtth bounds ol the obser vatwn space. In frames m which the obJect 1s detected 1ts pos1lion lies m the mtersectwn of the observation space and the rectangular box surrounding the assigned observation. Therefore, we can obtain a closed form expression for the joint distribution in the Gaussian form and apply the descnbed bounds on tl. From here we tnvtally obtam the Gausstan JunctiOnal lorms and bounds for each one dimenswna.I distr1butwn m the G1bbs sampling scheme. For an observatiOn space that is a subset of !Rd the number of one-dimensional variables to sample per object is 2d. Finally, we sample each univariate truncated Gaussian distributions using the inverse cumulative distribution I unctiOn method.l 2.3.4 Connectivity graph parametrization and path sampling Our goal ts to mtroduce a parametnc dtstnbutton whtch we can use to sample multtple non- mtersectmg paths on the connecttvtty graph. Some ol methods to do thts delme parttttons on the graph by erasmg some of 1ts edges, and some grow the mulli-path set path-by-path unlil ail verlices arc used. The main difficulty in the first class of methods is to impose constraint that each set in the partition has to be a simple path. In the second method, there arc two options: (I) to sample paths mdependently and reJect sample tl any vertex reoccurs; and (2) to apply the condtltonal sam- pling, 1.e., remove the VISited verlices munediately after sampling them leadmg to non-mtersectmil paths. In the latter case the sampling d1stnbutwn 1s not stalionary and a more careful approach to evaluation of the used probabilities is necessary. In this section we introduce a parametrization on consecutive edge transitions at each vertex. Using edge transitions rather than vertex transitions allows us to incorporate object kinematics con straints more efficiently. Our parametrization uses twice less parameters than the CEDA algorithm [58], also based on edge transitions, and explains specific structure that distinguishes path sampling on the described connectivity graphs and arbitrary directed graphs. Let us describe the parametrization we propose. For simplicity, let us drop the time layer indices and denote the vertex set as V = {v1 , ... ,viVI}· Let Lk and Rk be numbers of, respectively, left and right neighbors of vk, i.e., vertices connected with vk by edges directed towards and away from and Vk. Additionally, we introduce an auxiliary left neighbor and right neighbor that symbolize that vertex Vk have no incoming or outgoing edges. Let Lk = { v 1 k li = 0, ... , Lk} be the set of left ' neighbors and Rk = { Vrk li = 0, ... , Rk} the set of right neighbors of vk, where we reserve indices J l~ and r~ for the auxiliary neighbors. For each vertex vk we define a matrix ek with Lk + 1 rows and Rk + 1 columns. Each element ef,J represent probability that path through vk contains its neighbors v 1 7 and vrj. Note that although the connectivity graph is directed by ordering of time indices our intention is to sample paths on it both forward and backward in time. The direction in which we sample paths on the graph should not change our treatment of the object kinematics. The row and columns sums in the parameter matrix are respectively Lk 1 + 1 and Rk 1 +1. The sum of ith row represents probability that a path through vk contains v 1 k, and the sum of jth column represents probability that a path through Vk contains Vrk. ' J The following equations define conditional distributions that the right (left) neighbor Vrk ( v 1 k) of J ' the vertex vk belongs to the path that contains vk and its left (right) neighbor. Rk + 1 p(vrj• vl7) Lk + 1 p(vrk)lvk J Rk+1 L 1 p(Vlk1Vk,Vrk) k + ' J (2.7) 36 ISct the path counter c = 1. [nitialize the set of used vertices V' = 0. ~nilialize the lert and right vertex counters l = 1, r = 1. rrepeai Sample the first vertex v(o) uniformly from V \ V'. Sample lert and right neighbors ,( I) and v(r) or v 0 rrom multinomial distribution defined by all elements 8.'~ (V \ V') of the parameter matrix 8 r·u (V \ V'). Expand the path to the rightl while u(r) i 0 do !Update the nght vertex counter r - r + 1. Sample,/' l rrom multinomial distribution proportional to column elements e~i~=;i (V\ V* lor the parameter matrix 8" 1 ' ' 1 (V \ V*). end while Expand the path to the Iel't while,( I) i 0 d~ !Update the left vct1ex counter l - l 1.1 Sample v( -I) from multinomial distribution proportional to row elements 8.':,(~~~ 1 21 (V \ V'j lofthe parameter matrix 8,./-IHJ(V\Pl.l end while if'/ 1 Rni1 r 1 thPnl Set ITo = II 0 U {v(o)}. !Update set or used vertices V* = V* U { v(o) }.1 ~~~--~~~~~--~~~ Set the path II, = { ,,( l+l), ... , vC" 1)}. !Update set or used vertices V* - V* U II 1 • !Update the path counter c = c + lj eiid1f iintil V*- V Ftgure 2.9: Condttwna.J path samphng algonthm. IEqualron 2. I rmphcales thatlelt-lo-nghl and nghl-lo-leltlransrllons !rom the vertex Vk are kme- mattcaily eqmvalent, wht!e the dtfference the transtl!on dtstnbutwns comes from the structure of the connecttv1ty graph, t.e. dtf!erence m number of mcommg and outgmng edges. IWc sample a good partition, i.e. multiple objects' paths, as described in Algorithm (2.9). [Let us assume that Algorithm (2.9) generated a path 1r = T'Lr 2 :iy 1 = ( T'-iy 2 , ... , Vi 0 , ... , Viy 1 ~ starling with the vertex Vio and ending with nodes Vi_r, and Vir, that are either auxiliary empty nodes denoting path termination or belong to the boundaries of the time interval of interest [0, T]. The sequence ( i_ 12 , ... , i 0 , ... , i 11 ) is a sequence of indices belonging to consecutive vertices on the path. Let us denote the probability of this path as q(vi_r 2 :ir 1 , io, V*) where io denotes the index of the firstly chosen vertex and V* the set of nodes that are not used in previously generated paths in the same partition (path collection). he probability of the path irrespectively of the order in which vertices were chosen can be obtained by marginalization over the firstly chosen node is given by the Equation (??): Tz q(v· · V*)- ~ q(v· · i V*)- ~-T2 :~T1 ' - ~ ~-T2 :~T1 ' k' - (2.8) k=-T1 Tz Tz k-1 T2 =l~*l II e;~-l,ij+l(V*)+1) L (II (Lij(V*)+1) II (Rij(V*)+1)), j=-Tl k=-T1 j=-Tl j=k+l where Rij (V*) and Lij (V*) are numbers of left and right neighbors of the vertex Vij in the set V*. Let us further assume that a sequence of paths II = (7i\, ... , ir p) was generated and let ( 1rj 1 , ... , 1rjN) (1 ::; j1 ::; ... ::; jN ::; P) be a sequence of paths that contain more than one vertex. Order in which these paths are sampled is important because it defines order in which we assign paths to coordinates of vectors Xo:T and eo:T· On the other side order in which we generate remaining paths is not important because we assign these paths to the clutter set 1ro of the partition II= (1ro, 1r1, ... , 1fN) where 1fn = 1rjn (n = 1, ::;, N). The total number of permutations of:fr that map to the same II is ( P - N)N + 1 . Therefore, the probability of the partition II can be obtained by summing over all corresponding permutations. 2.3.5 Variational method We propose a variational method for minimization of the Kullback-Leibler divergence (2.9) between the posterior distribution p(X, E, PlY, M, xo, eo) (or p(X, E, PIZ, xo, eo)) and the parameterized block proposal distribution (??) over the parameters of the block proposal distribution. To sim- plify notation, we drop all conditioning variables from expressions for distributions of interest, e.g., instead of p(X, E, PlY, M, xo, eo) we write p(X, E, P). We choose the optimization criterion in 38 the form of the inclusive divergence KL(p(X,E,P)IIq(X,E,III8)) rather than the exclusive di vergence KL(q(X, E, III8)IIv(X, E, P)) as our intent is to use the obtained approximation as the proposal distribution in the block sampling algorithm. Namely, the exclusive divergence forces q to be zero whenever p is zero. As an effect of this property, parts of the support of p can be excluded from the support of q (therefore the exclusive divergence name [? ]) and the approximating dis- tribution q concentrates probability mass around modes of p. On the side, the inclusive divergence forces q > 0 whenever p > 0, including the support of pin the support of q. Therefore, approxi- mations based on the inclusive divergence optimization provide a better approximations in the low probability regions (tails) and are better suited to be used as proposal distributions. argminKL(p(X, E, II)IIq(X, E, III8)) e 1 ~ ( ) q(X,E,III8) argm:c L..,P X,E,II log (X E II) XEII p ' ' ' (2.9) Let us assume approximation distribution in the form q(X, E, III8) = p(XIE, II)p(E, III8). Since the true posterior distribution is p(X, E, II) = p(XIE, II)p(E, II), the conditional distri bution p(XIE, II) cancels and the Kullback-Leibler divergence can be calculated using only the discrete variables E, II (Equation (2.10): arg min KL(p(E, II)IIq(E, III8)) e ~ q(E,III8) argm:CL....p(E,II)log (E II) . E II p ' ' (2.10) The sum in the criterion is intractable due to the large number of partitions and can not be directly approximated by Monte Carlo methods since it is not possible to sample from p(E, II) directly. In the following theorem we define a lower bound of the original criterion function with idea to optimize this bound instead the original criterion. In order to define appropriate lower bound, let us first prove the following lemma. 39 Lemma 2.3.1 Let p be a discrete distribution with the finite supportS (lSI = N), and let q be a positive function defined on S such that 2:{: 1 qi = Q. Then, the following inequalities hold: (2.11) where Dx2 (PIIq) and KL(PIIq) are respectively the x 2 and Kullback-Leibler divergences. Proof Since the log : is a concave function, differentiable on R + it holds: 1 1 b (a - b) 2: log a - log b 2: ~ (a - b). (2.12) Substituting a= Pfi and b = qfi for each i = 1, ... , N into (2.12) we get: Since the arithmetic mean is greater than the geometric mean we have: and from where we respectively get: 2 Pi > (Pi 1)Pi --pi- -- ' qi qi and . After summation over all indices we get the inequality of interest. The following theorem defines a lower bound of the optimization criterion. 40 Theorem 2.3.2 Let p and r be two discrete distributions with identical finite supports S, and let q be a positive function defined on S that satisfies the expectation constraint l:i Piqi = 1. Then, the inequality (2.13) holds: KL(vllr) 2: KL(pqllvr). Proof From the previous lemma it follows: and N 2 ~P· - KL(vllr) 2: -Dx(PIIr) 2: 1- L.)rt ), i=1 t N LPiri- 1 2: -KL(pqllvr). i=1 (2.13) (2.14) (2.15) Using the Lagrange multipliers technique it is easy to prove that for any distribution p inequality 1- 2::{: 1 (~:) 2: 2::{: 1 PiTi- 1 holds for all distributions r. Therefore, from (2.14) and (2.15) we get the inequality of interest. Based on this theorem, instead maximization of the original criterion (2.10), we maximize its lower bound F(q, e) = -textKL(p(E, II)q(E, II)IIv(E, II)q(E, III e)) over parameters e under expectation constraints l:e rrP(E, II)q(E, II) = 1 using an iterative EM algorithm. ' The EM procedure alternates expectation and maximization steps optimizing the lower bound F(q, e) over (variational parameter-free) functions q and parameters e. After we initialize the procedure with parameters e 0 , the EM algorithm generates sequence of estimates (em, qm) (mE No). In the mth expectation we find the optimal parameter-free function qm (2.16) which maximizes F(q, em) under the expectation constraints using Lagrange multipliers. (2.16) 41 After we substitute the optimal variational function from the expectation step to the variational lower bound and discard terms that does not depend on parameters e we get the following function of parameters e: F(q~, e)= L q(E, IIIem)p(E, II) log(q(E, IIIe)). E,II In the maximization step the goal is to to obtain new parameter estimates em+l by maximization ofF( qm, 0) over parameters e. Since the integrals in F( qm, e) are intractable we use their sample approximation, where samples (Xn,En,IIn);{= 1 are drawn from the distribution q(X,E,IIIem) and p( E, II) is approximated with the value p( xn, En, IIn): N F(qm, e) ex LP(xn, En, IIn) log(q(En, IInle)). (2.17) n=l Finally, we obtain em+l by maximization of the sample approximation over parameters e. Instead of conditional distribution p(X, E, IIIY) it is enough to evaluate the joint distribution p(X, E, II, Y) which does not influence the optimization procedure. Finally, instead of crite- rion (2.17) we optimize the following criterion: N F(qm, e) ex LP(xn, En, IIn, Y) logq(IInle)). (2.18) n=l Finding derivatives over all parameters e leads to the following parameter update equations: k 2:~= 1 p(Xn, En, IIn, Y)II(IIn, lf, rj) e k k = -------;;-;----------"----- li ,rj '\'N p(Xn En IIn Y) ' L ... m=l ' ' ' (2.19) for all vertices k and all pairs of its neighbors (lf, rj). With II(IIn, lf, rj) we denote an indicator function that takes value 1 if vertices (lf, k, rj) belong to a trajectory in the partition IIn. We summarize the described variational approximation method for construction of the block proposal distribution in Algorithm (2.10). Note that we augment observations on interval [1, T] with location of the objects present at t = 0 to get the connectivity graph G (Y, x 0 , e0 ). 42 ~NITIALIZE iteration counter m = 0 and parameters e 0 using ~bile (not converged) do for u 1 to i\' dd SAMPLE IIm,n :x q(IIIG(Y, :ru, eu), B,m,n) SAMPLE Em,n ex q(EIIIm,n) SAMPLE xm,n ,:x q(XIY, :ru, eo, (E, II)'"'") !EVALUATE p((X E, II, Y)m ", Yl,ro, eo) end ton UPDATE em using Eqn, (2,19) ITERATION COUNTER m = m + ] end while IFtgure 2, I 0: Vanallonal approxnnat10n [he guahty of the obtamed approxunalion depends cntJcaJly on the parametnzatwn of the ap- proximating distribution q, In order to make the approximating distribution more flexible we assume the approximatmg distnbutwn m a more complex form With a C -component nuxture of multJnollllai distnbutions as a distnbutiOn olthe parlitiOn van able, Each component m the mixture has the lorm mtroduced m Seclion (2,3A): q(X,KIIIe) q,(X, EIII) L ctcqc(IIIec), ~ where(-:) contains all parameters (E-),);?_, and parameters n = (a,,J;,:_, that define a multinomial distnbution over the mixture components, I he mixture lorm ol the approx1matmg distnbution ailows us to capture more complex, The parameter update eguatwns can be easily extended from (2,19), 2.3.6 Adaptive MCMC for block sampling 1n this sectwn we mtroduce adaptatwn mechamsm to the proposed mdependence block MCMC sampling algorithm, It is known that MCMC samplers with local moves represent a better choice than independence samplers when it is not possible to have proposal distributions close to the target distribution [38] [50], Therel'ore, we choose to combine a MCMC sampling scheme with local moves from [ 44] with the variational distribution approximation method we proposed in the previous section in a joint adaptive sampling scheme. Adaptive MCMC samplers can be classified in there groups: (1) adaptive strategies within ordi nary MCMC; (2) algorithms where adaptation diminishes; and (3) algorithms with the chain regen eration. Possibilities for adaptation in ordinary MCMC sampling schemes are rather limited [18] and the chain regeneration in high dimensional spaces happens rarely [24]. The remaining class of algorithms with diminishing adaptation is the most active research field; however, most of proposed algorithms introduce additional restrictions which makes them difficult to apply in practice. We propose an adaptation scheme for block trajectory sampling based on the diminishing adap tation algorithm proposed in [30]. This algorithm, originally designed for problems in which the target distribution is expensive to evaluate, is particularly useful in applications when it is neces sary to generate large number of samples from the target distribution - inverse problems, or like in our case - optimization problems. In Algorithm 1 we describe a way to combine the local MCMC sampling technique [44] and the variational method we proposed in Section 2.3.5. Let us adopt the following shorthand notation: (2.20) for the mixture of the local MCMC and the independence Metropolis-Hastings kernel. We denote the update history as hi, initialize it with an empty set h 0 = 0 and update it after we generate a sample (X, E, II)* from qi ( (X, E, II) I (X, E, II) i- 1 , h i- 1 ) according to the following rule: if(X,E,II)* rv p{AR hi = hi- 1 U(X, E, II)*' if(X, E, II)* rv pf4CMCand rejected (2.21) hi- 1 U(X, E, II)i- 1 , if(X, E, II)* rv pf4CMCand accepted. In words, we can augment the history with accepted samples generated by the independence proposal p{ AR, rejected samples from the local MCMC transition distribution pf4CMC or accepted 44 samples from the local MCMC transition distribution after the chain moves to another state. The mixture proposal distribution qi((X,E,II)I(X,E,II)i- 1 ,hi- 1 ) can be calculated using only the samples in the update history hi- 1 . This way, we do not have to wait to generate a large number of samples with fixed parameter values to estimate the new parameter values. Instead, we update parameters after each sample is generated incrementally improving the proposal distribution. Algorithm 1 Adaptive Independence Metropolis Hastings Sampler SET h 0 = 0 GENERATE Initialize state (X, E, II) fori= 1 toN (all samples) do GENERATE (X, E, II)* "'qi((X, E, II)I(X, E, II)i- 1 , hi- 1 ) CALCULATE ACCEPTANCE RATIO . ( p((X,E,II)*)q((X,E,II)i- 1 1(X,E,II)*,hi- 1 ) 00 = mm 1 ' p((X,E,II)i-1)q((X,E,II)*I(X,E,II)i- 1 ,hi- 1 ) SET UPDATE HISTORY ( X E II)i = {(X, E, II)*' ' ' (X E II)t- 1 ' ' ' accepted rejected ij(X, E, II)* rv py AR if(X, E, II)* rv pfiCMCand rejected if(X, E, II)* "'pfiCMCand accepted UPDATE PARAMETERS em using Eqn. (2.19) and samples from hi. end for 2.3. 7 Comparison with related methods (2.22) (2.23) (2.24) Both the DDMCMC algorithm [44] and the CEDA algorithm [58] address the problem of multi- object tracking on finite time intervals and share many similarities. Both algorithms assume the classical multi-object dynamics model with Gaussian single-object transition distributions and ob- servation likelihoods. In order to use all observations on the interval of interest both methods use the partition variable II on the connectivity graph G(Y, xo, eo) in the way presented in Sections (2.3.1) and (2.3.2). 45 !Both methods pose the tracking as an optmuzalion problem. The DDMCMC directly optumzes multi-frame association posterior fr = arg maxu p(IIIY) by searching for the optimal partition ol the connecllVJly graph. For that purpose authors des1gn a Markov cham w1lh reversible moves that can transform any good partition II into another good partition II· in a finite number or steps usmg the transition kernel based on mtmlive. lntmlively, starling w1th a greedy Jrnlial guess for fr the proposed Markov chain locally explores the partition space. On the other side, in [58] the maximization is conducted using the cross entropy criterionj !Both methods search the space ol good partttwns II m order to lmd the opllmal parlllJOn fr - arg maxrr p(IIjY). Additionally, authors assume that P is the number of existing (instead of detected) objects and that an object become active exactly when 11 gets the first observatiOn. Al though it is reasonable practical assumption to accept approximation that each existing object will produce observallons on a reasonably long mlerval, m Sllualwns where the deleclJOn probabd1Ly Pd IS low Ills not necessary to have an object detected allhe same samplmg mslance when ll becomes actiVe. Smce no object mteraclions are mcorporated m the transition model, the postenor distribu tion p(IIIY) is obtained in the closed form by integration over the multi-object state variables X. Namely, under assumptions that the single-object state transitions and observation likelihoods arc Gausswn d1stnbullons these mlegrals can be calculated m the closed lorm usmg Kalman hiler. 2.4 Results! We evaiuated performance of the proposed sequential block samphng method on two synthetic multi-object tracking datasets using the optimal subpattcrn assignment (OSPA) metric [55] and compared ll w1lh perlormances ollwo baselme methods, SIR and SIR w1lh MCMC h1story reJuve natiOn moves. In th1s secllon we descnbe detads about the used dalasels and the OSPA melnc, and present the evaluatiOn results. 2.4.1 Synthetic datasets In our experiments we used two synthetic datasets, both representing motion of varying number or objects observed in a square region Y = [0, 5000] 2 E JR 2 on the Lime interval [0, 300] al 100 Figure 2.11: Object trajectory sets. Initialized with: 5 (upper left), 10 (upper right), 20 (lower left) and 40 (lower right) objects at t = 0. discrete time instances tk = 3 * k (k = 1, ... , 100). For both datasets we used the same four sets of trajectories (Figure (2.11)), generated using the multi-object transition model described in Section (2.1.2.1) and initialized at t = 0 respectively with 5, 10, 20 and 40 objects. We used the following set of the multi-object transition model parameters: (1) probability of death p D = 0.005; (2) birth rate AB leading to expected 0.3 object births per frame; (3) velocity variance O"x = 0.5; and (4) velocity limit Vmax = 10. We generated 100 independent observation sets per trajectory set for both datasets. For the first dataset we assumed only the object detection based likelihood described in Section (2.1.2.2). For the second dataset we used the ground moving target indicator (GMTI) observation model [36] with two likelihoods, the original association-free likelihood with binary observations and its association dependent approximation. Modeling parameters we used to generate observations for the first dataset are: (1) false de tection rate AJa equivalent to ex 20 false detections per frame; (2) object detection probability Pd = 0. 7; and (3) truncated Gaussian single-object likelihood centered at the object's position J-l, with the covariance matrix ~Y = 15 2 lb, with support Y n [M- 25, J-l + 25] 2 . Let us describe in more detail the observation model used in the second dataset. For the second dataset we divided the observation region in 10000 square cells each of size 50 x 50 and assumed that each cell acts as a binary Rayleigh object detector. We assumed the background false alarm rate Pia = 0.002 and the signal-to-noise ratio SN R = 17, matching the expected number of 20 false detections and the object detection probability 0. 7 of the first dataset. 47 2.4.2 Evaluation metric Development of multi-object tracking algorithms was not followed by the development of perfor- mance evaluation techniques. Measures used for evaluation in single object tracking problems (e.g. root mean square error) can be extended, by the optimal assignment step [51], to tracking scenar- ios with fixed number of objects. When the number of targets varies there are two types of errors, state estimate errors for existing objects and, mismatch between estimated and correct number of objects. In these cases, the usual practice is to evaluate each type of error separately, which makes comparisons of performances of different algorithms on different datasets more difficult. The classical multi-object dynamics model, described in Section (2.1.1), represents a special case from the evaluation perspective. Namely, the quality of object-to-observation assignment di rectly determines the quality of multi-object state estimates due to existence of the closed form so- lution for the single-object tracking problems when object-to-observaqtion associations are known. Therefore, for this model it might be sufficient to evaluate algorithm performance using only the data-association criteria, i.e. the normalized correct associations (NCA) and incorrect-to-correct association ratio (ICAR) proposed in [44]. On the other side, while the NCA and ICAR are intuitive performance measures, they do not represent metrics in mathematical sense. Two recently proposed metrics, the optimal mass transfer (OMAT) metric [29], and the opti- mal subpattem assignment (OSPA) metric [54], are appropriate evaluation criteria for multi-object tracking problems. Furthermore, they satisfy mathematical definition of metric on finite sets, have a clear and intuitive interpretation and are sensitive to errors both types of estimation errors. Let us briefly describe the OMAT and the OSPA metrics. Let K2 be the correct number of objects existing at t, and x~ = (x~ u ... , xt°K 0 ) vector of ' ' t correct object states. According to the notation introduced in Section (2.1) the estimated number of objects at tis Kt and estimated multi-object state is (et, Xt). The OMAT metric [29] of order p (1 ::; p < oo) with distance function dis defined as: (2.25) 48 where the minimization is performed over all transportation matrices C of size Kt by KP, with non-negative entries, row sums lz and column sums equal to kt. The metrics order p defines sensitivity of the OMAT metrics to outliers since taking the pili power (P > 1) emphasizes large distances between true object positions and estimates. The OMAT metrics is in general more sen- sitive to differences in estimated and correct number of objects than the Hausdorff metrics [29], the latter traditionally used for comparison of binary images. However, the OMAT metrics exhibits a geometry dependent behavior, it penalizes more heavily object number estimate errors when objects are far apart. Also, the OMAT metrics is not defined when the number of correct and/or estimated objects is zero. The OSPA metrics [54], represents an extension of OMAT metrics that addresses mentioned issues. In order to define the OSPA metric let us introduce the truncated distance with threshold pa rameter c, between the object's position estimate xf,. and object's true position x~, .• dc(xt,., x~,.) = min( c, d( Xt,., x~,.)). be , and let 1r Kf be the set of all permutations of ground truth object indices. If the correct number of objects is greater or equal than the estimated number of objects KP 2: Kt, the OSPA metric of order p is defined as: and for Kt > KP: The order parameter p has the same interpretation as for the OMAT metrics, and the threshold parameter c determines how much weight is put on errors in estimate of states of existing objects comparing to the errors in the estimated number of objects. Namely, the large values of c emphasize errors in the estimated number of objects. The OSAM can be efficiently evaluated following steps presented in Algorithm 2.12. 49 (A) Find m element subset of x~ that is closest to Xt in pili order OMAT metric (2.25). (B) For each existing object x~,k (et,k = 1) compute ak equal to the truncated distance de from the assigned estimate. 1 Kf p l (C) Compute (Ko l:k=l ak)P. t Figure 2.12: OSPA metric evaluation In the experiments we evaluated compared tracking methods by OSPA metric using three sets parameters (p, c) E {(2, 30), (2, 100), (2, 5000)}. We choose p = 2 because it provides smooth distance curves (comparing top = 1), and since the Euclidean distance is a traditionally accepted distance measure in tracking applications. We analyzed tracking performances both from perspec- tive of object localization and the object set cardinality, by varying threshold parameter c from 30 - equal to 2(]" y, to 5000 - equal to the size of the observation space. 2.4.3 Experiments We compared performance of the proposed sequential block sampling algorithm with the following baseline algorithms: (1) SIR algorithm with N = 10000 samples per frame; and (2) SIR with N = 1000 samples and 10 MCMC iterations per sample for rejuvenation of 10 frame history. In the SIR algorithm we used the multi-object transition distribution as a proposal distribution while in SIR with MCMC moves we used mixture of the multi-object transition and heuristically designed distribution based on optimal 2D assignment [56] in a single frame. We applied the proposed sequential block sampling method with T = 10 frames in a block and the L = 5 frame block overlap. We generated N = 1000 samples per block using the block proposal distribution with parameters optimized in 10 iterations. 2.4.3.1 Association dependent likelihood It can be seen that sampling importance resampling based methods estimate the number of active objects with a higher variance and non-zero biases. In other words SIR based methods tend to 50 Estima.tednumberofobjects:SIR "~=~=~~~~=~~=J. 0. -", EB11ma.ted number of objects: SIR-MCMC steps ':~f-~==~:;==~=· ·=· ~· •l -20}---- 20 80 EBIImatednumberofab]ects:Methad1 (var-mlxture) _::,~~'~ ==:,==='=;;. ;"',~=:~='j Estimalednumberofobjects:Method2(adapt,ve) "~~~=~==~~~~~ -t~f- ' 00 ' 30 \ jl '------~'- I "' --SIR-MCMCsteps ---Method1(var-mlxture) -Method2(adaptlva) ---··'--·'--_'. '=~~--] "' -- SIR-IuiCMCsteps ---Malhcd1(var-m~Xtura) -Methcd2(adaptlve) 50 60 70 80 90 100 TI~ ··SIR -- SIR-MCMCsteps ---Method1(var-mhc1Ure) -Mathod2(adapt1VG) 1000 i 10 o 10 20 30 40 T~~e eo 70 90 90 100 °o 10 20 Figure 2.13: Performance evaluation: detection based observations, OSPA metric (p {30, 100, 5000} ), 5 active objects at t = 0 Estimatednumberofobjects:SIR -~0~=~====-====~==:;::=J. Est, mated number of objects: SIR-MCMC steps Est,matednumberofobjoots:Method1 (var-mixture) ~ii'=~~~~~~~~~~ -00,~--~--~~--=---~--~ Est,matednumberofobjects:Method2(adaptive) ~F===~==============~ -00,~--~--~~--=---~--~ 1100\ ' ~ 90 I 60 :· 40 \,}\:... .!\ 0 o 10 20 30 40 50 --SIR-MCMCsiGps I "' ---Method1(var-mlxture) -Mathod2(adaptlva) ' .:.J,_ I "' -- SIR-IuiCMCsteps ---Methcd1(var-mlxrure) -Malhcd2(adapbVG) 5 o 10 20 30 40 50 60 70 80 90 100 ' ' 6000; 0000' 1000 - ~~.'.. - j\_ TI~ I "' ·-·-·SIR-MCMCsteps -- -Methcd1(var-mlxrure) -Malhcd2(adaptive) 40 50 60 70 90 90 100 Time Figure 2.14: Performance evaluation: detection based observations, OSPA metric (p {30, 100, 5000} ), 10 active objects at t = 0 2,c E 2,c E overestimate number of active objects. This can be explained by the fact that they rely on proposals based on a single observation. Namely, SIR algorithms tend to propose object births at the locations of false alarms and due to the low object termination probability keep these false trajectories alive even if no observations are assigned to it in the consecutive frames. 51 EstimatadnumberofobJects:SIR "~=~=~=~=~~=:::J -5~f- Estlmatad number of objects: SIR-MCMC steps ':r:=====~====~ _.,,,__-:O:-_---o:-------:::------;;,__-----,-; Estimatednumberofobjects:Method1 (var-mixture) ':F====='========l _.,,,__-:O:-_---o:-------:::------;;,__-----,-; Estimatednumberofobjects:Method2(adaptive) """==~=~=~=~~==! -5~~ I '" --SIR-MCMCstGpo; ---Methodt(var-mlxture) -Method2(adapttva) 140001 ' I '" - - SIA-MCMCsteps ---Methcdt(var-mlxrure) -Malhcd2(atlapbvG) I "" --SIR-MCMCsteps ---Melhod1(var-mtxlura) -Melhod2(adepllve) Figure 2.15: Performance evaluation: detection based observations, OSPA metric (p {30, 100, 5000} ), 20 active objects at t = 0 Estimatednumberofobjects:SIR - 1 ~0f==~=~=~=~~=d 160; 140: Esttmated number of objects: SIR-MCMC steps -~0~===~====~===~===~~~~~ 1120: 100 ~100; Estlmatednumberofob]ects:Methad1 (var-mlxlure) ~~~======~====~==~ -S<J,;---;O:-----o:-------:.;-----;;;;------;;_ I "" --SIR-MCMCsteps ---Methodl(var-mlxture) -Method2(adapttva) ,00: 100 1 . '.,:~ ~·:_, ,'·~·- --·~ j~ -'·-·- ,J:_.·~-- ~ -·..,.~ ~J·"'·- / ~ ao, ~ '· 40 i - 2 ~ 1 ' g r ~ 15 I ~ ' ~ ' 8 1 ~: oo, I .. ,, --SIR-MCMCsteps ---Melhod1(Var-mlxrure) -Melhod2(adaplrvG) 50 50 70 50 90 100 n~ I "" -- SIR-MCMCstep• ---Me1hod1(var-mixture) -Method2(aclaptlve) ,~,~ ~ f '- ~-~"·"~- 0 o 10 20 30 40 T~~e 60 70 90 90 100 °o 10 20 30 40 T~e 60 70 so 90 100 Figure 2.16: Performance evaluation: detection based observations, OSPA metric (p {30, 100, 5000} ), 40 active objects at t = 0 2,c E 2,c E Small value of parameter c (30) emphasizes object localization errors and put less weight on the potential mismatch between the true and the estimated object number. Medium c value (100) balances influence oflocalization and cardinality errors and the extremely large c value (5000) take into account only the errors in object cardinality estimate. For all c values results show similar tends: block sampling schemes tend to estimate the number of objects better and respond to changes in the 52 Estimatednumbero!ObJects:SIR -~-oE·::-=:· ==:====="======;::;J Eslimated number of objects: SIR-MCMC sleps '::~f-~~=·~:::· -;===;:·===ll -20f-- 80 100 ~: ' ~. Estimatednumberofobjects:Method1 (var-mixlure) i ':IT"~ =:~=~·~:=-~=-""lj ~ ', _, 0 o 20 60 100 15 I '" -·-·SIR-MCMCslop• ---Me1hod1(var-mucbJre) -Method2(adaptlve) i 4000, I ~ .. '" --SIR-IuiCMCstep• --Melhod1(var-mlxlure) -Melhod2(adsp11ve) so eo 70 ao go 100 T1me '" -- SIR-MCMCsteps ---Method1(var-m,xture) -Method2(adapbve) Figure 2.17: Performance evaluation: GMTI with the Rayleigh likelihood model, OSPA metric (p = 2,c E {30, 100, 5000} ), 5 active objects at t = 0 object number faster. Also, as a consequence of possibility to sample true object location posterior by the Gibbs sampling block sampling schemes provide better object localization performances. Additional inspection discovers that for SIR algorithms OSPA metric tend to with number of active objects, even for a small c. This phenomenon, not present for the block sampling schemes, points to the particle impoverishment problem of SIR. Namely, block sampling schemes are able to discover modes of the posterior distribution of observation partitions. Therefore, particles are not wasted on incorrect models and the full particle set explores space of object locations. On the other side, SIR methods, while running with 10 times more samples, have much smaller effective sample size. In Figures (2.17) - (2.20) we present the evaluation results for multi-object tracking with GMTI likelihood model. All conclusions related to relative performance between different algorithms hold. Overall, localization errors are slightly higher than with the association dependent likelihood model. 53 Estimatadnumberofobjects:SIR S:r;::;::;:.~~~~ -50!- Estimated number of objects: SIR-MCMC steps Estimatednumberofobjects:Method1 (var-mixb.Jre) ~~~~~~~~~~~~~ -~,L_--~~--~----~----=---~ I '" --·SIR-MCMCsteps ---Mathodl(var-mlldura) -Method2(aMptlve) ~100 :· ·.~~~~~~®--~ .. ~~~~,-~" T1ma I '" - SIR-MCMCoteps --Melhod1(var-m•xbJre) lulolhod2(adaptive) 20 ">_', __ ·,,--- -·--- -·----- _·,\,,·_ ;'...- 10-------------------------------- 80001 ' 7000i 51 oooo, ~ l 8 5000 t ~4000 I ~ ' 0 3000 :. I "" --SIR-MCMCoteps ---Methcd1(var-m1Kl\Jre) -Melhod2(adaplrve) ___ }),_;"... Figure 2.18: Performance evaluation: GMTI with the Rayleigh likelihood model, OSPA metric (p = 2,c E {30, 100, 5000} ), 10 active objects at t = 0 Estimatednumberofobjects:SIR "[;::::;::::~==;==~==~==J. 0' -"'o Estimated numbarof objects: SIR-MCMC steps ':F=====~=======i _.~~--~~--~----~----=---~ Est•matednumberofObJects:Method1 (var-mixture) ~lr==~==~==~==~=~ Estlmatednumberofobjects:Method2(adaptlve) "r;:::;::::~==~===;=-==;==3. -5~£-- 1250 ~200 ~ ' 0150 I "• I "" -- SIR-MCMCsteps ---Method! (var-mlxwre) -Method2(actapt,ve) I "" --SIR-MCMCstepo ---Methodl(var-m~>ture) -Method2(a<lapllve) 60:· .. , 20 I~ :_: -·-· _ ·:_ _ :._ j ~ 1-. _ :.__I\.:_ _·• .:_ /'- ' . .:_ _ i 12000\ 110000 i ~ aooo ;; ' gj 6000 I "• I "" --SIR-MCMCstepo ---Mell1odl(var-modure) -Mell1od2(adaptloe) 10 20 30 40 50 60 T1me Figure 2.19: Performance evaluation: GMTI with the Rayleigh likelihood model, OSPA metric (p = 2,c E {30, 100, 5000} ), 20 active objects at t = 0 54 Estimatednumberofobjects:SIR "'c;;:::::::~::::::::::==:::::::::::::_::::::::::;::::::J -1o~f- Estimated number of objeots: SIR-MCMC steps ':IT" _ _;;..;..,""""""~~=O::;;~=~ ~',~--~~--~----~--~~--~ EstimatednumberofobJects:Method1 (var-mixture) ':u~=~===========J ~,,L_ __ ~~--~----~----~--~ I "" --SIR-MCMC.tops ---Method1(var-mi<ture) -Method2(adaptlva) I '" - - SIR-MCMCsteps ---Melhod1(var-mi><11Jre) -Melhod2(adaplive) 100: I '" - - SIR-MCI~Csteps ---Melhod1(var-m~XbJre) -Melhod2(adsptlve) Figure 2.20: Performance evaluation: GMTI with the Rayleigh likelihood model, OSPA metric (p = 2,c E {30, 100, 5000} ), 40 active objects at t = 0 55 Chapter 3: Localization and tracking of multiple-speakers using microphone array 3.1 Proposed Method The algorithm we propose can be summarized in four main steps (see Fig. 3.1): Microphone Array Performance Evaluation a) T mMPF mixture update T particle reclustering T trajectory reconstruction T SOCPD speaker detection T b) (i) obtain the posterior distribution of the source locations at each time-frame through the up- date equations of the mixture particle filter (MPF) (Section 3.1.2); (ii) extract modes of the posterior distribution using the patient rule induction algorithm (PRIM) to (Section 3.1.3); (iii) reconstruct 56 Figure 3.1: Microphone array. Xt- position of sourceS; -< (micS) -true angle between sourceS and microphone pair (mi,mj); f3(xt)- angle that defines source position in the XoYplane; y;'j Direction of Arrival estimated by microphone pair ( mi, mj) source trajectories by assignment of modes discovered in consecutive time-frames. For this pur- pose we apply the two index assignment algorithm (Subsection 3.1.3); and (iv) perform speaker segmentation using the Sequential Optimal Change-Point Detection (SOCPD) algorithm on each reconstructed trajectory (Subsection 3.1.5). The statistical model used is described in the following subsection. 3.1.1 Statistical Model We assume that the acoustically active source is represented by its location Xt in the quantized 3-D meeting room space. Our analysis is based on time delay estimates derived by the algorithm described in [22]. Pair- wise delays between M microphones ( m 1, ... , m M) are estimated and transformed to Direction of Arrival Angles (DoAA) producing a M(M- 1)/2-dimensional observation vector Yt every time- frame. Given a 16-channel microphone array in our smart room this results in a 120-dimensional observation vector. As shown in Fig. 3.1 observation coordinate y;,j E [0, 1r] denotes DoAA for the microphone pair (mi, mj ). Let us denote t-tuple of all observation vectors up to timet as Y1:t· We assume that the Markovian assumption holds and describe active source kinematics by the transition distribution p(xtiXt-1) and the initial state distribution p(xo). We compute the observa tion likelihood P(Ytlxt) as: 57 (3.1) where R( Xt) denotes the set of all microphone pairs ( mi, mj) ( i > j) for which the distance from the source location Xt to the DoAA Y!'j (see Fig. 3.1) is smaller than some limiting distance. Both the transition distribution and observation likelihood are learned from a supervised training dataset (see Section 3.2). Since the goal is to track multiple acoustically active participants, the posterior distribution of interest p(xt IYI:t) will contain encoded information on the position of each sound source, and hence it is natural to represent it with a mixture model. This approach preserves the low dimensionality of the state space and has clear computational advantages over methods that employ concatenation of the position vectors of the different sources [47]. A computationally efficient approximation of the optimal Bayesian solution is obtained by formulating the tracking problem in the sequential Monte Carlo framework [16], particularly using the mixture particle filter [60] method. 3.1.2 Mixture Particle Filter In this subsection we give a brief overview of the MPF method and describe our choice of the sam- piing distribution. Vermaak et al. [60] proposed mixture particle filters (MPF) to enable maintaining the multi-modality of the posterior distribution: Mt p(xtiYl:t) = L ac,tPc(XtiYl:t), c=l where at,c represents the weight of the cth mixture component at time t. The set Mi = {(xL, wL) : i = l..Nc} defines a particle approximation of the distribution Pc(xtiYI:t) where , , Nc, x~ c and w~ c denote respectively number of particles, position of ith particle and its weight. , , MPFs show an elegant way to update particle representation of different mixture components by separate particle filters where the only interaction between different components appears in the particle weight update equations. For more details see Vermaak et al. [60]. 58 [A key aspect ol all sequential Monte Carlo algonthms ts the chOice ol an appropnate sam plmg dtstnbutiOn. Parltcularly, m multt source trackmg scenanos the sampling dtstnbulton has to dnve particles towards regwns where the new sources occur. Therefore, the transition distnbutwn [p(x,[x,_J) docs not represent a good choice since it captures only the kinematics of the existing target. In order to overcome this difficulty we usc a sampling distribution in the form of linear com- bination: p(x,lxt-1· Yt) - ')p(x,lx,_t) + (1 - '))q(xtiYt). Distribution q(xtiYt) is constructed based on agreement between DoAA estunates from different tmcrophone paus. Note that the tnplet Vm m · yiJ) defines a conic surface which contains all possible source locations which are indis- ~ 1,) J'' t tinguishablc from the perspective of the microphone pair ( m;, Tllj). For each Xt we count how many conic surraces pass sul'ficiently close to it and compute distribution q(xtiYt) by normalization or the obtamed counts. 3.1.3 Particle Reclusteringl In order to avotd mcorrecttdenttltcatton ol parttcle mtxture pamngs we perlorm parttcle recluster mg. In the MPF illgonthm [60 [ reclustenng ts performed by a combmatwn of the k-means clustenng algonthm and spill-merge heunstlcs. The k-means approach performs clustenng based on the po- sitions of particles. This approach sutlers since particles arc drawn from the sampling distribution and there! ore thetr postttons do not lollow the true postefll)f dtstnbutton. I hts problem can not be overcome by resamphng on the lull parttcle set because tt would undermme preservation ol multt- modahty as the mam pnnctple of MPFI [We propose to solve this problem and determine the number of mixture components without split-merge heuristics by the Patient Rule Induction Method (PRIM) [26] briefly described in Ta- ble 3.1. Our tdea ts to use PRIM to detect regtons m whtch the parttcle approxnnatton has htgh probabthty denstty and adopt these regwns as mtxture components. Thts way reclustenng 1s done usmg both spatial properties of particles and thetr wetghts. We define mtxture component as a set o~ particles in the interior of the 3D bounding box with sides parallel to coordinate planes and proceed according to the algorithm in Table 3.1J 159 liable 3.1: Bnef descnplion of the PRIM algontillnl (1) ln1tal1ze boundmg 30 box w1th s1des parallel to coordmate planes so that1t conta1ns all part1cles (2) Repeat steps 2a and 2b while there is more than Nt particles in the box. (2a) Cut ott f percent of total number of part1cles m the box by the plane parallel to one of the box s s1des. Choose the Sid~ in a way that probability density in the remaining part is the biggest possible. ~2b) If the new density is smaller than the old one goto 3 (3) Repeat steps 3a and 3b wh1le there IS less than J.V 2 part1cles m the boxj (3a) Expand the box along one s1de so that total number of part1cles Increases for F percent. Choose the s1de m a way tha~ probability density in the remaining part is the biggest possiblej (3b) If the new density is smaller than the old one goto 4. (4) Particles in the obtained box define one mixture component, remove them from the tracking region and repeat steps n -3 With rema1n1ng part1clesj 3.1.4 Trajectory Reconstruction Algorithm W1thoutthe part1cle reclustenng step, MPF perlorms trajectory mamtenance 1mphc1Uy- a mixture component Mi" is obtained by propagation of the particles from a mixture component Mf"_ 1 . The reclustering step interferes with this natural trajectory evolution and redefines mixture components m a way that a component M;" can contam pm1lcles obtmned by propagalion from different com- ponents att1me t 1. I herelore an add1llonal mechamsm lor traJectory reconstruction IS reqwred. !We propose to reconstruct trajectones of acouslic sources by assignment of mixture components m conseculive tune-frames. For this purpose we define a metnc that descnbes a sumlanty between two components. We pose assignment as an optimization problem where the goal is to maximize to- tal s1mlianty between ass1gned m1xture components. I h1s problem can be solved w1thm the mteger programming rramework. rr a mixture component at timet (t- 1) is not assigned to a component from t- 1 (t) we initialize (terminate) a trajectory. Since particle approximations of mixture components' posterior distributions do not necessarily have identical support sets, it is hard to find a good measure of similmity between them. In order to overcome th1s problem we iltthe Gaussmn d1stnbut1on on each mixture component and use obtamed Gaussmns to compute mter-component d1stances.1 ~ets s( t, k, rn) denote symmetrized Kullback-Leibler divergence 1131 between Gaussian distri butions fitted on mixture components M~ and M7_j_ 1 . We define goodness of assignment for these components asj The cost of not assigning components is defined as d(t, 0, m) = d(t, k, 0) = C. The second term in the exponent favors assignment of components similar in position and shape while the first term favors assignment of components with similar probability mass. Constants >.. and C are determined empirically to fit the application. Let link variable ck,m for components k and m take value 1 if components are assigned and value 0 if they are not assigned. The optimal assignment is the one that maximizes the total Mt Mt+l arg~wx L L d(t, k, m)ck,m, k,rn k=O m=O under the constraint that each mixture component at time t can be assigned to maximally one other component at time t - 1 and vice versa. This problem is solved by integer programming technique. Details on how the integer program- ming algorithms work can be found in Wolsey [65]. 3.1.5 Detection of Speaker Appearances and Disappearances Reconstructed trajectories have three possible origins: an active sound source (meeting participant), a temporary fluctuation in the posterior probability caused by reflections or just a reclustering arti- fact. Our goal is to determine which trajectories belong to meeting participants and segment these trajectories in order to discover intervals that correspond to verbal activity of participants, i.e. to perform speaker segmentation. For this purpose we propose to apply SOCPD algorithm on each trajectory. A high likelihood that a certain segment of the reconstructed trajectory is produced by a large amount of acoustic evidence (many microphone pairs point in that direction) indicates that such a segment corresponds to the dominant acoustic activity - speech. Further, we conclude that the trajectory on which a speech segment is detected corresponds to a meeting participant. The proposed SOCPD algorithm acts as an additional logic that sequentially discovers start and endpoints of speech segments on 61 reconstructed trajectories. We use separate likelihood statistics for detection of speaker appearances and disappearances (see Klygis et al. [34]) and propose a way to compute these statistics from particle representations of mixture components obtained by the mMPF-TbD tracking algorithm. Let us assume that the trajectory is represented as a sequence of particle sets Mt {(xm,t, Wm,t) : m = 1 ... Nt} fort = 1 ... Tmax· Note that for notational simplicity we drop the mixture component indices. We define a log-likelihood ratio at time t as: l ·- l P(YtiMt-1) t .- og ( ) . Po Yt (3.2) This ratio measures how likely is that observations Yt are produced by the sound source at Mt-1· Since particles from Mt-1 are independent, likelihood p(YtiMt-1) can be computed as: Nt-1 P(YtiMt-d = L Wm,t-1P(YtiXm,t-1) (3.3) m=1 where P(Ytlxm,t-1) is defined by Equation (3.1). Distribution Po represents a uniform distribution on the observation space. Note that we condition on the Mt-1 instead of Mt which is dependent on the observation Yt· This does not represent a problem in our scenario since the time sampling rate is high enough. The generalized likelihood ratio ADt_1 represents the likelihood that a speaker becomes active at time t1 and stops his activity at time t2 < t: The statistic At = maxt 1 <t l:T=t 1 :t l7 represents the likelihood that the speaker becomes active at some time t1 < t and is still active at timet. Therefore statistic Dt = ADt-1 -At is a measure of the likelihood that a speaker is not active at the time t. The notation used is: Dt - disappeared before time t, At - active at time t and ADt - appeared and disappeared up to time t. 62 Recursive update rules are given as: ADt = max(ADt-l, At)= max(At, At-l, .. . ) At = lt + max(O, At-d where ADo = A 0 = 0. According to [57] moments of speaker appearance and disappearance can be determined by application of appropriate thresholds on statistics ADt and Dt respectively. To summarize, speaker appearance is detected at the moment t 1 at which statistics At goes over the first threshold. Speaker disappearance is detected at the time t2 > t1 at which Dt becomes greater than the second threshold. After a disappearance is detected statistics ADt 2 is set to zero and the algorithm is ready to detect a new speech interval. 3.2 Experimental Results and Discussion We tested the proposed algorithm on the dataset collected in the University of Southern California smart room [8]. Four sessions with approximate length of 15 minutes each were monitored with multiple modalities: A ceiling 4-camera tracking system, a 360° camera, a single microphone for speaker ID, and a circular 16-rnicrophone array. Microphones were placed in the center of the meeting desk on a ring with 15 em radius as shown in Fig. 3.2. Figure 3.2: Left: instrumented conference room (ceiling camera views); Right: 16-microphone array with the omni-directional camera above it. The participants were given multiple topics on which to debate. While they were completely free to follow their beliefs, they were also given a list of arguments to help them along if they needed them. Mostly the interaction ended up being very spontaneous with people seriously believing 63 and argumg lor the1r pomts ol v1ew. I h1s mduces lrequent changes m the speaker acllv1ty r.e., dynam1c turn takmg. Momtonng started 1mmedrately pnor to the people entermg the conlerence room. The average turn duration was 6.727 seconds and in 9.7% of the total speech, different speakers overlapped. More details on the meeting dynamics can be found in Bussa ct a!. [8]. [The participants' positions obtained through human armotation from the ceiling multi-camera trackmg system are accepted as the relerence. Note that the accuracy m geometnc space IS hm1ted due to the non-po111t source nature of the human speech produc!ion system. The audio data was annotated manuillly 111 order to get accurate speaker segmentatiOn. DirectiOns of amval extracted by processing Time Difference of Arrival for each microphone pair arc used as observations. We part1t1oned the dataset m testmg (.l sess1ons) and trammg (I sess1on) sets and learned observation hkehhoods and trans1t1on d1stnbut1ons !rom the trammg set. 3.2.0.0.1 Tracking performance: In the first expernncnt we cvaiuatcd the tracking performance on intervals on which participant speaks. For this experiment we usc mMPF-TbD algorithm. All reconstructed trajectones were analyzed and one closest to the relerence trajectory ol a parllc1pant was assigned to that par!icipant. Average angular error between proJeCtiOns of eslimated and true parlicipant's positiOn on the XoY plane (see Fig . .l.l) on speech 111tervals was 7.46°. Note that the nature of observations (120 DoA's) makes it difficult to design a reliable frame level detector ofj acllve speaker's pos1t1on m th1s scenano. As the relatively low angular errors show, the proposed mMPF I bD algonthm accumulates ev1dence through consecullve lrames, d1scovers and mamtams tracks for both acoustically dominant and inferior speakers (in 9. 7% of total speech time, more than one speaker was active). 3.2.0.0.2 Speaker segmentation: In the second experiment we evaluate the perl'ormance or the SOCPD algonthm on the speaker segmenta!ion task. S111ce 111 our dataset speech represents the most prominent acoustic activity, it was possible to manually determine appropriate threshold values that enable the SOCPD algorithm to recognize speech segments on reconstructed trajectories. In Table 3.2: Statistical properties of the mMPF-TbD-SOCPD algorithm avg. duration of speech interval [sec] 6.727 avg. appearance detection delay [sec] 0.421 avg. disappearance detection delay [sec] 0.426 avg. duration of false appearance [sec] 0.545 avg. duration of false disappearance [sec] 0.531 no. of false disapp. per speech interval [overlapping speakers] 0.307 no. of false disapp. per speech interval [non-overlapping] 0.011 avg. duration of non-detected interval [sec] 1.056 total no. of non-detected intervals [overlapping speakers] 45 total no. of non-detected intervals [non-overlapping] 5 Table 3.2 we present statistical properties of the speaker segmentation algorithm which give insights into the behavior of the algorithm in terms of the meeting dynamics. Note that 90% (45/50) of non-detected speech intervals take place in segments when multiple participants speak at the same time. Also, average duration of the non-detected speech intervals (1.056sec) is significantly shorter than the overall average (6. 727sec), which implies that most speech segments are lost in situations when multiple sources compete for detection. The same holds for false disappearances (non-existing pauses detected within longer speech segments) which are approximately 30 times more likely to occur in intervals where speakers overlap. Values for average delays in detection of start/endpoints of speech intervals as well as the average duration of the falsely detected speech segments are given in the Table 3.2. 3.2.0.0.3 Multimodal Fusion: In the third experiment we explore the benefits derived by the proposed algorithm on the performance of our multimodal system [9] on the multi-modal speaker segmentation task. We introduce two criteria for judging speaker segmentation quality on lsec intervals: the strong decision criterion where speaker detection is considered correct if the speaker is active in at least 50% of the lsec time interval; and the weak decision criterion where speaker detection is considered correct if the speaker is active in any part of the lsec interval. 65 Our multlmodal system employs a cetlmg 4 camera system prov1dmg v1sual hulls ol the part1c1 pants, a 360° camera ror race tracking, a speaker identification system providing the identities of the current speaker(m this case, eqmvalent to the seatmg arrangement), and the 16-Jmcrophone array system. In the fusion algorithm (sec Fig. 3.la) the ceiling cameras and the ;)()() 0 camera system arc used to detect number of meeting participants and their locations. In the previous implementation [S]the microphone array system was prov1dmg angular pos1t1on ol the active speaker estimated as the mode of the distnbutwn obtamed by pro]ectmg the dJrectwns of arnval for each nucrophone pmr on the X oY plane at each lime-frmne. ln the new unplementat10n we prov1de eslimates of an gulm active speakers positions in XoY plane obtained by mMPF-TbD algorithm only on intervals m wh1ch speakers were actually detected by the SOCPD algonthm. I here! ore, the new algonthm mtroduces two types ol 1mprovement: hrst, on mtervals on wh1ch multiple speakers were detected 1t provides mulliple angles; and second, estunates of angular pos1lions of speakers are provided only on intervals on which speakers were actually detected. Speaker detection and localization is performed by probabilistic assignment of angular speakers' positions obtained by the microphone array algonthm to participants locations obtamed by v1deo trackmg system. I he I us1on ol outputs !rom the microphone array algonthm and the speaker 1dent1hcat1on system allows the mult1modal system to learn the Jdentilies of parlic1pants and perform speaker segmentatiOn and locahzatwn m parallel. An overview of the multimodal fusion algorithm is presented in Fig. 3.la. For more details sec [9] and [8]1 IPerlormance 1mprovements lor both multnnodal conilguratwns, Mtc.Array + Vtdeo and ~ic.Array + Video + SID, and for both pertormance cntena are presented m Tables 3.3 and 3.4. lt IS ev1dent that the proposed nucrophone array algonthm has a s1gmficant 1mpact on the overail system performance! Table 3.3: Performance on speaker detection task: strong decision Old system New system relative detection detection gain Mic. Array+ Video 81.97% 83.70% 9.60% Mic. Array+ Video+Speaker ID 83.25% 86.36% 18.57% Table 3.4: Performance on speaker detection task: weak decision Old system New system relative detection detection gain Mic. Array+ Video 88.48% 90.22% 15.10% Mic. Array+ Video+Speaker ID 90.57% 93.81% 34.36% Even though performance of the separate speaker identification (SID) system on the speaker detection task (for known assignment of participant identities to spatial locations) is relatively low, 60.10% for strong and 67.85% for the weak detection criteria (see [9]) it provides complementary information to the microphone array algorithm and improves overall segmentation performance. This is due to the fact that SOCPD and SID algorithms detect active speaker in different manners: SOCPD does that by monitoring process of competition for observations between different acoustic sources while SID recognizes spectral differences between different speakers and silence. This complementarity adds a new aspect to the multimodal fusion algorithm. 3.3 Conclusions In this work we presented improvements in our multimodal system for tracking of meeting partici pants and speaker segmentation. We achieved these improvements by fusing information obtained by the 16 acoustic channels. We proposed an algorithm that can perform tracking of the acoustically active participants and extraction of speech intervals using Directions-of-Arrival estimated for each microphone pair as observations. 67 [racking of acoust1ca1ly acllve sources was done by use of the mod1fied m1xture parllcle fil ter (mMPF) in the Track-before-Detection (TbD) framework. We modified the original MPF and apphed the pallent rule mduct10n method (PRIM) to discover mixture components m the postenor distnbution. I raJectones were reconstructed by the opllmal assignment ol discovered mixture com ponents m consecul!ve llme frames. We formulated the optunal llllxture component ass1grunent as an integer programming problem and proposed a metric that describes the distance between mix ture components. Tracking pclfonnance on segments with multiple overlapping speakers shows that mMPF- I bD algonthm can successlully mamtam multiple traJectones. IWe proposed a novel way to addfess the speaker segmentatwn problem by the seguenllal change-pomt detecllon (SOCPD) method. We presented a way to compute stallsllcs used m SOCPD from a particle representation of a reconstructed trajectory. With appropriately tuned threshold val ues, the SOCPD algonthm apphed on parllcular traJectory discovered lime mtervals ol dommant acoustic aclivity (speech). V\pphcatwn of the proposed illgonthm m the mulllmodal setup brought relatlVe speaker detec tion improvement of 18.57% according to the strong decision criterion and :54.:56% according to the weak decision criterion Chapter 4: Multi-modal speaker segmentation and identification 4.1 Proposed method Our multimodal speaker segmentation algorithm consists of 4 main steps (Figure 4.1): Figure 4.1: Proposed architecture for multimodal fusion. Parameters Pd,A and A are learned from training data. Parameter Pi,i defines state transition and parameters (a, {3) define likelihood fusion model. 69 I. We track locatwns of the meetmg participants usmg the mulll-target tracking algonthm de scribed in Chapter 2. The resulting locations arc subsequently used as a meta-feature by the other modaliltes.l Q. We extract local maxrma ollhe modtlted SPR-GCC-PHAI lunclton and lreallhem as mrcro- lphone array observatwns. Furthermore, we compute likehhood of these observatwns giVen lposlllons of acllve speakers obtamed from the first step. We propose a jomt probabihsllc model for association of microphone array observations to positions of active speakers (Sec lton 4.2.2). rl. We compute likelihoods ol speaker rdentrltcalton observaltons (MFCCs) gtven speakers' Idenlllles. Likehhoods are modeled as Gaussian mixtures for smgle speakers and overlapped speaker parrs (Sectwn 4.2.3). 4. We decode unknown identity-to-participant associations and speaker activity indicators. We !perform fusion of the microphone array and speaker identification likelihoods in the HMM lramework lor whtch we del me a stale lranstlton model. I he speaker aclrvtly mdtcalor se quence IS decoded by both Bayesian filtenng and Yiterbi algonthm (Secllon 4.2.4).1 4.2 Statistical Model Let us mtroduce the notallon that we use throughout the foiiowmg secllons. Let 1\. 1 be the number of parllClpants present m frame t. Therr posillons are contamed m the Jomt vector x 1 = (xf,~, ... ,x{K,Jl, whereforeachk (k = 1, ... ,K,) vectorx{k E JF!.· 1 represents locationo~ the k'h participant in the three dimensional tracking space. ~el Lube the lola! number or dil'l'erenl participants' trajectories registered on the interval [L t]. We define the trajectory index vector e 1 = (e 1 , 1 , ••• , e 1 x,), such that the trajectory of the kth participant at the frame t has the index et.k in the list of all trajectories detected on the interval [L t]. All trajectory index coordinates assigned to the same participant's trajectory obtain the same value rrom the set {L ... , Lu} and this value dirrers rrom the values assigned to other trajectories. m [In the lollowmg presentation we assume that at each lrame t vtdeo trackmg system provrdes us vectors x 1 and e 1 . In other words, we treat these vectors as known parameters.[ [he speech activity of the partlCipants present IS represented by the bmary acllvity mdicator vector a 1 = (au, ... , a 1 ,K,) E {0, 1 } 1 ' 1 , where at.k = 1 denotes the k 1 h participant is speaking and a 1 .k = 0 denotes the hili participant is silent.. The total number of active speakers at time t is V\ssummg that ali partiCipants belong to the fimte pool of people whose Identilles are kriown in advance. Lets denote set of these identities as I - { 1, ... , I f. Assuming that the total number of trajectories Lu registered on the full interval of interest [L T] is smaller than L we define the trajectory-to-identity mapping vector e - ( e~, ... , e L) whose coordinates take values l'rom I. This vector does not evolve through tnne, and rt represents an unknown modelmg parameter. furthermore, we define a hidden Markov model (see Figure 4.2) m which states are the speaker acllvity mdicators a 1 and IJ IS an uiikriown parameter. As mcntwncd earhcr vectors x 1 and e 1 arc known parameters. In this model we usc both the microphone array observations Y1H A and the speaker TD observations yf 1 JJ .1 [I he raptd processmg rate, stemmmg I rom the short duratton ol each data lrame, makes the synchfonous switchmg of mulllple speaker acllvity mdicators very unlikely. Therefore, we illlow only those state transitions in which the Hamming distance between consecutive states is less or equal than 1. We define two different state transition models that incorporate this constraint. In the ltrst one we do not allow more than two acttve speakers per lrame whtle m the second one we pose no lumtatwn on the number of acllve speakers. The total numbers of states for these models are ~7-o (r~') and 2K' respectively. We specify a (high) transition probability of staying in the same state p(a 1 ±1 = a 1 [a 1 ), for both models, while assuming that all other allowed state transitions arc equally probable. 4.2.1 Multi-view video tracking There are many existing techniques in the computer vision community that can be applied l'or track- ing humans in meeting room environments, e.g. [ 16, 21]. In this work we apply the an adaptive 7] Figure 4.2: HMM: at - speaker activity indicator vector; () - unknown trajectory-to-identity associ ation parameter background subtraction algorithm on each of camera views and use them to reconstruct the room height map. This map is further used as the raw observation stream. Local maxima of the height map are extracted and used to construct the proposal distribution as described in the Chapter 2. For purpose of this application we have used history rejuvenation with depth L = 2. For the association free likelihood model we use a background subtraction algorithm followed by the 3D reconstruction. For the data association free likelihood model we propose 3D modification of the precision-recall model proposed for image tracking space by [59]. Learning of the model parameters was done by establishing ground truth target locations using color markers (mini-paper hats) on the participant's heads. By updating list of all registered participant trajectories on the interval [1, t] we get the vector et where its kth coordinate et,k is an index in the trajectory list. 4.2.2 Microphone array likelihood model Classical microphone array processing algorithms compute the time domain GCC-PHAT function [7] and estimate the direction of arrival from the global maximum of this function. This way each microphone pair observes only the dominant speaker and it is hard to get correct solution on seg- ments with overlapping speakers. Other solutions, based on the steered power response [12, 21] are used in a similar manner where the global maximum of the SPR-GCC-PHAT function determines location of the dominant speaker. 72 We propose a modification of the SPR-GCC-PHATspeaker localization algorithm. First, we de- fine a 3D rectangular grid that covers the tracking space; Second, we extract multiple local maxima of the SPR-GCC-PHAT function on the grid and treat their locations as observations. We model association of these observations to the locations of the active speakers (output of the video tracking system) using a joint probabilistic data association model [ 4 7]. This model allows active speakers without assigned observations and observations without assigned active speakers. Observations Yt = (y~JA, ... , y~ttJ correspond to the Mt local maxima that are not smaller than 1 E [0, 1] times the value of the global maximum of the modified SPR-GCC-PHAT function given by Equation (4.1). Parameter r can be tuned to fit an application. (4.1) allm Functions 87' 1 and 87' 2 represent the Fourier transforms of the lOOms Hamming windowed speech segments recorded by the microphone pair m = (m 1 , m 2 ). M is the total number of micro phone pairs and F- 1 the inverse Fourier transform operator. We introduce the weighting coefficients am (y) that are equal to one if the location y is visible by both microphones in the pair m and is equal to zero otherwise. Additional coefficient I:M M ( ) de-penalizes function value in points on m=l CY.m Y the SPR-GCC-PHAT grid that are not visible from all microphone locations. The total number of the local maxima Mt is equal to the sum of number of observations Mt that are corresponding to the active speakers and the number of observations M{ that represent clutter. We model Mtc as a Poisson random variable with parameter >... Further we define the association vector rt = (rt,l, ... , rt,Mt), where rt,i = k(k = 1, ... , Kt) for at k = 1 means that the observation YtAfA is assigned to the active speaker k. If rt i = 0 then ' '" ' ith observation is not assigned to any speaker. An example of possible data association is given in Figures 4.3 and 4.4, where we omit time indices for simplicity. The likelihood of the microphone array observations can be obtained by averaging over all possible associations and is given by the Equation (4.2). 73 Figure 4.3: Sample participant arrangement: Xt- participants' locations, y_ffA -local maxima ofthe SPR-GCC-PHAT function, 'Pi,j- angular distance between ith observation and jth participant p(yfd"Aiat, Xt) = LP(Yfd"Aiat, rt, Xt)p(rtiM?, Mnp(Mn (4.2) Tt If the detection probability for an active speaker is Pd then: Also, under the reasonable assumption that all valid assignments are equally probable: Therefore, the final expression for the microphone array observation likelihood p(y_ff A I at, Xt) is where Vis the number of SPR-GCC-PHAT grid vertices. 74 Figure 4.4: Sample associations for the given state: p(yMA Ia, x, r) p(ySIDia, 0) = p(ySIDIJDl,ID4) 4.2.2.1 Learning of the likelihood model parameters P( it?l,l)p( IP4,3) J3 and As it is shown in [21] and[9], in the spherical coordinate system with the origin in the center of the small radius microphone array, source localization techniques based on the TDoA determine angular coordinates of the source much more precisely than its radial coordinate. Therefore, we model the distribution p(yt,fAixt,rt,J as a function of the angular distance (e.g. 01,1 and 04,3 in Figure 4.3) from the observation to the speaker location measured from the center of the microphone array. For the given neighborhood size A we learn this distribution and other unknown model parameters, Pd and A, from the training data. The probability Pd is equal to the percentage of participant-frames in which active speakers get at least one observation in neighborhood of size A. Value A is the expected number of observations per frame which do not fall in the A-neighborhood of any active speaker. Finally, the probabil- ity distribution of the single observation given a speaker location is learned as a function of the discretized angular distance from the speaker to the observation where this probability is zero for distances greater than A. 4.2.3 Speaker identification Speaker identification systems based on single speaker Gaussian mixture models (GMM) for known set of speakers do not perform well in the presence of the overlapped speech. In order to tackle this 75 difficulty we tram GMMs both for smgle speakers and combmalions of two overlapped speakersJ For the two-speaker models the corresponding single speaker channels were mixed with equal av- erage energy. Our tdenttilcallon algonthm employs MFCC"s extracted on lOOms segments! !Variables a 1 and e in combination with the participants' locations obtained rromthe video track- mg module respectiVely, define locatwns Ill space occupied With aclive speakers, and assign idenli- tics of the participants to the particular locations. Therefore, their combination determine identities of the active speakers (Figure 4.4) and the speaker identification system can provide the likelihood 4.2.4 Modality fusion Onder the assumption thatmtcrophone array and speaker tdenttltcatron ltkelthoods are mdependent the JO!llt likelihood can be represented as: where dil'l'erent choices or the parameter pair (a, ;'i) E [0, 1] 2 define dil'l'erent likelihood models. We use the following parameter combinations: (i) (a, (3) - (0, 1): speaker identification only; (ii) [n,!!) - (1, OJ: microphone array only; (iii) (n,!!) - (1, (3),.!! <::= 1: modality fusion. !Due to the limitation that only models for single and two overlapped speakers arc available, the ilrst and the thtrd parameter combmatron can be used only wrth the ltrsttransttton model (maxnnum two actrve speakers at a ttme) whtle the second parameter combmatron can be used wtth both state transitiOn models (no lmut on the number of aclive speakers). We found that the speaker idenlifi- cation likelihood model docs not provide reliable disambiguation between states directly connected in the transition model and therefore we discount differences between likelihoods of these states by the parameter ;1. 4.2.5 Speaker Identity and Activity Decoding In th1s sectiOn we descnbe two approaches lor decodmg ol the opt1mal state sequence and unknown tra]ectory-to-Jdenlity assocmtwn parameter. For s1tuatwns where 1t 1s necessary to compute est1- mates of the unknown variables sequentially at each time when the new observation arrives, we propose the sequenlial Bayesmn filtenng approach. In this case the optnnai state sequence a!,T JS delmed as: (4.3) and m order to compute 1t we prov1de sequenlial update eqnatwns for the state filtenng distr1butwn 1 .. ( I M.4.81D ) (S t' 4 2 < 1) IP a'- Yt , , x 1 _, e'- ec 1on .. _). . ~n situations where we can afford to process the whole session, the interval [L T], in a batch, we define the Optimal parameter value as e* = argmaxgp(e[y;VfA.SJI),XT,eT) and the Optimal sequence as the path that maximizes joint posterior state probability (Section 4.2.5.2): (4.4) [In the lollowmg sectwns we s1mphly the notatiOn by leavmg out cond1llonmg on the known parameters x 1 and e 1 . 4.2.5.1 Sequential Bayesian F'iltering] L h . . fil . d' 'b . ( Bl MA SiD) . 1 . kn fr h ets assmne t e Jmnt tenng 1stn utwn p a'--l, y 1 . 1 ' 1 at tlme t - 1s own om t e previous update step. Than, we compute the predictive distribution as: ( HI MA.81D) '\' ( I ) ( HI M.4.81D) P a,, Yu-J = L P a, a,_, P a,_~, Yu-J . (4.5) 77 Since there is no observation history at the timet = 1, we initialize the update process by the following distribution. , all allowedO (4.6) , otherwise. Number of the possible trajectory-to-identity associations is 181 = I(I -1) ... (I- L + 1) and size of the joint state-parameter space is 2Kt I 8 I· We obtain the state filtering distribution by marginalization over the space of the unknown pa- rameter 0, i.e. by averaging over all trajectory-to-identity associations: ( I MA,SID) " ( MA,SIDI ) ( I MA,SID) P at Y1:t rv L....,.P Yt at, (1 P at, (1 Y1:t-1 · (4.7) (} After the new observation yfiA,SID is introduced we get the updated distribution of unknown ( Ill MA SID) b · 1· · th parameter p u y l:t ' y margma 1zat10n over e state space: ( I MA,SID) " ( MA,SIDI ) ( I MA,SID) P 0 Y1:t rv L....,.P Yt at, 0 P at, (1 Y1:t-1 · (4.8) Finally, since on the interval [1, t] only L 1:t trajectories were registered, the meaningful part of the parameter distribution is p(Ol:Ll:tiY~A,SID) and we obtain it by the marginalization over We apply all mentioned steps for the estimation of the state and the parameter filtering distribu- tions sequentially for each t = 1, ... , T which we summarize in the following algorithm: Total complexity of the forwardsequential filtering algorithm on interval [1, T] is O(T·2 2 L ·181). 4.2.5.2 Viterbi Decoding In this algorithm we use the same update equations (4.6),(4.5) and (4.8) to get the distribution of the unknown association parameters. Main difference is that at the final time T the total number of the detected trajectories and information about their overlap in time is available. Therefore, at 78 Algorithm 2 Sequcnllill Bayesian F1ltenng l'or t - I to T do I(A) C l ( nl Ivf A,STD) ompuepa,,vYu-l: if t 1 thenl lOse Equallon (4.6). else "IU~s~e'E-q~u~a~t,~on~(4•. 7 5'). end if I( D) Compute p(a 1 ly{~A,srn). Usc Equation (4.7). I(Ct) Compute p(8lyf 1 /· 5 JIJ). Use Equation (4.8).1 l(c ) C t ( ~ I MA.811J) :! mnpuepul:T, 1 ,ty 1 . 1 : ( e I JIH,.sJvl - P l:Lu Y1:t - end for T the set of all possible trajectory-to-identity associations 8 is known exactly. We incorporate this information in the Equation (4.6) which allows us to work with valid associations only and deCI-cases the overall computational complexity. Algorithm 3 Viterbi Decoding! lor t I toT do I(A) C t ( ~I MA,8JLJ) mnpu e p ar, u YH - 1 : ift = 1 then! IUse Equation (4.6). else ilJse Equation (4.5). iilliiJI I( B) Compute p(8ly 1 i:/ 1 • 5 JlJ). Use Equation (4.8). end for (C) Find optimal parameter 8*. Use Equation (4.4). (D) Find optimal state sequence a; T· Use Equation (4.4). Steps (C) and (D) have complexity O(T · 2 2 L) and therefore, total complexity of the Viterbi decodmg algonthm rs ellectrvely the same as lor the sequentral Bayesran Iiller. 79 4.3 Results and discussion We test the proposed algonthms on two datasets. I he hrst set represents a readmg sessiOn where four participants read a g1ven text so that the1r turns sJgmficantly overlap. The correct segmentatiOn for th1s dataset 1s obtmned manuaily. The second set 1s a sem1-synthelic set obtmned, sumlarly to [37], by combining and overlapping single speaker segments recorded in the meeting room envi ronment by lour dirl'erent speakers. Total length or the sessions is 15 minutes with 27.4% or the overlapped speech, where the average durations ol the segments with one and two active speakers are respeclively ~.4s and 3.3s. rrwo additional datasets me used to learn pmameters of the microphone mray likelihood model [pd, .\) and probability distribution of a single observation given the single speaker location. These sets represent regular meeting sessions with 4 participants, in which the speakers overlap on 8% of the total sesswn length. Model parmneter learmng 1s descnbed m more detml at the end of the Sectwn 4.2.2.1 ~n our experiments not all microphones in the mray me visible from all SPR-GCC-PHAT grid pomts due to the occlusions caused by the ommdirectional camera placed m the center ol array! I here! ore, microphone array observations are extracted as the local maxima ol the modi lied SPR GCC-PHAf funclion computed m pmnts of the rectangular 20cm gnd. For practJcai purposes, we define fhe local maxima as fhe regional maxima in the :3 x :J x :3 connected neighborhoods on the grid. Processing is done on lOOms signal segments passed through fhe Hmnming window with [UUms I rame shtl t. IWe have computed model parmneters for different neighborhood s1zes A E1 { 5, 10, 15, 20, 25, 30, 35,40 f and three different sets of the observations. First set contains all extracted regional maxima; Second set, all regional maxima grater fhan 55% of fhe global maximum; and third set, all regional maxima greater than 75% or the global maximum. For all neighborhoods A, values Jld and .\ rise with increase in number or used observations. Ideally, we want high speaker detectwn probab1hty Pd and low probab1hty of false speaker detectwns. Smce the performance with the observation threshold levels 55% and 75% is better than performance with all extracted local maxima we present results only for the second and the third observation set. Probability P!d presented in Table 4.1, although not a modeling parameter, represents percent age of participant-frames in which a non-active speaker gets at least one observation in the A- neighborhood and gives insight in the dependency of the false speaker detections on number of extracted local maxima. Table 4.1 contains the model parameters computed for different values of speaker neighborhood A. Parameters in the row RLM > 0 are computed using all regional maxima while the following two rows RLM > 0.55RaM and RLM > 0.75RaM correspond respectively to the cases where only the regional maxima higher than 55% and 75% of the global maxima RaM are used as observations. Values Pd and P!d rise with relaxation of the neighborhood size A while the expected number of false alarms A falls. Ideally, we want high speaker detection probability Pd and low probability of false detections P!d· Value P!d (Table 4.1) rises significantly with the number of extracted local maxima. Having this in mind, we conduct all experiments for the threshold levels 55% and 75%. Table 4.1: Model Parameters: Varying Neighborhood Sizes RLM >0 RLM > 0.55RaM RLM > 0. 75RaM A[o] Pd Pfd A Pd Pfd A Pd Pfd A 5 31.1 4.8 4.51 28.6 1.1 1.57 27.0 0.8 1.06 10 74.5 17.9 3.96 68.0 2.8 1.07 64.0 1.9 0.59 15 84.0 32.0 3.84 75.5 3.7 0.97 70.7 2.4 0.50 20 89.4 43.0 3.77 79.2 4.8 0.92 73.8 2.7 0.46 25 92.3 50.9 3.73 81.8 5.8 0.89 76.3 3.2 0.43 30 94.3 56.5 3.71 82.8 7.0 0.88 77.1 3.7 0.42 35 95.0 60.5 3.70 83.1 7.4 0.87 77.4 4.0 0.42 40 95.7 66.6 3.69 83.3 8.5 0.87 77.5 4.4 0.41 Values Pd and PJd rise with relaxation of the neighborhood size A while the expected number of false alarms A falls. Ideally, we want high speaker detection probability Pd and low probability of false detections P!d· Value P!d (Table 4.1) rises significantly with the number of extracted local maxima. Having this in mind, we conduct all experiments for the threshold levels 55% and 75%. Observations for the speaker identification are MFCC coefficients computed on for lOOms frames aligned with microphone array frames. Gaussian mixture models (GMM) for silence and 81 single participants were trained on 30s training samples, while GMMs for two overlapped partici pants on training samples obtained by overlapping two single speaker samples with equal average energy. We evaluated the speaker segmentation performance for lOOms frames using precision (P), recall (R) and (F = ~~~)measures. This type of evaluation is a standard for speaker segmentation type of problems [37, 40] and gives more insight into the performance than the number of correctly detected speaker-frames. p = 100 #of found true active speaker-frames # of found active speaker-frames R = 100 # of found true active speaker-frames # of true active speaker-frames For the presentation of the experimental results we use the following notation: (4.9) (4.10) • MA0 .55 and MA0 .75 denote microphone array likelihood models, (a, (3) = (1, 0), with obser vations obtained with threshold levels 55% and 75% respectively. • MA1.0 denotes the baseline microphone array likelihood model presented in [21]. This method uses only the global maximum of the SPR-GCC-PHAT function as the observation, while the likelihood of this observation given locations of participants is modeled as a prod uct of likelihoods for each participant. Single participant likelihoods take a high constant value when the observation is in the A-neighborhood of the active speaker or when it is not in the A-neighborhood of a participant that is not speaking. Otherwise, it takes a low constant value. Both constants and a neighborhood size A are chosen to maximize F-measure value. For more details on the baseline model see Appendix .2. • SID denotes a likelihood model based on MFCC coefficients extracted in the speaker identi- fication module. We use this model as the second baseline. • MAxx&SID denotes the combination of the likelihood models defined in Section 4.2.4. The parameter pair used for fusion is (a, f3) = ( 1, 0. 5). 82 We performed exhaustive evaluations of the system performance for different state transition matrices and neighborhood sizes. In this work we present results for the optimal neighborhood size A = 10 and two different state transition models which have in common that not more than one speaker can change activity between two frames and that all allowed state transitions except stay in-the-same-state, p(at+l = atlat) = 0.99, are equally likely. The only difference is that in one model we introduce the constraint that not more than two speakers can be active in one frame. We present results for two types of HMM state decoding: forward maximum likelihood decoding by the sequential Bayesian filtering and forward-backward optimal sequence decoding by the Viterbi algorithm. In order to validate the choice A= 10 and p(at+l = atlat) = 0.99 which we used throughout the experiments we present Tables 4.2 and 4.3. All values in these tables are obtained for the best likelihood model MAo.75&SID for the parameter pair (a, {3) = (1, 0.5). The observed trend is that higher values of the state transition parameter are improving overall segmentation performance, while the neighborhood size A= 10 maximizes performance. Table 4.2: Performance vs. State Transition Model Parameter p(at+l = atlat) for neighborhood size A= 10 Pi,i 0.7 0.8 0.9 0.95 0.975 0.99 p 94.6 95.0 95.3 95.5 95.7 95.8 R 85.6 86.9 88.6 90.6 91.0 93.1 F 89.9 90.8 91.8 93.0 93.3 94.4 Table 4.3: Performance vs. Neighborhood Size A for transition parameter (Pi,i = 0.99) A[o] 5 10 15 20 25 30 35 40 p 87.5 95.8 95.9 95.6 95.6 95.8 95.6 95.6 R 81.4 93.1 91.2 92.3 92.2 92.3 92.3 92.6 F 84.4 94.4 93.6 93.9 93.9 93.9 93.9 94.0 83 We tested performance of the microphone array likelihood models MAo.55 and MAo.75 that we propose against the baseline model MA l.OO in the setup where we posed no limitations on the number of active speakers. We present these results for both decoding schemes in the Tables 4.4 and 4.5. The first three columns represent overall precision (P), recall (R) and F -measure; the following three columns contain the same performance measures on segments with one active speaker; and the last three columns contain performance measures on segments with overlapped speech. Table 4.4: Segmentation Performance:Viterbi Decoding, No Limit on Number of Active Speakers method p R F pl R1 F1 p2 R2 F2 MA1.0o 98.6 76.5 86.1 98.7 98.2 98.4 99.6 47.7 64.5 MAo.55 90.6 88.4 89.5 86.2 88.8 87.5 98.4 87.9 92.9 MAo.75 91.5 90.4 91.0 87.3 90.2 88.7 98.8 90.8 94.6 Table 4.5: Segmentation Performance: Bayes Decoding, No Limit on Number of Active Speakers method p R F pl R1 F1 p2 R2 F2 MA1.0o 96.7 76.7 85.6 96.0 95.7 95.9 100 51.6 68.1 MAo.55 85.9 89.0 87.4 79.0 89.8 84.1 98.2 88.0 92.8 MAo.75 86.1 90.7 88.4 78.0 91.9 85.3 97.9 89.2 93.3 For the both decoding schemes, our likelihood models (MA0 .55 and MA0 . 75 ) give better overall F-measure performance than the baseline (MA1.0 0 ). When employing the Viterbi decoding scheme (Table 4.4) on segments with only one active speaker, the relative advantage of the baseline over our model is 10.9%. Our model presents a balanced performance both on the segments with a single active speaker ( 88.7%) and overlapped speech ( 94.6%). The relative improvement over the baseline on segments with overlapped speech is 46. 7%. Tables 4.6 and 4.7. contain results for the transition model that allows at most two active speak ers per frame. This transition model allows us to compare all likelihood models. 84 Table 4.6: Segmentation Performance: Viterbi decoding, Maximally 2 Active Speakers per Frame method p R F pl R1 F1 p2 R2 F2 SID 74.7 90.5 81.8 69.8 98.5 81.7 85.1 79.9 82.4 MA1.0o 98.7 78.1 87.2 98.6 98.8 98.7 100 50.6 67.2 MAo.55 93.3 88.8 91.0 90.0 90.7 90.3 99.3 86.4 92.4 MAo.75 94.8 90.7 92.7 92.2 92.5 92.4 99.6 88.2 93.6 MA1.0o 97.7 89.0 93.1 97.7 98.4 98.1 100 77.9 87.6 &SID MAo.55 95.5 91.3 93.3 92.8 94.0 93.4 99.9 87.1 93.1 &SID MAo.75 95.8 93.1 94.4 93.6 95.2 94.4 99.9 90.3 94.9 &SID Table 4.7: Segmentation Performance: Bayes filtering, Maximally 2 Active Speakers per Frame method p R F pl R1 F1 p2 R2 F2 SID 72.0 87.7 79.1 66.9 95.4 78.7 82.8 77.5 80.1 MA1.0o 96.0 76.2 84.9 94.9 94.4 94.7 100 52.0 68.4 MAo.55 86.9 89.4 88.2 80.5 91.3 85.6 98.7 86.9 92.4 MAo.75 87.9 90.8 89.3 82.0 93.2 87.3 98.7 87.5 92.8 MA1.0o 91.8 88.5 90.1 88.1 97.5 92.5 99.5 76.7 86.6 &SID MAo.55 90.5 90.4 90.4 85.6 93.6 89.4 99.5 86.2 92.4 &SID MAo.75 91.2 91.1 91.1 87.1 94.7 90.8 99.8 86.4 92.6 &SID Similarly, for the transition model that poses no limit on number of active speakers, the proposed MA likelihood models perform better than the baseline ones. The proposed modality fusion model brings further improvement of the performance. This validates our assumption on complementarity of SID and MA likelihood models. Note that the combination of the baseline MA1.0o and the SID likelihoods degrades MA1.0o performance on the segments with single active speaker and improves performance on the segments with overlapped speech. On the other side, combination of SID with our models MAo.55 and MAo. 75 improves performance on all segments for the Viterbi decoding scheme. 85 4.4 Conclusions I he speaker segmentatiOn system presented m th1s work IS novel !rom three mam perspecllves.l [rrst, the proposed jomt probabilistic data assocmtwn model (JPDA) uses not only the global maJnma of the SPR-GCC-PHAf functiOn as the !Tilcrophone array (MA) observalion, but the multi ple regional maxima, which allows better handling in regions of speaker overlap. Our JPDA model lor the MA observatiOns outperlorms the class1cal speaker segmentatiOn methods on segments w1th overlapped speech, whether these are based on SID [48] or basehne MA based techmque [21]. Furthermore, careful thiesholdmg of the extracted regwnal maxima and the chmce of the fuswn parameters that emphasize advantages of both SID and MA likelihood models bring additional per formance improvcn1cnts. Second, we suggest a h1dden Markov model ol the speaker actiVIty state evolution wh1ch can work with the proposed MA hkehhood model only, or perform fuswn with the likelihood model ob tamed from the speaker Idenlificatwn (SID) system. This multunodaJ architecture pertorms fuswn of the video tracking, MA time delay processing and SID systems and allows for improvements in each modahty. lhnally, we propose two probabli1st1c algonthms that solve the mterestmg problem ol parallel estunatwn of the ulikriown tra]ectory-to-Idenlity assocmtwn parameter and state sequence. Chapter 5: Multi-modal multi-channel dyadic interaction database 5.1 Purpose of multi-modal recording environment and dyadic inter- action database This chapter presents two contributions. First, we present our multi-modal recording environment aimed at collection and informed analysis of human behaviors in collaboration with psychology ex- perts. We describe the collection and present some initial analysis of the first part of the database of three hours of data consisting of multiple five minute dyadic interactions; a product of the collabora tion between the USC Viterbi School of Engineering and the USC Department of Psychology 1 . Each short interaction represents an argument on one of nine suggested topics where each participant is trying to provide evidence that supports her/his point of view. Some of the topics are confrontations about cheating in a relationship, a drinking problem, stealing from a roommate etc. Data is manu- ally transcribed, segmented in speaker turns and annotated by experts with the approach/avoidance labels. The recording environment contains an array of 10-high-definition video cameras, multiple microphone arrays (13 mic total), 2lapel microphones and a 12-camera motion capture system. 1 The described part of the database is collected through role-playing, but we in-parallel analyze real data [4] and intend on collecting real-couple interaction data in this environment in the future. 87 II hts destgn allows collectiOn ol synchromzed htgh qualtty stgnals m a controlled envtronment and enables mvesttgatwn ol advanced stgnal processmg techmques. For mstance, the corpus oil reaJ marned couple mteractwns used m 141, illthough at the moment more reaJislic m terms of the impact to the field of psychology, it restricts the usc and development of algorithms. It was not designed and collected to also favor automatic processing and the included audio-video recordings may be ol considerably low qualtty. In addttton to these psychology domam data used m [4] our lab has a!Ieady released an acted multunodaJ database (http:l/sa~I.usc.eduhemocap) of emotwnaJ mteractwns. We also mtend to dissemmate to the commumty this ncher m reahsm and sensmg database I !An addtltonal tmportant advantage ol the database ts avatlabtltty ol both motton capture and vtdeo data. I hts allows us to (a) analyze the relatton ol the body language leatures obtamed !rom the molion capture output with domain labels; (b) perform tra~mng and testmg of illgonthms that extract equivalent or similm features from video and (c) analyze information loss through the video processing and refine the algorithms appropriately. lin relatton to thts, we present the second contnbutton, stattsttcal dependence ol mteractton de scnptors, e.g., turn durations, number ol questions, backchannels, successlul and unsuccesslul m terruptwns, on participants' roles and analysis of the relatiOn between vanous non-verbaJ features obtained from the audio and the motion capture data and approach/avoidance labels. ~n Section 5.2 we provide details on recording system mchitecture. Section 5.3 describes the collected database and the annotatton process. In Sectton 6 . .l we present the leature extraction process, anillyze the audio turn tiling, head onentatwn and hand movement patterns, and present anaJysis of relalion between non-verbal features and the approach/avmdance labelsJ 5.2 Recording Environment and Hardwar~ A physicill outlook of the recordmg environment and a schematic descnptwn of the recordmg m chitecture used for the data collection me presented in Figure 5.1.1 Our sensing capabilities include: Figure 5.1: Left: Outlook of the recording environment: (A) camera array; (B) microphone arrays; (C) motion capture cameras; (D)shotgun microphone; Right: Recording system architecture: blue video recording system, red - motion capture Vicon system, green - microphone recording system, PC1 and PC2 synchronized via dedicated 1394 bus and synchronized with PC3 using shutter signal • Vicon motion capture system: 12 motion capture cameras that track and record positions of 23 markers on participants' upper bodies (Figure 5.2) at 120fps rate. Note that the markers are placed in a way that leaves the skin on the lower arms and face visible allowing algorithm development from the video channel. • PointGrey camera array: 10 Flea2 PointGrey cameras recording 2 frontal close-up and 8 ceiling far-field views of interaction at 30fps with resolution 1024 * 768 • Microphones: three 4-microphone T-arrays, a lapel microphone for each participant and a shotgun microphone all recording interaction audio at 48kHz with 24 bit precision We used 2 PCs with solid state hard drives in RAID 0 configuration to achieve necessary writing speeds. The used configuration supports recording from eight 2M pix cameras at 30fps in raw format with 8bpp without dropped frames. Due to the huge volume of the recorded data and our processing goals we opted for a resolution of 0. 7Mpix per camera-frame. Cameras were synchronized using a dedicated 1394 connection between PCs and PointGrey's Multisync software. We developed the recording C++ software using PointGrey's SDK. Audio was recorded using another PC and two daisy-chained 8-channel MOTU-896 devices. The important issue of audio-visual synchronization 89 Figure 5.2: (a) Marker placement: 4 head markers, 3 back markers, 2 chest markers and 7 markers on each arm; (b) Reconstructed marker locations and derived features was addressed by bringing the shutter signal from one of the cameras as an input to the MOTU audio device and recording it synchronously with all audio signals. The synchronization of the motion capture stream with audio-visual components is done using director's clap at the beginning and the end of each recording session. The audio-visual synchronization precision is practically one audio sample, and the synchronization with the motion system is defined by the frame capture rate and is approximately lOms. A schematic representation of the recording system with connections between different modalities is given in Figure 5.1. 90 5.3 Database I he dyad1c multnnodal database wlil mclude several levels ol reahsm !rom the human aspect s1de. We have millaJly started our coilectwn w1th unscnpted role-play111g scenanos and we mtend to cont111ue w1th more reil11st1c data of couples mteractmg on confhctual top1cs of therr choosmg. We also want to solicit feedback from the broad scientific community and guide the future collection appropnately. IFor the hrsl part ol the database part1c1panls were g1ven tune to prepare lor arguments on a chosen subset of rune proposed top1cs and encouraged to be passwnate 111 argmng therr pos1t1on during conversations. Suggested topics were open ended (i.e. couple arguing over fling with a friend or friends arguing over fling with one's boy/girlfriend) and participants were drawing from the1r own expenences to create a back story that supports the1r poml ol v1ew. It was suggested to choose the back stones 111 a way that makes discusSion as natural as poss1ble for the part1c1pant. The mteractwns were segmented, transcnbed and labeled w1th the approachlavmdance labels. 5.3.1 Collection protocol! The data coilectlon protocol contams two mam stages. In the first stage, two days before the sched uled collection, participants were given a list of nine scenarios with instructions on how to interact! I he second stage happens allhe scheduled collection L1me and represents a sequence ol preparation and data collecllon steps. At the begmnmg ol each sessiOn part1c1panls were mlroduced to each other and llme was glVen to them to p1ck 4 - 6 scenarws they would like to discuss dunng the col lection. Participants interact in scenarios that require them to be familiar with each other (couples or friends) so before the recording of each scenario they spend an additional 5 - lOmin to share mlormalJOn that they cons1dered necessary lor the role play. !Aller recordmg all mleracllons ol the same couple we recorded add1l10nal relerence head on entatlon data. We illso recorded v1sual and aud10 mformatwn of the env1romnent such as scene and noise backgrounds and data necessary for the joint calibration of all modalities. Our data collection is on-going. The first part of the database described here contains approxi mately 3 hours or data. One third or the interactions contain couples or the opposite sex while the 9] rest mvolve mterlocutors who are of the same sex, mostly female. In these cases parlic1pants are acting as friends. [A subset ol the data correspondmg to e1ght sess10ns (45mm) IS tully annotated and th1s IS the dataset porllon we use lor the analys1s presented m th1s chapterj 5.3.2 Manual post-processing and data annotation I he mulll v1ew mot1on capture system IS des1gned to track markers on the human body m a 3D coordmate system. However, smce the parlic1pants were asked to have a natural mteractwn marker occlusions happen very often. The proprietary Vicon iQ software can not reconstruct reliably (after an occlusion, a wrong label is usually assigned to the occluded marker) full marker trajectories and a manual mtervent10n IS necessary. We manually corrected marker labels as needed to enable trajectory reconstructwnJ [We spht the annotatiOn process m two parts. The first part 1s conducted by labelers who can speak and write English and the second pm1 is conducted by psychology-domain experts trained in coding schemes such as the CIRS. Labels in the first group include all labels derived from audioj speaker segmentatiOn, transcnpllon, dwlogue acts on sentence level (questiOn, statement, back channel) and turn tiiking labels (successful and unsuccessful mterruplions). No labehng of v1deo channel for low-level mteractwn descnptors was pertormed smce these labels can be extracted di rectly from the motion capture output. Labels in the second group will include subject-interaction level labels, e.g., presence ol blame, atlltude (pos1t1ve vs. negative), acceptance and approach avmdance, or labels on the sub mteract10n level. 5.3.3 Illustrative Interaction Statistics Extracted leatures mclude speaker ID based segmentatiOn, speech segmentatiOn usmg vmce actiVIty detectwn [23[. energy, p1tch, 13 MFCCs and a 1mcrophone array based speaker segmentalion [52[. In addition, we calculate a range of functionals, e.g., mean, standard deviation, minimum and max imum. Both for active speaker and inactive participant estimate the number of interruptions and the total interruption duration normalized by the turn length. 92 Table 5.1: Interaction and dialogue event counts for different interaction roles: Q-question, BC backchannel, UIISI - unsuccessful/successful interruption session ID role Q BC UI SI 1,2,3 initiator 34 11 7 7 other 18 1 5 3 4,5 initiator 11 1 2 4 other 7 0 3 0 6,7,8 initiator 32 6 21 27 other 4 6 7 18 all initiator 77 18 30 38 other 29 7 15 21 Features derived from the motion capture data are chosen according to the approach and avoid- ance coding manual in a way that intuitively describes relative participant orientation, movement and body posture. For both participants we extract the following functionals in 3sec intervals with 1sec shift: (a) head/body orientation angle relative to the other participant; (b) arm velocity mea- sure representing average velocity of scaled arm markers maximized over left and right hand and (c) measure of how much the body posture is opened/closed in terms of the average distance from left and right forearms from the chest markers. Based on the best data and annotation we analyzed speaker turn taking, as summarized in Table 5.1 and Figure 5.3. We can observe that dialogue initiators use longer turns to communicate their messages. As it can be seen from Table 5.1 initiators use more questions, interrupt the other person more often and according to the number of backchannels they tend to be more active listeners. In addition, our analysis has shown that the turn initiator tends to have significantly more turns of 10 seconds or more while the other person has about 30% more turns of shorter duration. 0.9 turn duration [s] Figure 5.3: Speaker turn duration histograms for different roles 93 In addition we provide some analysis of the motion capture data as those relate to speaker activ- ity. Figure 5.4 represents the velocity of the head and hand motion. As we can see the active speaker demonstrates significantly more movement than the listener. The listener has negligible head and hands movement twice longer than the active speaker, while the active speaker demonstrates a con sistently higher velocity of movement. 0.5 -~ 0.4 :0 0.3 rd .0 e 0.2 11. 0.1 0 0 -Inactive Participant 0.4 -Active Speaker 2 3 4 Velocity of hands estimated by MoCap 5 ~ 0.3 ~ .g 0.2 a: 0.1 - Inactive Participant -Active Speaker 0 0.1 0.2 0.3 0.4 0.5 0.6 Angular Velocity of head estimated by MoCap Figure 5.4: Histogram of probability of the velocity of movement of hands and angular velocity of the head. As we can see the inactive participant is significantly less animated than the active speaker. 5.4 Results and Discussion In this section we present our analysis results of the relation between different low-level descriptors and the approach/avoidance labels on the speaker tum level. For this purpose we extract standard short term audio features and calculate their functionals for each speaker tum. In order to do this, we separated speech and non-speech regions using an unsu- pervised voice activity detector [23]. On the speech regions we extract energy, pitch and 13 MFCC coefficients on 25ms windows every lOms. These features were extracted using Praat software [6] for two close talk microphone channels. Additionally, on the speech regions we run a microphone array speaker segmentation algorithm [52] on multiple microphone array channels. For the active speaker we calculate the following functionals of the audio features: mean, standard deviation, min imum and maximum. Both for active and inactive speaker estimate the number of interruptions and the total interruption duration normalized by the tum length. 94 features denved from the motwn capture data are chosen according the the approach and avmd ance coding manual [27] in a way that intuitively describes relative participant orientation, move ment and body posture. For both parttctpants we extract the lollowmg lunctronals ol the motron capture marker locatrons: ttme spent onentmg head/body towards other parttcrpant normaltzed with a turn length, mean and standard devmtwn of: (a) angular difference between the reference head/body orientation and the average head/body orientation angle relative to the other participantj (b) upper/lower arm angular velocity amplitudes; (c) chest to wrist distances normalized with the arm length and (d) mmtmal dtstance between partrctpants durmg the tum normaltzed wtth the mm Imal distance at the begmmng of the mteractwn. 5.4.1 Approach and Avoidance - Expert Annotation The recorded data represents flow of verbal and non-verba.I cues and, m order to avmd apnon mter actwn segmentatwn, for the purpose of our analysis experts provided us with the contmuous-m-tune and discrete-in-value [-4, -:5, ... , 4] approach-avoidance labels for each participant. Each interac tion is annotated by a single expei1 in two ways: (a) using multi-view video only and (b)using multt-vtew vtdeo wrth audto. Addtltonally two test mteractrons are labeled by three annotators to giVe an mlllill msight m annotators agreement. Labels for video only are obtmned by foiiowmg la behng rules related to the gaze, relallve mter-participant body and head onentatwn and guaJificallon of the body pose as opened or closed. Beside visual cues, engagement in the conversation, dialogue management and turn takmg behavtor were used lor the audro vtsual label mg. 5.4.2 Interaction descriptors - role dependency All suggested scenanos have common role proltles, one parttcrpant rs mttratmg the drscussron wrth the clear message and goal m mmd, e.g., conlront lnend about her/hrs dnnkmg problem/cheatmg or msist on changes m hohday/weekend plans, and the other participant IS trymg to express his point which can end in agreement or participants may stay confronted. We examined dependence of impm1ant dialog properties on pm1icipants' roles. 95 Table 5.2: Approach-avoidance (A-A) label values for different interaction roles session ID role audio and video video only 1,2,3 initiator 2.16(0.49) 1.05(0. 79) other 1.88(0.55) 1.39(0.58) 4,5 initiator 2.66(0.83) -0.14(0.89) other 1.39(0.52) -0.07(0.32) 6,7,8 initiator 2.25(0.52) 1. 73(0. 72) other 1.88(0.48) 1.01(0.58) all initiator 2.31(0.62) 0.91(0.75) other 1.78(0.54) 1.14(1.05) In the following Table 5.2, we present mean and standard deviation for the approach-avoidance labels for each role. From Table 5.2 it can be seen that different roles were less discriminative in the visual cue domain, while addition of acoustic cues made interaction roles more separable. 5.4.3 Analysis of non-verbal features for A-A estimation We analyzed the relation of features derived from the motion capture output to the A-A labels derived from video only and combined audio and video. For that purpose we calculated features derived from audio and motion capture output in 3sec intervals with 1sec shift as described in Section 5.3. In Table 5.3 we present mutual information (MI) values for the chosen set of features. The MI values are estimated by discretizing each feature separately using k-means algorithm with 10 clusters and calculating mutual information between discretized feature variables and discrete A-A labels. The MI is calculated by concatenation of samples (the feature and the label) for all sessions and for all participants. The measures of how opened/closed is the body posture and of the body orientation angle have the highest relation to the A-A labels. They also show higher MI for A-A labeling from the video only stream (see Table 5.2), which is expected as this feature is based on motion capture and does not include audio features. Although the acoustic features (pitch, energy) do not exhibit high MI, we can still observe that MI has higher values for the A-A labeling of the audio-visual as opposed 96 Table 5.3: Mutual information between motion capture features and A-A labels audio description functional video only and video mean 0.42 0.40 body orientation min 0.40 0.37 max 0.47 0.37 opened/closed mean 0.45 0.27 min 0.51 0.32 hands vs body max 0.43 0.24 mean 0.11 0.12 hands motion var 0.13 0.15 pitch mean 0.08 0.12 var 0.07 0.12 mean 0.13 0.19 energy var 0.14 0.17 to the video only steams. The low MI value for these features implies that alternative set of acoustic functionals should be examined. 97 Chapter 6: Approach-avoidance behavior in dyadic interactions: Ordinal regression approach to approach-avoidance label estimation 6.1 Proposed algorithm for estimation of ordinal labels In our previous work [53], we introduced the multimodal dyadic interaction database and used it for analysis of relations between various non-verbal features and AA labels as defined by psychologists [27]. These labels belong to the ordered set of nine categories, ranging from complete avoidance to complete approach. In this paper we address the estimation of the ordinal AA labels for the same dyadic interactions using the low-level non-verbal signal features. These features represent the basic statistics (mean, minimum, maximum and standard deviation), of the various video (body orientation, head orientation, hand movement, measure of how opened the postures are) and audio (pitch, energy) based measurements calculated on feature processing window. 98 ~n order to addfess the ordmal nature of the AA labels we propose a new ordmal regressiOn algorithm. This algorithm is applicable to any ordinal regression problem and consists of the three mam steps: (I) we translorm the ordmal regressron problem to multrple bmary classrlrcatron prob lems delmed by the label ordenng; (2) we solve the bmary classrlicallon problems mdependently usmg any classilicatwn method that outputs (possibly non-probabihsllc) classilicatwn score; (3) we fit the cumulative logit logistic regression model with proportional odds (CLLRMP) on vectors obtained by concatenation of scores from the binary classifiers. Additionally, we propose a simple extensron ol the proposed algonthm apphcable to the lime senes ol ordmal labels. In the extended algonthm, we model the label sequence usmg the hidden Markov model with hkehhood based on the probabihstic output from the CLLRMP. !for the AA label estimation used feature vectors arc continuous and have no missing values, and we choose to apply the werghted bmary SVM classrliers [II] wrth natrve non probabrhstrc scores.! II he two step ordmal regressron algonthm proposed m [ 19]rs somewhat snmlar to the method we propose. While the first steps are Identical, m the second step this algonthin employs proba bilistic binary classifiers. The vector of their outputs should represent estimate of the cumulative distribution of the ordinal label, however, since the binary classifiers arc trained independently there rs no guarantee that the esllmate rs monotomcally non decreasmg. IWe evaluate the proposed esllmatron methods usmg leave one sessron out cross vahdal!on. We present evilluatwn results for 4 expenments: (1) analysis of dependency between the esllmatwn accuracy and lengths of windows used to calculate the feature statistics; (2) comparison of average estimation accuracies for proposed estimators and the baseline multi-class SVM and analysis o~ vanabrhty m estrmallon accuracy lor ddlerent sessrons; (3) analysrs ol drllerences between conlu swn matnces for different estunators; (4) comparison of estunatwn accuraCies for the SYM-OLRI and estunator obtmned by fitllng CLLRMP directly on the ongmal feature vectors. ~n Section 6.1.1, we present the transformation of the ordinal regression problem to multiple binary classi!Jcation problems. In Section 6. 1.2 we present the CLLRMP. The implications or the choice to lit CLLRMP to the classi!Jer scores instead or the reature vectors are discussed in Section 6. Ll. The extenswn to llme senes data IS presented m Sectlon 6.1.4. 99 6.1.1 Label ordering inspired collection of binary classifiers Let us introduce the notation used in our paper. We assume that the feature vectors y take values from spaceY, and that the ordinal labels o belong to the set 0. For simplicity, we denote elements of 0 as integers 0 = {1, 2, ... , K}. We map each ordinal categorical label o to a vector of K - 1 binary indicators b( o) = (b 1 (o), ... , bK_ 1 (o)) in a way that kth indicator bk takes value 1 if o E {1, ... , k} and value 0 if o E {k + 1, ... , K}. In other words, the described mapping is a redundant, label-ordering in spired, error correcting code. Figure 6.1: Proposed method and HMM extension. 100 We transform the original training dataset Y = {(Yn, on)};;'=l to K - 1 datasets Yk = {(yn, bk,n)};;'=l (k = 1, ... , K - 1) with binary labels defined by the described mapping, bk,n = bk(on). On each dataset Yk we train a binary classifier BCk. The collection of trained classifiers (BC1, ... , BCx_I) map every feature vector Yn in the training set into the vector of classifier scores fn = [!I,n, ... , fK-l,n]T. Therefore, we say that the collection of trained binary classifiers map Y to F ={(in, on)};;'=l· 6.1.2 Cumulative logit logistic regression with proportional odds We fit the CLLRMP [1] to the dataset F ={(in, on)};;'=l obtained in the previous step. Intuitively, the CLLRMP approximates logits (logit(x) = log 1 :_x) of the cumulative label distributions by linear functions, with equal slope, of the input vectors (in this case - score vectors f). Formally, the CLLRMP is defined by: 1 p(oE{1, ... ,k}lf) Tf n =wok +w , p(s E {k + 1, ... ,K}If) ' (6.1) for k = 1, ... , K - 1 where the model parameters wo = [wo,l ... wo,K-l]T and w = [w1 ... WK-l]T are respectively intercept and slope coefficients. The optimal values of the model parameters can be learned from the dataset Fin the maximum likelihood sense [42]. An important property of the CLLRMP is that it imposes the stochastic ordering of labels cor responding to different input vectors [1]. This means that it is possible to compare values of the cumulative distribution functions of labels for different score vectors f. This property is summa- rized by the following equation that follows trivially from Equation 6.1: wo.- wo. p(o=ilf)=p(o=jlf+ ~ ,Jw- 1 ), -1 (6.2) where w- 1 is vector of inverse slope coefficients. In the following section we discuss importance of the stochastic ordering in the case when we use weighted binary SVMs in the second step of the proposed estimation method. 101 6.1.3 Fitting CLLRMP on classifier score vector~ Let us bnelly drscuss the nnplicatwns ol the stochastic ordenng property nnposed by the CLLRMPI when 1t takes the classifier scores as mput. The bmary classifiers map the ongmaJ feature vector y to the score vector f whose coordmatcs have a clear mterpretatwn: 1f the label of 11 belongs to { h + 1, ... , K} ( { 1, ... , k} ), then fk takes low(higher) values. Assuming that classifiers can successlully solve bmary classrhcatron tasks, there exrsts a label mduced partition ol the space oil f whose elements are convex sets. Assuming that we "move" f along the line connecting arbitrary fr, h E F, it is desirable to have a model such that changes in the cumulative distributions of labels s(f) conditioned on f reflect intersections that f makes with the label induced boundaries. This is exactly the stochastic ordering property. The CLLRMP should fit the classifiers score vectors .f better than the ongmal leature vectors JJ smce elements ol the label mduced partrtron on Y are not necessanly convex.! 6.1.4 Hidden Markov model with OLR based likelihood To addfess dependenCies between conseculive labels we represent them as a sequence of varmbles that form a discrete Markov model with the transition matrix T = [p(s 1 = l.; ls 1 _, = lj]f)~ 1 . Hav ing labels as unobservable variables and feature vectors as observable variables implies a HMM structure. However, rt rs ddhcult to learn a good generative model lor the likelihood due to dr mensronality ol the leature vectors and mlluences ol out ol the model vanables on leature vectors! Therefore we adopt a hybnd HMM (F1gure 6.1) wh1ch explmts advantages of the dJscnmmatJvely trained CLLRMP through the likelihood function p(y 1 js,) ex P~{;;)t) ·I 6.2 Dyadic interaction datase~ The employed dataset is a part of the acted dyadic interaction database [53] that consists of :3 hours of audio, video and motion capture data split in multiple 5- lOmin sessions. Each session contains interaction based on unscripted role-play based on one out of 9 connictual topics that include cheat mg m relalionsh1p, argmng over a dfmking problem etc. Top1cs are selected such that same sex participants act as fnends and opposite sex participants as couples. In order to improve chances o~ recording realistic interactions participants undergo two preparation stages. A couple of days prior to the data collectiOn part1c1pants are mtroduced to the pool ol top1cs and, on the collectiOn day they are asked to d1scuss top1cs w1th the1r peers and agree on 4 6 mteract10n scenanos. rrhe hardware archilecture ailows us to record sessiOns With 10 HD Flea 2 cameras (30fps), ~2 sensor Yicon motion capture (MoCap) system (120fps), three 4-microphonc T-arrays, two lapel microphones and one shotgun microphone (48kHz) without dropped frames. All modalities arc synchromzed w1th a sub 10ms synchromzat10n prec1s1on. Sesswns are manuaily post-processed (to correct errors m MoCap automatic traJectory recon structiOn) and annotated m two ways. The first set of annotations mclude transcnptlon and seg mentation of the audio on the speaker turn-taking level augmented by the basic sentence level dia logue acts. I he second set ol annotatiOns IS conducted by tramed [2 /] psychology domam experts, to prov1de subject mteract10n level labels mcludmg, acceptance, presence ol blame, altitude and approach-avmdance. !Experts provide us with the continuous-in-time and discrete-in-value approach-avoidance labels for each participant. The approach-avoidance labels belong to an ordered set of nine categories rangmg !rom complete avOidance to the complete approach. Labelers prov1de two sets ol labels, one usmg only the mulll v1ew v1deo and the other usmg both v1deo and aud10.1 Smce the labeling and particularly the post-processmg are lime demandmg at present we have 8 fully annotated interaction sessions in total duration of 45 min. This 8-scssion subset of the full dataset is used in experiments presented in this chapter. The AA labels for each of 8 sessions belong to the same set { L 0, L 2. 3}. 6.3 Results and discussion In Section 6.3.1 we describe details on feature extraction, estimator training and evaluation method ology and m Sectwn 6.3.2 we present results on different eva.Iuatwn expenments. 6.3.1 Features, estimator training and evaluation methodology For each session and each participant we extract 7 MoCap features: the relative inter-participant head (2 angles)and body orientation (2 angles), two measures of the body posture (leaning angle the angle between spine axis and horizontal plane; body open-closed measure - sum of the triangular square areas defined by elbow, wrist and chest markers for both hands) and the hand velocity mea sure (maximum of left and right hand velocities). Additionally, we extract two acoustic features, pitch and energy. We get the MoCap features directly from the MoCap marker coordinates every 10 ms and the acoustic features by processing 25 ms speech frames with 10 ms shift. Acoustic features are extracted using Praat software. For each feature we get 4 functionals (feature statistics), mean, minimum, maximum and standard deviation, on 6 s (also 3 sand 4.5 s) functional windows with 1 s shift. Note that the statistics of the audio features can be calculated only in regions where the speaker is active. If a participant does not speak in a particular frame we set all coordinates of the feature vector that correspond to the audio feature statistics to zero. By doing this we avoid occur rences of missing features. For estimation of the video-only and audio-and-video based categories we use, respectively, the 28-dimensional vector of MoCap functionals and the full 36-dimensional functional vector. Figure 6.2: MoCap markers and body/head orientation features. Since AA label counts in different sessions are unbalanced we use the weighted multi-class SVM classifier as a baseline and the weighted binary SVMs in the SVM-OLR. Parameter values for the CLLRMP are chosen to be optimal in the maximum likelihood sense. 104 6.3.2 Experiments First, we present results that demonstrate the influence of the feature processing window length on the average estimation accuracy for different estimation methods (Figure 6.3). audio& video ,----..------=-v-=-=i d::,:-::e=--=o=----,------1•svM ...... ~ I I I ---,---,---,--- ...... 1 I I ~ ---:---,--- .SVM-OLR aiMM-SVM OLR ~ ::I u u C'CS 3 4.5 6 3 4.5 6 processing window length [sec] Figure 6.3: Dependency of AA category estimation accuracies on feature processing window length for different estimation techniques. All estimation methods achieve higher accuracies for longer functional windows (Figure 6.3). We did not experiment with the windows above 6s due to the limited size of the dataset. It can be seen that the multi-class SVM benefits the most by the increasing window size and HMM-SVM OLR the least, although the performance of the HMM-SVM-OLR is consistently the best. This does not come as a surprise, as HMM-SVM-OLR conditions current state on the previous state and therefore exploits context longer than window. The experts' perception of AA labels differs depending on whether they use video-only or both audio and video in their annotation process. The SVM-OLR suffers 9.6% lower average accuracy in the estimation of video-and-audio based AA labels when trained on the vision based (MoCap) features than when trained on audio-visual data (MoCap and audio features). This proves that the proposed small set of audio derived non-verbal features captures some of the same information that influences the experts' perception. However, all estimators achieve higher accuracies for the video-only AA labels which may imply that the audio feature set should be extended (dialogue acts, word frequencies etc.), but may also imply that there is more variability in interpretations of the audio-based information. 105 Table 6.1: Estimation accuracies for AA labels. (window: 6 s, V and AV:video-only and audio-and-video) SVM SVM-OLR HMM-SVM-OLR AV[%] V[%] AV[%] V[%] AV[%] V[%] 81 65.9 70.7 70.3 74.7 72.5 78.1 82 59.3 63.9 64.3 69.5 67.1 71.1 83 66.1 70.6 70.7 75.5 73.3 76.8 84 66.7 71.8 71.4 74.9 74.6 77.1 85 66.9 72.0 71.9 75.5 74.3 77.9 86 61.9 66.7 66.8 71.6 69.0 73.0 87 62.7 67.3 67.5 71.1 70.1 74.1 8s 65.6 70.8 70.4 74.3 72.7 77.2 I AVG I 64.3 69.1 69.0 73.2 71.5 75.6 All discussed trends related to difference in the average estimation accuracy for the video-only and the video-and-audio based AA labels remain valid on the session level. Additionally, accuracies on sessions 82, 86 and 87, are lower than on the remaining sessions. One explanation for this is a mismatch between the AA label prior estimated by label counts on the training set and the label distribution on the testing session. This mismatch can be quantitatively represented by the sym metric Kullback-Leibler distance, ~ ( K L (Pi II P-i) + K L (P-i II Pi)), between the AA label category counts for ith session (Pi) and all other sessions (P-i). This measure takes values 0.3, 0.09 and 0.07 for sessions 8 2 86 and 87 respectively, while its average for all remaining sessions is 0.02. The indicated mismatch has negative implications on the weighted training for the multi-class SVM and SVM-OLR, and scaling of the HMM-SVM-OLR likelihood. Further, we analyze the importance of the proper treatment of the category ordering and dy- namics and its influence on the category confusion patterns. For this purpose we subtract confusion distribution matrices of the SVM baseline for the video-only and the audio-and-video based AA labels from the corresponding matrices for SVM-OLR and HMM-SVM-OLR. Positive diagonal entries of the difference matrices provide insight into class conditioned accuracy improvements and negative off-diagonal elements describe differences in category confusion patterns between pro- posed methods and the multi-class SVM. We present the difference matrices in form of color maps (see Figure 6.4), where dark (light) shades represent negative (positive) values. 106 0 -1 a> t/) "C-0 ·- a> >.c ~.!!!1 .2 <J: 2 "C<( :::l ca 3 d(SVM-OLR,SVM) d(HMM-SVM-OLR,SVM) -2.5% -5.0% -7.5% -10% -1 0 1 2 3 -1 0 1 2 3 A-A labels A-A labels Figure 6.4: Differences between confusion matrices (left column: (SVM-OLR) - (SVM); right column: (HMM-SVM-OLR) - (SVM)). Similar shades of main diagonal entries in each colormap show that both SVM-OLR and HMM- SVM-OLR improve estimation accuracies for all classes uniformly. Lighter elements on the main diagonal in the right colormap column show performance advantage of HMM-SVM-OLR. Dark shade of cells corresponding to the neighboring category pairs, ( ct, c2) E { ( i, j) : li - jl = ±1 }, show that the CLLRMP fits the binary SVM outputs well and is able to distinguish similar cate- gories. As explained in Section 6.1.3 SVM outputs fit the CLLRMP better than the original feature vectors. This is experimentally confirmed by comparison of accuracies for the SVM-OLR and the CLLRMP fitted on the original feature vectors (see Table 6.3.2). Table 6.2: CPL-LR inputs: original features vs. SVM outputs. SVM-OLR CLLRMP AV[%]1 V[%] AV[%]1 V[%] AVG 69.o 1 73.2 56.2 1 58.3 107 Smce the SYM-OLR and the HMM-SVM-OLR explmt label ordenng and dynamics they Im prove estimates in situations in which SVM predicts frequent and/or changes to very distant la bels for consecutive frames. Therefore even the cells farther from the main diagonal, (c 1 . c 2 ) E1 {( L 1), (0, 2), (2, 0). (3, 1)}, gel a dark shade. 6.4 Conclusions From an apphcallon perspecllve, we addressed the eslnnaliOn ol specllic behaviOral calegones approach avmdance (AA) m dyadic human mteraclions usmg audio and MoCap denved features! From an algorithmic perspective, we proposed two estimation schemes that exploit ordering and dynamics of AA labels, the SVM-OLR and the HMM-SVM-OLR. The SVM-OLR transforms the ongmallealure space by mull!ple bmary SVM class11iers and lils the CLLRMPon class11ier oulpulsj The HMM-SVM-OLR 1s a hybnd Markov model that uses hkehhood funclion proportwnai to the ralio of the label postenor probab1hty from SVM-OLR and the label pnor. Expenmental results on the multimodal dyadic interaction dataset showed advantages of the ordinal regression based methods over the multi-class SVM baselme. Results displayed consistent mcrease m both average and smgle-sess10n estimation accuracies with mcrease ol lealure processmg wmdow length. I he HMM-SVM-OLR outperformed the SYM-OR and the multJ-class SVM and achieved leave-one session-out average accuracy of 75.() :J(. for (j s window. We discussed: (1) variability in single session estimation performances caused by the lack of label proportion balance between sessionsj (2) dlllerences between contusiOn malnces lor the proposed esllmalors and the multi class SVM caused by the CLLRMP properties; and (3) differences m performance when CLLRMP 1s fitted on SVMs and ongmal feature vectors. mfluenced by the stochastic ordenng property. Our ongoing work includes collection and preprocessing of a larger dataset, extraction of ad ditional audio based features (including language transcripts), extraction of video features that can replace the MoCap features. Our future work on AA label estimation will include the identification of rehably d1stmgmshable AA categones. analysis of labehng process reproducJbJbty and develop ment of tune senes models that can explmt mputs from mulliple labelers. Bibliography [1] A. Agresti. Analysis of Ordinal Categorical Data. Wiley, 2010. [2] J. Ajmera, G. Lathoud, and I. McCowan. Clustering and segmenting speakers and their loca tions in meetings. In Proc. of the ICASSP, 2004. [3] C. Berzuini and W. R. Gilks. Resample-move with cross-model jumps. Springer Verlag, 2001. [4] M. Black, A. Katsamanis, C-C. Lee, A. Lammert, B. R. Baucom, A. Christensen, G. G. Geor giou, and S. Narayanan. Automatic classification of married couples' behavior using audio features. submitted to Interspeech 2010, 2010. [5] Y. Boers and H. Driessen. A particle filter multi target track before detect application: Some special aspects. In Proc. International Conference on Information Fusion, 2004. [6] PPG Boersma. Praat, a system for doing phonetics by computer. Glot International, 5(9/10):341-345, 2001. [7] M. Brandstein and D. Ward. Microphone Arrays: Signal Processing Techniques and Applica tions. Prentice Hall, 2001. [8] C. Busso, P.G. Georgiou, and S.S. Narayanan. Real-time monitoring of participants interaction in a meeting using audio-visual sensors. In Proc. ICASSP, 2007. [9] C. Busso, S. Hemanz, C. W. Chu, S. I. Kwon, S. Lee, P. G. Georgiou, I. Cohen, and S. Narayanan. Smart room: Participant and speaker localization and identification. In Proc. of the ICASSP, 2005. [10] C. Canton-Ferrer, C. Segura, J. R. Casas, M. Pardas, and J. Hernando. Audiovisual head orientation estimation with particle filtering in multisensor scenarios. EURASIP Journal on Advances in Signal Processing, 2008. [ 11] Chih Chung Chang and Chih J en Lin. LIBSVM: a library for support vector machines, 2001. [12] J. Chen, J. Benesty, andY. Huang. Time delay estimation in room acoustic environments: an overview. EURASIP Journal on Applied Signal Processing, 2006. [13] T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, 1991. [14] P. Damien and S. G. Walker. Sampling truncated normal, beta and gamma densities. Journal of Computational and Graphical Statistics, 10(2):206-215, 2001. 109 1151 A. Doucet, M. Bners, and S. Senecal. EffiCient block samplmg stratepes for sequentlill monte carlo methods. Journal of Computational and Graphical Statistics, 3(15):693-711, 2006. [16] A. Doucet, N. DeFreitas, and N. Gordon. Sequential Monte Carlo Methods in Practice. ISpnnger- Verlag, 2001. [17] A. Doucet and A. M. Johansen. A tutorial on particle filtering and smoothing: fifteen vears 1ater. Spnnger Verlag, 20091 118] S. Erland. Approxm1atmg h1dden Gauss tan Markov random fields, 7 rondhem1. PhD thests, Norwegian University of Science and Technology, 20031 [19] E. Frank and M. Hall. A Simple approach to ordmal classilrcatron. In Proc. EMCL, pages 145-156, 2001.1 1201 G. Fnedland, 0. Vmyais, Y. Huang, and C. Miiiler. Prosodic and other long-term features for fpeaker dranzation. IEEE 7 ransactwns on Audw, Speech, and Language Processmg, 2009. [21] D. Gatica-Perez, G. Lathoud, J. M. Odobcz, and I. McCowan. Audiovisual probabilistic track mg ol mull! pie speakers m meetmgs. IEEE 7 rans. on Audw, Speech and Language ProcessmgJ 2ililL 1221 P. G. Georgwu, P. Tsakil.hdes, and C. Kyniikiikis. Alpha-stable robust modehng of background nmse lor enhanced sound source locahzation. In Proc. of the IEEE lnternatwnal Conference on Acoustics, Speech, and Signal Processing, 1999. [23] P.K. Uhosh, A. l siartas, and S.S. Narayanan. Kobust vmce activity detecllon usmg long term signal variability. IEEE Transactions on Audio, Speech and Language Processing, 20101 ~ccepted.l [24] W. R. Gdks, S. Rtchardson, and D. J. Spigelhalter. Markov cham Monte Carlo tn Practtce. Chapman&Hall, London, 201ll. [25] P. J. Green. Nonlinear Dynamics and Statistics. Birkhauser, 200 I. [26] T. Hastic, R. Tibshirani, and J. Friedman. Elements of Statistical Learning. Springcr-Verlagj 2001. [27] C. Heavey, D. Gill, and A. Chnstensen. Couples mteractwn rattng system 2 (CIRS2). Omver fity of California, Los Angeles, 20021 [28] K. 1::\. Heyman, K. L. Wetss, and J. M. 1::\ddy. Manta! mteractron codmg system: Revision and empirical evaluation. Behavioural Research and Therapv, 33:737-746., 1995.1 1291 J. R. Hoffman and R. P. S. Miihler. Mull! target m1ss distance vm oplimal ass1gmnent. IEEEI JTrans. on Svstems, Man and Cvbernetics- Part A: Systems and Humans, 34(3):327-336, 20041 1301 L. Holden, R. Hague, and M. Holden. AdaptiVe mdependent metropohs-hastmgs. Annals ofl Applied Probabilitv, ( 19):395-413, 20091 llQ 1311 A. Jmmes, T. Oklnura, T. Naga!Illne, and K. Hrrata. Memory cues for meetmg v1deo retnevall In Pmc. ACM Workshop of arch1val and retneval of personal expenences, 2004. [32] Z. Khan, T. Balch, and F. Dcllaert. Mcmc-based particle filtering for tracking a variable num ber of mteractmg targets. IEEE Transactions on Pattern Analysis and Machine Intelligence! 27(11):1905-1918, 2005. 1331 Z. Khan, T. Balch, and F. Deilaert. Mcmc data assocmtwn and sparse factonza!ion updatmg lor realtnne mulltlargellrackmg wtlh merged and mull! pie measurements. IEEE 7 ransactwns on Pattern Analysis and Machine Intelligence, 28(12): 1960-1972, 2006.1 1341 S. Khgys, B.Rozovsk)', and A. TartiikovskY, Detectwn algonthms and track before detect ar chitecture based on nonhnear ltltermg lor mlrared search and track systems. lechmcal report, CAMS-University of Southern California, 19981 [35] J. H. Kolecha and P. M. Djuric. Gibbs sampling approach ror generation or truncated multi variate gaussian random vmiablcs. In Proc. ICASSP, 1999. 1361 C. Kreucher, M. Morelande, K. Kastela, and A. 0. Hero. Par!icle filtenng for multi-large! detection and lrackmg. IEEE 7 rans. on Aemspace and Electmmc Systems, (41 ): 1396 1414,1 2ilil.2. [3 /] G. Lalhoud and I. McCowan. Local! on based speaker segmental! on. In Pmc. ICASSP, 2003. [38] J. S. Liu. Metropolized independent sampling with comparison to rejection smnplingand im portance smnphng. Statistical Computing, (6):113 199, 1996.1 [39] J. MacConnick and M. Ismd. Partitioned sampling, articulated objects, and interfacc-qualit)j hand tracking. Book Series Lecture Notes in Computer Science, Springer, 1843:3 19, 2000. [40] J. Makhoul, F. Kubala, R. Schwartz, and R. Weischedel. Perlormance measures lor mlorma lion extraction. In Proc. of DARPA Broadcast News Workshop, 1999.1 [41] I. McCowan, D. Galtca Perez, S. Beng10, G. Lalhoud, M. Barnard, and D.Zhang. Aulomaltc pnalysis of multimodal group actions in meetings. IEEE Trans. on Pattern Analysis and Ma chine Intelligence, 27(3):305-317, 20051 [42] P. McCullagh. Regression models lor ordmal data. Journal of the Royal Stat1sttcal Soctety, Series B, 42(2):109-142, 1980. [43] I. Mikic, K. Huang, and M. I nved1. Acllvtly momlonng and summanzallon lor an mtelhgenl meeting room. In Proc. IEEE Workshop on Human Motion, 2000. 1441 S. Oh, S. Russeii, and S. Sastry. Markov cham monte carlo data assoCJa!ion for genera.I multiple-target tracking problems. In Pmc. IEEE International Conference on Decision anJ1 JControl (CDC), 20041 [45] K. Okuma, A. Taleghani, N. De Freitas, J.J. Lillie, and D.G. Lowe. A boosted particle Iiller! Multitarget detection and tracking. In Proc. the European Conference on Computer Vision, 2004. ll] 1461 L. R. Rabmer. A tutonal on hmm and selected apphcatwns m speech recogmlion. Proc. ofl IEEE, 1989. [4 /] C. Rago, P. WJ!lett, and R. Stretl. A com pan son ol the jpdal and pmht trackmg. In Proc.l ICASSP, 1995. [48] D. A. Reynolds and R. C. Rose. Robust text-independent speaker identification using gaussian tmxture speaker models. IEEE Trans. on Speech and Audio Processing, 1995. 1491 B. Rtsttc. A companson of lriht and 2d asstgmnent algonthm for tracking wtthan auborne pulse doppler radar. In Pmceedmgs of the F(/th lnternatwnal Sympostum on S1gnal Processmg andl [ts Apllications, pages 341- 344, 1999. [50] G.O. Roberts and Rosenthal J.S. Markoc chain monte carlo: Some practical implications and theoretical results. Canadian Journal of Statistics, (26):5-31, 1998. 1511 R. L. Rothfock and 0. E. Drummond. Performance metncs for multtple-sensor, multtple-target tracking. In Pmc. Signal and Data Pmcessing of Small Tmxets, pages 521-531,2000. [52] V. RozgiC, C. Busso, P. G. Georgiou, and S. Narayanan. Speaker tracking and segmenta tion with microphone array using mixture particle filter: Improvement of multimodalmecting momtonng system. In Pmc. of Multi Media Signal Processing Conference, 2007. 1531 Y. Rozgtc, A. Xtao, B.and Katsamams, B. Baucom, P. G. Georgtou, and S. Narayanan. A new mull!channelmulttmodal dyadrc mteracllon database. In Proc. IS, 20 I 0. [54] D. Schuchmacher, B-T. Vo, and B-N Vo. A consistent metric lor perlormance evaluation oil multJ-object filters. IEEE Trans. on Signal Processing, 56(8):3447-3457, 2008.1 [55] D. Schuchmachcr and A. Xia. A new metric between point process distributions. Advances in Applted Pmbabilm·, 40(3), 2008. [56] Y. Bar Shalom. Mu/titarget/Mu/tisensor Tracking: Applications and Advances. Artech House, 2ililL [57] D. Siegmund. Sequential Analysis. Springer-Verlag: New York, 1985. [58] D. Sigalov and Shimkin N. Data association in multi target tracking using cross entropy based ~lgontiilns. 2010. 1591 K. Stmth. Bayestan methods for vtsual multt-o§ject tracking wtth apphcatwns to human ac llvtty recogmtton. PhD I hesrs, EPFL, 2001.1 [60] J. Vermaak, A. Doucet, and P. Perez. Mamtammg multrmodahty through mixture trackmg. In Proc. IEEE International Conference on Computer Vision, october 2003. [61] J. Vennaak, S. Godsill, and P. Perez. Monte carlo filtering for multi-target tracking and data ~ssoctalion. IEEE Trans. on Aemspace and Electmnic Systems, 2005. 1621 A. Vmctarelh, M. Panttc, and H. Bourlard. Soctal stgnal processmg: Survey of an emergmg domain. Image and Vision Computing, 27:1743-1759, 2009. 112 1631 D. Vukadmov1c and M. Pantle. Fuily automatic facml feature pomt detection usmg gabor l"eature based boosted classifiers. In SMC, 2005. [64] P.R. Williams. Performance bounds for track-before-detect target detection. In Proc. Signal and Data Processing of Small Targets, 1998.1 [65] L.A. Wolsey. Integer Programming. John Wiley & Sons, 1998. ill Appendix .1 Publications .1.1 Journal Journal [1] V. Rozgic, F. Septier, P. G. Georgiou and S. S. Narayanan, Block Sampling MCMC algorithm for multi-object tracking, to be submitted to IEEE Signal Processing Letters, Nov 2010 [2] V. Rozgic, F. Septier, P. G. Georgiou and S. S. Narayanan, Adaptive MCMC block sampling for multi-object tracking, to be submitted to IEEE Transactions on Signal Processing, Dec 2010 [3] V. Rozgic, A. Katsarnanis, B. Baucom, P. G. Georgiou and S. S. Narayanan, Learning and inference for ordinal regression time series models: estimation of approach-avoidance in dyadic interactions, to be submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence, Jan 2011 [4] V. Rozgic, K. J. Han, P. G. Georgiou and S. S. Narayanan, Multimodal speaker segmentation and identification in presence of overlapped speech segments, Journal of Multimedia, Special Issue on Data Semantics and Multimedia Information Management, 2009 [5] M. Li, V. Rozgic, G. Thatte, A. Emken, M. Annavararn, U. Mitra, D. Spruijt-Metz and S. S. Narayanan, Multimodal physical activity recognition by fusing temporal and cepstral information, In IEEE Transactions on Neural Systems and Rehabilitation Engineering, 2010 114 .1.2 ( :onference Conference I I IV. Rozg1c, B. Xmo, A. Katsamams, B. Baucom, P. G. Georgwu and S. S. Narayanan, Esttmatwn of ordmal appmach avotdance labels m dyadic mteractwns: ordmal/ofiiSttc refiresswn appmach, subm!lted to the lnternatwna.J Conference on Acouslics, Speech, and S1gnal Processmg (ICASSP) 20 I 0 121 Y. Rozg~c, B. Xmo, A. Katsaman1s, B. Baucom, P. G. Georgwu and S. S. Narayanan, A new multtchannel multunodal dyadic mteractwn database, In Proceedmgs ol lnterSpeech (IS), 20 I 0 131 Y. Rozg1c, K. J. Han, P. G. Georgwu and S. S. Narayanan, Multimodal speaker segmentation m presence of overlapped speech sefiments, In Proceedmgs ol the IEEE InternatiOnal SymposiUm on Multunedia (ISM), 2008 [4] V. Rozg16, C. Busso, P. G. Georg1ou and S. S. Narayanan, Mu/tunodal meetmfi momtortnfi.j [Impmvements on speaker tracking and segmentation through a modified mixture particle filter, In Proceedmgs ol IEEE lnternallonal Workshop on Multnnedm S1gnal Processmg (MMSP), 200 I [5] J. Silva, V. Rangarajan, Y. Rozgic and S. S. Narayanan, Information theoretic analysis of[ dtrect articulatory measurements for phonettc discnnunatwn, In Proceedmgs ol the lntematwnal Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2007 [6] G. I halle, V. Rozg16, M. L1, S. Ghosh, 0. M1tra, S. S. Narayanan, M. Annavaram and D. Spruijt-Mctz, Optimal allocation of time-resources for multi-hypothesis activity-level detec tiOn, In Proceedmgs ol the lnternallonal Conlerence on D1stnbuted Computmg m Sensor Systems (DCOSS), 2009 [I] G. I halle, V. Rozg16, M. L1, 0. M1tra, S. S. Narayanan, M. Annavaram and D. SprUIJt Metz, Optimal time-resource allocation for activity-detection via multimodal sensing, In Proceedings ofj the International Conference on Body Area Networks (Body Nets), 2009[ ill 181 D. Sprm]t-Metz, M. Li, G. Thatte, G. Siikliatme, M. Annavaram, S. Ghosh, V. Rozgic, U. Mitra, N. Medvidovic, B. Belcher and S. Narayanan, Dijjerenttatmg phys1cal actiVIty modalities m 5·outh using heartbeat waveform shape and dijjerences between adjacent waveforms, In Proceedings ol ICDAM, 2009 191 S. Lee, M. Annavaram, G. Thatte, Y. Rozgic, M. Li, Urbashi Milra, S. S. Narayanan and D. Sprutjt-Metz, Sensmg for obeslfy: KNOWME 1mplementatwn and lessons for an archttect, In Proceedings of the Workshop on Bwmedicme m Computmg: Systems, Archilectures, and Circmts (BIC), 2009 .1.3 Talks and Presentations Talks and Presentations [I] Analysis of approach and avoidance behavior in the new multimodal dyadic interaction database, Human Commumcatwn Dynalillcs workshop, Institute for CreatiVe lechnologies, Playa VIsta, CA, USA, Aug 201q 121 Sequential Monte Carlo multi-target tracking algorithms with block resample-move MCMCI steps, Sequential Monte Carlo Methods - Fmal workshop, Stallstical and Applied Mathematical SCiences Institute (SAMSI), NC, USA, Nov 2009 [3] Sequential Monte Carlo multt target trackmg algortthms w1th one at the ttme resample move MCMC steps, Sequential Monte Carlo Methods - Mid-program workshop, Statistical and Applied Mathematical Sciences lnslltute (SAMSI), NC, USA, Feb 200'1 141 Performance of the resample move algorithm on the synthetic multi-target tracking dataset, Sequenl!al Monte Carlo Program- Statislical and Applied Mathematical Sciences lnslltute (SAMSI), NC, USA, Dec 2008 116 .2 Baseline microphone array likelihood model In order to validate our proposed likelihood model for the microphone array observations we com- pare it with the recently proposed model [21] that uses only the global maximum of the SPR-GCC- PHAT function. For completeness we briefly present here this baseline model. F th b · ( MA MA) 1 · fth · · ( T T )T or e o servatwns y t = y t 1 , ... , Yt M , ocatwns o e participants Xt = xt 1 , ... , xt K ' ' t ' ' t and the speaker activity vector at = ( at,1, ... , at,Kt) the joint likelihood is defined as: where the probabilities P1 and Po are given by the following equations: (:Jj) : IIYt,?- Xt,ill :::::A The constant A defines the neighborhood size and the constants L1 and Lo are chosen to favor observation existence for each active speaker, and observation absence for all participants that are not speaking. Therefore the ratio f~ has to be significantly greater than one. 117
Abstract (if available)
Abstract
In this dissertation we propose contributions that address the problems in behavioral signal processing for small-group interactions from three important perspectives. We propose algorithmic contributions to general statistical inference methods for interacting dynamical systems, in particular multi-object tracking problems. We propose multi-modal, multi-channel signal processing methods to address particular aspects of the small group interaction, with emphasis on speaker segmentation and speaker/participant tracking. Finally, we present a recording environment, a collected dyadic interaction database and propose methods for estimation of approach-avoidance behavior labels based on non-verbal interaction cues. ❧ In the first part of this dissertation we present a class of sequential block sampling algorithms for tracking unknown and variable number of objects. Proposed algorithms are applicable to multi-object tracking scenarios in which only available observations are detector outputs, and also to scenarios where both detector outputs and more complex observations which figure in the data-association free likelihood models. Proposed algorithms provide a way to construct block proposal distributions using detection based observations. Key parts of the proposed algorithms are methods for sampling block proposal distributions. We propose two novel methods for this purpose, one is based on a variational approximation scheme and the other represents an adaptive MCMC sampling scheme. Samples from block proposal distributions are further used in the sequential MCMC (or SMC) framework. We tested proposed schemes on two synthetic datasets. Results demonstrate benefits of processing longer observation sequences in multi-object tracking problems in a more efficient manner that the classical sequential sampling schemes. ❧ In the second part, we present a multi-target tracking algorithm for algorithm for tracking multiple speakers by a microphone array. As the microphone array observations do not provide an easy way to design speaker location detectors we propose a mixture particle filter for tracing multiple acoustic sources track-before-detection (TbD) framework. This method belongs to the same class of sequential signal processing algorithms (SMC or MCMC) as the block sampler proposed in the first part, while the major difference is that block sampler belongs to the detect-before-tracking class of algorithms. The sound source trajectories reconstructed by by the mixture particle filter do not necessarily correspond to speech only. Therefore, we apply an adapted optimal change point algorithm to segment obtained sound source trajectories into speech and non-speech segments. The algorithm is tested on a multi-participant meeting database as a separate module and as a part of a multi-modal system for automatic meeting monitoring. In both cases it provided significant improvements on the speaker detection and segmentation tasks. ❧ In the third part, we present a modality fusion algorithm that exploits complementary properties of video tracking, microphone array localization and speaker identification and solves the problem of speaker segmentation in presence of the overlapped speech. In this paper we address improvements to our multimodal system for ❧ tracking of meeting participants and speaker segmentation with a focus on the microphone array modality. We propose an algorithm that uses Directions-of-Arrival estimated for each microphone pair as observations and performs tracking of an unknown number of acoustically-active meeting participants and subsequent speaker ❧ segmentation. The proposed algorithm is unique from multiple perspectives. First, we suggest a hidden Markov model architecture that performs fusion of three modalities: a multi-camera system for participant localization, a microphone array for speaker localization, and a speaker identification system
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Interaction dynamics and coordination for behavioral analysis in dyadic conversations
PDF
Behavioral signal processing: computational approaches for modeling and quantifying interaction dynamics in dyadic human interactions
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Multimodal and self-guided clustering approaches toward context aware speaker diarization
PDF
Human behavior understanding from language through unsupervised modeling
PDF
Modeling and regulating human interaction with control affine dynamical systems
PDF
Establishing cross-modal correspondences for media understanding.
PDF
Automatic quantification and prediction of human subjective judgments in behavioral signal processing
PDF
Modeling expert assessment of empathy through multimodal signal cues
PDF
Dynamic graph analytics for cyber systems security applications
PDF
Modeling dyadic synchrony with heterogeneous data: validation in infant-mother and infant-robot interactions
PDF
Multimodality, context and continuous dynamics for recognition and analysis of emotional states, and applications in healthcare
PDF
Context-aware models for understanding and supporting spoken interactions with children
PDF
Provable reinforcement learning for constrained and multi-agent control systems
PDF
Speech recognition error modeling for robust speech processing and natural language understanding applications
PDF
Machine learning in interacting multi-agent systems
PDF
Interaction and topology in distributed multi-agent coordination
PDF
Representation, classification and information fusion for robust and efficient multimodal human states recognition
PDF
Efficient inverse analysis with dynamic and stochastic reductions for large-scale models of multi-component systems
PDF
Computational methods for modeling nonverbal communication in human interaction
Asset Metadata
Creator
Rozgić, Viktor
(author)
Core Title
Statistical inference for dynamical, interacting multi-object systems with emphasis on human small group interactions
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
08/09/2011
Defense Date
08/09/2011
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
behavioral signal processing,multi-modal signal processing,OAI-PMH Harvest,statistical inference
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Narayanan, Shrikanth S. (
committee chair
), Georgiou, Panayiotis G. (
committee member
), Schaal, Stefan (
committee member
)
Creator Email
rozgic@gmail.com,rozgic@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c127-653978
Unique identifier
UC1354693
Identifier
usctheses-c127-653978 (legacy record id)
Legacy Identifier
etd-RozgiVikto-267-0.pdf
Dmrecord
653978
Document Type
Dissertation
Rights
Rozgić, Viktor
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
behavioral signal processing
multi-modal signal processing
statistical inference