Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Exploiting latent reliability information for classification tasks
(USC Thesis Other)
Exploiting latent reliability information for classification tasks
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
EXPLOITINGLATENTRELIABILITYINFORMATION FORCLASSIFICIATIONTASKS by KomathNaveenKumar ADissertationPresentedtothe FACULTYOFTHEUSCGRADUATESCHOOL UNIVERSITYOFSOUTHERNCALIFORNIA InPartialFulfillmentofthe RequirementsfortheDegree DOCTOROFPHILOSOPHY (ELECTRICALENGINEERING) December2016 Copyright 2016 KomathNaveenKumar Dedicatedtoallmyteachersandmyparents. ii Contents Dedication ii ListofTables vi ListofFigures vii Abstract x 1 Introduction 1 2 Onlinesensingrateadjustment 4 2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 RandomSensingNetworkOverStationaryFields . . . . . . . . . . . . 6 2.3 AdaptiveSensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.1 Oversensing. . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.2 Undersensing . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4 ProblemSetup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.5 AdaptiveFieldMonitoringScheme . . . . . . . . . . . . . . . . . . . . 13 2.5.1 TargetLocalization . . . . . . . . . . . . . . . . . . . . . . . . 13 2.5.2 Parameterestimation . . . . . . . . . . . . . . . . . . . . . . . 16 2.5.3 Reliabilitymetricforreconstruction . . . . . . . . . . . . . . . 17 2.5.4 Measuringmotionbytargettracking . . . . . . . . . . . . . . . 18 2.6 FeedbackMechanismforRateAdjustment . . . . . . . . . . . . . . . . 19 2.7 StabilityAnalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.8 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.9 SimulationResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3 ObjectClassificationinSidescanSonarimages 28 3.1 ProblemDefinition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.1.1 DatasetCharacteristics . . . . . . . . . . . . . . . . . . . . . . 29 3.2 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 iii 3.2.1 Mean-ShiftclusteringbasedSegmentation . . . . . . . . . . . 32 3.2.2 Choosingthesegmentationtarget: HighlightorShadows . . . . 33 3.2.3 Denoisingusingiterativehierarchicalclustering . . . . . . . . . 34 3.3 FeatureExtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3.1 ZernikeMoments . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.4 UsingShadowsForObjectClassification . . . . . . . . . . . . . . . . 39 3.4.1 Shadowfeatures . . . . . . . . . . . . . . . . . . . . . . . . . 39 4 IntelligibilityClassificationinPathologicalSpeech 41 4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2 PreprocessingandFeatures . . . . . . . . . . . . . . . . . . . . . . . . 42 5 ReliabilityAwareClassification 44 5.1 Reliabilityasalatentfactor . . . . . . . . . . . . . . . . . . . . . . . . 44 5.2 GenerativeModelforreliability . . . . . . . . . . . . . . . . . . . . . 45 5.2.1 Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.2.2 MLParameterEstimation . . . . . . . . . . . . . . . . . . . . 48 5.2.3 Inferenceofposteriordistributionofunknownvariables . . . . 51 5.3 DiscriminativeModelforreliability . . . . . . . . . . . . . . . . . . . 52 5.3.1 FormulationandNotation . . . . . . . . . . . . . . . . . . . . 53 5.4 Trainingthereliabilityawaremodel . . . . . . . . . . . . . . . . . . . 54 5.4.1 TheEMalgorithm . . . . . . . . . . . . . . . . . . . . . . . . 55 6 ExperimentsandResults 58 6.1 Generative: ReliabilityawareBayesianModel . . . . . . . . . . . . . 58 6.1.1 SyntheticExperiments: accuracyvs. reliabilityρ . . . . . . . . 58 6.1.2 Objectclassificationexperiment . . . . . . . . . . . . . . . . . 60 6.1.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 6.2 Discriminative: ReliabilityawareMax-Entmodel . . . . . . . . . . . . 63 7 Multimodalmixtureofexperts 69 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 7.2 Datasetandexperimentalsetup . . . . . . . . . . . . . . . . . . . . . . 71 7.3 FeatureDesign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 7.3.1 Audiofeatures . . . . . . . . . . . . . . . . . . . . . . . . . . 71 7.3.2 Videofeatures . . . . . . . . . . . . . . . . . . . . . . . . . . 72 7.4 SystemsforEmotionPrediction . . . . . . . . . . . . . . . . . . . . . 75 7.4.1 AudioOnly,VideoOnlyandEarlyFusion . . . . . . . . . . . . 76 7.4.2 LateFusionModel . . . . . . . . . . . . . . . . . . . . . . . . 76 7.4.3 ProposedMixtureofExperts(MoE)-basedFusionModel . . . . 76 7.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 7.6 ResultsandObservations . . . . . . . . . . . . . . . . . . . . . . . . . 80 iv 7.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 8 Reliabilityinthecrowd 83 8.1 Introductionandbackground . . . . . . . . . . . . . . . . . . . . . . . 83 8.2 Multipleannotatorsreliabilitymodel . . . . . . . . . . . . . . . . . . . 84 8.2.1 Parameterestimation . . . . . . . . . . . . . . . . . . . . . . . 85 8.2.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 8.3 Datadependentreliability . . . . . . . . . . . . . . . . . . . . . . . . . 88 8.4 ExperimentsandResults . . . . . . . . . . . . . . . . . . . . . . . . . 89 9 FutureWork 91 9.1 Latentannotatorreliabilitymodels . . . . . . . . . . . . . . . . . . . . 91 9.2 LabelStacking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 ReferenceList 95 v ListofTables 2.1 ControlFeedbackRules . . . . . . . . . . . . . . . . . . . . . . . . . . 21 6.1 Table shows the decision rules for different classification schemes for whichresultsarepresentedlaterinFig.6.2. . . . . . . . . . . . . . . . 61 6.2 Average reliability parameters ρ for the shadow and highlight features on different datasets. The values are obtained after training a doubly unreliablemodelasshowninFig.6.3. . . . . . . . . . . . . . . . . . . 62 6.3 Resultsofreliability-awarebinaryclassificationofspeechintelligibility for different feature sets. All accuracy figures are in %. The chance accuracy is 50.3%. Note that improvement in classification accuracy is obtainedwhenthefeaturesetsaremoreunreliable(highlighted). . . . . 64 6.4 Results of reliability aware binary classification on the 7-class object classificationtaskinunderwaterimages. . . . . . . . . . . . . . . . . . 65 7.1 Performance of different models in predicting continuous in time and scalearousal-valencecurves . . . . . . . . . . . . . . . . . . . . . . . 82 8.1 Results are shown on synthetic datasets where the annotators were cre- atedusingpre-detereminedaveragereliabilityρ m ’s. Theestimatedρ m ’s roughly follow the trend of the applied ρ m ’s. Also note that the pro- posedreliabilityawareapproachoutperformsthebaselinemajorityvot- ingmethodonalltasks. . . . . . . . . . . . . . . . . . . . . . . . . . . 86 8.2 Tableshowscorrelationbetweenreliabilityposteriorsonthedatasetand selfreportedconfidence. . . . . . . . . . . . . . . . . . . . . . . . . . 87 8.3 Results using the Mixture of annotator reliabilities (MoAR) and base- line methods. Note that the proposed MoAR model being an ensemble methodisabletoimproveovertheoraclebaseline. . . . . . . . . . . . 90 9.1 Results comparing label stacking for different classifiers. Label stack- ingimprovesresultsforSVMtrainingwhichmightimplythatselective trainingbyweightingsamplesdifferentlyisnecessary. . . . . . . . . . 93 vi ListofFigures 2.1 Per-node sensing rate λ 1s vs. the collection interval T, for N = 1000 nodes,S = 10,T p = 0.2sec,b = 4,α s = 244packetsandγ 0 = 15dB. . 7 2.2 Theadaptivesensingscheme. . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 (a)-(d) show the effect when the field changes faster than the sensing rate. (a)istheoriginalmapobservedwithinsingleT coh while(c)shows the same field after an observation intervalT = 10×T coh . Notice how (d)ismoresimilartotheblurredmap(c)than(a). . . . . . . . . . . . . 11 2.4 Blockdiagramfortheadaptivefeedbackmechanism. Gdenotesthetrue measurementsonthetimevaryingfield. . . . . . . . . . . . . . . . . . 13 2.5 Modifieddiscretestepupdateschemeusingquantizeddirections. Aster- isk(*)indicatespointsonthegridonthemap. . . . . . . . . . . . . . . 15 2.6 Figure on the left shows a sample map with sequential update steps for eachpointconvergingatlocalpeaks(markedbydiamondsymbols). The figureontherightshowsthewindowofwidthW = 3duringthecurrent iterationandthegeneraldirectionofthegradientfromthecenter. . . . . 16 2.7 Steps in rate adjustment (b) shows the samples chosen at random and (d) is the map generated using the parameters estimated on (c). Notice how(d)isapproximatelysimilarto(a)whendetectionisaccurate. . . . 17 2.8 Figure shows the (a) HF (b) LF components of the model-based error and the minimum average motion per frame for (c) L=5 and (d) L=10 cases. Systemisinopenloopcontroli.e. theerrormetricsarenotbeing compensatedforwhichcanbenotedbytheflatT line. . . . . . . . . . 20 2.9 Fourstatesofthecontrolalgorithm . . . . . . . . . . . . . . . . . . . . 22 2.10 Figureshowstheinternalstatesofthealgorithmasitmakesvariousrate adjustmentdecisions. StatesarethesameasdescribedinTable2.1.. . . 23 2.11 Closedloopcontrol . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.12 Averaged MSE and F-score evaluation metrics for a) no feedback b) correlationbaselinemechanismandthec)proposedfeedbackalgorithm atdifferentvelocityanddecayparametersettings. . . . . . . . . . . . . 25 3.1 ThefourtypesofobjectintheNSWCsidescansonardatabase . . . . . 29 3.2 Sampleobjectsfromthreedifferentdatabases: SSPS,NSWCandNURC. Thedataillustratethedifferingcharacteristicsacrossobjectsanddomains. 30 vii 3.3 An example from the NURC database showing highlight (right) and shadow(left)segmentedfromtheoriginalimage(center)usingthemean shiftclusteringbasedsegmentationscheme. . . . . . . . . . . . . . . . 31 3.4 Stages in segmentation of an image (a) from the SSPS database - (b) meanshiftsegmentationintoshadow(brown)andhighlight(white)and (d)iterativehierarchicalclustering(bottom-right).((c)indicatestheeffect ofdoingaconnectedcomponentanalysisinstead) . . . . . . . . . . . . 35 3.5 Zernike Moment magnitude features for 296 samples from the NSWC database. The solid red line is an indicator of the class labels and the verticalaxiscorrespondstodifferentfeaturedimensions. Notehowfea- tures are similar in one class in spite of the objects being oriented at differentanglesandvaryingacrossdifferentclasses. . . . . . . . . . . . 39 3.6 Average shadows from each class in the NSWC database.(Negative of imagesshowntohighlightshadows) . . . . . . . . . . . . . . . . . . . 39 5.1 Bayesian model proposed in [1] for reliability-aware fusion of feature sets for object classification in underwater sonar images. The dashed line between features X and Y indicate, that the dependence between them is contingent on the reliabilityR. Latent variables are denoted by unshadednodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.2 The reliability noder controls the link between class label and shadow features. Theshadednodesrepresentvariablethatcanbedirectlyobserved whereastheunshadednodecorrespondingtorisassumedtobehidden. Also the arrows between pair of nodes indicate the only direct condi- tionaldependencies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.3 DatageneratedusingthemodelinFig.5.2above. Redcrossesindicates the centers for each of the class clusters. For generating the shadow features P(r) = 0.5 was used i.e. 50% of the samples were randomly selectedtohaveunreliableshadowfeatures. . . . . . . . . . . . . . . . 48 6.1 Performance degradation with change in reliabilityρ. Accuracies indi- catedareonthe5000samplessynthetically-generateddataset. Thefea- tures presented are for ρ = 0.8. GMM indicates training a Gaussian modelperclassandclassificationusingaMAPrule. . . . . . . . . . . 59 6.2 Comparisonofclassificationaccuraciesonthreedifferentdatabasesusing different features sets and fusion schemes. Reliability aware fusion yielded the best performance on the NURC dataset where the shadow featuresweremostunreliable(see ˆ ρvalues). Alsoingeneraldimension reduction helps since parameters of the Gaussian models can be better estimated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.3 QRSHY:Doublyunreliablemodel. . . . . . . . . . . . . . . . . . . . 68 viii 7.1 Plot showing the variation in scaled video compressibility and scaled arousalvalueforasamplemovie . . . . . . . . . . . . . . . . . . . . . 74 7.2 Schematic representation showing the formation of Histogram of Face Area(HFA)forasamplevideo . . . . . . . . . . . . . . . . . . . . . . 75 7.3 SchematicrepresentationofProposedMixtureofExperts(MoE)-based FusionModel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 7.4 Plotsshowingthevariationinρ ar andρ vl withαfortheLateFusionModel 82 9.1 LatentannotatorreliabilityBayesianmodel. Aindicatesthelatentanno- tatortype. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 ix Abstract Researchers often need to work with noisy human annotations that are inherently sub- jectiveandchallengingfortheannotatorstoprovide. Inmythesis,Iproposeamodelto accountforthereliabilityofannotationsassociatedwithdifferentsampleswhiletraining classification models. Reliability is modeled as a latent factor that controls the depen- dence between observed features and their corresponding annotated class labels. An expectation-maximization algorithm is used to estimate these latent reliability scores formaximum-entropymodelsinamixture-of-expertslikeframework. Itesttherobust- nessoftheproposedapproachonmultiplespeech,multimediaandaffectivecomputing tasks and show that the method is able to exploit latent reliability information on the inherently noisy aspects of training data. Additionally, the reliability models also lend themselves easily to the crowdsourcing scenario where the challenge lies in estimating thewisdomfromopinionsofacrowdofannotators. x Chapter1 Introduction Despitesignificantadvancesinmachinelearningtechniques,wefindthatcertainpattern recognitionproblemsareintrinsicallymoredifficultthanothers. Forexample,inaclas- sification task certain classes may be more “noise-like” or difficult to model compared to others. The challenge in these tasks often lies in adequately modeling the variability inobservationswithinagivendataset. Considerthebinaryclassificationtaskforintelli- gibility in pathological speech [2]. The non-intelligible class in this case often presents more variability than the intelligible class. Thus, direct modeling of the variability in featuredistributionmightnotalwaysbethebestapproachtodealwiththeheterogeneity in such cases. In theory, we would be interested to isolate and learn from only those those aspects of this variability that are useful for the classification task. This infor- mation could be either in the form of knowledge of certain useful features or training samples that are more informative than others in dataset. In this work, I shall use the attribute reliable to refer to such aspects that are informative in the context of a pattern recognition task. In practice, this reliability information is usually hidden and must be jointlyestimatedduringlearning. Algorithmssuchasboosting[3]trytosolvethisproblembyweightingdatasamples differently to sequentially train a cascade of weak learners. Misclassified samples are given more weight in subsequent stages of training such that the learners are comple- mentary. In fact, boosting belongs to an increasingly popular class of machine learning algorithms referred to as ensemble methods [4, 5]. Ensemble methods such as random 1 forests [6, 7] are also able to jointly learn the features that are useful for classifica- tion. Yet another variation of ensemble methods is known as mixture-of-experts [8, 9] in which each “expert” classifier is proficient in modeling certain regions of the feature space. In most practical classification tasks, heterogeneity in the dataset often results from outliers. This may be due tonoisy observations orbecause some of the features are not informative. In order to achieve robustness to this heterogeneity it is important for the algorithmstofirstidentifybetweenwhatis“reliable”andwhatisnotforlearning. The most common techniques to deal with this problem usually tend to approach it in an ad-hoc fashion. For example, outlier removal methods based on the distribution of features might prune the dataset to remove any unreliable samples. However, this pruning might often not be motivated by the classification task for which it is required. Alternatively,theRandomSamplingConsensus(RANSAC)[10]paradigmhelpstofind the largest subset of the data whose samples have consensus among themselves with respect to a classification model and can be assumed to contain only inliers. At the otherendofthespectrum,onemightconsidermethodsthatassesthereliabilityoftheir output in a post-hoc fashion. Using a model to assess the confidence of a decision howevercomeswiththedifficultassumptionthatthemodelwastrainedoncleandatato beginwith. Finally,theclassofrobustlearningalgorithmsensurethatmodeltrainingisinsensi- tive to the noisy samples. A good example of such an algorithm is the Support Vector Machine (SVM) [11] classifier which maximizes the margin between the class bound- aryandthenearestsamplefromeachclassknownasthesupportvectors[12]. Sincethe class boundary only depends on a few reliable support vectors, presence of any other outliersdoesn’tmakeanydifferenceontheSVMresult. Alongtheselines,weproposed aBayesiannetworkforfeaturefusionin[1]thatlearnsthereliabilityofeachsampleand 2 featuresetfortheclassificationtaskduringtraining. Reliabilityisdefinedasalatentfac- tor in the generative model that controls the dependence between features and the class label. Weextendthisideaofreliabilityasalatentfactortodefinetheuncertaintyassoci- ated with different aspects of a classification task including features, labels or multiple annotators. Thisdocumentisarrangedasfollows. InChapter2,Ifirsttrytomotivatetheimpor- tance of reliability estimationg using a compress-sensing application, where the relia- bility of target detection is used to guide an adaptive rate control algorithm that com- pensates for any changes in monitored time-varying field. An unsupervised estimate of reliabilityisusedasafeedbackmetrictodecideifthesensingrateneedstobechanged. Later,Itrytoabstractthenotionofreliabilityforsupverisedclassificationtasksanddis- cuss models for reliability aware classification in the context of two problems - object classificationinunderwatersidescansonarimages(Chapter3)andintelligibilityclassi- fication of pathological speech (Chapter 4). These tasks and their preprocessing steps are discussed first followed by a generative and a discriminative model for reliability aware classification in Chapter 5. Experimental results obtained from these models are presentedinChapter6. Asimilarideaforreliabilitycanalsobeexploitedonmultimodal datasets such as movies as presented in Chapter7. Finally, we present extensions of the proposed reliability aware model to crowdsourcing where multiple labels are provided by different annotators. Results on an emotion prediction and face age estimation task are presented using this model in Chapter 8. Possible future directions are discussed in Chapter9. 3 Chapter2 Onlinesensingrateadjustment 2.1 Motivation Theemergenceofcompressedsensingframeworkpresentssignificantpotentialforeffi- cientsensingandsamplingsystems[13,14,15]byhelpingtoreducesamplecomplexity under realistic communication constraints. Sensor network technology greatly benefits from the compressed sensing paradigm [16, 17, 18, 19, 20, 21, 22, 23, 24]. As an example, large-scale networks deployed for long-term monitoring of dynamic fields, e.g. underwaterenvironmentstypicallyneedtoaccountforpowerconsumptionconsid- erations. An efficient scheme in such cases would call for a joint optimization of the sensingandcommunicationconstraints. Toaddressthisissue,Fazelet. al. proposedaRandomAccessCompressedSensing (RACS) scheme in [25, 26] for energy-efficient monitoring of sensing fields. The pro- posedsensingschemedependsonintegratinginformationfromthecommunciationand channelaccessmodulesintothedataacquisitionprocess. However,theproposedRACS scheme is designed for stationary sensing fields, where the process being monitored is assumed to remain fixed during sensing. It is assumed that the coherence properties of theunderlyingfieldareknowna-prioriandremainunchangedthroughoutthefullsens- ing duration. This assumption might be justified for monitoring of natural phenomena, where the rate of variations in the process typically remains constant. It is also com- mon to find similar assumptions of stationarity in related works in object detection or classification in underwater fields [27, 28]. However, when the field being montiored 4 undergoes a varying rate of change, (e.g., when the process is impacted by a target that ismovingatanunknownorvariablespeed),suchassumptionsmaynothold. In a more recent work, Kerse et. al. [29] addressed this problem by proposing to unify target detection with reconstruction using a standard sparse identification tech- nique. Whiletheirmethoddoesn’trequireanyaprioiknowledgeofthenumberoftargets orcoherencepropertiesoftheunderlyingfield,itreliesonknowledgeoftheexacttarget signaturemodelsforbothtargetlocalizationandtracking. Weproposearateadjustment method that does not require explicit knowledge of the target signature. Rather knowl- edge of the family of parametric models to which the fieldmight belong is sufficient as modelparameterscanbejointlyestimated. Similar to the work in [29], we consider that the end-goal in sensing is to detect or track targets, and incorporate these data processing aspects into the joint sensing and communication scheme. We propose a framework to adjust the sensing rate by estimating different attributes of the field to make an informed decision. Our adaptive rateadjustmentprocedureforcompresssensingiterativelyadjuststheper-nodesensing ratetocapturethevariationsinthetheunderlyingfield. First,wetreatthefieldasquasi- stationary and apply random access compress sensing within each stationary segment. Second, we compute two heuristic metrics that seek to tie in the end goal of target detection/tracking with the rate adjustment scheme. Using the data collected in each segment, the fusion center (FC) relies on a detection algorithm to first determine the currentstateitisin. Finally,dependingonthecurrentstateacontrolalgorithminstructs theFCtochangethesensingrateifrequired. A high rate of sensing would typically bode well for target detection/tracking, but it is not energy efficient. On the other hand a low rate of sensing does not necessarily lead to poor performance in detection. Thus there exists a trade-off between energy efficiencyandthetargetdetectionaccuracywhichtherateadjustmentseekstoexploit. 5 2.2 RandomSensingNetworkOverStationaryFields ConsideragridnetworkconsistingofN =P×Qsensors,withP andQsensorsinthex andydirections,respectively. Theunderlyingassumptionisthatmostsignalsofinterest (naturalorman-made)haveasparserepresentationinthespatial-frequencydomain. We denotethesparsityofthesignalbyS. Thedatafromthedistributedsensorsisconveyed to the FC, where a full map of the sensing field is reconstructed. This map can be used fordetectionaswillbeshowninSec.2.5. Inspired by the theory of compressed sensing, the architecture proposed in [25, 26] employsrandomsensing,i.e.,transmissionofsensordatafromonlyarandomsubsetof all the nodes. For a stationary field, each sensor node measures the signal of interest at randomtimeinstants—independentlyoftheothernodes—atarateofλ 1 measurements per second. It then encodes each measurement along with the node’s location tag into a packet, which is digitally modulated and transmitted to the FC in a random access fashion. Because of the random nature of channel access, packets from different nodes may collide, creating interference at the FC, or they may be distorted as a result of the com- munication noise. A packet is declared erroneous if it does not pass the cyclic redun- dancy check or a similar verification procedure. Since the recovery is achieved using a randomly selected subset of all the nodes’ measurements, we let the FC discard the erroneous packets as long as there are sufficiently many packets remaining to allow for thereconstructionofthefield. TheFCthuscollectstheusefulpacketsoveracollectionintervalofdurationT. The interval T is assumed to be much shorter than the coherence time of the process, such that the process can be approximated as fixed during one such interval. Let R xx (τ) denote the temporal autocorrelation of the process, which quantifies the average corre- lationbetweentwosamplesoftheprocessseparatedbytimeτ. ThecoherencetimeT coh 6 0 1000 2000 3000 4000 5000 6000 0 0.5 1 1.5 2 2.5 3 x 10 −3 λ 1s [packet/sec] Collection interval T [sec] Figure2.1: Per-nodesensingrateλ 1s vs. thecollectionintervalT,forN = 1000nodes, S = 10,T p = 0.2sec,b = 4,α s = 244packetsandγ 0 = 15dB. is defined as the time lag during which the samples of the signal are sufficiently corre- lated, i.e., R xx (T coh ) = qR xx (0), where q is the desired level of the correlation (e.g., q = 98%). y =RΨv+z (2.1) where z represents the sensing noise, Ψ is the inverse Discrete Fourier Transform matrix, v is the sparse vector of Fourier coefficients and R is a K× N matrix–with K correspondingtothenumberofusefulpacketscollectedduringT–whichmodelsthe selectionofcorrectpackets. Eachrowconsistsofasingle1inthepositioncorrespond- ing to the sensor contributing the useful packet. The FC can form R from the cor- rectly received packets, since they carry the location tag. We emphasize the distinction between the sensing noisez, which arises due to the limitations in the sensing devices, andthecommunicationnoise,whichisacharacteristicofthetransmissionsystem. The sensingnoiseappearsasanadditiveterminEq.(2.1),whereasthecommunicationnoise resultsinpacketerrorsanditseffectiscapturedinthematrixR. 7 The FC then recovers the map of the field using sparse approximation algo- rithms [30]. It suffices to ensure that the FC collects a minimum number of packets, N s =O(SlogN), picked uniformly at random from different sensors, to guarantee accurate reconstruction of the field with very high probability. The random nature of thesystemarchitecturenecessitatesaprobabilisticapproachtosystemdesignusingthe notion of sufficient sensing probability [25] denoted by P s . This is the probability withwhichfullfieldreconstructionisguaranteedattheFC.Settingthisprobabilitytoa desired target value, system optimization under a minimum-energy criterion yields the necessary design parameter, i.e., the per-node sensing rate λ 1 . The minimum per-node sensingratecanbeexpressedintermsofthesystemparametersas[26] λ 1s = −1 2NT p b b+1 ·W 0 2NT p b b+1 e b γ 0 T log 1− α s N (2.2) where T p is the packet duration, b is the packet detection threshold, α s is the average number of packets that need to be collected in one observation interval T to meet the sufficientsensingprobability,γ 0 isthenominalreceivedSNR,andW 0 (·)istheprincipal branch of the Lambert W function. (More details can be found in [26]). Note that as showninFig.2.1,λ 1s dependsonthecollectionintervalT,whichinturnmustbelower thanT coh ,forthestationarityassumptiontohold. 2.3 AdaptiveSensing Intheaboveframework,theminimumpernodesensingrateλ 1s isdeterminedbasedon the properties of the field and is then kept fixed throughout the entire sensing process. However, real underwater fields are seldom stationary and the coherence properties of a non-stationary dynamic field usually vary significantly over time. This calls for the 8 designofanadaptiverandomsensingnetworkforthemonitoringoftemporallyvarying fields. Theobjectiveistotransitionfrompassivemonitoringwhereonlythemapofthe sensing field is reconstructed, to an active framework where the relevant information is extracted (e.g., target detection/tracking), and exploited to instruct the sensor nodes to adjust their sensing rates. To achieve this goal, we employ a detection method, which uses the collected data to determine the attributes of the underlying field. The FC can then use this information to determine any appropriate modifications to the per-node sensingrate,i.e.,increase,maintainordecreasethesensingrate. Thiscycleofsensing- decision-adjustmentisillustratedinFig.2.4. Figure2.2: Theadaptivesensingscheme. For a given collection interval T, the corresponding per-node sensing rate can be determined using Eq. (2.2), as shown in Fig. 2.1. The proper choice of T however dependsontherateofvariationsinthefieldandisadaptivelytuned. Inparticular,weuse an approach based on target detection, where we assume that an object/target model of interestisknownbeforehand. Thisisacommonassumptioninmostsupervisedpattern recognition tasks. Given a reconstructed map, the location of targets is first estimated. Using this knowledge, we then estimate the parameters of the object model from the map. These parameters describe the detection system’s understanding of the field, and canbeusedtogeneratethemapofthefield. Comparingthismodel-basedmapwiththe observed one reconstructed using sparse approximation algorithms, provides us with a reliability metric for reconstruction. In other words, if there is a difference in what the 9 algorithm expects to see and what it sees, it indicates an error in either the acquisition of the field or the algorithm’s understanding of the map. In either case, the FC needs to adapt its sensing rate. We specifically discuss our methods in the context of the followingtwocases: 2.3.1 Oversensing Thissituationcorrespondstothecasewhenthereoccursredundantsensingbecausethe per node sensing rate is much larger than the rate of change of the field i.e. T << T coh . Although this case favors reconstruction using the RACS architecture, it leads to a wastage of communication resources and is also energy inefficient. Thus we seek to lower the sensing rate in this case to an optimal point such that the accuracy of our end goal is not affected. Fig. 2.7 shows the result for of oversensing for the example discussed in Section 2.5. In this case, we devise a scheme to estimate the motion of targetsusingmultipleframes. 2.3.2 Undersensing In this scenario, the rate of sensing per node is insufficient and the field changes within one collection interval sinceT > T coh . Thus the per-node sensing rate λ 1s needs to be increased. The targets are no longer steady within the duration of a collection interval T leading to blurring, which makes target detection challenging. The estimation task in this case is further complicated by the fact that the packets from different frames may have been collected during the intervalT. This scenario leads to a violation of the stationarity assumptions made initially. The reconstructed map is thus blurred because of different packets originating from different frames (Fig. 2.3). In this case, we use a specific error metric based on model-fitting to estimate the reliability of reconstruction andobjectdetection. 10 x y 1 2 3 4 Last frame from the original map 5 10 15 20 10 20 30 40 50 Sensed image 5 10 15 20 10 20 30 40 50 original map under motion blur 1 2 3 4 5 10 15 20 10 20 30 40 50 recovered map after frame 10 5 10 15 20 10 20 30 40 50 (c) (d) (b) (a) Figure 2.3: (a)-(d) show the effect when the field changes faster than the sensing rate. (a) is the original map observed within singleT coh while (c) shows the same field after an observation interval T = 10×T coh . Notice how (d) is more similar to the blurred map(c)than(a). 2.4 ProblemSetup We use the following field model to simulate our test case. Suppose, a field with M targets is being observed. Each target in the field is assumed to generate a signature (e.g., heat, sound, etc) decaying exponentially with distance from its location. At time t,theprocessobservedbysensornodeiatcoordinate(x i ,y i )isgivenbyEqn.(2.3) u i (t) = M X m=1 A m exp(−p p (x i −a m (t)) 2 +(y i −b m (t)) 2 ) (2.3) wherea m (t),b m (t)arethecoordinatesofthem th targetattimet,A m isitsstrength, andpisthedecayrateofthesources. Theprocessthenevolvesovertimeasthesources move along random trajectories. Similar models are commonly found in the energy- based localization literature for static sensor networks [31, 32, 33] when the targets beingdetectedarenotmoving. 11 Initially, the FC has no knowledge of the location of targets or the rate of variation inthefield,i.e.,thespeedatwhichthetargetsaremoving. Itthusinstructsthedatacol- lection to begin with an initial sensing rate λ init 1s . The initial sensing rate is determined using historical data by setting the desired parameters in Eq. (2.2). Once the map of the field G sa is recovered using sparse approximation methods, the FC may now use the rate-adjustment algorithm to decide if the sensing duration T needs to be adjusted. The sensors employ the adjusted sensing rate in the ensuing sensing duration. We dis- cuss metrics for the proposed rate adjustment algorithm for this family of field models, althoughttheframeworkitselfisgeneric. For simulating a map of the field as defined above, we start with a set of randomly chosen parameters. In addition, each target is assigned a random velocity and direction of movement. To simulate the undersensing case, we consider that collection occurs overN b coherence time intervals (referred to as frames) whereT≈ N b T coh ,N b ∈Z + . The randomly sampled packets are then collected uniformly over the last N b frames (Fig. 2.3). This effectively leads to motion blurring of the targets as mentioned before. To simulate the oversensing case the collection interval is reduced to T ′ = T coh while each target’s velocity is scaled by T/T coh to make them appear to be moving slower. To deal with the issue of a finite field size, targets moving out of the field are replaced by new targets starting from the same location assigned a new random velocity and direction. We demonstrate the adaptive monitoring procedure using a sample model forthefield. Wedemonstrateouradaptiveschemetoadjustthepernodesensingratein RACSusingasimulatedexamplefieldinthiswork. Usingarealisticexample,allowsus tocontrolthecoherenceparameterT coh andmonitortheeffectofchangingthecollection intervalT inaccordancewiththeadaptivealgorithm. 12 2.5 AdaptiveFieldMonitoringScheme Althoughthealgorithmsdiscussedbelowhavebeenadaptedtothisparticularfamilyof modelsusedinthesimulationexample,Ihaveattemptedtolayoutbroadstepswhenever possibleandprovideageneralschemetobeusedinsuchcases. Thesealgorithmstages havebeenpresentedintheblockdiagramshowninFig.2.4. Compressed Sensing G sa Target Detection Parameter Estimation G Random Access Target tracking Model-based reconstruction e m Feedback e r Figure 2.4: Block diagram for the adaptive feedback mechanism. G denotes the true measurementsonthetimevaryingfield. 2.5.1 TargetLocalization Before taking any decisions about the current sensing state, we first detect and localize the targets in the image. Similar to [29], by involving target detection in the feedback process,wewouldliketoensurethattherateofsensingisoptimizedfortheendgoalof targetdetection. Traditionally, most energy-based localization methods [31, 32, 33] assume that the number of targets is known in advance. In addition, they assume a field model for tar- get signature decay allowing for parameter estimation by model-fitting. More recently, Kerse et. al proposed a method for direct target localization based on a standard sparse identification technique [29]. The advantage in their method is that target localization 13 canbeperformedinasinglestepwithoutfirsthavingtoreconstructthefield. Thenum- ber of targets can also be jointly estimated in this process using a sparsity constraint. However,themethodstilldependsonexactknowledgeofthefieldmodel. Toovercomethisdrawback,weproposetargetlocalizationusinganadaptivecluster- ingofthereconstructedfieldbasedonlocalgradientascent. Theproposedmethodonly requiresthatthetargetsignaturesarelocallymonotonicallydecayingawayfromthetar- get. The technique is adapted from the mode seeking mean shift algorithm [34] which is frequently used for unsupervised clustering of data. Consequently, it can be thought ofasfindinglocalpeaksinthedatahistogram. Inthetaskathand,weareinsteadinter- ested in finding local peaks in intensity. We also modify the algorithm slightly to adapt tothediscretesearchspaceforthisproblem. Specifically,westartwitharandominitial point X on the map. Next a center of mass X c for the values inside a window of size W around the point X and weighted by the intensities I(x) is computed as shown in Eqn.(3.1). The line from X to X c then gives the direction of the gradient. This mean- shiftbasedgradientascenttechniqueisthenusedtoupdatethelocationofX tillapeak isfound. Tomakethealgorithmbettersuitedtoaspaceofdiscretesensorlocations,wequan- tize the gradient to 8 directions. Depending on the direction of the gradient, X is then shiftedtooneofits8connectedpoints. Thisentireprocedureisrepeatedmultipletimes with new random initial points till the peaks have been discovered. In practice, it is not necessary to traverse all points in the image to discover all the peaks (Fig.2.6) and we restrictthealgorithmtoafixednumberofiterations. TheparameterW inthisalgorithm serves as a mask over which the gradient can be estimated more robustly. The target detectionprocedureisexplainedinAlgorithm2. 14 Algorithm 1: Modified gradient ascent algorithm for target detection on the acquiredmap. (SeeFig.2.6) Input: P×QmapG sa reconstructedbyRACS Output: Locationoftargetsµ 1 ,µ 2 ,...wherethenumberoftargetsisinitially unknown SetmaximumnumberofiterationstoM,sizeofwindowtoW ClusterindexK← 1 foriter ← 1toM do RandomlyselectaninitiallocationX onthemap. whileX hasn’talreadybeenvisiteddo MarkX asvisitedandbelongingtoclusterK //assumingwe’regoingtofindanewtarget ComputethenewcenterofmassX c withinawindowofsizeW aroundX. ComputethedirectionfromX toX c andquantizeitinto8binsbetween {−π/8,π/8,3π/8...}(Fig. 2.5) UpdatethepositionofX tooneofthe8adjacentpositionsbasedonthequantized direction(Fig. 2.5) endwhile MarkallpointsinthistrailleadinguptoX asbelongingtothesameclusteras X ifX belongstothenewclusterK then AppendX asµ K tothesetoftargets//newtargetfound K←K +1 endif end 30 210 60 240 90 270 120 300 150 330 180 0 X X c X c = P s I(s)K(s−X)s P s I(s)K(s−X) ,s∈P×Qgrid (2.4) K(s−x) = 1 if||s−x|| c ≤W 0 ifotherwise Figure 2.5: Modified discrete step update scheme using quantized directions. Aster- isk(*)indicatespointsonthegridonthemap. 15 2.5.2 Parameterestimation Oncethepositionsa m ,b m forthetargetshavebeenroughlyidentified,thealgorithmuses a method of comparison by synthesis to match the acquired map G sa against a model. This is achieved by first estimating parameters of the object model defined in Eq. (2.3) given the observation G sa . The parameters of model denoted by G model comprise the targetlocations(a m ,b m ),theirrespectivestrengthsA m andthedecayofthefielddenoted aboveasp. We estimate the parameters via nonlinear regression [35] using a least squares error formulationasshowninEqn.(2.5). ForthispurposeweusethenlinfittoolinMATLAB. Since the optimization is not convex, the choice of initial points is important. We use the target object locations estimated earlier and a reasonable value for the decay as our initialpoint. TheestimatedparametersarethenusedtogenerateamapofthefieldG det (Fig. 2.7(d)). G det represents what the algorithm expects the field looks like based on thedetectedtargetsandknowledgeofthemodel. MSC update trails 10 20 30 40 50 2 4 6 8 10 12 14 16 18 20 Current window around point and direction of gradient 1 2 3 4 5 6 7 1 2 3 4 5 6 7 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 1.05 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 Figure2.6: Figureontheleftshowsasamplemapwithsequentialupdatestepsforeach point converging at local peaks (marked by diamond symbols). The figure on the right showsthewindowofwidthW = 3duringthecurrentiterationandthegeneraldirection ofthegradientfromthecenter. 16 \ a,b,A,p = argmin am,bm,Am,p P X i=1 Q X j=1 M X m=1 G model ij (a m ,b m ,A m ,p)−G sa ij ! 2 (2.5) G det =G model ( \ a,b,A,p) (2.6) x y 1 2 3 4 Original map with different components 5 10 15 20 10 20 30 40 50 Sensed image 1 2 3 4 5 10 15 20 10 20 30 40 50 map reconstructed via RACS 5 10 15 20 10 20 30 40 50 map recovered after detection 5 10 15 20 10 20 30 40 50 (a) (b) (c) (d) Figure 2.7: Steps in rate adjustment (b) shows the samples chosen at random and (d) is the map generated using the parameters estimated on (c). Notice how (d) is approxi- matelysimilarto(a)whendetectionisaccurate. 2.5.3 Reliabilitymetricforreconstruction We obtain a reliability metric for reconstruction based on the error in the model-fitting described above. We compare the model-based mapG det against the non-model-based map G sa recovered via sparse approximation by projecting both of them on the DFT matrix, and decomposing the error into a high frequency (HF) and low frequency (LF) term. The intuition is that G sa should be similar to G det but for any errors result from issues in either sensing or localization. A gross mismatch between the coefficients will be captured by the LF term indicating an inaccurate detection while a large value of 17 the HF term ( denoted bye r ) is characteristic of a poor reconstruction due to violation of the assumptions in RACS (Fig. 2.8 (a) & (b)). The cutoff frequency for the filters can be chosen from a previous estimate of the field. The error metric e r thus comes in handy in the undersensing case when the acquired map G sa is either blurred or poorly reconstructed. 2.5.4 Measuringmotionbytargettracking Inadditiontocompensatingforthedynamicnatureofthefieldintheundersensingcase, we would also like to eliminate any redundant sensing. Recall, that in the oversensing case, the field changes very slowly, yielding perfect reconstruction and detection. This makes it necessary to monitor the field over multiple frames for detecting any signs of oversensing. To quantify motion in the field, we track the location of detected targets over the lastL frames. Notethatthisisnottrivialsincethenumberoftargetsdetectedisnotguaranteed tobeconsistentfromoneframetoanother. Moreover,thetargetindicesassignedbythe detectionalgorithmarenon-unique. Todealwiththisissue,weuseatrackingapproach based on dynamic programming that ensures tracking even if the object is not detected insomeoftheintermediateframes. Morespecifically,werecursivelyminimizethetotal distance moved by a target over L frames. If the targets are spaced sufficiently apart, thisensurestrackingoftheslowestmovingtargetovermultipleframes. Suppose (ˆ a t i , ˆ b t i ) denote the location of the i th target detected in frame t and let d t ij be defined as the distance between the i th target detected in the frame t and the j th target 18 detected in the framet+1. Thenthe net distanceD(n 1 ,n 2 ,...,n L ) moved by a target overLframesanddetectedattheindices{n 1 ,n 2 ,...,n L }isgivenbyEqn.(2.7). D(n 1 ,n 2 ,...,n L ) = L−1 X t=1 d t ntn t+1 ; 1≤n t ≤M t (2.7) d t ij = q (a t i −a t+1 j ) 2 +(b t i −b t+1 j ) 2 (2.8) whereM t is the total number of targets detected at each frame. The optimization prob- lemthenboilsdowntofindingthesequenceofobjectlocationindices{n 1 ,n 2 ,...,n L } thatminimizesthesumoftotalpointtopointdistancesoverLframes. Tonormalizeover the length of the temporal window, we use the minimum average distance moved as a tracking-based-metric asshowninEqn.(2.9). Thismetric(denotedbye m )isexpected to be low when the targets move slowly in the oversensing case while it is expected to bemuchhigherwhentheobjectsgetblurredintheundersensingcase. e m = 1 L min n 1 ,...,n L D(n 1 ,...,n L ) (2.9) 2.6 FeedbackMechanismforRateAdjustment Aftertheerrormetricshavebeencomputedthealgorithmdeterminesifthecurrentsens- ing rate needs to be changed. This information is fed back to the FC which makes any necessary changes, thereby establishing a closed-loop control. The objective of control is to minimize the reconstruction error based metrice r at the same time preventing the tracking-based-metrice m fromgoingbelowacertainthreshold 19 20 40 60 80 100 120 140 160 180 200 0 500 1000 T coh (in s) Time 0 20 40 60 80 100 120 140 160 180 200 0 200 400 600 T coh (in s) Time 20 40 60 80 100 120 140 160 180 200 0 50 100 Error HF Time 0 20 40 60 80 100 120 140 160 180 200 0 50 100 150 Error LF Time LF e det T coh T T coh T HF e det (a) (b) 20 40 60 80 100 120 140 160 180 200 0 500 1000 T coh (in s) 20 40 60 80 100 120 140 160 180 200 0 5 10 tracking−based−metric Time T coh T tracking−based L=10 0 20 40 60 80 100 120 140 160 180 200 0 200 400 600 T coh (in s) 0 20 40 60 80 100 120 140 160 180 200 0 5 10 15 tracking−based−metric Time T coh T tracking−based L=5 (c) (d) Figure 2.8: Figure shows the (a) HF (b) LF components of the model-based error and the minimum average motion per frame for (c) L=5 and (d) L=10 cases. System is in open loop control i.e. the error metrics are not being compensated for which can be notedbytheflatT line. We propose a dual threshold feedback scheme to keep the system in this “stable” state. A lower threshold th m and an upper threshold th r are defined for the metrics e m and e r respectively. In the control mode (e m ≤ th m or e r ≥ th r ) the collection interval T is adjusted by a fixed scale (κ > 1) depending on the value on the metrics. The first condition is related to oversensing and leads to an increase in T while the second conditionindicatesundersensingwhichisdealtwithbydecreasingT therebyincreasing thesensingrate(SeeTable2.1). The mutually opposing nature of the two conditions gives rise to a “stable” or buffer region(StateBinTable2.1)forT whereitisnotupdated. Inthismodethewidthofthe 20 buffer region is calibrated according to the field by adjusting the threshold values such thatthecurrentvalueofT ensuresthatthecontrolisinthe“stable”zone. Thustheaim ofthisadaptivecontrolistoobtainthe“tightest”stableregionofcontrolforT. Notethatitisnotfeasibletoguessavalueforth m orth r makingthisadaptiveapproach necessary. In practice, the thresholds can be initialized with extreme values such that the buffer region is wide and the system starts in State B. In subsequent intervals the thresholds are updated in the calibration mode to shrink the buffer region such that it just contains the stable operating point. There is a risk of overshrinking the buffer region which might lead to the system being in both the undersensing and oversensing simultatenously, as the buffer region collapses upon itself (State C). The thresholds are adjustedbypredefinedincrementsα,β suchthatthecontrolisreturnedtoStateB. Oversensing e m <th m StableRegionUndersensing e r >th r Table2.1: ControlFeedbackRules State A B C D Condition e m < th m e r <th r e m > th m e r <th r e m < th m e r >th r e r > th r e m >th m Decision T↑ Narrow Widen T↓ Update T =κT th r = th r −α th m = th m + β th r = th r +α th m = th m − β T =T/κ 2.7 StabilityAnalysis To provide an intuition of how the scheme works, we present a graphical illustration of theproposedfeedbackalgorithminFig. 2.9. Atanytimeinstant,thesystemstatecanbe representedasapointonthee r e m plane. Therelativepositionofthispointwithrespect tothethresholdsth r andth m isthenusedtodecidethenextcourseofaction. Sincethe 21 errormetricse r ande m arecorrelatedwitheachother,theoperatingregionofthesystem lies on a line with a positive slope and passing through the origin showing in Fig. 2.9, suchthatthesystemonlytransitionsbetweenstatesAandD.Oncetheoperatingregion of the system shifts, the thresholds th r and th m are calibrated again. In general, the algorithm tries to maintain a narrow buffer region of control by keeping the operating point(e r ,e m )closeto(th r ,th m ). Thiscanbeachievedbyeitherupdatingthecontrolled sensing time period T (states A and D) or by calibrating the thresholds (states B and C). Byobservinghowthestatesofthealgorithmtransition,wepresentanargumenttoshow that the proposed algorithm is Bounded-Input-Bounded-Output (BIBO) stable . This means that for a finite coherence timeT coh the controlled variable viz. the sensing time period T is always bounded. This can be shown by considering each of the four states ofoperationA,B,C,D oftheproposedalgorithmabove. D B C A e r e m (th r ,th m ) undersensing oversensing Figure2.9: Fourstatesofthecontrolalgorithm Suppose,wearecurrentlyinstateA(oversensing)andthefeedbackalgorithmresponds by increasing T. This leads to an increase in both e m and e r since fewer number of samplesaresensedperunittime. This,inturn,preventsthesystemfromstayinginstate A and is most likely to cause a transition to states B or C. Similarly, if the system is in state D (undersensing), the algorithm responds by decreasing T. This leads to the net effect that both e r and e m decrease, taking the system out of state D. When the 22 system is in statesB orC the thresholds are accordingly varied so that operating point is maintainedclosetothe adaptedthresholds. Thisensures thatthesensingtime period T andinturntheerrormetricse r ande m cannotgrowunboundedforagivenfiniteT coh . As an example, note the state transitions in Fig.2.10 which shows the internal states of thealgorithmforasimulatedcontrolscenario. 0 500 1000 T coh (in s) 20 40 60 80 100 120 140 160 180 200 A B C D State Time T coh T System State Figure 2.10: Figure shows the internal states of the algorithm as it makes various rate adjustmentdecisions. StatesarethesameasdescribedinTable2.1. 2.8 Experiments Figure. 2.8 above showed the value of metricse r ande m in an open loop operation for a particular T coh profile. Fig. 2.8 (c) and (d) suggest that a larger L, might be related to a slower system response time as can be seen aroundt=80,100 when the field’sT coh changes. A largerL might on the other hand lead to a smoother control because of the largercontextbeingused. Theerrormetricsbasedonmodelfitting(Fig.2.8(a)and(b)) ontheotherhandarecomputedfromthecurrentframeandthusrespondinstantaneously tothechangesinfieldcoherence. ItisalsoworthnotingherethattheHFerrorstypically correspondtothereconstructionerrorintheundersensingcase,whiletheLFerrorserves asasanitycheckfordetection. Next, we present an example of closed loop control by using the feedback mechanism (Table2.1)discussedearliertocompensatefortheerrormetricsasshowninFigure2.11. 23 Note that the collection time interval T now attempts to trace the underlying hidden parameter T coh as the system tries to keep both e r and e m within bounds. Figure 2.10 providesabetterinsightintothesystembyshowingthefeedbackalgorithm’sunderlying internal state. Recall that States A and D correspond to oversensing and undersensing respectively while in states B and C the thresholds are adapted to achieve a tight buffer regionofstability. 0 50 100 150 200 250 0 500 1000 T coh (in s) Time 0 50 100 150 200 250 0 200 400 600 800 1000 T coh (in s) Time 0 50 100 150 200 250 0 20 40 e r Time T coh T model−error 0 50 100 150 200 250 0 2 4 6 8 10 e m Time T coh T tracking−based L=5 Figure2.11: Closedloopcontrol 2.9 SimulationResults We evaluate the performance of the proposed feedback algorithm using two evaluation metrics. First, mean square error (MSE) between the estimated sampling duration T (blueline)andthecoherencetimegroundtruthT coh (redline)iscomputedasameasure of how closely the algorithm is able to track and adapt to changes in coherence time. Secondly, we calculate an F-score measure for the accuracy of target detection in a particular frame. This is a reasonable metric for our task since the end goal of such an application would be target detection. To calculate the F-score of target detection in a frame,wecheckhowmanyofthelocalizedtargetscorrespondtoactualtargetlocations. F-scoreweighsinboththeprecisionandrecalloftargetdetectionasfollows 24 Precision = #targetsthatwerecorrectlylocalized #targetsthatwerelocalized Recall = #targetsthatwerecorrectlylocalized #targetsoriginallypresent F = 2.Precision.Recall Precision+Recall (2.10) We compare the results against a baseline method based on correlation of the acquired field G sa with previously acquired frame as suggested in [36]. Control decisions are taken depending on a fixed high and low threshold on the correlation metric. The sam- plingdurationT ischangedusingscalingparameterκ asearlier. Wepresentresultsfor averageMSEandF-scoremetricsforsimulationsusingdifferentparametersettingsfor velocityanddecaypinEqn.(2.3). decay velocity MSE 0.1 0.15 0.2 0.25 0.3 0 0.5 1 1.5 2 decay velocity F−score 0.1 0.15 0.2 0.25 0.3 0 0.5 1 1.5 2 200 400 600 800 1000 1200 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 (a) decay velocity MSE 0.1 0.15 0.2 0.25 0.3 0 0.5 1 1.5 2 decay velocity F−score 0.1 0.15 0.2 0.25 0.3 0 0.5 1 1.5 2 200 400 600 800 1000 1200 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 (b) decay velocity MSE 0.1 0.15 0.2 0.25 0.3 0 0.5 1 1.5 2 decay velocity F−score 0.1 0.15 0.2 0.25 0.3 0 0.5 1 1.5 2 200 400 600 800 1000 1200 0.3 0.4 0.5 0.6 (c) Figure2.12: AveragedMSEandF-scoreevaluationmetricsfora)nofeedbackb)corre- lation baseline mechanism andthec)proposed feedbackalgorithm at different velocity anddecayparametersettings. Figure2.12bshowsthevalueofevaluationmetricsaveragedovereachrunforthebase- linecorrelationmethod. FromtheMSEandF-scoreplotsitisevidentthatthealgorithm performancedropswithincreaseindecayparameterp. Thisisexpectedsincethetargets 25 withalesserspreadcannotaffectthecorrelationmetricsignificantly. Thisisunlikeour feedbackalgorithmwhichshowsabetterperformanceonincreaseinthedecayparame- terpsincetargetswithlesserspreadareeasiertolocalizeandhencethefieldreconstruc- tionismoreaccurate(Figure2.12c). Asimilartrendcanbeseenintheplotsofaverage MSEmetricwhichappearstobesomewhatinverselycorrelatedwiththeF-scoremetric. An increase in velocity, on the other hand appears to have a positive on both the algo- rithms. This can again be easily understood by considering that an increase in velocity ofthetargetsmakesiteasiertodiscriminatebetweentheoversensing,adequatesensing and undersensing states. This holds for both the algorithms and hence we notice a gen- eralincreaseinF-scoreanddecreaseinMSEwithincreaseinvelocity. Finally,F-score being a normalized metric, we also note that our proposed detection-based feedback mechanismoutperformsthebaselinecorrelation-basedmethod. 2.10 Conclusion InthisworkwepresentarateadjustmentschemeforRandomAccessCompressSensing (RACS)tomonitorandcompensatefortime-varyingfields. Adjustmentofsensingrate for RACS is motivated by the trade-off between energy efficiency and the accuracy of tracking targets of interest. Although estimating the coherence time might seem to be thebestapproachforsensingavaryingfield,wenotethatitisnotdirectlyrelatedtoour end goal of target detection in the field. We also observe that the reconstruction error for RACS has a complex relation to the current coherence time of the field and factors like the position of targets, their velocity etc. Thus by making the algorithm depend on the detection and localization of targets, we ensure that the rate adjustment is tied in to ouractualobjective. 26 We propose two unsupervised metrics to inform the rate adjustment scheme. A model- based field reconstruction is done after target localization for each acquired frame, and the model fit error e r is used as a metric to detect cases when the FC is undersensing. Toaccountforoversensinginoperationwedefineameasureforthemotionofdetected targets in the pastL frames. This motion based metrice m indicates any redundancy in sensingandcanbeusedtodecreasethecollectiontimeintervalT ifneeded. We propose a dual threshold based feedback mechanism using these error metrics for ratesensingadjustments. Thetechniqueassumesminimumpriorknowledgeandadapts the threshold in an online fashion using reasonable assumptions on the field. We also show that the proposed control mechanism is BIBO stable. In addition, we compare our results against a baseline algorithm that uses time-correlation of the acquired fields and show that the proposed rate adjustment mechanism performs better on an average in terms of target detection accuracy. In the future, in addition to testing this method on other fields, we would also like to consider better stochastic decision processes and updaterules. 27 Chapter3 ObjectClassificationinSidescanSonar images 3.1 ProblemDefinition To enable this empirically-grounded study, we rely on realistic data sets for develop- ing and testing the proposed classification schemes. In particular, three data sets have been used in this work, the NURC 1 dataset collected by DRDCAtlantic 2 and NURC (NATO)[37], the NSWC 3 Scrubbed image dataset and the SSPS multiresolution sonar imagery dataset collected at NSWC, Panama City comprising over 1000 images in all. These datasets contain sidescan sonar imagery in the form of 8 bit grayscale images each containing one or more synthetic mine-like objects. Each object is approximately 10-20pixelsinwidthandcanbelongtomultipleclasses,basedontheirshape(Fig.3.1). ObjectsintheNURCdatasetcanbefromanyofthe7classesdescribedasCone,Cylin- der,Junk,Rock,Sphere,WeddingCakeorWedge,whileobjectsintheNSWCorSSPS datasetarelabeledasbelongingtoclassesAthroughD.Thisworkdealswiththeprob- lem of classifying the objects in these datasets, assuming that the object has already beenlocalized. 1 NURC-NATOUnderseaResearchCenter 2 DRDC-DefenceR&DCanada 3 NSWC-NavalSurfaceWarfareCenter 28 Figure3.1: ThefourtypesofobjectintheNSWCsidescansonardatabase 3.1.1 DatasetCharacteristics While the object might usually be clearly visible as a bright highlight because of the strong reflection of sonar waves, substantial variations in object appearance can occur as a result of different sea-bed environments around the object. In addition, different anglesofviewingtheobjectcanmakethetaskconfusingeventotheexperthumaneye. Thus, one class of popular techniques used for this problem compares the shapes of the object shadows against expected shapes, generated through simulations of their 3- dimensional templates [38]. However, the datasets of interest in our work provide no additional information about the object shapes, ruling out the possibility of using such methodsthatuseexpertknowledgeofexpectedobjectproperties. Nootherinformation other than object location and class are provided in these datasets, thus specifying the scopeoftheclassificationscenarioofinterestinthiswork. In addition to variabilities within a database, each database has its own characteristics. Such differences among databases often make it difficult to generalize any particular technique/algorithm and can considerably affect experiment design. As an example, notice how the echo/highlight is a dominant feature of images from the NSWC and 29 NURC databases (Fig. 3.2). However, in the SSPS dataset, highlights are a minor fea- ture compared to shadows that are more informative about the object shape. Similarly, we find that images in the NSWC or NURC databases often lack a noticeable shadow. Theseobservationsunderscoretheimportanceofseekinganappropriatedatacharacter- izationthatiscognizantofthedomain. Database #Samples #Classes NSWC 296 4 SSPS 442 4 NURC 1038 7 :ClassB SSPS 10 20 30 40 50 60 70 80 20 40 60 80 100 120 140 160 180 200 220 NSWC:Class1 10 20 30 40 50 60 70 80 5 10 15 20 25 30 35 NURC:Wedge 10 20 30 40 50 60 70 80 10 20 30 40 50 60 70 80 Figure 3.2: Sample objects from three different databases: SSPS, NSWC and NURC. Thedataillustratethedifferingcharacteristicsacrossobjectsanddomains. 3.2 Segmentation Although we restrict our interest to object classification in this paper, classification in underwater sonar images is typically preceded by object detection and segmentation stages. In this work, we assume that the position of object in the image is roughly known. Nevertheless,objectsegmentationisessentialtolocatetheexactpositionofthe objectandextractinformationaboutitsshape. Asimilarsegmentationtechniqueisalso requiredtoextractauxiliaryfeaturesfromshadows. Shadow segmentation in sonar images can be challenging at times owing to the nature ofsonarimageryandotherfactorssuchasthetextureoftheseafloororthebeamangle 30 as can be seen in Fig.3.3. These can either cause the shadow to be absent at times, or lead to incorrect segmentations resulting in unreliable shadow information on some of the images. When present, the shadows can sometimes be very discriminative, owing to their much larger size in comparison to the echo/object highlight. In other cases, a shadowcanbeeithercompletelyabsentorlostamongsttheclutterontheseafloor. This lack of reliable shadow information prevents fusion of shadows directly at the feature level and necessitates a scheme that can cope with the missing or noisy shadows for certain samples. This adaptive fusion scheme is discussed in further detail in Section 5.2. OntheNURCandSSPSdatabases,weemployasegmentationmethodbasedonmean- shift clustering [34] for both objects and their shadows, followed by denoising. On the NSWC database, segmentation methods are only applied for extracting shadows, sincethepositionofobjectsispreciselyknown. Wefirstdescribethegeneralsegmenta- tion algorithm followed by modifications to specify the algorithm to either highlight or shadowsegmentation. Shadow 20 40 60 80 10 20 30 40 50 60 70 80 Highlight 20 40 60 80 10 20 30 40 50 60 70 80 Object Class : Wedge 20 40 60 80 10 20 30 40 50 60 70 80 Figure3.3: AnexamplefromtheNURCdatabaseshowinghighlight(right)andshadow (left) segmented from the original image(center) using the mean shift clustering based segmentationscheme. 31 3.2.1 Mean-ShiftclusteringbasedSegmentation In this work we adapt a segmentation scheme based on Mean Shift Clustering (MSC) [34]. MSC is used to facilitate segmentation by adaptively clustering the sonar images usingintensityvalues. MSCisamodeseekingalgorithmthatispopularforuseinunsu- pervisedclustering. Giveninitialpointsinadatadistribution,theMSCalgorithmclimbs in the direction of the density gradient to the nearest local mode of the distribution. In general MSC, the density at a point can be measured using an arbitrary kernel function K(x). In our case we measure the density at a point in terms of the number of points withinacircleoffixedradiusR b referredtoasthebandwidthinMSC. Thealgorithmproceedsbycomputingthemeanofallpointswithinthisspecifiedradius (bandwidth) of the initial point v 0 . This point is then moved to the mean in the next iteration as shown in Eq. (3.1). This procedure is repeated till the mean converges to a local mode of the distribution. Convergence is achieved when the update in the mean is below certain threshold i.e.||v n+1 −v n ||≤ ǫ. All points traversed in the process are thenassignedtothismode/cluster. v n+1 = P p K(p−v n )p P p K(p−v n ) ,p∈dom(v) (3.1) K(p−v) = 1 if||p−v||≤R b 0 otherwise Thisprocessisappliedforseveralrandominitialpointsuntilallthepointsintheimage havebeenassignedtosomecluster. ThebandwidthparameterR b ,providescontrolover howfine-grainedtheclusteringis. Asmallerbandwidthmightleadtoalargenumberof clusters that are similar to each other, while a larger bandwidth will merge the similar 32 clusters. Since the clustering occurs in the colorspace R b has the same units as that of thecolorintensity. 3.2.2 Choosingthesegmentationtarget: HighlightorShadows The value of R b is dependent on the object of segmentation, shadow or highlight. To segment shadows a smaller bandwidth is chosen, accompanied by the knowledge that theshadowregionshavethelowestintensityvalues. Wefindthatinourdatasetsof8bit images (0-255), a bandwidthR b = 10 manages to differentiate the shadow from back- ground clutter. Remember that a higher bandwidth ensures larger and more coherent clusters while a smaller bandwidth value yields smaller, more fragmented clusters. If segmentationfails,itcanbeverifiedbysettingathresholdratioofthenumberofpixels in the segmented shadow to that in the image. An adaptive scheme is then used which employsonalowerbandwidth(R b = 5onourdatasets)toobtainabettersegmentation. For segmenting the highlight, a larger bandwidth parameter (R b = 15) is used which fuses all dark regions with low intensity values around the bright highlight into one cluster with the lowest mean intensity, while the pixels belonging to the highlight are accumulated in all other clusters. This approach is based on the observation that the highlight in an image is typically much brighter compared to the shadow or clutter around the object. The algorithms are presented formally in Algorithm 2. C i ,S ′ ,H ′ used in the algorithm are sets of tuples of pixel indices indicating each cluster, the seg- mentedshadowandhighlightrespectively.F S ,F H arebinarymasksofthesamesizeas theoriginalimageFencodingeachsegmentation. 33 Algorithm2:ShadowandhighlightsegmentationusingMSC. Input: ImageFcontainingN pixels Output: ShadowsegmentationbinarymaskF S ofthesamesizeasF ClusterFusingMSCwithR b = 10. Supposethisyieldsdclusters. Sortclustersinascendingorderofmeanintensityvalue:C 0 ,C 1 ,··· ,C d−1 if |C 0 | N < 0.5then ShadowS ′ ←C 0 else ClusterimageusingMSCwithR b = 5. Sortclustersinascendingorderoftheirmeanintensityvalues:C 0 ,C 1 ,··· ,C d ′ −1 ShadowS ′ ←C 0 EncodethesetofpointsinS ′ asabinarymaskF S |F S ij = 1 if(i,j)∈S ′ 0 otherwise endif Input: ImageFcontainingN pixels Output: HighlightsegmentationbinarymaskF H ofthesamesizeasF ClusterimageFusingMSCwithR b = 15. Supposethisyieldsq clusters. Sortclustersinascendingorderoftheirmeanintensityvalues:C 0 ,C 1 ,··· ,C q−1 HighlightH ′ ←C 1 ∪C 2 ∪...C q−1 EncodethesetofpointsinH ′ asabinarymaskF H asshownforF S above. 3.2.3 Denoisingusingiterativehierarchicalclustering Although mean shift clustering manages to isolate the shadow and highlight to a large degree, the segmentation still contains some noise (Fig. 3.4b). Noise results as clus- tering occurs in the color space, and a cluster is not required to be a contiguous mass spatially. Thus,thesegmentedshadowisoftenfragmentedwhichrenderssimpledenois- ing approaches like selecting the largest connected component, useless (Fig. 3.4c). We performdenoisingbasedonahierarchicalclusteringmethodtodealwiththisproblem. Specifically,weuseanagglomerativehierarchicalclustering(AHC)[12]techniquethat initializes by assigning one cluster to each point. At each stage coherent clusters are fusedtogethercreatingahierarchythatdefineshoweachpointisrelatedtoothers. The clustering is stopped when a certain cutoff criterion on the consistency [39] of clusters is met. From one such round of clustering, we select the largest cluster and proceed 34 Object Class : B 20 40 60 80 50 100 150 200 After mean shift segmentation 20 40 60 80 50 100 150 200 Largest connected component 20 40 60 80 50 100 150 200 Shadow after iterative hierarchical clustering 20 40 60 80 50 100 150 200 (a) (b) (c) (d) Figure 3.4: Stages in segmentation of an image (a) from the SSPS database - (b) mean shift segmentation into shadow (brown) and highlight (white) and (d) iterative hierar- chicalclustering(bottom-right).((c)indicatestheeffectofdoingaconnectedcomponent analysisinstead) to cluster it similarly. This iterative scheme is stopped when the current cluster cannot be divided into further clusters according by AHC according to the consistency cutoff parameter provided. We use an empirically determined consistency cutoff value of 1.5 for our experiments. The algorithm is stated below in Algorithm 3. C i s are used to denotesetsoftuplesofpixelindicesbelongingtoeachclusterasbefore. The segmentation obtained after the above denoising is used to extract features from correspondingpartsoftheimage. 3.3 FeatureExtraction The role of feature extraction in pattern recognition problems is critical. The general consensus in practice is that there is no universally best feature for a problem given the object heterogeneity and data uncertainty common in real world applications. The 35 Algorithm 3: Iterative hierarchical clustering algorithm used for denoising the segmentation Input: MSCsegmentationS ′ Output: DenoisedsegmentationS ahc andshadowbinarymaskF S S ahc ←S ′ repeat ClusterS ahc usingAHC.Supposethisyieldsq clusters. SortclustersindescendingorderbynumberofpointsasC 0 ,C 1 ,...,C q−1 S ahc ←C 0 untilq > 1 EncodethesetofpointsinS ahc asabinarymaskF S |F S ij = 1 if(i,j)∈S ahc 0 otherwise general approach hence has been to find features tuned to particular datasets, and it is nodifferentinthedomainofunderwatersonarimages[40]. For the purposes of object classification, we desire that the algorithm be insensitive to smallvariationsinintensityandorientationoftheimage. Somemethodstrytomitigate such variations within the classification algorithms [41, 42, 43], while others deal with them atthe feature level[44,45]. Moreover, sincemost classificationalgorithms suffer from the curse of dimensionality [46], we would also like to characterize the object as compactly as possible. Zernike moments meet the above description needs and are popularly used as invariant features for object classification and shape based content retrievaltasks[47,48,49]. 3.3.1 ZernikeMoments Zernike moments of an image are computed via an orthogonal transform in the polar domain, where the degree of the representation controls the degree of generalizabil- ity. We use magnitudes of Zernike moments which have been shown to have rota- tional invariance properties for object recognition [47]. Their robustness to variabili- ties in underwater images has also been well established [50, 51]. To compute Zernike 36 momentsZ nm oforder(n,m),wefindtheprojectionoftheimageFwiththebasisfunc- tionV nm as shown in Eqn.(3.4). To simplify notation, we restrict ourselves to square images. Inaddition, wetransformthecoordinatesystemoftheimagesuchthattheori- ginisatthecenterofmassandtheentireobjectjustfitswithintheunitcirclea 2 +b 2 = 1. a,barerelatedtotheirpolarcounterpartsρ,θ asa =ρcos(θ),b =ρsin(θ),ρ≤ 1. V nm (a,b) =V nm (ρ,θ) =w nm (ρ)e −jmθ (3.2) w nm (ρ) = n−|m| 2 X s=0 (−1) s (n−s)! s!( n+|m| 2 −s)!( n−|m| 2 −s)! ρ n−2s (3.3) ζ nm (F) = n+1 π X x X y F(a,b)V ∗ nm (a,b),a 2 +b 2 ≤ 1 (3.4) where0≤n≤N;|m|≤n, n−|m|iseven The range ofn selects the order of Zernike moments and the degree of generalizability of the description. From our previous work [28] with sidescan sonar images, we found thatrepresentationsoftheorderN=10containsufficientinformationforobjectclassifi- cationthatalsogeneralizeswell,yielding36uniquemomentssatisfyingtheconstraints above. Note that after rotation by an angle α, only the phases of the Zernike moments (ζ α nm )dependontheobject’sorientation(Eq.(3.5))andthus,themagnitudesofZernike momentsarerotationallyinvariant, ζ α nm =ζ 0 nm e −jα (3.5) This property can also be seen in Fig.3.5 which shows the magnitude of Zernike moments corresponding to objects grouped together by their class. We extract Zernike 37 moments from the pixels corresponding to the highlight (F⊙F H ) 4 and compute their magnitude. The thick red line on the graph indicates the true class label of the object. ThesimilarityofZernikemagnitudefeatureswithinaclassisevidentinspiteofdifferent objectsbeingindifferentorientations. Toformallyverifythis,weperformedaone-sided F-testforratioofvariances,tocomparethecovarianceoffeatureswithinaclass(Σ class ) to that over all samples across all classes (Σ all ). The null hypothesis in each case was tr(Σ class ) < tr(Σ all ) and the test statistic for class k is shown in Eqn.(3.8). Rows of the matrix Q contain features ζ only for samples from the class k whereas G contains featuresfromallsamplesinthedataset. Σ k =E(Q 0 Q T 0 ) (3.6) Σ all =E(G 0 G T 0 ) (3.7) χ 2 = tr(Σ k ) tr(Σ all ) (3.8) ResultsoftheF-testindicatethatineachcasethereissufficientevidence(significantat 5% confidence level) to reject the null hypothesis that the two variances are equal. In otherwordsthevarianceoffeatureswithinaclassissignificantlysmallerthantheaver- agevarianceoverallclasses. Hencethefeatureshouldcontainsufficientdiscriminative information. 38 Objects sorted by class Magnitude of zernike moments 50 100 150 200 250 5 10 15 20 25 30 Class labels Class1 Class2 Class3 Class4 Figure 3.5: Zernike Moment magnitude features for 296 samples from the NSWC database. The solid red line is an indicator of the class labels and the vertical axis corresponds to different feature dimensions. Note how features are similar in one class in spite of the objects being oriented at different angles and varying across different classes. Class 1 10 20 30 40 5 10 15 Class 2 10 20 30 40 5 10 15 Class 3 10 20 30 40 5 10 15 Class 4 10 20 30 40 5 10 15 Figure 3.6: Average shadows from each class in the NSWC database.(Negative of imagesshowntohighlightshadows) 3.4 UsingShadowsForObjectClassification 3.4.1 Shadowfeatures While computing feature representations from shadows, intensity values are neglected. Only the shape represented by the binary mask F S is encoded using the previously 4 ⊙indicatesentrywiseproduct 39 mentioned Zernike features. One of the major challenges in using shadow and object features in the same framework is that the features extracted from shadows might often notbeasreliableasthosefromobjects. Thiscanbeeitherbecauseshadowfeaturesare missing for some objects or, if they are present, do not provide sufficient discrimina- tion for classification tasks by themselves. This can be seen in Fig.3.6, for the average shadowsextractedontheNSWCdatasetforeachclass. Notethatitishardertospotdif- ferencesbetweentheshadowsforclasses1,3and4whileclass2easilystandsout,and might help improve classification. This observation suggests that the shadow features are useful only in certain cases to resolve the ambiguities posed by the object shapes, hencemakingreliabilityawarefusionoffeaturesetsimportant. Based on this observation, we propose a reliability aware feature fusion scheme. Once thefeaturess =ζ(F S ), h =ζ(F⊙F H )havebeencomputedforallsamplesweassume that not all samples have equally reliable shadow features and express this reliability in adatadrivenfashion. Wetestthisideabycomparingitagainstthenaivefusionschemes suchasfeature-fusionandscore-fusion. 40 Chapter4 IntelligibilityClassificationin PathologicalSpeech Pathological speech usually refers to the condition of speech distortion resulting from atypicalities in voice resulting from illness or surgery leading to some physical or bio- logical insult to the speech production system. This condition severly affects the com- municationcapabilityofthesubjectbecauseofdegreadedspeechintelligibility. Hence, speech intelligibility assessment is an important part of their treatment. Assessment of speech intelligibility by experts can be expensive and time consuming and hence auto- maticassessmentofspeechintelligibilityisofconsiderableinterest. Lack of intelligibility in pathological speech can depend on a variety of factors where eachfactorcontributesdifferently. Atthesametime,thelabelsassignedtoeachsample depends on speech perception leaving scope for subjectivity in the annotations. This featureandlabelvariabilityinthisdomainmakesitparticularlyappealingtoreliability- awaremodeling. 4.1 Dataset In this work we use the NKI CCRT Speech Corpus [52] which contains sentence-level speechaudiospokeninDutchandtheintelligibilityevaluatedbyprofessionallisteners. The speech audio was collected at three stages based on Chemo-Radiation Treatment (CRT) of the patients: before CRT, 10-weeks after CRT and 12-months after CRT. The 41 speech audio data consist of read speech waveforms of 17 Dutch sentences spoken by 55 head and neck cancer patients. This corpus provides evaluator weighted estimator (EWE)foreachutteranceasaperceptualintelligibilityscore. TheEWEistheweighted mean of evaluation scores from multiple evaluators where the weight is the correlation coefficientofsingleevaluator’sevaluationscoretotheunweightedmeanofevaluations fromallevaluators[53]. 4.2 PreprocessingandFeatures We use the three sets of features used for automatic intelligibility classification in [54]. These features sets- prosody, pronunciation and voice quality derive information from differentaspectofspeechpathologyandaredescribedbrieflybelow. Prosodicfeatures WeobservedthatNIspeakersoftenhavedifficultyinpronouncingafewspecificspeech sounds, resulting in atypical prosodic and intonational shape. Additionally, the pitch trajectory of NI speakers’ data was often not smooth. Motivated by these observations, the following phone- and utterance-level features derived from pitch contours of each utterance. • Utterance-level features included [0.1 0.25 0.5 0.75 0.9] quantiles, interquartile range of pitch and its delta, normalized L0-norm, normalized utterance duration, the sum of normalized L0-norm ratio and the normalized utterance duration, the zscoreofeachphoneduration,varianceofpitch. • Phone-level features including the variance and pitch stylization parameters obtainedbyfittingquadraticpolynomialsforeachphone. These features were designed in a sentence-independent fashion in order to obtain suf- ficientnumberofsamplesforclassifiertraining. 42 Voicequalityfeatures We used three types of voice quality features, viz. harmonics to noise ratio (HNR), jitter and shimmer for intelligibility classification. They have been popularly used for vocaldisorderassessmentofsustainedvowelsound,e.g.,/AA/. Sincethedatabasesthat this study uses are made of running read speech, we concatenated vowel segments of each utterance instead. Then, we estimated statistics, such as [.05 .1 .25 .5 .75 .9 .95] quantiles,mean,maximumandminimuminthesegmentsforeachutterance. Pronunciationfeatures Under the hypothesis that vocal organ malfunction may cause pronunciation variation, contributingtointelligibilityloss,wealsotestedpronunciationfeaturesforintelligibility classification. The statistics of spectral features include [.05 .1 .25 .5 .75 .9 .95] quan- tiles, interquartile range, and three-order polynomial coefficients (except the residual term) of the first, second and third formants and their bandwidths, and their derivatives for each vowel segment in each utterance. We also estimated the maximum and stan- dard deviation of cepstral mean normalized 39 MFCCs extracted from utterance-level speech samples excluding silence regions. The temporal features included average syl- lable duration, pause duration (without silence before and after speech audio) to the numberofsyllableratioandaveragevowelduration. Forwardfeatureselectionwasthenusedoneachfeaturesettofurtherreducethenumber offeaturesto6(prosody),2(pronunciation)and5(voicequality). 43 Chapter5 ReliabilityAwareClassification 5.1 Reliabilityasalatentfactor WeproposedaBayesianmodelin[1]fortakingintoaccountthelatentreliabilityofeach feature modality. The central assumption in this model was to introduce feature relia- bilityasalatentrandomvariableR,thatcontrolledthedependencebetweenfeaturesX and the class labelsY. When a feature is unreliable (R = 0) it is assumed to be gener- ated at random from a garbage model, irrespective of the class label. This assumption is formalized in Eq.(5.1) whereΘ represents the reliable model andΦ represents the parameterscorrespondingtothe unreliablemodel. Weshallcarryforwardthisnotation formodelsthroughouttherestofthepaper. Θ Φ Φ Figure 5.1: Bayesian model proposed in [1] for reliability-aware fusion of feature sets for object classification in underwater sonar images. The dashed line between features X andY indicate,thatthedependencebetweenthemiscontingentonthereliabilityR. Latentvariablesaredenotedbyunshadednodes. 44 Pr(X,Y|R) =Pr(X,Y;Θ) R (Pr(Y;Φ)Pr(X;Φ)) 1−R (5.1) If we assume that the the marginal distribution of class label Y does not depend on reliabilityRi.e. P(Y;Φ) =P(Y;Θ),wegetEq.(5.2). Pr(X|Y,R) =Pr(X|Y;Θ) R Pr(X;Φ) 1−R (5.2) This reliability model was used in [1] to distinguish between unreliable and reliable features for an object classification task in underwater sonar images. The proposed Bayesian model (in Fig.5.1) was used to compare average reliabilities for features on differentdatasets[1]. While this model provides good insight into reliability of different features, it suffers fromtheissueofdata-sparsitythatiscommontomostgenerativemodels. Agenerative model must parametrically describe feature distributions Pr(X|Y;Θ) and Pr(X;Φ) to generate features. As a result, parameter estimation becomes quickly infeasible with increase in number of dimensions because of lack of data samples. In comparison, discriminativemodelsonlylearnparametersforconditionaldistributionsoflabelsgiven featuresandhencearemoreefficientinusingdataduringtraining. 5.2 GenerativeModelforreliability To formulate the reliability-aware fusion problem as a Bayesian network we assume that there exists a latent binary random variabler that lets us choose how the partially reliablefeatureisgenerated. Weassumethefollowingstructureforthegraphicalmodel, resulting from a Naive Bayes assumption on the shadow (s) and highlight (h) features. 45 In addition, we assume that the class conditional probability for shadow features is a mixturemodelofreliableandunreliablefeaturesasshowninEqn.(5.4). y s h r Figure 5.2: The reliability node r controls the link between class label and shadow features. Theshadednodesrepresentvariablethatcanbedirectlyobservedwhereasthe unshaded node corresponding tor is assumed to be hidden. Also the arrows between pairofnodesindicatetheonlydirectconditionaldependencies. UsinglocalMarkovpropertiesthejointdistributioncanbefactorizedasfollows P(r,s,h,y) =P(y)P(h|y)P(s|r,y)P(r) (5.3) wheres,haretheshadowandhighlightfeatures,withthedimensionsd s andd h ,respec- tively. s∈R ds ;h∈R d h ;y∈{1...k};r∈{0,1} ThemodelwillbereferredtoastheRSHYmodel. Thus we use the reliability of features to refer to their dependence with respect to class labels. i.e., P(s|y) = P(s). Thus for r = 0 i.e., when the shadow feature is com- pletely unreliable, the model says that the shadow feature was generated independent 46 of the class label, while r = 1 indicates that the shadow feature is generated from a class dependent mixture distribution. Another way to think about this model is that it smoothly interpolates between the conditional dependent and independent models for shadowfeatures. P(s|r,y) =P(s|y) r P(s) 1−r (5.4) For simplicity we assume all marginal and conditional distributions in this model to be Gaussian,whichgivesrisetothreesetsofGaussianparameters. Thegenerativeprocess foreachobjectcanbedescribedasfollows: Algorithm4:GenerativeprocessfortheRSHY model Chooseclasslabely∼Multinomial(η 1 ,η 2 ,...,η K ) Ify =k,choosehighlightfeaturesh∼Gaussian(µ Θ k ,Σ Θ k ) Chooser∼Binomial(ρ) ifr = 0then Chooseshadowfeaturess∼Gaussian(µ Φ ,Σ Φ ) else Chooseshadowfeaturess∼Gaussian(µ Ω k ,Σ Ω k )ify =k endif wherethemodelparametersaredefinedasfollows: • η k : Priorprobabilitiesforeachclass • ρ: Probabilityofshadowfeaturesforasamplebeingreliableonaverage • Θ k : Gaussianmodelforhighlightfeatureclassconditionalprobability • Ω k : Gaussianmodelforshadowfeatureclassconditionalprobability • Φ: Gaussianmodelforshadowfeaturemarginalprobability 47 5.2.1 Illustration Before diving into further details it might be useful to demonstrate via an example the data characteristic assumption that the model is based upon. This is particularly simple inourcase,sinceourmodelbeingagenerativemodel,canbeeasilysampledfrom,such as using random parameter settings as adopted here. Fig.5.3 shows a distribution of N = 1000 feature points generated using the model. We restrict the dimensionality to 2 for ease of interpretation. Note that the set of unreliable samples for shadow features are expected to have larger scatters in the feature space and their distribution does not depend on what class they belong to. This figure also illustrates the possible issues in parameterestimationincasethepatternsfromtwoclassesorthe“garbagemodel”have asignificantoverlap. −40 −20 0 20 40 −40 −30 −20 −10 0 10 20 30 40 Feature 1 Feature 2 Shadow features −30 −20 −10 0 10 −50 −40 −30 −20 −10 0 10 Feature 1 Feature 2 Highlight/Object features Reliable Unreliable Figure 5.3: Data generated using the model in Fig.5.2 above. Red crosses indicates the centers for each of the class clusters. For generating the shadow features P(r) = 0.5 was used i.e. 50% of the samples were randomly selected to have unreliable shadow features. 5.2.2 MLParameterEstimation We estimate the parameters of the proposed Bayesian network model using the Expectation-Maximization (EM) algorithm[55]. The variabler is a hidden parameter 48 inthismodel. Foreaseofanalysisweshallhenceforthrepresenttheclasslabelsyusing the 1-of-K encoding i.e. ify i = k originally, we use the notation P k y ik = 1,y ik = 1 to indicate that thei th sample belongs to thek th class. Ideally we would maximize the followinglog-likelihood, L =log N Y i=1 P(r i ,s i ,h i ,y i ) ! = N X i=1 K X k=1 y ik (log η k +r i logN(s i ;Ω k )+logN(h i ;Θ k )) +r i log ρ+(1−r i )(logN(s i ;Φ)+log(1−ρ)) (5.5) whereN(X,Ψ) refers to the probability according to the multivariate normal distribu- tiondefinedbythemodelΨ. E-Step Sincer i for each sample is a hidden random variable, we compute the following poste- riordistributionf P(r i |s i ,h i ,y i ) = P(r i ,s i ,h i ,y i ) P r i =0,1 P(r i ,s i ,h i ,y i ) (5.6) This is used to compute the expected value of the log-likelihood function. We first compute a soft value for each r i which represents the uncertainty in our knowledge of thehiddenvariable. 49 γ i =E f (r i ) =P(r i = 1|s i ,h i ,y i ) = ρ Q K k=1 N(s i ;ω k ) Y ik ρ Q K k=1 N(s i ;ω k ) Y ik +(1−ρ)N(s i ;Φ) (5.7) Thentheexpectedlog-likelihoodonf isgivenas L ′ = N X i=1 K X k=1 y ik (log η k +γ i logN(s i ;Ω k )+logN(h i ;Θ k )) (5.8) +γ i log ρ+(1−γ i )(logN(s i ;Φ)+log(1−ρ)) M-Step Now since the hidden variable r i has been replaced by its expected value, γ i , we can directly calculate the maximum likelihood estimates of the above parameters. The constrained optimization is changed to an unconstrained optimization problem using the method of Lagrange multipliers and the following intuitive update equations are obtainedforeachoftheparameters. 50 η k = P N i=1 y ik N (5.9) ρ = P N i=1 γ i N (5.10) µ Θ k = P N i=1 y ik h i P N i=1 y ik Σ Θ k = P N i=1 y ik (h i −µ Θ k )(h i −µ Θ k ) T P N i=1 y ik (5.11) µ Ω k = P N i=1 γ i y ik s i P N i=1 γ i y ik Σ Ω k = P N i=1 γ i y ik (s i −µ Ω k )(s i −µ Ω k ) T P N i=1 γ i y ik (5.12) µ Φ = P N i=1 (1−γ i )s i P N i=1 (1−γ i ) Σ Φ = P N i=1 (1−γ i )(s i −µ Φ )(s i −µ Φ ) T P N i=1 (1−γ i ) (5.13) NotethattheweightingtermsfortheupdateequationsformodelsΘ,Ω,Φprovidesome intuitionastowhattheyrepresent. ThusΘparametersarelearnedonhighlightfeatures per class, while Ω or Φ are trained selectively on shadow features from the reliable or unreliable samples. We iterate the E and M-steps back and forth till the data log- likelihood converges. The EM algorithm ensures that the data log-likelihood is non- decreasingaftereachiteration. Note that only the parameters ρ,µ Ω k ,Σ Ω k ,µ Φ ,Σ Φ need to be updated at each iteration, whiletheparametersη k ,µ Θ k ,Σ Θ k areonlyestimatedonce. 5.2.3 Inferenceofposteriordistributionofunknownvariables Oncetheparametersareestimatedwecaninfertheposteriordistributionforclasslabels orreliabilityforthetestsamples. Inferringclasslabels Forclassification,wecannowdirectlyestimatetheposteriorlabelsforeachclassgiven theshadowandhighlightfeatures. 51 P(y ik = 1|s i ,h i ) = P r=0,1 P(r,s i ,h i ,y ik = 1) P K j=1 P r=0,1 P(r,s i ,h i ,y ij = 1) (5.14) = η k N(h i ;Θ k )(ρN(s i ;Ω k )+(1−ρ)N(s i ;Φ)) P K j=1 η j N(h i ;Θ j )(ρN(s i ;Ω j )+(1−ρ)N(s i ;Φ)) (5.15) Estimatingreliabilityofasample Alternatively, we might be interested in only estimating the reliability of a sample’s givenfeatures. Thisinferencecanalsobeeasilydoneasfollows: P(r i |s i ,h i ) = P K k=1 P(r i ,s i ,h i ,y ik = 1) P r i =0,1 P K k=1 P(r i ,s i ,h i ,y ik = 1) (5.16) P(r i = 1|s i ,h i ) = P K k=1 η k N(h i ;Θ k )ρN(s i ;Ω k ) P K k=1 η k N(h i ;Θ k )(ρN(s i ;Ω k )+(1−ρ)N(s i ;Φ)) (5.17) Note that although the reliability variabler was only tied to the shadow features, infer- ence involves both features. This is due to the “upstream” nature of inferencing. These estimated reliabilities for samples on the test set, can now be used in another classifier. The reliabilities for samples in the training set can be simply obtained by the estimated γ i valuesattheendofalliterations. 5.3 DiscriminativeModelforreliability Weproposetoextendthereliabilityassumptiontoadiscriminativeframeworktoprovide robustness to data sparsity issues. In a discriminative model, only the parameters that correspondtotheconditionaldistributionofclasslabelsY givenfeaturesX arelearned. ThisisunlikeagenerativemodelsuchastheBayesiannetworkshowninFig.5.1above, 52 that additionally tries to describe a “generative story” for the features. We modify the reliabilityassumptioninthecontextofdiscriminativemodelsasfollows. Pr(Y,R|X) =Pr(Y|X;Θ) R | {z } reliable Pr(Y;Φ) 1−R | {z } unreliable Pr(R|X) (5.18) Thisallowsustowritethecompletedataconditionallikelihoodintermsoftheparame- tersofthemodelsΘandΦwhichcanbeestimatedbymaximizingthedataconditional log-likelihood. The sample reliability R is a latent variable in this model and hence we use Expectation-Maximization (EM) to optimize the data log-likelihood. This is describedindetailnext. 5.3.1 FormulationandNotation We use the 1-of-K encoding to denote class labels in this paper. The class labelY i for thei th sample is represented usingK binary labels forK classes where P K k=1 Y ik = 1 and Y ik = 1 indicates that the sample belongs to the k th class. The corresponding D- dimensionalfeaturevectorforeachoftheN samplesareindicatedbyX 1 ,X 2 ,...,X N . We assume that if a sample is reliable (R i = 1) for the classification task, the observed class labelY i for the sample would depend on its featuresX i , according to the reliable modelΘ. In this paper, we assumeΘ to be a MaxEnt model parametrized byW∈ R D×K where{W 1 ,...,W K } are the weight vectors for each class. According to the MaxEnt model, the probability of i th sample belonging to class k given features X i is given by the normalized exponential or softmax function as shown in Eq.(5.19) which we denote by Ψ ik . In practice, any model that can be trained using soft labels can be usedinsteadoftheMaxEntmodel. Whenasampleisunreliable(R i = 0)fortheclassificationtask,weassumeitsclasslabel Y i tohavebeengeneratedbyrollingaK faceddiewithaprobabilityη k ,k ={1,··· ,K} 53 foreachface. ThismarginaldistributionforY actsasthegarbagemodelΦasshownin Eq.(5.20). Inaddition,itisreasonabletoassumethatthereliabilityofasampledepends on where it lies in the feature space. We assume that reliability R i ={0,1} can be modeled by a logistic regression on the features X i parametrized byr∈R D as shown inEq.(5.21),whereσ(.)denotesthesigmoidfunction. Pr(Y ik = 1|X i ;Θ) = e W k T X i P K j=1 e W j T X i = Ψ ik (W) (5.19) Pr(Y ik ;Φ) =η k (5.20) Pr(R i = 1|X i ) =σ(r T X i ) =ρ i (r) (5.21) 5.4 Trainingthereliabilityawaremodel Itcanbeshownthattheproposedmodelisamixture-of-expertsmodelandhencelearn- ingparameterscanbeeasilydoneusingtheEMalgorithm. Sincetheproposedmodelis discriminative we try to maximize the conditional total data log-likelihood. Assuming thateachsampleisindependentlydrawnfromthedistribution,wecanfactorizethetotal conditionaldatalikelihoodofdataD ={X,Y,R}givenparameters(W,η,r)asshown inEq.(5.22). Pr[D|W,η,r] = N Y i=1 Pr(Y i ,R i |X i ;W,η,r) (5.22) Substituting from Eq. (5.18) leads to the following log-likelihood function using the notationsshownearlier. 54 L =lnPr[D|W,η,r] = N X i=1 R i lnρ i +(1−R i )ln(1−ρ i ) + N X i=1 K X k=1 Y ik [R i lnΨ ik +(1−R i )lnη k ] (5.23) We would like to find the maximum-likelihood parameter estimates (W ∗ ,η ∗ ,r ∗ ) that maximizeLinEq.(5.23). 5.4.1 TheEMalgorithm This optimization problem is complicated by the fact that R is a latent variable and should not occur in the parameter estimation equations. Hence, we employ the EM algorithm[55]whichinsteadmaximizesalowerboundfunctionofL. Werepresentthis functionasL ′ anditcanbeobtainedbycomputingtheexpectedvalueofLwithrespect to the posterior distribution of the latent variableR. This is shown in Eq. (5.24) where E f [.]denotestheexpectationoperatorwithrespecttodistributionf. L ′ =E f:Pr(R i |X i ,Y i ) [L] (5.24) The EM algorithm then makes use of an iterative procedure in which we alternate between computingL ′ (E-step) and find ML estimates of parameters by maximizing L ′ (M-step). Note thatL ′ is no longer a function of the latent variable R thereby simplifyingtheoptimization. 55 Expectation-step Given the observationD and parametersW,η,r we first compute γ i =E f (R i ) with respecttotheposteriordistributionforR i asfollows γ i =Pr(R i = 1|X i ,Y i ;W,η,r) ∝Pr(Y i |R i = 1,X i )P(R i = 1|X i ) = ρ i Q K k=1 (Ψ ik ) Y ik ρ i Q K k=1 (Ψ ik ) Y ik +(1−ρ i ) Q K k=1 η Y ik k (5.25) Substitutingγ i insteadofR i inEq.(5.23)weobtainL ′ L ′ = N X i=1 γ i lnρ i +(1−γ i )ln(1−ρ i ) (5.26) + N X i=1 K X k=1 Y ik [γ i lnΨ ik +(1−γ i )lnη k ] (5.27) Maximization-step Oncetheexpectedlog-likelihoodfunctionL ′ hasbeencomputedusingthecurrentvalue ofγ i weestimateoptimalvaluesoftheparametersW,r,η. Estimating η: The method of Lagrange multiplier with the additional constraint P K k=1 η k = 1 can be used to directly obtain the following parameter update equation fortheunreliablemodelparameterη k forthek th class. η k = P N i=1 Y ik (1−γ i ) P N i=1 Y ik (5.28) 56 Estimating W and r: There are no closed form estimates for parameters W and r and they must be optimized using gradient ascent onL ′ . It is easy to note that the objective function for parameterW corresponds to the multinomial logit cost function in Eq.(5.27), where each sample is weighted by γ i while for parameter r, the objec- tive function corresponds to ordinary logistic regression using γ i as soft labels instead (Eq.(5.26)). Thus, both the optimization problems are convex with unique optimas. We optimize these functions using the Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) optimization algorithm [56, 57] implemented in the SciPy toolkit [58]. The L-BFGS algorithm is convenient because it does not require direct computation of the Hessian matrixwhichcanoftenturnouttobesingular. AteachiterationoftheEandMstepswecomputetheexpectedlog-likelihoodfunction L ′ tillthevalueofupdateisbelowagiventhresholdǫ. Typically,thenumberofiterations required for convergence depends not only on the number of samples N but also on the number of classes K since it affects the number of parameters. Once the optimal parameter valuesW ∗ ,η ∗ ,r ∗ have been learned we can use them to classify a new test sampleZ,bycomputingtheclassposteriorforthek th classasshownbelow Pr(Y ik = 1|Z) = Ψ k (W ∗ ,Z)σ(r ∗T Z)+η ∗ k 1−σ(r ∗T Z) (5.29) 57 Chapter6 ExperimentsandResults Experimental results of reliability-aware classification are presented in this chapter for the classification problems discussed in Chapters 3 and 4. In each case, the baseline algorithm of the correspond reliability-blind model and is equivalent to always setting R = 1forthereliability-awaremodel. 6.1 Generative: ReliabilityawareBayesianModel 6.1.1 SyntheticExperiments: accuracyvs. reliabilityρ To understand the role of feature set reliability in fusion performance, we perform a simulationexperimentbyvaryingthedegreeofreliabilityρoftheshadowfeatures. We synthetically generated 2-D highlight and shadow features for N = 5000 samples as describedearlierusingrandomparametersvaluesinthegraphicalmodel. Classification experimentswereperformedusinga60-40splitofthesyntheticdatausingboththepro- posedandtraditionalschemes. Bysyntheticallygeneratingthesefeatures,weareableto controltheaveragereliabilityρtostudytheclassificationrobustnesswithchangeindata reliability. Compared to the example in Fig.5.3 used for illustration, the dataset gener- atedforthisexperimentwasmademorerealisticbyaddingsignificantoverlapbetween theclustersfromeachclass(Fig.6.1). Theseparabilitycanbeadjustedduringsynthesis by controlling the ratio of intra class to inter class variance (i.e. Fisher’s criterion), and causes the accuracy to taper off for lowerρ when the dataset is not easily separable. It isworthmentioningherethatsincetheabsolutevalueofaccuracyisalsocontingenton 58 the inherent separability of the classes (over which we often have little control in a real dataset) it might be more useful to study the relative trends in the accuracy curve with changeinρratherthanitsabsolutevalue. −20 −10 0 10 20 −20 −10 0 10 20 Feature 1 Feature 2 Shadow features −20 −10 0 10 20 −20 −15 −10 −5 0 5 10 Feature 1 Feature 2 Highlight/Object features Feature 1 Feature 1 Shadow features Highlight/Object features 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.4 0.6 0.8 1 True ρ Estimated ρ 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.5 0.6 0.7 0.8 0.9 True ρ Classification accuracy RSHY GMM Figure 6.1: Performance degradation with change in reliabilityρ. Accuracies indicated are on the 5000 samples synthetically-generated dataset. The features presented are for ρ = 0.8. GMM indicates training a Gaussian model per class and classification using a MAPrule. The decrease in accuracy in this case (Fig.6.1) results from a poor estimation of the garbage model. If majority of the samples are unreliable, then a slight overlap of the reliableclustersinthedatacancausethedatasamplestoappearnoise-like,causingthe algorithmtoincorrectlyassignlowweighttothemduringtraining. Forthesyntheticdata 59 experiment, we observe that the proposed method performs better than the reliability blindcase(GMM). 6.1.2 Objectclassificationexperiment The proposed RSHY model presented in Ch. 3 was tested on an object classification task on each of the three sonar image databases- NURC, SSPS and NSWC. Shadow features were modeled as unreliable and a latent rerliability variable was assigned for the shadow features. The proposed Bayesian network model and the baseline Gaussian models are trained on the Zernike moment magnitude features described earlier. For a fair comparison, we chose identical baseline models i.e., class-conditional Gaussian models (per class feature models). We train separate Gaussian models on shadow (Λ s ) and highlight (Λ h ) features and a joint model (Λ sh ) on both. Finally, we also compared against a uniformly weighted score level fusion which averages the posteriors from the highlightandshadowmodelsasshowninTable6.1. Featurelevelandscorelevelfusion schemes, being among the most commonly used blind fusion strategies, are adopted as ourbaseline. EachGaussianmodelislearnedinanaiveBayesfashion,thatisthecovari- ance matrices are assumed to be diagonal. This alleviates the data-sparsity problem to some extent. In addition, we perform dimension reduction using Principal Component Analysis (PCA). This reduces the number of parameters to be trained, improving the accuracyofthemodel. Resultsarepresentedfordifferentnumberoffeaturedimensions inFig.6.2. All experiments are performed using 5-fold cross validation in which the dataset is divided into 5 parts/folds. The average classification accuracy is reported at the end. For the proposed Bayesian network model, once the graphical model (Λ rsh ) is trained we infer posteriors for class labels as described earlier. Additionally, all features are 60 z-scorenormalizedsothateachfeaturedimensioniszero-meanandunitvariance. This ensuresthatanymismatchinmagnitudebetweenthetrainingandtestsetsisminimized. Scheme Predictedclassposteriors Shadows P(y ik = 1|s i ,Λ s ) Highlight P(y ik = 1|h i ,Λ h ) FeatureFusion P(y ik = 1|[s i h i ],Λ sh ) ScoreFusion (P(y ik = 1|s i ,Λ s )+P(y ik = 1|h i ,Λ h ))/2 Rel. AwareFusion P(y ik = 1|s i ,h i ,Λ rsh ) Table 6.1: Table shows the decision rules for different classification schemes for which resultsarepresentedlaterinFig.6.2. WecomparetheperformanceoftheproposedalgorithmontheNURC,SPSSandNSWC datasetsdescribedearlier. Fromtheresults,wenotethattheproposedalgorithmisuse- ful to different degrees on different datasets (Fig.6.2), due to the variable reliability of the shadow/object features in different datasets. On the NURC dataset the algorithm performs better than traditional fusions schemes, owing to the inherent lack of reliabil- ityintheshadowfeatures,whichcanbeseenbytheρvalueofapproximately0.3(Table 6.2). In the SPSS dataset the shadow features are more reliable on average (ρ≈ 0.7) leading to a performance that is at par with the feature fusion results. However, in the NSWC dataset the highlight features are highly discriminative, owing to the precisely knownobjectlocations. ThusallfusionapproachesperformworseontheNSWCdataset in comparison to using just the highlight features. In general, the classification perfor- manceonadatasetseemstobecorrelatedwiththereliabilityoffeaturesonthatdataset. Whentheshadowfeaturesonadatasetareunreliable,thealgorithmmanagestoimprove overthebaselinemethodofblindfusion. While these results provide some intuition as to when reliability aware classification mightbeuseful,wetrytofurtherexplorethedependenceofthemodelonthisnotionof featurereliability. Tounderstandthisinmoredetailweassociateareliabilitylatentvari- ableQ with the highlight features (Fig.6.3) as well. This lets us compare the reliability 61 of both feature sets simultaneously for a particular dataset. We train this doubly unre- liable model on the three datasets and present the values for the reliability parameterρ inTable6.2. Theρvaluesmatchourintuitionoffeaturereliabilityonthethreedatasets andalsoexplainwhymodelingshadowreliabilityonlyhelpsontheNURCdataset.‘ Dataset shad high NURC 0.39 0.70 SPSS 0.68 0.55 NSWC 0.42 0.95 Table 6.2: Average reliability parameters ρ for the shadow and highlight features on different datasets. The values are obtained after training a doubly unreliable model as showninFig.6.3. 6.1.3 Discussion The varied reliability of the shadow features in these sonar images makes the direct fusion of these features difficult. Direct fusion of the shadow and highlight features from these sonar images, as is typically done in pattern recognition, proves not to be a robust approach in this case, owing to the varied reliability of the shadow features. Instead, for the weak feature set, each sample is weighted according to its worth for the task at hand. We proposed a reliability-aware Bayesian network graphical model to exploitthisnotionofreliabilityofasampleintermsofthedependenceindistributions. Thisallowsustocombinethefeaturesetsefficientlyandextractadditionalclassification performance from the otherwise unreliable shadow features. The proposed framework clearlyestablishesthecaseforreliabilityawareclassificationofnoisyfeaturemodalities. While the method is promising it suffers from data sparsity issues resulting from large number of model parameters. This curse of dimensionality is a well known problem in generativelearningbecauseofpoorparameterestimation. Moreover,reliabilityweight- ing tries to further selectively use the data at hand for estimating “reliable” models. 62 Apart from providing a normalized quantitative measure for comparing the reliability of two feature sets (QRSHY: Fig.6.3), the only simple assumption in this framework (Eq.(5.4))canbeeasilyextendedtoothermodels. 6.2 Discriminative: ReliabilityawareMax-Entmodel We demonstrate the proposed algorithm on a speech-intelligibility classification exper- iment for pathological speech [59]. The term pathological speech here is used to refer to atypicalities in voice resulting from disease or surgery of the vocal tract. This often affectsspeechintelligibilityandhenceautomaticassessmentisofconsiderableinterest. Moreover, the manual assessment of intelligibility being a subjective task can result in biased labels, making estimation of sample and label reliability an interesting yet chal- lengingtask. For our experiments, we use the NKI CCRT Speech Corpus [60] consisting of 2385 sentence-levelutterances from 55speakersundergoing treatmentfor inoperable tumors oftheheadandneck. ThisdatasetreleasedattheInterspeech2012SpeakerTraitChal- lenge [61] contains binary intelligibility annotations for each utterance where binary labels (I/NI) were created after thresholding EWE scores [?] obtained from multiple annotators. While thechallenge focusedmainly onobtaining ahigh accuracyofclassi- fication for intelligibility [59], the crowdsourced and noisy aspect of annotations were notconsideredinanydetail. From each utterance, we extract features relating to the prosody, voice quality and pro- nunciationaspectsofthespeechsignal. Thesefeaturesarepopularinspeechsignalpro- cessingforstudyingparalinguisticinformationsuchasemotionalstateorfordesigning goodness of pronunciation measures. In [?] the authors studied in detail, these feature sets,inthecontextofpathologicalspeechandselectedasubsetof13features(prosody: 63 6, pronunciation: 2, voice quality: 5) that were shown to improve the average classifi- cationaccuracyonfusion. Weusetheseselectedfeaturesetsforourexperimentsinthis paper. Lack of intelligibility in pathological speech can depend on a variety of aspects of the speech signal where each factor contributes differently. This feature variability makesthisdomainparticularlyappealingforreliability-awaremodeling. Table 6.3: Results of reliability-aware binary classification of speech intelligibility for different feature sets. All accuracy figures are in %. The chance accuracy is 50.3%. Note that improvement in classification accuracy is obtained when the feature sets are moreunreliable(highlighted). Feature Logistic Proposed ρ avg voicequality 58.2 59.8 0.43 prosody 67.1 66.7 0.73 pronunciation 55.1 56.2 0.16 all 68.0 67.8 0.78 We perform experiments using 5-fold cross validation on the NKI CCRT dataset. For each data fold, we train the reliability-aware model using features and labels extracted from the remaining folds. This trained model is then used to infer class labels on the testfold. Wecompareourproposedapproachagainstalogisticregressionmodelwhich uses the same algorithm as the reliable model in our approach. However, the baseline algorithm is reliability-blind and assumes all samples to be equally reliable. The same featuresetsareusedtotestboththealgorithms. Weperformseparateexperimentsusing each of the feature sets voice quality, prosody and pronunciation and also test a system using all features together. The results obtained on this dataset are shown in Table 6.3. Weobservethatreliability-awaremodelinghelpsimprovetheclassificationaccuracyon the feature sets pronunciation and voice quality. However, no improvement in classi- fication accuracy is obtained for the prosody feature set, which also happens to be the mostinformativefeatureforpredictingspeechintelligibility. 64 Tobetterunderstandtheseresults, weperformfurther analysisusingaveragereliability scores as a diagnostic. The average reliability ρ avg over the entire dataset can be com- putedbymakingaBernoulliassumptiononthelatentvariableRasPr(R i = 1) =ρ avg . Simplifying the model enables us to analyze the reliability of each feature set in terms ofanaveragereliability scoreρ avg . Theresultsofρ avg obtainedforeachfeaturesetare showninTable6.3andseemtomatchwithourhypothesisthattheproposedreliability- aware classification method performs better when the feature set is inherently more unreliable (highlighted in Table 6.3). This might be the reason why our model is not able to extract any further performance by exploiting reliability information in experi- ments for the prosody feature set. We also note that for similar reasons feature fusion strategieswithreliability-awareclassificationalsodonothelpinimprovingtheclassifi- cation performance. This may be because we currently use a single reliability modelr forallfeaturestogether,whereastheexperimentalresultsclearlysuggestthatthedegree ofreliabilityofeachfeaturesetisdifferent. In addition, we also performed a similar experiment on the NURC dataset for object classification task in underwater sidescan sonar images. Zernike moment magnitude features are extracted from each segmented target and shadows as before. These 36- dimensional shadow and target features are then used separately for classifying the objecttypeoneachimage. TheresultsareshowninTable6.4. Feature Max. Ent. Proposed highlight 45.7 47.6 shadow 33.7 32.8 All 48.9 51.4 scorefusion 51.3 53.7 Table 6.4: Results of reliability aware binary classification on the 7-class object classi- ficationtaskinunderwaterimages. 65 Discussion Weobservethatthepropsedreliability-awareclassificationimprovesclassificationaccu- racyonthefeaturesets-pronunciationandvoicequality. Howeverthereisnoimprove- mentfortheprosodyfeaturesetwhichismorereliableatpredictingintelligibilitycom- pared to the other two. These observations are also supported by the average reliability score (ρ avg ) obtained by each feature set according to the bernoulli reliability model. These results hint at the ability of the proposed reliability-aware classification method to exploit the latent reliability factor better when the feature set has more unreliable samples. Similarresultsareobtainedontheobjectclassificationtask. Additionally, we note that feature-fusion with reliability-aware classification does not help in improving the classification performance. This maybe because we use a single combined reliability model r for all features, whereas the results clearly suggest that the degree of reliability for each feature set is different. Our experiments on the speech intelligibility classification task suggest that reliability-aware classification helps more when the feature set is noisy i.e. when more unreliable samples are present. Further investigationmightrevealtherolereliabilityplaysinnoiserobustnessofthemodel. 66 0 5 10 15 20 25 20 25 30 35 40 45 50 Number of feature dimensions Classification accuracy (in %) Highlight Shadows Score−fusion Feature−fusion RSHY (a)NURCdatabase(ˆ ρ≈ 0.3) 0 5 10 15 20 25 30 35 40 30 35 40 45 50 55 60 65 70 75 80 Number of feature dimensions Classification accuracy (in %) Highlight Shadows Score−fusion Feature−fusion RSHY (b)SPSSdatabase(ˆ ρ≈ 0.7) 0 5 10 15 20 25 30 35 40 20 30 40 50 60 70 80 90 100 Number of feature dimensions Classification accuracy (in %) Highlight Shadows Score−fusion Feature−fusion RSHY (c)NSWCdatabase(ˆ ρ≈ 0.4) Figure 6.2: Comparison of classification accuracies on three different databases using different features sets and fusion schemes. Reliability aware fusion yielded the best performanceontheNURCdatasetwheretheshadowfeaturesweremostunreliable(see ˆ ρ values). Also in general dimension reduction helps since parameters of the Gaussian modelscanbebetterestimated. 67 y s h r q N Figure6.3: QRSHY:Doublyunreliablemodel. 68 Chapter7 Multimodalmixtureofexperts 7.1 Introduction Autonomous emotion recognition systems find their place in numerous applications. Computers with the ability to recognize the emotion evoked by media contents can be usedtobuildbetterhumanassistivesystems[62]. Asoftwarecapableofrecognizingthe continuous dynamic emotion evoked by videos can be used for building better person- alizedvideorecommendationsystems. Emotionrecognitionsystemsareveryhelpfulin autonomousvideosummarizationandkeyeventdetectiontasks[63,64]. Moreoverthe emotion profile of a movie, i.e. the continuous dynamic emotion evoked by a movie, can be used as hidden layer in predicting outcomes like success and gross income of a movie. Related Works: Affective analysis in music has been an actively researched area [65] andseveralwellperformingsystemsforpredictingemotioninmusicexist. Ascompared toemotionpredictioninmusic,emotionpredictioninmoviesisamuchmorechalleng- ingandcomplextask. Inmovies,thereisacomplexinterplaybetweenaudioandvideo modalities that determines the perceived emotion. This interaction between modalities is highly dynamic in nature, in the sense that the relative contribution of modalities in emotion prediction may change during the course of the movie. For example let us consideramoviescenewhichbeginswithanaccompanyingmusicalsoundtrack,butthe musicfadesawayasthesceneproceeds. Insuchascene,musicalcueswouldbeinitially important in setting up the mood, but as we proceed, the visual cues might contribute 69 more to the perceived emotion. Therefore a multimodal framework, which can dynam- ically captures the interaction between the modalities is necessary for determining the evokedemotioninmovies. Much of the research done in the field of emotion prediction from audio-visual content has focused on accomplishing specific tasks. Chen et al. in [63] were trying to detect violent scenes in a movie. In [66], Nepal et al. focused on automatically detecting goal segments in basketball videos. Recent works have tried to determine categorical emotionsinmediacontent. Jiangetal. in[67]andKangetal. in[68]proposedsystems for predicting categorical emotion labels for videos. Some researchers have narrowed their attention on affective analysis of movies from specific genres. Xu et al. analyzed horrorandcomedyvideosbydoingeventdetectionusingaudiocues[69]. Asystemcapableofdeterminingtheemotionevokedbyavideocontinuouslyovertime can be very useful in all the above tasks. So in this work, we try to determine emotion evoked by a video by predicting continuous scale and time arousal-valence curves, and by validating them against human annotated values. We put forth a set of audio and videofeaturesthatcanbeusedforthetask,andalsoproposeseveralnewvideofeatures like Video Compressibility and Histogram of Face Area (HFA). We explore different fusionmodelsandshowhowthecomplementaryinformationpresentinaudioandvideo modalities can be exploited to predict emotion ratings. Finally, we propose a Mixture of Expert (MoE)-based fusion model that jointly learns optimal fusion weights for the audio and video modalities in a data-driven fashion. We present a learning algorithm basedonhardExpectation-Maximization(EM)fortheMoE-basedfusionmodel. 70 7.2 Datasetandexperimentalsetup Inthecurrentworkwehaveusedthedatasetdescribedin[64,70]. Thedatasetconsists of12videoclips,eachfromadifferentmovieandaround30minlong. Themoviesinthe dataset have won the Academy Award, and are from different genres. For each video clip, there are two curves, one for intended/evoked valence and the other for arousal. These curves vary in range from−1 to 1 and are sampled at a rate of 25 Hz. On doing a frequency response analysis of these curves, we found that more than 99% of their energy was contained in frequencies less than 0.2 Hz. This implies that the emotions varyslowlywithtime,anditissufficienttosamplearousalandvalenceratingsatevery 5 sec interval. So we split all movies into non-overlapping 5 sec samples, giving us around 3800 samples with one valence and arousal rating associated with each sample. For each of these samples, we separate the audio and video channels and extract audio andvideofeaturesfromeach,asdescribedinSec.7.3. 7.3 FeatureDesign Various features have been proposed in the literature for the emotion recognition task. Inadditiontothislistofaudioandvideofeaturesweproposetwonovelfeatures: Video Compressibility and Histogram of Facial Area (HFA). The features used in the current workaredescribedasfollows. 7.3.1 Audiofeatures Mel Frequency Spectral Coefficients (MFCC) and Chroma: MFCC and Chroma [71] features have been widely used in emotion recognition tasks. We extract MFCC features for each 25 ms window with 10 ms shift and the Chroma features for each 200 71 ms window with a shift of 50 ms. We also compute Delta MFCC and Delta Chroma features, which are time derivatives of the MFCC and Chroma features respectively. Finally for each sample, we compute statistics (mean,min,max) of the previously men- tionedfeatures. Audio Compressibility: This audio feature was introduced in [72], where it had been shown to be highly correlated with human annotated arousal and valence ratings. For an audio clip, the Audio Compressibility feature is defined as the ratio of the size of losslessly compressed audio clip to raw audio clip. We compress the raw audio using the FLAC lossless codec [73] using ffmpeg [74] and include the ratio of the size of the compressedaudiocliptooriginalaudioclipasafeature. Harmonicity: The presence of music in a media content helps in triggering emotional response in the viewers [75]. To capture this information, we use Harmonicity feature which was introduced in [76] as a measure of presence of music in an audio clip. We firstlydivideasampleaudioclipinto50msminiclipsandextractpitchineachofthese 50 ms clip using the aubio tool [77]. Harmonicity for that sample audio clip is then taken as the ratio of number of 50 ms clips that have a pitch to the total number of 50 msclips. 7.3.2 Videofeatures Shot Frequency: Cuts in the video have been widely used by cinematographers to set thepaceofaction[78]. Inordertocapturethisinformation,wefollowanapproachsim- ilar to that in [79]. We detect shots, i.e. the sequence of frames recorded continuously by a camera, in the sample video clip using ffmpeg [74] and count the total number of shotspresent. Histogram of Optical Flow (HOF) : Motion or activity in a scene affects the emo- tional response of viewers [80]. For capturing this information we use the HOF feature 72 [81]. First, we extract the Lukas-Kanade Optical Flow [82] for all frames except the ones near a shot boundary. Frames near a shot boundary were excluded because they would exhibit a spurious high optical flow value because of discontinuity. Correspond- ingtoeachframeforwhichopticalflowiscalculatedweconstructa8binhistogramas follows. For each optical flow vector [x,y] in a frame, we calculate its angle with the positive x axis i.e equal to tan −1 ( x y ) and find the bin in which it will lie, using the fact thatthei th binrepresentsangles∈ [ (i−1)π 4 , (i)π 4 ]. Thenitscontributiontothatbinistaken proportionaltoitsL2normas p x 2 +y 2 . Wehaveoptedforan8binhistogrambecause it is robust and sufficient for the task. For each sample video clip we then compute statisticsoftheHOFfeaturesacrossitsframes. 3d Hue Saturation Value (HSV) Histogram: Color has a significant influence on the human affective system [83, 84]. This information is captured using the 3d HSV fea- ture. First we convert the frames from RGB to HSV color space. Then for each frame we construct a 3d Histogram as follows. We quantize each of the hue, saturation and value into 4 equal sized intervals. So a pixel has 4 choices for hue, 4 for saturation and 4 for value, and therefore it can lie in any of the 4×4×4 (64) bins. Finally for each samplevideoclip,wecomputestatisticsfromthe3dHSVHistogramfeaturesacrossall theframesinit. Video Compressibility: Along the lines of audio compressibility, we define a video compressibility feature tocapture aspectsofmotion andchange inavideo. Mostvideo compression algorithms tend to exploit redundancy in video information over time by using motion and change predictions. As we expect them to be correlated with the per- ceived emotion ratings, we use video compressibility as a compact feature to combine the effects of such factors over a clip. To calculate video compressibility for a sample video clip, we first compress the raw video with the lossless huffyuv [85] codec using 73 0 0.2 0.4 0.6 0.8 1 Scaled Video Comp. Scaled Arousal Figure7.1: Plotshowingthevariationinscaledvideocompressibilityandscaledarousal valueforasamplemovie ffmpeg[74]andthencalculatetheratioofthesizeofthecompressedvideototheorig- inal raw video. We have found that the video vompressibility feature has a correlation −0.25withhumanannotatedarousalvalues. Thep-valueforthecorrelationis0,assert- ing that the correlation is significant. We have plotted the variation in scaled arousal values and scaled video compressibility for a movie sample in Fig.7.1, where we can clearlyobservehowvideocompressibilityandarousalvaluesvaryinversely. Histogram of Facial Area (HFA): Face closeups have been frequently employed by cinematographers to attract the attention of the audience and evoke aroused emotions [86]. We attempt to extract this information using the HFA feature. We begin by car- rying out face detection in all the frames using a Deep Convolutional Network based face detector [87]. Of all the faces detected in a frame, the ones with the largest area is takenastheprimaryface. Forasamplevideoclip,wedetecttheprimaryfacesinallthe frames. All the frames containing a face are binned according to the primary face area to construct a histogram. We construct a 3 bin histogram, with the bins representing small, medium and large sized faces. Fig.7.2 shows the formation of HFA for a sample videoclip. 74 7.4 SystemsforEmotionPrediction AsdescribedinSec.7.2,itissufficienttopredictvalenceandarousalratingsforamovie atanintervalof5sec. Toaccomplishthistaskwespliteachmovieintonon-overlapping 5secsampleclipsandextracttheabovementionedaudioandvideofeaturesfromthem. Welearndifferentregressionmodelswhichtrytopredictthearousalandvalenceratings corresponding to each sample from the extracted features. We perform a leave one movie out cross validation. For all the experiments, we learn independent models for arousalandvalenceusingthesampleclipsfromtrainingmovies. Thesemodelsarethen used to predict arousal and valence ratings for every 5 sec sample on the held out test movie. Toincorporatethetemporalcontextinformation,weapplyatemporalGaussian smoothingonthepredictedvalues. Thisensuresthatthepredictedvalueforeachclipis consistentwithitsneighbors. Thelengthofthesmoothingwindowincaseofarousalis represented as l ar and in case of valence as l vl . For each model, l ar and l vl are chosen through agrid searchonthe cross validation sets soas tomaximize the performance of thatmodel. Figure7.2: SchematicrepresentationshowingtheformationofHistogramofFaceArea (HFA)forasamplevideo 75 7.4.1 AudioOnly,VideoOnlyandEarlyFusion Intheaudioonlyandvideoonlymodelwetrytopredictthearousalandvalencevalues using only the audio features or video features. We learn a simple linear regression model to predict the arousal and valence values from the features of each sample clip. We also tried other regression models like Support Vector Regression and Gaussian ProcessRegressionbuttherewasnotmuchimprovementinthepredictionsowefocused on simple linear regression. In the early fusion model we simply concatenate the audio andvideofeaturesandlearnalinearregressionmodelusingthefusedfeaturevector. 7.4.2 LateFusionModel In the case of the late fusion model, we learn two independent models, one from only the video features and other from only the audio features. Then we try to fuse the predictionsfromthetwomodelstogivethefinalprediction. Let y (v) be the prediction from the video features and y (a) be the prediction from the audio features, then final predictiony (pre) is given by Eqn.7.1. Please note the value of α remains the same for all samples across all the cross validation folds and its value is chosen so as to maximize the correlation between the actual and predicted values. We further analyze the performance of the late fusion model with changing α in Sec.7.6 usingFig.7.4. y (pre) =αy (v) +(1−α)y (a) ,α∈ [0,1] (7.1) 7.4.3 ProposedMixtureofExperts(MoE)-basedFusionModel In the MoE-based model we have two experts, one that uses the audio features, and the other that uses the video features. Along with these experts, we have a gating function, 76 Figure 7.3: Schematic representation of Proposed Mixture of Experts (MoE)-based FusionModel which determines the contribution of each expert in the final prediction. The final pre- dictionfortheMoE-basedmodelisverysimilartotheLateFusionmodelexceptforthe fact that here we don’t have a fixedα. The value ofα depends on the audio and video features of the current sample. So the MoE-based model can be thought of as compris- ingoftwoindependentlearnersandagatingfunction,wherethegatingfunctiondecides thecontributionofeachlearner,asshowninFig.7.3. Let y 1 ,y 2 ,..,y n be our target labels, x (a) 1 ,x (a) 2 ,..,x (a) n be the audio features, x (v) 1 ,x (v) 2 ,..,x (v) n bethevideofeaturesandx (z) 1 ,x (z) 2 ,..,x (z) n bethefeaturesfordeter- miningαineachsample. Ingeneralonecanchooseanyfeaturesetforx (z) i . Inourcase we first concatenated all the audio and video features and then did a Principal Compo- nentAnalysis(PCA)[88]toreducetheirdimension. Theprincipalcomponentsexplain- ing 90% of the variance were retained in order to construct x (z) i . The predicted label 77 for the i th sample, y (pre) i is given by Eqn.2, where ω a , ω v and ω z are the parameters associatedwiththemodel y (pre) i =α i y (v) i +(1−α i )y (a) i wherey (a) i =ω ⊺ a x (a) i , y (v) i =ω ⊺ v x (v) i , α i = 1 1+e −ω ⊺ z x (z) i (7.2) In order to learn the parameters of the model, we follow an algorithm similar to hard expectationmaximizationasdescribednext. Thelossfunction,L(ω a ,ω v ,ω z )depends ontheparametersofthemodel,andisgivenbyEqn.7.3. L(ω a ,ω v ,ω z ) = n X i=1 n y i −y (pred) i o 2 = n X i=1 n y i −α i y (v) i −(1−α i )y (a) i o 2 = n X i=1 ( y i − ω ⊺ v x (v) i 1+e −ω ⊺ z x (z) i − 1− 1 1+e −ω ⊺ z x (z) i ω ⊺ a x (a) i ) 2 (7.3) The task of the learning algorithms is to estimate parameters ω a ,ω v and ω z that min- imize the loss function. We adopt a co-ordinate descent approach and subdivide the algorithm into two steps corresponding to the individual learners and gating function respectively. Westartbyrandomlyinitializingtheparametervalues,andthenrepeatthe followingstepsiterativelytillconvergence. STEP I : In this step we fix ω z and try to minimize the loss function by estimating 78 optimalvaluesforω a andω v . Sinceα i ,∀idependsonlyonω z andx (z) i ,theyarealso fixedinthisstep. minimize ωa,ωv n X i=1 n y i −α i y (v) i −(1−α i )y (a) i o 2 minimize ωa,ωv n X i=1 n y i −ω ⊺ v α i x (v) i −ω ⊺ a (1−α i )x (a) i o 2 minimize ωa,ωv n X i=1 y i − ω v ω a ⊺ α i x (v) i (1−α i )x (a) i 2 (7.4) From Eqn.7.4, it is clear that ω a and ω v can be found by solving the linear regression problemwhichtriestopredicty i ∀iusing[α i x (v) i (1−α i )x (a) i ] ⊺ asthefeaturevector. STEPII:Inthisstepwetrytominimizethelossfunctionbychangingω z keepingω a and ω v fixed. This is achieved by following a gradient descent formulation where the gradientisgivenbyEqn.7.5. ∂C ∂ω z = 2 n X i=1 n [y i −α i y (v) i − (1−α i )y (a) i ]α i (1−α i ) y (v) i −y (a) i x (z) i o (7.5) 7.5 Evaluation As briefly mentioned in Sec.7.4, we perform a leave one movie out cross validation to test the performance of different models. For each model we compute the mean abso- lutePearsoncorrelationcoefficient(PCC)betweenthepredictedlabelandgroundtruth label for all movies. ρ ar refers to the mean of absolute PCC between predicted and ground truth arousal values, and similarlyρ vl refers to the mean absolute PCC between predicted and ground truth valence values. For arousal the PCCs in all cases is positive 79 so mean absolute PCC would be same as mean PCC. But, for valence we sometimes getnegativePCC.Thiscanbeattributedtothefactthatunlikearousal,valencerequires much higher cognitive thinking, and similar audio-visual features can elicit very differ- ent valence. For example a fighting scene can evoke very different valence response dependingonwhetherthehero,orthevillainisdominating. Similarly,alaughingscene takes opposite sign on the valence scale depending on whether it is the hero, or villain whoislaughing. Themodelsproposedareunabletocapturethisaspectofvalence,and sometimes automatically give inverted prediction for valence, resulting in significant negativePCCwiththegroundtruthvalence. Out of the 12 movies in the dataset, 2 are animated. Since the video features for an animated movie would be very different from an usual movie, we have excluded the 2 animatedmoviesfromthevideoandfusionmodels. Wehaveevaluatedtheaudiomodel twice, once with the animated movies and once without them. The audio model with animated movies is referred to as Audio Only 1 , and the audio model without them is referredtoasAudioOnly 2 . 7.6 ResultsandObservations Table 7.1 shows the ρ ar and ρ vl value for different models. We have considered the Early Fusion Model as our baseline. It can be seen that fusion models perform better than individual audio or video models. This shows that audio and visual modalities containcomplementaryinformation,andtheirfusionhelpsinemotionprediction. Also, in all the models the prediction for arousal is better than that for valence. This can be attributed to the fact that valence prediction requires higher semantic information and cognitive thinking, and is therefore far more challenging than arousal prediction. Furthermore, it can be seen that there is a large variance in result for all the models. 80 Thiscanbeattributedtothefactthatthedatasethasmoviesbelongingtomanydifferent genres, and a common model is unable to describe all of them. The proposed MoE- based fusion model, which dynamically adjusts the contribution from audio and video modalities outperforms all other models. Overall, considering the complexity of the task,theMoE-basedmodeldoesagoodjobinpredictingthevalenceandarousalcurves. As mentioned in Sec.7.4, we apply a Gaussian window at the end of each model to incorporate the context information. This increases the agreement between neighbors. We found that l vl > l ar for all the models, which clearly shows that valence requires a longer context information than arousal. Further we investigated how audio and video modalities contribute to the final prediction in the case of the Late Fusion model. We have plotted the change inρ ar andρ vl with changing value ofα in Fig.7.4. Please note thatα = 0 corresponds to the Video Only Model andα = 1 corresponds to the Audio Only Model. From the plots, we can see thatα = 0.56 for the best performing arousal system, and α = 0.91 for the best performing valence system. We can conclude that in our model, audio and video contribute almost equally for arousal prediction, but for valencepredictionaudiocontributesmore. 7.7 Conclusions In this paper we addressed the problem of tracking time varying continuous emotion ratingsusingmultimodalcuesofmediacontent. Wesuggestedalistofaudioandvideo features suitable for the task, including novel features like Video Compressibility and HFA.Wecomparedandanalyzedtheperformanceofaudioonly,videoonlyandfusion models. Further we proposed a MoE-based fusion model which dynamically fuses the information from the audio and video channels and outperforms the other models. We 81 α 0 0.1 0.2 0.3 0.4 0.5 0.58 0.65 0.72 0.8 0.9 1 ρ ar 0.46 0.48 0.5 0.52 0.54 0.56 0.58 0.6 (a)ρ ar vsα α 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.78 0.85 0.91 1 ρ vl 0.15 0.2 0.25 (b)ρ vl vsα Figure7.4: Plotsshowingthevariationinρ ar andρ vl withαfortheLateFusionModel alsopresentedahardEMbasedlearningalgorithmfortheMoE-basedmodel. TheMoE- basedmodelingeneralperformswellintheemotionrecognitiontaskexceptsometimes forvalencewhenhighlevelsemanticinformationisrequired. Futureresearchanddevel- opment of systems that can capture the semantic information in a video can help in improvingtheemotionpredictionmodels. Model ρ ar ρ vl AudioOnly 1 0.56±0.23 0.24±0.15 AudioOnly 2 0.54±0.23 0.24±0.15 VideoOnly 0.49±0.18 0.16±0.12 Baseline(EarlyFusion) 0.58±0.17 0.22±0.12 LateFusion 0.59±0.2 0.24±0.14 ProposedMoE 0.62±0.16 0.29±0.16 Table 7.1: Performance of different models in predicting continuous in time and scale arousal-valencecurves 82 Chapter8 Reliabilityinthecrowd 8.1 Introductionandbackground Formanytasksinvolvingsubjectiveannotations,itisoftendifficulttoreachaconsensus onthelabel,becauseoftheannotators’individualbiasinperception. Insuchcases,itis often beneficial to instead have multiple annotators label the dataset. A decision on the finalgroundtruthisthenmadeafterconsideringallthelabelsassignedtoeachsample. In such cases, the wager is on the proverbial “wisdom of the crowd“. However, extracting this “wisdom“ from the crowd is usually not straightforward. This forms the primary interestofmostcrowdsourcingandmultipleannotatorstudies. The simplest and most popular approach to this problem is to use a statistic of all the labels for a given sample, such as the mode (majority voting) or mean (for continuous labels) as the consensus. Other popular variants are latent ground truth models that assume each annotator distorts the ground truth using a custom noisy channel model [89]. Onedrawbackofthesedistortionmodelsisthattheycanoftenbecomeparameter heavy if the original label set is large. Some approaches have also considered incor- porating features during inference [90, 91, 92] which allows them to jointly estimate a model to predict the latent ground truth. All these methods estimate the hidden ground truthlabelsalongwiththedistortionsthateachannotatorperforms. Although,theevalu- ationshouldideallycompareagainstgroundtruth,thisinformationisoftenunavailable. Alternatively,othermethodsofevaluationinclude: • Evaluatingnoisylabelsbyannotatorsaspredictedbythedistortionmodels. 83 • Usingmetricsofoverallagreementwithannotatorsasaproxymetricforaccuracy oftheestimatedgroundtruth. Thefundamentalchallengeincollectingsuchdatasetsistoobtainadatasetwhichposes sufficientambiguityforhumanannotations,whileprovidingeasyaccesstotheunderly- inggroundtruth. Wewillrefertothisgroundtruthlabelastheoracle. 8.2 Multipleannotatorsreliabilitymodel Inthissection,Iwouldliketopresenthowthereliabilityawareclassificationmodeldis- cussedinChapter5lendsitselfnaturallytothemultipleannotatorsscenario. Forthis,I assumethatthereliablemodelΘthatrepresentstheobjectiveandfeaturedependentpart of the annotation is shared among all the annotators. On the other hand, the unreliable model Φ that represents the subjective bias is specific to each annotator and is denoted byΦ m forthem th annotator. Then using the notation Y (m) ik to denote the m th annotator’s label for the k th class for thei th sample, we can rewrite the conditional distributions as shown in eqs.(8.1), (8.2) and(8.4). Pr(Y (m) i |X i ,R (m) i ) =Pr(Y (m) i |X i ;Θ) R (m) i Pr(Y (m) i ;Φ m ) 1−R (m) i (8.1) Pr(Y (m) ik = 1;Φ m ) =η mk (8.2) Pr(Y m ik = 1|X i ;Θ) = Ψ k (X i ;W) (8.3) Pr(R m i = 1) =ρ m (8.4) whereΨ k denotesthesoftmaxfunctionasbeforewithparametersWandη m correspond to the parameters for the unreliable cateogrical model for the m th annotator. A special 84 case of this model is presented by Rodrigues et. al. [93] where they assume a uniform η mk = 1/K forallannotatorsandtheK classes. Eachannotator’sreliabilityisdecided basedonabernoullidistributionwiththeparameterρ m andisassumedtobeindependent ofthesampleitself. 8.2.1 Parameterestimation The parameters for this proposed model are similar to the models proposed earlier. We have parameter matrixW for the reliable maxent model and categorical parame- ters η 1 ,...,η M for each annotator. In addition, we need to also estimate the average Bernoulli reliability ρ m for each annotator. This parameter estimation can also be per- formed as earlier using the EM algorithm, where in the E-step we first estimate the posteriors γ (m) i = Pr(R m i = 1|Y (m) i ,X i ). This is used to estimate the expected total log-likelihoodL ′ asbefore. TheparameterW canthenbeestimatedbymaximizingL ′ viagradientdescent. L ′ = X i X k X m γ (m) i Y (m) ik logΨ k (X i ,W) (8.5) Note that the closed form estimates can be obtained for the other parameters ρ m ,η m . Once the parameters have been estimated we can use the reliable model to predict the truelabelasshowninequation8.6 Pr(Y k |X) = Ψ k (X;W ∗ ) (8.6) 85 To test the proposed model, we perform two sets of experiments. First, we validate against synthetic datasets by simulating different annotator labels assuming different unreliablemodels. Wetrainourproposedmodeltoestimatetheseparametersandcom- pare them against the reference. Secondly, we use real annotated dataset where the oraclelabelwasknown. Ontheserealdatasetweevaluatetheaccuracyofpredictingthe oraclelabel. Theirperformanceisthencomparedagainstabaselineoftrainingasimilar modelusingmajorityvotedlabels. 8.2.2 Experiments Simulatedmultipleannotators For each row shown in Table 8.1 we choose differentρ m ’s for each of the synthetically generatedannotators. Theseareshowninthesecondcolumn. Theunreliableparameters η m for each annotator are randomly generated. Finally we train the proposed model to estimatebacktheρ m ’sasshowninthelastcolumn. Dataset Appliedρ Oracle Majority Proposed Estimatedρ Pathology 0.63,0.91,0.11 68.8 66.5 68.5 0.49,0.79,0.02 Pathology 0.2,0.4,0.8 68.8 65.9 68.5 0.10,0.28,0.59 Parkinsons 0.1,0.2,0.6 64.8 59.9 61.9 0.24,0.20,0.59 Diabetes 0.2,0.4,0.5,0.7 78.26 73.04 77.84 0.30,0.53,0.58,0.91 Diabetes 0.1,0.2,0.4,0.6 78.26 73.04 77.84 0.18,0.13,0.45,0.72 5EMO N/A 49.5 45.1 48.4 0.68,0.72,0.71,0.75 Table 8.1: Results are shown on synthetic datasets where the annotators were created using pre-deteremined average reliability ρ m ’s. The estimated ρ m ’s roughly follow the trendoftheappliedρ m ’s. Alsonotethattheproposedreliabilityawareapproachoutper- formsthebaselinemajorityvotingmethodonalltasks. 86 Ann# avg. confidence ρ m Spearman’scorr. 1 3.6 0.68 0.30 2 3.97 0.72 0.18 3 4.09 0.72 0.11 4 4.24 0.75 0.16 all - - 0.21 Table8.2: Tableshowscorrelationbetweenreliabilityposteriorsonthedatasetandself reportedconfidence. Realannotator: 5EMO In addition to experiments with simulated annotators we also perform experiments on the 5EMO dataset consisting of real annotators. The dataset contains speech utterances from 3 different speakers that are categorized as belonging to one of the 5 emotion categories - Hot Anger, Cold Anger, Sad, Happy, Neutral. In addition to labels, by the 4 annotators, the intended or target emotion is also known for each speech utterance. Additionally,eachannotatoralsoreportedtheirconfidenceonlabelingeachsampleson aLikertscaleof1to5. Wetrainourproposedreliabilityawaremultipleannotatormodel onthisdatasetasbeforeandobtainedtheresultsasshowninthelastrowofTable8.1. Correlationwithselfreportedannotatorconfidence In addition, to test if the proposed method was interpretable intuitively, I computed the correlation between posteriors of the latent reliability variableγ (m) i and the annota- tor’s self reported confidence on each sample. An average Spearman’s correlation was obtained on all the annotators. Results for individual annotators are shown in Table 8.2 below Thesecorrelationsindicatethatthelatentreliabilityvariableintheproposedmodelagree well with eachannotator’s self reported assessmentof their reliability. It must be noted 87 here,thatsuchselfreportedmeasuresofconfidencearesubjectiveandmightnotalways assesstheannotator’sabilitywell. 8.3 Datadependentreliability While performing further experiments with more real datasets, we realized that the BernoulliassumptiononreliabilityRisofteninsufficientinpracticei.e. theassumption that certain samples are reliable at random does not hold in certain cases. To alleviate this issue we model the complete data-dependent reliability assumption to the multi- ple annotator model. Recall, that the data-dependent reliability model in Section 5 was modeledbylogisticregressiononthefeaturespaceasPr(R i |X i ) =σ(r ⊺ X i ). While this can be easily extended to the multiple annotator case by training M such logisticregressionmodels,weinsteadadoptasomewhatvariedapproach. Thisapproach is motivated by the multimodal mixture-of-experts model where the reliability measure α for each modality is defined relative to the others, unlike the absolute definition of reliabilitythatwehavebeenusingforeachannotatorsofar. Thisrequirestheadditional assumption that the vector of annotator reliabilitiesR (1),...,R (M is a one-hot vector indi- cating which annotator is more reliable on a certain sample. In other words R (m) i = 1 indicatesthatthem th annotator’slabelsarereliableforthei th sample. We shall thus modify the proposed model to include the additional imposing the con- straint that P m Pr(R (m) i |X i ) = 1 which implies that atleast one of the annotators is always expected to be reliable. This additional constraint solves the issue of data spar- sity faced by the previous model as all samples are used for training the reliable model parameterW. Alternatively,thismodelisequivalenttolearningregionsinfeaturespace where different annotators are reliable. Hence, we shall refer to this model as MoAR: MixtureofAnnotatorReliabilities. Thisdata-dependentapproachissomewhatsimilarto 88 mutlipleannotatormodelssuchas[92,91]wheretheannotatorspecificmodeldepends onfeatures. Inthiswork,weassumethattheannotatorreliabilitycanbegeneratedusing asoftmaxtransformationofthefeatures. WedenoteparametersforthissoftmaxbyU m forthem th annotator. Pr(Y (m) ik ,R (m) i |X i ) =Pr(Y (m) ik |R (m) i ,X i )Pr(R (m) i |X i ) (8.7) Pr(R (m) i |X i ) = exp(U T m X i ) P p exp(U T p X i ) (8.8) 8.4 ExperimentsandResults Results for the MoAR approach are presented on two datasets where the underlying ground truth was known. Evaluation was done by comparing the reliable model Θ’s prediction against the oracle label. Recall, that the 5EMO dataset had an emotion class associatedwitheachspeechutterance. Eachofthe4annotatorsjudgedthemasbelong- ing to one of the 5 emotion classes. The FGnet age estimation database on the other hand comprises of face images where the oracle label for age of each person is known. These images are were presented to 10 human annotators who attempt to guess their age [94]. For our experiments, we use these four age categories: 0-6 years, 6-13 years, 13-22yearsandolderthan22years. Foreachofthedatasets5EMOandFGnet,thebaselinesystemsarefirsttrainedusingthe oracle and majority labels. Next, the reliabile aware system with a Bernoulli reliability assumptionistrainedfollowedbyMoARmodel. ResultsareshowninTable8.3. Notethatwhilethesimplerreliabilityawareapproachperformspoorlycomparedtothe majority voting baseline, the proposed data dependent MoAR model performs signif- icantly better, indicating the need to model reliability in a data dependent fashion. In 89 Method 5EMO FGnet Oracle 49.5 69.3 Majority 45.1 62.9 R-aware 48.4 60.7 MoAR 54.0 70.4 Table 8.3: Results using the Mixture of annotator reliabilities (MoAR) and baseline methods. Note that the proposed MoAR model being an ensemble method is able to improveovertheoraclebaseline. fact, we note that the proposed MoAR is able to outperform the Oracle system on both dataset,inspiteofhavingseenonlythenoisymultipleannotatorlabels. Thisshouldnot reallycomeasasurpriseasMoARisanensembleapproachandismorecomplexmodel than the single maxent model trained on oracle labels. Further investigation might be requiredtoseeunderwhatcircumstanceMoARmightgivecomparableperformanceto usingthegroundtruthlabels. 90 Chapter9 FutureWork 9.1 Latentannotatorreliabilitymodels Formostofthemultipleannotatormodelspresentedtillnow,theannotationsarerepre- sentedbyauniformlabelmatrixY i.e. thesamesetofannotatorslabelallsamples. This restriction isn’t however essential and all the proposed methods can easily be extended to the non-uniform case. In fact, in crowdsourcing applications, it is often the case that eachannotatoronlylabelsasmallsubsetofthesamples. Thelabelmatrixisthussparse in practice, which makes per annotator modeling of reliability infeasible owing to lack ofsufficientdataforeachannotator. Inaddition,certainscenarioswheretheannotators mightwanttostayanonymous. Inallthesescenarios,itoftenmakemoresensetomodel annotator reliability along data-driven annotator types. Such a latent annotator model couldallowustolearndiverseannotatortypesusingfewernumberofparameters. AproposedmodelforfutureworkisshownbelowinEq.(9.1),whereweassociateadis- cretelatentvariableAforannotatortype. Eachannotationthendependsontheannotator type A if the annotation was unreliable and on the sample features otherwise. This is consistentwithouroriginalreliabilityassumption. Theonlydifferenceisthatthelabels forAareunknownandwillbeinferredusingadatadependentdistributionjointly. 91 X Y A R Figure9.1: LatentannotatorreliabilityBayesianmodel. Aindicatesthelatentannotator type. Pr(Y|X,A,R) =Pr(Y|X;Θ) R Pr(Y|A;Φ) (1−R) (9.1) Pr(R,A|X) =Pr(R|A,X)Pr(A|X) (9.2) Pr(R|A,X) = Ψ(X;W) (9.3) The application of the model could be easily extended even to cases when the annota- tor identities are known. The key idea is that modeling annotations along data driven annotatortypesratherthanactualidentities,mightallowcreatingannotatormodelsthat are complementary to each other, thus helping the model generalize. The next section proposesapossiblebaselinemethodwhichtreatsthetrainingprobleminsteadasoneof robusttraininginthepresenceofuncertainlabels. 92 9.2 LabelStacking On closely studying the cost function for the mutiple annotator reliability model in Eq.(8.5) we notice that it is equivalent to artificially extending the dataset by syntheti- cally extending the dataset by copying the features with different set of labels to which they correspond. For instance, if the original training dataset be represented as the pair [X,Y]thentheextendeddatasetcanberepresentedas X Y (1) X Y (2) . . . . . . X Y (K) Werefertothismethodoftrainingwithuncertainmultiplelabelsaslabelstacking. We performed label stacking training for three different classifiers: Logistic Regression, Support Vector Machine (RBF Kernel) and Naive Bayes. We compared the results on the5EMOdataset. TheresultsarepresentedinTable??below. Method Logistic SVM(RBF) GaussianNB Oracle 47.5 42.9 44.7 Bestannotator 43.0 37.5 40.4 Labelstack 44.1 45.4 40.9 Majorityvote 44.5 37.9 41.8 Table 9.1: Results comparing label stacking for different classifiers. Label stacking improvesresultsforSVMtrainingwhichmightimplythatselectivetrainingbyweight- ingsamplesdifferentlyisnecessary. Initial results suggest that out of the 3 classifiers we tested this method on, only SVM is able to show an improvement using label stacking. We think this might be because theSVMcostfunctionusesthesamplesselectivelyduringtraining. Eachsampleduring 93 SVMtrainingisselectivelyweightedandpenalized. Hence,whenpresentedwithmulti- plelabelsforeachtrainingsample,SVMisabletoweightthemdifferently. Thismight besimilartowhatourproposedreliabilitymodelachievesonmultiple labels. Theonly differenceisthatlabelstackingdoesn’trequireanypriorsonannotatortendenciesinthe formofreliabilitymodels. Thisdirectionofrobustmulti-labeltraininglookspromising andwillrequirefurtherinvestigation. 94 ReferenceList [1] N.Kumar,U.Mitra,andS.Narayanan,“Robustobjectclassificationinunderwater sidescan sonar images by using reliability aware fusion of shadow features,” in IEEEJournalofOceanicEngineering,2014. [2] J. Kim, N. Kumar, A. Tsiartas, M. Li, and S. Narayanan, “Automatic intelligibil- ity classification of sentence-level pathological speech,” in Computer Speech and Language: SpecialIssueNextGenerationComputationalParalinguistics,2014. [3] R.E.Schapire,“Thestrengthofweaklearnability,”Machinelearning,vol.5,no.2, pp.197–227,1990. [4] H.Drucker,C.Cortes,L.D.Jackel,Y.LeCun,andV.Vapnik,“Boostingandother ensemblemethods,”NeuralComputation,vol.6,no.6,pp.1289–1301,1994. [5] T. G. Dietterich, “Ensemble methods in machine learning,” in Multiple classifier systems. Springer,2000,pp.1–15. [6] L.Breiman,“Randomforests,”Machinelearning,vol.45,no.1,pp.5–32,2001. [7] P. O. Gislason, J. A. Benediktsson, and J. R. Sveinsson, “Random forests for land cover classification,” Pattern Recognition Letters, vol. 27, no. 4, pp. 294–300, 2006. [8] M.I.JordanandR.A.Jacobs,“Hierarchicalmixturesofexpertsandtheemalgo- rithm,”Neuralcomputation,vol.6,no.2,pp.181–214,1994. [9] D.J.MillerandH.S.Uyar,“Amixtureofexpertsclassifierwithlearningbasedon both labelled and unlabelled data,” in Advances in neural information processing systems,1997,pp.571–577. [10] M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” CommunicationsoftheACM,vol.24,no.6,pp.381–395,1981. 95 [11] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20, no.3,pp.273–297,1995. [12] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern classification. John Wiley & Sons,2012. [13] E. J. Candes, J. Romberg, and T. Tao, “Stable signal recovery from incomplete andinaccuratemeasurements,” Comm. Pure Appl. Math., vol.59, pp.1207–1223, 2005. [14] ——, “Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information,” IEEE Transactions on Information Theory, vol.52,pp.489–509,March2006. [15] E. Candes and J. Romberg, “Sparsity and incoherence in compressive sampling,” InverseProblems,vol.23,pp.969–985,2006. [16] W.Bajwa,J.Haupt,A.Sayeed,andR.Nowak,“Compressivewirelesssensing,”in 5th Int. Conf. Information Processing in Sensor Networks (IPSN’06), April 2006, pp.134–142. [17] ——, “Joint source-channel communication for distributed estimation in sensor networks,” IEEE Transactions on Information Theory, vol. 53, no. 10, pp. 3629– 3653,October2007. [18] C.T.Chou,R.Rana,andW.Hu,“Energyefficientinformationcollectioninwire- less sensor networks using adaptive compressive sensing,” in IEEE 34th Confer- enceonLocalComputerNetworks(LCN),2009,pp.443–450. [19] C. Luo, F. Wu, J. Sun, and C. W. Chen, “Compressive data gathering for large- scale wireless sensor networks,” in Proceedings of the 15th annual international conference on Mobile computing and networking, ser. MobiCom, 2009, pp. 145– 156. [20] J. Meng, H. Li, and Z. Han, “Sparse event detection in wireless sensor networks using compressive sensing,” in 43rd Annual Conference on Information Sciences andSystems(CISS),March2009,pp.181–185. [21] R. Masiero, G. Quer, D. Munaretto, M. Rossi, J. Widmer, and M. Zorzi, “Data acquisition through joint compressive sensing and principal component analysis,” in Proceedings of the 28th IEEE conference on Global telecommunications, ser. GLOBECOM’09,2009,pp.1271–1276. [22] R.Masiero,G.Quer,M.Rossi,andM.Zorzi,“Abayesiananalysisofcompressive sensingdatarecoveryinwirelesssensornetworks,”inICUMT’09,2009,pp.1–6. 96 [23] S. Lee, S. Pattem, M. Sathiamoorthy, B. Krishnamachari, and A. Ortega, “Spatially-localized compressed sensing and routing in multi-hop sensor net- works,” in GeoSensor Networks, ser. Lecture Notes in Computer Science. SpringerBerlin/Heidelberg,2009,vol.5659,pp.11–20. [24] D. Motamedvaziri, V. Saligrama, and D. Castanon, “Decentralized compres- sive sensing,” in Communication, Control, and Computing (Allerton), 2010 48th AnnualAllertonConferenceon,Sept.2010,pp.607–614. [25] F. Fazel, M. Fazel, and M. Stojanovic, “Random access compressed sensing for energy-efficient underwater sensor networks,” IEEE Journal on Selected Areas in Communications(JSAC),vol.29,no.8,Sept.2011. [26] ——,“Randomaccesscompressedsensingoverfadingandnoisycommunication channel,”toappearinIEEEtransactionsinWirelessCommunciation. [27] S. Reed, Y. Petillot, and J. Bell, “An automatic approach to the detection and extractionofminefeaturesinsidescansonar,”OceanicEngineering,IEEEJournal of,vol.28,no.1,pp.90–105,2003. [28] N. Kumar, Q. Tan, and S. Narayanan, “Object classification in sidescan sonar images with sparse representation techniques,” in Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on. IEEE,2012,pp. 1333–1336. [29] K. Kerse, F. Fazel, and M. Stojanovic, “Target localization and tracking in a ran- dom access sensor network,” in Signals, Systems and Computers, 2013 Asilomar Conferenceon. IEEE,2013,pp.103–107. [30] J.A.TroppandS.J.Wright,“Computationalmethodsforsparsesolutionoflinear inverse problems,” Proceedings of the IEEE, special issue, Applications of sparse representationandcompressivesensing,vol.98,no.6,pp.948–958,June2010. [31] D.LiandY.H.Hu,“Energy-basedcollaborativesourcelocalizationusingacoustic microsensor array,” EURASIP Journal on Applied Signal Processing, vol. 2003, pp.321–337,2003. [32] X. Sheng and Y.-H. Hu, “Maximum likelihood multiple-source localization using acousticenergymeasurementswithwirelesssensornetworks,”SignalProcessing, IEEETransactionson,vol.53,no.1,pp.44–53,2005. [33] W. Meng, W. Xiao, and L. Xie, “An efficient em algorithm for energy-based mul- tisource localization in wireless sensor networks,” Instrumentation and Measure- ment,IEEETransactionson,vol.60,no.3,pp.1017–1027,2011. 97 [34] Y. Cheng, “Mean shift, mode seeking, and clustering,” Pattern Analysis and MachineIntelligence,IEEETransactionson,vol.17,no.8,pp.790–799,1995. [35] P. W. Holland and R. E. Welsch, “Robust regression using iteratively reweighted least-squares,” Communications in Statistics-Theory and Methods, vol. 6, no. 9, pp.813–827,1977. [36] F. Fazel, M. Fazel, and M. Stojanovic, “Random access sensor networks: Field reconstruction from incomplete data,” in Information Theory and Applications Workshop(ITA),2012. IEEE,2012,pp.300–305. [37] J. Fawcett, A. Crawford, D. Hopkin, V. Myers, and B. Zerr, “Computer-aided detection of targets from the citadel trial klein sonar data,” Defence Research and DevelopmentCanadaAtlanticTM,vol.115,2006. [38] V. Myers and J. Fawcett, “A template matching procedure for automatic target recognitioninsyntheticaperturesonarimagery,”SignalProcessingLetters,IEEE, vol.17,no.7,pp.683–686,2010. [39] C. T. Zahn, “Graph-theoretical methods for detecting and describing gestalt clus- ters,”Computers,IEEETransactionson,vol.100,no.1,pp.68–86,1971. [40] J. Stack, “Automation for underwater mine recognition: current trends and future strategy,”inProceedingsofSPIE,vol.8017,2011. [41] S. Belongie, J. Malik, and J. Puzicha, “Shape matching and object recognition using shape contexts,” Pattern Analysis and Machine Intelligence, IEEE Transac- tionson,vol.24,no.4,pp.509–522,2002. [42] J. Revaud, G. Lavou´ e, and A. Baskurt, “Improving zernike moments comparison for optimal similarity and rotation angle retrieval,” Pattern Analysis and Machine Intelligence,IEEETransactionson,vol.31,no.4,pp.627–636,2009. [43] R. P. Wurtz, “Object recognition robust under translations, deformations, and changes in background,” Pattern Analysis and Machine Intelligence, IEEE Trans- actionson,vol.19,no.7,pp.769–775,1997. [44] D.G.Lowe,“Objectrecognitionfromlocalscale-invariantfeatures,”inComputer vision, 1999. The proceedings of the seventh IEEE international conference on, vol.2. Ieee,1999,pp.1150–1157. [45] L.A.Torres-Mendez,J.C.Ruiz-Suarez,L.E.Sucar,andG.Gomez,“Translation, rotation, and scale-invariant object recognition,” Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, vol. 30, no. 1, pp. 125– 130,2000. 98 [46] D.Donoho, “High-dimensionaldataanalysis: Thecursesandblessingsofdimen- sionality,”AMSMathChallengesLecture,pp.1–32,2000. [47] A. Khotanzad and Y. Hong, “Invariant image recognition by zernike moments,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 12, no. 5, pp.489–497,1990. [48] W.-Y. Kim and Y.-S. Kim, “A region-based shape descriptor using zernike moments,” Signal Processing: Image Communication, vol. 16, no. 1, pp. 95–102, 2000. [49] S. Li, M.-C. Lee, and C.-M. Pun, “Complex zernike moments features for shape- based image retrieval,” Systems, Man and Cybernetics, Part A: Systems and Humans,IEEETransactionson,vol.39,no.1,pp.227–237,2009. [50] O. Pizarro and H. Singh, “Toward large-area mosaicing for underwater scientific applications,”OceanicEngineering,IEEEJournalof,vol.28,no.4,pp.651–672, 2003. [51] N. Kumar, A. Lammert, B. Englot, F. Hover, and S. Narayanan, “Directional descriptors using zernike moment phases for object orientation estimation in underwater sonar images,” in Acoustics, Speech and Signal Processing (ICASSP), 2011IEEEInternationalConferenceon,2011,pp.1025–1028. [52] L. van der Molen, M. van Rossum, A. Ackerstaff, L. Smeele, C. Rasch, and F. Hilgers, “Pretreatment organ function in patients with advanced head and neck cancer: clinical outcome measures and patients’ views,” BMC Ear, Nose and ThroatDisorders,vol.9,no.1,p.10,2009. [53] M.GrimmandK.Kroschel,“Evaluationofnaturalemotionsusingselfassessment manikins,”in IEEE Workshop on Automatic Speech Recognition and Understand- ing,2005,pp.381–385. [54] J. Kim, N. Kumar, A. Tsiartas, M. Li, and S. S. Narayanan, “Automatic intelli- gibility classification of sentence-level pathological speech,” Computer Speech & Language,2014. [55] A.P.Dempster,N.M.Laird,andD.B.Rubin,“Maximumlikelihoodfromincom- plete data via the em algorithm,” Journal of the Royal Statistical Society. Series B (Methodological),vol.39,pp.1–38,1977. [56] D. C. Liu and J. Nocedal, “On the limited memory bfgs method for large scale optimization,”Mathematicalprogramming,vol.45,no.1-3,pp.503–528,1989. 99 [57] C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal, “Algorithm 778: L-bfgs-b: Fortran subroutines for large-scale bound-constrained optimization,” ACM Transactions onMathematicalSoftware(TOMS),vol.23,no.4,pp.550–560,1997. [58] E. Jones, T. Oliphant, and P. Peterson, “Scipy: Open source scientific tools for python,”http://www.scipy.org/,2001. [59] J. Kim, N. Kumar, A. Tsiartas, M. Li, and S. Narayanan, “Intelligibility classifi- cation of pathological speech using fusion of multiple subsystems,” in Thirteenth AnnualConferenceoftheInternationalSpeechCommunicationAssociation,2012. [60] R.P.Clapham,L.vanderMolen,R.vanSon,M.vandenBrekel,andF.J.Hilgers, “Nki-ccrt corpus: speech intelligibility before and after advanced head and neck cancertreatedwithconcomitantchemoradiotherapy,”2012. [61] B.Schuller,S.Steidl,A.Batliner,E.N¨ oth,A.Vinciarelli,F.Burkhardt,R.vanSon, F.Weninger,F.Eyben,T.Bocklet,G.Mohammadi,andB.Weiss,“Theinterspeech 2012speakertraitchallenge,”inProceedingsofInterspeech. InternationalSpeech CommunicationAssociation(ISCA),2012. [62] R.W.Picard,Affectivecomputing. MITpressCambridge,1997,vol.252. [63] L.-H.Chen,H.-W.Hsu,L.-Y.Wang,andC.-W.Su,“Violencedetectioninmovies,” in Proc. of the Eighth IEEE International Conference on Computer Graphics, ImagingandVisualization(CGIV),2011,pp.119–124. [64] G. Evangelopoulos, A. Zlatintsi, A. Potamianos, P. Maragos, K. Rapantzikos, G. Skoumas, and Y. Avrithis, “Multimodal saliency and fusion for movie sum- marization based on aural, visual, and textual attention,” IEEE Transactions on Multimedia,vol.15,no.7,pp.1553–1568,2013. [65] Y.E.Kim,E.M.Schmidt,R.Migneco,B.G.Morton,P.Richardson,J.Scott,J.A. Speck,andD.Turnbull,“Musicemotionrecognition: Astateoftheartreview,”in Proc.ISMIR,2010,pp.255–266. [66] S.Nepal,U.Srinivasan,andG.Reynolds,“Automaticdetectionof’goal’segments inbasketballvideos,”inProc.oftheNinthACMInternationalConferenceonMul- timedia,2001,pp.261–269. [67] Y.-G. Jiang, B. Xu, and X. Xue, “Predicting emotions in user-generated videos,” inProc.oftheAAAIConferenceonArtificialIntelligence(AAAI),2014. [68] H.-B. Kang, “Affective content detection using hmms,” in Proc. of the Eleventh ACMInternationalConferenceonMultimedia,2003,pp.259–262. 100 [69] M. Xu, L.-T. Chia, and J. Jin, “Affective content analysis in comedy and horror videos by audio emotional event detection,” in Proc. of the IEEE International ConferenceonMultimediaandExpo(ICME),2005,pp.622–625. [70] N. Malandrakis, A. Potamianos, G. Evangelopoulos, and A. Zlatintsi, “A super- vised approach to movie emotion tracking,” in Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011, pp. 2376–2379. [71] T. Fujishima, “Realtime chord recognition of musical sound: A system using common lisp music,” in Proc. of the International Computer Music Conference (ICMC),1999,pp.464–467. [72] N. Kumar, R. Gupta, T. Guha, C. Vaz, M. Van Segbroeck, J. Kim, and S. S. Narayanan, “Affective feature design and predicting continuous affective dimen- sionsfrommusic,”inMediaEvalWorkshop,Barcelona,Spain,2014. [73] J.Coalson,“Flac-freelosslessaudiocodec,”Internet: http://flac.sourceforge.net, 2008. [74] F. Bellard, M. Niedermayer et al., “Ffmpeg,” Availabel from: http://ffmpeg. org, 2012. [75] G. C. Bruner, “Music, mood, and marketing,” Journal of Marketing, pp. 94–104, 1990. [76] T. Giannakopoulos, A. Pikrakis, and S. Theodoridis, “Music tracking in audio streamsfrommovies,”inProc.IEEEMMSP,2008,pp.950–955. [77] “aubio,”http://aubio.org/,accessed: 2015-06-30. [78] B. Adams, C. Dorai, and S. Venkatesh, “Novel approach to determining tempo anddramaticstorysectionsinmotionpictures,”inProc.oftheIEEEInternational ConferenceonImageProcessing(ICIP),vol.2,2000,pp.283–286. [79] T. Guha, N. Kumar, S. S. Narayanan, and S. L. Smith, “Computationally decon- structing movie narratives: An informatics approach,” in Proc. of the IEEE Inter- nationalConferenceonAcoustics,SpeechandSignalProcessing(ICASSP),2015, pp.2264–2268. [80] B. H. Detenber, R. F. Simons, and G. G. Bennett Jr, “Roll em!: The effects of picture motion on emotional responses,” Journal of Broadcasting & Electronic Media,vol.42,no.1,pp.113–127,1998. 101 [81] N. Dalal, B. Triggs, and C. Schmid, “Human detection using oriented histograms offlowandappearance,”inProc.oftheEuropeanConferenceonComputerVision (ECCV). Springer,2006,pp.428–441. [82] B. D. Lucas, T. Kanade et al., “An iterative image registration technique with an applicationtostereovision,”inProc.oftheSeventhInternationalJointConference onArtificialIntelligence(IJCAI),vol.81,1981,pp.674–679. [83] A. Hanjalic, “Extracting moods from pictures and sounds: Towards truly person- alizedtv,”IEEESignalProcessingMagazine,vol.23,no.2,pp.90–100,2006. [84] P. Valdez and A. Mehrabian, “Effects of color on emotions,” Journal of Experi- mentalPsychology: General,vol.123,no.4,p.394,1994. [85] B.Rudiak-Gould,“Huffyuvv2.1.1manual,”2004. [86] C.R.PlantingaandG.M.Smith,Passionateviews: Film,cognition,andemotion. JohnsHopkinsUniversityPress,1999. [87] Y. Sun, X. Wang, and X. Tang, “Deep convolutional network cascade for facial pointdetection,”inProc.oftheIEEEConferenceonComputerVisionandPattern Recognition(CVPR),2013,pp.3476–3483. [88] S.Wold,K.Esbensen,andP.Geladi,“Principalcomponentanalysis,” Chemomet- ricsandIntelligentLaboratorySystems,vol.2,no.1,pp.37–52,1987. [89] A.P.DawidandA.M.Skene,“Maximumlikelihoodestimationofobservererror- ratesusingtheemalgorithm,”Appliedstatistics,JSTOR,pp.20–28,1979. [90] V. C. Raykar, S. Yu, L. H. Zhao, A. Jerebko, C. Florin, G. H. Valadez, L. Bogoni, and L. Moy, “Supervised learning from multiple experts: whom to trust when everyoneliesabit,”inProceedingsofthe26thAnnualinternationalconferenceon machinelearning. ACM,2009,pp.889–896. [91] K. Audhkhasi and S. Narayanan, “A globally-variant locally-constant model for fusion of labels from multiple diverse experts without using reference labels,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 4, pp.769–783,2013. [92] Y. Yan, R. Rosales, G. Fung, M. W. Schmidt, G. H. Valadez, L. Bogoni, L. Moy, and J. G. Dy, “Modeling annotator expertise: Learning when everybody knows a bitofsomething.”inAISTATS,2010,pp.932–939. [93] F. Rodrigues, F. Pereira, and B. Ribeiro, “Learning from multiple annotators: Distinguishing good from random labelers,” Pattern Recognition Letters, vol. 34, no.12,pp.1428–1436,2013. 102 [94] Y. E. Kara, G. Genc, O. Aran, and L. Akarun, “Modeling annotator behaviors for crowdlabeling,”Neurocomputing,vol.160,pp.141–156,2015. 103
Abstract (if available)
Abstract
Researchers often need to work with noisy human annotations that are inherently subjective and challenging for the annotators to provide. In my thesis, I propose a model to account for the reliability of annotations associated with different samples while training classification models. Reliability is modeled as a latent factor that controls the dependence between observed features and their corresponding annotated class labels. An expectation-maximization algorithm is used to estimate these latent reliability scores for maximum-entropy models in a mixture-of-experts like framework. I test the robustness of the proposed approach on multiple speech, multimedia and affective computing tasks and show that the method is able to exploit latent reliability information on the inherently noisy aspects of training data. Additionally, the reliability models also lend themselves easily to the crowdsourcing scenario where the challenge lies in estimating the wisdom from opinions of a crowd of annotators.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Learning multi-annotator subjective label embeddings
PDF
Computational models for multidimensional annotations of affect
PDF
A computational framework for diversity in ensembles of humans and machine systems
PDF
Context-aware models for understanding and supporting spoken interactions with children
PDF
Enriching spoken language processing: representation and modeling of suprasegmental events
PDF
Computational modeling of human behavior in negotiation and persuasion: the challenges of micro-level behavior annotations and multimodal modeling
PDF
Modeling expert assessment of empathy through multimodal signal cues
PDF
Dynamic latent structured data analytics
PDF
Computational methods for modeling nonverbal communication in human interaction
PDF
A computational framework for exploring the role of speech production in speech processing from a communication system perspective
PDF
Latent space dynamics for interpretation, monitoring, and prediction in industrial systems
PDF
Efficient estimation and discriminative training for the Total Variability Model
PDF
Computational modeling of behavioral attributes in conversational dyadic interactions
PDF
Generating gestures from speech for virtual humans using machine learning approaches
PDF
Representation, classification and information fusion for robust and efficient multimodal human states recognition
PDF
Dynamical representation learning for multiscale brain activity
PDF
Noise benefits in expectation-maximization algorithms
PDF
Towards social virtual listeners: computational models of human nonverbal behaviors
PDF
Graph-based models and transforms for signal/data processing with applications to video coding
PDF
On the latent change score model in small samples
Asset Metadata
Creator
Kumar, Komath Naveen
(author)
Core Title
Exploiting latent reliability information for classification tasks
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
09/28/2016
Defense Date
08/01/2016
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Bayesian models,latent variable models,machine learning,maximum entropy models,multiple annotator,OAI-PMH Harvest,reliability,wisdom of crowd
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Narayanan, Shrikanth (
committee chair
), Gratch, Jonathan (
committee member
), Ortega, Antonio (
committee member
)
Creator Email
knaveen87@gmail.com,komathnk@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-306905
Unique identifier
UC11281189
Identifier
etd-KumarKomat-4821.pdf (filename),usctheses-c40-306905 (legacy record id)
Legacy Identifier
etd-KumarKomat-4821.pdf
Dmrecord
306905
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Kumar, Komath Naveen
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
Bayesian models
latent variable models
machine learning
maximum entropy models
multiple annotator
reliability
wisdom of crowd