Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
On optimal signal representation for statistical learning and pattern recognition
(USC Thesis Other)
On optimal signal representation for statistical learning and pattern recognition
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ONOPTIMALSIGNALREPRESENTATIONFORSTATISTICALLEARNING ANDPATTERNRECOGNITION by JorgeSilva ADissertationPresentedtothe FACULTYOFTHEGRADUATESCHOOL UNIVERSITYOFSOUTHERNCALIFORNIA InPartialFulfillmentofthe RequirementsfortheDegree DOCTOROFPHILOSOPHY (ELECTRICALENGINEERING) December2008 Copyright 2008 JorgeSilva Dedication ParamisamadosAndreayVicente. ii Acknowledgments TherearemanypeoplewhohelpmeduringthecourseofthisjourneyandIwanttohave theopportunitytoexpressmythanks. First to my advisor and mentor Shrikanth Narayanan, who has been an unlimited source of advise, knowledge, inspiration, and support. I want to express my deep grat- itude to him for all these years of dedication guiding my work and professional forma- tion. His guideline was an excellent balance between a rigorous critical vision, and a positive and enthusiastic attitude. He always found the way to constructively resonate the useful and interesting aspects of proposed ideas, while providing the directions to refine the missing elements and put them in the right track. This was fundamental to make me believe in unexplored directions, refine and mature them, to finally reach the difficult gap of making them happen. He shared his deep technical knowledge and vision with generosity and hope, which was fundamental in all phases of my progress. These words do not fairly express how important was the role of my advisor during these years, not only in the professional side, that he accomplished with supreme per- fection and dedication, but as a human that understands how rigorous and difficult is to makethisjourneycompatiblewithhisstudentspersonallives. Inthisrespecthealmost anonymous but with great determination took the care of ensuring a relax human envi- ronment for all his students. This may seem almost unimportant for our contemporary high competitive communities, but for me this was a fundamental dimension that help iii metoachievemyacademicgoals,whilekeepingabalancewithmyhealthandpersonal life. In many ways his role helped me to understand that academia is what I would like tofollowasaprofessionalcareerandduringtheprocess,hebecameanexcellentmodel to follow. To conclude, I want to thank Shri for giving me the opportunity to be part of thisgreatlabandallthegoodthingsthatcameduringtheprocess. Itisdefinitelyoneof themostamazingandunpredictablerichexperienceofmylife. IwanttothankthecommitteemembersC.-C.JayKuo,FernandoOrd ´ o˜ nez,Antonio Ortega and Giuseppe Caire for your time and disposition and all the constructive com- ments provided. Also I want to express my gratitude to all the fantastic professors of the Electrical Engineering Department at USC that with their knowledge and passion inspiremyacademicformation. I also want to thank all the SAIL lab members that help me in so many ways by discussing ideas, proofreading my foreign writings and in general just for being good friends and lab mates. In this regard special thanks to Joseph Tepperman, Sungbok Lee, Viktor Rozgic, Vivek Rangarajan, Shankar Ananthakrishnan, Carlos Busso, Abe Kazemzadeh, Matt Black, Tom Murray, Panayiotis G. Georgiou and Selina Chu. I also wanttoexpressmyconsiderationstoalltheEE-administratives,speciallyDianeDeme- tras,TimBoston,GloriaHalfacre,TalyiaVeal,AllanWeberandMaryFrancis,fortheir professionaldedicationtothestudent. Youwerejustfantasticinmakingourlivesinthe departmentsmoothandpleasant. Finally, let me thank my wife Andrea Pe˜ na and my few months old son Vicente for your company and love. Thank Andrea for your enormous effort and dedication during thesefiveyears. Iwouldn’tbeingabletomakeitwithoutyou. Father,thanksforalltheblessings. iv TableofContents Dedication ii Acknowledgments iii ListofTables ix ListofFigures x Abstract xii Chapter1: Introduction 1 Chapter2: MinimumProbabilityofErrorSignalRepresentation 4 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Preliminaries: BayesDecisionApproach . . . . . . . . . . . . . . . . . 7 2.3 SignalRepresentationResultsfortheBayesApproach . . . . . . . . . 10 2.3.1 TradeoffbetweenBayesandtheEstimationError . . . . . . . . 12 2.3.2 Bayes-EstimationErrorTradeoff: FiniteAlphabetCase . . . . . 15 2.4 MinimumProbabilityofErrorSignalRepresentation(MPE-SR) . . . . 17 2.4.1 ApproximatingtheMPE-SR:TheCost-FidelityFormulation . . 19 2.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.5.1 Classification Trees: CART and Minimum Cost Tree Pruning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.5.2 LinearDiscriminantAnalysis . . . . . . . . . . . . . . . . . . 25 2.5.3 ApplicationtoTransform-basedRepresentations . . . . . . . . 27 2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.7 TechnicalDerivations . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.7.1 ProofofTheorem2.2: Bayes-EstimationErrorTradeoff . . . . 29 2.7.2 Proof of the Bayes-Estimation Error Tradeoff: Finite Alphabet Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 v 2.7.3 Proof that Maximum Likelihood estimation is consistent with respecttoasequenceoffiniteembeddedrepresentations . . . . 34 2.7.4 ProofthatMaximumLikelihoodestimationisconsistentforthe Gaussianparametricassumption . . . . . . . . . . . . . . . . . 35 Chapter3: OptimizedWaveletPacketDecompositionbasedonMinimumProb- abilityofErrorSignalRepresentation 37 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.1.1 OrganizationandContribution . . . . . . . . . . . . . . . . . . 40 3.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2.1 MinimumProbabilityofErrorSignalRepresentation(MPE-SR) 43 3.3 Tree-IndexedFilterBankRepresentations: TheWaveletPackets . . . . 45 3.3.1 Tree-IndexedBasisCollectionsandSubspaceDecomposition . 46 3.3.2 RootedBinaryTreeRepresentation . . . . . . . . . . . . . . . 47 3.3.3 Analysis-MeasurementProcess . . . . . . . . . . . . . . . . . 49 3.4 MPE-SRforWaveletPackets: TheMinimumCostTreePruningProblem 49 3.4.1 Tree-EmbeddedFeatureRepresentationResults . . . . . . . . . 50 3.4.2 Studying Additive properties for the Mutual Information Tree functional . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.4.3 MinimumCostTreePruningProblem . . . . . . . . . . . . . . 54 3.4.4 Connections with the Family Pruning Problem with General Size-basedPenalty . . . . . . . . . . . . . . . . . . . . . . . . 56 3.5 Non-parametricEstimationoftheCMIGains . . . . . . . . . . . . . . 58 3.5.1 QuantizedCMIConstruction . . . . . . . . . . . . . . . . . . . 59 3.5.2 Darbellay-VajdaData-DependentPartition . . . . . . . . . . . 61 3.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.6.1 FrameLevelPhoneClassificationfromSpeechSignal . . . . . 63 3.6.2 AnalysisoftheMIGainandOptimalTreePruning . . . . . . . 64 3.6.3 FrameLevelPhoneRecognition . . . . . . . . . . . . . . . . . 66 3.7 DiscussionandFutureWork . . . . . . . . . . . . . . . . . . . . . . . 68 3.8 TechnicalDerivations . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.8.1 ProofofProposition3.1 . . . . . . . . . . . . . . . . . . . . . 69 3.8.2 ProofofProposition3.2 . . . . . . . . . . . . . . . . . . . . . 72 3.8.3 ProofofProposition3.3 . . . . . . . . . . . . . . . . . . . . . 73 3.8.4 Additive Property of the Mutual Information Tree Functional ρ(·): THEOREM3.1 . . . . . . . . . . . . . . . . . . . . . . . 74 3.8.5 ProofofProposition3.4 . . . . . . . . . . . . . . . . . . . . . 75 3.8.6 Dynamic Programming Solution for the Optimal Tree Pruning Problem: THEOREM3.2 . . . . . . . . . . . . . . . . . . . . 76 3.8.7 ProofofTHEOREM3.3 . . . . . . . . . . . . . . . . . . . . . 77 3.8.8 Asymptotically Sufficient Results for the Product CMI Con- struction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 vi Chapter4: DivergenceEstimationbasedonData-DependentPartitions: Strong ConsistencyandApplications 88 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.2.1 TheDivergence . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.2.2 PartitionschemesandLugosi-Nobelcomplexitynotions . . . . 93 4.2.3 Vapnik-Chervonenkistypeofinequality . . . . . . . . . . . . 94 4.3 Data-DependentPartitionforDivergenceEstimation: ProblemStatement 96 4.4 UniversallySufficientData-DependentPartitions . . . . . . . . . . . . 97 4.5 MainUniversalConsistencyResult . . . . . . . . . . . . . . . . . . . . 100 4.6 StatisticalEquivalentData-DependentPartitions . . . . . . . . . . . . . 106 4.6.1 Gessaman’sstatisticallyequivalentpartition . . . . . . . . . . . 109 4.7 Tree-StructuredPartitionSchemes . . . . . . . . . . . . . . . . . . . . 110 4.7.1 Basicnotation . . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.7.2 Tree-structureddata-dependentpartitions . . . . . . . . . . . . 112 4.7.3 Statisticalequivalentsplittingrules . . . . . . . . . . . . . . . 115 4.8 SummaryandFinalRemarks . . . . . . . . . . . . . . . . . . . . . . . 118 4.9 TechnicalDerivations . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 4.9.1 ProofofTheorem4.1 . . . . . . . . . . . . . . . . . . . . . . . 120 4.9.2 ProofofTheorem4.2: Shrinkingcellconditionsforuniversally sufficientdata-dependentpartition . . . . . . . . . . . . . . . . 121 4.9.3 ProofofLemma4.2. . . . . . . . . . . . . . . . . . . . . . . . 126 4.9.4 Detailsonsomealmostsurelyderivations . . . . . . . . . . . . 127 4.9.5 ShrinkingCellConditionfortheGessaman’sPartition . . . . . 128 4.10 ProofofLemma4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 4.10.1 Reducingtheproblemtoaboundedmeasurablespace . . . . . 132 4.10.2 Formulationofasufficientcondition . . . . . . . . . . . . . . . 133 4.10.3 -goodmediancuts . . . . . . . . . . . . . . . . . . . . . . . . 133 4.10.4 ShrinkingcellconditionforbalancedTSP . . . . . . . . . . . . 134 Chapter5: OnUniversallyConsistentHistogrambasedEstimatesfortheMutual Information 137 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 5.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 5.2.1 Data-DependentPartitionSchemes . . . . . . . . . . . . . . . 141 5.2.2 Vapnik-ChervonenkisInequalities . . . . . . . . . . . . . . . . 143 5.3 MutualInformationEstimatebasedonData-DependentPartitions . . . 145 5.4 StrongConsistencyfortheMutualInformationEstimate . . . . . . . . 147 5.4.1 ControllingtheApproximationError . . . . . . . . . . . . . . 148 5.4.2 TheMainConsistencyResult . . . . . . . . . . . . . . . . . . 149 5.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 5.5.1 StatisticalEquivalentData-DependentPartitions . . . . . . . . 153 vii 5.5.2 Tree-StructuredPartitionSchemes . . . . . . . . . . . . . . . . 154 5.6 ExperimentalSimulations . . . . . . . . . . . . . . . . . . . . . . . . . 156 5.6.1 Performanceresultsanddependenciesondesignvariables . . . 157 5.6.2 BiasandStandarddeviationAnalysis . . . . . . . . . . . . . . 159 5.7 FinalDiscussionandFutureWork . . . . . . . . . . . . . . . . . . . . 162 5.8 TechnicalDerivations . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 5.8.1 ProofofProposition5.1 . . . . . . . . . . . . . . . . . . . . . 165 5.8.2 ProofofTheorem5.2 . . . . . . . . . . . . . . . . . . . . . . . 171 5.8.3 ProofofTheorem5.3 . . . . . . . . . . . . . . . . . . . . . . . 175 5.8.4 ProofofTheorem5.4 . . . . . . . . . . . . . . . . . . . . . . . 177 References 178 viii ListofTables 3.1 Correct phone classification (CPC) rates (mean and standard deviation) for the minimum cost tree pruning (MCTP) solutions, using the pro- posed empirical Mutual Information (MI) and energy as fidelity crite- rion. As a reference, performances are provided for linear discriminant analysis (LDA) and non-parametric discriminant analysis (NDA). Per- formances obtained using 10-foild cross validation and a GMM-based classifier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.1 Bias for the non-parametric mutual information estimates (histogram- based using Gessaman partition scheme (GESS), tree-structured par- tition (TSVQ), classical product partition (PROD) and a kernel plug- in estimate (KERN)) obtained from 1000 independent realizations of the empirical process. Performance values are reported with respect to differentsamplinglengths{11,33,58,101,179,564,3164,5626}ofthe empiricalprocessandforthedifferentcorrelationcoefficientscenarios, r = 0,0.3,0.5,0.8,reportedinthisrespectiveorder. . . . . . . . . . . . 160 5.2 Variance for the non-parametric mutual information estimates obtained from 1000 independent realizations of the empirical process. These resultsareassociatedwiththebiasvaluesreportedinTable5.2andthey followthesameorganization. . . . . . . . . . . . . . . . . . . . . . . 163 ix ListofFigures 3.1 Filter bank decomposition given the tree-structured of Wavelet packet bases for the case of the ideal Sinc half-band two channel filter bank. Case A: Octave-band filter bank characterizing a Wavelet type of basis representation. Case B: Short time Fourier transform (STFT) type of basisrepresentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.2 Topology of the full rooted binary treeT full and representation of the treeindexedsubspaceWPdecomposition. . . . . . . . . . . . . . . . . 48 3.3 Darbellay-Vajda data-dependent partition algorithm for estimating the conditionalmutualinformation. . . . . . . . . . . . . . . . . . . . . . 86 3.4 Graphical representation of the CMI magnitude by splitting the basic two channel filter bank across scale (level of decomposition, horizontal axes)andfrequencybands(verticalaxes)intheWPdecomposition. . . 87 3.5 Exampleofthenotationandtopologyofatree-indexedWPrepresentation. 87 4.1 A: Example of Gessaman’s statistically equivalent partition for a two dimensionalboundedspace. B:Exampleofatree-structureddatadepen- dent partition and its tree-indexed structure. Each internal node has a labelindicatingthespatialcoordinateusedtosplititsassociatedrectan- gularset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.1 EmpiricalmutualInformationtrajectoriesofGessamanandTree-Structured VectorQuantization(TSVQ)schemes. Valuesarereportedasafunction of the sampling length of a realization of the empirical process and for differentdesignparametervaluesforτ. Datasimulatedwithcorrelation coefficientr = 0andplottedinalog-scaleofthesamplinglength. . . . 156 x 5.2 EmpiricalmutualinformationtrajectoriesofGessamanandTree-Structured VectorQuantization(TSVQ)schemes. Valuesarereportedasafunction of the sampling length of a realization of the empirical process and for differentdesignparametervaluesforτ. Datasimulatedwithcorrelation coefficientr = 0.5andplottedinalog-scaleofthesamplinglength. . . 159 5.3 Empirical mutual information trajectories for different estimation tech- niques. ML: maximum likelihood plug-in estimate under Gaussian parametricassumption,TSVQ:tree-structureddata-depenedntpartition, GESS:Gessaman’sstatisticallyequivalentblocks,kernel: kernelplug- in estimate and PR-class : product classical histogram-based estimate. Performances are reported for one realization of the empirical process across sampling lengths in the log-scale. Data simulated with correla- tioncoefficientr = 0.3. . . . . . . . . . . . . . . . . . . . . . . . . . . 161 5.4 Empirical mutual information trajectories for different estimation tech- niques using the same setting presented in Figure 5.3. Data simulated withcorrelationcoefficientr = 0.8. . . . . . . . . . . . . . . . . . . . 162 xi Abstract This work presents contributions on two important aspects of the role of signal repre- sentation in statistical learning problems, in the context of deriving new methodologies and representations for speech recognition and the estimation of information theoretic quantities. ThefirsttopicfocusesontheproblemofoptimalfilterbankselectionusingWavelet Packets (WPs) for speech recognition applications. We propose new results to show an estimation-approximation error tradeoff across sequence of embedded representations. These results were used to formulate the minimum probability of error signal repre- sentation (MPE-SR) problem as a complexity regularization criterion. Restricting this criteriontothefilterbankselection,algorithmicsolutionsareprovidedbyexploringthe dyadic tree-structure of WPs. These solutions are stipulated in terms of a set of con- ditional independent assumptions for the acoustic observation process, in particular, a Markov tree property across the indexed structure of WPs. In the technical side, this work presents contributions on the extension of minimum cost tree pruning algorithms and their properties to affine tree functionals. For the experimental validation, a phone classification task ratifies the goodness of Wavelet Packets as an analysis scheme for non-stationary time-series processes, and the effectiveness of the MPE-SR to provide costeffectivediscriminativefilterbanksolutionforpatternrecognition. xii The second topic addresses the problem of data-dependent partitions for the esti- mation of mutual information and Kullback-Leibler divergence (KLD). This work pro- poses general histogram-based estimates considering non-product data-driven partition schemes. Themaincontributionisthestipulationofsufficientconditionstomakethese histogram-based constructions strongly consistent for both problems. The sufficient conditions consider combinatorial complexity indicator for partition families and the use of large deviation type of inequalities (Vapnik-Chervonenkis inequalities). On the application side, two emblematic data-dependent constructions are derived from this result,onebasedonstatisticallyequivalentblocksandtheother,onatree-structuredvec- tor quantization scheme. A range of design values was stipulated to guarantee strongly consistencyestimatesforbothframework. Furthermore,experimentalresultsundercon- trolled settings demonstrate the superiority of these data-driven techniques in terms of a bias-variance analysis when compared to conventional product histogram-based and kernelplug-inestimates. xiii Chapter1 Introduction Thisthesisprovidescontributionsintwonovelresearchdirectionsofoptimalsignalrep- resentation for statistical learning problems. The work was in the context of proposing newmethodologiesandrepresentationsforspeechrecognitionandfortheestimationof information theoretic quantities. The first topic focuses on the role of structural bases for pattern recognition and their application to the largely unexplored problem of opti- malfilterbankselectionfortheacousticspeechprocess. Thesecondfocusesontherole of data-driven vector quantization to obtain consistent histogram-based estimators for mutualinformationandKullback-Leiblerdivergence(KLD). In the first chapter the problem of minimum probability of error signal represen- tations (MPE-SR) in pattern recognition is addressed in general terms. This thesis provides new results that justify under general conditions, a formal tradeoff between estimation and approximation for a collection of embedded signal descriptions. These results are used to formulate the optimal signal representation in pattern recognition as a complexity regularization problem, and specifically as the solution of a cost-fidelity problem, using mutual information as the fidelity criterion. In the application of these ideas, formal connection with well-known results in patter recognition are provided, specifically classification trees (the CART pruning algorithms) and Fisher linear dis- criminantanalysis. In the second chapter, this thesis extends the scope of the MPE-SR formulation to thedomainofstructuralbases,WaveletPacketsbeingthespecificcaseofstudy. Wavelet Packets(WPs)provideafamilyofdyadictree-structuredfilterbankrepresentationsand 1 attractive tools for the analysis of complex spatio-temporal phenomenon. In particular, the speech acoustic processing was the application focus in this work. In this domain, theproblemofMPE-SRfilterbankdecompositionwasaddressedbystudyingproperties of the fidelity and cost term of the problem that admit algorithmic solutions: minimum costtreepruningalgorithms. Thesesufficientconditionswerestudiedindetailandnew resultsandalgorithmicsolutionswereobtained. Evaluationofthesesolutionswascon- ducted in a controlled phone classification task. In this context, the proposed algorithm provided solutions with the expected filter-bank structure, in terms of what is known from studies of the human auditory system, better performance than alternative filter bankselectionschemes,andinterestingly,promisingresultsbycomparingperformance withconventionalacousticfeatures. The last two chapters address the second main topic of this thesis. This is the study of universal histogram-based estimates for mutual information and KLD. In these two contexts, we propose histogram-based data-driven constructions and sufficient condi- tions to make them universally consistent for the two problem scenarios. These consis- tency results use statistical learning tools to bound the estimation approximation errors presented in these problems. In particular, we extend the scope of the tools proposed by Lugosi and Nobel for the problem of classification and density estimation using data-dependent partitions. As expected from classical results on differential entropy estimation, stronger sufficient conditions are needed to get consistent information the- oretic estimates than the ones needed to obtain consistent density estimation in the L1 sense. Furthermore, we have particularized these results for well-adopted statistically equivalent blocks and dyadic tree-structured vector quantizations. On both application 2 domains, we get a range of design values in which consistency is guaranteed. Further- more, experimental evaluation is provided, where the superiority of data-driven tech- niquesfortheestimationofmutualinformationisclearlydemonstrated,withrespectto classicalnon-adaptivehistogram-basedconstructionsandkernelplug-inestimates. Thethesisisstructuredin4self-containedchapters,presentingindetailthemotiva- tion,scope,andcontributionsoftheaforementionedaspectsofthiswork. 3 Chapter2 MinimumProbabilityofErrorSignal Representation Thisworkrevisitsandextendstheproblemofminimumprobabilityoferrorsignalrep- resentation (MPE-SR) within the Bayes decision framework, focusing on the issue of finite training data. Results are presented that justify addressing the MPE-SR as a complexity-regularized optimization problem, formally demonstrating the empirically wellunderstoodtradeoffbetweensignalrepresentationqualityandlearningcomplexity. Under specific modeling assumptions it is also shown that the MPE-SR reduces to two emblematic scenarios in pattern recognition: classification trees (CART tree pruning algorithms),andsomeversionsofFisherlineardiscriminantanalysis. 1 1 Index Term: Signal representation for pattern recognition, Bayes decision framework, complexity regularization,mutualinformation,decisiontrees,lineardiscriminantanalysis. 4 2.1 Introduction The notion of optimal signal representation is a fundamental problem that the signal processing community has been addressing from different angles and under multiple research contexts. The formulation and solution to this problem has provided signifi- cant contributions in lossy compression, estimation and denoising realms [30, 57]. The overarchingmotivationistofindparsimoniousrepresentationsforaparticularfamilyof signals,wherethefactthatfewcoordinatescapturemostofthetargetsignalenergyhas proved to be of significant importance in improving compression and denoising tech- niques[57,10]. Inthecontextofpatternrecognition,signalrepresentationissuesarenaturallyasso- ciatedwithfeatureextraction(FE).Incontrasttocompressionanddenoisingscenarios, where the objective is to design bases that allow optimal representation of the observa- tion source, for instance in the mean square error sense, in pattern recognition we seek representationsthatcaptureanunobservedfinitealphabetphenomena,theclassidentity, fromtheobservedsignal. Consequently,asuitableoptimalitycriterionisassociatedwith minimizing the risk of taking the mentioned decision, as considered in Bayes decision framework[25,5]. Assumingthatweknowtheobservation-classdistribution, themin- imum risk decision can be obtained. However in practice typically this distribution is unknown, which introduces the learning aspect of the problem. The Bayes framework proposestoestimatethisdistributionbasedonafiniteamountoftrainingdata[25,5]. It iswellknownthattheaccuracyofthisestimationprocessisaffectedbythedimension- alityoftheobservationspace—thecurseofdimensionality—whichisproportionalto the disagreement between the real and the estimated distributions, the estimation error effectofthelearningproblem. Then,anintegralpartofthefeatureextraction(FE)isto control the estimation error by finding good parsimonious signal transformations, par- ticularlynecessaryinscenarioswheretheoriginalrawobservationmeasurementsliein 5 a high-dimensional space, and a limited amount of training data are available (relative to the raw dimension of the problem), such as in most speech classification [53], image classification[26]andhyper-spectralclassificationscenarios[37,41]. Despite issues in finding feature representations of lower complexity which capture the most discriminant aspects of the full measurement-observation space, the problem is a well motivated one and good approximations have been presented under specific modeling assumptions [36, 52, 75, 53]. Nevertheless, there has not been a concrete general formulation of the ultimate problem, which is to find the minimum probability errorsignalrepresentation(MPE-SR)constrainedonagivenamountoftrainingdataor anyadditionaloperationalcostthatmayconstrainthedecisiontask. Suchaformulation would provide a better theoretical support and justification for the aforementioned FE probleminpatternrecognition. Newandconcreteresultsinthedirectionofformalizing theMPE-SRproblemhavebeenrecentlypresentedbyVasconcelos[75]. [75]formalizes atradeoffbetweentheBayeserrorandaninformation-theoreticindicatoroftheestima- tionerrorandconnectsthisresultwiththeconceptofoptimalsignalrepresentation. The presentworkismotivatedby,andisbuiltupon,theseideas. 2.1.1 Contributions The central result presented in this work is the stipulation of general sufficient condi- tions that guarantee a formal tradeoff between Bayes error and estimation error across sequences of embedded feature transformations. These sufficient conditions not only take into consideration the embedded structure of the feature representation family, but also the consistent nature of the family of the empirical observation-class distributions estimated across the sequence of transformations — explicitly incorporating the role of thelearningphaseoftheproblem,andsignificantlyimprovingtheresultin[75]. 6 Followingthat,theBayes-estimationerrortradeoffisusedtoformulatetheMPE-SR problem as a complexity-regularized optimization, with an objective function that con- siders a fidelity indicator, which represents the Bayes error, and a cost term — associ- atedwiththecomplexityoftherepresentation—whichreflectstheestimationerror. We showthatthesolutionofthisproblemreliesonaparticularsequenceofrepresentations, which is the solution of a cost-fidelity problem. Interestingly restricting the problem and invoking some approximations, the well known CART pruning algorithm [5] and Fisherlineardiscriminantanalysis[25],offercomputationallyefficientsolutionsforthis cost-fidelity problem. Consequently, we are able to demonstrate that these well-known techniquesareintrinsicallyaddressingtheMPE-SRproblem. The paper is organized as follows. Section 2.2 introduces the general problem for- mulation, terminologies and review of some results that will be used in this work. Sec- tion 2.3 presents the Bayes-estimation tradeoff and Section 2.4 the MPE-SR problem and its cost-fidelity approximation. Sections 2.5.1 and 2.5.2 show how the MPE-SR can be addressed practically in two emblematic problem scenarios: classification tree (CARTpruningalgorithms)andlineardiscriminantanalysis. 2.2 Preliminaries: BayesDecisionApproach LetX:(Ω,F, P)→ (X,F X ) be an observation random vector taking values in a finite dimensionalEuclidianspaceX = R K ,andY:(Ω,F, P)→ (Y,F Y )beaclasslabelran- domvariablewithvaluesinafinitealphabetspaceY 2 . (Ω,F, P)denotestheunderlying probability space. Knowing the joint distributionP X,Y in (X ×Y,σ(F X ×F Y )) 3 , the problemistofindadecisionfunctiong(·)fromX toY suchthatforagivenrealization 2 Given thatX = R K , a natural choice forF X is the Borel sigma field [34], and forF Y the power set ofY. 3 σ(F X ×F Y )referstotheproductsigmafield[34]. 7 ofX,inferitsdiscretecounterpartY withtheminimumexpectedcost,orminimumrisk given by E X,Y [l(g(X),Y)], wherel(y 1 ,y 2 ) denotes the risk of labeling an observation with the value y 1 , when its true label is y 2 , ∀y 1 ,y 2 ∈ Y. The minimum risk deci- sioniscalledtheBayesdecisionrule,wherefortheemblematic0-1deltariskfunction, l(y 1 ,y 2 ) =δ(y 1 ,y 2 ),theBayesrulein(2.1)minimizestheprobabilityerror. g P X,Y (¯ x) = argmax y∈Y P X,Y (¯ x,y), ∀¯ x∈X. (2.1) Inthiscasetheminimumprobabilityerror(MPE)canbeexpressedby[21], L X = P u∈ Ω :g P X,Y (X(u))6=Y(u) =P X,Y (x,y)∈X ×Y :g P X,Y (x)6=y = 1− E X max i∈Y P Y|X (i|X) , (2.2) where I A (·) is the indicator function. The subscript notation inL X shows that this is an indicatorofthediscriminationpoweroftheobservationspaceX. Thefollowinglemma states a version of the well known result that a transformation of the observation space X cannotprovidediscriminationgain. LEMMA2.1 (Theorem3,[75])Considerf : (X,F X )→ (X 0 ,F 0 X )tobeameasurable mapping. If we define X 0 ≡ f(X) as a new observation random variable, with joint probabilitydistributionP X 0 ,Y inducedbyf(·)andP X,Y [4],wehavethat L X 0 ≥L X . (2.3) Fromthelemma,itisnaturaltosaythatthetransformationf isasufficientstatisticsfor theinferenceproblemifL X 0 =L X . 8 In practice we do not know the joint distribution P X,Y . Instead we may have access to independent and identically distributed (i.i.d.) realizations of (X,Y),D N ≡ {(x i ,y i ) :i∈{1,..,N}},whichintheBayesapproachareusedtocharacterizeanesti- mation of the joint observation-class distribution, the empirical distribution denoted by ˆ P X,Y . This estimated distribution ˆ P X,Y is used to define the plug-in empirical Bayes rule, using (2.1), that we denote as ˆ g ˆ P X,Y (·). Note that the risk of the empirical Bayes rule in (2.4), differs from the Bayes error L X as a consequence of what is called the estimationerroreffectinthelearningprocess. P n u∈ Ω : ˆ g ˆ P X,Y (X(u))6=Y(u) o (2.4) It is well understood that the estimation error introduces performance degradation withrespecttotheBayeserrorboundL X ,andthatthemagnitudeofthiserrorisafunc- tion of some notion of complexity of the observation space [60, 37, 75]. This implies a strong relationship between the number of training examples and the complexity of the observation space, justifying the widely adopted dimensionality reduction during featureextraction. Inthiswork,wefocusonstudyingaspectsofoptimalfeaturerepresentationforclas- sification,assumingtheBayesdecisionapproachandthatthelearningframeworksatis- fies certain assumptions that will be detailed in the next section. Under these assump- tions, we can consider two signal representation aspects that affect the performance of a Bayes decision framework. One relates to the signal representation quality, associ- ated with the Bayes error, and the other to the signal space complexity, to quantify the effect of the estimation error in the problem. The formalization of this tradeoff and its modelingimplicationsarethemaintopicsaddressedinthefollowingsections. 9 2.3 Signal Representation Results for the Bayes Approach Letusstartwitharesultthatprovidesananalyticalexpressiontoboundtheperformance deviationoftheempiricalBayesrulewithrespecttotheBayeserror. THEOREM2.1 (Theorem 4, [75]) Let us consider the joint observation-class distri- butionP X,Y and its empirical counterpart ˆ P X,Y , assuming that they only differ in their classconditionalprobabilities(i.e., ˆ P Y ({y}) =P Y ({y}),∀y∈Y). Then,thefollowing inequality holds involving the performance of the empirical Bayes rule ˆ g(·), the Bayes bound,Eq.(2.2), P n u∈ Ω : ˆ g ˆ P X,Y (X(u))6=Y(u) o −L X ≤ Δg MAP ( ˆ P X,Y ) (2.5) whereΔg MAP ( ˆ P X,Y )≡ √ 2ln2 X y∈Y P Y ({y})· r min n D(P X|Y (·|y)|| ˆ P X|Y (·|y)),D( ˆ P X|Y (·|y)||P X|Y (·|y)) o (2.6) andD(·||·)istheKullback-Leiblerdivergence(KLD)[40]betweentwoprobabilitydis- tributionson(X,F X ),givenby, D(P 1 ||P 2 ) = Z X p 1 (x)·log p 1 (x) p 2 (x) ∂x wherep 1 andp 2 arethepdfsofP 1 andP 2 ,respectively 4 . Note that Δg MAP ( ˆ P X,Y ) is the P Y -average of a non-decresing function of the KLD between the conditional class probabilities and their empirical counterparts. The KLD 4 WeassumethatthedistributionsareabsolutelycontinuouswithrespecttotheLebesguemeasurefor definingtheKLDusingtheirprobabilitydensityfunctions[32]. 10 has a well known interpretation as a statistical discrimination measure between two probabilistic models [40, 32, 16], however in this case, it is an indicator of the per- formance deviation, relative to the fundamental performance bound, as a consequence of the statistical mismatch occurring in estimating the class-conditional probabilities. Vasconcelosprovesthisresultforthecasewhentheclassesareequallylikely[75](The- orem 4). The proof of Theorem 2.1 is a simple extension of that and not reported here forspaceconsiderations. Remark2.1 A necessary condition for Δg MAP ( ˆ P X,Y ) to be well defined is that the empirical conditional class distributions are absolute continuous with respect to the other associated distributions, see (2.6). This assumption is not unreasonable because the empirical joint distribution is induced by i.i.d. realizations of the true distribution. Asaresult,itisassumedfortherestofthepaper. The next result shows the extension of Theorem 2.1 for the case when the obser- vation random variableX takes values in a finite alphabet set (or quantizations ofX), denotedbyA X . COROLLARY2.1 Let (X,Y) be a random vector taking values in the finite product spaceA X ×Y,withP X,Y and ˆ P X,Y beingtheprobabilityandtheempiricalprobability, respectively. Assuming thatP X,Y and ˆ P X,Y only differ in their class-conditional proba- bilities,then(2.5)and(2.6)hold,whereD(P X|Y (·|y)|| ˆ P X|Y (·|y))inthiscontextdenotes thediscreteversionoftheKLD[40,32]: D(P X|Y (·|y)|| ˆ P X|Y (·|y)) = X x∈A X P X|Y (x|y)· log P X|Y (x|y) ˆ P X|Y (x|y) ! . 11 2.3.1 TradeoffbetweenBayesandtheEstimationError The following result introduces aspects of signal representation into the classification problem by characterizing a tradeoff between the Bayes error and the estimation error. Beforethatweneedtointroducethenotionofanembeddedspacesequence,whichpro- vides a sort of order relationship among a family of feature observation spaces and the notionofconsistentprobabilitymeasuresassociatedwithanembeddedspacesequence. Definition2.1 Let{F i (·) :i = 1,..,n}beafamilyofmeasurabletransformationsfrom the same domain, (X,F X ) and taking values in {X 1 ,..,X n }, where {X 1 ,..,X n } has increasingfinitedimensionality,i.e.,dim(X i )<dim(X i+1 ),∀i∈{1,..,n−1}. Wesay that{F i (·) :i = 1,..,n} is dimensionally embedded if,∀i∈{1,..,n−1},∃π i+1,i (·), a measurablemappingfrom(X i+1 ,F i+1 )to(X i ,F i ) 5 ,suchthat, F i (x) =π i+1,i (F i+1 (x)), ∀x∈X. In this context, we also say that{X 1 ,..,X n } is dimensionally embedded with respect to {F i (·) :i = 1,..,n}and{π i+1,i (·) :i = 1,..,n−1}. Definition2.2 Let{X i :i = 1,..,n}beasequenceofdimensionallyembeddedspaces, whereπ i+1,i : (X i+1 ,F i+1 )→ (X i ,F i ) is the measurable mapping stated in Definition 2.1. Associated with those spaces, let us consider a probability measure ˆ P i defined on (X i ,F i ), ∀i ∈ {1,..L}. The family of probability measures n ˆ P i :i = 1,..,n o is consistentwithrespecttotheembeddedsequenceif ∀i,j∈{1,..n},i<j,∀B∈F i ˆ P i (B) = ˆ P j (π −1 j,i (B)) 5 ForallpracticalpurposesX i isafinitedimensionalEuclidianspaceandF i referstotheBorelsigma field. 12 whereπ j,i (·)≡π j,j−1 (π j−1,j−2 (···π i+1,i (·)···)). Definition 2.2 is equivalent to saying that if we induce a probability measure on (X i ,F i )byusingthemeasurablemappingπ j,i (·)andtheprobabilitymeasure ˆ P j onthe space(X j ,F j ),theinducedmeasureisequivalentto ˆ P i . Consequently,theprobabilistic descriptionofthesequenceofembeddedspacesisunivocallycharacterizedbythemore informative probability space, (X n ,F n , ˆ P n ), and the family of measurable mappings {π j,i (·) :j >i}oftheembeddedstructurepresentedinDefinition2.1. THEOREM2.2 Let (X,Y) be the joint observation-class random variables with dis- tribution P X,Y on (X ×Y,σ(F X ×F Y )), where X = R K for some K > 0. Let {F i (·) :i = 1,..,n} be a sequence of representation functions, with F i (·) : (X,F X )→ (X i ,F i ),measurable∀i∈{1,.,n}. Inaddition,letusassumethat,{F i (·) :i = 1,..,n} isafamilyofdimensionallyembeddedtransformations,satisfying F i (·) =π j,i (F j (·))for allj > i in{1,..,n}. Then, considering the family of observations random variables {X i = F i (X) :i = 1,..,n}theBayeserrorsatisfiesthefollowingrelationship, L X i+1 ≤L X i , ∀i∈{1,..,n−1}. (2.7) If in addition we have a family of empirical probability measures n ˆ P X i ,Y :i = 1,..,n o ,with ˆ P X i ,Y on(X i ×Y,σ(F i ×F Y ))andconditionalclassdistri- bution families n ˆ P X i |Y (·|y) :i = 1,..,n o consistent with respect to {X i :i = 1,..,n} ∀y∈Y,thenthefollowingrelationshipfortheestimationerrorapplies: Δg MAP ( ˆ P X i ,Y )≤ Δg MAP ( ˆ P X i+1 ,Y ), ∀i∈{1,..,n−1}. (2.8) This result presents a formal tradeoff between the Bayes and estimation errors by considering a family of representations of monotonically increasing complexity. In 13 other words, by increasing complexity we improve the theoretical performance bound —Bayeserror—-thatwecouldachieve,butasaconsequenceofincreasingtheestima- tionerror,whichupperboundsthemaximumdeviationfromtheBayeserrorbound,per Theorem2.1. TheproofofthisresultispresentedinAppendix2.7.1. ThefollowingcorollaryofTheorem2.2showstheimportantcasewhentheembed- ded sequence of spaces is induced by coordinate projections (equivalent to a feature selection approach). In this scenario, the consistency condition of the empirical distri- butionscanbeconsiderednaturalandconsequentlyimplicitlyassumedinthestatement. Thisresultwasoriginallypresentedin[75](Theorem5). COROLLARY2.2 (Theorem 5, [75]) Let X = R K and the family of coordinate projection π K m (·) : R K → R m , m ≤ K, be given by: π K m (x 1 ,...,x m ,..x K ) = (x 1 ,...,x m ), ∀(x 1 ,...,x K ) ∈ R K . Let P X,Y and ˆ P X,Y be the joint probability mea- sure and its empirical counterpart, respectively, defined on (X ×Y,σ(F X ×F Y )). Given that the coordinate projections are measurable, it is possible to induce those distributions on the sequence of embedded subspaces {X 1 ,..,X K } characterized by: X i = π K i (X), ∀i ∈ {1,..,K}. Then, the Bayes estimation error tradeoff is satisfied, i.e.,L X i+1 ≤L X i andΔg MAP ( ˆ P X i+1 ,Y )≥ Δg MAP ( ˆ P X i ,Y ),∀i∈{1,..,K−1}. From this corollary a natural approach to ensure that the family of empirical class conditional distributions n ˆ P X i |Y (·|y) :i = 1,..,n o are consistent across a dimension- ally embedded space sequence{X i :i = 1,..,n}, is to constructively induce ˆ P X i |Y (·|y) usingtheempiricaldistributiononthemostinformativerepresentationspace, ˆ P Xn|Y (·|y) on (X n ,F n ), and the measurable mappingsπ n,i (·),∀i < n associated with the embed- ded space sequence, per Definition 2.1. For instance, this construction is appealing whenassumingparametricclassconditionaldistributions,likeGaussianmixturemodels (GMMs), and standard family of transformations, like linear operators, where inducing 14 thosedistributionsimpliessimpleroperationontheparametersof ˆ P Xn|Y (·|y). Thistype ofconstructionwasconsideredin[75]andandalsowillbeillustratedinSection2.5.2. 2.3.2 Bayes-EstimationErrorTradeoff: FiniteAlphabetCase As for Theorem 2.1, we also extend Theorem 2.2 for the case when the family of rep- resentation functions{F i (·) :i = 1,..,n} take values in finite alphabet sets, and conse- quently induces quantizations ofX. In this scenario, the concept of embedded repre- sentation is better characterized by properties of the induced family of partitions. The followingdefinitionformalizesthisidea. Definition2.3 Let us consider the space (X × Y,σ(F X × F Y )) and a family of measurable functions {F i (·) :i = 1,..,n}, taking values in finite alphabet sets {A i :i = 1,..,n}, i.e., F i (·) : (X,F X ) → (A i ,2 A i ), with |A i | < ∞. The family of representations{F i (·) :i = 1,..,n} is embedded if: |A i | < |A i+1 |,∀i ∈ {1,..,n−1} and∀j,i∈{1,..,n},j >i,thereexistsafunctionπ j,i (·) :A j →A i suchthat F i (x) =π j,i (F j (x)), ∀x∈X. Remark2.2 Every representation function F i (·) produces a quantization of X by Q Fi ≡{F −1 ({a}) :a∈A i }⊂F X , where the embedded condition implies that: ∀i,j, 1 ≤ i < j ≤ n, Q Fj is a refinement of Q Fi 6 , (notation, Q Fi Q Fj ), and then, Q F1 Q F2 ···Q Fn . Forthenextresultwealsomakeuseoftheassumptionofconsistencyfortheempir- ical distributions across a sequence of embedded representations, which extends natu- rallyfromthecontinuouscasepresentedinDefinition2.2. 6 ¯ QisarefinementofQif,∀A∈Q,∃ ¯ Q A ⊂ ¯ QsuchthatA = S B∈ ¯ Q A B. 15 THEOREM2.3 Let (X,Y) be the joint-observation random vector and {F i (·) :i = 1,..,n} be a family of embedded representation taking values in finite alphabet sets {A i :i = 1,..,n}. Considering the quantized observation random variables {X i ≡ F i (X) :i = 1,..,n} then the Bayes error satisfies: L A i+1 ≤ L A i , ∀i ∈ {1,..,n−1}. If in addition we have empirical probabilities ˆ P X i ,Y on the family of representation spaces A i × Y 7 with conditional class probabilities, n ˆ P X i |Y (·|y) :i = 1,..,n o ,consistentwithrespectto{F i (·) :i = 1,..,n},∀y∈Y,then the estimation error satisfies: Δg MAP ( ˆ P X i+1 ,Y )≥ Δg MAP ( ˆ P X i ,Y ),∀i∈{1,..,n−1}. (TheproofispresentedinAppendix2.7.2) Notethatinthiscase,thetradeoffisobtainedasafunctionofthecardinalityofthese spaces. Inparticular,thecardinalityisthenaturalchoiceforcharacterizingfeaturecom- plexity,becauseitisproportionaltotheestimationerroracrosstheembeddedsequence of spaces. The following proposition states the validity of the consistence condition for the important scenario when the empirical distribution is obtained using the maximum likelihood(ML)criterion(frequencycounts). Proposition2.1 For a given amount of training data, i.i.d. realizations of (X,Y), the ML estimator ofP X i |Y (·|y),∀y∈Y obtained in the range of a family of finite alphabet embedded representations{F i (·) :i = 1,..,n}, per Definition 2.3, satisfies the consis- tenceconditionstatedinDefinition2.2. (TheproofispresentedinAppendix2.7.3) Itisimportanttoemphasizethathavingafamilyofembeddedrepresentations,con- tinuous or finite alphabet version, is not enough to show the result about the evolution of the estimation error across this embedded sequence of increasing complexity, The- orems 2.2 and 2.3. The additional necessary element is to have a consistent family of empirical distributions (see proofs of the theorems for details). This last condition is 7 InthiscaseweconsiderthepowersetofA i ×Y asthesigmafield,andconsequentlyweomitit. 16 clearlyafunctionofthelearningmethodologyusedforestimatingtheconditionalclass distributions. As explained in [75], this tradeoff formally justifies the fact that in the pro- cess of doing dimensionality-cardinality reduction, better estimation of the underlying observation-class distribution is obtained, in the KLD sense, at the expense of increas- ing the underlying Bayes error. In particular from these results, by constraining to a sequence of embedded representations there is one that minimizes the probability of error, and from the results presented here, the one that achieves the optimal Bayes- estimation error tradeoff. Hence, it is natural to think that having a rich collection of feature transformations for X, not necessarily embedded, there is one for which this tradeoff between “representation quality” and “complexity” achieves an optimal solu- tion,asolutionthatisconnectedwiththeMPE-SRproblem. Thisisthetopicaddressed inthenextsection. 2.4 Minimum Probability of Error Signal Representa- tion(MPE-SR) In this section the above signal representation results will be used to address the MPE- SR problem. Let us consider again {(x i ,y i ) :i = 1,..,N} i.i.d. realizations of the observation-class (X,Y) random variables with distributionP X,Y on (X ×Y,σ(F X × F Y )). In addition, let us consider a family of measurable functions D, where any f(·) ∈ D is defined inX and takes values in a transform spaceX f . Every represen- tation functionf(·) induces an empirical distribution ˆ P X f ,Y on (X f ×Y,σ(F f ×F Y )), based on the training data and an implicit learning approach, and hence the empirical Bayesruleby ˆ g f (x) = argmax y∈Y ˆ P X f ,Y (x,y), ∀x∈X f . (2.9) 17 Then,theoracleMPE-SRproblemreducesto: f ∗ = argmin f∈D E X,Y I {(x,y)∈X×Y:ˆ g f (f(x))6=y} (X,Y) (2.10) where the last expected value is taken with respect to the true joint distribu- tion P X,Y on (X × Y,σ(F X × F Y )). Note that from Lemma 2.1, ∀f(·) ∈ D, E X,Y I {(x,y)∈X×Y:ˆ g f (f(x))6=y} (X,Y) ≤L X f ≤L X , whereL X f denotes the Bayes error associated withX f . Then the MPE criterion tries to find the representation framework whose performance is the closest toL X , the fundamental bound of the problem. Using Theorem2.1,whereanupperboundfortheriskoftheempiricalBayesrulewasderived, i.e., E X,Y I {(x,y)∈X×Y:ˆ g f (f(x))6=y} (X,Y) ≤ Δg MAP ( ˆ P X f ,Y )+L X f , ∀f ∈ D, we can then follow the direction proposed for the structural risk minimization (SRM) principle[71],toreduce(2.10)tothefollowingoptimizationproblem, f ∗ ≈ argmin f∈D Δg MAP ( ˆ P X f ,Y )+ L X f −L X . (2.11) Here we have introduced the normalization factor L X to make explicit that this regu- larization problem implies finding the optimal tradeoff between approximation quality, L X f −L X , and estimation error, Δg MAP ( ˆ P X f ,Y ). Then, the MPE-SR is naturally for- mulated as a complexity regularized optimization problem whose objective function consists of a weighted combination of a fidelity criterion, reflecting the Bayes error bound, and a cost term, penalizing the complexity of the representation scheme. Note thatsolutionofthisoptimizationproblemisanoracletypeofresult,becauseneitherthe fidelity nor the cost term in (2.11) can be obtained directly in practice — both require 18 theknowledgeofthetruedistributions. Appropriateapproximationsforthefidelityand costtermsareneededtoaddressthisprobleminpractice. 2.4.1 ApproximatingtheMPE-SR:TheCost-FidelityFormulation For approximating Δg MAP ( ˆ P X f ,Y ), from Theorem 2.2 and 2.3 we have that this com- plexity indicator is proportional to the dimensionality or cardinality of the represen- tation, respectively, and consequently a function proportional to those terms can be adopted. On the other hand for the fidelity L X f , the first natural candidate to con- sider is the empirical risk (ER) [71, 21] associated with the family of empirical Bayes rules in (2.9). In favor of this choice is the existence of distribution free bounds that control the uniform deviation of the ER with respect to the risk (the celebrated Vapnik- Chervonenkis inequality [21, 71]). However, this choice of fidelity indicator raises the problemofaddressingtheresultingcomplexityregularizedERminimization,aproblem that has an algorithmic solution only in very restrictive settings. In this regard, Section 2.5.1 presents an emblematic scenario where this problem can be efficiently solved for afamilyoftree-structuredvectorquantizations. Howeverinnumerousimportantcases,theERisimpracticalbecausethesolutionof the resulting complexity regularization requires an exhaustive search. An alternative to approximatingtheBayesriskL X f istousesomeoftheinformationtheoreticquantities like the family of Ali-Silvey distances [35, 54], formally justified for the binary hypoth- esis testing problem, or the widely adopted mutual information (MI) [27, 53, 13, 39]. It is well known that MI and probability of error are connected by Fano’s inequality and tightness has been shown asymptotically by the second Shannon coding theorem [16, 32]. Importantly in our problem scenario, MI satisfies the same monotonic behav- ior under a sequence of embedded transformations as the Bayes risk; in the practical 19 side the empirical MI, denoted by ˆ I(X f ,Y) 8 , under some problem settings can offer algorithmic solutions for the resulting complexity regularized problem. One example of this is presented in following sections and another was recently shown in [64] for addressingtheMPEfilterbankdecompositionusingWaveletPackets(WPs). Returning to the problem generally denoting by ˆ I(f) and R(f) the approximated fidelityandcosttermsforf ∈ D,respectively,(2.11)canbeapproximatedby, f ∗ (λ) = argmin f∈D Ψ( ˆ I(f))+λ·Φ(R(f)), (2.12) where considering the tendency of the new fidelity-cost indicators, Ψ(·) should be a strictly decreasing real function and Φ(·), a strictly increasing function from N to R. Noting that the real dependency between Bayes and estimation errors in terms of our new fidelity complexity values, ˆ I(f) and R(f), is hidden and, furthermore, problem dependent, then Ψ, Φ and λ provide degrees of freedom for approximating the oracle MPE-SRin(2.11). Itisinterestingtonotethatindependentofthosedegreesoffreedom, undertheproposedapproximationtheoptimalsolutionf ∗ (λ)canbeexpressedby f ∗ (λ) = arg min f∈{f ∗ k :k∈K(D)} Ψ( ˆ I(f))+λ·Φ(R(f)), (2.13) with{f ∗ k :k∈K(D)}⊂ Dthesolutionsofthefollowingcost-fidelityproblem , f ∗ k = arg max f∈D R(f)≤k ˆ I(f), (2.14) ∀k ∈ K(D), where K(D) ≡ {R(f) :f ∈ D} ⊂ N. Then, the approximated MPE- SR solution in (2.12) can be restricted, without any loss, to what we call the optimal achievablecost-fidelityfamily {f ∗ k :k∈K(D)}. NotethatthecardinalityofK(D)could 8 ∀f ∈ DtheempiricalMI ˆ I(X f ,Y)canbeobtainedfromtheavailableempiricaldistribution ˆ P X f ,Y . 20 be significantly smaller than|D| and as a result, the domain of solutions of the original problem. Finally, the empirical risk minimization criterion among{f ∗ k :k∈K(D)} ⊂ D, for instance using cross validation [21, 5], can be the final decision step for solving (2.13), as it has been widely used for addressing a similar complexity regularization probleminthecontextofregressionandclassificationtrees[5,51]. 2.5 Applications By stipulating a learning technique and a family of representation functions, we can characterize different sub-problems associated with the MPE-SR complexity regular- ized formulation. Two emblematic cases where this particularization offers algorithmic solutions are presented in this section, while another interesting case was presented by theauthorspreviouslyfortheproblemofoptimalWPfilterbankdecomposition[64]. 2.5.1 ClassificationTrees: CARTandMinimumCostTreePruning Algorithms LetX = R K beafinitedimensionEuclideanspace,andlet Dbeafamilyofvectorquan- tizers(VQs)withabinarytreestructure[8,50,62]. ThentheMPE-SRproblemreduces to finding an optimal binary classification tree topology. Interestingly the pruning tree algorithms originally proposed by Breiman, Friedman, Olshen and Stone (BFOS) [5] 9 canbeshowntoaddressaninstanceofthecomplexityregularizedproblempresentedin Section 2.4. In this section we formally present this connection, and as well as how the Bayesestimationerrortradeoffjustifiesthissolutioninthiscontext. 9 TheseminalworkofBreiman et al. [5]addressesthemoregeneralcaseofclassificationandregres- sion trees (CART), where for the context of this work we just highlight results concerning the classifica- tionpart. 21 Let us introduce some basic terminology 10 . Using Breiman et al. conventions [5], a treeT is represented by a collection of nodes in a graph, with implicit left and right mappingsreflectingtheparent-childrelationshipamongthem. T isarootedbinarytree ifeverynonterminalnodehastwodescendants,wherewedenotebyL(T)⊂T thesub- collectionofleavesorterminalnodes(nodesthatdonothaveadescendent). Inaddition, |T| denotes the norm of the tree, which is the cardinality ofL(T). LetS andT be two binarytress,thenifS⊂T andbothhavethesamerootnode,wesaythatS isapruned version ofT, and we denote this relationship byS T. Finally, we denote byT full thetreeformedforallthenodesinthegraphandbyt root ∈T itsroot. ThetreestructureisusedtoindexafamilyofvectorquantizationsforX. Inorderto formalize this idea, we can consider that every nodet∈T full has associated a measur- ablesubsetX t ⊂X,suchthatX troot =X andift 1 andt 2 arethedirectdescendantsofa nonterminalnodet,wethenhavethat:X t =X t 1 ∪X t 2 andX t 1 ∩X t 2 =φ. Therefore,any rooted binary treeT T full induces a measurable partition of the observation space givenbyV T ={X t :t∈L(T)},whereimportantly,ifT 1 T 2 thenV T 2 isarefinement of V T 1 . With this concept in mind, we can define a pair of tree indexed representa- tions by [T,f T (·)]∀T T full , withf T (·) being a measurable function from (X,F X ) to L(T), such that X t = f −1 T ({t}), ∀t ∈ L(T). Hence, f T (·) induces the previously defined measurable partitionV T on (X,F X ). Then formally, the family of tree indexed representationsisgivenby D ={f T (·) :T T full }. Following the convention in [51], a classification tree is a triple [T,f T (·),g T (·)], whereg T (·)fromL(T)toY istheclassifierthatinfersY basedonthequantizedrandom variableX f T ≡f T (X). InparticulartheBayesruleis: g T (t) = argmax y∈Y P X f T ,Y (t,y), ∀t∈L(T). 10 Theinterestedreaderisreferredto[5,58]foramoresystematicexposition. 22 with Bayes error R(T) = P({u∈ Ω :g T (X f T (u)))6=Y(u)}). Breiman et al. [5] (Chapter 9) show thatR(T) can be written as an additive non-negative function of the terminalnodesofT,moreprecisely, R(T) = X t∈L(T) R(t), (2.15) where for the 0-1 cost function R(t) = P(X f T (u) = t)· 1−max y∈Y P Y|X f T (y|t) . As we know if P X,Y is available, the best performance is obtained for the finest rep- resentation, i.e., T full ,f T full (·),g T full (·) , however, our case of interest is when under the constraint of finite i.i.d. samples of (X,Y),D N ={(x i ,y i ) :i = 1,..,N}, we want to address the MPE-SR problem formulated in Section 2.4. In this case, the maximum likelihood(ML)empiricaldistribution ˆ P X f T ,Y isconsidered∀T T full ,whichreduces the problem to a family of classification trees [T,f T (·),ˆ g T (·)] with the empirical Bayes decision ˆ g T (·) corresponding to the majority vote decision rule [5]. The next result showsthattheBayes-estimationerrortradeoffholdsforasequenceofembeddedrepre- sentationsin D. Proposition2.2 Let us take a sequence of embedded treesT 1 T 2 T 3 ,..., T k , subsetsofT full . Then,R(T i+1 )≤R(T i ),∀i∈{1,..,n−1}. Ontheotherhand,considering[T i ,f T i (·),ˆ g T i (·)],∀i∈{1,..,k}andtheestimationerror of these empirical Bayes decisions, denoted by Δg( ˆ P X f T i ,Y )∀i ∈ {1,..,k} (Theorem 2.1),itfollowsthatΔg( ˆ P X f T i ,Y )≤ Δg( ˆ P X f T i+1 ,Y ),∀i∈{1,..,n−1}. Proof: Wehavedevelopedallthemachinerytoprovethisresult. Weknowthatthe family of representations{f T 1 (·),..,f Tn (·)} is embedded, where by Proposition 2.1 the inducedempiricaldistributions—conditionalclassprobabilities—areconsistentwith respecttotheembeddedrepresentationfamily. Consequently,theresultextendsdirectly fromTheorem2.3. 23 This result provides a strong justification for the complexity regularized MPE-SR formulation presented in Section 2.4. For that we can use the additive structure of the Bayeserrorin(2.15),toconsidertheempiricalBayeserrorandcardinalityasthefidelity and complexity indicators for (2.12), which in fact are additive tree functionals [8, 61]. In fact under an additive cost assumption for the term φ in (2.12), the solution to this problem is the well known CART pruning algorithm [5], which finds an algorithmic solutionfor 11 , T ∗ n (α) = arg min TT full ˆ R(T)+α·|T|, (2.16) with ˆ R(T) = 1 N P N i=1 I {(t,y)∈L T ×Y:ˆ g T (t)6=y} (f T (x i ),y i ) the empirical risk, andα·|T| the penalizationterm. Moreprecisely, Breiman et al. [5](Chapter 10)use theadditivity of theobjectivefunctionin(2.16)toformulateadynamicprogrammingsolutionforT ∗ n (α) inO(|T full |). Moreover, they proved that there exists a sequence of optimal embedded representations, denoted byT full = T ∗ 1 T ∗ 2 ,..., T ∗ m = {t root }, which are the solutions of (2.16) for all possible values of the complexity weight α ∈ R + . More precisely,∃α 0 = 0<α 1 <,..,<α m =∞,and∀i∈{1,..,m},suchthat, T ∗ n (α) =T ∗ i , ∀α∈ [α i−1 ,α i ). (2.17) Note that this result connects the MPE-SR tree pruning problem with the solutions for our cost-fidelity problem, as Scott [61] had recently pointed out. The reason is that this family of optimal embedded sequences is the solution to the cost-fidelity problem (2.14),whichisexpressedhereby: T ∗ j = arg min TT full |T|≤m−j−1 ˆ R(T), ∀j∈{1,..,m}. (2.18) 11 FromthispointweconsiderthetreeindexT forreferringtotherepresentationfunctionf T (·)andthe empiricalBayesclassifier[T,f T (·),ˆ g T (·)],dependingonthecontext. 24 Scott coined the solution of (2.18) as the minimum cost trees, and has presented a gen- eral algorithm to solve it inO(|T full | 2 ) [61]. Also the connection of this cost-fidelity problem with a more general complexity regularized objective criterion was presented, where the cost termα·|T| is substituted for a general sized-based penalty α· Φ(|T|), where Φ(·) is a non-decreasing function. Moreover an algorithm based on the charac- terization of the operational cost-fidelity boundary, was presented for finding explicitly α 0 < α 1 <,..,< α m as in (2.17). Scott’s work is the first one that formally presented the connections between the general CART complexity regularized pruning problem andthesolutionofacost-fidelityproblem. Hereweprovidethecontexttoshowthatthe algorithmsusedtosolvethecost-fidelityproblemareimplicitlyaddressingtheultimate MPE-SRproblem. 2.5.2 LinearDiscriminantAnalysis Let us consider again X = R K and the family of linear transformations as the dic- tionary, D = f : R K → R m :f linear,m≤K . An element f ∈ D can be uni- vocally represented by a matrix A ∈ R(m,K) 12 . In particular, we restrict D to the family of full-rank matrices. If we consider that the conditional class probabil- ity follows a multivariate Gaussian distribution, then p X|Y (·|y) = N(·,μ y ,Σ y ) and p X (·) = P y∈Y P(Y(u) =y)·N(·,μ y ,Σ y ), where N(·,μ,Σ) is a Gaussian pdf with meanμandcovariancematrixΣ. Considering a finite amount of training data {(x i ,y i ) :i = 1,..,N} and max- imum likelihood (ML) estimation techniques [25], the empirical distributions ˆ p X|Y (·|y) :y∈Y and ˆ p X (·)areGaussianandGaussianmixtures,respectively,char- acterized by the empirical mean and covariance matrices, ˆ μ y = 1 Ny P N i=1 I {y} (y i )·x i 12 R(m,n)representsthecollectionofm×nmatriceswithentriesin R. 25 and ˆ Σ y = 1 Ny P N i=1 I {y} (y i )·(x i − ˆ μ y )(x i − ˆ μ y ) † , with N y = |{1≤i≤N :y i =y}|, ∀y∈Y. Proposition2.3 Let A 1 ,..,A n be a family of full-rank linear transformations taking values in R k1 ,.., R kn with 0 < k1 < k2 < ··· < kn ≤ K. In addition, let us assumethatthesequenceoftransformationsisdimensionallyembedded,perDefinition 2.1, i.e., ∀j,i,j > i there exists B j,i ∈ R(ki,kj), such thatA i = B j,i ·A j . Under the Gaussian parametric assumption the empirical sequence of class conditional pdfs ˆ p A i X|Y (·|y) :i = 1,..,n ,estimatedacross R k1 ,.., R kn (MLcriterion),characterize asequenceofconsistentprobabilitymeasureswithrespectto R k1 ,.., R kn ,inthesense presentedinDefinition2.2. ProofprovidedintheAppendix. This last result formally extends Theorem 2.2, for the case of embedded sequences of full-rank linear transformations A 1 ,..,A n . and provides justification for addressing the MPE-SR problem using the cost-fidelity approach. In this context, as considered by Padmanabhan et al. [53] the empirical mutual information is used as the objective indicator. Then,thesolutionoftheMPE-SRproblemresidesinthesolutionof: A ∗ k = arg max A∈R(k,K) ˆ I(A), ∀k∈{1,..,K}, (2.19) where ˆ I(A)denotestheempiricalmutualinformationbetweenAX andY. Letuswrite I(A) = H(AX)−H(AX|Y). Then under the Gaussian assumption and considering A ∈ R(k,K), it follows that, H(AX|Y = y) = k 2 log(2π) + 1 2 log AΣ y A † + 1 2 . Given that AX has a Gaussian mixture distribution, a closed-form expression is not available for the differential entropy. Padmanabhan et al. [53] proposed to use an upper bound based on the well known fact that the Gaussian law maximizes the differential entropy under second moment constraints. Then, denoting Σ≡ E(XX † )− 26 E(X)E(X) † ,wehavethatH(AX)≤ k 2 log(2π)+ 1 2 log AΣA † + 1 2 andthenI(A)≤ 1 2 log |AΣA † | Q y∈Y |AΣyA † | P(Y(u)=y) . Usingthisboundthecost-fidelityproblemreducesto A ∗ k = arg max A∈R(k,K) log A ˆ ΣA † Q y∈Y A ˆ Σ y A † P(Y=y) . (2.20) where ˆ Σ y and ˆ Σ are the empirical class conditional covariance matrices and the uncon- ditional covariance matrices, respectively. ˆ Σ can be written as ˆ Σ w + ˆ Σ b [53], with ˆ Σ w = P y∈Y ˆ P Y ({y})· ˆ Σ y and ˆ Σ b = P y∈Y ˆ P Y ({y})·(ˆ μ− ˆ μ y )(ˆ μ− ˆ μ y ) † the between- class and within-class scatter matrices used in linear discriminant analysis [25]. As pointed out in [53], under the additional assumption that class conditional covariance matricesareequivalent,theproblemreducesto A ∗ k = arg max A∈R(k,K) log A ˆ ΣA † A ˆ Σ y A † , whichisexactlytheobjectivefunctionusedforfindingtheoptimallineartransformation used in multiple discriminant analysis (MDA), the casek = 1 being the Fisher linear discriminantanalysisproblem[25]. 2.5.3 ApplicationtoTransform-basedRepresentations Motivated by results in rate-distortion for lossy compression (in particular transform coding)[57,8],oneinterestingscenariowheretheMPE-SRformulationcanofferprac- tical solutions is to consider feature transformations induced by a dictionary of bases withastrongembeddedstructure. Asinclassificationtrees,bytakingadvantageofthis hierarchical structure the operational cost-fidelity problem in (2.14) can be addressed using polynomial time algorithms [8, 61, 5]. In this direction, initial promising results 27 have been presented for a family of filter-bank representations induced by the Wavelet Packets,inthecontextofaframe-basedphoneclassificationtask[64]. 2.6 Summary Thispaperfocusedontheminimumprobabilityoferrorsignalrepresentation(MPE-SR) problem. The contributions can be summarized in three folds. First, important gener- alization for conditions that guarantee a tradeoff between Bayes and estimation errors under sequence of embedded transformations for both continuous and final alphabet (vector quantization) settings were provided. Second, the use of this tradeoff to formu- late the MPE-SR as a complexity regularized optimization criterion, and an approach to address this oracle result in practice, as the solution of a cost-fidelity problem, was given. Finally, the MPE-SR was connected with two classical feature transformation techniquesusedinpatternrecognition,byshowingthatunderspecificassumptionsboth representinstancesofthepresentedcost-fidelityformulation. 28 2.7 TechnicalDerivations 2.7.1 ProofofTheorem2.2: Bayes-EstimationErrorTradeoff Proof: Giventhat{F i (·) :i = 1,..,n}isasequenceofembeddedtransformations, i.e., ∀i ∈ {1,..,n−1}, there exists a measurable mapping π i+1,i : X i+1 → X i such that X i = π i+1,i (X i+1 ), from Lemma 2.1 we have that L X i+1 ≤ L X i . Concerning the estimation error inequality across the sequence of embedded spaces, a sufficient conditiongivenTheorem2.1,istoprovethat D (X i ,F i ) (P X i |Y (·|y)|| ˆ P X i |Y (·|y))≤D (X i+1 ,F i+1 ) (P X i+1 |Y (·|y)|| ˆ P X i+1 |Y (·|y)), D (X i ,F i ) ( ˆ P X i |Y (·|y)||P X i |Y (·|y))≤D (X i+1 ,F i+1 ) ( ˆ P X i+1 |Y (·|y)||P X i+1 |Y (·|y)), (2.21) ∀i ∈ {1,..,n−1} and ∀y ∈ Y. We will focus on proving the first set of inequalities in (2.21), where the same argument applied for the other family. Here D (X i ,F i ) (P X i |Y (·|y)|| ˆ P X i |Y (·|y)) denotes the KLD of the conditional class probability P X i |Y (·|y) with respect to the empirical counterpart in (X i ,F i ). The fact of consider- ingthedependencywithrespecttothemeasurablespaceintheKLDnotation, whichis usuallyimplicit,isconceptuallyimportantfortherestoftheproof. The main idea is to represent the empirical distribution as an underlying measure defined on the original measurable space (X,F X ). This is possible using the fact that thefunctionsin{F i (·) :i = 1,..,n}aremeasurable[4]. Consequentlygiventheempir- icalclassconditionalprobability ˆ P X i |Y (·|y)intherepresentationspace(X i ,F i ),wecan induce a probability measure ˆ P X|Y (·|y) in the measurable space (X,σ(F i )), withσ(F i ) is the smallest sigma field that makes F i (·) a measurable transformation [34], where 29 it is clear that σ(F i ) ⊂ F X [34]. More precisely, σ(F i ) = F −1 i (B) :B∈F X and ˆ P X|Y (·|y)isconstructedby ∀A∈σ(F i ),∃B∈F i , st.A = F −1 i (B) and ˆ P X|Y (A|y) = ˆ P X i |Y (B|y). (2.22) Bytheconsistencepropertyof n ˆ P X i |Y (·|y) :i = 1,..,n o ,itiseasytoshowthatthereis auniquemeasure ˆ P X|Y (·|y)definedon(X,σ(F n ))thatrepresentsthefamilyofempiri- caldistributions n ˆ P X i |Y (·|y) :i = 1,..,n o usingtheprocedurepresentedin(2.22) 13 . As aconsequence,theempiricalmeasure ˆ P X|Y (·|y)isuniquelycharacterizedinX usingthe finest sigma fieldσ(F n ). On the other hand, the original probability measureP X|Y (·|y) is originally defined on (X,F X ) and given that σ(F n ) ⊂ F X , it extends naturally to (X,σ(F i )),∀i∈{1,..,n}. ThenextstepistorepresenttheKLDin(2.21),asaKLDintheoriginalobservation spaceX relativetoaparticularsigmafield. Usingaclassicalresultfrommeasuretheory [34],itispossibletoprovethat[32](Lemma5.2.4), D (X,σ(F i )) (P X|Y (·|y)|| ˆ P X|Y (·|y)) =D (X i ,F i ) (P X i |Y (·|y)|| ˆ P X i |Y (·|y)). (2.23) Finallyfromproving(2.21),wemakeuseofthefollowinglemma. LEMMA2.2 ([32], Lemma 5.2.5) Let us consider two measurable spaces (X,F) and (X, ¯ F), such that ¯ F is a refinement ofF, in other wordsF ⊂ ¯ F. In addition, let us 13 Itisimportanttonotethatthissequenceofinducedsigmafieldscharacterizesafiltration[4],inother words σ(F i ) ⊂ σ(F i+1 ), because of the existence of a measurable mapping π i+1,i (·) from (X i ,F i ) to (X i+1 ,F i+1 ). 30 consider two probability measures P 1 and P 2 defined on (X, ¯ F), then assuming that P 1 P 2 ,thefollowinginequalityholds, D (X, ¯ F) (P 1 ||P 2 )≥D (X,F) (P 1 ||P 2 ). (2.24) In our context we have P X|Y (·|y) and ˆ P X|Y (·|y) defined on (X,σ(F i+1 )) and conse- quently on (X,σ(F i )), because σ(F i+1 ) is a refinement of σ(F i ), then (2.21) follows directlyfromLemma2.2. 2.7.2 Proof of the Bayes-Estimation Error Tradeoff: Finite Alpha- betCase Proof: This proof follows the same arguments as the one presented in Appendix 2.7.1. Let us denote the Bayes rule for (X i ,Y) byg P X i ,Y (·) with error probabilityL A i , givenbyL A i =P X i ,Y ( n (x,y)∈A i ×Y :g P X i ,Y (x)6=y o ),∀i∈{1,..,n}. By the assumption that the family{F i (·) :i = 1,..,n} is embedded, we have that ∀0≤i<j ≤n,X i ≡ F i (X) =π j,i (F j (X)) =π j,i (X j ). Consequently using Lemma 2.1, the Bayes error inequality,L A i+1 ≤ L A i ,∀i ∈ {1,..,n−1}, follows directly. For theestimationerrorinequality,wewillprovethefollowingsufficientcondition: D(P X i |Y (·|y)|| ˆ P X i |Y (·|y))≤D(P X +1 i|Y (·|y)|| ˆ P X i+1 |Y (·|y)), D( ˆ P X i |Y (·|y)||P X i |Y (·|y))≤D( ˆ P X +1 i|Y (·|y)||P X i+1 |Y (·|y)), (2.25) ∀i ∈ {1,..,n−1} and∀y ∈ Y. We focus on proving one of the set of inequalities in (2.25),theproofoftheothersetofinequalitiesisequivalent. 31 Withoutlossofgeneralityletusconsideragenericpair(io,yo)∈{1,..,n−1}×Y. Let ˆ P io betheempiricaldistributionon(X,σ io )inducedbythemeasurabletransforma- tion F io (·)andtheprobabilityspace(A i , ˆ P X io |Y (·|yo)). Inthiscaseσ io isthesigmafiled induced by the partition Q io ≡ F −1 io ({a}) :a∈A io , and consequently the measure ˆ P io isunivocallycharacterizedby[34] ˆ P io (F −1 i ({a})) = ˆ P X io |Y ({a}|yo), ∀a∈A io . (2.26) The same process can be used to induce a measure ˆ P io+1 on (X,σ io+1 ). Note that given that the family of representations is embedded, we have that Q io+1 is a refine- ment of Q io in X and consequently σ io ⊂ σ io+1 , [34]. Then we have that ˆ P io+1 is also well defined on (X,σ io ). Moreover, by the consistence property of the conditional classprobabilities n ˆ P X i |Y (·|yo) :i = 1,..,n o onthefamilyofrepresentationfunctions {F i (·) :i = 1,..,n}, we want to show that the two measures agree onσ io . For that we justneedtoshowthattheyagreeonthesetofeventsthatgeneratethesigmafield,i.e.,in Q io = F −1 io ({a}) :a∈A io ,becauseQ io isapartitionandinparticularasemi-algebra [34]. Then without loss of generality let us consider the event F −1 io ({a}), then we have that: ˆ P io+1 (F −1 io ({a})) = ˆ P io+1 (F −1 io+1 (π −1 io+1,io ({a}))) = ˆ P X io+1 |Y (π −1 io+1,io ({a})|yo) = ˆ P X io |Y ({a}|yo) = ˆ P io (F −1 io ({a})), ∀a∈A io . (2.27) The first equality is because of the the fact that F i (·) = π io+1,io (F j (·)) — embedded property of the representation family, the second by (2.26), the third by the consistence 32 property of the conditional class probabilities and the last again by definition ofP io (·), (2.26). Consequently, we can just consider ¯ P ≡ ¯ P io+1 as the empirical probability measurewelldefinedon(X,σ io+1 )and(X,σ io ). Alsonotethattheoriginalprobability measureP X|Y (·|yo) is well defined on (X,σ io+1 ) and (X,σ io ) by the measurability of F io and F io+1 ,respectively[34]. Finally,fromthedefinitionoftheKLD[32](Chapter5): D(P X io |Y (·|yo)|| ˆ P X io |Y (·|yo)) =D (X,σ io ) (P X|Y (·|yo)|| ˆ P) (2.28) D(P X io+1 |Y (·|yo)|| ˆ P X io+1 |Y (·|yo)) =D (X,σ io+1 ) (P X|Y (·|yo)|| ˆ P) (2.29) where, D (X,σ io ) (P X|Y (·|yo)|| ˆ P) = X A∈Q io P X|Y (A|yo)log P X|Y (A|yo) ˆ P(A) D (X,σ io+1 ) (P X|Y (·|yo)|| ˆ P) = X A∈Q io+1 P X|Y (A|yo)log P X|Y (A|yo) ˆ P(A) andusingtheLemma2.2presentedinAppendix2.7.1,consideringthatσ io ⊂σ io+1 ,and the equations (2.28) and (2.29), we prove the sufficient condition stated in (2.25) and consequentlytheresult. 33 2.7.3 ProofthatMaximumLikelihoodestimationisconsistentwith respecttoasequenceoffiniteembeddedrepresentations Proof: Let{F i (·) :i = 1,..,n} be the family of embedded representation func- tionstakingvaluesinfinitealphabetsets{A i :i = 1,..,n},respectively. Foreveryrep- resentation spaceA i , the empirical distribution is obtained by the ML criterion [25, 5], wheretheconditionalclassdistributionisgivenby, ˆ P X i |Y ({a}|y) = P N k=1 I {(a,y)} (F i (x k ),y k ) N y , (2.30) ∀i ∈ {1,..,n}, ∀a ∈ A i and∀y ∈ Y, where N y ≡ P N k=1 I {y} (y k ) is assumed to be strictly greater than zero. For the proof we will use the induced probability measure on theoriginalobservationspace(X,F X ),thatwedefineby ˆ P i|y (F −1 i ({a}))≡ ˆ P X i |Y ({a}|y), (2.31) for all i ∈ {1,..,n} and y ∈ Y. By (2.30), it is straightforward to show that for any a∈A i ˆ P i|y (F −1 i ({a})) = P N k=1 I {(F −1 i ({a}),y)} (x k ,y k ) N y . (2.32) Without loss of generality, let us consider io,jo ∈ {1,..,n} and yo ∈ Y, such that io < jo. For proving the consistence condition of the ML empirical distributions, we justneedtoshowthat ˆ P X io |Y ({a}|yo) = ˆ P X jo |Y (π −1 jo,io ({a})|yo), ∀a∈A io . (2.33) By Remark 2.2, we have that the induced quantizationQ F jo ≡ F −1 jo ({a}) :a∈A jo isarefinementofQ F io ≡ F −1 io ({a}) :a∈A io . Then,anyatom F −1 jo ({a})indexedby 34 a∈A io , can be expressed as disjoint unions of atoms inQ F jo ; more precisely, we have that: F −1 io ({a}) = [ b∈π −1 jo,io ({a}) F −1 jo ({b}) = F −1 jo (π −1 jo,io ({a})) (2.34) wherefinallyby(2.31)and(2.32),wehavethat: ˆ P X io |Y ({a}|yo) = P N k=1 I {(F −1 io (a),yo)} (x k ,y k ) N yo = P N k=1 I {(F −1 jo (π −1 jo,io ({a})),yo)} (x k ,y k ) N yo = ˆ P X jo |Y (π −1 jo,io ({a})|yo). (2.35) 2.7.4 Proof that Maximum Likelihood estimation is consistent for theGaussianparametricassumption Proof: Without loss of generality, let us consider just f 1 (x) = A 1 · x and f 2 (x) = A 2 ·x, with A 1 ∈ R(k1,K) and A 2 ∈ R(k2,K) (0 < k1 < k2 < K). We need to show that ˆ P f 2 (X)|Y (·|y) defined on (R k2 ,B k2 ) is consistent with respect to ˆ P f 1 (X)|Y (·|y) defined on (R k1 ,B k1 ), in the sense that ˆ P f 2 (X)|Y (·|y) induces ˆ P f 1 (X)|Y (·|y) by the measurable mappingB 2,1 : (R k2 ,B k2 )→ (R k1 ,B k1 ). However under the Gaus- sianparametricassumption,thisconditionreducestocheckingthefirstandsecondorder statisticsoftheinvolveddistributions. Consideringthetrainingdata,itisdirecttoshow that the empirical mean and covariance matrix for ˆ P f 2 (X)|Y (·|y) is given byA 2 ˆ μ y and A 2 ˆ Σ y A † 2 , respectively, where ˆ μ y and ˆ Σ y are the respective empirical values in the orig- inalobservationspaceX. Analogousresultsholdforthecaseof ˆ P f 1 (X)|Y (·|y). 35 GiventhatlineartransformationspreservethemultivariateGaussiandistribution,we havethat ˆ P f 2 (X)|Y (·|y)inducesaGaussiandistributionon(R k1 ,B k1 )withmeanB 2,1 A 2 ˆ μ and covariance matrixB 2,1 A 2 ˆ Σ y A † 2 B † 2,1 . Finally, given that the linear transformations f 1 (·)andf 2 (·)preservetheconsistencestructureof R k1 , R k2 ,wehavethatB 2,1 A 2 =A 1 whichissufficienttoprovetheresult. 36 Chapter3 OptimizedWaveletPacket DecompositionbasedonMinimum ProbabilityofErrorSignal Representation ThisworkaddressestheproblemofoptimalWaveletpacket(WP)filterbankdecompo- sition based on the minimum probability of error signal representation (MPE-SR) prin- ciple. The problem is formulated as a complexity regularized optimization, where the tree-indexed structure of the WP family is explored to find conditions for reducing this problem to a type of minimum cost tree pruning, a method well-understood in regres- sion and classification trees (CART) and tree-structured vector quantization (TSVQ). For estimating the conditional mutual information (CMI), the fidelity criterion adopted intheoptimaltreepruningproblem,anon-parametricapproachbasedonproductadap- tivepartitionsisproposedextendingtheDarbellay-Vajdatree-structureddatadependent algorithm. Finally,experimentalevaluationwithinanautomaticspeechrecognitiontask shows that MPE-SR solutions for the WP decomposition problem are consistent with 37 well understood empirically determined speech features, and the derived feature rep- resentations yield competitive performances with respect to standard feature extraction techniques. 1 3.1 Introduction Wavelet packets (WPs) and general multi-rate filter banks [77, 47, 11] have emerged as important signal representation schemes for compression, detection and classifica- tion [17, 26, 57, 75, 80]. This basis family is particularly appealing for the analysis of pseudo-stationary time series processes and quasi periodic random fields, such as the acoustic speech signals, and texture image sources [9, 7, 43], where a filter bank analysis has shown to be suitable for de-correlating the process into its basic innova- tion components. In pattern recognition (PR), filter bank structures have been the basic signal analysis block for several acoustic and image classification applications, notably includingautomaticspeechrecognition(ASR)andtextureclassification. Inthisdomain, an interesting problem is to determine the optimal filter bank structure for a given clas- sificationtask[7,43],ortheequivalentoptimalbasisselectionproblem[12,59]. Inpatternrecognition(PR),theoptimalsignalrepresentationproblemcanbeassoci- ated with the feature extraction (FE). It is well-known that if the joint class observation distributionisavailable,theBayesdecisionprovidesameansofminimizingtheaverage risk[21]. However,inpracticethejointdistributionistypicallynotavailable,andinthe Bayes decision approach this distribution is estimated from a finite amount of training data[21,5,25]. Itisalsowellunderstoodthattheaccuracyofthisestimationisaffected by the dimensionality of the observation space. Hence, an integral part of FE is to 1 Index Term: Minimum probability of error Signal Representation, Bayes decision approach, basis selection,tree-structuredbasesandWaveletpacket(WP),complexityregularization,mutualinformation, minimum cost tree pruning, family pruning problem, data-dependent partition, nonparametric mutual informationestimation. 38 address the problem of optimal dimensionality reduction, particularly necessary in sce- narioswheretheoriginalraw-observationmeasurementslieinahighdimensionalspace, and a limited amount of training data is available, such as in most speech classification [56],imageclassification[26]andhyper-spectralclassificationscenarios[37,41]. Toward addressing of this problem, Vasconcelos [75] has formalized the minimum probabilityoferrorsignalrepresentation(MPE-SR)principle. Undercertainconditions, thisworkpresentsatradeoffbetweentheBayeserrorandaninformationtheoreticindi- cator for the estimation error across a sequence of embedded feature representations of increasingdimensionality,andconnectsthisresultwiththenotionofoptimalsignalrep- resentationforpatternrecognition. In[63]theseresultswereextendedtoamoregeneral theoretical setting, introducing the idea of family of consistent distributions associated with an embedded sequence of feature representations. Furthermore, [63] addresses the MPE-SR problem as solution of an operational cost-fidelity problem using mutual information(MI)asthefidelitycriterion[16]anddimensionalityasthecostterm. The focus of this study is to extend the MPE-SR principle for the important family of filter bank feature representations induced by the Wavelet packets (WPs) [77, 11]. The idea is to take advantage of the WP tree structure to characterize sufficient condi- tions that guarantee algorithmic solutions for the cost-fidelity problem. This approach was motivated by algorithmic solutions obtained for the case of tree-structured vector quantization (TSVQ) in lossy compression [8, 57] and TSVQ for non-parametric clas- sificationandregressionproblems[61,5,51]. Basisselection(BS)problemsfortree-structuredbasesfamilyinPRhavebeenpro- posed independently in [59, 26]. Saito et al. [59], extending the idea of BS for signal representation in [12], proposed a fidelity criterion that measures inter-class discrimi- nation, in the Kullback-Leibler divergence (KLD) sense [40], considering the average 39 energy of the transform coefficients for every basis. Etermad et al. [26] used an empir- icalfidelitycriterionbasedonFisher’sclassseparabilitymetric[25]. Boththeseefforts usedthetree-structureoftheWPsfordesigninggreedypruning-growingalgorithmsfor thebasisselection. Thosealgorithmsmakelocaldecisionsforsplitting-pruningthebasic two-channelfilterbank,whichgeneratesthebasesfamily[77],asawaytoapproximate the optimal tree-indexed basis for their respective optimality criteria. The estimation- approximation error tradeoff was not formally considered in these basis selection algo- rithms,whiledimensionallyreductionisaddressedinapost-processingstage. In this work we address a problem that is distinct from the aforementioned approaches,intermsofboththesetoffeaturerepresentationsorthedictionaryobtained from the WP bases collection, and the optimality criterion used to formulate the basis selection problem. For feature representation, we consider an analysis-measurement framework that projects the signal into different filter bank sub-space decompositions and then compute measurements for the resulting subspaces as a way to obtain a sequence of successively refined features. The filter bank energy measurement is the focusinthispaper,motivatedbyitsusedinseveralacoustic[55,43]andimageclassifi- cationproblems[26,59,42,7]. Inthiswayafamilyoftree-embeddedfeaturerepresen- tations is obtained. Concerning the basis selection, the Bayes-estimation error tradeoff is explicitly considered as objective function in terms of a complexity regularization formulation, where the embedded structure of the WP feature family is used to study conditionsthatguaranteealgorithmicsolutions—optimaltreepruningalgorithms. 3.1.1 OrganizationandContribution This work is organized in two parts. In the first part, WP tree-structured feature rep- resentations are characterized in terms of an analysis-measurement framework, where 40 the notion of dimensionally embedded feature representations is introduced. Then, suf- ficient conditions are studied in the adopted fidelity criterion, with respect to the tree- structure of the WP basis family, which allow for implementing the MPE cost-fidelity problem using dynamic programing (DP) techniques. Those conditions are based on a conditional independent structure of the family of random variables induced from the analysis-measurement process, where the cost-fidelity problem reduces to a minimum costtreepruningproblem[61]. Finally,theoreticalresultsandalgorithms,withpolyno- mialcomplexityinthesizeoftheWPtreearepresented,extendingideasbothfromthe context of regression and classification trees [5, 51] and the general single and family pruningproblemsrecentlypresentedbyScott[61]. In the second part, we address implementation issues and provide some experimen- tal results. First, a non-parametric data-driven approach is derived for estimating the conditional mutual information (CMI) [16, 32]. This is a necessary building block for computingtheadoptedfidelitycriterion—empiricalmutualinformation(MI)—given theaforementionedconditionalindependentassumptions. Inthiscontext,weextendthe Darbellay-Vajda tree-structureddata-dependentpartition[18],originallyformulatedfor estimatingMIbetweentwocontinuousrandomvariables,intoourscenariofortheCMI. WeconsideraproductpartitionstructurefortheproposedCMIestimator,whichsatisfies desirable asymptotically properties (weak consistency). For experimental evaluation, solutionsfortheoptimalWPdecompositionproblemareevaluatedinaspeechphonetic classification task, where the solutions of the proposed optimal filter bank decomposi- tionareevaluatedandcomparedwithsomestandardfeatureextractiontechniques. The rest of the paper is organized as follows. Section 3.2 provides basic notations andsummarizestheMPE-SRformulationconsideredinthiswork. Section3.3presents the WP bases family and its indexing in terms of a tree-structured feature representa- tion. Section 3.4 addresses the MPE-SR problem for the WP indexed family in terms 41 of minimum cost tree pruning. Section 3.5 is devoted to the non-parametric CMI esti- mation. Finally,Section3.6presentsexperimentalevaluationsandSection3.7provides finalremarks. ProofsareprovidedintheAppendix. 3.2 Preliminaries Let X:(Ω,F, P) → (X,F X ) denote the observation random variable with values in X = R K (for some fixed K ∈ N), and Y:(Ω,F, P) → (Y,F Y ) be the class rv with values in a finite alphabet setY, where (Ω,F, P) refers to the underlying probability space 2 . The pattern recognition (PR) problem is to find the mapping from the set of measurable transformations from (X,F X ) to (Y,F Y ) with the minimum risk, given byargmin g(·) E X,Y [l(g(X),Y)],wherel(y 1 ,y 2 )representsthepenalizationoflabeling an observation with the valuey 1 , when its true label is given byy 2 ,∀y 1 ,y 2 ∈ Y. This optimalsolutionisknownastheBayesrule,wherefortheemblematic0-1costfunction, l(y 1 ,y 2 ) =δ(y 1 ,y 2 ),itreducestothemaximumaposterior(MAP)decision,g P X,Y (¯ x) = argmax y∈Y P X,Y (¯ x,y),∀¯ x ∈ X, with the corresponding Bayes error given byL X = P u∈ Ω :g P X,Y (X(u))6=Y(u) [21]. In practice the joint distribution P X,Y is unknown, and a set of independent and identically distributed (i.i.d.) realizations of the pair (X,Y), denoted by D N = {(x i ,y i ) :i∈{1,..,N}}, is assumed. In the Bayes decision approach, D N is used to obtain an empirical observation-class distribution ˆ P X,Y . that is in turn used to derive the empirical Bayes rule, ˆ g ˆ P X,Y (·) = argmax y∈Y ˆ P X,Y (·,y). It is well known that the risk of the empirical Bayes rule, deviates from L X as a consequence of estima- tion errors [60, 37, 75, 63]. This implies a strong dependency between the number of 2 AnaturalchoiceforF X istheBorelsigmafieldB(R K )[34],andforF Y thepowersetofY. 42 trainingexamplesandthecomplexityoftheobservationspace,thatjustifiesdimension- ality reduction as a fundamental part of feature extraction (FE). For addressing this FE problem, we revisit the minimum probability of error signal representation (MPE-SR) principle[75,63]. 3.2.1 Minimum Probability of Error Signal Representation (MPE- SR) Let D be a dictionary of feature transformations, where any f(·) ∈ D is a mapping fromtheoriginalsignalspaceX toatransformspaceX f ,equippedwithjointempirical distribution ˆ P X f ,Y on(X f ×Y,σ(F f ×F Y ))obtainedfromD N andanimplicitprobabil- ity estimation approach. Consequently, we have a collection of empirical Bayes rules, n ˆ g f (·) = argmax y∈Y ˆ P X f ,Y (·,y) :f(·)∈ D o . TheMPE-SRproblem[63]isgivenby f ∗ = argmin f∈D P X,Y ({(x,y)∈X ×Y : ˆ g f (f(x))6=y}), (3.1) where P X,Y refers to the true joint distribution on (X ×Y,σ(F X ×F Y )). Note that ∀f(·)∈ D,P X,Y ({(x,y)∈X ×Y : ˆ g f (f(x))6=y})≥ L X f ≥ L X [21], then the MPE- SRcriterionideallychoosestherepresentationfunctionwhoseperformanceistheclos- esttoL X . Usingthefollowingupperbound,P X,Y ({(x,y)∈X ×Y : ˆ g f (f(x))6=y})≤ Δg MAP ( ˆ P X f ,Y )+L X f proposedbyVasconcelosin[75],theobjectivecriterionof(3.1) canbeapproximatedbythisboundresultingin 3 , ˆ f ∗ = argmin f∈D L X f −L X +Δg MAP ( ˆ P X f ,Y ). (3.2) 3 The normalization factorL X was included to make explicit the approximation error partL X f −L X in(3.2). 43 This last problem, as desired, makes explicit that the optimal decision needs to find the best tradeoff between signal representation quality (approximation error) and learning complexity (estimation error). In particular, Δg MAP ( ˆ P X,Y ) quantifies the estimation error [75] which is a non-decreasing closed-form of the Kullback-Leibler divergence (KLD) [40] between the true conditional class probabilities and their empirical coun- terparts. In practice, neither term in (3.2) is directly available since they requireP X,Y . To address this problem from observed data D N , [63] proposes the use of empirical mutualinformation(MI)[16,32]asafidelityindicatortoapproximatetheBayeserror 4 , and a function proportional to the dimensionality ofX f for the estimation error term 5 . Consequently,thenewcomplexityregularizationproblemisformulatedas 6 , ˆ f ∗ (λ) = argmin f∈D −I(f(X);Y)+λ·Φ(R(f)), (3.3) where R(f) denotes the dimensionality ofX f . Note that independent of Φ, ˆ f ∗ (λ) the domain of solutions of (3.3) resides in a sequence of feature transformations which are thesolutiontothefollowingcost-fidelityproblem [63], f ∗ k = arg max f∈D R(f)≤k I(f(X);Y), ∀k∈K(D), (3.4) 4 Fano’s inequality (Chapter 2.11 in [16]) characterizes a lower bound for the probability of error of anydecisionframeworkg(·):X →Y thattriestoinferY asafunctionofX andoffersthetightestlower boundfortheBayesrule. 5 Supportingthischoice,Theorem2in[63]showsthattheestimationerrorismonotonicallyincreasing with the dimensionality of the space under some general dimensionally embedded consistency assump- tions. 6 Since the real dependency between Bayes and estimation error in terms of I(f(X);Y) and R(f) is hidden and, furthermore, problem dependent, Φ and λ provide degrees of freedom for approximating it andconsequentlyapproachingthesolutionof(3.2). 44 with K(D) = {R(f) :f ∈ D} ⊂ N. In practice, the solution for the approximated MPE-SR in (3.3) requires to know the complexity fidelity tradeoff represented by λ. This again can be empirically obtained fromD N by evaluating the empirical risk in an independenttestsetorbycross-validation[75] 7 . Next we particularize this learning-decision framework to our case of interest, the family of filter bank representations induced by Wavelet packets (WPs). First we show how the alphabet of feature transformations is created using an analysis-measurement process,andhowthetreestructureoftheWPsisusedtoindexthisdictionaryoffeature transformations. This abstraction will be crucial to address the cost-fidelity problem algorithmically,aspresentedinSection3.4. 3.3 Tree-Indexed Filter Bank Representations: The WaveletPackets WPsallowdecomposingtheobservationspaceintosubspacesassociatedwithdifferent frequency bands [77]. This basis family is characterized by a tree structure induced by its filter bank implementation, that recursively iterates a two channel orthonormal filter bank. In the process of cascading this basic block of analysis, it is possible to generate a rich collection of orthonormal bases for L 2 (Z) [68] — the space of finite energy sequences — associated with different time-scale representation properties [11, 77]. Emblematic examples for these bases include the Wavelet basis, which recursively iterates the low frequency band generating a multiresolution type of analysis [47], and the short-time Fourier transform (STFT) with a balanced filter bank structure [57, 11], 7 The same final re-sampling is considered for solving a related problem in the context of regression andclassificationtree(CART)[5,51]. 45 Figure3.1: Filterbankdecompositiongiventhetree-structuredofWaveletpacketbases for the case of the ideal Sinc half-band two channel filter bank. Case A: Octave-band filter bank characterizing a Wavelet type of basis representation. Case B: Short time Fouriertransform(STFT)typeofbasisrepresentation. illustrated in Fig. 3.1. For a comprehensive treatment of WPs, we refer to the excellent expositionsin[77,57,12]. 3.3.1 Tree-IndexedBasisCollectionsandSubspaceDecomposition Here, as considered in [12, 59], we use the WP two-channel filter bank implementation to hierarchically index the WP bases collection and its subspace decomposition. Let X = R K again be our finite-dimensional raw observation space. Then, the applica- tion of the basic block of analysis — two channel filter bank and down-sampling by 2 [77] — decomposesX into two subspacesX 1 0 andX 1 1 , respectively associated with its low and hight frequency content. This process can be represented as an indexed- orthonormal basis that we denote by B = ψ 1 0,k1 ,ψ 1 1,k2 :k 1 ∈A 1 0 ,k 2 ∈A 1 1 , where X = X 1 0 L X 1 1 , beingX 1 i ≡ span ψ 1 i,k :k∈A 1 i ,i ∈ {0,1}. The indexed structure 46 ofB is represented by the way its basis elements are dichotomized in terms of the fil- ter bank index setsA 1 0 andA 1 1 , which are responsible for the sub-space decomposition. In any of the resulting sub-band spaces, X 1 0 and X 1 1 , we can reapply the basic block of analysis to generate a new indexed basis. By iterating this process, it is possible to construct a binary tree-structured collection of indexed bases for X. For instance, by iterating this decomposition recursivelyl-times from one step to another in every sub- band space, we can generate the indexed basis S j∈{0,...,2 l −1} ψ l j,k :k∈A l j , where X = L j∈{0,...,2 l −1} X l j , andX l j ≡ span ψ l j,k :k∈A l j , ∀j ∈ 0,...,2 l −1 . It is importanttomentionthatthisconstructionensuresthatinanyiterationwehavethefol- lowingrelationshipX l j =X l+1 2j L X l+1 2j+1 ,∀l∈{0,..,K o −1},∀j∈ 0,..,2 l −1 and, hence ψ l+1 2j,k 1 ,ψ l+1 2j+1,k 2 :k 1 ∈A l+1 2j ,k 2 ∈A l+1 2j+1 is an embedded indexed basis for the subspaceX l j . Finally from this construction, it is clear that there is a one-to-one map- pingbetweenafamilyoftreesinacertaingraphandthefamilyofWPbases,whichwe formalizenext. 3.3.2 RootedBinaryTreeRepresentation We represent the generative process of producing a particular indexed basis in the WP family by a rooted binary tree [61]. Let K o = blog 2 (K)c be the maximum number of iteration of this sub-band decomposition process, given our finite dimensional setting 8 . Let G = (E,V) be a graph with E = (0,0),(1,0),(1,1),(2,0),(2,1),(2,2),(2,3),··· ,(K o ,0)···(K 0 ,2 Ko −1) , and V the collection of arcs on E ×E that characterizes a full rooted binary tree with root v root ≡ (0,0),asillustratedinFig. 3.2.A.Insteadofrepresentingthetreeasacollection of arcs inG, we use the convention proposed by Breiman et al. [5], where sub-graphs 8 Withoutlossofgenerality,weconsiderK = 2 Ko fortherestofthepapertosimplifytheexposition. 47 Figure 3.2: Topology of the full rooted binary treeT full and representation of the tree indexedsubspaceWPdecomposition. are represented by subset of nodes of the full graph. In this context, any pruned ver- sion of the full rooted binary tree represents a particular way of iterating the basic two channelblockanalysisoftheWP. Beforecontinuingwiththeexposition,letusintroducesomebasicterminology. We usethebasicconceptsofchild,parent,path,leafandrootusedingraphtheory[14]. We describearootedbinarytreeT ={v 0 ,v 1 ,....}⊂E ascollectionofnodeswithonlyone with degree 2, the root node, and the remaining nodes with degree 3 (internal nodes) and 1 (leaf nodes) 9 . We defineL(T) as the set of leaves of T andI(T) as the set of internalnodesT,consequently,L(T)∪I(T) =T. WesaythatarootedbinarytreeS is a subtreeofT ifS⊂T. Inthepreviousdefinition,iftherootsofS andT arethesame thenS is a pruned subtree ofT, denoted byS T. In addition if the root ofS is an internal node ofT thenS is called a branch ofT. In particular, we denote the largest branch ofT rooted atv ∈ T asT v . We define the size of the treeT as the number of terminalnodes,i.e.,thecardinalityofL(T),anddenoteitby|T|. Finally, letT full = E denote the full binary tree, Fig. 3.2, then the WP bases can be indexed by{T :T T full }. More precisely, for anyT T full the induced tree 9 Thedegreeofanodeisthenumberofarcsconnectingthenodewithitsneighbors. 48 indexed basis is given byB T = S (l,j)∈L(T) ψ l j,k :k∈A l j and its filter bank subspace decompositionby X l j : (l,j)∈L(T) . 3.3.3 Analysis-MeasurementProcess In association with the subspace decomposition, we consider a final measurement step for feature extraction. Let B T a basis element in the collection, the analysis- measurementmappingisgivenby, m T (x) = M l j (x) (l,j)∈L(T) , ∀x∈ R K , (3.5) whereM l j (x) represents a measurement of the signal components in the subspaceX l j . In particular, we considerM l j (x) = F x,ψ l j,k k∈A l j with F(·) representing a non- injective function. While in the development of this work we focus on the sub-space energy as the measurement function, because of its wide use in several acoustic and image classification problems, the formulation and results presented in the following sectionscanbeextendedtoincludeamoregeneralfamilyoffeaturemeasurements. 3.4 MPE-SR for Wavelet Packets: The Minimum Cost TreePruningProblem Let X and Y be the observation and class label random variables, respectively. The analysis-measurement induces a family of random variables {m T (X) :T T full }. 49 For the MPE-SR formulation, we consider D Ko = {m T (·) :T T full }, the dictio- nary of transformations withK o levels of decomposition, where from Section 3.2.1 the approximatedMPE-SRreducestothefollowingcost-fidelityproblem, T k∗ = arg max TT full |T|≤k I(m T (X);Y), (3.6) ∀k∈ 1,...,|T full | = 2 Ko . The solution of this problem turns to finding the sub-band decomposition ofX that maximizes the empirical MI for a given number of frequency bands, which can be seen as an optimal band allocation problem. Note that without some additive property on the tree functionals involved in (3.6), an exhaustive search is needed for solving it, which grows exponentially with the size of the problem. The next subsections derive general sufficient conditions to address this problem using DP techniques. 3.4.1 Tree-EmbeddedFeatureRepresentationResults We begin with some basic tree-embedded properties of our dictionary of feature rep- resentations. To simplify notation letρ(T) = I(m T (X);Y) denote our target MI tree functional,andX l j =M l j (X)denotetherandommeasurementofX inthesubspaceX l j in(3.5). Proposition3.1 Thecollection{m T (X) :T T full }isembeddedinthesensethat, H(X l j |X l+1 2j ,X l+1 2j+1 ) = 0, ∀(l,j)∈I(T full ). (3.7) 50 Furthermore, for any sequence of rooted binary trees T 1 ··· T m , {m T 1 (X),··· ,m Tm (X)}isprobabilistically-embedded inthesensethat, H(m T i (X)|m T k (X)) = 0, ∀1≤i<k≤m, (3.8) where H refers to the differential entropy [16]. The proof is presented in Appendix 3.8.1. Proposition3.2 LetusconsiderT ¯ T,thenwehavethatρ(T)≤ρ( ¯ T),wheretheMI differenceisgivenby ρ( ¯ T)−ρ(T) =I X l j (l,j)∈ ¯ T\T ;Y| X l 0 j 0 (l 0 ,j 0 )∈T (3.9) =I X l j (l,j)∈ ¯ T\T ;Y|m T (X) . (3.10) This result is a consequence of the tree-embedded structure of {m T (X) :T T full } in(3.8). TheproofisgiveninAppendix3.8.2 10 . 3.4.2 StudyingAdditivepropertiesfortheMutualInformationTree functional LetT 6= T full be a rooted binary tree and letT + l,j denote the tree induced by splitting an admissible leaf node (l,j) ∈ L(T)\L(T full ) 11 . From Proposition 3.2, the MI gainρ(T + l,j )−ρ(T)isequaltoI X l+1 2j ,X l+1 2j+1 ;Y| X l 0 j 0 (l 0 ,j 0 )∈L(T) ,whichisnotalocal function of (l,j), but a function of the statistical dependency of the complete holding tree structure represented by X l 0 j 0 (l 0 ,j 0 )∈L(T) . The following result presents sufficient 10 Note that from Proposition 3.2, there exits a solution for (3.6), such that T k∗ = k, ∀k ∈ {1,...,|T full |}. 11 L(T + l,j ) ={(l+1,2j),(l+1,2j +1)}∪L(T)\{(l,j)}. 51 conditions to simplify this dependency, which requires the introduction of a Markov tree assumption on the conditional independence structure of the set of full random measurements X l j : (l,j)∈T full . Proposition3.3 Let X l j : (l,j)∈T full be the family of full random mea- surements. If ∀(l,j) ∈ I(T full ), n X ¯ l ¯ j : ( ¯ l, ¯ j)∈T full(l,j) \{(l,j)} o and n X ¯ l ¯ j : ( ¯ l, ¯ j)∈T full \T full(l,j) o areconditionallyindependentgivenX l j andgivenboth X l j andY,then ρ(T + l,j )−ρ(T) =I X l+1 2j ,X l+1 2j+1 ;Y|X l j , (3.11) ∀T T full such that (l,j)∈L(T). The proof is direct from the conditional indepen- dentassumptions. andispresentedinAppendix3.8.3. For the rest of this exposition we consider the Markov tree assumption presented in Proposition 3.3 for solving (3.6). We denote the CMI gain presented in (3.11) by Δρ(l,j) ≡ I X l+1 2j ,X l+1 2j+1 ;Y|X l j , well defined ∀(l,j) ∈ I(T full ). In addition, let T be a non-trivial tree (i.e., |I(T)| > 0) and (l,j) ∈ I(T), then ρ(T (l,j) ) ≡ I X ˆ l ˆ j ( ˆ l, ˆ j)∈L(T (l,j) ) ;Y denotestheMIofmeasurementsrestrictedtothebrachT (l,j) . Finally,letusdefine ρ T (l,j)≡I m T (l,j) (X);Y|X l j =ρ(T (l,j) )−I X l j ;Y (3.12) therootedCMIgainforthebrachofT rootedat(l,j)∈I(T). In this context, ρ(T) can be expressed as a function of the local CMI gains {Δρ(l,j) : (l,j)∈I(T full )} and the MI of the root node I(X 0 0 ;Y). The following resultsformalizethispointandthegeneraladditivepropertyofourMItreefunctional. 52 THEOREM3.1 LetT, ¯ T bebinarytreessuchthatT ¯ T. Thenthefollowingresults hold: ρ( ¯ T) =ρ(T)+ X (l,j)∈I( ¯ T)\I(T) Δρ(l,j) (3.13) =ρ(T)+ X (l.j)∈L(T) ρ¯ T (l,j) (3.14) Inparticularfrom(3.13),∀T T full wehavethat ρ(T) =I X 0 0 ;Y + X (l,j)∈I(T) Δρ(l,j). (3.15) TheproofispresentedinAppendix3.8.4. The following proposition presents the pseudo-additive property of ρ(·) when a rooted binarytreeispartitionedintermsofitsprimaryleftandrightbranches. Proposition3.4 LetT T full be a non-trivial tree (i.e.,|I(T)| > 0). Then if (l,j)∈ I(T)wehavethat, ρ T (l,j) = Δρ(l,j)+ρ T (l+1,2j)+ρ T (l+1,2j +1), (3.16) while for (l,j)∈L(T) by definitionρ T (l,j) = 0. The proof is presented in Appendix 3.8.5. 53 From(3.13)weobservethatρ(·)isadditivewithrespecttotheinternalnodesofthe tree, which implies thatρ(·) is an affine tree functional [8] 12 . Moreover, by definition (3.12), ρ(T) =I(X 0 0 ;Y)+ρ T (v root ), ∀T T full , (3.17) thenfrom(3.16),wehaveawayofcharacterizingρ(T)asanadditivecombinationofa rootdependenttermandρ(·)evaluatedinitsprimaryleftandrightbranches. The proposed Markov tree assumption depends on the goodness of the index bases familytodecomposetheobservationprocessintoconditionalindependentcomponents. Given that we are working with the WP family of bases, their frequency band decom- position provides good de-correlation for wide sense stationary random processes, and independentcomponentsforstationaryGaussianprocesses[24,31]. Thenassumingthat the observation source has local stationary behavior similar to assumptions considered in the short term frame-by-frame analysis of the acoustic speech process, this Markov tree assumption could be considered a good approximation and used to find computa- tionally efficient solutions for our cost-fidelity problem. This last point is addressed in furtherdetailinthenexttwosubsections. 3.4.3 MinimumCostTreePruningProblem The cost-fidelity problem in (3.6) can be formalized as a minimum cost tree pruning problem[61,5,8]. Inparticularweneedtosolve, T k∗ = arg max TT full |T|=k ρ(T) = arg max TT full |T|=k ρ T (v root ), (3.18) 12 A tree functional ρ(·) is affine if, for any T,S rooted binary trees such that S T, then ρ(T) = ρ(S)+ P v∈L(S) [ρ(T v )−ρ({v})],where{v}representsatrivialbinarytree. ForourMItreefunctional, thispropertyisobtainedfrom(3.14). 54 ∀k ∈ 1,..,|T full | = 2 Ko . Let us defineT v as the largest branch ofT full rooted at v ∈ T full andT k∗ v T v as the solution of the following more general optimal tree pruningproblem, T k∗ v = arg max TTv |T|=k ρ T (v), ∀k∈{1,..,|T v |}. (3.19) The next result presents a DP solution for this problem using the additive properties of ourfidelityindicator. THEOREM3.2 Let us consider an arbitrary internal node v ∈ I(T full ) and denote its left and right children by l(v) and r(v) respectively 13 . Assuming that we know the solution of (3.19) for the child nodes l(v) and r(v), i.e., we know n T k 1 ∗ l(v) ,T k 2 ∗ r(v) :k 1 = 1,.., T l(v) k 2 = 1,.., T r(v) o , the solution of (3.19) for the par- entnodeisgivenbyT k∗ v = h v,T ˆ k 1 ∗ l(v) ,T ˆ k 2 ∗ r(v) i ,where 14 ( ˆ k 1 , ˆ k 2 ) = arg max (k 1 ,k 2 )∈{1,..,|T l(v)|}×{1,..,|T r(v)|} k 1 +k 2 =k ρ T k 1 ∗ l(v) (l(v))+ρ T k 2 ∗ r(v) (r(v)) , (3.20) ∀k∈{1,..,|T v |}. Inparticular, whenv isequaltotherootofT full thesolutionforthe optimal pruning problem in (3.18), is given byT k∗ = h v root ,T ˆ k 1 ∗ l(vroot) ,T ˆ k 2 ∗ r(vroot) i . The proofispresentedinAppendix3.8.6. This DP solution is a direct consequence of solving (3.19) for the parent node as a function of the solutions of the same problem for its direct descendants. In 13 Basedonourpreviousnotationanyinternalnodev∈I(T full )canbewrittenas(l,j)wherel(v) = (l+1,2j)andr(v) = (l+1,2j +1) 14 Using Scott’s nomenclature [61], the notation [v,T 1 ,T 2 ] represents a binary tree T with root v, T l(v) =T 1 andT r(v) =T 2 . 55 particular, if we index all the nodes from top to bottom, such that index(v) > max{index(l(v)),index(r(v))}, then we can solve an ordered sequence of optimal tree pruning problems, from the terminal nodes ofT full — where the solution is trivial —totheroot. ThealgorithmpresentedbyScott[61]fortheminimumcosttreepruning with additive fidelity tree functional can be extended directly to this problem. Bohanec etal. [2]showedthatthecomputationalcomplexityofthisalgorithmisO(|T full | 2 ). for balancedtrees,whichisinfactourcase. The next section revisits the original complexity regularized problem in (3.3), and shows that under additional conditions in the penalization term, the problem reduces to findingamorerestrictivesequenceofoptimaltree-embeddedrepresentations. Usingthis tree-embedded structure, in theory a more efficient algorithmic solution can be derived forit. 3.4.4 Connections with the Family Pruning Problem with General Size-basedPenalty As presented in Section 3.2.1, the approximated MPE-SR problem can be expressed as thefollowing“singletreepruningproblem”withgeneralizedsize-basedpenalties[61], T ∗ (λ) = arg min TT full −ρ(T)+λΦ(|T|), (3.21) with Φ:N→ R + a non-decreasing function and λ∈ R + reflecting the tradeoff between the fidelity and cost terms. In this case, −ρ(T) represents the MI loss for having a coarse representation of the raw observation, and Φ(|T|) is the regularization term that penalizesdimensionality. Proposition1in[61]showsthatwhenΦisstrictlyincreasing, 56 then there exitsλ 0 = 0<λ 1 <···λ m =∞ and a sequence of pruned treesR 1 ,..,R m (with|R 1 |>|R 2 |...>|R m | = 1),suchthat∀i∈{1,..,m}, T ∗ (λ) =R i , ∀λ∈ [λ i−1 ,λ i ). (3.22) This results characterizes the full range of solutions for (3.21) as a function of the rela- tive weight between the fidelity and penalization terms (or the optimal achievable cost- fidelity region [8]). The problem of finding{λ 0 ,,..λ m−1 } and the associated solutions {R 1 ,..,R m }wascoinedasthe“familypruningproblem”undergeneralsizebasedpenal- ties [61]. It is not difficult to see thatR j is an admissible solution of the minimum cost tree pruning (3.18) fork j = |R j |, i.e. R j = T k j ∗ , and consequently we can consider that{R 1 ,..,R m }⊂ T k∗ :k = 1,..,2 K−o . Interestingly, if the cost tree functional is additive,thefollowingresultcanbestated. THEOREM3.3 If Φ(|T|) = |T|, then the solution of the family pruning problem admitsanembeddedstructure,i.e.,R m R m−1 ···R 1 . Indirectly, the proof of this theorem can be obtained from the fact that this set of solutions characterizes an operational rate-distortion region associated with two mono- tone affine tree functionals: one being the MI, our fidelity criterion, and the other, the cost of the tree given by|T|, which is additive 15 . We derived an alternative algebraic proof for this result based on Breiman et al.’s derivations (Chapter 10.2) [5], presented inAppendix3.8.7. By Theorem 3.3, wehave that the familypruning problem admitsa nested solution. Consequently,asimpleralgorithmcanbeusedforfinding{R 1 ,..,R m }. Thealgorithmis 15 Inthegeneralcaseofmonotoneaffinefunctionals,Chouetal. (Lemma1)[8]showedthatthesolution foranoperationalratedistortionproblemallowsanestedsub-treestructure. 57 presentedin[8],whichisbasedonagraphicalrepresentationoftheproblem. Thecom- plexity of this algorithm isO(|T full |log(|T full |)) for the case of balanced trees. Fur- thermore, Theorem 3.3 and the algorithm mentioned can be extended to a more general family of sub-additive penalties — functionals dominated by an additive cost, details arepresentedin(Theorem2)[61]. Finally,asintheCARTpruningalgorithm[5,61,51],thetruevalueofλthatreflects the right tradeoff between the fidelity and cost term of the problem is unknown. The problem then reduces to finding the optimal λ ∗ and consequentlyT ∗ (λ ∗ ). In practice the empirical data D N is used for this final decision. This is done by considering the empirical risk minimization (ERM) criterion across the set of empirical Bayes rules defined for every member of{R 1 ,..,R m } (embedded under additive and sub-additive costterm),ortothemorecompletefamilyofminimumcosttrees T k∗ :k = 1,..,2 Ko . ThisisthestepwherethesetofempiricalBayesrulescomeintoplayandwherefeature extractionandclassificationareoptimizedjointlyforthetask. Consideringthatadditive assumption for the cost term is difficult to be rigorously justified in our Bayes decision setting,andthatre-samplingisusedasthefinaldecisionstep,itisreasonabletoconsider thefullminimumcosttreepruningfamilyasthedomainforthisfinalempiricaldecision. Finally what we haven’t addressed so far, and that was implicitly assumed in the MPE-SR solution is how to estimate our fidelity criterion in (3.18) and (3.21) based on the empirical data. The solution adopted in this work is based on a non-parametric techniques,whichisthefocusofthenextsection. 3.5 Non-parametricEstimationoftheCMIGains The solution of the minimum cost tree pruning problem requires the estimation of the conditional mutual information (CMI) quantities 58 I X l+1 2j ,X l+1 2j+1 ;Y|X l j : (l.j)∈L(T full ) , by Theorem 3.1. To solve this prob- lem a non-parametric approach is adopted based on vector quantization (VQ) [18]. In this section we propose a quantized CMI construction, state its asymptotic desirable properties and finally introduce the role of data-dependent VQ for the problem, where an an algorithm is presented based on Darbellay-Vajda tree-structured data-dependent partition[18,21]. 3.5.1 QuantizedCMIConstruction Our basic problem is to estimate Δρ = I(X 1 ,X 2 ;Y|X 3 ) based on i.i.d. realizations of the joint phenomenon. Without loss of generality let X 1 , X 2 and X 3 be tree continu- ous random variables in (R,B(R)) andY the finite alphabet class random in (Y,F Y ). We denote by P X i the probability of X i on (R,B(R)), and we assume it has a prob- ability density function (pdf) given by p X i . The same is assumed for the joint prob- ability of (X 1 ,X 2 ,X 3 ), with pdf p X 1 ,X 2 ,X 3 defined on (R 3 ,B(R 3 )) and for the class conditional probabilities denoted byP X 1 ,X 2 ,X 3 |Y (·|y) with corresponding pdfs given by p X 1 ,X 2 ,X 3 |Y (·|y),∀y∈Y. Our CMI construction follows Darbellay et al. [18], by using quantized versions of X 1 ,X 2 and X 3 by the following type of product partition, Q 1,2×3 ≡ Q 1,2 × Q 3 = R 1,2 i ×R 3 j :j = 1,..,n;i = 1,..,n , where Q 1,2 = R 1,2 i :i = 1,..,n and Q 3 = R 3 j :j = 1,..,n aremeasurablepartitionsof(R 2 ,B(R 2 ))and(R,B(R)),respectively. BasedonthisproductpartitionourquantizedCMIisgivenby, Δρ(Q 1,2×3 ) =I Q 1,2×3 (X 1 ,X 2 ,X 3 ;Y)−I Q 3 (X 3 ;Y), (3.23) 59 where for any arbitrary continuous random variable X in (R,B(R)) and partition Q of R k , I Q (X;Y) refers to 16 P y∈Y P A∈Q P X,Y (A×{y})log P X,Y (A×{y}) P X (A)P Y ({y}) . It is well known that quantization reduces the magnitude of information quantities [32, 18], which is also the case for our quantized CMI construction, i.e., Δρ(Q 1,2×3 ) ≤ Δρ = I(X 1 ,X 2 ;Y|X 3 ). Then it is interesting to study the approximation properties of the proposed product CMI construction. In other words, find if this suggested construction can achieve Δρ by systematically increasing the resolution of a sequence of product quantizers—anotionofasymptoticsufficientpartitionsfortheCMIestimation. Inthis direction, we have extend the work of Darbellay et al. [18] showing general sufficient conditions in the asymptotic structure of a sequence of nested product partitions for approximatingtheCMI.Thisimportantresultjustifiesourchoiceofproductpartitionin theasymptoticregime. Theproofofthisresultalthoughfundamentalisnotinthemain scopeofthiswork,andconsequentlyitisreportedinAppendix3.8.8forcompleteness. In practice we have a collection of i.i.d. samples and hence the empirical distributions will be used to estimate Δρ(Q 1,2×3 ) in (3.23). More precisely, let {(x i 1 ,x i 2 ,x i 3 ,y i ) :i = 1,..,N} be our empirical data and Q 1,2×3 an arbitrary product measurable partition. The empirical joint distribution of the quantized observation ran- domvariable((X 1 ,X 2 ) Q 1,2 ,X Q 3 3 )andclassrandomvariableY,usingtheMLcriterion, isgivenby ˆ P N (X 1 ,X 2 ) Q 1,2 ,X Q 3 3 ,Y (A 1,2 ×A 3 ×{y}) = 1 N P N i=1 I A 1,2 ×A 3 ×{y} ((x i 1 ,x i 2 ),x i 3 ,y i ), ∀A 1,2 ∈ Q 1,2 ,∀A 3 ∈ Q 3 and∀y ∈Y. The associated marginal empirical distributions 16 I Q (X;Y) can be seen as the MI between the quantized random variable. X Q (u) = P A∈Q I A (X(u))·f(A)—f(·)beingageneralinjectivefunctionfromQto R k —andY(u). 60 are computed accordingly. Hence, we can obtain the empirical MIs using the formula below 17 , ˆ I Q 1,2×3 N (X 1 ,X 2 ,X 3 ;Y) = 1 N N X i=1 log ˆ P N (A i 1,2 ×A i 3 ×{y}) ˆ P N (A i 1,2 ×A i 3 )· ˆ P N ({y}) (3.24) ˆ I Q 3 N (X 3 ;Y) = 1 N N X i=1 log ˆ P N (A i 3 ×{y}) ˆ P N (A i 3 )· ˆ P N ({y}) (3.25) and consequently the empirical CMI by ˆ Δρ N (Q 1,2×3 ) = ˆ I Q 1,2×3 N (X 1 ,X 2 ,X 3 ;Y) − ˆ I Q 3 N (X 3 ;Y). Considering the mentioned product sufficient partition sequence for the CMI and sufficient number of samples points, from the weak law of large numbers [73, 4] it is simpletoshowthat ˆ Δρ N (Q 1,2×3 )canbearbitrarilyclosetoΔρinprobability,whichis a desired weak consistency result. However, in practice we need to deal with the non- asymptotic case of having a finite amount of training data. In this context, the problem of finding a good estimation for Δρ across a sequence of nested partitions needs to consider an approximation and estimation error tradeoff, as with any other statistical learning problem [21]. To address this issue, we follow the data-dependent partition frameworkproposedbyDarbellayetal. [18]. 3.5.2 Darbellay-VajdaData-DependentPartition The Darbellay-Vajda algorithm partitions the observation space, by iterating a splitting rule that generates a sequence of tree-indexed nested partitions [18]. To illustrate the idea, let X and Y be continuous finite alphabet random variables and let us consider the problem of estimatingI(X;Y). In addition, let{(x i ,y i ) :i = 1,..,N} denote the trainingdataand ˆ P N betheempiricalprobabilitywithdistributionfunctiondenotedby 17 Thesub-scriptindicesontheprobabilitiesareomittedtosimplifynotationin(3.24)and(3.25). 61 ˆ F X (x) = ˆ P N X ((−∞,x]) 18 . In the k-phase of this algorithm the criterion checks every atomAofthecurrentpartitionQ k byevaluatingtheempiricalMIgainobtainedbypar- titioningA with a product structure adaptively generated with the marginal distribution of the training points inA, denoted byQ(A) 19 . If this gain is above a critical threshold thealgorithmsplitstheatomtoupgradethepartitionbyQ k+1 = (Q k \{A})∪Q(A). and continuesinthisregionapplyingrecursivelytheaforementionedsplittingcriterion. But inthenegativecase,thealgorithmstopstherefinementofthisregionundertheassump- tionthatconditiontotheeventX ∈A,X andY canbeconsideredalmostindependent, i.e., ˆ I Q 2 (A) N (X;Y|X ∈A)<⇒I(X;Y|X ∈A)≈ 0. Furthermore to control estima- tionerror,weintroduceathresholdinthesplittingruletocontroltheminimumnumber oftrainingpointsassociatedwithA,forhavingagoodrepresentationofthejointdistri- butionbetweenX andY inthistargetregion. Thepseudo-codeispresentedinFig. 3.3, whichconsidersthefollowingsetofparameters: • (s,r) ∈ N 2 , s > r: used for generating product refinements, see Fig. 3.3 for details, • δ> 0: thresholdfortheMIgain, • N c ∈ N: minimumnumberofpointsforprobabilityestimation. Finally, in our target problem we haveX 1 ,X 2 ,X 3 andY and we need to estimate Δρ = I(X 1 ,X 2 ;Y|X 3 ), with the the empirical data {(x i 1 ,x i 2 ,x i 3 ,y i ) :i = 1,..,N}. Thenthenon-parametricestimationgoesasfollows: 1) Use Darbellay-Vajda algorithm to construct partition Q N 1,2 for (X 1 ,X 2 ) using {x i 1 ,x i 2 ,y i :i = 1,..,N}. 18 We considerX as a scalar random variable, however the construction extents naturally for the finite dimensionalscenario. 19 ThemarginalMIgaincanbeexpressedby ˆ P N X (A)· ˆ I Q 2 (A) N (X;Y|X ∈A). 62 2) Use Darbellay-Vajda algorithm to construct partition Q N 3 for X 3 using {x i 3 ,y i :i = 1,..,N}. 3) ConsidertheproductadaptivepartitionQ N 1,2×3 to: 3.1) Compute empirical joint distribution ˆ P N X 1 ,X 2 ,X 3 ,Y for every event in σ(Q N 1,2×3 )×F Y . 3.2) ComputeempiricalMIindicators ˆ I Q N 1,2×3 N (X 1 ,X 2 ,X 3 ;Y)and ˆ I Q N 3 N (X 3 ;Y). 3.3) FinallycomputetheCMIestimate ˆ Δρ N (Q N 1,2×3 ). 3.6 Experiments In this section we report experiments to evaluate: the non-parametric CMI estimator across the different scale-frequency values of the WP basis family; the solutions the minimum cost tree pruning in terms of the expected frequency band decompositions, and the classification performance of the resulting feature descriptions in comparison withsomestandardfeaturerepresentations. 3.6.1 FrameLevelPhoneClassificationfromSpeechSignal We consider an automatic speech recognition scenario, where filter banks have been widely used for feature representations and, furthermore, concrete ideas for the opti- mal frequency band decompositions are well understood based on perceptual studies of the human auditory system. The corpus used was collected in our group at USC and comprises about 1 hour 30 minutes of spontaneous conversational speech from a male English speaker, sampled at 16 Khz. The standard frame by frame analysis was 63 performed on those acoustic signals, where every 10ms (frame rate) a segment of the acoustic signal of 64ms around a time center position was extracted. Word level tran- scriptionswereusedforgeneratingphoneleveltimesegmentationsontheacousticsig- nalsbyusingautomaticforcedViterbialignmenttechniques. Usingthephoneleveltime segmentations, the collection of those acoustic frame vectors, dimension K = 1024, with their corresponding phone class information (47 classes) was created, where we considered one session of the data comprising N = 14979 supervised sample points. Finally for creating the set of feature representations, we use the Daubechies’ maxi- mally flat filter (db4) for the WP basis family [19, 77], and the energy on the resulting bands. We first present some analysis of the minimum cost tree pruning in terms of topology of those solutions (the optimal filter bank decomposition problem) and then weevaluateperformancesassociatedwiththosesolutions. 3.6.2 AnalysisoftheMIGainandOptimalTreePruning WeestimatedtheCMIgainsin(3.11),usingthealgorithmpresentedinSection3.5. We considereds = 8,r = 2 for generating the product refinement (associated with the MI gainobtainedbyrefiningtheproductpartition),followingthegeneralrecommendations suggestedin[18]. Wetrieddifferentconfigurationsforδ andN c ,whichstronglygovern the tradeoff between approximation and estimation error. We conducted an exhaustive analysisoftheCMIestimationobtainedacrossthoseconfigurationsobservingmarginal discrepancies on the relative differences of CMI estimated values across scale and fre- quency bands. In this respect, it is important to point out that the relative differences among the CMI values Δρ(l,j) :∀l∈ 1,..,6,j∈ 0,..,2 l −1 fully characterize the topology of solutions of the minimum cost tree pruning problem. This behavior can be explained because the implicit over-estimation (because of estimation error) and under-estimation (because of quantization) uniformly affect all the CMI estimations 64 acrossscalesandbands(samedimensionforinvolvedrandomvariablesandsamenum- ber of samples points). For this setting we have chosen a conservative configuration, δ = 1.0 −200 andN c = 200,tohaveareasonableestimationoftheclass-observationdis- tributionsduringthequantizationprocessandconsequentlybiastoanunder-estimation oftherealCMIvalues. Fig. 3.4 represents the CMI estimations (or MI gains) across scales and frequency bands for the WP decomposition. The global trend presented in Fig.3.4 is expected, in the sense that the iteration of lower frequency bands provides more phone discrimina- tioninformationthantheiterationonhigherfrequencybandsacrossalmostallthescales oftheanalysis. Thisfactisconsistentwithstudiesofthehumanauditorysystemshow- ing that overall there is higher discrimination for lower frequency regions than higher frequencyregionsintheauditoryrangeof55Hz-15Khz[55]. Thisglobaltrendwasalso observedforalltheothersessionsofthecorpus(notreportedhere),supportingthegen- erality of results obtained from the mutual information decomposition across bands of the acoustic signals. Based on this trend the general solution of the optimal tree prun- ing problem follows the expected tendency, where for a given number of bands, more level of decompositions are allocated in lower frequency components of the acoustic space. Interestingly, exact Wavelet type of filter bank solutions (the type of filter bank structure obtained from human perceptual studies, MEL scale [81]) were obtained for solutions associated with small dimensions. It is important to mention the same analy- sis was conducted in a synthetic setting to evaluate CMI trends across scale-frequency and solutions of the optimal filter bank decomposition. Expected trends and decompo- sitions were obtained in terms of the discrimination of the different frequency bands of the signals, designed during the synthesis part. Results are not reported here for space considerations. 65 3.6.3 FrameLevelPhoneRecognition The solutions of the cost-fidelity were used as feature representations for frame level phoneclassification. Inparticular,weevaluatedsolutionsassociatedwiththefollowing dimensions: 4,7,10,13, 19,25,31,37,43,49,55 and 61. GMMs were used for esti- mating class-conditional densities in the Bayes decision setting, which is the standard parametric model adopted for this type of frame level phone classification [9], and a ten-fold cross-validation was used for performance evaluation. 32 mixture components per class were considered and the EM-algorithm was used for ML parameter estima- tion. Asareference,weconsiderthestandard13-dimensionalMel-Cepstrum(MFCCs) plus delta and acceleration coefficients using the same frame rate (10ms) and window length (64ms) — 39-feature vector associated with a total window length of 100ms, where the correct phone classification rate (mean and standard deviation) obtained was 53.01%(1.01). The performances for the minimum cost tree pruning family using as fidelity the proposed non-parametric CMI, as well as the energy, considered in [7], are reported in Table 3.1. Table 3.1 also reports performances of two widely used dimen- sionality reduction techniques acting on the raw time domain data: linear discriminant analysis(LDA)andnon-parametricdiscriminantanalysis(NDA). LDA and NDA present relatively poor performances compared to using filter bank representationsoftheacousticprocess. Thiscanbeattributedtotworeasons: firstthese methods are constrained to the family of linear transformations on the raw data, and second, there is an implicit Gaussianity assumption in considering the between-within classscattermatricesratioastheoptimalitycriteriononbothtechniques[53,63],which isnotguaranteedtobevalidinthisparticularhighdimensionalsetting. Whencomparing solutionsoftheminimumcosttreepruningusingtheproposedempiricalMIandenergy as the fidelity criterion, the former as expected shows consistently better performance, demonstrating the effectiveness of the empirical MI as an indicator of discrimination 66 MCTP-MI MCTP-Energy LDA NDA Dimension mean(std. dev.) mean(std. dev.) mean(std. dev.) mean(std. dev.) 4 28.53(0.65) 28.04(1.03) 12.58(0.53) 6.73(0.55) 7 37.95(1.32) 35.54(0.98) 17.95(1.19) 7.48(0.99) 10 40.34(1.31) 39.39(1.30) 21.55(1.02) 8.03(0.71) 13 44.29(1.24) 40.25(1.58) 25.20(0.99) 8.57(0.67) 19 46.61(1.10) 44.09(0.97) 30.34(1.32) 8.73(0.88) 25 48.27(1.22) 46.21(1.55) 31.72(1.31) 9.25(0.73) 31 49.64(1.03) 46.58(1.93) 31.84(1.04) 9.37(0.76) 37 51.11(0.97) 47.64(1.51) 31.62(1.03) 9.32(0.56) 43 52.10(1.43) 47.25(0.99) 31.28(1.16) 9.50(0.55) 49 52.98(1.52) 47.47(1.72) 29.61(1.52) 9.80(0.74) 55 52.87(1.23) 47.07(1.33) 27.60(0.93) 10.12(0.72) 61 52.44(1.41) 39.33(1.67) 25.42(1.03) 10.08(1.12) Table 3.1: Correct phone classification (CPC) rates (mean and standard deviation) for theminimumcosttreepruning(MCTP)solutions,usingtheproposedempiricalMutual Information (MI) and energy as fidelity criterion. As a reference, performances are provided for linear discriminant analysis (LDA) and non-parametric discriminant anal- ysis (NDA). Performances obtained using 10-foild cross validation and a GMM-based classifier. information. As a final corroboration of the goodness of the filter bank WP family and correctness of proposed optimality criterion, our data-driven optimal tree pruning family provide competitive performances with respect to MEL-frequency scale filter bankfamily(MFCCs)forsimilarrangeofdimensions[31−43]. These experiments show the importance of having on the one hand, a good target familyoffeaturerepresentations,ratifyingtherepresentationqualityoffilterbankfam- ily for the analysis of pseudo stationary stochastic phenomena, and on the other hand, an optimality criterion that reflects the underling estimation and approximation error tradeoffoftheMPE-SRproblem. 67 3.7 DiscussionandFutureWork It is important to remind the reader that although the presented formulation is theoreti- callysupportedbythefactthattheMPE-SRneedstoaddressacomplexityregularization problem, this optimization problem is practically intractable and requires the introduc- tion of approximations, in particular concerning the probability of error. In this paper empirical MI is adopted for that purpose. This choice has some theoretical justification in terms of information theoretic inequalities and monotonic behavior of the indicator across sequence of embedded transformation of the data [16], however tightness is not guaranteed. In that respect the presented formulation is open to consider alternative fidelity criteria. The empirical risk (ER) is a natural candidate with a strong theoretical support[21,71],howevertheoptimizationproblemrequiresanexhaustiveevaluationin ouralphabetoffeaturetransformations,whichforreasonabledimensionoftheproblem is impractical. Another attractive alternative is the family of Ali-Silvey distances mea- sures, used to evaluate the effect of vector quantization in hypothesis testing problems [54,35],orevenempiricalindicatorslikeFisherlikescatterratios[26]. Thisisaninter- esting direction for future research, where as presented in this work additivity property oftheseindicators,withrespecttostructureofWPbasis,canbestudiedtoextendalgo- rithmic solutions, or alternatively, greedy algorithms can be proposed and empirically evaluated, when the resulting basis selection problem does not admit polynomial time algorithmicsolutions. Concerningthepresentedphoneclassificationexperiments,theproposeddata-driven feature extraction offers promising results, however a systematic study of the problem stillremainstobeconductedtoexplorethefullpotentialityoftheproposedformulation. Thismayincludeacarefuldesignofthetwochannelfilterbankevaluatingitsimpactin classification performances [9], the use of other tree-structured bases families, as well asexperimentalvalidationundermoregeneralacousticconditions. 68 3.8 TechnicalDerivations 3.8.1 ProofofProposition3.1 Equation (3.7) is just a consequence of the Parseval relationship [77, 19] and the fact that by construction ifT ¯ T, thenB¯ T is a sub-space refinement of B T 20 . Concerning thesecondresult,withoutlossofgeneralityletusconsiderT 1 T 2 T full ,wherewe needtoshowthatH(m T 1 (X)|m T 2 (X)) = 0. Forcomputingtheconditionalentropy,we consider thatm T 1 (X) = X l j (l,j)∈L(T 1 ) andm T 2 (X) = X l j (l,j)∈L(T 2 ) . Before going to theactualproofwewillusethefollowingresult. LEMMA3.1 LetusconsiderT T full ,thenwehavethat H X l j (l,j)∈T =H X l j (l,j)∈L(T) . (3.26) Proof: WeusethefactthatH(X l j |X l+1 2j ,X l+1 2j+1 ) = 0,forall(l,j)∈I(T full ),from (3.7). The idea is to partition the set of nodes as a function of its depth with respect to therootv root = (0,0),andusethechainrule[16,32]. LetT ={v root }∪T 1 ∪···∪T Ko , whereT i isthecollectionofnodesinT withdepthi,andK o themaximumdepthofthe tree(seeFig. 3.5-A).Inaddition,letusdefine ˜ T i ≡T i ∩L(T)and ˆ T i ≡T i ∩I(T)the setofterminalandinternalnodesofdepthiofthetree,respectively. SeeFig. 3.5-A.Note thatT Ko = ˜ T Ko andthatL(T) = S Ko k=1 ˜ T k . Bythetreestructure,∀k∈{1,..,K o −1}, and ∀(k,j) ∈ ˆ T k , we have that (k + 1,2j) and (k + 1,2j + 1) belong to T k+1 = ˜ T k+1 ∪ ˆ T k+1 . We will use this node depth dependent partition of T for the following 20 B¯ T isasubspacerefinementofB T ,inthesensethatforanysubspaceX l j ,(l,j)∈L(T),∃ ˆ L⊂L( ¯ T) suchthatX l j = L ( ˆ l, ˆ j)∈ ˆ L X ˆ l ˆ j . 69 derivations. Inparticular,consideringthatT = ∪ Ko−1 k=1 ˆ T k ∪{v root } S ∪ Ko k=1 ˜ T i we havethat H X l j (l,j)∈T =H X l j (l,j)∈∪ Ko k=1 ˜ T k + H X l j (l,j)∈∪ Ko−1 k=1 ˆ T k ∪{vroot} | X l j (l,j)∈∪ Ko k=1 ˜ T k . (3.27) Henceforproving(3.26),weonlyneedtoshowthatthelastrighttermon(3.27)isequal tozero. Againusingthechainrulewehavethat: H X l j (l,j)∈∪ Ko−1 k=1 ˆ T k ∪{vroot} | X l j (l,j)∈∪ Ko k=1 ˜ T k =H X l j (l,j)∈ ˆ T Ko−1 | X l j (l,j)∈∪ Ko k=1 ˜ T k +H X l j (l,j)∈ ˆ T Ko−2 | X l j (l,j)∈∪ Ko k=1 ˜ T k , X l j (l,j)∈ ˆ T Ko−1 +H X l j (l,j)∈ ˆ T Ko−3 | X l j (l,j)∈∪ Ko k=1 ˜ T k , X l j (l,j)∈ ˆ T Ko−1 , X l j (l,j)∈ ˆ T Ko−2 +··· ···+H X l j (l,j)∈{vroot} | X l j (l,j)∈∪ Ko k=1 ˜ T k , X l j (l,j)∈∪ Ko−1 k=1 ˆ T k . (3.28) Without loss of generality let us analyze one of the generic terms of (3.28), say i ∈ {1,..,K o −2}. Bychainrulewehavethefollowinginequality H X l j (l,j)∈ ˆ T i | X l j (l,j)∈∪ Ko k=1 ˜ T k , X l j (l,j)∈∪ Ko−1 k=i+1 ˆ T k ≤H X l j (l,j)∈ ˆ T i | X l j (l,j)∈ ˜ T i+1 ∪ ˆ T i+1 , (3.29) 70 where enumerating X l j (l,j)∈ ˆ T i by the sequence X i j1 ,X i j2 ,..,X i jN i , where {j1,j2,..,jN i }⊂ {0,..,2 i −1} and considering the notation ¯ X l+1 j = X l+1 2j ,X l+1 2j+1 ∀(l,j)∈I(T full ),theupperboundpresentedin(3.29)isequivalentto H X i j1 ,X i j2 ,..,X i jN i | X l j (l,j)∈ ˜ T i+1 ∪ ˆ T i+1 ≤H X i j1 ,X i j2 ,..,X i jN i | ¯ X i+1 j1 , ¯ X i+1 j2 ,.., ¯ X i+1 jN i ≤ X j∈{j1,j2,..,jN i } H X i j | ¯ X i+1 j = 0. (3.30) The first inequality is because of the fact that {(i+1,2·j1),(i+1,2·j1+1),..,(i+1,2·jN i +1)} ⊂ ˜ T i+1 ∪ ˆ T i+1 and the chain rule, and the last equality by the hypothesis. The same derivation can be extendedforallthetermson(3.28),whichprovesthelemma. Using Lemma 3.1 the proof of the result is a basic application of the chain rule. Let usconsider H X l j (l,j)∈L(T 1 )∪L(T 2 ) =H X l j (l,j)∈L(T 2 ) +H X l j (l,j)∈L(T 1 ) | X l j (l,j)∈L(T 2 ) , (3.31) wheregiventhatL(T 1 )∪L(T 2 )⊂T 2 ,byLemma3.1wehavethat, H X l j (l,j)∈L(T 1 )∪L(T 2 ) ≤H X l j (l,j)∈T 2 =H X l j (l,j)∈L(T 2 ) . (3.32) Finally,thislastequalityprovestheresultusing(3.31). 71 3.8.2 ProofofProposition3.2 Proof: Let us start by consideringT T + lo,jo T full , whereT + lo,jo denotes the treeinducedfromT bysplittingoneofitsterminalnodes,(lo,jo)∈L(T). Bydefinition wehavethatm T (X) = X l j (l,j)∈L(T) andm T + lo,jo (X) = X l j (l,j)∈L(T + lo,jo ) withL(T + lo,jo ) ={(lo+1,2jo),(lo+1,2jo+1)}∪L(T)\{(lo,jo)}. Bymultipleapplicationofthe chainrulefortheMIitfollowsthat: I m T + lo,jo (X);Y −I(m T (X);Y) =I X l j (l,j)∈L(T + lo,jo ) ;Y −I X l j (l,j)∈L(T) ;Y =I X l j (l,j)∈L(T + lo,jo )∪{(lo,jo)} ;Y −I X lo jo ;Y| X l j (l,j)∈L(T + lo,jo ) −I X l j (l,j)∈L(T) ;Y = h I X l j (l,j)∈L(T)∪{(lo+1,2jo),(lo+1,2jo+1)} ;Y −I X l j (l,j)∈L(T) ;Y i −I X lo jo ;Y| X l j (l,j)∈L(T + lo,jo ) =I ¯ X lo+1 jo ;Y| X l j (l,j)∈L(T) −I X lo jo ;Y| X l j (l,j)∈L(T + lo,jo ) . (3.33) Finally noting that 0 ≤ I X lo jo ;Y| X l j (l,j)∈L(T + lo,jo ) ≤ H X lo jo | X l j (l,j)∈L(T + lo,jo ) ≤H X lo jo | ¯ X lo+1 jo = 0,bydefinitionoftheCMIandthechainrule,wegetthat I m T + lo,jo (X);Y −I(m T (X);Y) =I ¯ X lo+1 jo ;Y| X l j (l,j)∈L(T) , (3.34) which proves the result for this particular case. For the general case T ¯ T we can consider one of the possible sequence of internal nodes {(l1,j1),..,(ln,jn)}, which needs to be split to go from T to ¯ T. More precisely, we can consider the sequence 72 of embedded trees T = T 0 T 1 ··· T n = ¯ T, such that T i = (T i−1 ) + li,ji , ∀i∈{1,..,n}. Usingtelescopeseriesexpansionandtheresultsobtainedfrom(3.34), I(m Tn (X);Y)−I(m T 0 (X);Y) = n X i=1 I(m T i (X);Y)−I m T i−1 (X);Y (3.35) = n X i=1 I ¯ X li+1 ji ;Y| X l j (l,j)∈L(T i−1 ) = n X i=1 I ¯ X li+1 ji ;Y| X l j (l,j)∈T i−1 (3.36) =I ¯ X l1+1 j1 ,··· , ¯ X ln+1 jn ;Y| X l j (l,j)∈T 0 =I ¯ X l1+1 j1 ,··· , ¯ X ln+1 jn ;Y| X l j (l,j)∈L(T) (3.37) =I X l j (l,j)∈ ¯ T\T ;Y| X l j (l,j)∈L(T) =I X l j (l,j)∈ ¯ T\T ;Y|m T (X) . (3.38) The first equality in (3.37) is because of the chain rule for the MI, and the first equality in (3.38) by construction where we have that ¯ T \ T = {(l1+1,2·j1),(l1+1,2·j1+1),...,(ln+1,2·jn),(ln+1,2·jn+1)}. The equalities involving interchanging X l j (l,j)∈T by X l j (l,j)∈L(T) in (3.36) and (3.38) are adirectconsequenceofLemma3.1. 3.8.3 ProofofProposition3.3 Proof: This result is a simple consequence of the CMI definition and the Markov treeassumption. Morepreciselywehavethat: ρ(T + l,j )−ρ(T) =I X l+1 2j ,X l+1 2j+1 ;Y| X l 0 j 0 (l 0 ,j 0 )∈L(T) =H X l+1 2j ,X l+1 2j+1 | X l 0 j 0 (l 0 ,j 0 )∈L(T) −H X l+1 2j ,X l+1 2j+1 |Y, X l 0 j 0 (l 0 ,j 0 )∈L(T) =H X l+1 2j ,X l+1 2j+1 |X l j −H X l+1 2j ,X l+1 2j+1 |Y,X l j =I X l+1 2j ,X l+1 2j+1 ;Y|X l j , wherethethirdequalitymakesuseoftheconditionalindependentassumption. 73 3.8.4 AdditivePropertyoftheMutualInformationTreeFunctional ρ(·): THEOREM3.1 Proof: We have that T ¯ T. As in the proof presented in Appendix 3.8.2, we can consider a sequence of internal nodes {(l1,j1),..,(ln,jn)}, and the sequence of embedded trees T = T 0 T 1 ··· T n = ¯ T, such that T i = (T i−1 ) + li,ji , ∀i ∈ {1,..,n}. Fromthefirstequalityof(3.36)wehavethat ρ( ¯ T)−ρ(T) = n X i=1 I ¯ X li+1 ji ;Y| X l j (l,j)∈L(T i−1 ) (3.39) = n X i=1 I ¯ X li+1 ji ;Y|X li ji = n X i=1 Δρ(li,ji), (3.40) wheretheequalitiesin(3.40)istheresultoftheMarkovtreeassumption. Usingthefact that{(l1,j1),..,(ln,jn)} =I( ¯ T)\I(T),(3.40)showsthefirstresultsofthetheoremin (3.13). Forprovingthenextexpressionin(3.14),wewillstartwiththeresultpresented in Proposition 3.2, where ρ( ¯ T)−ρ(T) = I X l j (l,j)∈ ¯ T\T ;Y|m T (X) . Using that m T (X) = X l j (l,j)∈L(T) andthechainruleitdirecttoshowthat ρ( ¯ T)−ρ(T) =I X l j (l,j)∈ ¯ T\T∪L(T) ;Y|m T (X) =I X l j (l,j)∈ ¯ T\I(T) ;Y|m T (X) (3.41) wherenotingthat ¯ T \I(T) = S ( ¯ l, ¯ j)∈L(T) ¯ T ( ¯ l, ¯ j) (seeFig. 3.5-B),wegetthat, ρ( ¯ T)−ρ(T) =I X l j (l,j)∈ S ( ¯ l, ¯ j)∈L(T) ¯ T ( ¯ l, ¯ j) ;Y| X l j (l,j)∈L(T) . (3.42) 74 Finally from (3.42) by using the chain rule for CMI, and the conditional indepen- denceassumptionstatedinProposition3.3,itissimpletoshowthat: ρ( ¯ T)−ρ(T) = X ( ¯ l, ¯ j)∈L(T) I X l j (l,j)∈ ¯ T ( ¯ l, ¯ j) ;Y|X ¯ l ¯ j (3.43) = X ( ¯ l, ¯ j)∈L(T) I X l j (l,j)∈L( ¯ T ( ¯ l, ¯ j) ) ;Y|X ¯ l ¯ j = X ( ¯ l, ¯ j)∈L(T) I m¯ T ( ¯ l, ¯ j) (X);Y|X ¯ l ¯ j (3.44) which from the definition ofρ T (l,j) proves (3.14). Finally for proving the last expres- sion, in (3.15), we just need to consider the trivial tree{(0,0)} and T T full . It is clearthat{(0,0)}T andfrom(3.40), ρ(T) =ρ({(0,0)})+ X (l,j)∈I(T)\I({(0,0)}) Δρ(l,j), (3.45) wheregiventhatI({(0,0)}) =φandthatρ({(0,0)}) =I(X 0 0 ;Y),weprove(3.15). 3.8.5 ProofofProposition3.4 Proof: Forproving(3.16)bydefinition, ρ T (l,j) =I(m T (l,j) (X);Y)−I(X l j ;Y) =I X ¯ l ¯ j ( ¯ l, ¯ j)∈T (l,j) \{(l,j)} ;Y|X l j , 75 where the second equality is because of Proposition 3.2. Using the binary structure of T,itfollowsthatT (l,j) \{(l,j)} =T (l+1,2j) ∪T (l+1,2j+1) . Hence,consideringthenotation ¯ X l+1 j = X l+1 2j ,X l+1 2j+1 wehavethat: ρ T (l,j) =I X ¯ l ¯ j ( ¯ l, ¯ j)∈T (l+1,2j) ∪T (l+1,2j+1) ;Y|X l j =I( ¯ X l j ;Y|X l j )+I X ¯ l ¯ j ( ¯ l, ¯ j)∈T (l+1,2j) ;Y|X l j ,X l+1 2j ,X l+1 2j+1 +I X ¯ l ¯ j ( ¯ l, ¯ j)∈T (l+1,2j+1) ;Y|X l j ,X l+1 2j ,X l+1 2j+1 , X ¯ l ¯ j ( ¯ l, ¯ j)∈T (l+1,2j) = Δρ(l,j)+I X ¯ l ¯ j ( ¯ l, ¯ j)∈T (l+1,2j) ;Y|X l+1 2j +I X ¯ l ¯ j ( ¯ l, ¯ j)∈T (l+1,2j+1) ;Y|X l+1 2j+1 = Δρ(l,j)+I m T (l+1,2j) (X);Y|X l+1 2j +I m T (l+1,2j+1) (X);Y|X l+1 2j+1 . The second equality is because of the chain rule for the CMI [16], the third by the tree independent Markov assumption, and the last a direct consequence of Proposition 3.1 (seeLemma3.1inAppendix3.8.1fordetails),whichprovestheresult. 3.8.6 Dynamic Programming Solution for the Optimal Tree Prun- ingProblem: THEOREM3.2 Proof: Letusconsiderv∈I(T full ),wewanttofindthesolutionfor, T k∗ v = arg max TTv |T|=k ρ T (v), (3.46) as a function of solutions of its direct descendants — l(v) and r(v) — which are assumedtobeknown, n T k 1 ∗ l(v) ,T k 2 ∗ r(v) :k 1 = 1,.., T l(v) k 2 = 1,.., T r(v) o . Letuscon- sider the non-trivial case k > 1, and an arbitrary tree T T v such that |T| = k. Then, l(v) ∈ T and r(v) ∈ T and by Proposition 3.4 it follows that, ρ T (v) = 76 Δρ(v) + ρ T (l(v)) + ρ T (r(v)), where in addition if we denote T by v,T l(v) ,T r(v) , thenbydefinition|T| = T l(v) + T r(v) ,andT T v isequivalenttoT l(v) T l(v) and T r(v) T r(v) . Consequently,ifweconcentratetheattentionon(3.46),itfollowsthat: max TTv |T|=k ρ T (v) = Δρ(v)+ max T=[v,T l(v) ,T r(v)] T l(v) T l(v) ,T r(v) T r(v) |T l(v)|+|T r(v)|=k [ρ T (l(v))+ρ T (r(v))] (3.47) = Δρ(v)+ max k 1 ≥1,k 2 ≥1 k 1 +k 2 =k max TT l(v) |T|=k 1 [ρ T (l(v))]+ max TT r(v) |T|=k 2 [ρ T (r(v))] (3.48) = Δρ(v)+ max |T l(v)|≥k 1 ≥1,|T r(v)|≥k 2 ≥1 k 1 +k 2 =k ρ T k 1 ∗ l(v) (l(v))+ρ T k 2 ∗ r(v) (r(v)) . (3.49) Thislastequality,(3.49),isdirectfromthedefinitionoftheoptimalpruningtreeforthe descendantsofv,using(3.46). Hence,from(3.49)wehavethatT k∗ v canberepresented by h v,T ˆ k 1 ∗ l(v) ,T ˆ k 2 ∗ r(v) i ,being( ˆ k 1 , ˆ k 2 )thesolutionof ( ˆ k 1 , ˆ k 2 ) = arg max (k 1 ,k 2 )∈{1,..,|T l(v)|}×{1,..,|T r(v)|} k 1 +k 2 =k ρ T k 1 ∗ l(v) (l(v))+ρ T k 2 ∗ r(v) (r(v)). (3.50) 3.8.7 ProofofTHEOREM3.3 Proof: Let us defineρ λ (T) ≡ −ρ T (t)+λ|T|,∀T T t ( the set of binary tree rooted at t ∈ T full ), ∀λ ∈ R + , where ρ T (t) is defined as in (3.12). In addition, let us denote by T(λ) a minimum norm solution of the following complexity regularized problem: T(λ) = argmin ¯ TT ρ λ ( ¯ T), (3.51) 77 where if T 1 T is a solution of (3.51), then|T 1 | ≥ |T(λ)|. 21 Next we show that there exists a embedded family of solution for the family pruning problem associated with (3.51). Consider T T t to be a nontrivial tree, |T| ≥ 2 — this implies that t ∈ I(T full ). LetT l(t) andT r(t) be its primary branches, i.e.,T = t,T l(t) ,T r(t) , then usingthepseudo-additivepropertyof ρ T (t),(3.16)andtheadditivepropertyofthenorm ofthetree,wehavethat 22 ρ λ (T) =−Δρ(t)+ρ λ (T l(t) )+ρ λ (T r(t) ). (3.52) Thenthefollowingpropositionfollows. Proposition3.5 Let T T t be a non-trivial tree rooted at t ∈ T full , then we have that min ¯ TT ρ λ ( ¯ T) = min ρ λ ({t}),−Δρ(t)+ρ λ (T l(t) (λ))+ρ λ (T r(t) (λ)) . (3.53) Proof: LetT(λ) be a solution of (3.51). We need to consider two scenarios. The trivialcase,T(λ) ={t}wheregiventhatinparticular t,T l(t) (λ),T r(t) (λ) T,then ρ λ ({t})≤ρ λ ( t,T l(t) (λ),T r(t) (λ) ) =−Δρ(t)+ρ λ (T l(t) (λ))+ρ λ (T r(t) (λ)), thelastequalityby(3.52),whichprovestheresultforthetrivialcase. 21 Notethatthereisalwaysaminimumnormsolutionof(3.51),whichisnotnecessarilyunique. 22 NoteforthetrivialcaseT ={t}thatρ λ (T) =λ. 78 For the non-trivial case |T(λ)| ≥ 2, then using (3.52), we have that ρ λ (T(λ)) = −Δρ(t)+ρ λ (T(λ) l(t) )+ρ λ (T(λ) r(t) ), where it is necessary — considering thatT(λ) minimizesρ λ (·)—that: ρ λ (T(λ) l(t) ) = min ¯ TT l(t) ρ λ ( ¯ T) =ρ λ (T l(t) (λ)) ρ λ (T(λ) r(t) ) = min ¯ TT r(t) ρ λ ( ¯ T) =ρ λ (T r(t) (λ)), andthen ρ λ (T(λ)) =−Δρ(t)+ρ λ (T l(t) (λ))+ρ λ (T r(t) (λ))<ρ λ ({t}), (3.54) whichfinallyproves(3.53). Finallywewanttoshowthatifλ 2 ≥λ 1 ≥ 0,thenT(λ 2 )T(λ 1 ),whichattheend prove the result. Using induction in the number of nodes ofT, that we will denote by n, and Proposition 3.5, the result follows directly. Let us denote the root ofT byt. For n = 1, we have that∀λ∈ R + ,T(λ) =T ={t}, which satisfies the result. For the non trivialcasen> 1,weconsidertheresultistruefortreeswithstrictlylessthannnodes. Thenwehavethreescenarios: a) T(λ 1 ) ={t}istrivial,whichimpliesthatT(λ 2 )istrivialandconsequentlytheresult holds. b) T(λ 1 )isnon-trivialand T(λ 2 ) ={t}istrivial,wheretheresultholdsdirectlyagain. c) BothT(λ 1 ) andT(λ 2 ) are not trivial. In this case we know by Proposition 3.5 that T(λ 1 ) = t,T l(t) (λ 1 ),T r(t) (λ 1 ) andT(λ 2 ) = t,T l(t) (λ 2 ),T r(t) (λ 2 ) ,whereusing the inductive hypothesis, we have that T l(t) (λ 2 ) T l(t) (λ 1 ) and T r(t) (λ 2 ) T r(t) (λ 1 )provingthatT(λ 2 )T(λ 1 ). 79 3.8.8 Asymptotically Sufficient Results for the Product CMI Con- struction HereweprovidetheoreticalsupportfortheproductpartitionpresentedinSection3.5.1, for the CMI estimation problem. In particular based on the product construction, we prove a desired asymptotic sufficient result and show a weak consistency argument for ourhistogrambasedCMIestimate. It is well known that quantization reduces the magnitude of information quantities [32, 18], which is also the case for our quantized CMI construction, i.e., Δρ(Q 1,2×3 )≤ Δρ =I(X 1 ,X 2 ;Y|X 3 ). Then it is interesting to study the approximation properties of theproposedproductCMIconstruction. Inotherwords,findifthissuggestedconstruc- tioncanachieveΔρbysystematicallyincreasingtheresolutionofasequenceofproduct quantizers—anotionofasymptoticsufficientpartitionsfortheCMIestimation. Letusstartwithsomebasicdefinitionsandresults. Definition3.1 (from [18]) Let X be a continuous random variable taking values in (R k ,B(R k )) with probability measure P X . Let{Q n :n∈ N} be a nested sequence of partitions for (R k ,B(R k )) 23 . The sequence is asymptotic sufficient for X, if ∀A ∈ B(R k ),∀> 0,thereexistsn()∈ Nand ¯ A∈σ(Q n() ),suchthat P X (A4 ¯ A)<, (3.55) whereσ(Q n )⊂B(R k )denotesthesigmafieldinducedbyQ n . 24 23 {Q n :n∈ N} is nested if for any n ∈ N, Q n+1 is a refinement of Q n ., i.e., every atom in Q n is disjointunionofatomsinQ n+1 . 24 A sigma field induced by a set Q is the smallest sigma field that contains Q [34]. For the case of finiteandcountablepartitionsisthecollectionofsetthatcanbeconstructedastheunionofcellsinQ. 80 Proposition3.6 (Proposition 2 in [18]) Let us consider X and Y to be respectively the continuous and finite alphabet random variables with joint distribution P X,Y on (R k ×Y,σ(B(R k )×F Y )). If{Q n :n∈ N}isasymptoticsufficientforX then lim n→∞ I Q n (X;Y) =I(X;Y). (3.56) Given that the CMI can be expressed as the difference between MI quantities, from Proposition 3.6 we just need to prove that there exists a way to construct a sequence of nestedsufficientpartitions Q n 1,2,3 :n∈ N for(X 1 ,X 2 ,X 3 )withtheproposedproduct structure, i.e.,Q n 1,2,3 =Q n 1,2 ×Q n 3 ∀n∈ N, and the condition that its marginal partition sequence{Q n 3 :n∈ N}isasymptoticsufficientforX 3 . Thefollowingtheoremformal- izesthisidea. THEOREM3.4 Let {Q n 1 :n∈ N} and {Q n 2 :n∈ N} be asymptotic sufficient par- titions for X 1 and X 2 (both continuous random variables in (R k1 ,B(R k1 )) and (R k2 ,B(R k2 )), respectively). Then we have that the product partition Q n 1×2 =Q n 1 ×Q n 2 :n∈ N is asymptotic sufficient for (X 1 ,X 2 ) on (R k1 × R k2 ,σ(B(R k1 )×B(R k2 ))), where σ(B(R k1 )×B(R k2 )) is the product sigma field for R k1 ×R k2 . 25 Then,ifweconsiderΔρ(Q n 1×2 ) =I Q n 1×2 (X 1 ,X 2 ;Y)−I Q n 2 (X 2 ;Y),∀n∈ N, lim n*∞ Δρ(Q n 1×2 ) =I(X 1 ;Y|X 2 ). (3.57) Proof: We need to prove that∀ > 0,∀A ∈ σ(B(R k1 )×B(R k2 )), there exists n()∈ Nand ¯ A∈σ(Q n() )suchthat P X 1 ,X 2 (A4 ¯ A)<. (3.58) 25 σ(B(R k1 ) × B(R k2 )) is the smallest sigma field that contain the cylinders A 1 ×A 2 :A 1 ∈B(R k1 ),A 2 ∈B(R k2 ) ,[34]. 81 Given that σ(B(R k1 ) × B(R k2 )) is induced by the product cylinders B(R k1 ) × B(R k2 ) = A 1 ×A 2 :A 1 ∈B(R k1 ),A 2 ∈B(R k2 ) ,itisnaturaltothinkthatif(3.58)is valid for this subset, this condition can be extended for all the elements in the induced sigmafieldσ(B(R k1 )×B(R k2 )). Wewilluseawellknownmeasuretheoretictechnique forprovingthisresult[34,73]. Letusdefinetheset F = A∈σ(B(R k1 )×B(R k2 )) :∀> 0,∃n,∃ ¯ A∈σ(Q n ),P X 1 ,X 2 (A4 ¯ A)< (3.59) which is the collection of measurable events where {Q n :n∈ N} is asymptotically sufficient. We will prove first that F is a sigma field and second that F contains A 1 ×A 2 :A 1 ∈B(R k1 ),A 2 ∈B(R k2 ) ,whichimpliesthatσ(B(R k1 )×B(R k2 )) =F, by definition of the induced sigma field [34, 73]. Note that this last equality proves the propositionbydefinitionofthesetF. LetusstartbyprovingthatF isasigmafield. WeneedtoshowthatF iscloseunder anyfinitenumberofsetoperations. ForthisitissufficienttoprovethatF iscloseunder the union and the complement operations [34]. For the union, letA,B ∈F, and let us consider an arbitrary > 0. By definition ofF we have that there existsn1(/2) and n2(/2)associatedwith/2,and ¯ A∈σ(Q n1 )and ¯ B∈σ(Q n2 )thatsatisfy P X 1 ,X 2 (A4 ¯ A)< 2 , P X 1 ,X 2 (B4 ¯ B)< 2 .. Given that{Q n :n∈ N} is a nested sequence of partitions, per Definition 3.1, by con- sidering ¯ n() = max{n1(/2),n1(/2)}wehavethat[34] ¯ A, ¯ B∈σ(Q ¯ n() ), (3.60) 82 andbythefollowingrelationship (A∪B)4 ¯ A∪ ¯ B = (A∪B)\ ¯ A∪ ¯ B +... = A\( ¯ A∪ ¯ B) ∪ B\( ¯ A∪ ¯ B) +... ⊂ A\ ¯ A ∪ B\ ¯ B +...⊂ A4 ¯ A ∪ B4 ¯ B , we have that P X 1 ,X 2 (A∪B)4( ¯ A∪ ¯ B) ≤ P X 1 ,X 2 ( ¯ A4A) + P X 1 ,X 2 ( ¯ B4B) < . Finally noting that ¯ A∪ ¯ B ∈ σ(Q ¯ n() ) — by (3.60) and the fact that by construction σ(Q ¯ n() )isasigmafield—wehavethatA∪B∈F. ForthecomplementoperationletusagainconsiderA∈F and> 0. Thenwehave that∃n() and ¯ A ∈ σ(Q n() ), satisfying equation (3.58). But a basic set manipulation showsthat A4 ¯ A =A c 4 ¯ A c , (3.61) which implies thatP X 1 ,X 2 (A c 4 ¯ A c )<. In addition noting that ¯ A c ∈σ(Q n() ) because σ(Q n() )isasigmafield,wehavebydefinitionofF thatA c ∈F. We also need to prove that R k1+k2 ∈ F, which is direct from the definition ofF, (3.59),andthefactthat R k1+k2 ∈σ(Q n ),∀n∈ N. 26 Finally we need to show the sigma additive property in F [34]. Let us consider {A k :k∈ N} ⊂ F, where we need to prove that S k∈N A k ∈ F. Let > 0, then we needtofind ¯ A∈σ( S n∈N Q n ),suchthatP X 1 ,X 2 ( S k∈N A k 4 ¯ A)<. Letusconstructthe followingintegrablesequenceofnon-negativerealnumbers ( k ≡ 2 −k+1 ·) k∈N . Given that{A k :k∈ N}⊂F,then∀k∈ N,∃n( k )and ¯ A k ∈σ(Q n( k ) )suchthat P X 1 ,X 2 (A k 4 ¯ A k )< k . (3.62) 26 NotethatQ n isapartitionof R k1+k2 ,thenitsinducedsigmafieldcontains R k1+k2 . 83 IfwedefineA l ≡ S l k=0 A k and ¯ A l ≡ S l k=0 ¯ A k thenusingthat A l 4 ¯ A l ⊂ l [ k=0 (A k 4 ¯ A k ), ∀l∈ N, (3.63) andthewell-knownmonotoneandsub-addtitivepropertiesof P X 1 ,X 2 [34,73],itfollows that P X 1 ,X 2 (A l 4 ¯ A l )≤ l X k=0 P X 1 ,X 2 (A k 4 ¯ A k ). (3.64) Finallygiventhat(A l 4 ¯ A l ) l∈N ismonotonicincreasingmeasurablesetsequence,and usingthecontinuitypropertyofanyprobabilitymeasure[4]and(3.64),wehavethat P X 1 ,X 2 [ k∈N A k 4 [ k∈N ¯ A k ! = lim l→∞ P X 1 ,X 2 A l 4 ¯ A l ≤ lim l→∞ l X k=0 P X 1 ,X 2 (A k 4 ¯ A k ) = X k∈N k =, (3.65) where considering that ¯ A k ∈ σ(Q n( k ) ) ⊂ σ(∪ n∈N Q n ),∀k ∈ N, then we have that S k∈N ¯ A k ∈ σ(∪ n∈N Q n ), becauseσ(∪ n∈N Q n ) is a sigma field by definition. Therefore S k∈N A k ∈F. We have proved that F is a sigma field. The final step is to show that F = σ(B(R k1 ) × B(R k2 )). By hypothesis we have that A × R k2 ∈ F, ∀A ∈ B(R k1 ), because {Q 1 n :n∈ N} is asymptotic sufficient for X 1 , and by analogy R k1 × B ∈ F, ∀B ∈ B(R k2 ). Then by the fact that F is close under intersection A×B :A∈B(R k1 ),B∈B(R k2 ) ⊂ F, which implies that σ( A×B :A∈B(R k1 ),B∈B(R k2 ) ) = σ(B(R k1 )×B(R k2 )) ⊂ F and given that bydefinitionF ⊂σ(B(R k1 )×B(R k2 )),theresultisproved. 84 Returning to our problem, Theorem 3.4 says that we can approximate Δρ = I(X 1 ,X 2 ;Y|X 3 ) with arbitrary precision by constructing a product partition sequence frommarginalasymptoticsufficientpartitionsfor(X 1 ,X 2 )andX 3 ,respectively. Remark3.1 Important to note that existence of these marginal partition constructions follows from the actual definition of the KLD for continuous alphabet spaces, where it is defined as the supremum of finite measurable partitions. See (Chapter 5 in [32]) for anexcellentexpositionofthispoint. Remark3.2 Let consider a sequence of product sufficient partitions for the CMI, Q n 1,2×3 :n∈ N and empirical CMI ˆ Δρ N (Q n 1,2×3 ) presented in Section V.A of the manuscript, then it is simple consequence of Theorem 3.4 here and the weak law of largenumbers[73,4]that:∀> 0,∃n()∈ N,suchthat∀n>n(), lim N→∞ P ˆ Δρ N (Q n 1,2×3 )−Δρ > = 0, (3.66) where P refers to the underlying probability of the sample space. Then our quantized CMI estimate convergences in probability toΔρ as the index in the nested quantization sequence,n,andthenumberofsamplespoints,N,gotoinfinity. 85 Phase0: (Initialization: StatisticallyEquivalentBlocks) Q (1) = {(−∞,a 1 ],(a 1 ,a 2 ],...,(a r−1 ,∞)}, where a j ≡ ˆ F −1 X ( j r ), j ∈ {1,..,r−1}. Phase1: Stepk→k+1(Mainrecursion) Q (k+1) =∅,(initialization) foreach,A∈Q (k) if( ˆ P N X (A)> 1 Nc )(criticalnumberofsamplesperbin) -compute: ˆ F A (x) = ˆ P N X ((−∞,x]∩A) ˆ P N X (A) - construct: Q (k+1) r (A) = (−∞,a r 1 ],(a r 1 ,a r 2 ],..., a r r−1 ,∞ ∩ A and Q (k+1) s (A) = (−∞,a s 1 ],(a s 1 ,a s 2 ],..., a s s−1 ,∞ ∩ A where: a r j ≡ ˆ F −1 A ( j r ), j∈{1,..,r−1} a s j ≡ ˆ F −1 A ( j s ), j∈{1,..,s−1} -compute: ΔI(A) = ˆ P N X (A)· ˆ I Q (k+1) s (A) N (X;Y|X ∈A) if(ΔI(A)>δ)(criticalMIgain) Q (k+1) =Q (k+1) S Q (k+1) r (A) else Q (k+1) =Q (k+1) S {A} end,if end,if end,foreach Phase3: (Termination) if (Q k+1 equaltoQ k ) done, else k =k+1,gotoPhase1, end,if Figure 3.3: Darbellay-Vajda data-dependent partition algorithm for estimating the con- ditionalmutualinformation. 86 Figure 3.4: Graphical representation of the CMI magnitude by splitting the basic two channelfilterbankacrossscale(levelofdecomposition,horizontalaxes)andfrequency bands(verticalaxes)intheWPdecomposition. Figure3.5: Exampleofthenotationandtopologyofatree-indexedWPrepresentation. 87 Chapter4 DivergenceEstimationbasedon Data-DependentPartitions: Strong ConsistencyandApplications This paper presents a general histogram-based divergence estimator based on data- dependent partitions. General sufficient conditions for data-driven partition schemes, using Lugosi and Nobel’s combinatorial notions for partition families, are established to guarantee universal strong consistency of the divergence estimate. This consistency result does not put any major assumption other than the two probability measures to be absolutely continuous with respect to the Lebesgue measure. The result is particu- larized for two emblematic cases: the finite dimensional statistically equivalent blocks (Gessaman’sdata-dependentpartition)anddata-dependenttree-structuredvectorquan- tization (TSVQ), where specific design conditions are given for the induced histogram- baseddivergenceestimatestobestronglyconsistent. 1 1 IndexTerm: Universaldivergenceestimation,data-dependentpartitions,universallysufficientparti- tions, strong consistency, Vapnik-Chervonenkis inequality, statistically equivalent blocks, tree-structured partitions. 88 4.1 Introduction The divergence or Kullback-Leibler divergence (KLD) is a well known fundamental quantityininformationtheory[16,32]. Divergenceisdefinedastheaverageinformation per observation to discriminate between two probabilistic models in statistical decision theory[40]. Inlargedeviations,itappearsasafundamentalquantitytocharacterizethe ratefunction,whichreflectstheexponentialdecayofconvergenceofempiricalmeasures to their probabilities (Sanov’s theorem) [20] and the rate of decay of the probability of errorinabinaryhypothesistestingproblem(Stein’slemma)[16]. On the application side, mainly because of its role as a discriminative measure between probabilistic models [40], divergence has found wide use in statistical deci- sion problems. It has been used as a optimality criterion for parameter re-estimation, [67,38],asasimilaritymeasureformodelingclusteringandindexing[76,74,23],asan indicatortoquantifytheeffectofestimationerrorinaBayesdecisionapproach[75,63], to quantify the approximation error of vector quantization in statistical hypothesis test- ing[35,54]andasfidelityindicatorforfeatureselectionandfeatureextraction[59,52]. Usually these scenarios require the estimation of the divergence based on a finite num- ber of samples points — typically distributions are not available and they need to be estimated based on empirical data. Consequently the problem of divergence estimation isquietfundamentalinthesescenarios. Ofparticularinterestistohavedistribution-free estimates which converge to desired theoretical values in some sense as the number of samplespointstendstoinfinity. Despite its theoretical and practical significance relatively little work has been con- ducted for the universal estimation of the divergence in the continuous setting, where the probability space of interest is a finite dimensional Euclidean space. One of the first work that addresses this problem of universal divergence estimation is presented by Wang, Kulkarni and Verd´ u [78]. This work proposed a histogram-based divergence 89 estimation,consideringanadaptivepartitionschemethatapproximatesempiricalstatis- tical equivalent intervals relative to the reference measure. Sufficient conditions on the statistically equivalent partitions were stipulated to guarantee the strong consistency of the estimate for the scalar case — where probability measures are defined on the real line. This construction and consistency result was also extended to the case of Ergodic sources, in this case for the estimation of relative entropy rate. Independently, Nguyen, Wainwright and Jordan [48] proposed a nobel estimate based on a variational charac- terization of the divergence [32, 20]. The proposed approach reduces to estimating the likelihood ratio or Radon-Nicodym derivative of the probability measures involved by the solution of a risk minimization problem. Under some approximation assumptions and smoothness condition on the likelihood-ratio, strong consistency of the proposed estimate was obtained. Remarkable, under those conditions an optimal minimax rate forconvergenceforthelikelihood-ratioandasymptoticrateofconvergenceforthepro- posedplug-indivergenceestimatewereestablished. In this paper, we extend the work of Wang et al. [78], by exploring in more gen- eral terms the problem of histogram-based divergence estimation. We propose a gen- eralhistogram-baseddivergenceestimatebasedondatadependentpartitions. Themain resultcharacterizesgeneralsufficientconditionsonthedata-dependentpartitionscheme that guarantee the divergence estimate to be universally strongly consistent. This con- sistency result does not stipulate any major assumption other than requiring the two probability measures to be absolutely continuous with respect to the Lebesgue mea- sure. This direction was motivated by the seminal work of Lugosi and Nobel [46] and Nobel [49], where general sufficient conditions were established — based on combi- natorial notions of complexity for partition families and the extension of the Vapnik- Chervonenkis inequality [71, 70] —- on data-dependent partition schemes to obtain 90 strong consistency in three classical learning problems: classification, regression and densityestimation. As with any other statistical learning problem, the divergence estimation based on data-dependent partition presents two sources of errors. One is the approximation errorwhichismainlyattributedtothepartitioningofobservationspace—quantization reduces the magnitude of information quantities [32, 44]. The other is the estimation errorwhichisaconsequenceofthedeviationofempiricalmeasureswithrespecttotheir respective probabilities. As is well known in statistical learning theory [21, 3, 51, 70], toobtainaconsistentestimate,thetradeoffbetweenestimationandapproximationerror needs to be balanced as the number of sample points goes to infinity. Concerning the approximation error, we revisit the notion of universal sufficient partition for the divergenceestimationandparticularizeresultsforthecaseofdata-dependentpartitions. Specifically,sufficientconditionsforuniversallysufficientarepresentedbasedonawell known approximation quality notion: shrinking cell condition for data-dependent par- titions [46, 5, 21]. For the estimation error we use the Vapnik-Chervonenkis inequality for the case of partitions [46] as the main result to control the uniform deviation of the empirical distribution with respect to the probability in the process of estimating the divergence. Our main result shows that these conditions are much stronger than the analogous conditions obtained for the learning problem of histogram-based density estimationandclassification[46]. Inthesecondpartofthiswork,weexploreapplicationsofthisuniversalconsistency result by restricting it to some emblematic families of data-dependent constructions, which can be considered an extension and generalization of the original findings pre- sented by Wang et al. [78] for thel m -spacing partition scheme. In particular, the case 91 of statistically equivalent blocks in finite dimensional setting — the Gessaman’s data- dependent partition [29], and the case of data-dependent tree-structured vector quanti- zation(TSVQ)[21]arestudiedinsomedetail. Generalsufficientconditionsonspecific propertiesofthesepartitionfamiliesarederivedtoobtainstronglyconsistentdivergence estimates. These new data-dependent constructions and results offer concrete practical techniquesforhistogram-baseddivergenceestimationinthefinitedimensionalsetting. The rest of the paper is organized as follows. Section 4.2 presents basic definitions and key results that will be used in this work. Section 4.3 presents our data-driven estimate of the divergence. Section 4.4 introduces the notion of universally sufficient data-dependentpartitionandpresentssomegeneralresults. Section4.5statesandproves ourmainconsistencyresult. FinallySections4.6and4.7aredevotedtotwodata-driven schemes: the case of statistically equivalent blocks — scalar and multivariate case — andageneralfamilyoftree-structureddata-dependentpartition. 4.2 Preliminaries 4.2.1 TheDivergence LetX=R d beafinite-dimensionalEuclideanspacewithcorrespondingBorelsigmafield B(R d )andletP andQbeprobabilitymeasuresdefinedon(R d ,B(R d )),absolutelycon- tinuous with respect to the Lebesgue measure λ. The divergence or Kullback-Leibler divergenceisgivenby[40,32] D(P||Q) = Z log ∂P ∂Q (x)·∂P(x) = Z log ∂P ∂Q (x)· ∂P ∂Q (x)·∂Q(x), (4.1) 92 wheretheseexpressionsareundertheassumptionthatD(P||Q)<∞andconsequently P Q [32], which makes the Radon-Nicodym (RD) derivative of P with respect to Q tobewelldefined. 4.2.2 PartitionschemesandLugosi-Nobelcomplexitynotions We sayπ ={A 1 ,..,A r } is a finite measurable partition of R d if for anyi,A i ∈B(R d ); A i ∩A j = ∅, i 6= j; and S r i=1 A i = R d . We denote|π| as the number of cells inπ. LetA be a collection of measurable partitions for R d . The maximum cell count ofA is givenby[21] M(A) = sup π∈A |π|. (4.2) In addition, a notion of combinatorial complexity forA can be introduced, following LugosiandNobel[46]. Letusconsiderafinitelengthsequencex n 1 = (x 1 ,..,x n )∈ R d·n , andtheinducedsetby{x 1 ,..,x n },thenwecandefine Δ(A,x 1 ,..,x n ) =|{{x 1 ,..,x n }∩π :π∈A}|, (4.3) with {x 1 ,..,x n } ∩ π a short hand for {{x 1 ,..,x n }∩A :A∈π}. Consequently, Δ(A,x 1 ,..,x n ) is the number of possible partitions of{x 1 ,..,x n } induced byA, and thenthegrowthfunctionofAisdefinedby[46] Δ ∗ n (A) = max x n 1 ∈R d·n Δ(A,x 1 ,..,x n ). (4.4) An-samplepartitionrule π n isamappingfrom R d·n tothespaceoffinite-measurable partitions for R d , that we denote byQ, where a partition scheme for R d is a countable 93 collection of n-sample partitions rules Π = {π 1 ,π 2 ,...}. Let Π be an arbitrary parti- tion scheme for R d , then for every partition rule π n ∈ Π we can define its associated collectionofmeasurablepartitionsby[46] A n = π n (x 1 ,..,x n ) : (x 1 ,..,x n )∈ R d·n . (4.5) In this context, for a given n-sample partition rule π n and a sequence (x 1 ,..,x n ) ∈ R d·n , π n (x|x 1 ,..,x n ) denotes the mapping from any pointx in R d to its unique cell in π n (x 1 ,..,x n ),suchthatx∈π n (x|x 1 ,..,x n ). 4.2.3 Vapnik-Chervonenkistypeofinequality LetX 1 ,X 2 ,..,X n beindependentidenticallydistributed(i.i.d.) realizationsofarandom vector with values in R d , with X ∼ P and P a probability measure on (R d ,B(R d )). Then∀A∈π n (X 1 ,X 2 ,..,X n ),wecandefinetheempiricaldistributionby P n (A) = 1 n n X i=1 I A (X i ), (4.6) aprobabilitymeasuredefinedon(R d ,σ(π n (X 1 ,..,X n ))) 2 . Thisistheabstractrepresen- tationofthedata-dependentpartitionschemeforprobabilityestimation,wherethei.i.d. samples are used twice: first for defining a sub-sigma field σ(π n (X 1 ,..,X n ))⊂B(R d ) wherewewanttofocustheestimationproblemandthenforcharacterizingtheempirical probabilitymeasureonit. Inthislearningsetting,LugosiandNobel[46]provedadistributionfreeinequalityto boundtheuniformdeviationoftheempiricaldistributionwithrespecttotheprobability 2 σ(π)denotesthesmallestsigma-fieldthatcontain π,whichforthecaseofpartitionsisthecollection ofsetsthatcanbewrittenasunionofcellsofπ. 94 in a family of partitions. Interestingly the rate of convergence of this bound is char- acterized by a combinatorial complexity notion of the mentioned partition family, and consequentlyanaturalgeneralizationoftheVapnik-Chervonenkisinequality [72,21]. LEMMA4.1 (Lugosi and Nobel [46]) LetA be a collection of measurable partitions for R d . Then∀n∈ N,∀> 0, P sup π∈A X A∈π |P n (A)−P(A)|> ! ≤ 4Δ ∗ 2n (A)2 M(A) exp − n 2 32 , (4.7) where PreferstotheprocessdistributionofX 1 ,X 2 ,···. Introducing the role of a partition scheme, the following result is a consequence of Lemma4.1andtheapplicationoftheBorel-Cantellilemma [4]. COROLLARY4.1 (Lugosi and Nobel [46]) Let us consider a sequence of partition familiesA 1 ,··· ,A n ,··· induced by a partition scheme Π. If whenn tends to infinity: n −1 M(A n )→ 0andn −1 logΔ ∗ n (A n )→ 0,then sup π∈An X A∈π |P n (A)−P(A)|→ 0 (4.8) withprobabilityonewithrespecttotheprocessdistributionofX 1 ,X 2 ,···. This important result says that uniformly in the collection of measurable partitions in A n the empirical distribution (relative frequencies) converges to the probability — in thetotalvariationaldistancesense—asntendstoinfinity P-almostsurely. 95 4.3 Data-Dependent Partition for Divergence Estima- tion: ProblemStatement LetP andQbeprobabilitymeasuresin(R d ,B(R d ))absolutelycontinuouswithrespect totheLebesguemeasure,suchthatD(P||Q)<∞. LetΠ ={π 1 ,π 2 ,···}beapartition scheme for R d , and let us considerX 1 ,..,X n andY 1 ,..,Y m i.i.d. realizations of random variables with values in R d and distributionsP andQ, respectively. Then the proposed candidatefortheempiricaldivergenceisgivenby ˆ D n,m (P||Q)≡ X A∈πm(Y 1 ,..,Ym) P n (A)·log P n (A) Q m (A) , (4.9) whereP n andQ m denotetheempiricaldistributionsinducedbyX 1 ,..,X n andY 1 ,..,Y m respectively by (4.6), defined in the sub-sigma field σ(π m (Y 1 ,..,Y m )) ⊂ B(R d ). Note that this data-dependent partition is a function only of the i.i.d. realizations associ- ated with the reference measureQ. Loosely speaking the choice of using only partial information can be justified by the assumption that P Q. The next section will formally support this choice by introducing the notion of universally sufficient data- dependent partition and showing that the proposed construction satisfies this desired asymptotic property. In addition, we impose in Π the condition that Q m (A) > 0, ∀A∈π m (Y 1 ,..,Y m ),thatensuresthat ˆ D n,m (P||Q)<∞. ˆ D n,m (P||Q) is a measurable function ofX 1 ,..,X n andY 1 ,..,Y m , and consequently weareinterestedinstudyingthestrong—almostsurelywithrespecttothejointdistri- butionof{X n ,n∈ N}and{Y m ,m∈ N}—universalconsistencyof ˆ D n,m (P||Q)asm andn tend to infinity and as a function of the aforementioned notions of combinatorial complexityforΠ. 96 Forthe restofthe papertheprocess distributionsofY 1 ,Y 2 ,··· andX 1 ,X 2 ,··· will be denoted by Q and P, respectively, and their probability measures restricted to finite blocks,i.e.Y m 1 ≡ (Y 1 ,..,Y m )andX n 1 ≡ (X 1 ,..,X n ),by Q m and P n ,respectively. We will start exploring results concerning general approximation property of the proposed data dependent partition scheme for the divergence estimation and then use thoseforcharacterizingthestronguniversalconsistency. Theassumptions,notationand problemformulationpresentedinthissectionwillbeusedfortherestofthepaper,even when not explicitly stipulated (in particular, without any additional considerations for the first part in Sections 4.4 and 4.5, where as for the second part, Sections 4.6 and 4.7, we include other restrictions to particularize results to the properties of specific data-dependentconstructions). 4.4 UniversallySufficientData-DependentPartitions In this section we study the approximation quality of data-dependent partition schemes Π for the divergence estimation (4.9). More precisely, we wish to study conditions where, lim m→∞ X A∈πm(Y m 1 ) P(A)·log P(A) Q(A) =D(P||Q) (4.10) almost surely with respect to the process distribution ofY 1 ,Y 2 ,···, independent of the probability measures P and Q. Note that the expression in (4.10) differs from the one in (4.9) because it does not consider the empirical distributions. Consequently this approximation condition is strictly a function of the random partition sequence π 1 (Y 1 ),π 2 (Y 1 ,Y 2 ),··· ,π m (Y m 1 ),··· andthereforeoftheuniversalapproximationprop- ertiesofΠwithrespecttotheprocessdistribution Q. 97 Let P and Q be the probability measures in (R d ,B(R d )) under the assumption of Section 4.2 and let us consider the characterization of the divergence as the supremum withrespecttofinitecodingsorpartitionsof R d [32],i.e., D(P||Q) = sup π∈Q D(P π ||Q π ), (4.11) withQ representing the set of finite measurable partitions of R d , and the finite alphabet divergencegivenby, D(P π ||Q π ) = X A∈π P(A)log P(A) Q(A) , (4.12) which is well defined as P Q [32]. Consequently, for any partition scheme Π, D(P πm(y 1 ,..,ym) ||Q πm(y 1 ,..,ym) )≤D(P||Q),∀(y 1 ,..,y m )∈ R d·m andthen limsup m→∞ D(P πm(Y m 1 ) ||Q πm(Y m 1 ) )≤D(P||Q), (4.13) Q−almostsurely. Definition4.1 ApartitionschemeΠisuniversallysufficientforthedivergenceestima- tion if for any arbitrary pairP,Q of probability measures, absolutely continuous with respecttotheLebesguemeasuresuchthatD(P||Q)<∞,wehavethat, lim m→∞ D(P πm(Y m 1 ) ||Q πm(Y m 1 ) ) =D(P||Q), (4.14) withprobabilityonewithrespecttotheprocessdistributionofY 1 ,Y 2 ···. Westartbystatingthefollowinggeneralapproximationresult. THEOREM4.1 A partition scheme Π is universally sufficient for the divergence esti- mationifforanypairofprobabilitymeasuresP andQon(R d ,B(R d )),∀δ> 0,andfor 98 any measurable partitionπ = {A 1 ,..,A r } ∈ Q, there existsπ ∗ m = {A m,1 ,..,A m,r } ⊂ σ(π m (Y m 1 )),asequenceoffinitemeasurablepartitions,suchthat: limsup m→∞ sup i=1,..,r |P(A i )−P(A m,i )|<δ, (4.15) limsup m→∞ sup i=1,..,r |Q(A i )−Q(A m,i )|<δ, (4.16) Q-almostsurely. TheproofispresentedinAppendix4.9.1. Based on this general approximation theorem, a more specific result based on a ”shrinking cell” condition for data-dependent partition schemes can be stated. Before presenting the result let us introduce the following concept. For any A ∈ B(R d ), we defineitsdiameterby diam(A) = sup x,y∈A ||x−y||, (4.17) where||·||referstotheEuclideannormin R d . THEOREM4.2 Under the assumptions of Theorem 4.1, a partition scheme Π = {π 1 ,π 2 ,···}isuniversallysufficientforthedivergenceestimationif,∀γ > 0, Q x∈ R d :diam(π m (x|Y 1 ,..,Y m ))>γ → 0 (4.18) Q-almostsurelyas mtendstoinfinity. TheproofreducestocheckingthesufficientconditionstatedinTheorem4.1. Details arepresentedinAppendix4.9.2. In other words, the shrinking cell condition with respect to the reference measure Q, in (4.18), is sufficient to guarantee that the approximation error of our divergence estimatevanishesasthenumberofsampledpointtendstoinfinity,(4.14). Thisshrinking 99 cell condition was proposed by Lugosi and Nobel [46] for controlling approximation errorfortheproblemofhistogram-baseddensityestimation. Remark4.1 The shrinking cell condition for the references measure implies that the same asymptotic property is satisfied for the measureP (see proof of Theorem 4.2 for details). This provides a justification for our choice of data-dependent construction in (4.9). Finally, Theorem 4.2 will be used to control the approximation error in the main consistencyresultstatedinthenextsection. 4.5 MainUniversalConsistencyResult Before presenting the main result let us introduce some basic definitions. Let (a n ) n∈N and(b n ) n∈N betwosequencesofnon-negativerealnumbers. Wesaythat (a n )dominates (b n ), denoted by (b n ) (a n ), if there existsC > 0 andk ∈ N such thatb n ≤ C·a n ∀n ≥ k. We say that (b n ) n∈N and (a n ) n∈N are asymptotically equivalent, denoted by (b n )≈ (a n ),ifthereexitsC > 0suchthatlim n→∞ an bn =C. THEOREM4.3 LetP andQ be probability measures in (R d ,B(R d )) absolutely con- tinuous with respect to the Lebesgue measure, such thatD(P||Q) <∞. LetX 1 ,..,X n and Y 1 ,..,Y m be i.i.d. realizations of P and Q, respectively, and Π = {π 1 ,π 2 ,...} a partition scheme with associated sequence of measurable partitionsA 1 ,A 2 ,···. If for somel∈ (0,1),wehavethat,asmtendstoinfinity, a) m −l M(A m )→ 0, b) m −l logΔ ∗ m (A m )→ 0, c) ∃(k m ) ≈ (m 0.5+l/2 ) such that, ∀m ∈ N, ∀(y 1 ,..,y m ) ∈ R d·m , inf A∈πm(y 1 ,..,ym) Q m (A)≥ km m , 100 d) ∀γ > 0, Q x∈ R d :diam(π m (x|Y 1 ,..,Y m ))>γ → 0 almost surely with respecttotheprocessdistributionofY 1 ,Y 2 ,···, then lim m→∞ lim n→∞ ˆ D m,n (P||Q) =D(P||Q) (4.19) withprobabilityone. Proof: There are two important considerations to be taken into account in the proof. First, the universal sufficient nature of the adaptive quantization framework Π, considered in d), and second, the generalization ability of the learning approach, how relative frequencies converge uniformly to their respective probabilities for the estima- tionofthedivergence,consideredbya),b)andc). LetusconsiderY m 1 =Y 1 ,...,Y m . Theproofwillbebasedonthefollowinginequal- ity: ˆ D m,n (P||Q)−D(P||Q) ≤ X A∈πm(Y m 1 ) P n (A)·log P n (A) Q m (A) − X A∈πm(Y m 1 ) P n (A)·log P n (A) Q(A) + X A∈πm(Y m 1 ) P n (A)·log P n (A) Q(A) − X A∈πm(Y m 1 ) P(A)·log P(A) Q(A) + X A∈πm(Y m 1 ) P(A)·log P(A) Q(A) −D(P||Q) . (4.20) Then it is sufficient to prove that the three terms on the right side of the inequality converge to zero almost surely asn tends to infinity and asm tends to infinity. We will provethesethreecases(indexedfromtoptobottom)separately. 101 Term1:LetusconsiderX 1 ,..,X n andY 1 ,..,Y m ,then 3 X A∈πm(Y m 1 ) P n (A)·log P n (A) Q m (A) − X A∈πm(Y m 1 ) P n (A)·log P n (A) Q(A) ≤ X A∈πm(Y m 1 ) P n (A)|logQ(A)−logQ m (A)| (4.21) ≤ sup A∈πm(Y m 1 ) |logQ(A)−logQ m (A)|. (4.22) NotethatthisupperboundisindependentofX 1 ,...,X n ,andonlyinvolvesthedistribu- tion ofY 1 ,..,Y m . By the relationship between the total variational distance and theL 1 norm [22] and the use of Corollary 4.1 (note that a) and b) underl ∈ (0,1) imply the sufficientconditionsofCorollary4.1),wehavethatasmtendstoinfinity sup π∈Am max A∈π |Q(A)−Q m (A)|→ 0, (4.23) almost surely with respect to the process distribution ofY 1 ,Y 2 ,···. However this con- dition is not sufficient to prove that (4.22) tends to zero Q-almost surely because the log(·) function is not absolutely continuous in (0,1]. The following lemma provides an importantresulttoaddressthisissue. LEMMA4.2 LetY 1 ,..,Y m be i.i.d. realizations of a random variable with probability measureQin R d andΠapartitionschemeaspresentedinTheorem4.3. Iftheconditions a),b)andc)ofTheorem4.3aresatisfiedforsomel∈ (0,1),then lim m→∞ sup A∈πm(Y m 1 ) Q(A) Q m (A) −1 = 0, (4.24) 3 ByconstructionofΠ(conditionc)),∀A∈π m (Y 1 ,..,Y m ),Q m (A)> 0. Ontheotherhand,theevent Q m (A)> 0andQ(A) = 0hasprobabilityzero,moreprecisely Q(y m 1 :Q(A)> 0,∀A∈π m (y m 1 )) = 1, consequentlytheTerm1andupperboundin(4.22)arewelldefinedwithprobabilityone. 102 almost surely with respect to the process distribution of Y 1 ,Y 2 ,···. The proof is pre- sentedinAppendix4.9.3. Then from (4.24) it is simple to prove that lim m→∞ sup A∈πm(Y m 1 ) Q(A) Qm(A) = 1 and lim m→∞ sup A∈πm(Y m 1 ) Qm(A) Q(A) = 1 almost surely, see Appendix 4.9.4 for details. On the otherhand,wehavethat∀A∈π m (Y m 1 ), Q m (A) Q(A) −1 ≤ |Q(A)−Q m (A)| Q m (A) · Q m (A) Q(A) , (4.25) then lim m→∞ sup A∈πm(Y m 1 ) Q m (A) Q(A) −1 = 0, (4.26) Q-almost surely from (4.24) and (4.25). Finally, considering |log(x)| ≤ max x−1, 1 x −1 ∀x> 0,itfollowsthat sup A∈πm(Y m 1 ) log Q(A) Q m (A) ≤ sup A∈πm(Y m 1 ) max Q(A) Q m (A) −1 , Q m (A) Q(A) −1 ≤ max ( sup A∈πm(Y m 1 ) Q(A) Q m (A) −1 , sup A∈πm(Y m 1 ) Q m (A) Q(A) −1 ) , where using (4.24) and (4.26), we have that lim m→∞ sup A∈πm(Y m 1 ) log Q(A) Qm(A) = 0 almostsurely,provingourtargetresultfrom(4.22). Term2:Thesecondtermof(4.20)canbeupperboundedby, X A∈πm(Y m 1 ) P n (A)·logP n (A)− X A∈πm(Y m 1 ) P(A)·logP(A) + X A∈πm(Y m 1 ) (P n (A)−P(A))·log 1 Q(A) , (4.27) 103 where P A∈πm(Y m 1 ) (P n (A)−P(A))·log 1 Q(A) ≤ 2 sup A∈πm(Y m 1 ) log Q m (A) Q(A) + X A∈πm(Y m 1 ) (P n (A)−P(A))·log 1 Q m (A) . (4.28) Let us condition on a realization ofY 1 ,..,Y m and consequently we fix the measurable partitionπ m (Y m 1 ). Then by the strong law of large numbers (SLLN) [73], we have that ∀A∈π m (Y m 1 ), lim n→∞ P n (A) =P(A) (4.29) almost surely with respect to the distribution ofX 1 ,X 2 ,···. Then given thatxlogx is acontinuousrealfunction,and|π m (Y m 1 )|<∞,thenitfollowsdirectlythat, lim n→∞ X A∈πm(Y m 1 ) P n (A)·logP n (A) = X A∈πm(Y m 1 ) P(A)·logP(A) lim n→∞ X A∈πm(Y m 1 ) P n (A)·log 1 Q m (A) = X A∈πm(Y m 1 ) P(A)·log 1 Q m (A) almostsurelywithrespecttothedistributionofX 1 ,X 2 ,...givenY m 1 =y m 1 ,∀y m 1 ∈ R d·m . This proves that the first term in (4.27) and the second term of (4.28) converges to zero as n tends to infinity for any m ∈ N and any realization of Y m 1 . Then, it is simple to show that these two terms are zero almost surely as in additionm tends to infinity. Finally, the first term of (4.28) tends to zero asm tends to infinity almost surely from Lemma4.2. Term3:(Approximationpart)Thislasttermtendstozeroalmostsurelybyadirect applicationsofTheorem4.2. 104 Remark4.2 Perhapsnotexplicitinthestatementofthetheoremistheassumptionthat X 1 ,X 2 ,... and Y 1 ,Y 2 ,... need to be mutually independent random sequences. This is used when invoking the SLLN in (4.29) which is implicitly conditioned on the random partitionπ m (Y m 1 )andconsequentlyonY 1 ,..,Y m . Note that the result presented in Theorem 4.3 can be naturally extended when X 1 ,..,X m is a stationary ergodic source [4, 73]. The following result states this exten- sion. THEOREM4.4 Let us consider the same problem setting and assumptions of Theo- rem 4.3. If we consider instead that the random sequenceX 1 ,..,X m is stationary and ergodicthen, lim m→∞ lim n→∞ ˆ D m,n (P||Q) =D(P||Q) (4.30) withprobabilityone. Proof: The same arguments for proving Theorem 4.3 can be adopted, where the proofsofTerm1andTerm3remainthesame—becausethosetermsareindependent of the process distribution ofX 1 ,X 2 ,···, and the proof of the Term 2 can be adapted byasimpleapplicationoftheErgodicTheorem[4,73]. The second part of this paper is devoted to showing how this general strong consis- tency result particularizes to two widely used data-dependent partition schemes. These particularizations can be related to similar results presented by Lugosi et al. [46] for provinghowdata-dependentpartitionschemesarestronglyconsistent—inthe L 1 sense —forthedensityestimationproblem. 105 4.6 StatisticalEquivalentData-DependentPartitions Let us first consider the real line (R,B(R)) as the target measurable space with two probability measuresP andQ satisfying the conditions of Theorem 4.3. We consider the l m -spacing partition scheme originally studied by Wang et al. for the problem of divergence estimation [78]. More precisely, letY 1 ,..,Y m be the i.i.d. realizations with marginaldistributionQ. TheorderstatisticsY (1) ,Y (2) ,..,Y (m) isdefinedasthepermu- tation ofY 1 ,..,Y m such thatY (1) <Y (2) <···<Y (m) — this permutation exists with probability one as Q is absolutely continuous with respect to the Lebesgue measure. Basedonthissequence,theresultingl m -spacingquantizationisgivenby π m (Y m 1 ) ={I m i :i = 1,..,T m } = (−∞,Y (lm) ],(Y (lm) ,Y (2lm) ],..,(Y ((Tm−1)lm) ,∞) , whereT m =bm/l m cassumingthenon-trivialcasewhere m>l m . Notethatunderthis construction every cell ofπ m (Y m 1 ) has at leastl m samples fromY 1 ,..,Y m . The follow- ing result presents the sufficient conditions that makes this data-dependent divergence estimatorstronglyconsistent. THEOREM4.5 LetP,Q be absolutely continuous with respect to the Lebesgue mea- sureon(R,B(R))andD(P||Q)<∞. LetX 1 ,..,X n andY 1 ,..,Y m bei.i.d. realizations ofP andQ respectively. Under thel m -spacing partition scheme, if (l m ) ≈ (m 0.5+l/2 ) forsomel∈ (1/3,1),then lim m→∞ lim n→∞ ˆ D m,n (P||Q) =D(P||Q), (4.31) withprobabilityone. 106 Proof: We just need to check that under the l m -spacing partition scheme, the conditionsa),b),c)andd)ofTheorem4.3aresatisfied. Withoutlossofgeneralityletus consideranarbitraryl∈ (1/3,1). Thetrivialcasetocheckisc),becausebyconstruction we can considerk m = l m ,∀m ∈ N, and then the hypothesis of this theorem gives the condition. Concerning a), again by construction we have thatM(A m ) ≤ m/l m + 1, thenm −l M(A m )≤ m 1−l /l m +m −l . Given that (l m )≈ (m 0.5+l/2 ) andl ∈ (1/3,1) it followsthat, lim m→∞ m −l M(A m ) = 0. (4.32) Forconditionb),Lugosietal. [46]showedthatΔ ∗ m (A m ) = Tm+m m ,whereusingthat log s t ≤s·h(t/s)[21],withh(x) =−xlog(x)−(1−x)log(1−x)forx∈ [0,1]— thebinaryentropyfunction[16],itfollowsthat, m −l log(Δ ∗ m (A m )) =m −l ·log Tm+m m ≤m −l ·(m+T m )·h m m+T m ≤ 2m 1−l ·h 1 m/T m +1 ≤ 2m 1−l ·h T m m ≤ 2m 1−l ·h 1 l m (4.33) Consequentlywehavethat,∀m∈ N, m −l log(Δ ∗ m (A m ))≤− 2m 1−l l m log(1/l m ) −2m 1−l (1−1/l m )log(1−1/l m ). (4.34) The first term on the right hand side (RHS) of (4.34) behaves likem 0.5−3/2·l · log(l m ), where as long as the exponent of the first term is negative (equivalent to l > 1/3) 107 Figure4.1: A:ExampleofGessaman’sstatisticallyequivalentpartitionforatwodimen- sional bounded space. B: Example of a tree-structured data dependent partition and its tree-indexed structure. Each internal node has a label indicating the spatial coordinate usedtosplititsassociatedrectangularset. this sequence tends to zero as m tends to infinity — considering that by construction (l m ) (m). ThesecondtermontheRHSof(4.34)behavesasymptoticallylike−m 1−l · log(1− 1/l m ) which is upper bounded by the sequence m 1−l lm · 1 1−1/lm — using that log(x)≤x−1,∀x> 0. Thisupperboundtendstozerobecause(l m )≈ (m 0.5+l/2 )and l> 1/3. Consequentlyfrom(4.34),lim m→∞ m −l log(Δ ∗ m (A m )) = 0. Finally concerning condition d), Lugosi et al. [46] (Theorem 4) proved that it is sufficienttoshowthatlim m→∞ lm m = 0,whichisthecaseconsideringthatl< 1. Remark4.3 Under the condition presented in Theorem 4.5 we have that l m → ∞ and l m /m → ∞ as m tends to infinity, which are the sufficient conditions presented by Lugosi and Nobel [46] for the l m -based histogram based density estimation to be strongly consistent in the L 1 sense. The fact that stronger sufficient conditions are needed to get strong consistency for the divergence estimation, can be explained by thefactthatthedivergenceisnotcontinuousfunctionofthedensitieswithrespecttothe L 1 norm[22]. 108 4.6.1 Gessaman’sstatisticallyequivalentpartition Here we extend the consistency result of Theorem 4.5 for the finite-dimensional case, X = R d , considering the particular type of statistically equivalent block proposed by Gessaman [29]. In this context, the partition rule considers T m = b(m/lm) 1/d c as the number of axis-paralled splits to be induced in any coordinate of the space. More precisely first, the i.i.d samplesY 1 ,..,Y m associated with the reference measureQ are projectedinthefirstcoordinatetocreateapartitionofT m cellswithsamenumberofpro- jected sample points using axis-parallel hyper-planes perpendicular to the first coordi- nate. Thenforanyresultingrectangularcell,itsrespectivesamplespointsareprojected inthesecondcoordinateandusedtopartitionthecellinT m statisticallyequivalentsets, in this case by hyper-planes perpendicular to the second coordinate. By iterating this processuntilthelastcoordinate,wehaveanadaptivepartitionschemeofexactly(T m ) d rectangular cells with at least l m -sample points, see Fig. 4.1 for an illustration 4 . The followingresultpresentsthesufficientconditionstomakethispartitionschemestrongly consistent for the divergenceestimation. This result is ageneralization of Theorem 4.5, but the techniques used to prove the shrinking cell condition does not extend from the approach proposed by Lugosi et al. [46] (Theorem 4) for the scalar case. The proof of thisshrinkingcellconditionispresentedinAppendix4.9.5. THEOREM4.6 The Gessaman’s partition scheme is strongly consistent for the his- togrambaseddivergenceestimationasmandntendtoinfinityif(l m )≈ (m 0.5+l/2 )for somel∈ (1/3,1). Proof: UsingagainTheorem4.3,weneedtochecktheconditionsa),b),c)andd). The arguments to check conditions a) and c) extend directly from Theorem 4.5. Con- cerningb)usingthesamecombinatorialargument,wehavethatΔ ∗ m (A m )≤ Tm+m m d . 4 Notethatthel m -spacingpartitionisaparticularcaseofGessaman’spartitionschemewhen d = 1. 109 Defining ¯ T m = bm/lmc ≥ T m and h(·) the binary entropy function, we can use the samederivationspresentedin(4.33)toshowthat, m −l log(Δ ∗ m (A m ))≤m −l d·log ¯ Tm+m m ≤ 2d·m 1−l ·h 1 l m . (4.35) This last upper bound tends to zero as m goes to infinity because (l m ) ≈ (m 0.5+l/2 ) and 1 > l > 1/3 as shown in Theorem 4.5. The most challenging part to check is the shrinkingcellcondition,presentedindetailinAppendix4.9.5. 4.7 Tree-StructuredPartitionSchemes Tree-structured partitions (TSPs) have been widely used as an adaptive non-parametric technique in statistical learning problems. The induced quantization of the space has a binary tree structure associated with the way in which the space is recursively splitted, starting with a cell containing the full space and inductively partitioning every cell by a local binary splitting rule. Their binary tree structure allows efficient data-dependent construction and posterior vector indexing. Emblematic applications include classifica- tion(classification trees)[21, 51,61,5], densityestimation [46]andregression [49]. In thissection,wepresenttheapplicabilityofTheorem4.3toprovidesufficientconditions forageneralfamilyofTSPschemestobestronglyconsistentforthedivergenceestima- tion,aproblemthathasnotbeenexploredforthistypeofdata-dependentconstruction. 4.7.1 Basicnotation Letusfirstintroducesometerminology. UsingBreimanetal. conventions[5],abinary tree T is a collection of nodes with only one with degree 2 (the root node), and the 110 remainingnodeswithdegree3(internalnodes)ordegree1(leaforterminalnodes). Let depth(t)denotethedepthoft∈T —thenumberofarcsthatconnecttwiththerootof T, andL(T) be the collection of terminal nodes ofT. We define the size of a treeT as the cardinality ofL(T) and denote it by|T|. If ¯ T ⊂ T and ¯ T is a binary tree by itself, we say that ¯ T is a subtree ofT and moreover if both have the same root we say that ¯ T is a pruned version ofT, denoted by ¯ T T. Finally,T r denotes the truncated version ofT,formallygivenbyT r ={t∈T :depth(t)≤r}. WewillconcentrateonthefamilyofTSPinducedbyhyperplanecuts[21],i.e. dur- ing the construction of the partition intermediate cells are dichotomized by intersecting it with closed and open halfspacesH,H c ⊂ R d of the formH = x :x † w≥α , for somew ∈ R d andα ∈ R. Examples of this collection are the axis-parallel halfspaces given by{x :x(i)≥α} withi∈{1,..,d}. Let usdenote byH the collectionof closed halfspacesandbyH 0 thecollectionofaxis-parallelhalfspaces. Closely following Nobel’s conventions [51, 50], a tree-structured partition (TSP) can be represented by a pair (T,τ(·)), withT a binary tree andτ(·) a function fromT toH. Foranyt∈T,τ(t)correspondstotheclosedhalfspacethatdichotomizesthecell associated witht, denoted byU t , in two componentsU t ∩τ(t) andU t ∩τ(t) c . These resulting cells are associated with the left and right child oft respectively, in the case whent is not a terminal node ofT. Then initializing the cell of the root nodet 0 with U t 0 = R d ,τ(·)providesdewaytocharacterizeU t ,∀t∈T. Inparticular, π(T)≡{U t :t∈L(T)}⊂B(R d ), (4.36) istheTSPinducedby (T,τ(·)). Becauseofthisconstruction, thecellassociatedwitha node of depthk in the binary tree construction is a convex polytope of at mostk faces 5 . 5 Apolytopereferstosetsinducedbyfiniteintersectionsofclosedoropenhalfspaces[21]. 111 This property will turn out to be crucial to find conditions that make the divergence estimationbasedondata-drivenTSPstronglyconsistent. If (T,τ(·)) is a TSP and ¯ T T, then there is a unique TSP associated with ¯ T by restrictingτ(·) to the domain of ¯ T. Note that if ¯ T T thenπ(T) is a refinement of π( ¯ T), that we denote consistently by π( ¯ T) π(T). Finally for the sake of simplic- ity, we will use the binary tree notation T to refers to (T,τ(·)) or the partition π(T) dependingonthecontext. 4.7.2 Tree-structureddata-dependentpartitions A n-sampled TSP rule T n is a function from the space of finite sequences R d·n to the space of TSP with halfspace splitting rules, and the resulting partition scheme is the collection of TSP rules Π = {T 1 ,T 2 ,···}. Specifically in this work we focus on the generalfamilyofTSPrulesinducedbyalocalsplittingandstoppingcriterion. LetU bethecollectionofpolytopesin R d andP bethespaceofprobabilitymeasures in (R d ,B(R d )), then a local splitting rule can be seen as a function Ψ : U ×P → H, thatforagivencellU ∈U andprobabilitymeasureP ∈P itdefinesaclosedhalfspace Ψ(U,P) ∈ H to partitionU. On the other hand, a local stopping criterion is a binary functionΦ :U×P×[0,1]→{0,1},whichforgivenU ∈U andP ∈P,indicateswhen toapplythelocalsplittingcriteriaΨ(·)onthecellU. Giventhetypeofconstraintsthat wewanttoimposeonthedata-dependentpartition,inparticular Theorem4.3(condition c)),weconsiderstoppingrulesoftheform, Φ(U,P,p) = I {P(U∩Ψ(U,P))>p}∩{P(U∩Ψ(U,P)) c >p} , (4.37) forsomep∈ (0,1). 112 Finally, given Y m 1 = Y 1 ,..,Y m i.i.d. realizations of the reference measure Q, the corresponding empirical distributionQ m and a non-negative sequence (k m ) ∈ N N , the m-sampledpartitionrule π(T m (Y m 1 ))isinducedbytherecursiveapplicationofthestop- pingandsplittingcriteriaasfollows: 1. Initialization:T m ={t 0 }(therootnode),U t 0 = R d ,π(T m ) ={U t 0 }andτ m (t 0 ) = Ψ(U t 0 ,Q m ) 2. Recursion: forallt∈L(T m ) if Φ(U t ,Q m ,k m /m) = 1, then considert 1 andt 2 as the left and right extensions oftandupdateasfollows: • T m =T m ∪{t 1 ,t 2 }, • U t 1 =U t ∩τ m (t),U t 2 =U t ∩τ m (t) c • τ m (t 1 ) = Ψ(U t 1 ,Q m ),τ m (t 2 ) = Ψ(U t 2 ,Q m ). • π(T m ) =π(T m )\{U t }∪{U t 1 ,U t 2 } 3. Termination: Repeat2),untilΦ(U t ,Q m ,k m /m) = 0,∀t∈L(T m ). NotethatbyconstructionQ m (U t )≥k m /m,∀t∈L(T m (Y m 1 )),whichisconsistentwith condition c) of Theorem 4.3. Under this data-driven TSP scheme the following result canbestated. THEOREM4.7 LetP,Q be probability measures as in Theorem 4.3, andX 1 ,..,X n and Y 1 ,..,Y m i.i.d. realizations of P and Q, respectively. Let Π = {T 1 ,T 2 ,···} be a TSP scheme driven by the empirical process Y 1 ,Y 2 ,··· and the local stopping rule governed by a sequence of non-negative number (k m ) m∈N . If (k m ) ≈ (m 0.5+l/2 ) for some l ∈ (1/3,1) and Π satisfies the shrinking cell condition stated in Theorem 4.2, then lim m→∞ lim n→∞ ˆ D m,n (P||Q) =D(P||Q), 113 withprobabilityone. Proof: ThisresultisasimpleconsequenceofTheorem4.3. Notethatweonlyneed to check conditionsa) andb), becausec) is obtained directly by the stopping criterion, (4.38),andd)isassumedinthetheoremstatement. Bythestoppingcriterion,|T m (y m 1 )| is uniformly upper bounded bym/k m , for ally m 1 ∈ R d·m . ThenM(A m )≤ m/k m and consequently, m −l M(A m )≤ m 1−l k m ≈m 0.5− 3 2 l , (4.38) upper bound that tends to zero as m → ∞ if l > 1/3. Concerning condition b), we use the upper bound proposed by Lugosi et al. [46], specifying that every polytope of π(T m (y m 1 )) is induced by at most M(A m ) hyperplane splits. Each binary splits can dichotomizem≥ 2pointsin R d inatmostm d ways[15]. Consequently, Δ ∗ m (A m )≤ (m d ) m/km , (4.39) then, m −l logΔ ∗ m (A m )≤ m 1−l k m dlogm, (4.40) upperboundsequencethatagaintendstozeroasm→∞aslongasl> 1/3. Next we will particularize this result for more specific tree-structured data depen- dent construction where the ”shrinking cell condition” is satisfied. In particular, the nextsectionexplorestheemblematiccasewherethesplittingruletakesvaluesinspace ofaxis-parallelhalfspaces, H 0 ,andusesacriterionbasedonstatisticalequivalentparti- tions. 114 4.7.3 Statisticalequivalentsplittingrules In this section we consider a version of what is known as balanced search tree [21](Chapter20.3). Moreprecisely,givenY 1 ,Y 2 ,..,Y m i.i.d. realizationswithprobabil- ityQandinducedempiricalversionQ m ,weconsideraTSPschemewherethesplitting ruleΨ(U t ,Q m )∈H 0 firstchosesadimensionofthespaceinasequentialfashion,func- tionofthedepthoft—forinstancei =mod d (depth(t))—andthentheiaxis-parallel halfspaceH i (Y m 1 )∈H 0 by Ψ(U t ,Q m ) =H i (Y m 1 ) = x∈ R d :x(i)≤ ¯ Y (d¯ m/2e) (i) , (4.41) where ¯ Y (1) (i) < ¯ Y (2) (i) <,..,< ¯ Y (¯ m) (i) denotes the order statistics of the sampling points of interest ¯ Y 1 ,.., ¯ Y ¯ m = {Y 1 ,..,Y m }∩U t projected in the target dimensioni. This type of data-dependent splitting criterion was proposed by Darbellay and Vajda [18]fortheestimationofmutualinformationbetweencontinuousrandomvariables. Afterthefirstiterationofthesplittingrule,theresultingcellshaveatmostm/2+1 andatleastm/2−1samplingpoints. Theseconditerationimpliesthecreationof4cells withatmostm/4+2andatleastm/4−2sampledpoints,andconsequentlyinductively thek-th iteration — if the stopping criterion is not violated during the splitting process — creates a balanced tree of 2 k cells with at least m/2 k −k and at most m/2 k +k sampling points. Note that at the end of the algorithm — considering our stopping criterion(4.37)andsplittingrule(4.41),T m (Y m 1 )isnotguaranteetobeabalancedtree, but the following condition follows: ∀t ∈ T m (Y m 1 ), 1 2 depth(t) + depth(t) m ≥ Q m (U t ) ≥ 1 2 depth(t) − depth(t) m . ToprovethatthisTSPschemeΠinducesastronglyconsistentdivergenceestimator, we just need to verify that Π satisfies the shrinking cell condition under the specific 115 assumptions stated in in Theorem 4.7 to control the estimation error. Before proving thisconditionletusintroducesomedefinitionsandresults. Proposition4.1 Let Π ={π 1 ,π 2 ,···} and ¯ Π ={¯ π 1 ,¯ π 2 ,···} denote two general par- tition schemes of R d driven by i.i.d. realizations Y 1 ,Y 2 ,··· with marginal probability Q. If we consider that Π is a refinement of ¯ Π in the sense that∀m ∈ N,∀y m 1 ∈ R d·m , ¯ π(y m 1 )π(y m 1 ),then lim m Q( x∈ R d :diam(¯ π m (x|Y 1 ,..,Y m ))>δ ) = 0 Q-almostsurelyimpliesthat lim m Q( x∈ R d :diam(π m (x|Y 1 ,..,Y m ))>δ ) = 0 Q-almostsurely. Proof: The proof is a straightforward consequence of the fact that,∀y m 1 ∈ R d·m , ∀x∈ R,π m (x|Y 1 ,..,Y m )⊂ ¯ π m (x|Y 1 ,..,Y m ). Definition4.2 LetT be a binary tree, we say thatT is balanced tree of depthr if∀t∈ L(T),depth(t) =r. Definition4.3 A TSP scheme Π = {T 1 ,T 2 ,···} is a uniform balanced tree-structure scheme if each partition ruleT m (y m 1 ) forms a balanced tree of depthd m , only function ofthelengthofy m 1 . LEMMA4.3 Let Π = {T 1 ,T 2 ,···} be a uniform balanced tree-structure scheme induced by the statistically equivalent splitting rule (4.41) and with depth sequence 116 (d m ). Π satisfies the shrinking cell condition of Theorem 4.2 if there exists a non- negativerealsequence(a m )≈ (m p ),forsomep> 0,suchthat m d m 2 dm − a m d m →∞andd m →∞, (4.42) asmtendstoinfinity. ThisresultcanbederivedfromtheideaspresentedbyDevroye,GyorfiandLugosi[21] (Theorem 20.2) where a weak version of our shrinking cell condition was proved for a similarbalancedtree-structuredpartitionscheme. Forsakeofcompletenessthisstronger resultispresentedinAppendix4.10. Finallywehaveallthemachinerytoproveourintendedresult. THEOREM4.8 Let Π = {T 1 ,T 2 ,···} a TSP scheme with the stopping and splitting rulepresentedin(4.37)and(4.41),respectively. UndertheproblemstatementofTheo- rem4.3andtheconstraintsimposedbythestoppingruleinTheorem4.7,theempirical divergence ˆ D m,n (P||Q),constructedfromΠ,isstronglyconsistent. Proof: From Theorem 4.7, we only need to verify the shrinking cell condition for Π. By the binary tree structure of Π and the stopping rule, it is simple to show that, ∀y m 1 ∈ R d·m , r(m)≡blog 2 (m)c−dlog 2 (k m )e≤ inf t∈L(Tm(y m 1 )) depth(t), (4.43) andconsequentlyT r(m) m (Y m 1 )isabalancedtree. Defining ¯ Π = n T r(1) 1 ,T r(2) 2 ,··· o from Proposition 4.1, it suffices to check the shrinking cell condition on ¯ Π because by con- struction T r(m) m (y m 1 ) T m (y m 1 ), ∀y m 1 ∈ R d·m . Moreover given that ¯ Π is a balanced tree-structurescheme,wecancheckthesufficientconditionstatedin Lemma4.3. 117 Let ¯ d m (=r(m))denotethedepthofT r(m) m . Byconstruction ¯ d m ≥log 2 (m/k m )−2 and tends to infinity asm/k m does ((k m ) ≈ (m 0.5+l/2 ) forl < 1). On the other hand, let us consider an arbitrary non-negative sequence (a m ) ≈ (m p ) for some arbitrary p∈ 0, 2 3 ,then m ¯ d m 2 ¯ dm − a m ¯ d m ≥ m log 2 (m/k m )2 log 2 (m/km) − a m log 2 (m/k m )−2 (4.44) = k m log 2 (m/k m ) − a m log 2 (m/k m )−2 →∞ (4.45) asm→∞,because(k m )≈ (m 0.5+l/2 )forl∈ (1/3,1),whichprovestheresult. 4.8 SummaryandFinalRemarks This work explores the problem of universal divergence estimation based on data- dependent partitions. General sufficient conditions were presented to make the diver- gence estimate strongly consistent extending the methodology proposed by Lugosi and Nobel for the problem of histogram-based density estimation and classification [46]. Concerning the approximation error, the notion of universally sufficiency was intro- duced for families of data-dependent partitions and some general sufficient conditions stipulated extending a well known shrinking cell condition property [21, 5]. For the estimation error, the Vapnik and Chervonenkis inequality [70] was used to control esti- mation error in the problem. In this context, the fact that the log function — an inte- gralpartofinformationtheoreticquantities—isnotabsolutelycontinuousrequiresthe introduction of stronger conditions on structural properties of the data-driven partition family,intermsofthegrowthandmaximumcellcountfunctionals,relativetotheresults presented for histogram-based density estimation [46]. In the application of this result, 118 the mentioned point is reflected on stronger conditions for the properties that a specific data-dependent construction needs to satisfy to make its histogram-based divergence estimate consistent. This suggests that the universal divergence estimation problem is morechallengingintermsofdata-drivendesignconditionscomparedwiththeproblem ofdensityestimationandclassification. The results presented in this work provide concrete design conditions to implement empirical divergence estimates. Furthermore, the problem formulation offers the possi- bilityofextendingthistypeofhistogram-basedconstructionandresultstotheestimation of other information theoretic quantities — like the mutual information or the differen- tialentropy,aswellasusingtherichmachineryofstatisticallearningtheory[70,21]to explorenewfindings. 119 4.9 TechnicalDerivations 4.9.1 ProofofTheorem4.1 Proof: Let us fix an arbitrary > 0 and a measurable partition π(/2) = {A 1 ,..,A r }suchthat, D(P π(/2) ||Q π(/2) )>D(P||Q)−/2, (4.46) bydefinitionofthedivergencein(4.11). Consideringthat|π(/2)|<∞andthatxlogx is continuous real function, it is not difficult to show thatD(P π(/2) ||Q π(/2) ) is a con- tinuous function with respect to the total variational distance in the product space of probability measures on (R d ,σ(π(/2))) under some additional conditions. More pre- ciselyfor/2,∃δ 1 > 0andδ 2 > 0,suchthatif, sup i=1,..,r P 1 (A i )−P 2 (A i ) <δ 1 (4.47) sup i=1,..,r Q 1 (A i )−Q 2 (A i ) <δ 2 , (4.48) andP 1 Q 1 ,P 2 Q 2 then, D(P 1 π(/2) ||Q 1 π(/2) )−D(P 2 π(/2) ||Q 2 π(/2) ) </2. (4.49) Then a direct consequence of the hypotheses of the theorem is that there exits {π ∗ m ,m∈ N}⊂Qsuchthat liminf m→∞ D(P π ∗ m ||Q π ∗ m )>D(P π(/2) ||Q π(/2) )−/2, (4.50) 120 Q-almost surely. Finally, note that D(P πm(Y 1 m ) ||Q πm(Y m 1 ) ) ≥ D(P π ∗ m ||Q π ∗ m ), because by construction π ∗ m ⊂ σ(π m (Y 1 m )) and consequently π m (Y 1 m ) is a refinement of π ∗ m , ∀m∈ N[32]. Thenitfollowsthat, liminf m→∞ D(P πm(Y 1 m ) ||Q πm(Y m 1 ) )>D(P π(/2) ||Q π(/2) )−/2 >D(P||Q)−, (4.51) with probability one. Given that can be chosen arbitrarily small, then liminf m→∞ D(P πm(Y 1 m ) ||Q πm(Y m 1 ) ) ≥ D(P π(/2) ||Q π(/2) ) Q-almost surely and in con- junctionwith(4.13)theresultisproved. 4.9.2 Proof of Theorem 4.2: Shrinking cell conditions for univer- sallysufficientdata-dependentpartition Proof: Letusfirstcheckthattheshrinkingcellconditionin(4.18)canbeextended to the measure P, considering that P Q. Using the short-hand notation Y m 1 = Y 1 ,...,Y m ,(4.18)isequivalentto lim m→∞ Q [ A∈πm(Y m 1 ) diam(A)>γ A = 0, (4.52) Q-almostsurely ∀γ > 0. Letusfixγ > 0andanadmissiblerealizationoftheprocessy 1 ,y 2 ···,i.e. arealiza- tion where (4.52) holds. Based on this sequence let us define the measurable sequence of eventsB m = S A∈πm(y m 1 ) diam(A)>γ A∈B(R d ),∀m∈ N. Then from the fact thatP Q and (4.52),f m (x)≡ ∂P ∂Q (x)· I Bm (x) tends to zero asm tends to infinity forQ-almost every 121 x∈ R d . Given thatf m (x)≤ ∂P ∂Q (x) — this RD derivativeQ-integrable, the application ofthedominatedconvergencetheorem[73,34]impliesthat lim m→∞ Z ∂P ∂Q (x)· I Bm (x)·∂Q(x) = 0⇔ lim m→∞ P(B m ) = 0. This last result holds for any admissible realization of the processY 1 ,Y 2 ,···. Conse- quentlyfrom(4.52),∀γ > 0, lim m→∞ P [ A∈πm(Y m 1 ) diam(A)>γ A = 0. (4.53) Q-almost surely. In other words, the measure ( P andQ) of cells of our random data- dependentpartitionscheme{π m (Y m 1 ) :m∈ N},withdiametergreaterthananarbitrary non-zeronumbertendstozeroalmostsurelyas mtendstoinfinity 6 . Returning to the main problem, the proof reduces to checking the sufficient condi- tions of Theorem 4.1. Let us consider an arbitrary partitionπ ={A 1 ,..,A r }∈Q. We will concentrate on proving the result for the measureQ, i.e. (4.16), however the con- structionoftherequiredpartitionsequence{π ∗ 1 ,π ∗ 2 ,···},stipulatedinTheorem4.1,and theproofargumentextendforP aswell. Let us consider an arbitrary δ > 0. Given that Q is absolutely continuous with respect to the Lebesgue measure λ, there is a bounded measurable set B, such that 6 The fact that P Q and that the shrinking cell condition for the measure Q implies the same condition under the measure P, (4.53), provide theoretical justification of the proposed data-dependent construction,(4.10),whichonlydependsonrealizationsofthereferencemeasure. 122 Q(B)> 1−δ/2. Letusdefine ¯ π = ¯ A 1 ,.., ¯ A r with ¯ A j =B∩A j ,∀j = 1,..,r asthe partitionofB inducedbyπ,wherebyconstructionofB itfollowsthat max j∈{1,..,r} Q(A j )−Q( ¯ A j ) <δ/2. (4.54) Wedefine{B m 1 ,..,B m r }asthecoveringofπ inducedbyπ m (Y m 1 )by B m j = [ A∈πm(Y m 1 ) A∩A j 6=∅ A, ∀j∈{1,..,r}. (4.55) Based on{B m 1 ,..,B m r }, we can induce a partitionπ ∗ m = {A m 1 ,..,A m r } ⊂ σ(π m (Y m 1 )) that approximatesπ by the following construction: A m 1 = B m 1 ,A m 2 = B m 2 \B m 1 ,···, A m r =B m r \ ∪ r−1 j=1 B m j . Fromthisconstructionwecanalsodefine ¯ π ∗ m = ¯ A m 1 ,.., ¯ A m r asthepartitionofB inducedbyπ ∗ m ,wherewehavethat, max j∈{1,..,r} Q(A m j )−Q( ¯ A m j ) <δ/2, (4.56) uniformly∀m∈ N. Inaddition,foranymeasurablesetA∈B(R d )letusdefineitsγ-opencovering by A γ+ = [ x∈A B(x,γ), (4.57) withB(x,γ) denoting the open ball centered atx and radiusγ, and theγ-interior ofA by A γ− =A\ [ x∈δ(A) B(x,γ), (4.58) 123 withδ(A) representing the boundary points ofA. Let us consider an arbitraryA i ∈ π, then |Q(A i )−Q(A m i )|< Q(A i )−Q( ¯ A i ) + Q( ¯ A i )−Q( ¯ A m i ) + Q( ¯ A m i )−Q(A m i ) <δ + Q( ¯ A i )−Q( ¯ A m i ) , ∀m∈ N. (4.59) Given that ¯ A i is bounded andλ(δ( ¯ A i )) = 0, by the continuity ofλ under monotone set sequences [34], ∀ > 0, ∃γ > 0, such that λ( ¯ A γ+ i \ ¯ A γ− i ) < , where given that Q λ, the same is true considering the measure Q. Hence, let us fix γ such that Q( ¯ A γ+ i \ ¯ A γ− i )<. andletusdefinetheeventS m γ inB(R d·m )by S m γ = y m 1 ∈ R d·m :diam(π m (y m 1 ))<γ , (4.60) 124 withdiam(π m (y m 1 )) = max A∈πm(y m 1 ) diam(A). IfinQ( ¯ A m i )wemakeexplicititsdepen- dency onY m 1 , from the fact that Q( ¯ A i )−Q( ¯ A m i (Y m 1 )) ≤ Q( ¯ A i 4 ¯ A m i (Y m 1 )) we have thefollowingsequenceofinequalities: Q( ¯ A i )−Q( ¯ A m i (Y m 1 )) ≤Q( ¯ A γ+ i \ ¯ A γ− i )· I S m γ (Y m 1 )+ [Q( ¯ A γ+ i \ ¯ A γ− i )+Q [ A∈πm(Y m 1 ) diam(A)>γ A ]· I (S m γ ) c(Y m 1 ) (4.61) =Q( ¯ A γ+ i \ ¯ A γ− i )+Q [ A∈πm(Y m 1 ) diam(A)>γ A · I (S m γ ) c(Y m 1 ) ≤Q( ¯ A γ+ i \ ¯ A γ− i )+Q [ A∈πm(Y m 1 ) diam(A)>γ A (4.62) ≤+Q [ A∈πm(Y m 1 ) diam(A)>γ A , ∀m∈ N. (4.63) The first inequality, (4.61), is obtained by construction of ¯ A m i and the fact that we con- ditionontheeventsS m γ and(S m γ ) c ,respectively 7 . Thesecondisjustbydefinitionofthe indicatorfunctionandthelastfromourchoiceofγ. Howeverbyhypothesisasmtends to infinity Q S A∈πm(Y m 1 ) diam(A)>γ A ! tends to zero almost surely with respect to the process distributionofY 1 ,Y 2 ,··· independentofγ. Then,from(4.59)and(4.63), limsup m→∞ |Q(A i )−Q(A m i )|<δ +, (4.64) 7 By construction condition on Y m 1 ∈ S m γ , it is clear that ¯ A i 4 ¯ A m i (Y m 1 ) ⊂ ¯ A γ+ i \ ¯ A γ− i , where in general ¯ A i 4 ¯ A m i (Y m 1 )⊂ S x∈δ( ¯ Ai) π(x|Y m 1 ) = S A∈πm(Y m 1 ) A∩δ( ¯ Ai)6=∅ A. 125 Q-almost surely. Note that can be chosen arbitrarily small in this last expression. FinallynotingthatthisresultisvalidforanymeasurableeventA i ∈πandthat|π|<∞, itfollowsthat limsup m→∞ sup i∈{1,..,r} |Q(A i )−Q(A m i )|<δ, (4.65) with probability one. Using the same partition sequence{π ∗ 1 ,π ∗ 2 ,···} and proof argu- ments, the result in (4.65) can be shown for the measureP, which proves the theorem. 4.9.3 ProofofLemma4.2 Proof: Letusfirstnotethatfromc),∀m∈ N,∀A∈π m (Y m 1 ), |Q m (A)−Q(A)| Q m (A) ≤ |Q m (A)−Q(A)| k m /m , (4.66) thenwewillconcentrate inprovingthat sup A∈πm(Y m 1 ) |Qm(A)−Q(A)| km/m tendstozeroalmost surely as m → ∞. From the Borel-Cantelli lemma , a sufficient condition is to prove that∀> 0, X m≥0 Q sup A∈πm(Y m 1 ) |Q m (A)−Q(A)|>·k m /m ! <∞, (4.67) 126 where Q denotes the process distribution of the empirical process Y 1 ,Y 2 ,···. Let us consideranarbitrary> 0. Thenwehavethat, Q sup A∈πm(Y m 1 ) |Q m (A)−Q(A)|>·k m /m ! ≤ Q sup A∈σ(πm(Y m 1 )) |Q m (A)−Q(A)|>·k m /m ! ≤ Q sup π∈Am sup A∈σ(π) |Q m (A)−Q(A)|>·k m /m ! ≤ 4Δ ∗ 2m (A m )2 M(Am) exp − (·km) 2 8·m , (4.68) where the last inequality is from the Vapnik-Chervonenkis inequality [72], by upper bounding the scatter coefficient of the collection of events S π∈Am σ(π) by 2 M(Am) Δ ∗ 2m (A m ), see proof of Lemma 1 in [46]. Finally using conditions a), b) and c)fromTheorem4.3wegetthat, lim m→∞ 1 m l ·log Q sup A∈πm(Y m 1 ) |Q m (A)−Q(A)|>·k m /m ! ≤− lim m→∞ 2 · k 2 m m 1+l =−·C, (4.69) forsomeC > 0. Consequentlythetermofthesummationin(4.67)isdominatedbythe sequence (exp −·C·m l · ) m∈N , where given that P m∈N exp −·C·m l <∞ for anyl∈ (0,1), theresultisproved. 4.9.4 Detailsonsomealmostsurelyderivations Weknowthat, lim m→∞ sup A∈πm(Y m 1 ) Q(A) Q m (A) −1 = 0, (4.70) 127 with probability one. Then if we denote the event S m () = n y m 1 ∈ R d·m : sup A∈πm(y m 1 ) Q(A) Qm(A) −1 < o , then (4.70) is equivalent to [4]: ∀ > 0, Q(∪ k∈N ∩ m≥k S m ()) = 1. It is straightforward to show that if sup A∈πm(y m 1 ) Q(A) Qm(A) −1 < then sup A∈πm(y m 1 ) Q(A) Qm(A) −1 < and inf A∈πm(y m 1 ) Q(A) Qm(A) −1 <,whichfrom(4.70)impliesthat, lim m→∞ sup A∈πm(Y m 1 ) Q(A) Q m (A) = 1, (4.71) lim m→∞ inf A∈πm(Y m 1 ) Q(A) Q m (A) = 1. (4.72) Q-almost surely. Finally, the result lim m→∞ sup A∈πm(Y m 1 ) Qm(A) Q(A) = 1 Q-almost surely comesdirectlyfrom(4.72). 4.9.5 ShrinkingCellConditionfortheGessaman’sPartition Given that the Gessaman’s partition is monotone transformation invariant [21], we can restrict to the scenario whereP andQ are defined on ([0,1] d ,B([0,1] d )), see Appendix 4.10.1fordetailsaboutthisargument. On the other hand, every cell of the partition ruleπ m is a finite dimensional rectan- gle — axis-pallalel hyperplanes are used to construct the partition, and consequently a sufficientconditiontoprovetheshrinkingcellconditionisthatasmtendstoinfinity, E Q d X i=1 length i (π m (X|Y m 1 )) ! → 0 (4.73) almost surely with respect to the process distribution of Y 1 ,Y 2 ···, where length i (A) denotes the Lebesgue measure of the projection of A on the i-coordinate — see Appendix4.10.2forfurtherexplanation. 128 Definition4.4 LetA⊂ [0,1] d beafinitedimensionalrectangle,oftheform N d i=1 [l i ,u i ) withl i <u i . Letπ j (A) be a partition ofA induced by axis parallel hyperplanes on the j coordinate. Wesaythatπ j (A)is-statisticallyequivalent withrespecttoameasureQ if, max B∈π j (A) Q(B)≤ Q(A) |π j (A)| · √ 1+. (4.74) Notethatbyconstruction, P B∈π j (A) P d i=1 length i (B)·Q(B)≤ length j (A)· Q(A) √ 1+ |π j (A)| + X i6=j length i (A)·Q(A). (4.75) Our data-dependent construction is a concatenation of the type of axis parallel par- titionpresentedinDefinition4.4. Then,thefollowingresultholds. Proposition4.2 Letπ m (Y m 1 ) be a data-dependent Gessaman’s partition of [0,1] d with T m splits per coordinate. If during the construction of π m (Y m 1 ) all its axis-paralled partitionsare–statisticallyequivalentwithrespecttothereferencemeasureQ,then, E Q d X i=1 length i (π m (X|Y m 1 )) ! ≤ d· √ 1+. T m . (4.76) Proof: By construction,π m (Y m 1 ) can be seen as the concatenation of 1 +T m + T 2 m +···T d−1 m family of axis-paralled partitions. Then the proof can be derived from a recursiveapplicationof(4.75). Thedetailsareomittedhereforspaceconsideration. If we define B m () ⊂ B(R d·m ) as the set of realizations of the empirical process whereπ m (y m 1 )isconcatenationof–statisticallyequivalentpartitions,thenfrom(4.76) E Q d X i=1 length i (π m (X|Y m 1 )) ! ≤ d· √ 1+. T m · I Bm() (Y m 1 ) +d·T m · I Bm() c(Y m 1 ), (4.77) 129 For proving (4.73), we are interested in the event A m () = n y m 1 : E Q P d i=1 length i (π m (X|y m 1 )) > o ∈ B(R d·m ). From (4.77) fixing 0 > 0, ∀ > 0 we have that eventually A m () ⊂ B m ( 0 ) c and consequently (Q m (A m ()) (Q m (B m ( 0 ) c )), where Q m denotes the probability measure on (R d·m ,B(R d·m ))inducedbyrestrictingtheprocesstothefiniteblockY m 1 . Inaddition,weusethefollowingunionbound, Q m (B m ( 0 ) c )≤ T d m −1 T m −1 Q m (B o m ( 0 )), (4.78) where B o m ( 0 ) ∈ B(R d·m ) denotes the event that one of the 1 +T m +T 2 m +···T d−1 m axis-paralledpartitionsof π m (y m 1 )isnot 0 –statisticallyequivalent. Tofindanexpressionfor Q m (B o m ( 0 )),withoutlossofgenerality,letusconsiderA = [0,1] d ,acoordinatej∈{1,..,d}andπ j (A) ={A 1 ,..,A Tm }apartitionofAbasedon ¯ m i.i.d. samples points projected in thej-coordinate, let say ¯ Y 1 (j) < ¯ Y 2 (j),··· < ¯ Y ¯ m (j). IfF(x) and ˆ F ¯ m (x) denote thej-marginal distribution function and its empirical coun- terpart respectively (associated with the reference measure), it is simple to show that if π j (A) is not 0 –statistically equivalent, then sup x∈[0,1] ˆ F ¯ m (x)−F(x) > √ 1+ 0 −1 Tm (Chapter20.3)[21]. Consequently, Q m (B o m ( 0 ))≤ Q m ( sup x∈[0,1] ˆ F ¯ m (x)−F(x) > √ 1+ 0 −1 T m )! ≤ 2·exp −2· ¯ m· √ 1+ 0 −1 T m 2 ! ≤ 2·exp −2·k m · √ 1+ 0 −1 T m 2 ! . (4.79) 130 Thesecondinequalityisobtainedfromthelargedeviationresultin[21](Theorem12.9), where the last inequality is because of the fact that in our construction∀A∈ π m (Y m 1 ), Q m (A)≥ km m andthenwecanconsider ¯ m≥k m . Consequently,from(4.78)and(4.79), Q m (B m ( 0 ) c )≤ 2· T d m −1 T m −1 ·exp −2·k m · √ 1+ 0 −1 T m 2 ! ≤ 2·T d m ·exp −2· k m b(m/k m ) 1/d c 2 · √ 1+ 0 −1 2 ≤ 2· m k m ·exp −2· k 2 m m · √ 1+ 0 −1 2 . (4.80) In these sequences of inequalities we use thatT 2 m =b(m/k m ) 1/d c 2 ≤b(m/k m ) 1/2 c 2 ≤ m/k m (consideringd≥ 2). Finally, noting that (k m )≈ (m 0.5+l/2 ), for somel∈ (0,1), then, (Q m (B m ( 0 ) c ))) (m 0.5−l/2 ·exp −2·m l ·( √ 1+ 0 −1) 2 ), (4.81) upper bound sequence that tends to zero asymptotically. Moreover from (4.81), P m Q m (B m ( 0 ) c ) <∞ and then P m Q m (A m ()) <∞∀ > 0, which from the Borel Cantellilemmaprovestheresult. 4.10 ProofofLemma4.3 Proof: We closely follow the proof technique presented by Devroye, Gyorfi and Lugosi [21] (Theorem 20.2). Subsections 4.10.1, 4.10.2 and 4.10.3 provide some pre- liminariesforthefinalresultin4.10.4. 131 4.10.1 Reducingtheproblemtoaboundedmeasurablespace Note that the partition scheme Π is monotone transformation invariant [21], in the sense that for any function F : R d → R d that can be expressed by F(x) = (f 1 (x(1)),··· ,f d (x(d))) for some collection of strictly increasing real functions {f i (·) :i = 1,..,d}, any partition rule π m = π(T m ) of Π satisfies that ∀x ∈ R d , ∀y m 1 ∈ R d·m ,π m (x|y 1 ,..y m ) =π m (F(x)|F(y 1 ),..F(y m )). We can consider f i (·) to be the distribution function of the marginal probability Q restricted to events on the i-coordinate, ∀i ∈ {1,..,d}. Without loss of generality we can restrict to the case when {f i (·) :i = 1,..,d} are strictly increasing functions. Consequently, the induced distributions of the transform space, denoted by ¯ Q and ¯ P respectively,havesupporton[0,1] d andsatisfiesthat[32] D(P||Q) =D( ¯ P|| ¯ Q), (4.82) because F(·) is one-to-one continuous mapping from R d to [0,1] d (more precisely F −1 (A) :A∈B([0,1] d ) = B(R d )). Moreover, by construction of the divergence estimator, Section 4.3, if we apply Π in the transform domain, i.e. we estimate the empirical distributions using the transform i.i.d. realizations F(X 1 ),..,F(X m ) and F(Y 1 ),..,F(Y m ) — denoted by ¯ P n and ¯ Q m on σ(π(F(Y 1 ),..,F(Y m ))), and estimate thedivergence ˆ D m,n ( ¯ P n , ¯ Q m )by(4.9),itissimpletocheckthat ˆ D m,n (P n ,Q m ) = ˆ D m,n ( ¯ P n , ¯ Q m ). (4.83) In other words, the empirical divergence based on our TSP Π is invariant under mono- tone coordinate transformations. Then from (4.82) and (4.83), we can reduce the prob- lem to checking the asymptotic sufficient nature of Π and consequently the shrinking 132 cell condition for the case whenQ (and consequentlyP) is defined on the measurable space([0,1] d ,B([0,1] d )withuniformcoordinatemarginaldensityfunctionsfortheref- erencemeasure. 4.10.2 Formulationofasufficientcondition Let π m (Y m 1 ) = π(T m (Y m 1 )) denote the m-sample partition rule of Π. Given that π m (Y m 1 ) is induced by axis-parallel hyperplanes, every cell U t ∈ π m (Y m 1 ) is a finite dimensional rectangle of the form⊗ d i=1 [l i ,u i ) (with the possible open and closed inter- valvariations). Inthisscenario,∀t∈T m (Y m 1 ), diam(U t )≤ d X i=1 length i (U t ), (4.84) with length i (U t ) denoting the Lebesgue measure of the projection of U t on the i- coordinate. Then from (4.84) and Markov’s inequality, for proving the shrinking cell conditionitsufficestoshowthat[21], E Q d X i=1 length i (π m (X|Y m 1 )) ! = Z [0,1] d d X i=1 length i (π m (x|Y m 1 ))∂Q(x)→ 0, (4.85) almostsurelywithrespecttotheprocessdistributionofY 1 ,Y 2 ··· asmtendstoinfinity. 4.10.3 -goodmediancuts WithoutlossofgeneralityletusconsiderU =⊗ d i=1 [l i ,u i ]tobearectangleinB([0,1] d ) with probability Q(U) = p > 0. Let n H 0 0 ,H 1 0 ,H 1 1 ,··· ,H d−1 0 ,..,H d−1 2 d−1 −1 o be a 133 sequence of axis-parallel hyperplanes used to recursively split the space in every coor- dinate and creates a partition of U with 2 d cells. More precisely, H 0 0 parallel to the 1-coordinate splits U 0 0 =U into two rectanglesU 1 0 ,U 1 1 , thenH 1 0 andH 1 1 parallel to the 2-coordinatesplit U 1 0 andU 1 1 intoU 2 0 ,U 2 1 ,andU 2 2 ,U 2 3 respectively,andinductivelyatthe endoftheprocessatreestructuredpartition U d j :j = 0,..,2 d −1 forU iscreated. Definition4.5 Let us denote the probability of every induced rectangle by p l j = Q(U l j ) in the aforementioned construction, then we say that n H 0 0 ,H 1 0 ,H 1 1 ,··· ,H d−1 0 ,..,H d−1 2 d−1 −1 o is a sequence of -good median cuts for U if,∀l∈{0,..,d−1}andj∈ 0,..,2 l −1 . max(p l+1 2j ,p l+1 2j+1 )≤ 1 2 (1+) 1/d ·p l j . (4.86) Thefollowingpropositionisimportantforprovingthemainresult. Proposition4.3 Let U be a finite dimensional rectangle in B([0,1] d ), and U d j :j = 0,..,2 d −1 a partition of U induced by sequence of -good median cuts. Then, 2 d −1 X j=0 p d j · d X i=1 length i (U d j )≤ 1+ 2 ·p· d X i=1 length i (U) (4.87) Theproofisasimpleconsequenceof(4.86)andisomittedhereforspaceconsiderations. 4.10.4 ShrinkingcellconditionforbalancedTSP LetusfocusagainonourbalancedTSPΠ ={T 1 ,T 2 ,···}ofdepthd m ,i.e.|π m (Y m 1 )| = 2 dm . In addition, let us consider ¯ Π = ¯ T 1 , ¯ T 2 ,··· , with TSP rule sequence given by ¯ T 1 (y m 1 ) ≡ T ¯ dm m (y m 1 ) and ¯ π m (y m 1 ) ≡ π( ¯ T m (y m 1 )) for ¯ d m = d · bd m /dc. From Proposition 4.1 it is sufficient to prove the shrinking cell condition for the pruned 134 balanced TSP ¯ Π, The reason for this reduction is that by construction the depth of ¯ T m (Y m 1 ) is power of d and then we can recursively use Proposition 4.3 to bound E Q P d i=1 length i (¯ π m (X|Y m 1 )) . More precisely, if we condition to the event B m () ∈ B(R d·m ) that all the axis- parallel hyperplanes that induce ¯ T m (Y m 1 ) are-good median cuts, then from (4.87) we havethefollowingbound, E Q d X i=1 length i (¯ π m (X|Y m 1 )) ! ≤ 1+ 2 rm ·d, (4.88) withr m =bd m /dc. Let us choose 0 > 0 sufficiently small in order that 1 + 0 < 2. Then from (4.88) as r m → ∞ (when m → ∞), the event A m () = n y m 1 ∈ R d·m : E Q P d i=1 length i (¯ π m (X|y m 1 )) > o ∈ B(R d·m ) is eventually con- tained inB m ( 0 ) c ,∀ > 0. Consequently, let us focus on the analysis of Q m (B m ( 0 ) c ). By definition B m ( 0 ) c is the event that one of the cut of ¯ T m (Y m 1 ) is not 0 -median good. By construction the number of hyperplanes splitting ¯ T m (Y m 1 ) is given by (1+2+···+2 ¯ dm−1 ),then Q m (B m ( 0 ) c )≤ 2 ¯ dm · Q m (B o m ( 0 )) (4.89) withB o m ( 0 ) denoting the event that one cut is not 0 -median good. Devroye et al. [21] (Theorem20.2)showedforthiscaseofbalancedtreesthat, Q m (B o m ( 0 ))≤ 2·exp − m 2 ¯ dm+2 ·((1+ 0 ) 1/d −1) 2 , (4.90) 135 formsufficientlylarge. Consequently,from(4.89)and(4.90),thereexistsK > 0such, Q m (B m ( 0 ) c )≤K·exp log(2)· ¯ d m − m 2 ¯ dm+2 ·((1+ 0 ) 1/d −1) 2 , (4.91) ∀m ∈ N. From definition of ¯ d m , we have thatd m −d < ¯ d m ≤ d m , and consequently fromthehypothesisin(4.42),itissimpletoshowthat,thereexists(a m )≈m p forsome p> 0,suchthat m ¯ d m 2 ¯ dm − a m ¯ d m →∞ (4.92) whichfrom(4.91)issufficienttoshowthat, Q m (B m ( 0 ) c ) exp(−m p ) → 0 (4.93) asmtendstoinfinity. Finally,wehavethatlimsup m A m ()⊂ limsup m B m ( 0 ) c ,∀> 0, then given that P m Q m (B m ( 0 ) c ) < ∞ from (4.93), and the Borel-Cantelli lemma , E Q P d i=1 length i (¯ π m (X|Y m 1 )) tends to zero with probability one with respect to the process distribution ofY 1 ,Y 2 ,···, which proves the lemma by the argument presented inAppendix4.10.2. 136 Chapter5 OnUniversallyConsistentHistogram basedEstimatesfortheMutual Information The problem of mutual information estimation based on data-dependent partition for continuous distribution is presented. A general histogram-based estimate is proposed considering non-product data-driven partition schemes. From the fundamental connec- tion with the Kullback-Leibler divergence, a general set of sufficient conditions was stipulated using the combinatorial complexity indicator for partition families, proposed by Lugosi and Nobel, and the celebrated Vapnik-Chervonenkis inequality . These statis- tical tools were used to provide distribution-free bounds for the estimation and approx- imation error and in that way to guarantee a universal strongly consistent estimate for mutual information. On the application side, two emblematic data-dependent construc- tions are derived from this result, one based on statistically equivalent blocks and the other, on a tree-structured vector quantization scheme. A range of design values were stipulatedtoguaranteestronglyconsistentestimatesforbothframeworks. Furthermore, experimentalresultsundercontrolledsettingsdemonstratethesuperiorityofthesedata- driven techniques in terms of a bias-variance analysis when compared to conventional producthistogram-basedandkernelplug-inestimates. 1 1 Index Term: Mutual information estimation, data-dependent partitions, statistical learning theory, tree-structuredvectorquantization. 137 5.1 Introduction Mutual information specifies the level of statistical dependency between a pair of ran- domvariables[32,16]. Thisquantityisfundamentaltocharacterizingsomeofthemost remarkable results in information theory, the performance limit for the rate of reliable communicationthroughanoisychannelandtheachievablerate-distortioncurveinlossy compression,amongothers[32,16]. Mutualinformationhasbeenalsoadoptedinstatis- ticallearning-decisiontheorycontexts. Ithasbeenusedasafidelityindicator,primarily becauseofFano’sinequality[16],findingimportantapplicationsasatoolforstatistical analysis[28,45],infeatureextraction[53,63,65],indetection[13],imageregistration and segmentation [69, 6, 39], and recently used to characterize performance limits on patternrecognition[79],justtomentionabriefspectrumofimportantapplications. Usuallytheseapplicationsrelyonempiricaldatatoestimatethismutualinformation, wherepriorinformationforthedistributionisnotavailable. Hence,theproblemofuni- versal mutual information estimation based on independent and identically distributed realizations of the involved distribution becomes crucial, as addressed in many of these works. Thereisanextensiveliteraturedealingwiththerelateddifferentialentropyesti- mation for distributions defined on finite dimensional Euclidean spaces. Beirlant et al. [1] provide an excellent review. In particular, consistency is well understood for classi- cal histogram-based [33] and kernel plug-in estimates [1]. However in this context, the role of data-dependent vector quantization has not been studied with the same level of details,inparticularconcerningconsistency. ThiswasoriginallypointedoutbyDarbel- layetal. [18],whoproposedtheuseofanon-productdata-driventechniquesformutual informationestimation. Themotivationhereisthatanon-productdata-dependentparti- tioncanprovidetheflexibilitytoimprovetheapproximationqualityofclassicalproduct histogram-based constructions and consequently provide relative improvement in the small sample regime of the estimate. [18] proposes a tree-structure partition scheme 138 that partitions the space in statistically equivalent bins and used a local stopping rule based on thresholding the conditional empirical mutual information obtained during an inductive partition process. While this work showed promising empirical evidence, as theauthorspointedout[18],consistencyisachallengingandopenproblemforthistype of construction. On the application sided implies that the mentioned stopping criterion hastobetriggerempirically. The work of the present paper was motivated by this general direction, but a con- ceptually different way to address the problem is proposed. First, we plan to study generaldistribution-freeconditionsthatguaranteedata-dependenthistogram-basedcon- structionstobestronglyconsistentformutualinformation. Then,theideaistoparticu- larizethisconsistencyresulttospecificadaptivehistogram-basedestimates. Withregard to the first goal, a non-product data-dependent construction is presented and a set of general sufficient conditions are proposed to make its induced histogram-based mutual information estimate strongly consistent. This approach was motivated by the seminal work of Lugosi et al. [46] and Nobel [49], by extending the role of the proposed com- binatorial notions of complexity for partition families and a Vapnik-Chervonenkis type of inequality[71, 70]. In particular these tools was used to control the tradeoff between estimation and approximation error, which as explained in detail in the following sec- tionsareimplicitinthislearningproblem[21],andprovidethewayguaranteeuniversal consistency. Thisresearchdirectionfollowsourpreviousworkondivergenceestimation [66]. Howeverinmanyaspectsthesettingoftheproblemisfundamentallydifferentand addresses hitherto unexplored technical and practical challenges, as we shall point out alongtherestofthispaper. We consider two important applications of the main consistency result: statistically equivalent blocks — the Gessaman’s data-dependent partition [29], and tree-structured vectorquantization(TSVQ)[21]. Inboththesetwocontexts,ourgeneralresultimpliesa 139 rangeofdesignvalueswheretheinducedestimatesshowtheexpectedconsistentbehav- ior. Finally,weprovideasystematicempiricalanalysisinacontrolledsimulationsetting demonstrating the superiority of proposed data-driven scheme when compared to con- ventionalhistogram-basedandkernelplug-inestimates. The rest of the paper is organized as follows. Section 5.2 introduces the problem and summarizes some statistical learning results that will be used in the development of the paper. Section 5.3 presents the non-product histogram based construction. Sec- tion 5.4 introduces relevant aspects of the mutual information estimation problem and formulatesthemainconsistencyresult. Section5.5providesdetailsaboutthetwodata- partition schemes used to estimate the mutual information. Finally, Section 5.6 reports experimentalvalidationandSection5.7,thefinalremarksandfuturework. 5.2 Preliminaries Consider two random variablesX andY taking values onX andY, respectively, with a joint distribution denoted by P X,Y . We focus on the finite dimensional continuous spaces, i.e.,X = R p andY = R q and consequentlyP X,Y is defined on the Borel sigma field, denoted byB(R d ), ford =p+q. In this case the mutual information betweenX andY canbeexpressedby[16], I(X;Y) =D(P X,Y ||P X ×P Y ), (5.1) where P X ×P Y is the probability distribution on R d induced by multiplication of the marginals ofX andY (joint probability whereX andY are conditional independent) andD(P||Q)denotestheKullback-Leiblerdivergence [40],givenby, D(P||Q) = Z log ∂P ∂Q (x)·∂P(x). (5.2) 140 We are interested in the problem of estimatingI(X;Y) based on independent and identicallydistributed(i.i.d.) realizations,denotedby(X 1 ,Y 1 ),···(X n ,Y n ),ofthejoint distributionP X,Y . To simplify notation we denote byZ i the joint vector (X i ,Y i ) onR d andbyZ k 1 thesequenceofrealizations(Z 1 ,..,Z k )onR d·k . Inparticular,inthisworkwe focus on the classical histogram based approach [21, 46], where the inference problem involvesthreephases: first,tocharacterizeafinitedata-dependentpartition π n ⊂B(R d ) of the space where we want to restrict the learning problem, second, to estimate the involved distributionsP X,Y andP X ×P Y restricted toπ n and finally, use the empirical distributions to obtain an empirical mutual information estimate. Before going into the concrete formulation of the estimate, we need to introduce some terminology for data- dependentpartitionsandtheirassociatedstatisticallearningresults. 5.2.1 Data-DependentPartitionSchemes Let us denote byπ a finite measurable partition of R d , and by|π| its cardinality. IfA is acollectionofpartitionsfor R d ,thenitsmaximumcellcount[46]isdenotedby, M(A) = sup π∈A |π|. (5.3) In addition, notions of combinatorial complexity for collection of events was pro- posed by Vapnik and Chervonenkis [72] and extended for collection of partitions by Lugosi and Nobel [46]. More precisely, letC ⊂ B(R d ) be a collection of measurable events,andx n 1 = (x 1 ,..,x n )beasequencesofpointsin R d ,thenwecandefine, S(C,x n 1 ) =|{{x 1 ,x 2 ,..,x n }∩A :A∈C}|, (5.4) and the scatter coefficient ofC byS n (C) = sup x n 1 ∈R d ·n S(C,x n 1 ). The scatter coefficient isanindicatoroftherichnessofC todichotomizesequenceofpointsinthespace,where 141 bydefinitionS n (C)≤ 2 n . Thefirstn∈ N, whereS n (C)isstrictlylessthan 2 n iscalled the VC-dimension of the class [21], where if this condition is never satisfied then the class is set to have infinite VC-dimension. In particular, if C has finite VC-dimension V,thenthefollowingimportantconsequencefollows[72,21],∀n>V, S n (C)≤ (1+n) V , (5.5) where the scatter coefficient is bounded by a polynomial growth. This notion can be similarly extended for collection of partitions [46]. LetA be a collection of partitions andx n 1 = (x 1 ,..,x n )asequenceofpointsin R d ,then Δ(A,x n 1 ) =|{{x 1 ,x 2 ,..,x n }∩π :π∈A}|, (5.6) where {x 1 ,x 2 ,..,x n }∩π is a short-hand notation for the partition of {x 1 ,x 2 ,..,x n } inducedby{{x 1 ,x 2 ,..,x n }∩A :A∈π}. Inthesameway,thegrowthfunctionofAis givenbyΔ ∗ n (A) = sup x n 1 ∈R d·nΔ(A,x n 1 ). A n-sample partition rule π n (·) is a mapping from R d·n to the space of finite- measurable partitions for R d , that we denote by Q, where a partition scheme for R d isacountablecollectionofn-samplepartitionsrules Π ={π 1 (·),π 2 (·),...}. LetΠbean arbitrary partition scheme for R d , then for every partition ruleπ n (·)∈ Π we can define itsassociatedcollectionofmeasurablepartitionsby[46] A n = π n (x 1 ,..,x n ) : (x 1 ,..,x n )∈ R d·n . (5.7) Here, for a given n-sample partition rule π n (·) and a sequence (x 1 ,..,x n ) ∈ R d·n , π n (x|x 1 ,..,x n ) denotes the mapping from any point x in R d to its unique cell in π n (x 1 ,..,x n ),suchthatx∈π n (x|x 1 ,..,x n ). 142 5.2.2 Vapnik-ChervonenkisInequalities LetX 1 ,X 2 ,..,X n be i.i.d. realizations of a random vector with values in R d , withX ∼ P andP a probability measure on (R d ,B(R d )). Then for any measurable setA we can considertheempiricaldistributiononB(R d )by, P n (A) = 1 n n X i=1 I A (X i ), (5.8) where I A (x)istheindicatorfunctionofA∈B(R d ). A fundamental statistical learning problem is being able to bound the deviation of theempiricaldistribution(relativefrequencies)withrespecttotheprobabilityforacol- lection of measurable events. The following result provides a general answer for this problem. THEOREM5.1 (Vapnik and Chervonenkis [72]) LetC be a collection of measurable events,then∀n∈ N,∀> 0, P sup A∈C |P n (A)−P(A)|> ≤S n (C)·exp − n 2 8 , (5.9) where PreferstotheprocessdistributionofX 1 ,X 2 ,···. This is the celebrated Vapnik and Chervonenkis inequality, a distribution free bound that controls the uniform deviation of the empirical measure with respect to the prob- abilities, where the exponential rate of convergence of this bound is a function of the combinatorialcomplexityofthecollectionofmeasurableeventsconsidered. Lugosi and Nobel [46] proposed an extension of this inequality for a collection of measurable partitions. In this case the empirical measure is restricted to a partition 143 π⊂B(R d )(morepreciselytothemeasurablespace(R d ,σ(π)) 2 )wherethetotalvaria- tional distance is used to quantify the deviation between empirical and real distribution restricted to the measurable space (R d ,σ(π)). The following lemma states this result formally. LEMMA5.1 (Lugosi and Nobel [46]) LetA be a collection of measurable partitions for R d . Then∀n∈ N,∀> 0, P sup π∈A X A∈π |P n (A)−P(A)|> ! ≤ 4Δ ∗ 2n (A)2 M(A) exp − n 2 32 , (5.10) with P,theprocessdistributionofX 1 ,X 2 ,···. Interestingly the rate of convergence of this bound is a function of the combinatorial complexity of the partition family as in the case of the Vapnik-Chervonenkis inequality [72,21]. The following result that introduces the role of a partition scheme, the follows as a consequenceofLemma5.1andtheapplicationoftheBorel-Cantellilemma [4]. COROLLARY5.1 (Lugosi and Nobel [46]) Let us consider a sequence of partition familiesA 1 ,··· ,A n ,··· induced by a partition scheme Π. If whenn tends to infinity: n −1 M(A n )→ 0andn −1 logΔ ∗ n (A n )→ 0,then sup π∈An X A∈π |P n (A)−P(A)|→ 0, (5.11) withprobabilityonewithrespecttotheprocessdistributionofX 1 ,X 2 ,···. 2 σ(π)denotesthesmallestsigma-fieldthatcontains π,whichforthecaseofpartitionsisthecollection ofsetsthatcanbewrittenasunionofeventsinπ. 144 This important result states that uniformly in the collection of measurable partitions in A n the empirical distribution (relative frequencies) converges to the probability — in thetotalvariationaldistancesense—asntendstoinfinity P-almostsurely. 5.3 Mutual Information Estimate based on Data- DependentPartitions HerewereturntotheMIestimationproblemintroducedinthebeginningofSection5.2. For that, we consider a histogram-based approach based on data-dependent partition schemes, considering the notation presented in Section 5.2.1. Let Π ={π 1 ,π 2 ,···} be thepartitionscheme,whereweimposetheconditionthateverybininducedbythisfam- ilyofpartitionruleshasaproductform,i.e.,∀z n 1 = (z 1 ,..,z n )∈ R d·n ,everymeasurable setA∈π n (z n 1 )canbeexpressedinathefollowingproductform, A =A 1 ×A 2 (5.12) whereA 1 ∈ B(R p ) andA 2 ∈ B(R q ). Note that this condition does not imply that the partitionπ n (z n 1 ) has a product structure, i.e., can be written byQ 1 ×Q 2 , withQ 1 and Q 2 beingindividualpartitionsof R p and R q ,respectively. Remark5.1 The product bin condition in (5.12) is strictly necessary for being able to estimate P X,Y as well as the reference measure P X ×P Y just based on the i.i.d. realizationsofthejointdistributionP X,Y . Let us consider Z 1 ,..,Z n i.i.d. realizations with probability distribution P X,Y and partition scheme Π with the aforementioned product bin property. To simplify notation 145 we denote by P, the joint distribution and by P n , its empirical version given in (5.8). Then,theproposedmutualinformationestimateisgivenby, ˆ I n (X;Y) = X A∈πn(Z n 1 ) P n (A)·log P n (A) P n (A 1 × R q )·P n (R p ×A 2 ) , (5.13) whereA 1 ×A 2 denotestheproductformoftheeventA∈π n (Z n 1 ). Remark5.2 As pointed out in [18], this estimate is not formally a mutual information quantity, in other words is not the mutual information between quantized versions of the two random variables X and Y (this requires a product structure of the partition π n (Z n 1 )). As considered by Darbellay et al. [18] this empirical construction tries to estimate the KLD instead, right hand side of (5.1). Specifically, ˆ I n (X;Y) can be inter- preted as the KLD restricted to the sigma fieldσ(π n (Z n 1 )), between the empirical joint distribution and its empirical product counterpart (multiplication of marginals empiri- caldistributions). Notethatπ n (Z n 1 )isarandompartitiondrivenbytheempiricalprocess. Specificallythe learning problem involves two main phases, one to find a random partition function of the data, and then restricted to this measurable spaceσ(π n (Z n 1 )), finding the empirical distributions necessary to estimate the MI betweenX andY. Then a challenging prob- lem is to find general sufficient conditions on the data-dependent partition scheme Π thatguaranteeitsinducedMIestimate(5.13)tobealmostsurely(a.s.) consistent,i.e., lim n→∞ ˆ I n (X;Y) =I(X;Y) (5.14) 146 withprobabilityonewithrespecttotheprocessdistributionofZ 1 ,Z 2 ,···,foranycon- tinuous joint distribution P X,Y (universally consistent estimate). The answer to this importantquestionisformallyaddressedinthenextsection. 5.4 Strong Consistency for the Mutual Information Estimate In this section the universal strong consistency of our data-dependent construction is studied. This requires finding conditions to control the estimation and approxi- mation errors, which are presented in the majority of statistical learning problems [21, 46, 70, 22]. More precisely, the difference between the empirical MI data- dependentconstruction(5.13)andtheMIcanbeboundedbythefollowingtwoterms, ˆ I n (X;Y)−I(X;Y) ≤ X A∈πn(Z n 1 ) P X,Y (A)·log P X,Y (A) P X ×P Y (A) −I(X,Y) + ˆ I n (X;Y)− X A∈πn(Z n 1 ) P X,Y (A)·log P X,Y (A) P X ×P Y (A) . (5.15) The first term which is the approximation error (or bias of the estimate) only considers the effect of quantizing the space and not the use of empirical distributions — it is well known that quantization reduces the magnitude of information quantities [44, 32, 18]. The second term is the estimation error (or the variance term) that quantifies the devia- tion because of the use of empirical distributions. A natural direction is to find a good compromise between these sources of error as a function of the structural properties of the data-dependent partition scheme. Specifically, the objective is to make these two 147 termsvanishasymptoticallywithprobabilityonewithrespecttotheprocessdistribution ofZ 1 ,Z 2 ,···. Concerningtheestimationerror,wewillusethemachineryofstatisticallearningthe- ory, in particular the uniform bounds to control the deviation of empirical distribution withrespecttotheprobability,introducedinSection5.2.2. Todealwiththeapproxima- tion error, the next sub-section introduces some preliminary results that will be used to statethemainconsistencytheoremofthiswork. 5.4.1 ControllingtheApproximationError ForameasurableeventA∈ R d ,thediameterofthesetisgivenby, sup x,y∈A ||x−y||, (5.16) where||·||denotestheEuclidiannormin R d . Proposition5.1 Let Π be a partition scheme driven byZ 1 ,Z 2 ,··· a sequence of i.i.d. realizationsfollowingajointdistributionP X,Y thatisabsolutelycontinuouswithrespect totheLebesguemeasureλin R d . If∀> 0, lim n→∞ P X,Y z∈ R d :diam(π n (z|Z n 1 ))> → 0, (5.17) a.s. withrespectto P(theprocessdistributionofZ 1 ,Z 2 ,···),then, lim n→∞ X A∈πn(Z n 1 ) P X,Y (A)·log P X,Y (A) P X ×P Y (A) =I(X,Y), (5.18) P-almostsurely. 148 The proof is presented in Appendix 5.8.1. The result says that if the diameter of the random partition π n (Z n 1 ) vanishes in a probabilistic sense as the number of samples tends to infinity, as given by (5.17), we can approximate with arbitrary precision the distributions for the purpose of estimating the mutual information. This approximation condition in (5.17) is called a shrinking cell condition. Different flavors of this notion havebeenintroducedforcontrollingapproximationerrorinhistogram-basedregression, classificationanddensityestimationproblems[5,46,21,49]. Remark5.3 Asimilarshrinkingcellconditionwaspresentedin[66]forthemoregen- eralproblemofdivergenceestimation. AspartoftheproofofProposition5.1thisorigi- nalresulthasbeenimproved(fordetailsaboutthispointwereferthereadertoTheorem 5.5,statedandprovedinAppendix5.8.1). 5.4.2 TheMainConsistencyResult Before stating the result, we need to introduce some definitions. For any partition rule π n (·) ∈ Π, in addition to its associated collection of measurable partitions, A n = π n (z n 1 ) :z n 1 ∈ R d·n , we consider its product bin structure (5.12) to define the followingcollectionofmeasurableevents, C [1−q] (z n 1 ) = ξ [1−q] (A) :A∈π n (z n 1 ) (5.19) C [q+1−d] (z n 1 ) = ξ [q+1−d] (A) :A∈π n (z n 1 ) (5.20) 149 withξ [1−q] (A)denotingthesetoperatorthatreturnsthecollectionofprojectedelements ofA in the range of coordinate dimensions [1−q] 3 . Then, the following collection of measurableeventswillbeassociatedtothepartitionruleπ n (·): C [1−q],n = [ z n 1 ∈R d·n C [1−q] (z n 1 ) (5.21) C [q+1−d],n = [ z n 1 ∈R d·n C [q+1−d] (z n 1 ) (5.22) Considering sequences of non-negative real numbers, we say that (a n ) n∈N dominates (b n ) n∈N ,denotedby(b n ) (a n ),ifthereexistsC > 0andk∈ Nsuchthatb n ≤C·a n for alln≥ k. We say that (b n ) n∈N and (a n ) n∈N are asymptotically equivalent, denoted by(b n )≈ (a n ),ifthereexitsC > 0suchthatlim n→∞ an bn =C. Finallywehavealltheelementstostatethefollowingresult. THEOREM5.2 Let X and Y be random variables in R q and R p , respectively, with jointdistributionP X,Y absolutelycontinuouswithrespecttotheLebesguemeasureλin (R d ,B(R d )). In addition, let us consider a partition scheme Π = {π 1 (·),···} with the product bin structure and driven by i.i.d. realizationsZ 1 ,Z 2 ,··· drawn fromP X,Y . If thereexistsτ ∈ (0,1)forwhichthefollowingsetofconditionsaresatisfied: c.1: lim n→∞ 1 n τ logS n (C [1−p],n ) = 0, lim n→∞ 1 n τ logS n (C [p+1−d],n ) = 0, c.2: lim n→∞ 1 n τ logΔ ∗ n (A n ) = 0, 3 ByconstructionanysetA∈π n (z n 1 )canbeexpressedbyA =A 1 ×A 2 ,withA 1 ∈ R q andA 2 ∈ R p , andconsequentlyξ [1−q] (A) =A 1 andξ [q+1−d] (A) =A 2 . 150 c.3: lim n→∞ 1 n τ M(A n ) = 0, c.4: ∃(k n ) n∈N asequenceonnon-negativenumbers,with (k n )≈ (n 0.5+τ/2 ),suchthat, ∀n> 0and(z 1 ,..,z n )∈ R d·n , inf A∈π(z n 1 ) P n (A)≥ k n n , c.5: ∀> 0, lim n→∞ P X,Y z∈ R d :diam(π n (z|Z n 1 ))> → 0, P-almostsurely, then, lim n→∞ ˆ I n (X;Y) =I(X,Y), (5.23) a.s. withrespecttotheprocessdistributionofZ 1 ,Z 2 ,···. Thefirstfourconditionsarestipulatedtoasymptoticallycontroltheestimationerror quantity in (5.15). The argument used to make this error vanish is fundamentally based on the Vapnik-Chervonekis (VP) inequality and its generalization for the case of parti- tions, in Section 5.2.2. However observing the domain of values stipulated forτ, these conditions are stronger than the one used for the problem of density estimation [46]. These stronger conditions are necessary to handle the complicated unbounded behav- ior of the log(·) function in the neighborhood of zero — the function is not absolutely continuous in (0,∞), which is the most critical part to guarantee a strongly consistent result for our MI estimation problem. Concerning the approximation error, condition 5 just recalls the sufficient condition presented in Proposition 5.1. The technical argu- mentsarepresentedinAppendix5.8.2. 151 Remark5.4 ItiswellknownthatifacollectionofmeasurableeventsC hasafiniteVC dimension,letsayV > 0,then∀n>V [22,21], S n (C)≤ (n+1) V , (5.24) andconsequently∀τ ∈ (0,1),lim →∞ 1 n τ logS n (C) = 0. Then for the finite VC dimensional case the growth function is dominated by a polyno- mial behavior. Considering the conditionc.1 of Theorem 5.2, it is interesting to extend this idea to a sequence of measurable events. The following proposition guaranteesc.1 ofTheorem5.2whenacollectionsofmeasurableeventshavefiniteVC-dimensions. Proposition5.2 Let{C n :n∈ N} be a collection of measurable events with finite VC dimensionsequence(V n ) n∈N . If(V n ) n∈N iso n τo log(n+1) forsomeτ o ∈ (0,1),then 4 lim n→∞ 1 n τo logS n (C n ) = 0. (5.25) Proof: From the fact that (V n ) n∈N iso n τo log(n+1) , there existsN, such that∀n > N, n > V n . Then from the definition of VC dimension [21], ∀n > N, S n (C n ) ≤ (n+1) Vn . Thenlimsup n 1 n τo logS n (C n )≤ lim n Vn log(n+1) n τo = 0bythehypothesis. Thenavalidquestiontoaskishowthisgeneralresulttranslatesintospecificdesign conditions when working with some specific data-dependent constructions. In other words, is it possible to find strongly consistent mutual information estimates based on thisresult? Thiswillbethefocusfortherestofthepaper,whereweshowhowthesegen- eralconditionsprovidespecificdesignsettingintheimplementationoftwowidelyused andemblematicpartitionschemes: tree-structuredpartitionsandstatisticallyequivalent blocks. 4 A nonnegative sequence (a n ) n∈N iso(f(n)), for some non-negative increasing function f(·) : N→ Rif,lim n→∞ an f(n) = 0. 152 5.5 Applications 5.5.1 StatisticalEquivalentData-DependentPartitions Here we consider a data-dependent partition scheme based on the notion of statisti- cally equivalent blocks [21], and more precisely the axis-parallel partition scheme pro- posed by Gessaman [29]. The idea is to use the data Z 1 ,..,Z n to partition the space in such a way to create cells with equal empirical mass. In Gessaman’s approach, this is done by sequentially splitting every coordinate of R d using axis-parallel hyper- planes. More precisely, let l n > 0 denote the number of samples points that we ideally want to have in every bin of π n (Z n 1 ), and let us choose a particular sequen- tial order for the axis-coordinates, without loss of generality the standard (1,..,d). With that, T n = b(n/l n ) 1/d c is the number of partitions to create in every coordi- nate. Then the inductive construction goes as follows: first, project the i.i.d. sam- ples Z 1 ,..,Z n into the first coordinate, which for simplicity we denote by Y 1 ,..,Y n . Compute the order statisticsY (1) ,Y (2) ,..,Y (n) or the permutation ofY 1 ,..,Y n such that Y (1) < Y (2) < ··· < Y (n) — this permutation exists with probability one asP X,Y is absolutely continuous with respect to the Lebesgue measure. Based on this, define the followingintervalstopartitionthefirstreallinecoordinate, {I i :i = 1,..,T n } = (−∞,Y (sn) ],(Y (sn) ,Y (2·sn) ],..,(Y ((Tn−1)·sm) ,∞) , (5.26) where s n = bn/T n c. Then assigning the samples of Z 1 ,..,Z n to the different result- ing bins, i.e., I i × R d−1 :i = 1,..,T n , we can conduct the same process in each of thosebinsbyprojectingitsdataintothesecondcoordinate. Iteratingthisapproachuntil the last coordinate (in a sort of balance tree hierarchical order to assign data points to 153 intermediate and last product intervals) we get the Gessaman data-dependent partition π n (Z n 1 ). Note that by construction if n = l d n , then we are in the ideal scenario where every bin has been assigned withl n empirical points ofZ 1 ,..,Z n . Importantly for our consistency result, in general P n (A) ≥ ln n ,∀A ∈ π n (Z n 1 ). Consequently, we can use thisproduct-binconstructionforestimatingtheMIbasedon(5.13). Thefollowingresult appliedTheorem5.2tothisscenario. THEOREM5.3 Under the problem setting presented in Theorem 5.2 the Gessaman’s partition scheme provides a strongly consistent estimate for the mutual information if (l n )≈ (n 0.5+τ/2 )forsomeτ ∈ (1/3,1). The proof of this result reduces to checking the sufficient condition stated in Theorem 5.2. DetailsarepresentedinAppendix5.8.3. 5.5.2 Tree-StructuredPartitionSchemes In this section we consider a version of what is known as balanced search tree [21](Chapter 20.3), in particular the binary case. More precisely, given Z 1 ,Z 2 ,..,Z n i.i.d. realizationsofthejointdistribution, thisdata-dependentpartitionchosesadimen- sionofthespaceinasequentialorder,saythedimensioniforthefirststep,andthenthe iaxis-parallelhalfspaceby H i (Z n 1 ) = x∈ R d :x(i)≤Z (dn/2e) (i) , (5.27) where Z (1) (i) < Z (2) (i) <,..,< Z (n) (i) denotes the order statistics of the sampling points {Z 1 ,..,Z n } projected in the target dimension i. Using this hyper-plane, R d is divided in two statistically equivalent rectangles with respect to the coordinate dimen- sioni, denoted byU (1,0) andU (1,1) . Reallocating the sampling points in the respective 154 intermediate cells, U (1,0) andU (1,1) , we can choose a new dimension in the mentioned sequential order and continue in an inductive fashion with this splitting process. The termination criterion is based on a stopping rule that guarantees a minimum number of sample points per cell, denoted by k n > 0. This stopping rule is fundamental to obtainingourconsistencyresult. Importantly the way the data is split in terms of measurable rectangles has a binary-tree indexed structure [21]. In particular in the iteration k of the algorithm (assuming that the stopping rule is not violated) the intermediate rectanglesU (k−1,l) for l ∈ 0,..,2 k−1 −1 are partitioned in terms of their respective statistically equivalent k-axis parallel hyper-planes to create U (k,2l) ,U (k,2l+1) :l = 0,..,2 k−1 −1 . After the first iteration, the resulting cells have at most n/2 + 1 and at least n/2− 1 sampling points. The second iteration implies the creation of 4 cells with at most n/4 + 2 and at least n/4− 2 sampled points, and consequently inductively the k-th iteration — if the stopping criterion is not violated — creates a balanced tree of 2 k cells with at least n/2 k −k andatmostn/2 k +k samplingpoints. Notethatattheendoftheprocessitis not guaranteed thatπ n (Z n 1 ) has neither perfect statistically equivalent cells (rectangles withequalempiricalmass)norabalancetreestructure. THEOREM5.4 Let us consider the tree-structure partition with binary axis-parallel statistically equivalent splits and a stopping rule governed by a sequence of non- negative numbers (k n ) n∈N . Under the general problem setting of Theorem 5.2, if (k n ) ≈ (n 0.5+τ/2 ) for someτ ∈ (1/3,1) the empirical mutual information ˆ I n (X : Y) inducedby(5.13)isstronglyconsistent. TheproofispresentedinAppendix5.8.4. 155 Figure5.1: EmpiricalmutualInformationtrajectoriesofGessamanandTree-Structured VectorQuantization(TSVQ)schemes. Valuesarereportedasafunctionofthesampling lengthofarealizationoftheempiricalprocessandfordifferentdesignparametervalues forτ. Datasimulatedwithcorrelationcoefficientr = 0andplottedinalog-scaleofthe samplinglength. 5.6 ExperimentalSimulations Inthissectionweprovideempiricalevidencetoevaluatetheapproximationgoodnessof thedata-dependentschemespresentedintheprevioussection,bycontrastingthemwith some classical techniques used for estimating the mutual information. Here we also check the consistent nature of proposed histogram-based estimates under the sufficient conditionspresentedinSection5.5. WeconsiderasimulationscenariousingjointlyGaussiandistributionswheremutual information can be obtained in closed-form, and furthermore, where their parametric density description admits a maximum likelihood (ML) estimation approach. The ML estimate can be used as a benchmark for performance comparison as suggested in [18]. 156 We contrast the results of the proposed TSVQ and Gessaman’s data-dependent con- structionswithtwoclassicalframeworks, aproductpartitionhistogrambasedapproach [33] and a kernel based plug-in estimate. It is important to point out that these two techniques are strongly consistent for the differential entropy estimation [1] and conse- quentlyforthemutualinformation,undertheabsolutelycontinuousassumptionstudied inthiswork. For the evaluation we consider X and Y to be scalar Gaussian random variables with correlation coefficient denoted by r. We simulate i.i.d. realizations of the joint Gaussian distribution with different correlation coefficients, {0,0.3,0.5,0.8}. Using thetechniquesformutualinformationestimationdescribedinSection5.5,weevaluated theirperformanceasafunctionofthelengthoftheempiricalprocess,Z 1 ,..,Z n withnin therangeof{1,..,10 4 }. Weconductedtwotypesofanalyses,onereportingtrajectories of the estimates for a realization of the empirical process of length n, and the other, computing bias and standard deviation for 1000 realizations of the empirical process acrossthesamplinglengthrangementionedabove. 5.6.1 Performanceresultsanddependenciesondesignvariables The Gessaman’s statistically equivalent partition and the TSVQ schemes are consistent under the general conditions presented in Theorem 5.3 and 5.4. Both these techniques haveasadegreeoffreedomtheasymptoticsub-linearrateof l n andk n respectively,that is characterized with the design variableτ. Note that both results impose the condition τ ∈ (1/3,1) for consistency. Figures 5.1 and 5.2 present the evolution of the empirical estimateforthecaser = 0(randomvariablesareindependent)andr = 0.5fordifferent valuesofthedesignvariableτ intheinterval(0,1). 157 From these results as expected, consistency is observed in the range of values τ ∈ (1/3,1), however the asymptotic trends of the estimates present different behav- iors as a function of their design variables. From Figures 5.1 and 5.2, large values of τ in the target range (1/3,1) (or equivalently a more conservative approach for split- tingthespaceinstatisticallyequivalentbins)providebettercontrolofestimationerrors effects and consequently better small sample performances for scenarios whereX and Y have low correlation. This is particularly clear for the scenario r = 0, Figure 5.1, where the deviation of empirical estimates from the target zero value is fully attributed to estimation error effects. However, for medium to high correlation (and consequently higher mutual information), the behavior of these family of estimates as a function of τ changes. As observed in these experiments and naturally expected, less conserva- tive settings tend to provide better small sample performances, because of the fact that approximation error effect arises and tends to dominate the performance deviation of theestimates. Interestingly, both data-dependent schemes present close performance trends con- strained to similar range of design values and correlation scenario, with TSVQ clearly better in the small sample regime [1−10 4 ]. These close trends can be attributed to the fact that both are based on the notion of statistically equivalent partition of the space. However as expected from results obtained in other applications — classification and regression [61, 51, 62], the tree inductive nature of TSVQs provides more degrees of freedom to adapt to the empirical trend of the data, justifying its better performance. Resultspresentedinthenextsectionconfirmthisobservation. Based on the trends of empirical estimates as a function ofτ, it is hard to say how tighttheproposedsufficientconditionsinTheorems5.3and5.4are,because,forvalues ofτ smallerthan1/3,theirestimatesseemtoprovideconsistenttrends. Conclusionsin 158 Figure5.2: EmpiricalmutualinformationtrajectoriesofGessamanandTree-Structured VectorQuantization(TSVQ)schemes. Valuesarereportedasafunctionofthesampling lengthofarealizationoftheempiricalprocessandfordifferentdesignparametervalues forτ. Data simulated with correlation coefficientr = 0.5 and plotted in a log-scale of thesamplinglength. this regard, however, can not be drawn fromthis empirical analysis and further theoret- icalresultsareneededtocorroborateordisclaimthispoint. Forthenextsubsection,we have chosen for both techniques a fixed value forτ in (0.4,0.6), that has demonstrated relativelygoodperformanceacrossallcorrelationscenarios. 5.6.2 BiasandStandarddeviationAnalysis Inthissectionaclassicalproducthistogram-basedestimateandakernelplug-inestimate are evaluated and contrasted with our adaptive histogram-based constructions. Both techniques offer universally consistent estimates for the mutual information [1, 33]. 159 Table 5.1: Bias for the non-parametric mutual information estimates (histogram-based using Gessaman partition scheme (GESS), tree-structured partition (TSVQ), classical product partition (PROD) and a kernel plug-in estimate (KERN)) obtained from 1000 independent realizations of the empirical process. Performance values are reported with respect to different sampling lengths {11,33,58,101,179,564,3164,5626} of the empirical process and for the different correlation coefficient scenarios, r = 0,0.3,0.5,0.8,reportedinthisrespectiveorder. 11 33 58 101 179 564 3164 5626 TSVQ: 1.508e-03 1.280e-02 5.101e-03 8.570e-03 3.006e-03 1.722e-03 2.678e-04 8.479e-05 GESS: 1.657e-02 2.519e-02 1.206e-02 1.995e-02 7.669e-03 2.842e-03 2.436e-04 1.594e-04 KERN: 6.362e-02 3.532e-02 2.243e-02 1.534e-02 9.674e-03 3.176e-03 3.090e-04 9.035e-05 PROD: 3.273e-01 4.515e-01 1.055e-01 6.769e-02 3.946e-02 1.086e-02 5.310e-03 5.652e-03 TSVQ: 3.568e-03 8.574e-03 2.495e-03 6.139e-03 1.826e-03 1.197e-03 1.391e-04 2.359e-05 GESS: 9.661e-03 1.914e-02 8.498e-03 1.645e-02 5.515e-03 2.037e-03 1.259e-04 8.186e-05 KERN: 8.329e-02 4.499e-02 3.160e-02 2.170e-02 1.411e-02 5.268e-03 9.148e-04 4.355e-04 PROD: 3.599e-01 4.769e-01 1.185e-01 7.350e-02 4.362e-02 1.211e-02 6.248e-03 6.680e-03 TSVQ: 6.867e-03 1.951e-03 7.681e-05 2.045e-03 1.509e-04 2.848e-04 1.584e-07 5.105e-05 GESS: 7.733e-04 8.473e-03 2.645e-03 9.607e-03 1.877e-03 5.659e-04 9.096e-07 1.270e-06 KERN: 1.260e-01 7.114e-02 5.079e-02 3.693e-02 2.508e-02 1.088e-02 3.002e-03 1.854e-03 PROD: 4.187e-01 5.388e-01 1.410e-01 8.960e-02 5.488e-02 1.622e-02 8.737e-03 9.381e-03 TSVQ: 1.709e-01 3.246e-02 4.095e-02 1.619e-02 2.189e-02 7.399e-03 5.844e-03 6.679e-03 GESS: 6.469e-02 1.778e-02 2.128e-02 2.299e-03 1.421e-02 9.955e-03 5.941e-03 4.351e-03 KERN: 3.436e-01 1.996e-01 1.515e-01 1.141e-01 8.556e-02 4.575e-02 1.849e-02 1.354e-02 PROD: 7.461e-01 8.019e-01 2.617e-01 1.768e-01 1.290e-01 4.556e-02 3.021e-02 3.275e-02 Before performing comparisons, we conducted an exhaustive evaluation of both tech- niques with respect to their respective design variables (the window of the kernel, and length of the rectangle intervals generated by the product histogram), restricted to the range of those variables that make both techniques strongly consistent, see [1] and ref- erences therein for an excellent treatment of these consistency results. Then, for each techniquewehavechosenadesignvaluethatdemonstratesgoodempiricalperformance amongallthecorrelationscenariosexploredinourexperiments. Tables5.1and5.2providetheglobalpictureoftheperformancecomparisonamong the different techniques (Gessaman, TSVQ, product histogram, kernel) by evaluating performances across the sampling lengths, n ∈ {11,33,58,101,179,564,3164,5626} (uniformlyspacedinlogdomain)forallthecorrelationscenarios,r∈{0,0.3,0.5,0.8}. 160 Figure 5.3: Empirical mutual information trajectories for different estimation tech- niques. ML:maximumlikelihoodplug-inestimateunderGaussianparametricassump- tion, TSVQ: tree-structured data-depenednt partition, GESS: Gessaman’s statistically equivalent blocks, kernel: kernel plug-in estimate and PR-class : product classical histogram-based estimate. Performances are reported for one realization of the empir- ical process across sampling lengths in the log-scale. Data simulated with correlation coefficientr = 0.3. These results show that under different levels of statistical dependency betweenX and Y our data-driven non-product histogram-based constructions (Gessaman and TSVQ) present performance improvements compared to the two classical techniques, in terms ofboththebiasandstandarddeviations. Theseperformancevalueswerecomputedfrom 1000independentrealizationsoftheempiricalprocess. Inparticular,thebiasdifferences are substantial in the small sample regime [1− 10 4 ], with respect to classical product histogramapproaches,supportingtheconjecturethatmotivatedthiswork,claimingthat data-dependent partitions can improve the performance of classical product histogram based constructions in the small sample regime. These techniques also perform better than the kernel plug-in estimate, particularly clear in the sampling range [1−10 2 ] and acrossallcorrelationscenarios. Asimilarperformancetrendwasalsoobservedin[78]. Figures5.3and5.4showthecharacteristicbehavioroftheestimatesforarealizationof the empirical process, and two correlation scenariosr = 0.3 andr = 0.8 respectively. 161 Figure5.4: Empiricalmutualinformationtrajectoriesfordifferentestimationtechniques using the same setting presented in Figure 5.3. Data simulated with correlation coeffi- cientr = 0.8. Here it is possible to see with better sampling resolution the performance behavior of thedifferenttechniquesandagaintherelativeadvantageofhistogram-baseddata-driven estimates is evident. Note that we also provide the performances of the ML estimate which is based on assuming the right Gaussian distribution and consequently the esti- mation of the empirical correlation coefficient — sufficient statistics for the problem. Theseresultsshowthatourdistribution-freeestimatespresentcompetitivesmallsample behavior in comparison with the ideal ML estimate, which is at an advantage with the useofpriorknowledgeaboutthejointdistribution. 5.7 FinalDiscussionandFutureWork Thispapershowedhowdata-dependentpartitionscanbeincorporatedasinferencetools for the mutual information estimation problem. We provided a general consistency resultthatwasappliedintwoemblematicconstructions—statisticallyequivalentblocks and tree-structured vector quantization — in terms of specific range of design values 162 Table5.2: Varianceforthenon-parametricmutualinformationestimatesobtainedfrom 1000 independent realizations of the empirical process. These results are associated withthebiasvaluesreportedinTable5.2andtheyfollowthesameorganization. 11 33 58 101 179 564 3164 5626 TSVQ: 1.473e-03 1.732e-03 7.941e-04 5.878e-04 2.153e-04 6.009e-05 4.245e-06 1.534e-06 GESS: 4.037e-03 2.509e-03 1.234e-03 9.908e-04 3.824e-04 6.009e-05 4.251e-06 1.992e-06 KERN: 2.428e-02 1.165e-02 6.461e-03 4.294e-03 2.258e-03 6.009e-05 9.379e-05 4.941e-05 PROD: 5.992e-02 2.431e-02 9.190e-03 4.825e-03 2.037e-03 6.009e-05 8.629e-05 6.193e-05 TSVQ: 1.406e-03 2.563e-03 1.418e-03 1.022e-03 5.302e-04 1.926e-04 3.049e-05 1.437e-05 GESS: 5.565e-03 3.369e-03 1.729e-03 1.384e-03 6.555e-04 1.926e-04 3.006e-05 1.632e-05 KERN: 2.565e-02 1.198e-02 6.465e-03 4.127e-03 2.461e-03 1.926e-04 1.145e-04 6.314e-05 PROD: 6.055e-02 2.276e-02 9.664e-03 5.346e-03 2.454e-03 1.926e-04 1.212e-04 7.653e-05 TSVQ: 1.563e-03 3.355e-03 2.268e-03 1.697e-03 9.138e-04 3.759e-04 6.863e-05 3.585e-05 GESS: 6.514e-03 4.386e-03 2.632e-03 2.079e-03 9.792e-04 3.759e-04 7.069e-05 3.894e-05 KERN: 2.820e-02 1.322e-02 7.290e-03 4.674e-03 2.497e-03 3.759e-04 1.494e-04 7.831e-05 PROD: 6.687e-02 2.631e-02 1.327e-02 6.010e-03 2.894e-03 3.759e-04 1.769e-04 1.146e-04 TSVQ: 1.007e-03 3.621e-03 2.464e-03 2.227e-03 1.333e-03 6.586e-04 1.386e-04 7.679e-05 GESS: 6.426e-03 4.900e-03 3.601e-03 3.257e-03 1.521e-03 6.586e-04 1.418e-04 8.362e-05 KERN: 3.132e-02 1.534e-02 1.042e-02 6.225e-03 3.412e-03 6.586e-04 2.345e-04 1.244e-04 PROD: 7.025e-02 2.808e-02 1.602e-02 8.991e-03 4.766e-03 6.586e-04 4.581e-04 3.277e-04 whereuniversalstrongconsistencyisguaranteed. Inaddition,weprovidedanempirical comparisonoftheseconstructionswithrespecttoclassicalestimationtechniques. From the results presented here, it is evident that additional improvement can be obtained by choosing the design variable of this family of consistency estimates as a function of the data, in such a way to find a good compromise between estimation and approximation error effects [22, 21]. The idea would be to shrink the gap with respect totheidealoracleresultwhereinthedomainofconsistentdesignvalues,wechoosethe onewiththebestsmallsampleperformanceforagivenjointdistribution. Improvement can be obtained from the inductive nature of data-driven tree-structured partitions, as explored in [18], and in theory motivated from results in the context of regression and classification trees [5, 51, 62]. This is an interesting direction for further research, not only to explore pruning algorithms that could improve the small sample properties of 163 data-driven histogram based constructions, but also to characterize rate of convergence results. 164 5.8 TechnicalDerivations 5.8.1 ProofofProposition5.1 Westartbyintroducingsufficientconditionstocontrolapproximationerrorforthemore generalproblemofdivergenceestimation[66]. LEMMA5.2 (Silva et al. [66]) Let P and Q be two probability measures in (R d ,B(R d )), absolutely continuous with respect to the Lebesgue measure, such that the divergenceD(P||Q) is finite (thenP Q [40]). Let us consider a partition scheme Π ={π 1 (·),π 2 (·),···}drivenbyarandomsequenceZ 1 ,Z 2 ,···,withvaluesin R d and processdistribution P. Ifthefollowingshrinkingcellcondition[46,21]issatisfiedwith respecttotothereferencemeasure,moreprecisely,∀> 0, lim n→∞ Q z∈ R d :diam(π n (z|Z n 1 ))> → 0, (5.28) a.s. withrespectto Pthen, lim n→∞ X A∈πn(Z n 1 ) P(A)·log P(A) Q(A) =D(P||Q), (5.29) P-almostsurely. TheargumentofthisresultispresentedintheproofofTheorem1in[66]. Asthemutual informationisaparticularinstanceofthedivergence,perEq.(5.1),thislemmaprovides sufficient condition to make our data-dependent partition asymptotically sufficient for the mutual information (i.e. Π satisfies (5.18) for any joint distributionP X,Y equipped with a density function). However those are stipulated as a function of the reference measure, that for our MI context isP X ×P Y . Given that the proposed data-depended 165 construction is driven by i.i.d samples of the joint distributionP X,Y instead, the condi- tionofLemma5.2cannotbeguaranteed 5 . The next result improves Lemma 5.2 in the context of divergence estimation and as acorollaryprovesProposition5.1. THEOREM5.5 Under the problem setting of Lemma 5.2, Π = {π 1 (·),π 2 (·),···} is asymptoticsufficientforD(P||Q)(i.e.,Eq.(5.29)issatisfied),if lim n→∞ P z∈ R d :diam(π n (z|Z n 1 ))> → 0, (5.30) a.s. withrespecttotheprocessdistributionofZ 1 ,Z 2 ,···. ProofofTheorem5.5: First,thedivergencecanbecharacterizedby[32], D(P||Q) = sup π∈Q D π (P||Q) (5.31) with Q denoting the set of finite measurable partitions of R d and D π (P||Q) ≡ P A∈π log P(A) Q(A) ·P(A)thedivergenceofP withrespecttoQrestrictedtoσ(π)⊂B(R d ). Then,foranysequencez n 1 ∈ R d·n ,D πn(z n 1 ) (P||Q)≤D(P||Q)andthen, limsup n→∞ D πn(Z n 1 ) (P||Q)≤D(P||Q), (5.32) P-almost surely ( P denoting the process distribution of Z 1 ,Z 2 ,···). Consequently to finishtheproof,itissufficienttoshowthat,∀> 0, D(P||Q)< liminf n→∞ D πn(Z n 1 ) (P||Q)+, (5.33) 5 In fact, given that P X,Y P X × P Y , (5.28) implies the shrinking cell condition with respect to P X,Y in(5.17),see[66]fortheproof. 166 P-almostsurely. Let us introduce some short hand notations. We define the set B n (δ) ≡ S A∈πn(z n 1 ) diam(A)>δ A, as the support of the partition π n (z n 1 ) with measurable bins of diame- ter smaller than δ. In addition for any event B ∈ B(R d ), we denote by π n [B|z n 1 ] ≡ S A∈πn(z n 1 ) s.t.A∩B6=∅ A, the smallest measurable support ofπ n (z n 1 ) that fully containsB. Finally foranypartitionπ,wedefineaassociatedmeasurablefunctionfrom R d to Rby, f π (P||Q)(x)≡ X A∈π log P(A) Q(A) · I A (x), (5.34) with I A (·) denoting the indicator function. Note that f π (P||Q)(·) is P-integrable (∈ L 1 (P)),infactfromitsdefinition, R f π (P||Q)∂P(x) =D π (P||Q)≤D(P||Q)<∞. Notethatbythehypothesisofthetheorem,∀δ> 0, lim n→∞ P [ A∈πn(Z n 1 ) diam(A)>δ A = lim n→∞ P(B n (δ)) = 0 (5.35) foralmostevery(withrespecttotheprocessdistribution P)sequencez 1 ,z 2 ,···∈ R d·N . For the rest of the proof let us concentrate on one of those typical sequences and show that for any of them, (5.33) is satisfied. Without loss of generality let z 1 ,z 2 ,··· be a typical sequence, i.e., a realization of the process where (5.35) is satisfied, and an arbitrary> 0,thenfrom(5.31)thereexistsapartition ¯ π ={A 1 ,..,A L }suchthat, D(P||Q)<D ¯ π (P||Q)+/2. (5.36) 167 By the continuity of x · logx function, it is simple to show that there exists δ(/2) > 0 such that for any partition ˆ π = n ˆ A 1 ,.., ˆ A L o that satisfies both sup i=1,..,L P(A i )−P( ˆ A i ) <δ(/2)andsup i=1,..,L Q(A i )−Q( ˆ A i ) <δ(/2),then, |D ¯ π (P||Q)−D ˆ π (P||Q)|< 2 . (5.37) We will use this continuity result to approximate the events of ¯ π by set operations (unions) of our data-dependent collection {π 1 (z 1 ),π 2 (z 2 1 ),···}. First we will concen- trateinaboundeddomainin R d . NotethatP λandQλ, withλrepresentingthe Lebesgue measure in (R d ,B(R d )), and consequently there exists a bounded setB o such thatP(B c o ) < 0.5·δ(/2) andQ(B c o ) < 0.5·δ(/2) [34]. We will try to find a good approximation for the bounded set of events ¯ π/B o ≡{A 1 ∩B o ,..,A L ∩B o }. For that weneedtointroducethefollowingoracledatadependentconstruction. Definition5.1 For the bounded set B o and for all δ > 0 and n ∈ N, we denote by π δ n (z n 1 )therefinedversionofπ n (z n 1 )bythefollowingsetofinductivesteps: ∀A∈π n (z n 1 ): if A∩B o =∅,thenA∈π δ n (z n 1 ). if A∩B o 6=∅,anddiam(A)<δ,thenA∈π δ n (z n 1 ). else, Partition A into a finite collection of events, with the condition that every event intersecting B o has diameter strictly smaller than δ, and assign those sets to π δ n (z n 1 ). Note that this oracle construction is conceptually possible as B o is a bounded set (in particulartherefinementsteponthebinsintersectingB o withdiametergreaterorequal to δ). By definition π δ n (z n 1 ) is a refinement of π n (z n 1 ), constructed in such a way that all its bins on π δ n [B o |z n 1 ] have a diameter strictly less than δ, which is not guaranteed 168 for the data dependent partition π n (z n 1 ). However, π δ n (z n 1 ) should converge in some sense toπ n (z n 1 ) because of (5.35),∀δ > 0. Then, we use the following construction to approximate ¯ π/B o fromπ δ n (z n 1 ), C δ 1,n = [ A∈π δ n (z n 1 ) A∩(Bo∩A 1 )6=∅ A, C δ 2,n = [ A∈π δ n (z n 1 ) A∩(Bo∩A 2 )6=∅ A\C δ 1,n ,··· , C δ L,n = [ A∈π δ n (z n 1 ) A∩(Bo∩A L )6=∅ A\ L−1 [ k=1 C δ k,n , and we denote the approximation by ¯ π δ n ≡ C δ 1,n ,..,C δ L,n (without loss of generality weassumethat∀i∈{1,..,L},B o ∩A i 6=∅). Notethateverybinusedtoconstructthis collectioniscontainedinπ δ n [B o |z n 1 ]andconsequentlyhasadiametersmallerthanδ. By a continuity argument (continuity of a measure under monotone set sequence [4, 34]), it can be shown that ∀¯ choosing ∃ ¯ δ sufficiently small, such that sup i=1,..,L λ((A i ∩ B o )4C ¯ δ i,n ) < ¯ , uniformly∀n ∈ N. Then in particular, using that P λ and Q λ, we can choose ¯ δ such that sup i=1,..,L P(A i ∩B o )−P(C ¯ δ i,n ) < 0.5·δ(/2) and sup i=1,..,L Q(A i ∩B o )−Q(C ¯ δ i,n ) < 0.5·δ(/2),∀n∈ N. Usingthisresultand(5.37), thenforthis ¯ δ(/2)wehavethat, D ¯ π (P||Q)− L X i=1 log P(C ¯ δ i,n ) Q(C ¯ δ i,n ) ·P(C ¯ δ i,n ) </2, (5.38) 169 ∀n∈ N. Thenwehavethefollowingsequenceofinequalities, D ¯ π (P||Q)−D πn(z n 1 ) (P||Q) (5.39) < 2 + L X i=1 log P(C ¯ δ i,n ) Q(C ¯ δ i,n ) ·P(C ¯ δ i,n )−D πn(z n 1 ) (P||Q) ≤ 2 +D π ¯ δ n (z n 1 ) (P||Q)−D πn(z n 1 ) (P||Q) = 2 + Z Bn( ¯ δ) f π ¯ δ n (z n 1 ) (P||Q)∂P(x)− Z Bn( ¯ δ) f πn(z n 1 ) (P||Q)∂P(x), the first inequality because of (5.38), the second due to the monotonic behavior of D π (P||Q) under refined partitions [32, 44] and the fact that by construction π ¯ δ n π ¯ δ n (z n 1 ), and the last equality because by construction π ¯ δ n (z n 1 ) and π n (z n 1 ) are equiva- lent in the supportB c n ( ¯ δ). Again using the monotonic property of the divergence under sequenceofembeddedpartitions,wehavethat, log P(B n ( ¯ δ)) Q(B n ( ¯ δ)) ·P(B n ( ¯ δ))≤ Z Bn( ¯ δ) f π ¯ δ n (z n 1 ) (P||Q)∂P(x) ≤ Z Bn( ¯ δ) f(P||Q)∂P(x), (5.40) where given that D(P||Q) < ∞ and that lim n P(B n (δ)) = 0, by the dom- inated convergence theorem [34], we have that lim n R Bn(δ) f(P||Q)∂P(x) = 0, then from (5.40) limsup n→∞ R Bn( ¯ δ) f π ¯ δ n (z n 1 ) (P||Q)∂P(x) ≤ 0. On the other hand, P(B n ( ¯ δ)) · log P(Bn( ¯ δ)) Q(Bn( ¯ δ)) ≥ P(B n ( ¯ δ)) · logP(B n ( ¯ δ)) and the fact that lim x→0 x · logx = 0 implies that liminf n→∞ R Bn( ¯ δ) f π ¯ δ n (z n 1 ) (P||Q)∂P(x) ≥ 0. Consequently, lim n→∞ R Bn( ¯ δ) f π ¯ δ n (z n 1 ) (P||Q)∂P(x) = 0, and using the same argument we have that lim n→∞ R Bn( ¯ δ) f πn(z n 1 ) (P||Q)∂P(x) = 0. 170 Finally taking limits in the previous set of inequalities (5.39), D ¯ π (P||Q) < liminf n→∞ D πn(z n 1 ) (P||Q)+ 2 ,andfrom(5.36), D(P||Q)< liminf n→∞ D πn(z n 1 ) (D||P)+, ∀ > 0. Consequently, D(P||Q) = lim n→∞ D πn(z n 1 ) (D||P) for any typical sequence. Then from the definition of the set of typical sequences, we have this last equality P−almostsurely,whichprovesthetheorem. ProofofProposition5.1: ItisjustadirectconsequenceofTheorem5.5,considering themeasuresP=P X,Y andQ =P X ×P Y . 5.8.2 ProofofTheorem5.2 Proof: We consider the divergence notation for the MI in (5.1). ThenI(X;Y) = D(P||Q)withP denotingthejointdistributionP X,Y andQ =P X ×P Y . Wedenoteby P n andQ n theempiricalversionsofP andQpresentedinSection5.3basedonZ 1 ,..,Z n and the product bin structure ofπ n (·). Then our empirical MI estimate in (5.13) can be expressedbyD πn(Z n 1 ) (P n ||Q n ). Toprovetheresultweusethefollowinginequality, D πn(Z n 1 ) (P n ||Q n )−D(P||Q) ≤ D πn(Z n 1 ) (P n ||Q n )−D πn(Z n 1 ) (P||Q) + D πn(Z n 1 ) (P||Q)−D(P||Q) . (5.41) Notethatthelasttermintherighthandsideof(5.41)istheapproximationerror,which from Proposition 5.1 converges to zero P-a.s. as n tends to infinity. Then we just 171 need to focus on the estimation error term. From triangular inequality and its defini- tion D πn(Z n 1 ) (P n ||Q n )−D πn(Z n 1 ) (P||Q) ≤ X A∈πn(Z n 1 ) [P n (A)logP n (A)−P(A)logP(A)] (5.42) + X A∈πn(Z n 1 ) [P n (A)logQ n (A)−P(A)logQ(A)] . (5.43) Concerning the term in (5.42), it is upper bounded by P A∈πn(Z n 1 ) [P n (A)−P(A)]logP n (A) + P A∈πn(Z n 1 ) [logP n (A)−logP(A)]P(A) ≤ X A∈πn(Z n 1 ) |P n (A)−P(A)|log n k n + sup A∈πn(Z n 1 ) |logP(A)−logP n (A)|, (5.44) where this last inequality uses the fact thatP n (A) ≥ kn n ∀A ∈ π n (Z n 1 ). Using Lugosi andNobelinequality,Lemma5.1,theprobabilityofthefirsttermin(5.44)greaterthan canbeuniformly(distributionfree)boundedby, P n X A∈πn(Z n 1 ) |P n (A)−P(A)|·log n k n > ≤ P n sup π∈An X A∈π |P n (A)−P(A)|·> logn/k n ! ≤ 4Δ ∗ 2n (A n )2 M(An) ·exp − n 2 (logn/k n ) 2 ·32 , (5.45) 172 where the exponential term exp n − n 2 (logn/kn) 2 ·32 o ≤ exp n − n 2 (logn) 2 ·32 o . Note that this last sequence is uniformly, in , dominated by the sequence (exp{−m ¯ τ }) n∈N , ∀¯ τ ∈ (0,1). Consequentlyfromc.2andc.3,itissimpletoshowthat∀, limsup n→∞ 1 m τ ·log P n X A∈πn(Z n 1 ) |P n (A)−P(A)|> logn/k n ≤ C o , being C o a strictly negative constant. Finally from the fact that P n≥0 exp{C o ·m τ } < ∞ and the Borel-Cantelli Lemma, we have that lim n→∞ P A∈πn(Z n 1 ) |P n (A)−P(A)|log n kn = 0, P-a.s. Concerning the second term in(5.44),sup A∈πn(Z n 1 ) |logP(A)−logP n (A)|,weusethefollowingresult. Proposition5.3 (Silva and Narayanan [66]) If lim n→∞ sup A∈πn(Z n 1 ) P(A) Pn(A) −1 = 0, P-a.s,then, lim n→∞ sup A∈πn(Z n 1 ) |logP(A)−logP n (A)| = 0, P−a.s. Silvaetal. [66]provethatunderc.2,c.3andc.4,thesufficientconditionofProposition 5.3issatisfiedandconsequentlyfrom(5.44)thetermin(5.42)tendstozero P-a.s. Concerning the term in (5.43), we bounded it by the expression in (5.46), where considering the product bin structure of π n (·), we have that∀A ∈ π n (Z n 1 ), Q n (A) = P n (A [1−p] × R q )P n (R p ×A [p+1−d] ), with A [1−p] and A [p+1d] a short-hand notation for ξ [1−p] (A)andξ [p+1−d] (A),respectively. X A∈πn(Z n 1 ) [P n (A)logQ n (A)−P(A)logQ(A)] ≤ X A∈πn(Z n 1 ) P(A)logP(A [1−p] × R q )−P n (A)logP n (A [1−p] × R q ) + X A∈πn(Z n 1 ) P(A)logP(R p ×A [p+1−d] )−P n (A)logP n (R p ×A [p+1−d] ) (5.46) 173 Given the symmetric structure of the bound in (5.46), we focus the attention on just one of these terms since the derivation for the other use the same arguments. Using similar derivations as those used in the set of inequalities (5.44), we have that P A∈πn(Z n 1 ) P(A)logP(A [1−p] × R q )− P n (A)logP n (A [1−p] × R q ) ≤ X A∈πn(Z n 1 ) |P n (A)−P(A)|log n k n + sup A∈πn(Z n 1 ) logP(A [1−p] × R q )−logP n (A [1−p] × R q ) , (5.47) wherewehavealreadyprovedthatthefirsttermtendstozero P-a.sas ntendstoinfinity. Concerning the second term in (5.47), from Proposition 5.3 it is sufficient to prove that lim n→∞ sup A∈πn(Z n 1 ) P(A [1−p] ×R q ) Pn(A [1−p] ×R q ) −1 = 0 P-a.s. Analyzing this expression, we have that,∀> 0, P n sup A∈πn(Z n 1 ) P(A [1−p] × R q ) P n (A [1−p] × R q ) −1 > ! ≤ P n sup A∈C [1−p],n P(A [1−p] × R q ) P n (A [1−p] × R q ) −1 > ! ≤ P n sup A∈C [1−p],n P(A [1−p] × R q )P n (A [1−p] × R q ) > k n · n ! ≤S n (C [1−p],n )·exp − k 2 n · 2 n·8 , (5.48) 174 the first inequality results from the fact that π n (Z n 1 ) ⊂ C [1−p],n , the second from P n (A [1−p] × R d ) ≥ P n (A) ≥ kn n , ∀A ∈ π n (Z n 1 ), and the last one from the distribu- tionfreeversionoftheVapnik-Chervonenkisinequality,Theorem5.1. Finallyfromthe factthat(k n )≈ (n 0.5+τ/2 )andtheconditionc.1,itissimplealgebratoshowthat, limsup n→∞ 1 n τ log P n sup A∈πn(Z n 1 ) P(A [1−p] × R q ) P n (A [1−p] × R q ) −1 > ! < C() a constant function of that is strictly negative. Then from this and the Borel- Cantelli lemma, we have that lim n→∞ sup A∈πn(Z n 1 ) P(A [1−p] ×R q ) Pn(A [1−p] ×R q ) −1 = 0 P-a.s, which isthelastpieceofresultneededtoprovethetheorem. 5.8.3 ProofofTheorem5.3 Proof: This proof follows the same argument presented in (Theorem)[66]. For sake of completeness we present this result here. We just need to verify the sufficient conditions of Theorem 5.2. Without loss of generality let us consider an arbitraryτ ∈ (1/3,1). The trivial case to check is c.4), because by construction we can consider k n = l n , ∀n ∈ N, and then the hypothesis of the theorem gives the result. For c.1), from the construction of π n (·), C [1−p],n and C [p+1−d],n are contained in the collection of all rectangles of R p and R q , respectively, which are well known to have finite VC dimensions [22]. Hence from Proposition 5.2 we get the result. Concerningc.3), again by construction we have thatM(A n )≤ n/l n +1, thenn −l M(A n )≤ n 1−τ /l n +n −τ . Giventhat(l n )≈ (n 0.5+τ/2 )andτ ∈ (1/3,1)itfollowsthat, lim n→∞ n −τ M(A n ) = 0. (5.49) 175 Forc.2),Lugosietal. [46]showedthatΔ ∗ n (A n )≤ Tn+n n d ,whereusingthatlog s t ≤ s·h(t/s)[21],withh(x) =−xlog(x)−(1−x)log(1−x)forx∈ [0,1]—thebinary entropyfunction[16]anddefining ¯ T n ≡bn/l n c≥T n ,itfollowsthat, n −τ log(Δ ∗ n (A n ))≤n −τ d·log ¯ Tn+n n ≤dn −τ ·(n+ ¯ T n )·h n n+ ¯ T n ≤ 2dn 1−τ ·h 1 n/ ¯ T n +1 ≤ 2dn 1−τ ·h ¯ T n n ≤ 2dn 1−τ ·h 1 l n (5.50) Consequentlywehavethat,∀n∈ N, n −τ log(Δ ∗ n (A n ))≤− 2dn 1−τ l n log(1/l n ) −2dn 1−τ (1−1/l n )log(1−1/l n ). (5.51) The first term on the right hand side (RHS) of (5.51) behaves like n 0.5−3/2·τ · log(l n ), where as long as the exponent of the first term is negative (equivalent to τ > 1/3) this sequence tends to zero as m tends to infinity — considering that by construction (l n ) (n). ThesecondtermontheRHSof(5.51)behavesasymptoticallylike−n 1−τ · log(1−1/l n )whichisupperboundedbythesequence n 1−τ ln · 1 1−1/ln —usingthatlog(x)≤ x−1,∀x> 0. This upper bound tends to zero because (l n )≈ (n 0.5+τ/2 ) andτ > 1/3. Consequentlyfrom(5.51),lim n→∞ n −τ log(Δ ∗ n (A n )) = 0. Finallyconcerningc.5),Lugosietal. [46](Theorem4)provedthattogetthisshrink- ingcellconditionissufficienttoshowthatlim n→∞ ln n = 0,whichisthecaseconsidering thatτ < 1. 176 5.8.4 ProofofTheorem5.4 Proof: The proof is obtained by checking the sufficient conditions of our gen- eral result in Theorem 5.2. First c.1) is guaranteed by the same reasons stated for the Gessaman’s partition scheme in Appendix 5.8.3, where c.4) is obtained directly by the stopping criterion. Considering c.3), |π n (Z n 1 )| is uniformly upper bounded by n/k n . ThenM(A n )≤n/k n andconsequently, n −τ M(A n )≤ n 1−τ k n ≈n 0.5− 3 2 τ , (5.52) upper bound that tends to zero as n → ∞ if τ > 1/3. For the condition c.2), we use the upper bound proposed by Lugosi et al. [46], specifying that every polytope (or cell) ofπ n (Z n 1 ) is induced by at mostM(A n ) hyperplane splits. Each binary splits can dichotomizen≥ 2pointsin R d inatmostn d ways[15]. Consequently, Δ ∗ n (A n )≤ (n d ) n/kn , (5.53) then, n −τ logΔ ∗ n (A n )≤ n 1−τ k n dlogn, (5.54) upperboundsequencethatagaintendstozeroasn→∞aslongasτ > 1/3. Thefinalconditionc.5)canbederivedfromtheideaspresentedbyDevroye,Gyorfi and Lugosi [21] (Theorem 20.2) where a weak version of our shrinking cell condition was proved for a similar balanced tree-structured partition scheme. These arguments require the introduction of technical elements that are out of the scope of this work and becauseitisnotthecentralresult,weomitthemhere. 177 References [1] J. Beirlant, E. J. Dudewicz, L. Gy¨ orfi, E. van der Meulen, Nonparametric entropy estimation: Anoverview,Int.J.ofMath.andStat.Sci.6(1)(1997)17–39. [2] M.Bohanec,I.Bratko,Tradingaccuracyforsimplicityindecisiontrees,Machine Learning15(1994)223–250. [3] O. Bousquet, S. Boucheron, G. Lugosi, Theory of Classification: A Survey of RecentAdvances,ESAIM:ProbabilityandStatistics,URL:http:/www.emath.fr/ps, 2004. [4] L.Breiman,Probability,Addison-Wesley,1968. [5] L.Breiman,J.Friedman,R.Olshen,C.Stone,ClassificationandRegressionTrees, Belmont,CA:Wadsworth,1984. [6] T.Butz,J.-P.Thiran,Fromerrorprobabilitytoinformationtheoretic(multi-modal) signalprocessing,ElsevierSignalProcessing85(2005)875–902. [7] T. Chang, C. J. Kuo, Texture analysis and classification with tree-structured wavelettransform,IEEETransactionsonImageProcessing2(4)(1993)429–441. [8] P. Chou, T. Lookabaugh, R. Gray, Optimal pruning with applications to tree- structure source coding and modeling, IEEE Transactions on Information Theory 35(2)(1989)299–315. [9] G. F. Choueiter, J. R. Glass, An implementation of rational wavelets and filter desing for phonetic classification, IEEE Transactions on Audio, Speech, and Lan- guageProcessing15(3)(2007)939–948. [10] A.Cohen,I.Daubechies,O.Guleryuz,M.Orchard,Ontheimportanceofcombin- ing wavelet-based nonlinear approximation with coding strategies, IEEE Transac- tionsonInformationTheory48(1). 178 [11] R. Coifman, Y. Meyer, S. Quake, V. Wickerhauser, Signal processing and com- pressionwithwaveletpackets,Tech.rep.,NumericalAlgorithmsResearchGroup, NewHaven,CT,YaleUniversity(1990). [12] R.R.Coifman,M.V.Wickerhauser,Entropy-basedalgorithmforbestbasisselec- tion,IEEETransactionsonInformationTheory38(2)(1992)713–718. [13] M. L. Cooper, M. I. Miller, Information measures for object recognition accom- modating signature variability, IEEE Transactions on Information Theory 46 (5) (2000)1896–1907. [14] T.Cormen,C.Leiserson,R.L.Rivest,IntroductiontoAlgorithms,TheMITPress, Cambridge,Massachusetts,1990. [15] T.M.Cover,Geometricalandstatisticalpropertiesofsystemsoflinearinequalities with applications in pattern recognition, IEEE Trans. on Electronic Computers 14 (1965)326–334. [16] T. M. Cover, J. A. Thomas, Elements of Information Theory, Wiley Interscience, NewYork,1991. [17] M. S. Crouse, R. D. Nowak, R. G. Baraniuk, Wavelet-based statistical signal pro- cessing using hidden Markov models, IEEE Transactions on Signal Processing 46(46)(1998)886–902. [18] G. A. Darbellay, I. Vajda, Estimation of the information by an adaptive partition of the observation space, IEEE Transactions on Information Theory 45 (4) (1999) 1315–1321. [19] I.Daubechies,TenLecturesonWavelets,Philadelphia: SIAM,1992. [20] F.denHollander,LargeDeviations,AmericanMathematicalSociety,2000. [21] L. Devroye, L. Gyorfi, G. Lugosi, A Probabilistic Theory of Pattern Recognition, NewYork: Springer-Verlag,1996. [22] L. Devroye, G. Lugosi, Combinatorial Methods in Density Estimation, Springer - Verlag,NewYork,2001. [23] M. N. Do, M. Vetterli, Wavelet-based texture retrieval using generalized gaussian densities and Kullback-Leibler distance, IEEE Transactions on Image Processing 11(2)(2002)146–158. [24] D. L. Donoho, M. Vetterli, R. A. DeVore, I. Daubechies, Data compression and harmonicanalysis,IEEETransactionsonInformationTheory44(6)(1998)2435– 2476. 179 [25] R. O. Duda, P. E. Hart, Pattern Classification and Scene Analysis, New York: Wiley,1983. [26] K. Etemad, R. Chellapa, Separability-based multiscale basis selection and feature extraction for signal and image classification, IEEE Transactions on Image Pro- cessing7(10)(1998)1453–1465. [27] J. W. Fisher III, T. Darrel, W. Freeman, P. Viola, Learning joint statistical models for audio-visual fusion and segregation, in: Advances in Neural Information Pro- cessing System, Denver, USA, Advances in Neural Information Processing Sys- tems,2000. [28] J. W. Fisher III, M. Wainwright, E. Sudderth, A. S. Willsky, Statistical and information-theoreticmethodsforself-organizationandfusionofmultimodal,net- worked,InternationalJournalofHighPerformanceComputingApplications. [29] M. P. Gessaman, A consistent nonparametric multivariate density estimator based onstatisticallyequivalentblocks,Ann.Math.Statist.41(1970)1344–1346. [30] V. K. Goyal, M. Vetterli, N. T. Thao, Quantized overcomplete expansions in R n : Analysis, synthesis and algorithms, IEEE Transactions on Information Theory 44(7)(1998)16–31. [31] R. Gray, L. D. Davisson, Introduction to Statistical Signal Processing, Cambridge UnivPress,2004. [32] R.M.Gray,EntropyandInformationTheory,Springer-Verlag,NewYork,1990. [33] L.Gy¨ orfi,E.vanderMeulen,Density-freeconvergencepropertiesofvariousesti- matorsofentropy,ComputationalStatisticsandDataAnalysis5(1987)425–436. [34] P.R.Halmos,MeasureTheory,VanNostrand,NewYork,1950. [35] A.Jain,P.Moulin,M.I.Miller,K.Ramchandran,Information-theoreticboundson targetrecognitionperformancesbasedondegradedimagedata,IEEETransactions onPatternAnalysisandMachineIntelligence24(9)(2002)1153–1166. [36] A. K. Jain, R. P. W. Duin, J. Mao, Statistical pattern recognition: A review, IEEE TransactionsonPatternAnalysisandMachineIntelligence22(2000)4–36. [37] L. O. Jimenez, D. A. Landgrebe, Hyperspectral data analysis and supervised feature reduction via projection pursuit, IEEE Transactions on Geoscience and RemoteSensing37(6)(1999)2653–2667. [38] B. H. Juang, L. R. Rabiner, A probabilistic distance measure for hidden markov models,AT&TTechnicalJournal64(2)(1985)391–408. 180 [39] J.Kim,J.W.FisherIII,A.Yezzi,M.Cetin,A.S.Willsky,Anonparametricstatis- ticalmethodforimagesegmentationusinginformationtheoryandcurveevolution, IEEETransactionsonImageProcessing14(10)(2005)1486–1502. [40] S.Kullback,InformationtheoryandStatistics,NewYork: Wiley,1958. [41] S.Kumar,J.Ghosh,M.M.Crawford,Best-basesfeatureextractionalgorithmsfor classificationofhyperspectraldata,IEEETransactionsonGeoscienceandRemote Sensing39(7)(2001)1368–1379. [42] A. Laine, J. Fan, Texture classification by wavelet packet signatures, IEEE Trans- actionsonPatternAnalysisandMachineIntelligence15(11)(1993)1186–1191. [43] R. E. Learned, W. Karl, A. S. Willsky, Wavelet packet based transient signal clas- sification,in: Proc.IEEEConf.TimeScaleandTimeFrequencyAnalysis,1992. [44] F.Liese,D.Morales,I.Vajda,Asymptoticallysufficientpartitionandquantization, IEEETransactionsonInformationTheory52(12)(2006)5599–5606. [45] J.Liu,P.Moulin,Information-theoreticanalysisofinterscaleandintrascaledepen- denciesbetweenimagewaveletcoefficients,IEEETransactionsonImageProcess- ing10(11)(2001)1647–1658. [46] G.Lugosi,A.B.Nobel,Consistencyofdata-drivenhistogrammethodsfordensity estimationandclassification,TheAnnalsofStatistics24(2)(1996)687–706. [47] S. Mallat, A theory for multiresolution signal decomposition: the wavelet rep- resentation, IEEE Transactions on Pattern Analysis and Machine Intelligence 11 (1989)674–693. [48] X.Nguyen,M.Wainwright,M.Jordan,Nonparametricestimationofthelikelihood ratio and divergence functionals, in: IEEE International Symposium on Informa- tionTheory,IEEE,2007. [49] A.B.Nobel,Histogramregressionestimationusingdata-dependentpartitions,The AnnalsofStatistics24(3)(1996)1084–1105. [50] A. B. Nobel, Recursive partitioning to reduce distortion, IEEE Transactions on InformationTheory43(4)(1997)1122–1133. [51] A. B. Nobel, Analysis of a complexity-based pruning scheme for classification tree,IEEETransactionsonInformationTheory48(8)(2002)2362–2368. [52] J. Novovicova, P. Pudil, J. Kittler, Divergence based feature selection for multi- modalclassdensities,IEEETransactionsonPatternAnalysisandMachineIntelli- gence18(2)(1996)218–223. 181 [53] M. Padmanabhan, S. Dharanipragada, Maximizing information content in feature extraction,IEEETransactionsonSpeechandAudioProcessing13(4)(2005)512– 519. [54] H.V.Poor,J.B.Thomas,Applicationsofali-silveydistancemeasuresinthedesign of generalized quantizers for binary decision problems, IEEE Trans. on Commu- nicationsCOM-25(9)(1977)893–900. [55] T. F. Quatieri, Discrete-time Speech Signal Processing principles and practice, PrenticeHall,2002. [56] L. R. Rabiner, A tutorial on hidden Markov models and selected applications in speechrecognition,ProceedingsoftheIEEE77(2)(1989)257–286. [57] K.Ramchandran,M.Vetterli,C.Herley,Wavelet,subbandcoding,andbestbases, ProceedingsoftheIEEE84(4)(1996)541–560. [58] B. D. Ripley, Patten Recognition and Neural Networks, Cambridge Univ Press, 1996. [59] N.Saito,R.R.Coifman,Localdiscriminantbasis,inProc.SPIE2303,Mathemat- icalImaging: WaveletApplicationsinSignalandImageProcessing(1994)2–14. [60] N.A.Schmid,J.A.O’Sullivan,Thresholdingmethodfordimensionalityreduction in recognition system, IEEE Transactions on Information Theory 47 (7) (2001) 2903–2920. [61] C. Scott, Tree pruning with subadditive penalties, IEEE Transactions on Signal Processing53(12)(2005)4518–4525. [62] C.Scott,R.D.Nowak,Minimax-optimalclassificationwithdyadicdecisiontrees, IEEETransactionsonInformationTheory52(4)(2006)1335–1353. [63] J. Silva, S. Narayanan, Minimum probability of error signal representation,, in: IEEEInternationalWorkshoponMachineLearningforSignalProcessing,2007. [64] J. Silva, S. Narayanan, Optimal wavelet packets decomposition based on a rate- distortion optimality criterion, in: IEEE International Conference on Acoustics, Speech,andSignalProcessing,IEEE,2007. [65] J. Silva, S. Narayanan, Optimal wavelet packets decomposition based on a rate- distortion optimality criterion, in: IEEE International Conference on Acoustics, Speech,andSignalProcessingICASSP,IEEE,2007. [66] J. Silva, S. Narayanan, Universal consistency of data-driven partitions for diver- gence estimation, in: IEEE International Symposium on Information Theory, 2007. 182 [67] Y. Singer, M. Warmuth, Training algorithm for hidden markov models using entropy based distance functions, in: Advances in Neural Information Processing System9,MorganKaufmannPublishers,1996. [68] A. K. Soman, P. P. Vaidyanathan, On orthonormal wavelet and paraunitary filter banks,IEEETransactionsonSignalProcessing41(3)(1993)1170–1183. [69] P. Th´ evenaz, M. Unser, Optimization of mutual information for multiresolution image registration, IEEE Transactions on Image Processing 9 (12) (2000) 2083– 2099. [70] V.Vapnik,StatisticalLearningTheory,JohnWiley,1998. [71] V.Vapnik,TheNatureofStatisticalLearningTheory,Springer-Verlag,NewYork, 1999. [72] V.Vapnik,A.J.Chervonenkis,Ontheuniformconvergenceofrelativefrequencies ofeventstotheirprobabilities,TheoryofProbabilityApl.16(1971)264–280. [73] S.Varadhan,ProbabilityTheory,AmericanMathematicalSociety,2001. [74] N. Vasconcelos, Bayesian model for visual information retrieval, Ph.D. thesis, Mass.Inst.ofTechnol.(2000). [75] N.Vasconcelos,Minimumprobabilityoferrorimageretrieval,IEEETransactions onSignalProcessing52(8)(2004)2322–2336. [76] N.Vasconcelos,Ontheefficientevaluationofprobabilisticsimilarityfunctionsfor image retrieval, IEEE Transactions on Information Theory 50 (7) (2004) 1482– 1496. [77] M. Vetterli, J. Kovacevic, Wavelet and Subband Coding, Englewood Cliffs, NY: Prentice-Hall,1995. [78] Q. Wang, S. R. Kulkarni, S. Verd´ u, Divergence estimation of continuous distribu- tions based on data-dependent partitions, IEEE Transactions on Information The- ory51(9)(2005)3064–3074. [79] M. B. Westover, J. A. O’Sullivan, Achievable rates for pattern recognition, IEEE TransactionsonInformationTheory54(1)(2008)299–320. [80] A. S. Willsky, Multiresolution Markov models for signal and image processing, ProceedingsoftheIEEE90(8)(2002)1396–1458. [81] X. Yang, K. Wang, S. A. Shamma, Auditory representation of acoustic signals, IEEETransactionsonInformationTheory38(2)(1992)824–839. 183
Abstract (if available)
Abstract
This work presents contributions on two important aspects of the role of signal representation in statistical learning problems, in the context of deriving new methodologies and representations for speech recognition and the estimation of information theoretic quantities.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Novel optimization tools for structured signals recovery: channels estimation and compressible signal recovery
PDF
Sampling theory for graph signals with applications to semi-supervised learning
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Synchronization and timing techniques based on statistical random sampling
PDF
Multimodality, context and continuous dynamics for recognition and analysis of emotional states, and applications in healthcare
PDF
Leveraging training information for efficient and robust deep learning
PDF
Statistical inference for dynamical, interacting multi-object systems with emphasis on human small group interactions
PDF
Representation, classification and information fusion for robust and efficient multimodal human states recognition
PDF
Information geometry of annealing paths for inference and estimation
Asset Metadata
Creator
Silva, Jorge
(author)
Core Title
On optimal signal representation for statistical learning and pattern recognition
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
10/21/2008
Defense Date
06/23/2008
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
basis selection,Bayes decision theory,complexity regularization,concentration inequalities,data-dependent partitions,divergence estimation,family pruning problem,minimum cost tree pruning,mutual information estimation,OAI-PMH Harvest,signal representation in statistical learning,statistical learning theory,tree-structured bases and Wavelet packet (WP),tree-structured vector quantization.
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Narayanan, Shrikanth S. (
committee chair
), Kuo, C.-C. Jay (
committee member
), Ordóñez, Fernando I. (
committee member
)
Creator Email
jorgesil@usc.edu,josilva@ing.uchile.cl
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m1684
Unique identifier
UC1223798
Identifier
etd-Silva-2450 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-120808 (legacy record id),usctheses-m1684 (legacy record id)
Legacy Identifier
etd-Silva-2450.pdf
Dmrecord
120808
Document Type
Dissertation
Rights
Silva, Jorge
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
basis selection
Bayes decision theory
complexity regularization
concentration inequalities
data-dependent partitions
divergence estimation
family pruning problem
minimum cost tree pruning
mutual information estimation
signal representation in statistical learning
statistical learning theory
tree-structured bases and Wavelet packet (WP)
tree-structured vector quantization.