Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
High-dimensional feature selection and its applications
(USC Thesis Other)
High-dimensional feature selection and its applications
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
High-DimensionalFeatureSelectionandIts Applications by YinfeiKong ADissertationPresentedtothe FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA InPartialFulfilmentofthe RequirementsfortheDegreeof DOCTOR OF PHILOSOPHY (BIOSTATISTICS) August2016 i Acknowledgements Iwouldliketogratefullyandsincerelythankmyadvisorsfortheirguidance. TheyareProfessors DanielO.Stram,JinchiLv,YingyingFanandWendyCozen. Lastly,Ihavetothankmyfamilyandfriends,withwhomIwentthroughalltheseprocess. Contents Acknowledgements i 1 Introduction 1 1.1 High-dimensionaldata .............................. 1 1.2 Somevariableselectionmethods ......................... 3 1.2.1 Penalizedempiricalriskminimization.................. 4 1.2.2 AnotherL 1 -typemethod: Dantzigselector ............... 7 1.3 Variableselectionmethodsforgeneticdata ................... 8 1.4 ChallengesofPredictionModelswithInteractions................ 9 2 ConstrainedDantzigSelector 11 2.1 Introduction.................................... 11 2.2 Modelsetting ................................... 12 2.2.1 Nonasymptoticcompressedsensingproperties ............. 15 2.2.2 Samplingproperties ........................... 17 2.3 NumericalstudiesfortheconstrainedDantzigselector ............. 19 2.3.1 Implementation.............................. 19 2.3.2 Simulationstudies ............................ 21 2.3.3 Realdataanalysis............................. 23 2.4 Discussion..................................... 25 2.5 Appendix ..................................... 26 2.5.1 ProofofTheorem2.1........................... 26 2.5.2 ProofofTheorem2.2........................... 28 3 ApplicationsofVariableSelectiontoGeneticData 30 3.1 Introduction.................................... 30 3.2 Preprocessingofthe JAPC data .......................... 31 3.3 Predictionmodels................................. 33 3.4 StablyselectedSNPs ............................... 35 3.5 CDSonthefulldataset.............................. 35 4 InteractionPursuitwithFeatureScreeningandSelection 38 4.1 Introduction.................................... 38 4.2 Interactionscreening ............................... 40 4.2.1 Anewinteractionscreeningprocedure.................. 40 4.2.2 Surescreeningproperty ......................... 42 4.3 Interactionselection................................ 45 4.3.1 Interactionmodelsinreducedfeaturespace ............... 45 ii iii 4.3.2 Asymptoticpropertiesofinteractionandmaineffectselection ..... 46 4.3.3 VerificationofCondition4........................ 48 4.4 Numericalstudies................................. 49 4.4.1 Featurescreeningperformance...................... 49 4.4.2 Variableselectionperformance...................... 52 4.4.3 Realdataanalysis............................. 53 4.5 Discussion..................................... 58 5 Conclusion 60 Bibliography 61 Supplementalmaterials 67 Chapter1 Introduction 1.1 High-dimensionaldata High-dimensionaldatareferstothosewithlargenumberoffeaturesrelativelytothesamplesize. This kind of data has become increasingly common in various areas such as image processing, molecular genetics, spatial statistics and time series. The fast development of technology has made it possible to obtain such data at cheaper price. For example, with the technology devel- opment in genotyping and even deep-sequencing, genome-wide association studies have been widely conducted and successfully revealed important risk variants for a broad spectrum of diseases. Epigenome-wide association studies (EWAS) are becoming another new mainstream recently with the help of many affordable and high-coverage BeadChip’s such as Infinium Hu- manMethylation450. Thehigh-dimensionaldataanalysismethodsarereceivingmuchattention in the recent years. The visualization of high-dimensional data can be seen in Figure 1.1. The block in the figure represents a matrix whose rows are samples and columns are features. Gen- erally, high-dimensional data can have even more features than samples, which poses difficulty onselectinginformativefeaturesefficientlyandaccurately. Challengesofhigh-dimensionaldataincludethehighcollinearityandspuriouscorrelation induced by high dimensionality. Assume that there are n samples and p features or variables. Geometrically, the correlation between two features is the cosine of the angel spanned by the twon-dimensionalvectors. Therefore,themorevectorswethrowintothen-dimensionalspace, the more likely to have some vectors of small angels with each other. In other words, high dimensionality will inevitably bring in the problem of spurious correlation. This point can also beseeninthesimulationbelow. Randomly generate p variables which are identically independently distributed (i.i.d.) as N(0,1). Calculatethemaximumabsolutesamplecorrelation max j 2 |d corr(X 1 ,X j )|. Wecon- sidered two casesp = 10 3 andp = 10 4 . The distributions of their maximum absolute sample 1 Chapter1. Introduction 2 FIGURE 1.1: High-dimensionaldata correlation can be found in Figure 1.2. High collinearity and spurious correlation make high- dimensional variable selection intrinsically difficult; some true or important variables can have aweakerrelationshipwiththeresponsethansomenoisevariables,andsomenoisevariablescan haveastrongrelationshipwiththeresponse. SeeFanandLv(2010)fordetaileddiscussions. FIGURE 1.2: Distributionsofmaximumabsolutesamplecorrelation Variableselectionmethodsarespecificallydesignedforthistypeofdata. Unliketheusual associationstudieswhereeachfeatureistestedseparatelywiththephenotype,variableselection aimstoutilizeinformativefeaturesorvariablestopredictthephenotype. Thevariableselection methods have many advantages over univariate association tests. The major diseases adding largest threats on public health (such as cancer, diabetes, and heart diseases) are caused by Chapter1. Introduction 3 multiple genetic and environmental factors, each contributing slightly, instead of a couple of ones with overwhelming effects. In view of such biological complexity, it is very likely that someoftheriskvariantsmaybemarginallyuncorrelatedbutjointlycorrelatedwiththedisease. In this scenario, the univariate association analysis is unable to provide us a complete map of disease mechanism therefore adding obstacles to understanding the disease. In the contrast, the analysis of high-dimensional data based on variable selection methods does not have such troubleandisabletoprovideusmodelswithenhancedinterpretabilityandpredictability. 1.2 Somevariableselectionmethods The classical variable selection model usually replies on the data setting (x T i ,y i ) n i=1 , where y i is thei-th observation of the response variable andx i is its associatedp-dimensional covariates vector. The goal is to select some of these p variables to build up a prediction model. In this proposal,weconsiderlinearmodelsfirstandsimilarideacanbegeneralizedtoothermodels. FIGURE 1.3: Variableselectioninthelinearcase Figure 1.3 demonstrates the variable selection procedure in linear cases. The goal of vari- able selection is to identify those true variables as shown in the figure from a large number of candidatevariables. Mathematically,theidealprocedureestimatesthecomponentscorrespond- ing to noise variables in the coefficient vector as zeros and the other components as non-zeros. Therefore, in many cases, a sparse model is desired since in practice we usually believe a rela- tivelysmallnumberoffeaturesorvariablesarecontributingtothevariationoftheoutcome. Chapter1. Introduction 4 When the dimensionality is not too large, two widely used variable methods, backward and forward selection, may be helpful. They enjoy lots of popularity in applications thanks to their easy implementation and good interpretability. However, when the dimensionality be- comes large, such as the huge number of SNPs considered in the genetic data, these methods are infeasible. Another important issue with them is the noise accumulation. Once a certain variable or feature is contaminated by noise, its influence may be passed to all other variables in the procedure. Besides, consider the case when two important variables, both contributing in a large extent to the variation of the outcome, are somewhat correlated. It gets difficult for one important variable to enter the model once the other important variable is already selected. There are other problems with the traditional backward and forward selection methods but the message,asisknowntoall,isthattheyarenotgoodenoughinmanysituations. 1.2.1 Penalizedempiricalriskminimization Another group of more subtle variable selection methods are penalized empirical risk mini- mization. In prediction models, goodness is often defined in terms of prediction accuracy, but parsimonyisanotherimportantcriterion: simplermodelsarepreferredforthesakeofscientific insight into the feature-outcome relationship. However, the two goals are not on the track of thesamedirection,sincethemoreindependentvariablesweincludethebettermodelfittingwe would gain. The extreme case is that when there are exactly the same number of independent variables in the model as the sample size, all fitted residuals in the linear regression could be zeros. The penalized empirical risk minimization tries to make a compromise between model fittingandcomplexity. Considerthelinearregressionmodel y = X +", (1.1) where y=(y 1 ,...,y n ) T is an n-dimensional response vector, X=(x 1 ,...,x p ) is an n⇥ p design matrix consisting of p covariate vectors x j ’s, =( 1 ,..., p ) T is a p-dimensional regressioncoefficientvector,and"=(" 1 ,...," n ) T ⇠ N(0, 2 I n )forsomepositiveconstant isann-dimensionalerrorvectorindependentofX. Anaturewaytoachievethebalancebetween model fitting and complexity is to minimize the weighted sum of the two measures of them, as shownintheexpressionbelow. min 2 R kyX k 2 2 + k k 0 wherek·k 0 denotes theL 0 -norm, number of nonzero components of a vector. If =1 it be- comes the AIC model selection criteria while if = (logn)/2 it gives us the BIC criteria. This Chapter1. Introduction 5 procedure can indeed help us select a good model. However, the computation of the penalized L 0 problem is a combinational problem with NP-complexity. Therefore, a computationally af- fordable alternative for such method is to minimize the following more general penalized least squares(PLS)problem min 2 R 8 < : 1 2n kyX k 2 2 + p X j=1 p (| j |) 9 = ; where p (·) is a penalty function indexed by the regularization parameter> 0. A natural generalization of penalized L 0 -regression is penalized L q -regression with p (t)= |t| q for 0<q< 2, which is also called bridge regression in Frank and Friedman (1993). Specifically, whenq=2itbecomestheridgeregressionbutclearlyitdoesnotpossessthevariableselection feature. When q=1 it becomes the well-known Lasso estimator by Tibshirani (1996), which possessesthevariableselectionfeature. ThebeautyofL 1 -penaltypartiallyliesinthefactthatit iscomputationallyfeasibleandapproximateequivalencetoL 0 -penaltyundercertainconditions; see,forexample,DonohoandElad(2003),Donoho(2006)andCandesandTao(2005,2007). FIGURE 1.4: Geometryunderstanding It is illustrated in Figure 1.4 the geometric intuition why theL 1 -penalty encourage sparse solutions but the L 2 -penalty does not. The elliptical contours of the square losses are shown by the full curves in the left panel of Figure 1.4. They are centered at the OLS estimates; the constraint regions is the rotated square. The right penal of Figure 1.4 is for the L 2 -penalty which does not often produce zero coefficients. Besides lasso by Tibshirani (1996), there are many more choices of the penalty function which enjoy good properties as well, such as, the adaptivelassowithweightedL 1 penalty(Zou,2006),thegrouplasso(YuanandLin,2007)and elastic net (Zou and Hastie, 2005) which both combine the feature ofL 1 andL 2 penalties. To savespace,Idonotintroducealloftheminthisproposal. There are many advantages ofL 1 -type methods, but they also suffer from the problem of biasednessaccordingtoFanandLi(2001),where,henceforth,aconcavepenaltywasproposed Chapter1. Introduction 6 and shown to enjoy unbiasedness, sparsity and continuity simultaneously. Besides, Zhao and Yu (2006) proved that the irrepresentable condition, a necessary condition to guarantee good properties,iseasiertobesatisfiedbyconcavepenalties. Aseriesofconcave-penalizedmethods has emerged in recently years, for instance, SCAD by Fan and Li (2001) and MCP by Zhang (2007). Takethe SCAD estimatorasanexample. Itspenaltyfunctionhastheform p (|t|)= 2 (|t| ) 2 I(|t|< ). FIGURE 1.5: Penaltyfunctions AsshowninFigure1.5,the SCAD penaltytakesoffattheoriginandbecomesflatlateron. Therefore, it flattens the penalty when the argument t or more precisely | j | is large enough. Such character is preferred over L 1 -penalty since there is no reason to increase penalty just because the magnitude of the coefficient is larger. Intuitively, the SCAD estimator can have re- ducedbiasednessthankstosuchfeature. AntoniadisandFan(2001)showthatthe PLSestimator possessestheproperties: 1. sparsityifmin t 0 {t+p 0 (t)> 0}; 2. approximateunbiasednessifp 0 (t)=0forlarget; 3. continuityifandonlyifargmin t 0 {t+p 0 (t)}=0, Chapter1. Introduction 7 wherep (t)isnondecreasingandcontinuouslydifferentiableon[0,1),thefunctiontp 0 (t) is strictly unimodal on (0,1), and p 0 (t) means p 0 (0+) when t=0 for notational simplicity. The concave penalized methods such as SCAD and MCP are among those who enjoy the above properties. 1.2.2 AnotherL 1 -typemethod: Dantzigselector In 2007, Cand` es and Tao proposed another famous L 1 -regularization approach of the Dantzig selector, which minimizes theL 1 -loss of the coefficient vector and relaxes the normal equation toallowforcorrelationbetweentheresidualvectorandallthevariables. Anappealingfeatureof the method is its formulation as a linear program, which enjoys computational efficiency. They proved that under the uniform uncertainty principle condition, the Dantzig selector can recover sparse signals as accurately as the ideal procedure, knowing the true underlying sparse model, up to a logarithmic factor logp. Later, Bickel et al. (2009) established more comprehensive asymptoticpropertiesoftheDantzigselectoringeneralnonparametricregressionmodelsunder therestrictedeigenvalueassumptions. OtherworkontheDantzigselectorincludesJamesetal. (2009),Antoniadisetal. (2010),andBertinetal. (2011). TheDantzigselector(DS)isdefinedas b DS = argmin 2 R p k k 1 subjectto kn 1 X T (yX )k 1 1 , (1.2) where 1 0 is a regularization parameter. It can be viewed as a relaxation of the normal equations n 1 X T (yX )= 0 (1.3) whose solutions include the minimizer of the L 2 -loss,ky X k 2 2 . When the dimensionality p stays below the sample size n, solving (1.3) directly gives us the ordinary linear regression estimator. However,whenthedimensionalitygoesbeyondthesamplesize,itbecomesnecessary to give some relaxation to the normal equations. Along with it, forcingk k 1 to be small helps usfindasparsesolutioninviewoftheintuitionweexplainedinFigure1.4. There is another understanding of the constraint in (1.2). WriteX=(x 1 ,x 2 ,··· ,x p ) with eachx j denotingthesamplevaluesoffeaturej2{1,···,p}. Thenwecanrewritekn 1 X T (y X )k 1 1 as|n 1 x j (yX )| 1 for allj2{1,···,p}. As assumed before, columns of X haveL 2 -normn 1/2 . Hence,|n 1 x j (yX )| measures to some extent the magnitude of correlations between covariates x j ’s and the residual y X . Then the constraint in Dantzig selectorputsaconstantboundonsuchcorrelations. Chapter1. Introduction 8 However, the constraint may not be flexible enough to differentiate important covariates and noise covariates. In other words, if a component of is nonzero, its information is ex- tracted from the residual, the corresponding covariate should have a lower correlation with the residual vector. This serves as an important motivation for my new variable selection method, theconstrainedDantzigselector(CDS),whichwillbeintroducedinthenextchapter. AlthoughtheDantzigselectorenjoysthenicepropertiesasmentionedabove,itmaysuffer from some potential issues. A common feature of its theory is that its rate of convergence under the prediction or estimation loss usually involves the logarithmic factor logp. From the asymptotic point of view, such a factor may be negligible or insignificant when p is not too large. It can, however, become no longer negligible or even significant in finite samples with largedimensionality,particularlywhenarelativelysmallsamplesizeisconsideredorpreferred. In such cases, the recovered signals and sparse models by the Dantzig selector may tend to be noisy. Another consequence of this effect is that many noise variables tend to appear together with recovered important ones, as demonstrated in Cand` es et al. (2008). In view of these problems, we propose an adapted version of the Dantzig selector which improved theoretical and numerical properties, the so-called constrained Dantzig selector (Kong et al, 2013), which willbeintroducedindetailsinChapter2. 1.3 Variableselectionmethodsforgeneticdata Genomewideassociationstudyhasleaptprogressivelyoverthelastdecadeandmanyimportant risk variants have been found to associated with various disease. Yet, currently GWA studies focus largely on single variant associations. The advantage of association tests is that they are computationally inexpensive. Besides, the adjusted p-values accounting for the multiple testing in the association tests have valid statistical meaning. However, there are still many problems with such univariate analysis. First, as is known to all, the linkage and interaction amongSNPsadd somedependencestructure tothe data, posingdifficulty inthe adjustmentfor multipletesting. Morespecifically,theusualthreshold5⇥ 10 8 tendstobetooconservativeat thepresenceofdependenceamongSNPs. Second,duetothediseasecomplexity,eachdiscovery aloneexplainsonlyasmallproportionofphenotypicvariation,especiallyforthetop-ratedhealth threats of Americans, such as diabetes and cancers, even though they are largely attributed to genetic factors. In such situation, many researchers turn to rare variants hoping that the missed variation can be found by deep sequencing. Correspondingly, a series of methods dealing with rarevariantshaveemergedrecently,suchasburdentests,C-alphaTestbyNeale(2011),andthe SequenceKernelAssociationTest(SKAT)byLinetal(2011). Many efforts have been placed on looking for the new variants that explain the complex disease. However, there is not enough attention to employ some recently developed advance Chapter1. Introduction 9 feature selection methods to build up a good predictive model, only using common variants. This way we are able to accommodate several risk variants simultaneously and hopefully es- tablish a model that accounts for a much larger proportion of phenotypic variation. Motivated bythishope,someresearchershaveturnedtoriskpredictionwiththehelpofvariableselection methods; see, for example, Kooperberg et al (2010), He and Lin (2010), Guan and Stephens (2011),andPeltolaetal(2012). Oneadvantageofvariableselectionmethodsingeneticstudies isthattheycandetectthejointeffectsofagroupofvariantswhosemarginaleffectsmaynotbe significant. Moreover,thep-valuesofsomeriskvariantsarenotsmallenoughtoreachgenome- wide significance and for this reason such SNPs have not drawn enough attention. However, in the predictive model estimated by the variable selection methods, those SNPs may play very importantroles. TheMultiethnicCohortisaprospectivecohortstudythatincludes215,251menandwomen, themajorityfrom5racial/ethnicgroupsinHawaiiandLosAngeles,California(AfricanAmeri- cans,EuropeanAmericans,NativeHawaiians,JapaneseAmericans,andLatinos);see,forexam- ple, Park et al (2009), Haiman et al (2007), and Nomura et al (2007). Between 1993 and 1996, participants entered the cohort by completing a 26-page, self-administered questionnaire that askedaboutdietanddemographicfactors,personalbehaviors(e.g.,physicalactivity),historyof prior medical conditions (e.g., diabetes), and family history of common cancers. Potential co- hort members were identified primarily through Department of Motor Vehicles drivers’ license files and, additionally for African Americans, Health Care Financing Administration data files. Participantswerebetweentheagesof45and75yearsatthetimeofrecruitment. The JAPC data is a subset of the Multiethnic Cohort, comprised of 2075 sample with 1033 prostate cancer cases and 1042 controls; see Cheng (2012). Genotyping of these samples was conducted using the Illumina.Human660W–Quad–v1 bead array at the Broad Institute. In this proposal, some preliminary results of application of some variable selection methods, such as constrainedDantzigselectorbyKongetal(2013),lassobyTibshirani(1996),adaptivelassoby Zou(2006)and SCAD byFanandLi(2001),willbepresentedinChapter3ofthisproposal. 1.4 ChallengesofPredictionModelswithInteractions Intheprevioustwosections,weintroducedtheprevalenceofhighdimensionaldataandbriefly discussed on some popular variable selection techniques. However, the performance and fea- sibility of those methods can be much limited if we further consider interactions in high di- mensions. For example, even for a moderate dimensionality p = 1000, there can be 499,500 two-wayinteractionsbetweensuchmanyvariables. Directincorporationofthesemanyinterac- tionsintothemodelandimplementpenalizedregularizationmethodsorDantzigselectorcanbe difficultduetotheultra-highdimensionality. Chapter1. Introduction 10 On a different note, there is growing evidence that effect modification exists in many pre- diction models. For example, gene-gene and gene-environment interactions have attracted the attentionofmanygeneticresearchers. Agreatmanystudieshavebeenconductedtoidentifyin- teractionsingeneticmodels;see,forexample,Gaudermanetal. (2013),Lewingeretal. (2013) andBuiletal. (2015)amongothers. Attracted by the popularity of gene-gene and gene-environment interactions, many statis- ticians aim to tackle the challenges of interaction identification in the prediction models. Note that considering interactions in prediction models is different from the detecting interactions in associationstudies. Theformaloneinvestigatesallpossibleinteractionsalongwithmaineffects simultaneously in a single model while the latter one scans interactions separately. In the sense ofpredictionmodelingwithinteractions,HallandXue(2014)proposedatwo-steprecursiveap- proachtoidentifyinginteractionsbasedonthesureindependencescreening(FanandLv,2008), whereallpmaineffectsarefirstrankedandonlythetopp 1/2 onesarethenretainedtoconstruct pairwise interactions of orderO(p) for further screening and selection of both interactions and main effects. Recently, Hao and Zhang (2014) introduced a forward selection based procedure toidentifyinteractionsinagreedy fashion. Mostexisting interactionidentificationmethodsin- cluding the ones above requires at least weak heredity assumption, meaning that an interaction canbeinthemodelifandonlyifatleastoneofitscorrespondingmaineffectsshouldalsobein themodel. Althoughtheheredityassumptionisdesiredandnaturalinmanyapplications,itcan alsobeeasilyviolatedinsomesituationsasdocumentedin theliterature. Forexample, Culver- house et al. (2002) discussed the interaction models displaying no main effects and examined theextenttowhichpureepistaticinteractionswhoselocidonotdisplayanysingle-locuseffects could account for the variation of the phenotype. In the Nature review paper Cordell (2009), concerns were raised that many existing methods may miss pure interactions in the absence of main effects. Efforts have already been made on detecting pure epistatic interactions in Ritchie etal. (2001),wherearealdataexamplewaspresentedtodemonstratetheexistenceofsuchpure interactions. In these applications, methods that are released from the heredity constraint can enjoybetterflexibilityandbemoresuitableformodelswithpureepistaticinteractions. Toaddressthechallengesofinteractionidentificationinultra-highdimensionsandbroader settings, we propose an efficient and flexible procedure, called the interaction pursuit (IP), for interactionidentificationinultra-highdimensions. Thesuggestedmethodfirstreducesthenum- berofinteractionsandmaineffectstoamoderatescalebyanewfeaturescreeningapproach,and then selects important interactions and main effects in the reduced feature space using regular- ization methods. Compared to existing approaches, our method screens interactions separately frommaineffectsandthuscanbemoreeffectiveininteractionscreening. Underafairlygeneral framework, we establish that for both interactions and main effects, the method enjoys the sure screeningpropertyinscreeningandoracleinequalitiesinselection. Ourmethodandtheoretical resultsaresupportedbyseveralsimulationandrealdataexamples. Chapter2 ConstrainedDantzigSelector 2.1 Introduction In many contemporary applications, it is appealing to design procedures that can provide a recovery of informative signals, among a pool of potentially huge number of signals, to a de- siredlevelofaccuracywithassmallnumberofobservationsaspossible. Suchprocedurerelies heavily on variable selection idea. Cand` es and Tao (2007) proposed a well known method the Dantzig selector to achieve this goal. This method is computationally efficient since it can be recast as a linear programming problem and enjoys nice sampling properties. Existing results show that it can recover sparse signals mimicking the accuracy of the ideal procedure, up to a logarithmicfactorofthedimensionality. It may suffer from some potential issues. A common feature of its theory is that its rate of convergenceunderthepredictionorestimationlossusuallyinvolvesthelogarithmicfactorlogp. Fromtheasymptoticpointofview,suchafactormaybenegligibleorinsignificantwhenpisnot toolarge. Itcan,however,becomenolongernegligibleorevensignificantinfinitesampleswith largedimensionality,particularlywhenarelativelysmallsamplesizeisconsideredorpreferred. In such cases, the recovered signals and sparse models by the Dantzig selector may tend to be noisy. Another consequence of this effect is that many noise variables tend to appear together withrecoveredimportantones,asdemonstratedinCand` esetal. (2008). Therefore, in this chapter, we will present an adapted version, the constrained Dantzig se- lector, which mitigates the potential issues of the Dantzig selector in relatively small samples. We replace the constant Dantzig constraint on correlations between the variables and the resid- ual vector by a more flexible one, and consider a constrained parameter space distinguishing between zero parameters and significantly nonzero parameters. The main contributions of this paper are threefold. First, compared to the Dantzig selector, theoretical results of this paper 11 Chapter2. Constrained Dantzig selector 12 show that the number of falsely discovered signs of our new selector, with an explicit inverse relationship to the signal strength, is controlled as a possibly asymptotically vanishing fraction of the true model size. Second, the convergence rates for the constrained Dantzig selector are shown to be within a factor of logn of the oracle rates relative to the factor of logp for the Dantzig selector, a significant improvement in the case of ultra-high dimensionality and rela- tively small sample size. It is appealing that such an improvement is made with a fairly weak assumptiononthesignalstrength. Tothebestofourknowledge,thisassumptionseemstobethe weakestoneintheliteratureofsimilarresults;see,forexample,Bickeletal. (2009)andZheng et al. (2013). Two parallel theorems, under the uniform uncertainty principle condition and the restricted eigenvalue assumptions, are established on the properties of the constrained Dantzig selector for compressed sensing and sparse modeling, respectively. Third, an active-set based algorithmisintroducedtoimplementtheconstrainedDantzigselectorefficiently. Anappealing featureofthisalgorithmisthatitsconvergencecanbecheckedeasily. 2.2 Modelsetting Tosimplifythetechnicalpresentation,weadoptthemodelsettinginCand` esandTao(2007)and presentthemainideasfocusingonthelinearregressionmodel y = X +", (2.1) where y=(y 1 ,...,y n ) T is an n-dimensional response vector, X=(x 1 ,...,x p ) is an n⇥ p design matrix consisting of p covariate vectors x j ’s, =( 1 ,..., p ) T is a p-dimensional regression coefficient vector, and "=(" 1 ,...," n ) T ⇠ N(0, 2 I n ) for some positive constant is ann-dimensional error vector independent of X. The normality assumption is considered for simplicity, and all the results in the paper can be extended to the cases of bounded errors or light-tailed error distributions without much difficulty. See, for example, the technical analysis inFanandLv(2011)andFanandLv(2013). Insparsemodeling,weareinterestedinrecoveringthesupportandnonzerocomponentsof thetrueregressioncoefficientvector 0 =( 0,1 ,..., 0,p ) T ,whichweassumetobesparsewith s nonzero components, for the case when the dimensionalityp may greatly exceed the sample size n. Throughout this paper, p is implicitly understood as max(n,p) and s min(n,p) to ensure model identifiability. To align the scale of all covariates, we assume that each column of X, that is, each covariate vector x j , is rescaled to have L 2 -norm n 1/2 , matching that of the constantcovariatevector1. TheDantzigselector(Cand` esandTao,2007)isdefinedas b DS = argmin 2 R p k k 1 subjectto kn 1 X T (yX )k 1 1 , (2.2) Chapter2. Constrained Dantzig selector 13 where 1 0 is a regularization parameter. The above constant Dantzig selector constraint on correlations between all covariates and the residual vector may not be flexible enough to differentiateimportantcovariatesandnoisecovariates. Weillustratethispointwiththehelpofasimplesimulation. Wegeneratedfromthemodel y = X 0 +"with(s,n,p)=(7,100,1000)and"⇠ N(0,0.4 2 I). Thenonzerocomponentsof 0 weresettobe(1, 0.5, 0.7, 1.2, 0.9, 0.3, 0.55) T lyinginthefirstsevencomponents. TherowsofthedesignmatrixXweresampledasindependentandidenticallydistributedcopies fromN(0, r ),with r ap⇥ pmatrixwithdiagonalelementsbeing1andoff-diagonalelements being0.5,andtheneachcolumnwasrescaledtohaveunitL 2 -norm. Weareinterestedifthereis any difference between the following two types of correlations: type I, the correlation between selected variables and residuals and type II, the correlation between unselected variables and residuals. In the simulation model, we calculated the oracle estimator of the coefficient vector, denoted by oracle , and the corresponding residuals. The oracle is in the sense that we know which components of 0 are nonzero in advance. Therefore, the oracle estimator oracle is nothing but the least square estimator on the first seven components in this simulation. Type I correlation, corr(x j ,yX oracle ) with x j the j-th variable for j2{1,2,··· ,7}, and type II correlation, corr(x j ,y X oracle ) with j2{8,9,··· ,1000}, were calculated. We repeat such simulation 100 times so that 700 type I correlations and 99300 type II correlations were obtained. BoxplotofthetwotypesofcorrelationscanbefoundinFigure2.1. 0.0 0.1 0.2 0.3 0.4 Two types of correlations absolute correlation Type I correlation Type II correlation FIGURE 2.1: Boxplotoftwotypesofcorrelations Chapter2. Constrained Dantzig selector 14 Motivatedbythissimulationexample,itseemsnecessarytoboundthetwotypesofcorre- lationsdifferently. Withthisintuition,wedefinetheconstrainedDantzigselectoras b CDS = argmin 2 R p k k 1 subjectto |n 1 x T j (yX )| 0 1 [| j | ] + 1 1 [| j |=0] forj=1,...,pand 2B , (2.3) where 0 0isaregularizationparameterandB ={ 2 R p : j =0or| j | foreachj} is the constrained parameter space for some 0. When we choose =0 and 0 = 1 , the constrained Dantzig selector becomes the Dantzig selector. Throughout this paper, we choose theregularizationparameters 0 and 1 asc 0 {(logn)/n} 1/2 andc 1 {(logp)/n} 1/2 ,respectively, with 0 1 as well as c 0 and c 1 two sufficiently large positive constants, and assume that isaparametergreaterthan 1 . The twoparameters 0 and 1 differentiallyboundtwotypesof correlations: 1. onthesupportoftheconstrainedDantzigselector,thecorrelationsbetweencovariatesand residuals are bounded, up to a common scale, by 0 ; denote byS CDS the support of CDS, thenmathematically|n 1 x T j (yX )| 0 forallj2 S CDS 2. on its complement, however, the correlations are bounded through 1 ; mathematically |n 1 x T j (yX )| 1 forallj/ 2 S CDS . In the ultra-high dimensional case, meaning logp = O(n ↵ ) for some↵> 0, the constraints involving 0 are tighter than those involving 1 , in which 1 is a universal regularization pa- rameterfortheDantzigselector;seeCand` esandTao(2007)andBickeletal. (2009). WenowprovidemoreinsightsintothenewconstraintsintheconstrainedDantzigselector. First, it is worthwhile to notice that if 0 2B , 0 can satisfy new constraints with large probability in model setting (2.1); see the proof of Theorem 1. With the tighter constraints, the feasible set of the constrained Dantzig selector problem is a subset of that of the Dantzig selectorproblem,resultinginasearchofthesolutioninareducedspace. Second,itisappealing toextractmoreinformationinimportantcovariates,leadingtolowercorrelationsbetweenthese variables and the residual vector. In this spirit, the constrained Dantzig selector puts tighter constraints on the correlations between selected variables and residuals. Third, the constrained Dantzig selector is defined on the constrained parameter spaceB , which has been introduced in Fan and Lv (2013). Such a space also shares some similarity to the union of coordinate subspaces considered in Fan and Lv (2011) for characterizing the restricted global optimality of nonconcave penalized likelihood estimators. The threshold inB distinguishes important covariates with strong effects and noise covariates with weak effects. As shown in Fan and Lv (2013),thisfeaturecanleadtoimprovedsparsityandeffectivelypreventoverfittingbymakingit harderfornoisecovariatestoenterthemodel. Thegeometricvisualizationoftheserelationship canbefoundbelow. Chapter2. Constrained Dantzig selector 15 FIGURE 2.2: Differentconstraints 2.2.1 Nonasymptoticcompressedsensingproperties Since the Dantzig selector was introduced partly for applications in compressed sensing, we firststudythenonasymptoticcompressedsensingpropertiesoftheconstrainedDantzigselector by adopting the theoretical framework in Cand` es and Tao (2007). They introduced the uniform uncertaintyprincipleconditiondefinedasfollows. DenotebyX T asubmatrixofXconsistingof columns with indices in a setT⇢{ 1,...,p}. For the true model sizes, define thes-restricted isometryconstantofXasthesmallestconstant s suchthat (1 s )khk 2 2 n 1 kX T hk 2 2 (1+ s )khk 2 2 for any setT with size at mosts and any vectorh. This condition requires that each submatrix of X with at most s columns behaves similarly as an orthonormal system. In other words, we require the eigenvalue of the submatrix X T be of small deviation in magnitude ( s ) from the value 1. Another constant, the s-restricted orthogonality constant, is defined as the smallest quantity✓ s,2s suchthat n 1 |hX T h,X T 0h 0 i| ✓ s,2s khk 2 kh 0 k 2 forallpairsofdisjointsetsT,T 0 ⇢{ 1,...,p}with|T| sand|T 0 | 2sandanyvectorsh,h 0 . A small value of ✓ s,2s generally requires the angle between the two spaces spanned by X T and X T 0 tobelarge. Chapter2. Constrained Dantzig selector 16 Theuniformuncertaintyprinciplecondition(UUP)issimplystatedas s +✓ s,2s < 1. (2.4) For notational simplicity, we drop the subscripts and denote these two constants by and ✓ , respectively. The UUPconditionactuallyputssomerequirementsonthecollinearitystructureof the design matrix. If each column ofX is independent of or weakly correlated with each other, the UUPconditioncanbeeasilysatisfied. Besides,thisconditionhasbeenwidelyusedandmore discussionsonitcanbefoundin,forexample,Cand` esandTao(2005),Cand` esandTao(2006), andCand` esandTao(2007). Without loss of generality, assume that supp( 0 )= {1 j p : 0,j 6=0} = {1,...,s} hereafter. To evaluate the sparse recovery accuracy, we consider the number of falsely dis- covered signs defined as FS( b )= |{j =1,...,p : sgn( b j ) 6= sgn( 0,j )}| for an estimator b =( b 1 ,..., b p ) T . Now we are ready to present the nonasymptotic compressed sensing prop- ertiesoftheconstrainedDantzigselector. Theorem2.1. Assumethattheuniformuncertaintyprinciplecondition(2.4)holdsand 0 2B with C 1/2 (1 + 1 / 0 ) 1 for some positive constant C. Then with probability at least 1O(n c ) forc=(c 2 0 ^ c 2 1 )/(2 2 )1, the constrained Dantzig selector b satisfies that k b 0 k 1 2 p 5(1 ✓ ) 1 s 0 , k b 0 k 2 (1 ✓ ) 1 (5s) 1/2 0 , FS( b ) Cs( 1 / ) 2 . If in addition> (1 ✓ ) 1 (5s) 1/2 0 , then with the same probability, it also holds that sgn( b ) = sgn( 0 ) andk b 0 k 1 2k(n 1 X T 1 X 1 ) 1 k 1 0 whereX 1 is ann⇥ s submatrix ofX corresponding tos nonzero 0,j ’s. The constantc in the above probability bound can be sufficiently large since both constantsc 0 and c 1 are assumed to be large, while the constant C comes from Theorem 1.1 in Cand` es and Tao(2007);seetheproofintheAppendixfordetails. IntheaboveboundontheL 1 -estimation loss, it holds thatk(n 1 X T 1 X 1 ) 1 k 1 s 1/2 k(n 1 X T 1 X 1 ) 1 k 2 (1 ) 1 s 1/2 . See Section 2.2.2formorediscussiononthisquantity. From Theorem 1, we see improvements of the constrained Dantzig selector over the Dantzig selector, which has a convergence rate, in terms of the L 2 -estimation loss, up to a factor logp of that for the ideal procedure. However, the sparsity property of the Dantzig selector was not investigatedinCand` esandTao(2007). Incontrast,theconstrainedDantzigselectorisshownto have an inverse quadratic relationship between the number of falsely discovered signs and the threshold , revealing that its model selection accuracy increases with the signal strength. The Chapter2. Constrained Dantzig selector 17 number of falsely discovered signs can be controlled below or as an asymptotically vanishing fraction of the true model size since FS( b ) Cs( 1 / ) 2 (1+ 1 / ) 2 s s by assuming C 1/2 (1+ 1 / 0 ) 1 . AnotheradvantageoftheconstrainedDantzigselectorliesinitsconvergencerates. Inthecaseof ultra-highdimensionalitywhichistypicalincompressedsensingapplications,itspredictionand estimation losses can be reduced from the logarithmic factor logp to logn with overwhelming probability. In particular, only a fairly weak assumption on the signal strength is imposed to attain such improved convergence rates. Our convergence rate provides an upper bound for the minimax rate of convergence. We conjecture that this convergence rate with the logarithmic factorlognwouldbetheminimaxrateinourmodelsetting. There exist other methods which have been shown to enjoy convergence rates of the same order as well, for example, in Zheng et al. (2013) for high-dimensional thresholded regres- sion. However, these results usually rely on a stronger condition on signal strength, such as, the minimum signal strength is at lease of the order {s(logp)/n} 1/2 . In another paper, Fan and Lv (2011) showed that the nonconcave penalized estimator can have a consistency rate of O p (s 1/2 n logn) for some 2 (0,1/2] under the L 2 -estimation loss, which can be slower than our rate of convergence. A main implication of our improved convergence rates is that a smallernumberofobservationswillbeneededfortheconstrainedDantzigselectortoattainthe samelevelofaccuracyincompressedsensing,astheDantzigselector,whichisdemonstratedin Section2.3.2. 2.2.2 Samplingproperties The properties of the Dantzig selector have also been extensively investigated in Bickel et al. (2009). Theyintroducedtherestrictedeigenvalueassumptionwithwhichtheoracleinequalities under various prediction and estimation losses were derived. We adopt their theoretical frame- work and study the sampling properties of the constrained Dantzig selector under the restricted eigenvalueassumptionstatedbelow. Condition 1. For some positive integerm in the order ofs, there exists some positive constant such thatkn 1/2 X k 2 {max(k 1 k 2 ,k 0 1 k 2 )} for all 2 R p satisfyingk 2 k 1 k 1 k 1 , where =( T 1 , T 2 ) T , 1 is a subvector of consisting of the first s components, and 0 1 is a subvectorof 2 consistingofthemax{m,C m s( 1 / ) 2 }largestcomponentsinmagnitude,with C m somepositiveconstant. Chapter2. Constrained Dantzig selector 18 Condition 1 is a basic assumption onthe design matrixX for deriving the oracle inequalities of theDantzigselector. Thereasontocallitrestrictedfollows. Weknow min 2 R p kn 1/2 X k 2 k k 2 =0 sincethesmallesteigenvalueofXiszerowhenp>n. However,for CDS weassume min 2 R p kn 1/2 X k 2 k sub k 2 > 0 by replacing some subvector sub in the denominator instead. In van de Geer and B ¨uhlmann (2009), there is a comprehensive discussion on the relationships among a collection of condi- tions such as mutual coherence and restricted eigenvalue assumptions. They argued rigorously thattherestrictedeigenvalueconditionsallowforafairlygeneralclassofdesignmatrices. Some easilycheckableconditions,suchasmutualcoherence,canbeusedassufficientconditions. The use of UUP and restricted eigenvalue conditions is to provide a deeper theoretical understand- ing of these L 1 -regularization methods. See (2.10) in the Appendix for insights into the basic inequalityk 2 k 1 k 1 k 1 and Bickel et al. (2009) for more detailed discussions on this as- sumption. Theorem2.2. AssumethatCondition1holdsand 0 2B with C 1/2 m (1+ 1 / 0 ) 1 . Then the constrained Dantzig selector b satisfies with the same probability as in Theorem 2.1 that n 1/2 kX( b 0 )k 2 =O( 1 s 0 ), k b 0 k 1 =O( 2 s 0 ), k b 0 k 2 =O( 2 s 1/2 0 ), FS( b ) C m s( 1 / ) 2 . If in addition> 2 p 5 2 s 1/2 0 , then with the same probability it also holds that sgn( b )= sgn( 0 ) andk b 0 k 1 =O{k(n 1 X T 1 X 1 ) 1 k 1 0 }. Theorem 2.2 establishes asymptotic results on the sparsity and oracle inequalities for the con- strained Dantzig selector under the restrictive eigenvalue assumption. This assumption, which is an alternative to the uniform uncertainty principle condition, has also been widely employed in high-dimensional settings. In Bickel et al. (2009), an approximate equivalence of the lasso estimator (Tibshirani, 1996) and Dantzig selector was proved under this assumption, and the lasso estimator was shown to be sparse with size O( max s), where max is the largest eigen- valueoftheGrammatrixn 1 X T X. Incontrast,theconstrainedDantzigselectorgivesasparser model under the restricted eigenvalue assumption, since its number of falsely discovered signs FS( b ) C m s( 1 / ) 2 = o(s) when 1 = o( ). Similar as in Theorem 1, the constrained Dantzig selector improves over both lasso and the Dantzig selector in terms of convergence rates,areductionofthelogpfactortologn. Chapter2. Constrained Dantzig selector 19 One can see that results in Theorems 1 and 2 are approximately equivalent, while the latter presents an additional oracle inequality on the prediction loss. An interesting phenomenon is that if adopting a simpler version, that is, the Dantzig selector equipped with the thresholding constraint only, we can also obtain similar results as in Theorems 1 and 2, but with a stronger conditiononsignalstrengthsuchas s 1/2 1 . Inthissense,theconstrainedDantzigselector is also an extension of the Dantzig selector equipped with the thresholding constraint only, but enjoys better properties. Some comprehensive results on the prediction and variable selection properties have also been established in Fan and Lv (2013) for various regularization methods, revealing their asymptotic equivalence in the thresholded parameter space. However, as men- tionedinSection2.2.1,improvedratesasinTheorem2commonlyrequireastrongerassumption onthesignalstrength,whichis 0 2B with s 1/2 1 ;see,forexample,Theorem2ofFan andLv(2013). For the quantityk(n 1 X T 1 X 1 ) 1 k 1 in the above bound on theL 1 -estimation loss, if X 1 takes the form of a common correlation matrix (1 ⇢ )I s + ⇢ 1 s 1 T s for some ⇢ 2 [0,1), it is easy to checkthatk(n 1 X T 1 X 1 ) 1 k 1 =(1⇢ ) 1 {1+(2s3)⇢ }/{1+(s1)⇢ },whichisbounded regardlessofs. 2.3 NumericalstudiesfortheconstrainedDantzigselector 2.3.1 Implementation The constrained Dantzig selector defined in (2.3) depends on tuning parameters 0 , 1 and . For any fixed 0 and , we exploit the idea of sequential linear programming to produce the solution path of the constrained Dantzig selector as 1 varies. Choose a grid of values for the tuningparameter 1 indecreasingorderwiththefirstonebeingkn 1 X T yk 1 . Itiseasytocheck that = 0 satisfies all the constraints in (2.3) for 1 = kn 1 X T yk 1 , and thus the solution is b CDS = 0 in this case. The choices of 0 and will be discussed later. For each 1 in the grid, we use the solution from the previous one in the grid as an initial value to speed up the convergence. For a given 1 , we define an active set, iteratively update this set, and solve the constrained Dantzig selector problem. We name this algorithm as the CDS algorithm which is detailedbelow. 1. For a fixed 1 in the grid, denote by b (0) 1 the initial value. Let b (0) 1 be zero when 1 = kn 1 X T yk 1 ,andtheestimatefromprevious 1 inthegridotherwise. 2. Denoteby b (k) 1 theestimatefromthekthiteration. DefinetheactivesetAasthesupport of b (k) 1 andA c itscomplement. Letbbeavectorwithconstantcomponents 0 onAand 1 onA c . Forthe(k+1)thiteration,updateAasA[{ j2A c :|n 1 x T j (yX b A )|> 1 }, Chapter2. Constrained Dantzig selector 20 where the subscript A indicates a subvector restricted on A. Solve the following linear programonthenewA: b A = argmink A k 1 subjectto |n 1 X T A (yX A A )| b A , (2.5) whereisunderstoodascomponentwisenolargerthanandthesubscriptAalsoindicates a submatrix with columns corresponding toA. For the solution obtained in (2.5), set all itscomponentssmallerthan inmagnitudetozeros. 3. Updatethe active setA as thesupport of b A . Solve the Dantzig selector problemonthis activesetwith 0 astheregularizationparameter: b A = argmink A k 1 subjecttokn 1 X T A (yX A A )k 1 0 . (2.6) Let b (k+1) A = b A and b (k+1) A c = 0,whichgivethesolutionforthe(k +1)thiteration. 4. Repeatsteps2and3untilconvergenceforafixed 1 andrecordtheestimatefromthelast iterationas b 1 . Jumptothenext 1 if b 1 2B ,andstopthealgorithmotherwise. With the solution path produced, we use the cross-validation to select the tuning parameter 1 . One can also tune 0 and similarly as for 1 , but in our numerical studies, a few fixed values forthemsufficetoobtainsatisfactoryresults,aselaboratedineachstudylater. The rationales of the constrained Dantzig selector algorithm are as follows. Step 1 defines the initialvalue(0thiteration)foreach 1 inthegrid. Instep2,startingwithasmalleractiveset,we addvariablesthatviolatetheconstrainedDantzigselectorconstraintstoeliminatesuchconflict. Asaconsequence,somecomponentsofb A areofvalue 1 insteadof 0 . Therefore,weneedto further solve (2.6) in step 3 by noting restricted on its support, the constrained Dantzig selector should be a solution to the Dantzig selector problem with parameter 0 . An early stopping of thesolutionpathisimposedinstep4tomakethisalgorithmcomputationallymoreefficient. Anappealingfeatureofthisalgorithmisthatitsconvergencecanbecheckedeasily. Oncethere are no more variables violating the constrained Dantzig selector constraints, that is,{j2A c : |n 1 x T j (yX b A )|> 1 } =; ,theiterationstopsandthealgorithmconverges. Inotherwords, the convergence of the algorithm is equivalent to that of the active set which can be checked directly. When the algorithm converges, the solution lies in the feasible set of the constrained Dantzig selector problem and is a global minimizer restricted on the active set, and is then a localminimizer. In simulation study 2 of Section 2.3.2, we tracked the convergence property of the algorithm on 100 data sets for p = 1000 and 5000, respectively. In both cases, we observe that the algorithm always converged over all 100 simulations, indicating considerable stability of this Chapter2. Constrained Dantzig selector 21 algorithm. Another advantage of the algorithm is that it is built upon the Dantzig selector in lowerdimensionssoitinheritsthecomputationalefficiency. 2.3.2 Simulationstudies TobetterillustratetheperformanceoftheconstrainedDantzigselector,weconsiderthethresh- oldedDantzigselectorwhichsimplysetscomponentsoftheDantzigselectorestimatetozerosif smallerthanathresholdinmagnitude. WeevaluatedtheperformanceoftheconstrainedDantzig selectorincomparisonwiththeDantzigselector,thresholdedDantzigselector,Lasso,elasticnet (Zou and Hastie, 2005), and adaptive Lasso (Zou, 2006). Two simulation studies were consid- ered,withthefirstoneinvestigatingsparserecoveryforcompressedsensingandthesecondone examiningsparsemodeling. The setting of the first simulation study is similar to that of the sparse recovery example in Lv and Fan (2009). We generated 100 data sets from model (2.1) without noise, that is, the linear equation y = X 0 with (s,p)=(7,1000). The nonzero components of 0 were set to be (1, 0.5, 0.7, 1.2, 0.9, 0.3, 0.55) T lyinginthefirstsevencomponents,andnwaschosen tobe even integers between 30 and 80. The rows of the design matrixX were sampledasinde- pendentandidenticallydistributedcopiesfromN(0, r ),with r ap⇥ pmatrixwithdiagonal elementsbeing1andoff-diagonalelementsbeingr,andtheneachcolumnwasrescaledtohave unitL 2 -norm. Threelevelsofpopulationcollinearity,r=0,0.2,and0.5,wereconsidered. We let 0 and be in two small grids {0.001, 0.005, 0.01, 0.05, 0.1} and {0.05, 0.1, 0.15, 0.2} respectively. Set the grid of values for 1 as described in Section 2.3.1. If any of the solutions in the path had exactly the same support as 0 , it was counted as successful recovery. This criterion applied to all other methods in this example for fair comparison. Figure 2.3 presents the probabilities of exact recovery of sparse 0 based on 100 simulations by all methods. We see that all methods performed well in relatively large samples and had lower probability of successful sparse recovery when the sample size becomes smaller. The constrained Dantzig selector performed better than other methods over different sample sizes and three levels of population collinearity. In particular, the thresholded Dantzig selector performed similarly to the Dantzig selector, revealing that simple thresholding alone, instead of flexible constraints as intheconstrainedDantzigselector,doesnothelpmuchonsignalrecoveryinthiscase. The second simulation study adopts a similar setting to that in Zheng et al. (2013). We gener- ated 100 data sets from the linear regression model (2.1) with Gaussian error"⇠ N(0, 2 I n ). The coefficient vector 0 =(v T ,...,v T ,0 T ) T with the pattern v=( T strong , T weak ) T repeated threetimes,where strong =(0.6,0,0,0.6,0,0) T and weak =(0.05,0,0,0.05,0,0) T . The coefficient subvectors strong and weak stand for the strong signals and weak signals in 0 , Chapter2. Constrained Dantzig selector 22 30 40 50 60 70 80 0.0 0.2 0.4 0.6 0.8 1.0 r = 0 Recovery probability DS TDS lasso Enet Alasso CDS 30 40 50 60 70 80 0.0 0.2 0.4 0.6 0.8 1.0 r = 0.2 Recovery probability DS TDS lasso Enet Alasso CDS 30 40 50 60 70 80 0.0 0.2 0.4 0.6 0.8 1.0 r = 0.5 Recovery probability DS TDS lasso Enet Alasso CDS FIGURE 2.3: ProbabilitiesofexactrecoveryfortheDantzigselector(DS),thresholdedDantzig selector(TDS),lasso,elasticnet(Enet),adaptivelasso(Alasso),andconstrainedDantzigselec- tor(CDS) Chapter2. Constrained Dantzig selector 23 respectively. The sample size and noise level were chosen as (n, ) = (100,0.4), while the di- mensionpwassettobe1000and5000. Therowsofthen⇥ pdesignmatrixXweresampledas independent and identically distributed copies from a multivariate normal distributionN(0,⌃ ) with⌃ =(0.5 |i j| ) 1 i,j p . We applied all methods as in simulation study 1 to produce a se- quence of sparse models and set 0 =0.01 and =0.2 for simplicity. The ideal procedure, whichknowsthetrueunderlyingsparsemodelinadvance,wasalsousedasabenchmark. Tocomparethesemethods,weconsideredsevenperformancemeasures: thepredictionerror,the L q -estimationlosswithq=1,2,1,numberoffalsepositives,andnumberoffalsenegativesfor strongorweaksignals. ThepredictionerrorisdefinedasE(Y x Tb ) 2 with b anestimateand (x T ,Y) an independent observation, and the expectation was calculated using an independent testsampleofsize10,000. Afalsepositivemeansafalselyselectednoisecovariateinthemodel, while a false negative means a missed true covariate. Table 2.1 summarizes the comparison resultsbyallmethods. Weobservethatmostweakcovariatesweremissedbyallmethods. This isreasonablesincetheweaksignalsarearoundthenoiselevel,makingitdifficulttodistinguish them from the noise covariates. However, the constrained Dantzig selector outperformed other methods in terms of other prediction and estimation measures, and followed very closely the ideal procedure in both cases of p = 1000 and 5000. In particular, the L 1 -estimation loss for the constrained Dantzig selector was similar to that for the oracle procedure, confirming tight bounds on this loss in Theorems 1 and 2. As the dimension grows from 1000 to 5000, theconstrainedDantzigselectorperformedsimilarly,whileothermethodssufferedfromhigher dimensionality. In particular, the thresholded Dantzig selector has been shown to improve over the Dantzig selector, but was still outperformed by the adaptive Lasso and constrained Dantzig selector in this study, revealing the necessity to introduce more flexible constraints instead of simplethresholding. 2.3.3 Realdataanalysis We applied the same methods as in Section 2.3.2 to the diabetes data set studied in Efron et al. (2004). This data set consists of 442 diabetes patients with a quantitative measure of disease progressiononeyearafterbaselineastheresponsevariable,andtenbaselinevariablesofthem: sex, age, body mass index, average blood pressure, and six blood serum measurements (tc, ldl, hdl, tch, ltg, glu). Efron et al. (2004) considered the quadratic model with inter- actions, by adding the squares of all baseline variables except the dummy variable sex, and all interactionsbetweeneachpairofthetenbaselinevariables. Alongwiththeintercept,itresulted inalinearregressionmodelwith65predictors. Werandomlysplitthefulldataset 100timesintoatrainingsetof 400samplesandavalidation set of 42 samples. For each splitting of the data set, we applied all methods to the training Chapter2. Constrained Dantzig selector 24 TABLE 2.1: Meansandstandarderrors(inparentheses)ofdifferentperformancemeasuresbyallmethodsinsimulationstudy2 Measure DS TDS Lasso Enet ALasso CDS Oracle p = 1000 PE(⇥ 10 2 ) 30.8(0.6) 28.5(0.5) 30.3(0.5) 32.9(0.7) 19.1(0.1) 18.5(0.1) 18.2(0.1) L 1 (⇥ 10 2 ) 201.9(5.4) 137.2(3.4) 186.1(5.0) 211.1(6.0) 58.0(1.1) 51.3(0.6) 41.5(0.9) L 2 (⇥ 10 2 ) 40.1(0.7) 37.2(0.7) 39.8(0.7) 43.0(0.8) 18.3(0.3) 16.3(0.2) 14.7(0.3) L 1 (⇥ 10 2 ) 19.0(0.5) 18.6(0.4) 19.3(0.5) 20.7(0.5) 9.1(0.3) 7.5(0.2) 8.4(0.2) FP 44.4(1.8) 5.5(3.8) 36.3(1.6) 44.1(1.9) 0.5(0.1) 0(0) 0(0) FN.strong 0(0) 0(0) 0(0) 0(0) 0(0) 0(0) 0(0) FN.weak 5.3(0.1) 5.9(0.4) 5.4(0.1) 5.3(0.1) 6.0(0.0) 6.0(0.0) 0(0) p = 5000 PE(⇥ 10 2 ) 45.1(1.1) 39.3(1.1) 44.8(1.1) 44.9(1.1) 21.3(0.6) 18.4(0.1) 18.3(0.1) L 1 (⇥ 10 2 ) 289.3(6.4) 184.6(4.5) 270.8(6.8) 273.2(6.9) 71.2(2.1) 50.4(0.7) 41.7(1.1) L 2 (⇥ 10 2 ) 56.3(1.1) 50.6(1.1) 56.1(1.1) 56.2(1.1) 22.9(0.9) 16.0(0.2) 14.9(0.4) L 1 (⇥ 10 2 ) 27.4(0.7) 25.1(0.6) 27.8(0.7) 27.8(0.7) 12.7(0.7) 7.3(0.2) 8.8(0.3) FP 60.6(1.7) 7.0(3.5) 53.5(2.3) 53.8(2.1) 1.0(0.1) 0(0) 0(0) FN.strong 0(0) 0(0) 0(0) 0(0) 0(0) 0(0) 0(0) FN.weak 5.2(0.1) 6.0(0.1) 5.4(0.1) 5.3(0.1) 5.9(0.0) 6.0(0.0) 0(0) Chapter2. Constrained Dantzig selector 25 TABLE 2.2: Meansandstandarddeviationsofpredictionerrorsindiabetesdataanalysis PE DS TDS lasso Enet Alasso CDS mean 2988.8 2900.8 2948.4 2964.5 2888.5 2881.5 sd 622.0 545.8 527.1 538.9 535.6 534.0 set, and calculated the prediction error, as defined in Section 2.3.2, on the validation set. We chose 0 =0.001 and =0.02 for the constrained Dantzig selector and use tenfold cross validation to select 1 . Minimizing the prediction error gave the best model for each method. The tuning parameters for other methods were also selected with the cross-validation. The means and standard errors of their prediction errors over 100 random splittings are presented in Table 2.2. We see that the constrained Dantzig selector outperformed other methods. The relativelylargestandarderrorsindicatethedifficultyofpredictionforthisdataset. Wealsocalculatedthemedianmodelsizebyeachmethod: 10byDantzigselector,10bythresh- oldedDantzigselector,16bylasso,16byelasticnet,10byadaptivelasso,and9byconstrained Dantzig selector. The constrained Dantzig selector produced the sparsest model with improved prediction accuracy. The percentage of times each predictor was selected and their t-statistics over 100 random splittings were investigated as well. The set of most frequently selected pre- dictors (percentage > 90%) for the constrained Dantzig selector is a subset of those for other methods. Besides,thet-statisticsforthissetofpredictorsareallofmagnitudegreaterthan2. It is interesting that the interaction term by sex and age is significant, while the main effect ageis insignificantforallmethods. 2.4 Discussion We have shown that the suggested constrained Dantzig selector can achieve convergence rates within a logarithmic factor of the sample size of the oracle rates in high dimensions under a fairly weak assumption on the signal strength. Our work provides a partial answer to an open question of whether convergence rates involving a logarithmic factor of the dimensionality are optimal for regularization methods. It would be interesting to investigate such a phenomenon formoregeneralregularizationmethods. Our formulation of the constrained Dantzig selector uses theL 1 -norm of the parameter vector. AnaturalextensionofthemethodistoexploittheweightedL 1 -normoftheparametertoallow for different regularization on different covariates. It would be interesting to investigate the behavior of these methods in more general model settings including generalized linear models and survival analysis. These problems are beyond the scope of the current paper and will be interestingtopicsforfutureresearch. Chapter2. Constrained Dantzig selector 26 2.5 Appendix 2.5.1 ProofofTheorem2.1 AlltheresultsinTheorems2.1and2.2willbeshowntoholdonakeyevent E = kn 1 X T 1 "k 1 0 and kn 1 X T 2 "k 1 1 , (2.7) whereX 1 isasubmatrixofXconsistingofcolumnscorrespondingtosupp( 0 )andX 2 consists oftheremainingcolumns. Thuswewillhavethesameprobabilityboundinboththeorems. The probability bound on the eventE in (2.7) can be easily calculated, using the classical Gaussian tailprobabilitybound(see,forexample,Dudley,1999)andtheBonferroniinequality,as pr(E) 1 pr(kn 1 X T 1 "k 1 > 0 )+pr(kn 1 X T 2 "k 1 > 1 ) =1 s(2/⇡ ) 1/2 1 0 n 1/2 e 2 0 n/(2 2 ) +(ps)(2/⇡ ) 1/2 1 1 n 1/2 e 2 1 n/(2 2 ) =1O sn c 2 0 /(2 2 ) (logn) 1/2 +(ps)p c 2 1 /(2 2 ) (logp) 1/2 . (2.8) Let c=(c 2 0 ^ c 2 1 )/(2 2 ) 1 be a sufficiently large positive constant, since the two positive constantsc 0 andc 1 arechosenlargeenough. Recallthatpisunderstoodimplicitlyasmax(n,p) throughoutthepaper. Thusitfollowsfrom(2.8),s n,andn pthat pr(E)=1O n c . (2.9) Fromnowon,wederivealltheboundsontheeventE. Inparticular,inlightof(2.7)and 0 2B it is easy to verify that conditional on E, the true regression coefficient vector 0 satisfies the constrainedDantzigselectorconstraintsorinotherwords 0 liesinthefeasiblesetin(2.3). We first make a simple observation on the constrained Dantzig selector b =( b 1 ,..., b p ) T . Recallthatwithoutlossofgenerality,weassumesupp( 0 )={1,...,s}. Let 0 =( T 1 ,0 T ) T with each component of 1 being nonzero, and b =( b T 1 , b T 2 ) T with b 1 a subvector of b consisting of its first s components. Denote by =( T 1 , T 2 ) T = b 0 the estimation error, where 1 = b 1 1 and 2 = b 2 . It follows from the global optimality of b that k b 1 k 1 +k b 2 k 1 =k b k 1 k 0 k 1 =k 1 k 1 ,whichentailsthat k 2 k 1 =k b 2 k 1 k 1 k 1 k b 1 k 1 k b 1 1 k 1 =k 1 k 1 . (2.10) Wewillseethatthisbasicinequalityk 2 k 1 k 1 k 1 playsakeyroleinthetechnicalderivations of both Theorem 1 and 2. Equipped with this inequality and conditional onE, we are now able tostartthederivationofallresultsinTheorem1asfollows. Chapter2. Constrained Dantzig selector 27 The main idea is to prove its sparsity property first which would be presented in the next para- graph. Then, with the control on the number of false positive and false negative, we derive an upper bound for theL 2 -estimation loss using the conclusion in Lemma 3.1 of Cand` es and Tao (2007). Resultsonothertypesoflossesfollowaccordingly. (1)Sparsity. RecallthatundertheassumptionofTheorem1, 0 liesinitsfeasiblesetconditional onE. Therefore,bythesameargumentsoftheproofforTheorem1.1inCand` esandTao(2007), we can establish the same bound for L 2 -estimation loss, (Cs) 1/2 1 with C some constant ac- cording to (1.10) of Cand` es and Tao (2007), holding with probability exceeding 1O(n c ). ThenmakinguseofthethresholdingfeatureoftheconstrainedDantzigselector,onecanobtain thatitsnumberoffalselydiscoveredsignisboundedbyCs( 1 / ) 2 . Observingtheassumption C 1/2 (1+ 1 / 0 ) 1 ,weknowCs( 1 / ) 2 s(1+ 1 / 0 ) 2 <s. (2) L 2 -estimation loss. We exploit the technical tool of Lemma 3.1 in Cand` es and Tao (2007) to analyze the behavior of the estimation error = b 0 . Let 0 1 be a subvector of 2 consisting of the s largest components in magnitude, 3 =( T 1 ,( 0 1 ) T ) T , and X 3 a submatrix of X consisting of columns corresponding to 3 . We emphasize that 3 covers all nonzero componentsof sincethenumberoffalselydiscoveredsignissmallerthans,asshowedinthe previousparagraph. Therefore,k 3 k q =k k q forallq> 0. Inviewoftheuniformuncertainty principlecondition(2.4),anapplicationofLemma3.1inCand` esandTao(2007)gives k 3 k 2 (1 ) 1 kn 1 X T 3 X k 2 +✓ (1 ) 1 s 1/2 k 2 k 1 . (2.11) Ontheotherhand,fromthebasicinequality(2.10)itiseasytosees 1/2 k 2 k 1 s 1/2 k 1 k 1 k 3 k 2 . Substitutingitinto(2.11)andsimplifyingthatleadto k k 2 =k 3 k 2 (1 ✓ ) 1 kn 1 X T 3 X k 2 (2.12) Hence, to develop a bound for the L 2 -estimation loss it suffices to find an upper bound for kn 1 X T 3 X k 2 . Denote byA 1 ,A 2 andA 3 the index sets of correctly selected variables, missed true variables and falsely selected variables respectively. Let A 23 = A 2 [ A 3 . Then, due to the thresholding feature of the constrained Dantzig selector, one can obtain from constrained Dantzig selector constraints that |n 1 X T A1 (y X )| 0 , |n 1 X T A2 (y X )| 1 and |n 1 X T A3 (yX )| 0 where is understood as componentwise smaller than. Conditional onE,substitutingybyX 0 +"andapplyingthetriangularinequalitygivesus kn 1 X T A1 X k 1 2 0 andkn 1 X T A23 X k 1 0 + 1 . (2.13) We now make use of the technical result on sparsity. Since A 23 denotes the index set of false positives and false negatives, the number of its component is also bounded byCs( 1 / ) 2 with probability exceeding 1O(n c ). Therefore, according to (2.13), we havekn 1 X T 3 X k 2 2 Chapter2. Constrained Dantzig selector 28 kn 1 X T A1 X k 2 2 +kn 1 X T A23 X k 2 2 skn 1 X T A1 X k 2 1 +Cs( 1 / ) 2 kn 1 X T A23 X k 2 1 4s 2 0 + Cs( 1 / ) 2 ( 0 + 1 ) 2 . Substitutingthisinequalityinto(2.12)yields k k 2 (1 ✓ ) 1 4s 2 0 +Cs( 1 / ) 2 ( 0 + 1 ) 2 1/2 . Sinceweassume C 1/2 (1+ 1 / 0 ) 1 ,itsuggeststhatC( 1 / ) 2 ( 0 + 1 ) 2 2 0 . Therefore, wecanconcludethatk k 2 (1 ✓ ) 1 (5s 2 0 ) 1/2 . (3)Otherlosses. Applyingthebasicinequality(2.10),weestablishboundfortheL 1 -estimation lossk k 1 =k 1 k 1 +k 2 k 1 2k 1 k 1 2s 1/2 k 1 k 2 2s 1/2 k k 2 < 2(1 ✓ ) 1 s(5 2 0 ) 1/2 . Regarding theL 1 loss, we additionally assume that> (1 ✓ ) 1 (5s 2 0 ) which can lead to sign consistence, sgn( b ) = sgn( 0 ), in view of the L 2 inequality above. Therefore, by the constrained Dantzig selector constraints we havekn 1 X T 1 (yX 1 b 1 )k 1 0 henceforth kn 1 X T 1 (" X 1 1 )k 1 0 . Then conditional on E, it follows from the triangular inequal- ity thatkn 1 X T 1 X 1 1 k 1 2 0 . Hence, k k 1 = k 1 k 1 2k(n 1 X T 1 X 1 ) 1 k 1 0 , which completestheproof. 2.5.2 ProofofTheorem2.2 We continue to use the technical setup and notation introduced in the proof of Theorem 2.1. ResultsareparalleltothatinTheorem1butpresentedintheasymptoticmanner. Thesimilarity lies in the rationale of the proof as well. The key element is to derive the sparsity and then construct an inequality for L 2 -estimation loss through the bridge n 1 kX k 2 2 . Inequalities for othertypesoflossarebuiltuponthebasisofitafterwards. (1)Sparsity. UnderCondition1ofourpaper,therestrictiveeigenvalueassumptionRE(s,m,1) inBickeletal. (2009)isalsosatisfied. Thus,adoptingthetechnicalframeworkoftheprooffor Theorem 7.1 in Bickel et al. (2009), we can obtain the same L 2 oracle inequality k k 2 (C m s) 1/2 1 with C m some positive constant dependent on m. Additionally, thanks to the thresholding feature of the constrained Dantzig selector, it could be shown that the number of falsely discovered sign is bounded by C m s( 1 / ) 2 and it is smaller than s(1 + 1 / 0 ) 2 since C 1/2 m (1 + 1 / 0 ) 1 . Now we go through the proof of Theorem 7.1 in Bickel et al. (2009)inamorecautiousmannerandmakesomeimprovementsinsomestepswiththehelpof theboundonthenumberoffalsepositivesandfalsenegatives. (2) L 2 -estimation loss. According to Condition 1, we have a lower bound for n 1 kX k 2 2 . It Chapter2. Constrained Dantzig selector 29 is very natural to derive an upper bound for it and build up an inequality related to the L 2 - estimationloss. By(2.13)wehave n 1 kX k 2 2 k n 1 X T A1 X k 1 k A1 k 1 +kn 1 X T A23 X k 1 k A23 k 1 2 0 k A1 k 1 +( 0 + 1 )k A23 k 1 . (2.14) Since the number of component in A 23 is bounded by C m s( 1 / ) 2 , applying the Cauchy- Schwarzinequalityto(2.14)yieldsn 1 kX k 2 2 2 0 k A1 k 1 +( 0 + 1 )k A23 k 1 2 0 s 1/2 k A1 k 2 + C 1/2 m ( 0 + 1 )s 1/2 1 / k A23 k 2 . This gives an upper bound for n 1 kX k 2 2 . Combining this withCondition1,whichgivesalowerboundonit,leadsto 2 1 2 (k A1 k 2 2 +k A23 k 2 2 ) 2 0 s 1/2 k A1 k 2 +C 1/2 m ( 0 + 1 )s 1/2 1 / k A23 k 2 (2.15) Consider (2.15) as a two dimensional space with respect to k A1 k 2 and k A23 k 2 . Then the quadratic inequality (2.15) quantifies a circular area centering at 2 2 0 s 1/2 , 2 C 1/2 m ( 0 + 1 )s 1/2 1 / . The termk A1 k 2 2 +k A23 k 2 2 is nothing but the squared distance between the point in this circular area and the origin. One can easily identify the largest squared distance whichisalsotheupperboundfortheL 2 -estimationloss k k 2 2 =k A1 k 2 2 +k A23 k 2 2 4 4 ⇥ 2 0 s 1/2 2 + C 1/2 m ( 0 + 1 )s 1/2 1 / 2 ⇤ . Withtheassumption C 1/2 m (1+ 1 / 0 ) 1 ,wecanshowthat C 1/2 m ( 0 + 1 )s 1/2 1 / 2 s 2 0 henceforth havingk k 2 = O( 2 s 1/2 0 ). This bound has significance improvement from logptolognintheultra-highdimensioncase. (3)Otherlosses. WiththeL 2 oracleinequalityinhand,onecanderivetheL 1 inequalitywithout anydifficulty:k k 1 s 1/2 k k 2 =O( 2 s 0 ). Forthepredictionloss,by(2.14)wehave n 1 kX k 2 2 2 0 s 1/2 k A1 k 2 +C 1/2 m ( 0 + 1 )s 1/2 1 / k A23 k 2 . Consider the problem minz=2 0 s 1/2 k A1 k 2 +C 1/2 m ( 0 + 1 )s 1/2 1 / k A23 k 2 subject to (2.15). It is simply a two-dimensional linear optimization problem with a circular area as its feasible space. One can easily find the minimum of z henceforth obtaining n 1 kX k 2 2 2 2 ⇥ 2 0 s 1/2 2 + C 1/2 m ( 0 + 1 )s 1/2 1 / 2 ⇤ =O( 2 s 2 0 ). RegardingtheL 1 inequality, itfollowsfromsimilarargumentsintheproofofTheorem1thatk k 1 2k(n 1 X T 1 X 1 ) 1 k 1 0 ork k 1 =O{k(n 1 X T 1 X 1 ) 1 k 1 0 },whichconcludestheproof. Chapter3 ApplicationsofVariableSelectionto GeneticData 3.1 Introduction Genome-wide association studies (GWAS) usually involves millions of single nucleotide poly- morphisms (SNPs). It can provide us a large number of genome-wide significant risk variants sometimes but this also presents us difficulty interpreting them. A possible solution to such problem is to build up a prediction model with SNPs and some other control variables, such as ageandsex. UnlikeGWAstudies,thepredictionmodelconsidersalltheSNPsoralargesubset ofthemsimultaneously. OneadvantageofdoingsoisthatitcandetectSNPsthataremarginally unrelatedtothediseaseorquantitativetraitbutjointlyrelated. Atypicalexamplefollows. Consider a quantitative trait so that the outcome is continuous for simplicity. Denote the trait byy andweassumeitisdeterminedbythreeSNPswhoarecorrelatedwitheachothertosome extent. LetX 1 ,X 2 andX 3 bethenumberofminorallelesforeachofthreeSNPs,respectively. ButinthismotivationexampleweassumethatX 1 ,X 2 andX 3 followstandardGaussiandistri- butionforeasypresentation. Thepairwisecorrelationcorr(X i ,X j )=0.5forall1 i6=j 3. Then,weassumethisquantitativetraitisdeterminedbythethreeSNPsinthefollowingway. y =X 1 +X 2 X 3 +" where"isarandomerrorindependentofX 1 ,X 2 andX 3 . Itiseasytocalculatethatcov(y,X 1 )= cov(y,X 2 )=1 but cov(y,X 3 )=0. Therefore, we see that third SNP is jointly related to trait but marginally unrelated to it. The classic marginal association test is not able to detect such SNPsatall. 30 Chapter3. Application of variable selection in genetic data 31 FIGURE 3.1: Geneticdata Therefore, in this chapter, we propose to establish prediction models with variable selection methods. Actually,weknowgeneticdataisexactlyhigh-dimensionalaswediscussedinChapter 1. The design matrix for the genetic prediction model can be visualized as in Figure 3.1. There are a large number of literature on variable selection methods but in this proposal we apply some of the most widely used ones such as lasso by Tibshirani (1996), adaptive lasso by Zou (2006), SCAD by Fan and Li (2001), and the constrained Dantzig selector (CDS) by Kong et al (2014+). Out of all afore mentioned methods, the CDS performs best as introduced in next few sectionssowefocusonthosestablyselectedSNPs. Werefitalinearregressionmodelonthese few SNPs and find that it produces a relative sparse model with small training error. Further investigationsuchasthebiologicalinterpretationoftheseSNPswillbeoneofthefutureworks in the proposal. Flowchart of the application of variable selection methods on the JAPC data is asinFigure3.2. 3.2 Preprocessingofthe JAPC data For the data cleaning and genomic control steps, we strictly follow the procedure in Cheng (2012)wheretheoriginaldatasetwasanalyzed. Datasource OurJapaneseAmericanProstateCancer(JAPC)datacomefromtheMultiethnicCohort(MEC). The MECisalargepopulation-basedcohortstudyofmorethan215,000individualsfromHawaii Chapter3. Application of variable selection in genetic data 32 FIGURE 3.2: FlowchartofapplicationofCDSonJAPCdata andCalifornia. TheprostatecancercaseswereidentifiedbycohortlinkagetoSurveillance,Epi- demiology and End Results cancer registries covering Hawaii and California. Controls had no diagnosis of prostate cancer, were randomly selected from the random control pool of partici- pants, and provided blood specimens for genetic analysis. Controls were frequency matched to casesbyage(5,yearcategories)andethnicity. ThroughJanuary1,2008,therewere1,033cases and1,042controlsrecruited. Genotyping Genotyping was conducted using the Illumina.Human660W_Quad_v1 bead array at the BroadInstitute(Cambridge,MA).SampleswithDNAconcentrationslessthan18.8ng/mLwere notscanned(53subjects). Sampleswereremovedonthebasisofthefollowingexclusioncrite- ria: (i) call rates less than 95% (5 subjects); (ii) ancestry outliers (21 subjects); and (iii) related samples (88 subjects). Single-nucleotide polymorphism (SNP) with minor allele frequencies lessthan1%wereremoved. Thefinalanalysisincluded528,023SNPsevaluatedin2,075sam- ples. Chapter3. Application of variable selection in genetic data 33 Marginalassociationtests Ancestry estimation, Relatedness inference and imputation are exactly the same as in Cheng (2012). Unconditional logistic regression was conducted for each SNP adjusting for age and the first 10 principle components. The manhattan plot and QQ plot are presented in Figure 3.3 and Figure 3.4, respectively. The genomic inflation factor is 1.02. The top 1,000 SNPs with smallestp-valuesgetintothesecondstepvariableselection. Thereasonofkeepingonlythetop 1,000 hits is to relieve the computation difficulty of variable selection. Indeed, such screening procedure is widely used in many other prediction modeling for genetic data; see, for example, Kooperberg et al (2010) and He and Lin (2011). Actually, Fan and Lv (2009) for the first time provided theoretical support of such procedure and they called it sure independence screening. They claim that in the linear case if keeping a certain number of highly-ranked variable by marginal correlation, it can include all true covariates associated with the response under some sensible conditions. Later on, such variable screening approach was extended to many other types of models, such as, generalized linear models (Fan et al., 2009), nonparametric additive models (Fan et al., 2011), multi-index models (Zhu et al., 2011), and single-index hazard rate models (Gorst-Rasmussen and Scheike, 2013). Therefore, it is safe for us to use the top 1000 hitstodovariableselection. FIGURE 3.3: Manhattanplot 3.3 Predictionmodels Denotebyy i thebinaryindicatorforcases(codedas1)andcontrols(codedas2),i=1,··· ,2075. Thedesignmatrix X=[intercept, age, 10PCs, 1000SNPs] (3.1) isofdimension2075⇥ 1012andtheoutcome,comprisedof1s(case)and2s(control)istreated ascontinuousforthepurposeofsimplicitythoughitiscategorical. Toevaluatetheperformance of each method, we randomly split the data into training and test sets with the formal one to Chapter3. Application of variable selection in genetic data 34 FIGURE 3.4: QQplot obtaintheestimateofcoefficientsandpredictedoutcomesforthecorrespondingtestsetarecal- culatedbasedonthisestimate. Sincetheoutcomeistreatedascontinuous,wefurthercategorize thefittedvaluesinto1and2with1.5asthecutoff. Classificationerroriscountedasthenumber of mismatches of these categorized predicted outcomes with the true outcomes in the test set. We do this random splitting 100 times and each of which has 900 cases and 900 controls in the training set while the remaining 275 samples are the test set. The mean and standard deviation oftheclassificationerrorforeachmethods,arepresentedasbelow. TABLE 3.1: Tableofpredictionresults CDS Lasso Alasso SCAD meanclassificationerror 28.4 32.7 39.9 44.3 SDclassificationerror 4.6 5.0 9.8 8.1 mean#SNPsselected 477.1 529.0 387.1 388.2 Chapter3. Application of variable selection in genetic data 35 We see from Table 3.1 that among all methods, the CDS performs the best in term of misclas- sification error. More interestingly, it utilizes the least number of SNPs to obtain such results. Recall that in Table 3.1 we randomly split the data set into training set and test set 100 times and at each time, different variables are selected. We define the stably selected SNPs as those who are selected more than 80 out of the 100 times in the analysis. We are mostly interested in those SNPs since they stably play roles in predicting the outcome. There are 4 stably selected principalcomponentsand269such stablyselected SNPs inour JAPC data, aslisted inthetable of the supplemental materials. The SNP ”rs16901966” on the well-known risk region 8q24 for prostatecancerisalsoincludedinthislist. 3.4 StablyselectedSNPs As mentioned in previous section that there are 269 stably selected SNPs and 4 stably selected principal components. We further fit an ordinary linear regression (OLS) on the full data and find that the misclassification error can be controlled at 5.6%. Further investigation on these SNPs such as verification by other prostate cancer data and biological interpretation is the next stepofmydissertation. We then annotate those 269 SNPs with HaploReg. The HaploReg is a tool for exploring anno- tations of the noncoding genome at variants on haplotype blocks, such as candidate regulatory SNPsatdisease-associatedloci. UsingLDinformationfromthe1000GenomesProject,linked SNPs and small indels can be visualized along with their predicted chromatin state, their se- quence conservation across mammals, and their effect on regulatory motifs. For our JAPC data, we choose the LD thresholdr 2 =0.8 on the basis of population group ASN of the 1000 GenomesProjectandlistallSNPsinthoseLDblocks. SetoptionsandDNaseenrichmentanal- ysis results are presented in Figure 3.5 and Table 3.2. Further investigation of those results and these269SNPswillbeconductedinthefuture. TABLE 3.2: DNaseenrichmentanalysisresults CelltypeID Description DNaseObs Exp Fold p GM12878 B-lymphocyte,lymphoblastoid 8 2.8 2.8 0.007891 Jurkat TlymphoblastoidderivedfromanacuteTcellleukemia 7 2.6 2.7 0.015902 GM18507 B-lymphocyte,lymphoblastoid 5 1.8 2.8 0.03611 3.5 CDSonthefulldataset We also apply the CDS method on the full data set instead of random split subset. As discussed in Chapter 2, the CDS estimator is dependent on a regularization parameter 1 and we usually Chapter3. Application of variable selection in genetic data 36 FIGURE 3.5: HaploRegv2options adopt cross validation or model selection criterion AIC or BIC to select such a best value for such a parameter. In the application of CDS to the full data set, we do not choose an optimal value for it but instead plot the classification error over the model size which is tuned by the parameter , in Figure 3.6. As shown in that Figure, the more SNPs we use or in other words themoreinformationwehaveinhand,themoreaccurateourpredictionmodelcanbe. Chapter3. Application of variable selection in genetic data 37 0 200 400 600 800 0.05 0.10 0.15 0.20 0.25 0.30 Model size Error rate Rate = 0.01 FIGURE 3.6: CDS appliedonthefulldata Chapter4 InteractionPursuitwithFeature ScreeningandSelection 4.1 Introduction Chapter2introducesanewvariableselectionmethodcalledCDSanditsapplicationtoagenetic data is presented in Chapter 3. However, neither the method itself nor its application considers interactions between variables, which are commonly believed to account for a non-neglectable proportionofvariationformanycomplextraits. Inthischapter,weintroduceaflexibleinterac- tion detection method called Interaction Pursuit (IP), for interaction identification in ultra-high dimensions. The suggested method first reduces the number of interactions and main effects to a moderate scale by a new feature screening approach, and then selects important interactions andmaineffectsinthereducedfeaturespaceusingregularizationmethods. Comparedtoexist- ing approaches, our method screens interactions separately from main effects and thus can be more effective in interaction screening. Besides, the new method IP does not rely on the strong (weak) heredity assumption, that is, an interaction between two variables should be included in the model only if both (either) variables are (is) in the model, as existing interaction detection methodsdo. Wepresentourideasbyfocusingonthelinearinteractionmodel Y = 0 + p X j=1 j X j + p 1 X k=1 p X `=k+1 k` X k X ` +", (4.1) where Y is the response variable, x=(X 1 ,···,X p ) T is a p-vector of covariates X j ’s, 0 is theintercept, j ’sand k` ’sareregressioncoefficientsformaineffectsandinteractions,respec- tively,and"istherandomerrorindependentofX j ’swithmeanzeroandfinitevariance. Denote 38 Chapter3. Application of variable selection in genetic data 39 by 0 =( 0,j ) 1 j p and 0 =( 0,k` ) 1 k<` p the true regression coefficient vectors for main effects and interactions, respectively. To ease the presentation, throughout the paper X k X ` is referred to as an important interaction if its regression coefficient 0,k` is nonzero, and X k is called an active interaction variable if there exists some 1 ` 6= k p such thatX k X ` is an important interaction. Under the above model setting, we suggest a new approach, called the interaction pursuit (IP), for interaction identification using the ideas of feature screening and selection. The IP is a two-step procedure that first reduces the number of interactions and main effects to a moderate scale by a new feature screening approach, and then identifies important interactionsandmaineffectsinthereducedfeaturespace,withinteractionsreconstructedbased ontheretainedinteractionvariables, usingregularizationmethods. AkeyinnovationofIPisto screen interaction variables instead of interactions directly and thus the computational cost can be reduced substantially from a factor ofO(p 2 ) toO(p). Our interaction screening step shares a similar spirit to the SIRI proposed in Jiang and Liu (2014) in the sense of detecting interac- tionsbyscreeninginteractionvariables. Animportantdifference,however,liesinthatSIRIwas proposed under the sliced inverse index model and its theory relies heavily on the normality assumption. Themaincontributionsofthispaperarethreefold. First,theproposedprocedureiscompu- tationally efficient thanks to the idea of interaction variable screening. Second, we provide the- oreticaljustificationsoftheproposedprocedureundermildinterpretableconditions. Third,our procedurecandealwithmoregeneralmodelsettingswithoutrequiringtheheredityornormality assumption,whichprovidesmoreflexibilityinapplications. Inparticular,twokeymessagesthat wetrytodeliverinthispaperarethataseparatescreeningstepforinteractionscansignificantly improve the screening performance if one aims at finding important interactions, and screening interactionvariablescanbemoreeffectiveandefficientthanscreeninginteractionsdirectlydue tothenoiseaccumulation. Therestofthischapterisorganizedasfollows. Section4.2introducesanewfeaturescreen- ing procedure for interaction models and investigates the theoretical properties of the proposed screening procedure. We exploit the regularization methods to further select important inter- actions and main effects and study the theoretical properties on variable selection in Section 4.3. Section4.4demonstratestheadvantageofourproposedapproachthroughsimulationstud- ies and a real data example. We discuss some implications and extensions of our method in Section 4.5. All technical proofs and some additional simulation studies are provided in the SupplementaryMaterial. Chapter3. Application of variable selection in genetic data 40 4.2 Interactionscreening We begin with considering the problem of feature screening in interaction models with ultra- highdimensions. Definethreesetsofindices I ={(k,`):1 k<` pwith 0,k` 6=0}, A ={1 k p:(k,`)or(`,k)2I forsome`}, (4.2) B ={1 j p : 0,j 6=0}. The set I contains all important interactions and the set A consists of all active interaction variables, while the set B is comprised of all important main effects. We combine sets A and B, and define the set of important features as M = A[B . As demonstrated in Section B of Supplementary Material, the sets A, I, and M are invariant under affine transformations X new j = b j (X j a j ) with a j 2 R and b j 2 R\{0} for 1 j p. We aim at recovering interactionsinI andvariablesinMandthusthereisnoissueofidentifiability. 1 4.2.1 Anewinteractionscreeningprocedure Without loss of generality, assume that EX j =0 and EX 2 j =1 for each random covariate X j . To ensure model identifiability and interpretability, we impose the sparsity assumption that only a small portion of the interaction and main effects are important with nonzero regres- sion coefficients k` and j in interaction model (4.1). Our goal is to effectively identify all importantinteractionsI andimportantfeaturesM,andefficientlyestimatetheregressioncoef- ficientsin(4.1)andpredictthefutureresponse. Clearly,I isasubsetofallpairwiseinteractions constructed from variables in A. Thus, as mentioned before, to recover the set of important interactionsI we first aim at screening the interaction variables while retaining active ones in setA. Let us develop some insights into the problem of interaction screening by considering the fol- lowingspecificcaseofinteractionmodel(4.1): Y =X 1 X 2 +", (4.3) where x is further assumed to be N(0,⌃ ) with covariance matrix⌃ having diagonal entries 1 and off-diagonal entries 1< ⇢< 1. Simple calculations show that corr(X j ,Y)= 0 for each j. This entails that screening the main effects based on their marginal correlations withtheresponsecaneasilymissthe activeinteractionvariableX 1 . Aninterestingobservation is, however, that taking the squares of all variables leads to cov(X 2 1 ,Y 2 ) = 2 + 10⇢ 2 and 1 Wewouldliketothankarefereeforhelpfulcommentsontheissueofinvariance. Chapter3. Application of variable selection in genetic data 41 cov(X 2 j ,Y 2 )=4⇢ 2 (1 + 2⇢ ) for eachj 3, where the former is always larger than the latter in absolute value regardless of the value of1<⇢< 1. Thus, the active interaction variable X 1 can be safely retained by ranking the marginal correlations between the squared covariates andthesquaredresponse,thatis,X 2 j andY 2 . Bysymmetry,thesameistruefortheotheractive interaction variable X 2 . The following proposition provides the marginal correlations, up to a scalingfactor,betweenX 2 j andY 2 formoregeneralinteractionmodels. Proposition4.1. In interaction model(4.1) withx⇠ N(0,I p ), it holds that for eachj, cov(X 2 j ,Y 2 )=2 0 @ 2 0,j + j 1 X k=1 2 0,kj + p X `=j+1 2 0,j` 1 A . (4.4) Proposition 4.1 shows that for the specific case of⌃ = I p , the correlation betweenX 2 j andY 2 is always nonzero as long as X j is an active interaction variable, regardless of whether or not X j is an important main effect. In contrast, such a correlation becomes zero ifX j is neither an importantmaineffectnoranactiveinteractionvariable. Inthecaseofgeneralcovariancematrix ⌃ , the expression of cov(X 2 j ,Y 2 ) takes a more complicated form which we do not pursue in this paper. But we conjecture that for covariance matrix⌃ with certain correlation structures, such as the weakly correlated case, it still holds that active interaction variables X j can stand out compared to those that are inactive ones and unimportant main effects, in terms of having largercorrelationsbetweenX 2 j andY 2 . Motivatedbythesimpleinteractionmodel(4.3)andProposition4.1,weproposetoidentifythe set of active interaction variables A by first ranking the marginal correlations, in magnitude, between the squared covariates X 2 j and the squared response Y 2 , and then retaining the top ones with absolute correlations bounded below by some positive threshold ⌧ . This gives a new interaction screening procedure which is the first step of IP. The choice of the threshold ⌧ will be discussed in details later. More specifically, suppose we are given a sample (x i ,y i ) n i=1 of n independent and identically distributed (i.i.d.) observations from (x,Y) in interaction model (4.1). Observethatcorr(X 2 k ,Y 2 )= ! k /{var(Y 2 )} 1/2 ,where ! k =cov(X 2 k ,Y 2 )/ var(X 2 k ) 1/2 . (4.5) Denotebyb ! k theempiricalversionofthepopulationquantity! k byplugginginthecorrespond- ingsamplestatistics,basedonthesample(x i ,y i ) n i=1 . ThenthescreeningstepofIPisequivalent tothresholdingtheabsolutevaluesofb ! k ’s;thatis,weestimatethesetofactiveinteractionvari- ablesAas b A ={1 k p :|b ! k | ⌧ } (4.6) Chapter3. Application of variable selection in genetic data 42 forsomethreshold⌧> 0. Hereweusethesamenotation⌧ forsimplicity. Basedontheretained interactionvariablesin b A,wecanconstructallpairwiseinteractionsas b I = n (k,`):k,`2 b Aandk<` o . (4.7) Itisworthmentioningthat b I generallyprovidesanoverestimateofthesetofimportantinterac- tions I, in the sense that some interactions in the constructed set b I may be unimportant ones. Thisis,however,notanissueforthepurposeofinteractionscreeningandwillbeaddressedlater intheselectionstepofIP. For completeness, we also briefly describe our procedure for main effect screening. We adopt the SIS approach in Fan and Lv (2008) to screen unimportant main effects outside the set B; that is, we first calculate the marginal correlations between the original covariates X j and the responseY andthenkeeptheoneswithmagnitudeatorabovesomepositivethresholde ⌧ . Since we have assumed EX j =0 and EX 2 j =1 for each covariate X j , thresholding the marginal correlationbetweenX j andY isequivalenttothresholding! ⇤ j =E(X j Y). Weestimatetheset ofimportantmaineffectsB by b B = 1 j p :|b ! ⇤ j |e ⌧ , (4.8) where b ! ⇤ j is the sample version of the population quantity ! ⇤ j ande ⌧> 0 is some threshold. Finally the set of important features M can then be estimated as c M = b A[ b B. Although our approach to main effect screening is the same as SIS, the theoretical developments on the screening property for main effects are distinct from those in Fan and Lv (2008) due to the presenceofinteractionsinourmodel. 4.2.2 Surescreeningproperty We now turn our attention to the theoretical properties of the proposed screening procedure in IP. It is desirable for a feature screening procedure to possess the sure screening property Fan and Lv (2008), which means that all important variables are retained after screening with probability tending to one. We aim at establishing such a property for IP in terms of screening of both interactions and main effects. To this end, we need to impose some mild regularity conditions. Assumption 1. There exist constants 0 ⇠ 1 ,⇠ 2 < 1 such that s 1 = |I| = O(n ⇠ 1 ) and s 2 = |B| =O(n ⇠ 2 ),and| 0 |,k 0 k 1 ,k 0 k 1 =O(1)withk·k 1 denotingthevectorL 1 -norm. Assumption 2. There exist constants ↵ 1 ,↵ 2 ,c 1 > 0 such that for anyt> 0, P(|X j |>t) c 1 exp(c 1 1 t ↵ 1 ) for each 1 j p and P(|"|>t) c 1 exp(c 1 1 t ↵ 2 ), and var(X 2 j ) are uniformlyboundedawayfromzero. Chapter3. Application of variable selection in genetic data 43 Assumption3. Thereexistsomeconstants0 1 , 2 < 1/2andc 2 > 0suchthatmin k2A |! k | 2c 2 n 1 andmin j2B |! ⇤ j | 2c 2 n 2 . Condition 1 allows the numbers of important interactions and important main effects to grow with the sample sizen, and imposes an upper bound on the magnitude of true regression coef- ficients. See, for example, Cho and Fryzlewicz (2012) and Hao and Zhang (2014) for similar assumptions. Clearly, Condition 1 entails that the number of active interaction variables is at most2s 1 ,thatis,|A| 2s 1 . ThefirstpartofCondition2isanusualassumptiontocontrolthetailbehaviorofthecovariates anderror,whichisimportantforensuringthesurescreeningpropertyofourprocedure. Similar assumptions have been made in such work as Fan et al. (2011), Barut et al. (2012), and Chang et al. (2013). The scenario of ↵ 1 = ↵ 2 =2 corresponds to the case of sub-Gaussian covariates and error. The class of sub-Gaussian distributions includes distributions with bounded support or light-tailedness. The second part of Condition 2 is a mild requirement on the variances of squaredcovariates. Condition 3 puts some reasonable constraints on the minimum marginal correlations, through different forms, for active interaction variables and important main effects, respectively. It is analogous to Condition 3 in Fan and Lv (2008), and can be understood as an assumption on the minimum signal strength in the feature screening setting. Smaller constants 1 and 2 cor- respond to stronger marginal signals. This condition is crucial for ensuring that the marginal utilities carry enough information about the active interaction variables and important main ef- fects. Under these conditions, the following theorem shows that the sample estimates of the marginal utilities are sufficiently close to the population ones with significant probability, and establishesthesurescreeningpropertyforbothinteractionandmaineffectscreening. Theorem 4.2. (a) Under Conditions 1–2, if 0 2 1 +4⇠ 1 < 1, 0 2 1 +4⇠ 2 < 1, and E(Y 4 )= O(1), then for anyC> 0, there exists some constantC 1 > 0 depending onC such that P( max 1 k p |b ! k ! k |Cn 1 )=o(n C1 ) (4.9) forlogp =o(n ↵ 1⌘ ) with⌘ =min{(12 1 4⇠ 2 )/(8+↵ 1 ), (12 1 4⇠ 1 )/(12+↵ 1 )}. (b)UnderConditions1–2,if0 2 2 +2⇠ 1 < 1,0 2 2 +2⇠ 2 < 1,andE(Y 2 )=O(1),then for anyC> 0, there exists some constantC 2 > 0 depending onC such that P(max 1 j p |b ! ⇤ j ! ⇤ j |Cn 2 )=o(n C2 ) (4.10) forlogp =o(n ↵ 1⌘ 0 ) with⌘ 0 =min{(12 2 2⇠ 2 )/(4+↵ 1 ),(12 2 2⇠ 1 )/(6+↵ 1 )}. Chapter3. Application of variable selection in genetic data 44 (c) Under Conditions 1–3 and the choices of ⌧ = c 2 n 1 ande ⌧ = c 2 n 2 , if 0 ⇠ 1 ,⇠ 2 < min{1/4 1 /2,1/2 2 } andE(Y 4 )=O(1), then we have P ⇣ I⇢ b I and M⇢ c M ⌘ =1o ⇣ n min{C1,C2} ⌘ (4.11) forlogp =o(n ↵ 1 min{⌘,⌘ 0 } )withconstantsC 1 andC 2 givenin(4.9)and (4.10),respectively. In addition, it holds that P ⇣ | b I| O{n 4 1 2 max (⌃ ⇤ )} and| c M| O{n 2 1 max (⌃ ⇤ )+n 2 2 max (⌃ )} ⌘ =1o ⇣ n min{C1,C2} ⌘ , (4.12) where max (·) denotes the largest eigenvalue, ⌃ =cov(x), and ⌃ ⇤ =cov(x ⇤ ) for x ⇤ = (X ⇤ 1 ,···,X ⇤ p ) T withX ⇤ k =(X 2 k EX 2 k )/{var(X 2 k )} 1/2 . Comparing the results from the first two parts of Theorem 4.2 on interactions and main effects, respectively, we see that interaction screening generally requires more restrictive assumption on dimensionality p. This reflects that the task of interaction screening is intrinsically more challenging than that of main effect screening. In particular, when ↵ 1 =2, IP can handle ultra-highdimensionalityupto logp =o ⇣ n min{(1 2 1 4⇠ 2)/5,(1 2 1 4⇠ 1)/7,(1 2 2 2⇠ 2)/3,(1 2 2 2⇠ 1)/4} ⌘ . (4.13) ItisworthmentioningthatbothconstantsC 1 andC 2 intheprobabilitybounds(4.9)–(4.10)can be chosen arbitrarily large without affecting the order of p and ranges of constants 1 and 2 . Wealsoobservethatstrongermarginalsignalstrengthforinteractionvariablesandmaineffects, intermsofsmallervaluesof 1 and 2 ,canenableustotacklehigherdimensionality. ThethirdpartofTheorem4.2showsthatIPenjoysthesurescreeningpropertyforbothinterac- tion and main effect screening, and admits an explicit bound on the size of the reduced model afterscreening. Morespecifically,anupperboundofthereducedmodelsizeiscontrolledbythe choicesofboththresholds⌧ ande ⌧ ,andthelargesteigenvaluesofthetwopopulationcovariance matrices⌃ ⇤ and⌃ . If we assume max (⌃ ⇤ )= O(n ⇠ 3 ) and max (⌃ )= O(n ⇠ 4 ) for some con- stants⇠ 3 ,⇠ 4 0,thenwithoverwhelmingprobabilitythetotalnumberofinteractionsandmain effects in the reduced model is at most of a polynomial order of sample sizen. In practice, one may choose the number of retained variables in a screening procedure asn 1 or [n/(logn)] depending on the available sample sizen, following the suggestion in Fan and Lv (2008). It is worth pointing out that our result is weaker than that in Fan and Lv (2008) in terms of growth ofdimensionality,whereonecanallowlogp =o(n 1 2 2 ). Thisismainlybecausetheyconsid- ered linear models without interactions, indicating the intrinsic challenges of feature screening Chapter3. Application of variable selection in genetic data 45 in presence of interactions. Moreover, our assumptions on the distributions for covariates and errorsaremoreflexible. TheresultsinTheorem4.2canbeimprovedinthecasewhenthecovariatesX j ’sandresponseY areuniformlybounded. Anapplicationofproofsfor(4.9)–(4.10)inSectionDofSupplementary Materialyields P ✓ max 1 k p |b ! k ! k |c 2 n 1 ◆ pC 3 exp(C 1 3 n 1 2 1 ), P ✓ max 1 j p |b ! ⇤ j ! ⇤ j |c 2 n 2 ◆ pC 3 exp(C 1 3 n 1 2 2 ), whereC 3 issomepositiveconstant. Inthiscase,IPcanhandleultra-highdimensionalitylogp = o(n ⇠ )with⇠ =min{12 1 ,12 2 }. 4.3 Interactionselection 4.3.1 Interactionmodelsinreducedfeaturespace We now focus on the problem of interaction and main effect selection in the reduced feature space identified by the screening step of IP. To ease the presentation, we rewrite interaction model(4.1)inthematrixform y = 0 1+ e X✓ +", (4.14) where y =(y 1 ,···,y n ) T is the response vector, ✓ =(✓ 1 ,···,✓ e p ) T is a parameter vector consisting ofe p = p(p+1)/2 regression coefficients j and k` , e X is the correspondingn⇥ e p augmented design matrix incorporating the covariate vectors for X j ’s and their interactions in columns, and " is the error vector. Hereafter, for simplicity we assume that the response and each column of e X are de-meaned and thus 0 =0. Denote by b A = {k 1 ,···,k p1 } and b B = {j 1 ,···,j p2 } the sets of retained interaction variables and main effects, respectively, andH a subsetof{1,··· ,e p}givenbythefeaturesin c M = b A[ b Bandconstructedinteractionsin b I based on b A as defined in (4.7). To estimate the true value ✓ 0 =(✓ 0,1 ,···,✓ 0,e p ) T of the parameter vector ✓ , we can consider the reduced feature space spanned by the q=2 1 p 1 (p 1 1) +p 3 columnsoftheaugmenteddesignmatrix e XinHwithp 3 thecardinalityof c M,thankstothesure screeningpropertyofIPshowninTheorem4.2. When the model dimensionality is reduced to a moderate scale q, one can apply any favorite variable selection procedure for effective selection of important interactions and main effects and efficient estimation of their effects. There is a large literature on the developments of vari- ous variable selection methods. Among all approaches, two classes of regularization methods, Chapter3. Application of variable selection in genetic data 46 the convex ones (e.g., Tibshirani (2996), Zou (2006), Candes and Tao(2007)) and the concave ones (e.g., Fan and Li (2001), Zhang (2010)), have been extensively investigated. To combine the strengths of both classes, Fan and Lv (2014) introduced the combinedL 1 and concave reg- ularization method. Such an approach can be understood as a coordinated intrinsic two-scale learning, in the sense that the Lasso component plays the screening role, in terms of reducing thecomplexityofintrinsicparameterspace,whereastheconcavecomponentplaystheselection role,intermsofrefinedestimation. Following Fan and Lv (2014), we consider the following combinedL 1 and concave regulariza- tionproblem min ✓ 2 R e p ,✓ H c=0 n (2n) 1 ky e X✓ k 2 2 + 0 k✓ ⇤ k 1 +kp (✓ ⇤ )k 1 o , (4.15) where ✓ H c denotes a subvector of ✓ given by components in the complement H c of the re- duced setH, 0 0 is the regularization parameter for the L 1 -penalty, p (✓ ⇤ )= p (|✓ ⇤ |)= (p (|✓ ⇤ 1 |),...,p (|✓ ⇤ e p |)) T with✓ ⇤ =(✓ ⇤ 1 ,...,✓ ⇤ e p ) T , andp (t) is an increasing concave penalty functionon[0,1)indexedbyregularizationparameter 0. Here,✓ ⇤ = D✓ =n 1/2 (ke x 1 k 2 ✓ 1 , ··· ,ke x e p k 2 ✓ e p ) T is the coefficient vector corresponding to the design matrix with each column rescaled to have L 2 -norm n 1/2 , where e X=(e x 1 ,··· ,e x e p ) and D = diag{D 11 ,··· ,D e pe p } with D mm =n 1/2 ke x m k 2 ,m=1,··· ,e p,isthescalematrix. Thecomputationalcostofsolvingthe regularizationproblem(4.15)inq dimensionsafterscreeningfromultra-highscaletomoderate scale is substantially reduced compared to that of solving the same problem in e p dimensions withoutscreening. Moreover,importanttheoreticalchallengesariseininvestigatingtheasymp- toticpropertiesoftheresultingregularizedestimatorforIP.FanandLv(2014)consideredlinear models with deterministic design matrix and no interactions, whereas we need to study the in- teraction model with random design matrix. The presence of both interactions and additional randomnessrequiresmoredelicateanalyses. Weremarkthat(4.15)doesnotautomaticallyenforcetheheredityconstraint. Ifsuchconstraint is desired, one can always employ existing methods, such as those in Yuan et al. (2009) and Bienetal. (2013),intheselectionstepofIPtoachievethisgoal. 4.3.2 Asymptoticpropertiesofinteractionandmaineffectselection Before presenting the results on interaction and main effect selection, we state some mild reg- ularity conditions that are needed in our analysis. Without loss of generality, assume that the first s = k✓ 0 k 0 components of the true regression coefficient vector ✓ 0 in (4.14) are nonzero. Throughout the paper, the regularization parameter for the L 1 component is fixed to be 0 = e c 0 {(logp)/n ↵ 1↵ 2/(↵ 1+2↵ 2) } 1/2 withe c 0 somepositiveconstant. Someinsightsintothischoiceof Chapter3. Application of variable selection in genetic data 47 0 willbeprovidedlater. Denotebyp H, (t)=2 1 { 2 ( t) 2 + },t 0,thehard-thresholding penalty,where(·) + denotesthepositivepartofanumber. Assumption 4. There exist some constants 0 ,,L 1 ,L 2 > 0 such that with probability 1a n satisfyinga n =o(1),itholdsthatmin k k2=1,k k0<2s n 1/2 k e X k 2 0 , min 6=0,k 2k1 7k 1k1 n n 1/2 k e X k 2 /(k 1 k 2 _k e 2 k 2 ) o for =( T 1 , T 2 ) T 2 R e p with 1 2 R s and e 2 a subvector of 2 consisting of the s largest componentsinmagnitude,andD mm ’sareboundedbetweenL 1 L 2 . Assumption 5. The concave penalty satisfies that p (t) p H, (t) on [0, ], p 0 {(1c 3 ) } min{ 0 /4,c 3 } for some constant c 3 2 [0,1), and p 00 (t) is decreasing on [0,(1 c 3 ) ]. Moreover,min 1 j s |✓ 0,j |>L 1 1 max{(1c 3 ), 2L 2 1 0 p 1/2 (1)}withp (1)= lim t!1 p (t). Condition 4 is similar to Condition 1 in Fan and Lv (2014) for the case of deterministic design matrix, except that the design matrix is now random in our setting and also augmented with interactions. WeprovideinSection4.3.3somesufficientconditionsensuringthatCondition4holds. Condi- tion5putssomebasicconstraintsontheconcavepenaltyp (t)asinFanandLv(2014). Under these regularity conditions, the following theorem presents the selection properties of the IP estimator b ✓ =( b ✓ 1 ,··· , b ✓ e p ) T including an explicit bound on the number of falsely discovered signs FS( b ✓ )= |{1 m e p : sgn( b ✓ m )6= sgn(✓ 0,m )}|, which provides a stronger measure on variableselectionthanthetotalnumberoffalsepositivesandfalsenegatives. Theorem 4.3. Assume that the conditions of part c) of Theorem 4.2 and Conditions 4–5 hold, logp =o{n ↵ 1↵ 2/(↵ 1+2↵ 2) }with↵ 1 ↵ 2 /(↵ 1 +2↵ 2 ) 1,andp (t)iscontinuouslydifferentiable. Then the global minimizer b ✓ of (4.15) has the hard-thresholding property that each component is either zero or of magnitude larger than (1c 3 ) , and with probability at least 1a n o(n min{C1,C2} +p c4 ), it satisfies simultaneously that n 1/2 e X( b ✓ ✓ 0 ) 2 =O( 1 0 s 1/2 ), b ✓ ✓ 0 d =O( 2 0 s 1/d ),d2 [1,2], FS( b ✓ )=O 4 ( 0 / ) 2 s , and furthermore sgn( b ✓ ) = sgn(✓ 0 ) if 56(1c 3 ) 1 2 0 s 1/2 , wherec 4 is some positive constant. Moreover, the same results hold with probability at least 1a n o(p c4 ) for the regularized estimator b ✓ without prescreening, that is, without the constraint✓ H c = 0 in (4.15). The results in Theorem 4.3 also apply to the regularized estimator with p 1 = p 2 = p and q = e p = p(p + 1)/2, that is, without any screening of variables. Theorem 4.3 shows that Chapter3. Application of variable selection in genetic data 48 if the tuning parameter satisfies 0 / ! 0, then the number of falsely discovered signs FS( b ✓ ) is of order o(s) and thus the false sign rate FS( b ✓ )/s is asymptotically vanishing with probabilitytendingtoone. Wealsoobservethattheboundsforpredictionandestimationlosses areindependentofthetuningparameter fortheconcavepenalty. AsshowninTheorem4.3,theregularizationparameterfortheL 1 component 0 =e c 0 {(logp)/n ↵ 1↵ 2/(↵ 1+2↵ 2) } 1/2 plays a crucial role in characterizing the rates of convergence for the regularized estimator b ✓ . Such a parameter basically measures the maximum noise level in interaction models. In par- ticular, the exponent ↵ 1 ↵ 2 /(↵ 1 +2↵ 2 ) is a key parameter that reflects the level of difficulty in the problem of interaction selection. This quantity is determined by three sources of heavy- tailedness: covariatesthemselves,theirinteractions,andtheerror. Tosimplifythetechnicalpre- sentation,inthispaperwehavefocusedonthemorechallengingcaseof↵ 1 ↵ 2 /(↵ 1 +2↵ 2 ) 1. Such a scenario includes two specific cases: 1) sub-Gaussian covariates and sub-Gaussian er- ror, that is, ↵ 1 = ↵ 2 =2 and 2) sub-Gaussian covariates and sub-exponential error, that is, ↵ 1 =2,↵ 2 =1. We remark that in the lighter-tailed case of ↵ 1 ↵ 2 /(↵ 1 +2↵ 2 ) > 1, one can simply set 0 =e c 0 {(logp)/n} 1/2 and the results in Theorem 4.3 can still hold for this choice of 0 byresortingtoLemma5.6andsimilarargumentsintheproofofTheorem4.3. 4.3.3 VerificationofCondition4 SinceCondition4isakeyassumptionforprovingTheorem4.3,weprovidesomesufficientcon- ditionsthatensuresthisassumptionontheaugmentedrandomdesignmatrix e X=(e x 1 ,··· ,e x e p ). Denote by e ⌃ the population covariance matrix of the augmented covariate vector consisting of pmaineffectsX j ’sandp(p1)/2interactionsX k X ` ’s. Assumption6. ThereexistssomeconstantK> 0suchthatfor =( T 1 , T 2 ) T 2 R e p , min k k2=1,k k0<2s T e ⌃ K and min 6=0,k 2k1 7k 1k1 T e ⌃ / ⇣ k 1 k 2 _k e 2 k 2 ⌘ K, where 1 2 R s and e 2 isasubvectorof 2 consistingoftheslargestcomponentsinmagnitude. Condition 6 is satisfied if the smallest eigenvalue of e ⌃ is assumed to be bounded away from zero. Such a condition is in fact much weaker than the minimum eigenvalue assumption, since itisthepopulationversionofamildsparseeigenvalueassumptionandtherestrictedeigenvalue assumption. Thefollowingtheoremshowsthatundersomemildassumptions,Condition4holds forthefullaugmenteddesignmatrix e Xandthusholdsnaturallyforanyn⇥ q sub-designmatrix withq e p. Chapter3. Application of variable selection in genetic data 49 Theorem 4.4. Assume that Condition 6 holds, there exist some constants ↵ 1 ,c 1 > 0 such that for anyt> 0, P(|X j |>t) c 1 exp(c 1 1 t ↵ 1 ) for each j, s = O(n ⇠ 0 ), and logp = o(n min{↵ 1/4,1} 2⇠ 0 ) with constant 0 ⇠ 0 < min{↵ 1 /8,1/2}. Then Condition 4 holds with n min{↵ 1/4,1} 2⇠ 0 =O(loga n ). 4.4 Numericalstudies Inthissection,wedesigntwosimulationexamplestoverifythetheoreticalresultsandexamine the finite-sample performance of the suggested approach IP. We also present an analysis of a prostatecancerdataset. 4.4.1 Featurescreeningperformance We start with comparing the IP with several recent feature screening procedures: the SIS, DC- SISLietal. (2012),andSIRI.TheSIRIisaniterativeprocedurethatisdesignedspecificallyfor detecting interactions, while both SIS and DC-SIS are non-iterative procedures that are not di- rectly for interaction screening. In particular, the original SIS is only for main effect screening. AlthoughDC-SIScandetectmaineffectsandinteractionvariables,itdoesnotdistinguishthese two types of variables. For a fair comparison, we thus construct a set of all possible pairwise interactions using the variables recruited by SIS or DC-SIS. By doing this, the strong heredity assumption is enforced. We name the resulting procedures as SIS2 and DC-SIS2 to distinguish them from the original SIS and DC-SIS. To align with all other screening procedures, we also implementSIRIinanon-iterativefashionbyrankingvariablesaccordingtotheirmarginalutil- ities of SIRI and then keeping the top ones. This strategy is equivalent to the initial screening stepdescribedinSection2.3ofJiangandLiu(2014). As mentioned in Section 4.2.2, we retain the top [n/(logn)] variables in each of sets b A and b B defined in (4.6) and (4.8), respectively. The features in the union set c M = b A[ b B are included asmaineffectsinthereducedinteractionmodelafterthescreeningstepofIP,whilevariablesin set b A are used to build interactions in the selection step of IP. To ensure a fair comparison, the numbers of variables kept in the screening procedures of SIS2 and DC-SIS2 are both equal to thecardinalityof c M,whichisupto2[n/(logn)]. Example 1 (Gaussian distribution). We consider the following four interaction models linking thecovariatesX j ’stotheresponseY: • M1(strongheredity): Y =2X 1 +2X 5 +3X 1 X 5 +" 1 , • M2(weakheredity): Y =2X 1 +2X 10 +3X 1 X 5 +" 2 , Chapter3. Application of variable selection in genetic data 50 TABLE 4.1: The percentages of retaining each important interaction or main effect, and all importantones(All)byallthescreeningmethodsoverdifferentmodelsandsettingsinExample 1. Method M1 M2 M3 M4 X 1 X 5 X 1 X 5 All X 1 X 10 X 1 X 5 All X 10 X 15 X 1 X 5 All X 1 X 5 X 10 X 15 All Setting1: (n,p,⇢ ) = (200,2000,0) SIS2 1.00 1.00 1.00 1.00 1.00 1.00 0.09 0.09 1.00 1.00 0.02 0.02 0.02 0.02 0.00 DC-SIS2 1.00 1.00 1.00 1.00 1.00 1.00 0.88 0.88 1.00 1.00 0.04 0.04 0.15 0.16 0.03 SIRI 1.00 1.00 1.00 1.00 1.00 1.00 0.67 0.67 1.00 1.00 0.13 0.13 0.34 0.29 0.13 IP 1.00 1.00 0.97 0.97 1.00 1.00 0.88 0.88 1.00 1.00 0.93 0.93 0.80 0.79 0.59 Setting2: (n,p,⇢ ) = (200,2000,0.5) SIS2 1.00 1.00 1.00 1.00 1.00 1.00 0.15 0.15 1.00 1.00 0.01 0.01 0.01 0.04 0.00 DC-SIS2 1.00 1.00 1.00 1.00 1.00 1.00 0.85 0.85 1.00 1.00 0.03 0.03 0.14 0.11 0.03 SIRI 1.00 1.00 1.00 1.00 1.00 1.00 0.62 0.62 1.00 1.00 0.09 0.09 0.36 0.31 0.11 IP 1.00 1.00 0.96 0.96 1.00 1.00 0.85 0.85 1.00 1.00 0.84 0.84 0.75 0.84 0.59 Setting3: (n,p,⇢ ) = (300,5000,0) SIS2 1.00 1.00 1.00 1.00 1.00 1.00 0.07 0.07 1.00 1.00 0.01 0.01 0.00 0.00 0.00 DC-SIS2 1.00 1.00 1.00 1.00 1.00 1.00 0.93 0.93 1.00 1.00 0.03 0.03 0.14 0.16 0.01 SIRI 1.00 1.00 1.00 1.00 1.00 1.00 0.72 0.72 1.00 1.00 0.15 0.15 0.40 0.43 0.16 IP 1.00 1.00 0.97 0.97 1.00 1.00 0.90 0.90 1.00 1.00 0.96 0.96 0.83 0.82 0.65 Setting4: (n,p,⇢ ) = (300,5000,0.5) SIS2 1.00 1.00 1.00 1.00 1.00 1.00 0.17 0.17 1.00 1.00 0.04 0.04 0.02 0.00 0.00 DC-SIS2 1.00 1.00 1.00 1.00 1.00 1.00 0.95 0.95 1.00 1.00 0.07 0.07 0.13 0.18 0.02 SIRI 1.00 1.00 1.00 1.00 1.00 1.00 0.83 0.83 1.00 1.00 0.18 0.18 0.46 0.47 0.18 IP 1.00 1.00 0.99 0.99 1.00 1.00 0.90 0.90 1.00 1.00 0.94 0.94 0.79 0.85 0.64 • M3(anti-heredity): Y =2X 10 +2X 15 +3X 1 X 5 +" 3 , • M4(interactionsonly): Y =3X 1 X 5 +3X 10 X 15 +" 4 , where the covariate vector x=(X 1 ,···,X p ) T ⇠ N(0,⌃ ) with⌃ =(⇢ |j k| ) 1 j,k p and the errors " 1 ⇠ N(0,2.5 2 ), " 2 ⇠ N(0,2 2 ), " 3 ⇠ N(0,2 2 ), and " 4 ⇠ N(0,1.5 2 ) are independent of x. The first two models M1 and M2 satisfy the heredity assumption (either strong or weak), while the last two M3 and M4 do not obey such an assumption. Different levels of error vari- ance are considered since the difficulty of feature screening varies across the four models. A sampleofni.i.d. observationswasgeneratedfromeachofthefourmodels. Wefurtherconsid- ered four different settings of (n,p,⇢ ) = (200,2000,0), (200,2000,0.5), (300,5000,0), and (300,5000,0.5),andrepeatedeachexperiment100times. Table 4.1 lists the comparison results for all screening methods in recovering each important interactionormaineffect,andretainingallimportantones. FormodelM1satisfyingthestrong heredity assumption, all procedures performed rather similarly and all retaining percentages wereeitherequalorcloseto100%. ThesurescreeningprobabilitiesofIPinM1areslightlyless than one or those for other methods since we keep only [n/(logn)] variables in b A to construct interactionswhileothermethodsretainupto 2[n/(logn)]variables. BothDC-SIS2andIPper- formed similarly and improved over SIS2 and SIRI in model M2 in which the weak heredity assumptionholds. InmodelsM3andM4,IPsignificantlyoutperformedallothermethodsinde- tectinginteractionsacrossallfoursettings,showingitsadvantagewhentheheredityassumption Chapter3. Application of variable selection in genetic data 51 TABLE 4.2: The percentages of retaining each important interaction or main effect, and all importantones(All)byallthescreeningmethodsoverdifferentmodelsandsettingsinExample 2. Method M1 M2 M3 M4 X 1 X 5 X 1 X 5 All X 1 X 10 X 1 X 5 All X 10 X 15 X 1 X 5 All X 1 X 5 X 10 X 15 All Setting1: (n,p,⇢ ) = (200,2000,0) SIS2 1.00 1.00 1.00 1.00 1.00 1.00 0.13 0.13 1.00 1.00 0.02 0.02 0.00 0.01 0.00 DC-SIS2 1.00 1.00 1.00 1.00 1.00 1.00 0.91 0.91 1.00 1.00 0.06 0.06 0.17 0.20 0.01 SIRI 1.00 1.00 1.00 1.00 1.00 1.00 0.75 0.75 1.00 1.00 0.18 0.18 0.36 0.40 0.11 IP 1.00 1.00 0.96 0.96 1.00 1.00 0.95 0.95 1.00 1.00 0.97 0.97 0.80 0.83 0.63 Setting2: (n,p,⇢ ) = (200,2000,0.5) SIS2 1.00 1.00 1.00 1.00 1.00 1.00 0.18 0.18 1.00 1.00 0.01 0.01 0.00 0.00 0.00 DC-SIS2 1.00 1.00 1.00 1.00 1.00 1.00 0.95 0.95 1.00 1.00 0.10 0.10 0.14 0.14 0.02 SIRI 1.00 1.00 1.00 1.00 1.00 1.00 0.85 0.85 1.00 1.00 0.16 0.16 0.41 0.43 0.18 IP 1.00 1.00 0.95 0.95 1.00 1.00 0.97 0.97 1.00 1.00 0.96 0.96 0.80 0.81 0.61 Setting3: (n,p,⇢ ) = (300,5000,0) SIS2 1.00 1.00 1.00 1.00 1.00 1.00 0.14 0.14 1.00 1.00 0.02 0.02 0.00 0.01 0.00 DC-SIS2 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.10 0.10 0.18 0.20 0.01 SIRI 1.00 1.00 1.00 1.00 1.00 1.00 0.93 0.93 1.00 1.00 0.32 0.32 0.63 0.65 0.43 IP 1.00 1.00 0.98 0.98 1.00 1.00 0.98 0.98 1.00 1.00 0.97 0.97 0.86 0.85 0.71 Setting4: (n,p,⇢ ) = (300,5000,0.5) SIS2 1.00 1.00 1.00 1.00 1.00 1.00 0.09 0.09 1.00 1.00 0.02 0.02 0.00 0.01 0.00 DC-SIS2 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.16 0.16 0.32 0.25 0.05 SIRI 1.00 1.00 1.00 1.00 1.00 1.00 0.92 0.92 1.00 1.00 0.34 0.34 0.70 0.57 0.36 IP 1.00 1.00 0.99 0.99 1.00 1.00 0.94 0.94 1.00 1.00 0.97 0.97 0.81 0.89 0.70 is not satisfied. We also observe that SIS2 failed to detect interactions, whereas SIRI improved over DC-SIS2 in these two models. These results suggest that a separate screening step should bedesignedspecificallyforinteractionstoimprovethescreeningaccuracy,whichisindeedone ofthemaininnovationsofIP. Example 2 (Non-Gaussian distribution). The second example adopts the same four models as in Example 1, but with different distributions for the covariates X j ’s and error ". We added an independently generated random variableU j to each covariateX j as given in Example 1 to obtain new covariates, whereU j ’s are i.i.d. and follow the uniform distribution on [0.5,0.5]. Theerrors" 1 ⇠ t(3)," 2 ⇠ t(4)," 3 ⇠ t(4),and" 4 ⇠ t(8)areindependentofx. The screening results of all the methods are summarized in Table 4.2. Similarly as in Example 1, IPoutperformedSIS2ininteractionscreening. Whentheheredityassumptionissatisfied, IP performed comparably to DC-SIS2. In particular, both approaches were better than SIS2 and SIRI when the weak heredity assumption is satisfied. The improvement of IP over all other methodsindetectinginteractionsbecamesubstantialwhentheheredityassumptionisviolated. Togainsomeinsightsintothesignalstrengthinthemodelsinvestigated, wecalculatetheover- all signal-to-noise ratio (SNR) for each model, which is defined as var(e x T ✓ )/var(") withe x the augmented covariate vector consisting of p main effects X j ’s and p(p 1)/2 interactions X k X ` ’s, " the error term, and ✓ given in model (4.14). In the high-dimensional setting, it is equally important to consider the individual SNRs for important interactions and main effects Chapter3. Application of variable selection in genetic data 52 TABLE 4.3: The overall and individual signal-to-noise ratios (SNRs) of each model in Exam- ples1and2. Example1 Example2 Settings1,3 Settings2,4 Settings1,3 Settings2,4 M1 X 1 0.64 0.64 1.44 1.44 X 5 0.64 0.64 1.44 1.44 X 1 X 5 1.44 1.45 3.52 3.53 Overall 2.72 2.81 6.41 6.59 M2 X 1 1.00 1.00 2.17 2.17 X 10 1.00 1.00 2.17 2.17 X 1 X 5 2.25 2.26 5.28 5.30 Overall 4.25 4.26 9.61 9.64 M3 X 10 1.00 1.00 2.17 2.17 X 15 1.00 1.00 2.17 2.17 X 1 X 5 2.25 2.26 5.28 5.30 Overall 4.25 4.32 9.61 9.76 M4 X 1 X 5 4.00 4.02 7.92 7.95 X 10 X 15 4.00 4.00 7.92 7.93 Overall 8.00 8.02 15.84 15.88 inthemodel,byreplacing var(e x T ✓ )withthevarianceofeachindividualterm. Theoveralland individual SNRs for the models considered in both Examples 1 and 2 are listed in Table 4.3. In particular, we see that although the overall SNRs are at decent levels, the individual ones are weaker,reflectingthegeneraldifficultyofretainingallimportantfeaturesforscreening. 4.4.2 Variableselectionperformance WefurtherassessthevariableselectionperformanceofIP.ForeachdatasetgeneratedinExam- ples1and2,wecanemployregularizationmethodssuchastheLassoandthecombinedL 1 and concave method to select important interactions and main effects after applying each screening procedure to reduce the dimensionality to a moderate scale. As shown in Fan and Lv (2014), differentchoicesoftheconcavepenaltygaverisetosimilarperformance. Wethusimplemented thecombinedL 1 andSICA(L 1 +SICA)forsimplicity. TheapproachofSIS2followedbyLasso isreferredtoasSIS2-Lassoforshort. Allothercombinationsofscreeningandselectionmethods aredefinedsimilarly. WealsopairedupthehierNetBienetal. (2014)withtheIPforinteraction identification. Theoracleprocedurebasedonthetrueunderlyinginteractionmodelwasalsoin- cludedasareferencepointforcomparisons. Thecross-validation(CV)wasusedtoselecttuning parametersforallthemethods,exceptthattheBICwasappliedtoL 1 +SICArelatedprocedures forcomputationalefficiencysincetworegularizationparametersareinvolved. Chapter3. Application of variable selection in genetic data 53 To evaluate the variable selection performance of each method, we employ three performance measures. Thefirstoneisthepredictionerror(PE),whichwascalculatedusinganindependent test sample of size 10,000. The second and third measures are the numbers of false positives (FP)andfalsenegatives(FN),whicharedefinedasthenumbersofincludednoisevariablesand missedimportantvariablesinthefinalmodel,respectively. Table4.4presentsthemediansandrobuststandarddeviations(RSD)ofthesemeasuresbasedon 100simulationsfordifferentmodelsinExample1. TheRSDisdefinedastheinterquartilerange (IQR) divided by 1.34. We used the median and RSD instead of the mean and standard devia- tion since these robust measures are better suited to summarize the results due to the existence of outliers. When the strong heredity assumption holds (model M1), both DC-SIS2-L 1 +SICA andIP-L 1 +SICAfollowedcloselytheoracleprocedure,andoutperformedtheothermethodsin terms of PE, FP, and FN across all four settings. In model M2 with the weak heredity assump- tion,variableselectionmethodsbasedonbothDC-SISandIPperformedfairlywell. Inthecases when the heredity assumption does not hold (models M3 and M4), the IP-L 1 +SICA still mim- ickedtheoracleprocedureanduniformlyoutperformedtheothermethodsoverallsettings. The inflated robust standard deviations, relative to medians, in model M4 were due to the relatively low sure screening probabilities (see Tables 4.1 and 4.2). When the sure screening probability is low, a nonnegligible number of replications can have nonzero false negatives, which inflated thecorrespondingpredictionerrors. ThecomparisonresultsofvariableselectionforExample2 aresummarizedinTable4.5. TheconclusionsaresimilartothoseforExample1. 4.4.3 Realdataanalysis In addition, we illustrate our procedure IP through an analysis of the prostate cancer data stud- ied originally in Singh et al. (2002) and analyzed also in Fan and Fan (2008) and Hall and Xue (2014). Thisdataset,whichisavailableathttp://www.broad.mit.edu/cgi-bin/cancer/datasets.cgi, contains136sampleswith77fromthetumorgroupand59fromthenormalgroup,eachofwhich records the expression levels measured for 12600 genes. Hall and Xue (2014) applied a four- step procedure to preprocess the data. Their procedure includes the truncation of intensities to make them positive, the removal of genes having little variation in intensity, the transformation of intensities to base 10 logarithms, and the standardization of each data vector to have zero meanandunitvariance. Anapplicationofthefour-stepprocedureresultsinatotalofp = 3239 genes. Wetreatedthediseasestatusastheresponseandtheresulting3239genesascovariates. Thedata setwasrandomlysplitintoatrainingsetandatestset. Eachtrainingsetconsistsof 69samples from the tumor group and 53 samples from the normal group, while each test set includes 8 samples from the former group and 6 samples from the latter group. For each split, we applied Chapter3. Application of variable selection in genetic data 54 TABLE 4.4: Variable selection results for all the selection methods in terms of medians and robust standard deviations (in parentheses) of various performance measuresinExample1. Method M1 M2 M3 M4 PE FP FN PE FP FN PE FP FN PE FP FN Setting1: (n,p,⇢ ) = (200,2000,0) SIS2-Lasso 8.002(0.560) 81(10.8) 0(0) 15.928(1.002) 88.5(8.6) 1(0) 15.877(0.981) 90(8.6) 1(0) 22.673(0.672) 96(7.1) 2(0) SIS2-L 1 +SICA 9.705(2.388) 12.5(10.4) 0(0) 21.013(3.354) 15(10.4) 1(0) 21.320(3.220) 20.5(11.2) 1(0) 32.605(3.190) 16(9.7) 2(0) DC-SIS2-Lasso 7.957(0.475) 82(9.7) 0(0) 5.151(0.328) 82(10.1) 0(0) 15.859(0.969) 90(7.1) 1(0) 22.440(6.808) 94(8.2) 2(0.7) DC-SIS2-L 1 +SICA 9.043(2.174) 10.5(9.7) 0(0) 6.332(1.721) 11.5(8.2) 0(0) 20.693(3.704) 15.5(10.4) 1(0) 31.637(9.652) 17(11.2) 2(0.7) IP-hierNet 8.525(0.836) 55.5(34.7) 0(0) 6.557(0.853) 82(28.5) 0(0) 7.181(0.912) 95(27.2) 0(0) 6.149(8.393) 115(21.3) 0(0.7) IP-Lasso 8.428(0.807) 79.5(16.8) 0(0) 5.302(0.455) 75(14.6) 0(0) 5.386(0.422) 74.5(13.1) 0(0) 3.135(7.909) 79(11.2) 0(0.7) IP-L 1 +SICA 7.391(1.509) 3(5.2) 0(0) 4.358(0.964) 1(4.1) 0(0) 4.640(0.890) 2(5.2) 0(0) 3.108(8.847) 3(6.0) 0(0.7) Oracle 6.34(0.110) 0(0) 0(0) 4.051(0.080) 0(0) 0(0) 4.058(0.081) 0(0) 0(0) 2.269(0.037) 0(0) 0(0) Setting2: (n,p,⇢ ) = (200,2000,0.5) SIS2-Lasso 7.992(0.472) 83(10.4) 0(0) 15.625(1.358) 89(10.4) 1(0) 15.997(0.965) 89(8.6) 1(0) 22.973(0.724) 94(5.6) 2(0) SIS2-L 1 +SICA 9.355(2.791) 12(11.6) 0(0) 20.120(4.690) 17.5(11.6) 1(0) 21.274(2.612) 18(9.0) 1(0) 33.225(3.600) 17(10.4) 2(0) DC-SIS2-Lasso 7.907(0.518) 82(9.3) 0(0) 5.154(0.417) 83(12.7) 0(0) 15.943(1.115) 89(9.0) 1(0) 22.804(1.492) 92(6.7) 2(0) DC-SIS2-L 1 +SICA 8.661(2.526) 9(9.7) 0(0) 6.046(1.995) 10(8.2) 0(0) 20.477(3.891) 15.5(11.9) 1(0) 31.691(8.200) 16(10.4) 2(0) IP-hierNet 8.31(0.705) 37(32.5) 0(0) 6.441(1.004) 71(22.4) 0(0) 6.891(1.179) 85.5(22.6) 0(0) 5.467(8.654) 109(21.1) 0(0.7) IP-Lasso 8.487(0.698) 73.5(16.4) 0(0) 5.423(0.482) 76(13.4) 0(0) 5.375(0.577) 71.5(16.8) 0(0) 3.053(8.036) 77(16.0) 0(0.7) IP-L 1 +SICA 7.343(1.603) 3(6.0) 0(0) 4.373(0.970) 1(3.7) 0(0) 4.561(1.380) 2(5.6) 0(0) 2.826(9.292) 3.5(6.7) 0(0.7) Oracle 6.335(0.115) 0(0) 0(0) 4.057(0.073) 0(0) 0(0) 4.06(0.082) 0(0) 0(0) 2.270(0.041) 0(0) 0(0) Setting3: (n,p,⇢ ) = (300,5000,0) SIS2-Lasso 7.686(0.307) 123(16.4) 0(0) 15.364(0.798) 138(8.6) 1(0) 15.491(0.613) 139(10.4) 1(0) 22.582(0.622) 145(6.3) 2(0) SIS2-L 1 +SICA 10.047(1.277) 19(5.6) 0(0) 21.666(2.105) 29(7.8) 1(0) 21.748(2.205) 29.5(8.6) 1(0) 32.559(3.175) 32.5(14.9) 2(0) DC-SIS2-Lasso 7.660(0.293) 129(16.8) 0(0) 4.919(0.221) 127(14.9) 0(0) 15.419(0.608) 136(9.7) 1(0) 22.383(7.327) 139.5(10.4) 2(0.7) DC-SIS2-L 1 +SICA 10.248(1.705) 20.5(8.2) 0(0) 6.098(0.986) 14(5.6) 0(0) 21.694(2.437) 30(8.2) 1(0) 31.714(10.652) 30(12.3) 2(0.7) IP-hierNet 8.435(0.903) 103(44.8) 0(0) 5.879(0.608) 105(37.7) 0(0) 6.241(0.660) 122.5(33.8) 0(0) 4.735(8.426) 156.5(29.9) 0(0.7) IP-Lasso 8.105(0.474) 115.5(17.5) 0(0) 5.140(0.371) 109.5(27.6) 0(0) 5.117(0.371) 105(25.7) 0(0) 2.903(8.092) 118(17.2) 0(0.7) IP-L 1 +SICA 6.986(1.403) 3(7.8) 0(0) 4.624(1.325) 4(9.7) 0(0) 4.652(1.243) 4(9.7) 0(0) 2.859(9.151) 7(9.3) 0(0.7) Oracle 6.307(0.093) 0(0) 0(0) 4.036(0.054) 0(0) 0(0) 4.034(0.056) 0(0) 0(0) 2.261(0.033) 0(0) 0(0) Setting4: (n,p,⇢ ) = (300,5000,0.5) SIS2-Lasso 7.733(0.383) 123(14.6) 0(0) 15.25(1.363) 133(13.8) 1(0) 15.519(0.562) 137(11.2) 1(0) 22.717(0.638) 143(6.0) 2(0) SIS2-L 1 +SICA 9.999(1.706) 19(8.2) 0(0) 20.051(3.688) 23.5(11.2) 1(0) 21.494(3.002) 29(10.4) 1(0) 33.411(2.945) 34(11.2) 2(0) DC-SIS2-Lasso 7.633(0.376) 123(18.7) 0(0) 4.851(0.238) 127.5(13.8) 0(0) 15.486(0.601) 134(12.7) 1(0) 22.349(7.051) 138.5(10.8) 2(0.7) DC-SIS2-L 1 +SICA 10.012(1.510) 19(6.7) 0(0) 6.082(0.834) 15(6.3) 0(0) 21.585(3.663) 27(10.4) 1(0) 31.448(10.427) 32(9.3) 2(0.7) IP-hierNet 8.097(0.47) 76(41.0) 0(0) 5.776(0.536) 96(25.6) 0(0) 5.978(0.538) 105(29.9) 0(0) 4.647(8.700) 151(27.6) 0(0.7) IP-Lasso 7.974(0.465) 112(19.0) 0(0) 5.115(0.417) 109.5(30.6) 0(0) 5.095(0.285) 106(28.0) 0(0) 2.860(7.827) 113.5(19.4) 0(0.7) IP-L 1 +SICA 6.753(1.271) 1(6.7) 0(0) 4.47(1.182) 2.5(9.0) 0(0) 4.450(0.801) 3(7.5) 0(0) 3.121(9.126) 7.5(9.0) 0(0.7) Oracle 6.305(0.091) 0(0) 0(0) 4.039(0.060) 0(0) 0(0) 4.033(0.067) 0(0) 0(0) 2.261(0.032) 0(0) 0(0) Chapter3. Application of variable selection in genetic data 55 TABLE 4.5: Variable selection results for all the selection methods in terms of medians and robust standard deviations (in parentheses) of various performance measuresinExample2. Method M1 M2 M3 M4 PE FP FN PE FP FN PE FP FN PE FP FN Setting1: (n,p,⇢ ) = (200,2000,0) SIS2-Lasso 3.652(0.422) 73.5(15.3) 0(0) 15.092(1.077) 88(12.7) 1(0) 15.317(0.779) 88.5(9.0) 1(0) 25.252(0.812) 93(6.3) 2(0) SIS2-L 1 +SICA 3.081(0.584) 0(3.0) 0(0) 19.573(3.202) 13.5(11.9) 1(0) 20.181(3.209) 14(10.4) 1(0) 35.674(3.862) 14(10.1) 2(0) DC-SIS2-Lasso 3.678(0.392) 75.5(13.1) 0(0) 2.470(0.248) 73(17.5) 0(0) 15.395(1.028) 87(9.7) 1(0) 24.959(8.620) 93(8.6) 2(0.7) DC-SIS2-L 1 +SICA 3.092(0.800) 0(4.5) 0(0) 2.089(0.532) 0(4.9) 0(0) 20.506(3.690) 15.5(11.9) 1(0) 32.404(12.479) 15.5(9.0) 2(0.7) IP-hierNet 4.487(0.624) 46.5(38.4) 0(0) 3.108(0.545) 57(24.3) 0(0) 3.399(0.583) 72(21.3) 0(0) 3.557(10.816) 110(21.1) 0(0.7) IP-Lasso 3.777(0.438) 76.5(14.9) 0(0) 2.595(0.275) 71(19.4) 0(0) 2.609(0.292) 72(13.8) 0(0) 1.719(9.417) 64(19.0) 0(0.7) IP-L 1 +SICA 3.061(0.579) 0(1.9) 0(0) 2.076(0.342) 0(2.2) 0(0) 2.058(0.399) 0(3.4) 0(0) 1.543(10.135) 1.5(4.5) 0(0.7) Oracle 2.929(0.237) 0(0) 0(0) 2.002(0.068) 0(0) 0(0) 2.017(0.065) 0(0) 0(0) 1.339(0.035) 0(0) 0(0) Setting2: (n,p,⇢ ) = (200,2000,0.5) SIS2-Lasso 3.643(0.430) 76.5(14.2) 0(0) 15.314(1.863) 87(10.8) 1(0) 15.285(0.958) 88(7.5) 1(0) 25.151(0.881) 95(7.1) 2(0) SIS2-L 1 +SICA 3.152(0.887) 0(5.2) 0(0) 19.485(3.944) 13(11.9) 1(0) 20.731(3.277) 19(9.7) 1(0) 36.600(2.672) 16.5(10.4) 2(0) DC-SIS2-Lasso 3.668(0.422) 78.5(15.7) 0(0) 2.481(0.192) 74.5(21.6) 0(0) 15.163(1.150) 87(12.7) 1(0) 24.997(7.947) 90.5(10.8) 2(0.7) DC-SIS2-L 1 +SICA 3.183(0.717) 0(3.7) 0(0) 2.226(0.560) 1(4.9) 0(0) 18.800(5.117) 12(12.3) 1(0) 34.958(10.563) 19(10.1) 2(0.7) IP-hierNet 4.306(0.560) 28(29.5) 0(0) 2.881(0.497) 52(21.5) 0(0) 3.286(0.390) 64(24.3) 0(0) 3.442(10.716) 107(22.9) 0(0.7) IP-Lasso 3.829(0.406) 71(19.4) 0(0) 2.516(0.266) 68.5(15.7) 0(0) 2.543(0.247) 71(14.5) 0(0) 1.726(9.245) 60(17.2) 0(0.7) IP-L 1 +SICA 3.028(0.549) 0(3.7) 0(0) 2.079(0.220) 0(2.3) 0(0) 2.056(0.302) 0(3.0) 0(0) 1.492(9.988) 1(4.5) 0(0.7) Oracle 2.941(0.238) 0(0) 0(0) 2.021(0.072) 0(0) 0(0) 2.007(0.061) 0(0) 0(0) 1.345(0.033) 0(0) 0(0) Setting3: (n,p,⇢ ) = (300,5000,0) SIS2-Lasso 3.481(0.361) 115(24.6) 0(0) 14.708(0.678) 133(14.2) 1(0) 14.861(0.639) 132(11.2) 1(0) 24.988(0.751) 143.5(6.7) 2(0) SIS2-L 1 +SICA 3.146(0.792) 0(5.2) 0(0) 19.613(3.332) 24(14.6) 1(0) 20.765(1.990) 28.5(5.2) 1(0) 36.296(3.241) 33(14.9) 2(0) DC-SIS2-Lasso 3.475(0.345) 109.5(26.5) 0(0) 2.396(0.135) 126(26.5) 0(0) 14.724(0.703) 131(15.3) 1(0) 24.634(8.786) 140(12.7) 2(0.7) DC-SIS2-L 1 +SICA 3.092(0.670) 0(4.5) 0(0) 2.136(0.327) 1(4.1) 0(0) 19.986(3.224) 26(13.1) 1(0) 34.206(13.297) 27(12.7) 2(0.7) IP-hierNet 3.753(0.534) 42(27.6) 0(0) 2.725(0.253) 61.5(26.9) 0(0) 2.955(0.366) 79(34.5) 0(0) 2.587(10.023) 138(36.0) 0(0.7) IP-Lasso 3.620(0.400) 112.5(24.6) 0(0) 2.441(0.192) 96(21.3) 0(0) 2.445(0.185) 98.5(19.4) 0(0) 1.574(8.975) 78(35.4) 0(0.7) IP-L 1 +SICA 3.117(0.850) 0(6.7) 0(0) 2.071(0.171) 0(2.2) 0(0) 2.074(0.228) 0(3.0) 0(0) 1.377(9.704) 0(3.4) 0(0.7) Oracle 2.924(0.251) 0(0) 0(0) 2.006(0.055) 0(0) 0(0) 2.007(0.064) 0(0) 0(0) 1.347(0.031) 0(0) 0(0) Setting4: (n,p,⇢ ) = (300,5000,0.5) SIS2-Lasso 3.457(0.329) 109.5(25.4) 0(0) 14.505(1.194) 133(14.9) 1(0) 14.947(0.596) 136(9.7) 1(0) 25.174(0.702) 144(7.1) 2(0) SIS2-L 1 +SICA 3.095(0.823) 0(4.5) 0(0) 19.14(3.153) 26(9.0) 1(0) 20.824(2.960) 28.5(10.4) 1(0) 36.423(3.332) 31.5(14.6) 2(0) DC-SIS2-Lasso 3.492(0.358) 112(26.1) 0(0) 2.384(0.134) 118.5(35.1) 0(0) 14.928(0.784) 135(10.4) 1(0) 14.477(8.994) 140(14.9) 1(0.7) DC-SIS2-L 1 +SICA 3.204(0.963) 0.5(7.8) 0(0) 2.074(0.274) 0(3.4) 0(0) 19.962(3.679) 26(13.4) 1(0) 21.558(13.215) 27(13.4) 1(0.7) IP-hierNet 3.727(0.484) 38(30.6) 0(0) 2.762(0.284) 58(23.1) 0(0) 2.863(0.357) 69.5(31.0) 0(0) 2.483(10.133) 130.5(28.0) 0(0.7) IP-Lasso 3.590(0.319) 109(20.5) 0(0) 2.491(0.220) 100(20.5) 0(0) 2.433(0.203) 97(18.3) 0(0) 1.555(9.057) 71(39.2) 0(0.7) IP-L 1 +SICA 3.108(0.769) 0(4.5) 0(0) 2.061(0.255) 0(3.0) 0(0) 2.072(0.173) 0(2.2) 0(0) 1.381(9.257) 0(4.5) 0(0.7) Oracle 2.921(0.245) 0(0) 0(0) 2.009(0.064) 0(0) 0(0) 2.009(0.070) 0(0) 0(0) 1.342(0.028) 0(0) 0(0) Chapter3. Application of variable selection in genetic data 56 TABLE4.6: Themeansandstandarderrors(inparentheses)ofclassificationerrorsandmedian modelsizesinprostatecancerdataanalysis. Method Classificationerror Medianmodelsize SIS2-Lasso 0.900(0.084) 26 SIS2-L 1 +SICA 1.060(0.086) 32.5 DC-SIS2-Lasso 0.990(0.086) 28 DC-SIS2-L 1 +SICA 1.040(0.089) 35.5 IP-hierNet 0.750(0.083) 20 IP-Lasso 0.800(0.079) 19 IP-L 1 +SICA 0.820(0.081) 24 thescreeningmethodIPtothetrainingdataandretainedthetop[n/(logn)] = 25genesineach of sets b A and b B. Thus we kept up to 2[n/(logn)] = 50 genes as main effects and used 25 genesidentifiedin b AtoconstructinteractionsintheselectionstepofIP.ForSIS2andDC-SIS2, we retained the top | c M| variables in the screening step. We employed the same methods as in Section 4.4.2 to select important interactions and main effects. The classification error was calculatedusingthetestdata,andthetuningparameterswereselectedsimilarlyasinsimulation studies. Werepeatedtherandomsplit100times. Table4.6summarizestheclassificationresultsandmedianmodelsizesforeachmethod. Weob- servethattheapproachesofIP-hierNet,IP-Lasso,andIP-L 1 +SICAyieldedlowerclassification errors with smaller model sizes. To take a further look at the features selected by each method, wepresenttheinteractionsandmaineffectsthatwereselectedatleast60outof100repetitions inTable4.7. WeseefromTable4.7thatasetofgenes,suchasHPNandS100A4,wereselectedbyallmeth- ods as main effects, revealing that those genes may play a significant role in the etiology of prostate cancer; see, for example, Saleem et al. (2006) and Holt (2010) for discussions of the association between those genes and prostate cancer. In particular, there are a wide range of studies investigating the effect of TARP, TCR alternative reading frame protein, on prostate cancer(Wolfgangetal. (2000)andHillerdaletal. (2012)). Therearealsostudiescharacterizing a regulatory sequence involving TARP that may be used to restrict expression of therapeutic genes to prostate cancer cells Cheng (2003). In other words, TARP may possibly interact with othergenesontheetiologyofprostatecancer. Suchafindingisconsistentwiththeresultsofour approach IP that the interaction TARP⇥ PRKDC is found to be associated with the phenotype 81 out of 100 times for IP-hierNet. Moreover, we examined the association of two interac- tions TARP⇥ PRKDC and AFFX-CreX-3⇥ LRRC75A-AS1 with the phenotype by conducting the two-sample t-test between the prostate cancer case and control groups. The p-values for thesetwointeractionswere0.020and0.034,respectively,whichrevealthatthemostfrequently Chapter3. Application of variable selection in genetic data 57 TABLE 4.7: Genesandtheirinteractionsselectedatleast60%of100randomsplits,aswellastheirselectionpercentages,inprostatecancerdataanalysis. SIS2-Lasso SIS2-L 1 +SICA DCSIS2-Lasso DCSIS2-L 1 +SICA IP-hierNet IP-Lasso IP-L 1 +SICA HPN 1.00 1.00 1.00 1.00 1.00 1.00 1.00 S100A4 1.00 0.98 0.98 0.98 0.62 0.62 0.62 SERINC5 1.00 1.00 1.00 1.00 0.98 0.94 0.96 HSPD1 0.99 0.99 0.98 0.99 0.99 1.00 0.99 LMO3 0.99 0.98 0.99 0.98 1.00 1.00 1.00 TARP 0.98 0.97 0.99 1.00 1.00 0.98 0.93 MAF 0.89 0.89 ATP2C1 0.77 0.77 PLA2G7 0.69 0.70 0.87 0.85 ANGPT1 0.68 0.67 0.60 0.68 0.87 ARL2BP 0.68 0.67 0.65 0.62 PDLIM5 0.61 0.87 NELL2 0.64 ERG 0.85 1.00 1.00 RBP1 0.67 CALM1 0.70 0.64 TMSB15A⇥ EPCAM 0.61 0.65 0.82 0.82 RARRES2⇥ KLK3 0.66 0.64 TARP⇥ PRKDC 0.81 AFFX-CreX-3⇥ LRRC75A-AS1 0.80 0.77 Chapter3. Application of variable selection in genetic data 58 selectedinteractionsbyourmethodareindeedmarginallysignificantlyassociatedwiththephe- notype. 4.5 Discussion We have considered in this paper the problem of interaction identification in ultra-high dimen- sions. The proposed method IP based on a new interaction screening procedure and post- screeningvariableselectioniscomputationallyefficient,andcapableofreducingdimensionality fromalargescaletoamoderateoneandrecoveringimportantinteractionsandmaineffects. To simplify the technical presentation, our analysis has been focused on the linear pairwise inter- actionmodels. Screening for maineffectsin more general modelsettingshas been exploredby many researchers; see, for example, Fan and Song (2010) and Fan et al. (2011). It would be interesting to extend the interaction screening idea of IP to these and other more general model frameworks such as the generalized linear models, nonparametric models, and survival models withinteractions. The key idea of IP is to use different marginal utilities to screen interactions and main effects separately. As such, it can suffer from the same potential issues as the SIS. First, some noise interactions or main effects that are highly correlated with the important ones can have higher marginal utilities and thus priority to be selected than other important ones that are relatively weakly related to the response. Second, some important interactions or main effects that are jointly correlated but marginally uncorrelated with the response can be missed after screening. To address these issues, we next briefly discuss two extensions of IP that enable us to exploit morefullythejointinformationamongthecovariates. Our first extension of IP, the iterative IP (IIP), is motivated by the idea of two-scale learning with the iterative SIS (ISIS) in Fan and Lv (2008) and Fan et al. (2009). The IIP works as follows by applying large-scale screening and moderate-scale selection in an iterative fashion. First, apply IP to the original sample (x i ,y i ) n i=1 to obtain two sets b I 1 of interactions and b B 1 of main effects, and construct a set b A 1 of interaction variables based on b I 1 as in (4.2). Second, updatethesetsofcandidateinteractionvariablesas{1,···,p}\ b A 1 andcandidatemaineffects as{1,···,p}\ b B 1 ,treattheresidualvectorfromthepreviousiterationasthenewresponse,and apply IP to the updated sample to obtain new sets b I 2 , b B 2 , and b A 2 defined similarly as before. Third, iteratively update the feature space for candidate interaction variables and main effects andtheresponse,andapplyIPtotheupdatedsampletosimilarlyobtainsequencesofsets( b I k ), ( b B k ), and ( b A k ), until thetotalnumberofselectedinteractionsandmain effects insets b I k ’sand b B k ’s reaches a prespecified threshold. Fourth, finally select important interactions and main effectsusingaregularizationmethodinthereducedfeaturespacegivenbytheunionof b I k ’sand b B k ’s. Chapter3. Application of variable selection in genetic data 59 The second extension of IP, the conditional IP (CIP), exploits the idea of the conditional SIS (CSIS) in Barut et al. (2012), which replaces the simple marginal correlation with the con- ditional marginal correlation to assess the importance of covariates when some variables are known in advance to be important. Suppose we have some prior knowledge that two given sets A 0 ,B 0 ⇢{ 1,···,p} contain some active interaction variables and important main effects, re- spectively. Forinteractionscreening,theCIPregressesthesquaredresponseY 2 oneachsquared covariateX 2 k withk outsideA 0 by conditioning on (X 2 ` ) `2A 0 , and retains top ones in the con- ditionalmarginalutilitiesasinteractionvariables. Similarly,inmaineffectscreeningitemploys marginal regression of the response Y on each covariate X k with k outside B 0 conditional on (X ` ) `2B 0 . After screening, CIP further selects important interactions and main effects using a variable selection procedure in the reduced feature space. The approach of CIP can also be incorporated into IIP by conditioning on selected variables in previous steps when calculating themarginalutilitiesalongthecourseofiteration. Theinvestigationoftheseextensionsisbeyondthescopeofthecurrentpaperandwillbeinter- estingtopicsforfutureresearch. Chapter5 Conclusion Inmydissertation,IproposedonevariableselectionmethodcalledConstrainedDantzigSelector (CDS), for moderate high dimensions. However, if we need to further consider interactions in predictionmodels,ascreeningstepisrequiredandatwo-stepmethodcalledInteractionPursuit (IP) is proposed. The formal method CDS has been applied on a GWAS data and 269 SNPs are selected. The latter method IP can possibly be applied to the same GWAS data to detect gene-geneinteractions. 60 Bibliography [1] ALFONSO,B., BROWN,A.A.LAPPALAINEN,T., VINUELA,A., DAVIES,M.N., ZHENG,H.F.,RICHARDS,J.B.,GLASS,K.S.,SMALL,K.S.,DURBIN,R.,SPEC- TOR, T. D.and DERMITZAKIS, E. T.(2015). Gene-geneandgene-environmentinteractions detectedbytranscriptomesequenceanalysisintwins. Nat. Genet.47,88–91. [2] ANTONIADIS,A.,FRYZLEWICZ, P. and LETUE, F. (2010). TheDantzigselectorinCox’s proportionalhazardsmodel. Scand. J. Statist.37,531–552. [3] Barut, E., Fan, J. and Verhasselt, A. (2012), “Conditional sure independence screening,” Preprint. arXiv:1206.1024. [4] BERTIN,K.,LE PENNEC, E. and RIVOIRARD, V. (2011). Adaptive Dantzig density esti- mation. Ann. Inst. H. Poincar` e Probab. Statist.47,43–74. [5] BICKEL,P.J.,RITOV, Y. and TSYBAKOV, A. (2009). Simultaneous analysis of lasso and Dantzigselector. Ann. Statist.37,1705–1732. [6] Bien, J., Taylor, J. and Tibshirani, R. (2013), “A lasso for hierarchical interactions,” The Annals of Statistics,41,1111–1141. [7] CAND` ES,E.J.,ROMBERG, J. and TAO, T. (2006). Robust uncertainty principles: ex- act signal reconstruction from highly incomplete frequency information. IEEE Trans. Info. Theory52,489–509. [8] CAND` ES, E. J. and TAO, T. (2005). Decoding by linear programming. IEEE Trans. Info. Theory51,4203–4215. [9] CAND` ES, E. J. and TAO, T. (2007). The Dantzig selector: statistical estimation whenp is muchlargerthann(withDiscussion). Ann. Statist.,35,2313–2351. [10] CAND` ES,E.J.,WAKIN, M. B. and BOYD, S. (2008). Enhancing sparsityby reweighted l 1 minimization. J. Fourier Anal. Appl.14,877–905. [11] Chang, J., Tang, C. Y. and Wu, Y. (2013), “Marginal empirical likelihood and sure inde- pendencefeaturescreening,” The Annals of Statistics,41,2123–2148. 61 Chapter3. Application of variable selection in genetic data 62 [12] CHENG,I.,GARY,K.,NAKAGAWA, H.etal(2012). EvaluatingGeneticRiskforProstate Cancer among Japanese and Latinos. Cancer Epidemiol. Biomarkers Prev. 21(11):2048– 2058. [13] Cheng, W. S., Giandomenico, V., Pastan, I. and Essand, M. (2003), “Characterization of theandrogen-regulatedprostate-specificTcellreceptorgamma-chainalternatereadingframe protein(TARP)promoter,” Endocrinology,144,3433–3440. [14] Cho, H. and Fryzlewicz, P. (2012). High dimensional variable selection via tilting. J. R. Statist. Soc.B,74,593–622. [15] CULVERHOUSE,R.,SUAREZ,B.K.,LIN, J. and REICH, T. (2002). A perspective on epistasis: limitsofmodelsdisplayingnomaineffect. Ame. J. Hum. Genet.70,461–471. [16] CORDELL, H. J. (2009). Detecting gene?gene interactions that underlie human diseases. Nat. Rev. Genet.10,392–402. [17] DONOHO, D. L. (2006). For most large underdetermined systems of linear equations the minimalL 1 -norm solution is also the sparsest solution. Commun. on Pure and Appl. Math.. 59,797-–829. [18] DONOHO, D. L. and ELAD, M. (2003). Optimally sparse representation in general (nonorthogonal) dictionaries via ` 1 minimization. Proc. Natl. Acad. Sci. USA 100, 2197– 2202. [19] DUDLEY, R. M. (1999). Uniform Central Limit Theorems. CambridgeUniversityPress. [20] Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression (with discussion). Ann. Statist.32,407–499. [21] Fan, J. and Fan, Y. (2008), “High dimensional classification using features annealed inde- pendencerules,” The Annals of Statistics,36,2605–2637. [22] Fan, J., Feng, Y. and Song, R. (2011), “Nonparametric independence screening in sparse ultra-high-dimensional additive models,” Journal of the American Statistical Association, 106,544–557. [23] Fan, J. and Li, R. (2001), “Variable selection via nonconcave penalized likelihood and its oracleproperties,” Journal of the American Statistical Association,96,1348–1360. [24] FAN, J. and LV, J. (2010). A selective overview of variable selection in highdimensional featurespace(invitedreviewarticle). Statist. Sinica20,101—148. [25] FAN,J.,FENG, Y., and SONG, R. (2011). Nonparametric independence screening in sparseultra-highdimensionaladditivemodels. J. Amer. Statist. Assoc.106,544—557. Chapter3. Application of variable selection in genetic data 63 [26] FAN, J. and LV, J. (2008). Sureindependencescreeningforultrahighdimensionalfeature space. J. R. Statist. Soc. Ser.B70,849–911. [27] FAN, J. and LV, J. (2011). Nonconcave penalized likelihood with NP-dimensionality. IEEE Trans. Info. Theory57,5467–5484. [28] FAN, Y. and LV, J. (2013). Asymptotic equivalence of regularization methods in thresh- oldedparameterspace. J. Am. Statist. Assoc.,forthcoming. [29] Fan, Y. and Lv, J. (2014), “Asymptotic properties for combined L 1 and concave regular- ization,” Biometrika,101,57–70. [30] FAN,J.,SAMWORTH, R. and WU, Y. (2009). Ultrahigh dimensional feature selection: beyondthelinearmodel. J. Mach. Learn. Res.10,2013—2038. [31] Fan, J. and Song, R. (2010), “Sure independence screening in generalized linear models withNP-dimensionality,” The Annals of Statistics,38,3567–3604. [32] FRANK, I.E. and FRIEDMAN, J.H. (1993). A Statistical View of Some Chemometrics RegressionTools. Technometrics35,109—148. [33] GAUDERMAN,W.J.,ZHANG,P.,MORRISON, J. L. and LEWINGER, J. P. (2013). Find- ing novel genes by testing G*E interactions in a genome-wide association study. Genet. Epidemiol.37,603-613. [34] GORST-RASMUSSEN, A. and SCHEIKE, T. (2013). Independent screening for single- index hazard rate models with ultrahigh dimensional features. J. R. Statist. Soc. Ser. B 75, 217-245. [35] GUAN, Y.andSTEPHENS, M.(2011). Bayesianvariableselectionregressionforgenome- wideassociationstudiesandotherlarge-scaleproblems. Ann. Appl. Stat.5,1780–1815. [36] HAIMAN,C.A.,LE MARCHAND,L.YAMAMATO,J.,STRAM,D.O.,SHENG,X., KOLONEL,L.N.,WU,A.H.,REICH, D. and HENDERSON, B. E. (2011). A common geneticriskfactorforcolorectalandprostatecancer. Nat. Genet.39(8):954–956. [37] HALL, P. and MILLER, H. (2009). Using generalized correlation to effect variable selec- tioninveryhighdimensionalproblems. J. Comput. Graph. Statist.,18,533–550. [38] HALL, P.andXUE, J.-H.(2014). Onselectinginteractingfeaturesfromhigh-dimensional data. Comp. Stat. Data Ana.71,694–708. [39] HALL,P.,TITTERINGTON, D. and XUE, J.-H. (2009). Tilting methods for assessing the influenceofcomponentsinaclassifier. J. R. Statist. Soc.B,71,783–803. Chapter3. Application of variable selection in genetic data 64 [40] HAO, N.andZHANG, H.(2014). InteractionScreeningforUltra-HighDimensionalData. J. Amer. Statist. Assoc.,109,1285–1301. [41] HE, Q. and LIN, D. (2010). A variable selection method for genome-wide association studies.. Bioinformatics.,27(1): 1—8. [42] Hillerdal, V., Nilsson, B., Carlsson, B., Eriksson, F. and Essand, M. (2012), “T cells en- gineeredwithaTcellreceptoragainsttheprostateantigenTARPspecificallykillHLA-A2+ prostate and breast cancer cells,” Proceedings of the National Academy of Sciences, 109, 15877–15881. [43] Holt, S. K., Kwon, E. M., Lin, D. W., Ostrander, E. A. and Stanford, J. L. (2010), “Asso- ciation of hepsin gene variants with prostate cancer risk and prognosis,” Prostate, 70, 1012– 1019. [44] HUNTER, D. J. (2005). Gene–environment interactions in human diseases. Nat. Rev. Genet.,6,287–298. [45] JAMES,G.M.,RADCHENKO, P. and LV, J. (2009). DASSO: connections between the Dantzigselectorandlasso. J. R. Statist. Soc. Ser.B71,127–142. [46] KONG,Y.,LV, J.andZHENG, Z.(2013+). ConstrainedDantzigselectionandcompressed sensing,JournalofMachineLearningResearch,inpress. [47] KOOPERBERG,C.,LEBLANC, M. and OBENCHAIN, V. (2010). Risk Prediction Using Genome-WideAssociationStudies. Genetic Epidemiol.34,643—652. [48] LEWINGER,J.P.,MORRISON,J.L.,THOMAS,D.C.,MURCRAY,C.E.,CONTI,D.V., LI, D. and GAUDERMAN, W. J. (2013). Efficienttwo-steptestingofgene-geneinteractions ingenome-wideassociationstudies. Genet. Epidemiol.37,440-451. [49] LI,G.,PENG,H.,ZHANG, J. and ZHU, L. (2012a). Robust rank correlation based screening. Ann. Statist.,40,1846–1877. [50] Li,R.,Zhong,W.andZhu,L.(2012b). Featurescreeningviadistancecorrelationlearning. J. Amer. Statist. Assoc.,107,1129–1139. [51] LV, J. and FAN, Y. (2009). A unified approach to model selection and sparse recovery usingregularizedleastsquares. Ann. Statist.37,3498–3528. [52] Mai, Q. and Zou, H. (2013). The kolmogorov filter for variable screening in high- dimensionalbinaryclassification. Biometrika,100,229–234. [53] NEALE,B.M.,RIVAS,M.A.,VOIGHT,B.F.,ALTSHULER,D.,DEVLIN,B.,ORHO- MELANDER,M.,KATHIRESAN,S.,PURCELL,S.M.,ROEDER, K., and DALY,M.J. (2011). Testingforunusualdistributionofrarevariants. PLoS Genet.7:e1001322. Chapter3. Application of variable selection in genetic data 65 [54] NOMURA AM,HANKIN JH,HENDERSON BE,WILKENS LR,MURPHY SP,PIKE MC, LE MARCHAND L, STRAM DO, MONROE KR, and DKOLONEL LN (2007). Dietary fiber andcolorectalcancerrisk: themultiethniccohortstudy. Cancer Causes Control.18(7):753- 764. [55] JIANG, B. and LIU, J. S. (2014), “Variable selection for general index models via sliced inverseregression,” Ann. Stat.,42,1751–1786. [56] PARK SY, WILKENS LR, FRANKE AA, LE MARCHAND L, KAKAZU KK, GOODMAN MT, MURPHY SP, HENDERSON BE and KOLONEL LN (2009). Urinary phytoestrogen excretionandprostatecancerrisk: anestedcase-controlstudyintheMultiethnicCohort. Br J Cancer 101(1): 185-191. [57] PELTOLA,T.,MARTTINEN,P.,JULA, A. et al (2012). Bayesian Variable Selection in SearchingforAdditiveandDominantEffectsinGenome-WideData. PLoSOne7(1):e29115. [58] RITCHIE,M.D.,HAHN,L.W.,ROODI,N.,BAILEY,R.,DUPONT,W.D.,PARL,F.F. andMOORE, J. H.(2001),“Multifactor-dimensionalityreductionrevealshigh-orderinterac- tionsamongestrogen-metabolismgenesinsporadicbreastcancer,”TheAme.J.Hum.Genet., 69,138–147. [59] Saleem, M., Kweon, M. H., Johnson, J. J., Adhami, V. M., Elcheva, I., Khan, N., Bin Hafeez, B., Bhat, K. M., Sarfaraz, S., Reagan-Shaw, S., Spiegelman, V. S., Setaluri, V. and Mukhtar,H.(2006),“S100A4acceleratestumorigenesisandinvasionofhumanprostatecan- cer through the transcriptional regulation of matrix metalloproteinase 9,” Proceedings of the National Academy of Sciences,103,14825–14830. [60] Singh, D., Febbo, P. G., Ross, K., Jackson, D. G., Manola, J., Ladd, C., Tamayo, P., Renshaw, A. A., D’Amico, A. V., Richie, J. P., Lander, E. S., Loda, M., Kantoff, P. W., Golub,T.R.andSellers,W.R.(2002),“Geneexpressioncorrelatesofclinicalprostatecancer behavior,” Cancer Cell,1,203–209. [61] TIBSHIRANI, R. J. (1996). Regressionshrinkageandselectionviathelasso. J. R. Statist. Soc. Ser.B58,267–288. [62] VAN DE GEER, S. A. and BUHLMANN, P. (2009). Ontheconditionsusedtoproveoracle resultsfortheLasso. Electron. J. Statist.3,1360–1392. [63] Wolfgang,C.D.,Essand,M.,Vincent,J.J.,Lee,B.andPastan,I.(2002),“TARP:anuclear proteinexpressedinprostateandbreast cancercellsderived froman alternatereading frame oftheTcellreceptorgammachainlocus,”ProceedingsoftheNationalAcademyofSciences, 97,9437–9442. Chapter3. Application of variable selection in genetic data 66 [64] WU,M.C.,LEE,S.,CAI,T.,LI,Y.,BOEHNKE, M. and LIN, X. (2011). Rare-variant associationtestingforsequencingdatawiththesequencekernelassociationtest. Am.J.Hum. Genet.89,82–93. [65] XUE, L. and ZOU, H. (2011). Sure independence screening and compressed random sensing. Biometrika98,371–380. [66] Yuan,M.,Joseph,V.R.andZou,H.(2009),“Structuredvariableselectionandestimation,” Annals of Applied Statistics,3,1738–1757. [67] Zhang,C.-H.(2010),“Nearlyunbiasedvariableselectionunderminimaxconcavepenalty,” The Annals of Statistics,38,894–942. [68] ZHENG,Z.,FAN, Y. and LV, J. (2013). High-dimensional thresholded regression and shrinkageeffect. J. R. Statist. Soc. Ser.B,forthcoming. [69] ZHU,L.-P.,LI,L.,LI, R. et al (2011). Model-free feature screening for ultrahigh- dimensionaldata. J. Amer. Statist. Assoc.106,1464—1475. [70] ZOU, H. (2006). The adaptive lasso and its oracle properties. J. Am. Statist. Assoc. 101, 1418–1429. [71] ZOU, H. and HASTIE, T. J. (2005). Regularization and variable selection via the elastic net. J. R. Statist. Soc. Ser.B67,301–320. Supplementalmaterials This supplementary material consists of five parts. Section A presents some additional simula- tion studies. We establish the invariance of the three sets A, I, and M under affine transfor- mationsinSectionB.SectionCillustratesthatinthepresenceofcorrelationamongcovariates, using corr(X 2 j ,Y 2 ) as the marginal utility still has differentiation power between interaction variables(i.e.,variablescontributingtointeractions)andnoisevariables(variablescontributing toneitherinteractionsnormaineffects). WeprovidetheproofsofProposition4.1andTheorems 4.2–4.4inSectionD.SectionEcontainssometechnicallemmasandtheirproofs. Hereafterwe use e C i withi=1,2,··· todenotesomegenericpositiveornonnegativeconstantswhosevalues mayvaryfromlinetoline. ForanysetD,denoteby|D|itscardinality. AppendixA:Additionalsimulationstudies A.1. Lowersignal-to-noiseratiosinExample1 In Section 4.4.1, we investigated the screening performance of each approach at certain noise levels. However, it is still interesting to test the robustness of those methods when the signal- to-noise ratio (SNR) is smaller. Therefore, keeping all the settings in Example 1 the same as before,wenowconsiderthreemoresetsofnoiselevels: Case1: " 1 ⇠ N(0,3 2 )," 2 ⇠ N(0,2.5 2 )," 3 ⇠ N(0,2.5 2 )," 4 ⇠ N(0,2 2 ); Case2: " 1 ⇠ N(0,3.5 2 )," 2 ⇠ N(0,3 2 )," 3 ⇠ N(0,3 2 )," 4 ⇠ N(0,2.5 2 ); Case3: " 1 ⇠ N(0,4 2 )," 2 ⇠ N(0,3.5 2 )," 3 ⇠ N(0,3.5 2 )," 4 ⇠ N(0,3 2 ). FollowingthesamedefinitionofSNRasinSection4.1,theSNRsinthesettingsabovearelisted inTable5.1andfarlowerthanbefore. Forexample,theSNRsinthethirdsetofnoiselevelsfor modelsM1–M4are0.39,0.33,0.33,and0.25timesaslargeasbefore,respectively. 67 Chapter3. Application of variable selection in genetic data 68 [Table5.1abouthere.] The corresponding screening results for those three sets of noise levels are summarized in Ta- ble 5.2. It is seen that our approach IP performed better than all others across three settings in models M2–M4, where the strong heredity assumption is not satisfied. In model M1, the IP did not perform as well as other methods, since it keeps only [n/(logn)] variables to con- struct interactions while the other methods can keep up to 2[n/(logn)] interaction variables in screening. [Table5.2abouthere.] A.2. Computationtime Todemonstratetheeffectofinteractionscreeningonthecomputationalcost,weconsidermodel M2 in Example 1 withn = 200, ⇢ =0.5 andp = 200,300, and 500, and calculate the average computationtimeofhierNetandIP-hierNet. Theonlydifferencebetweenthesetwomethodsis that IP-hierNet has the screening step whereas hierNet does not. Table 5.3 reports the average computation time of hierNet and IP-hierNet based on 100 replications. We see from Table 5.3 that when the dimensionality gets higher, the ratio of average computation time of hierNet overIP-hierNetbecomeslarger. Inparticular,theaveragecomputationtimeforhierNetreaches 292.770 minutes for a single repetition when p = 500, while that for IP-hierNet is only 6.047 minutes. Asexpected,ourproposedprocedureIPiscomputationallymuchmoreefficientthanks totheadditionalscreeningstep. [Table5.3abouthere.] AppendixB:InvarianceofsetsA,I,andM Considerthelinearinteractionmodel Y = 0 + p X j=1 j X j + p 1 X k=1 p X `=k+1 k` X k X ` +" givenin(4.1). Foranyk,`2{1,···,p},define ⇤ k` = k` /2fork<`, ⇤ k` =0fork = `,and ⇤ k` = `k /2fork>`. Then ⇤ k` = ⇤ `k andourmodelcanberewrittenas Y = 0 + p X j=1 j X j + p X k,`=1 ⇤ k` X k X ` +". Chapter3. Application of variable selection in genetic data 69 Under affine transformations X new j = b j (X j a j ) with a j 2 R and b j 2 R\{0} for j = 1,···,p,ourmodelbecomes Y = 0 + p X j=1 j (b 1 j X new j +a j )+ p X k,`=1 ⇤ k` (b 1 k X new k +a k )(b 1 ` X new ` +a ` )+" =( 0 + p X j=1 j a j + p X k,`=1 ⇤ k` a k a ` )+ p X j=1 ( j + p X `=1 ⇤ j` a ` + p X k=1 ⇤ kj a k )b 1 j X new j + p X k,`=1 ⇤ k` b 1 k b 1 ` X new k X new ` +" = e 0 + p X j=1 e j X new j + p 1 X k=1 p X `=k+1 e k` X new k X new ` +", where e 0 = 0 + p X j=1 j a j + p X k,`=1 ⇤ k` a k a ` = 0 + p X j=1 j a j + X 1 k<` p k` a k a ` , (B.1) e j =( j + p X `=1 ⇤ j` a ` + p X k=1 ⇤ kj a k )b 1 j =( j + X 1 k<j kj a k + X j<k p jk a k )b 1 j , (B.2) e k` = k` b 1 k b 1 ` . (B.3) SimilartothedefinitionsofsetsI,A,B,andMin(4.2),wedefineindexsets e I ={(k,`):1 k<` pwithe k` 6=0}, e A ={1 k p:(k,`)or(`,k)2I forsome`}, e B = n 1 j p : e j 6=0 o . Thenfrom(B.3),wehave e I =I andthus e A =A. Next we show f M = M. It is equivalent to show that e A c \ e B c = A c \B c . To this end, we first prove A c \B c ⇢ e A c \ e B c . For any j2A c \B c , we have j =0 and jk =0 for all 1 k 6= j p. In view of (B.2) and (B.3), we have e j =0 ande jk =0, which means j2 e A c \ e B c . ThusA c \B c ⇢ e A c \ e B c holds. Similarly,wecanalsoshowthat e A c \ e B c ⇢A c \B c . Combiningtheseresultsyields e A c \ e B c =A c \B c andthus f M =M. Therefore, the three sets A, I, and M are invariant under affine transformations X new j = b j (X j a j )witha j 2 Randb j 2 R\{0}forj=1,···,p. Chapter3. Application of variable selection in genetic data 70 AppendixC: cov(X 2 j ,Y 2 )underspecificmodels Without loss of generality, we assume that 0 =0 and the s true main effects concentrate at the first s coordinates, that is, B = {1,···,s}. Here we slightly abuse the notation s for simplicity. Due to the existence of O(p 2 ) interaction terms, it is generally too complicated to calculate cov(X 2 j ,Y 2 ) explicitly. Since our purpose is to illustrate that in the presence of cor- relation among covariates, using corr(X 2 j ,Y 2 ) as the marginal utility still has differentiation power between interaction variables (i.e., variables contributing to interactions) and noise vari- ables (variables contributing to neither interactions nor main effects), we consider the specific case when there is only one interaction and x=(X 1 ,···,X p ) T ⇠ N(0,⌃ ) with⌃ =( k` ) beingtridiagonal,thatis, k` =1fork = `, k` = ⇢ 2 [1,1]for|k`|=1,and k` =0for |k`|> 1. Inaddition,assumethatallnonzeromaineffectcoefficientstakethesamevalue , thatis, 0,1 =··· = 0,s = 6=0. We consider the following three different settings according to whether or not the heredity as- sumptionholds: Case1: A ={1,2}–strongheredityifs 2, Case2: A ={1,s+1}–weakheredity, Case3: A ={s+1,s+2}–anti-heredity. Here,ineachcase,thesetofactiveinteractionvariablesAischosenwithoutlossofgenerality. Fortheeaseofpresentation,denotebyJ 1 = P s j=1 0,j X j andJ 2 =X k X ` withk,`2A and k6= `. Then,Y =J 1 +J 2 +"and cov(X 2 j ,Y 2 )=cov(X 2 j ,J 2 1 )+cov(X 2 j ,J 2 2 ). Directcalculationsyield cov(X 2 j ,J 2 1 )= 8 > > < > > : 2 2 ,j=1 2 2 ⇢ 2 ,j=2 0,j 3 when s=1, cov(X 2 j ,J 2 1 )= 8 > > < > > : 2 2 (1+⇢ ) 2 ,j=1,2 2 2 ⇢ 2 ,j=3 0,j 4 when s=2, Chapter3. Application of variable selection in genetic data 71 cov(X 2 j ,J 2 1 )= 8 > > > > > < > > > > > : 2 2 (1+⇢ ) 2 ,j=1ors 2 2 (1+2⇢ ) 2 , 2 j s1 2 2 ⇢ 2 ,j =s+1 0,js+2 when s 3. Next,wedealwithcov(X 2 j ,J 2 2 ). ByIsserlis’Theorem,wehave E(X 2 j X k X ` X k 0X ` 0)= jj k` k 0 ` 0 + jj kk 0 `` 0 + jj k` 0 `k 0 + jk j` k 0 ` 0 + jk jk 0 `` 0 + jk j` 0 `k 0 + j` jk k 0 ` 0 + j` jk 0 k` 0 + j` j` 0 kk 0 + jk 0 jk `` 0 + jk 0 j` k` 0 + jk 0 j` 0 k` + j` 0 jk `k 0 + j` 0 j` kk 0 + j` 0 jk 0 k` and E(X k X ` X k 0X ` 0)= k` k 0 ` 0 + kk 0 `` 0 + k` 0 `k 0. Combining these two results above gives cov(X 2 j ,X k X ` X k 0X ` 0) =E(X 2 j X k X ` X k 0X ` 0)E(X 2 j )E(X k X ` X k 0X ` 0) =2( jk j` k 0 ` 0 + jk jk 0 `` 0 + jk j` 0 `k 0 + j` jk 0 k` 0 + j` j` 0 kk 0 + jk 0 j` 0 k` ). Next, we calculate the value of cov(X 2 j ,J 2 2 ) according to the three different model settings discussedabove. Case1: A ={1,2}. Then J 2 = X 1 X 2 and cov(X 2 j ,J 2 2 )=2 2 ( 2 j1 22 +4 j1 j2 12 + 2 j2 11 ). Thus cov(X 2 j ,J 2 2 )= 8 > > < > > : 2 2 (1+5⇢ 2 ),j=1 or 2, 2 2 ⇢ 2 ,j=3, 0,j 4. Insummary,cov(X 2 1 ,Y 2 )> 0andcov(X 2 2 ,Y 2 )> 0forall1 ⇢ 1,whilecov(X 2 j ,Y 2 )= 0forj max{s+2,4}. Case2: A ={1,s+1}. ThenJ 2 =X 1 X s+1 andcov(X 2 j ,J 2 2 )=2 2 ( 2 j1 s+1,s+1 +4 j1 j,s+1 1,s+1 + 2 j,s+1 11 ). Thus cov(X 2 j ,J 2 2 )=2 2 ( 2 j1 +4 j1 j2 ⇢ + 2 j2 )= 8 > > < > > : 2 2 (1+5⇢ 2 ),j=1 or 2 2 2 ⇢ 2 ,j=3 0,j 4 when s=1, Chapter3. Application of variable selection in genetic data 72 cov(X 2 j ,J 2 2 )=2 2 ( 2 j1 + 2 j3 )= 8 > > > > > < > > > > > : 2 2 ,j=1 or 3 4 2 ⇢ 2 ,j=2 2 2 ⇢ 2 ,j=4 0,j 5 when s=2, cov(X 2 j ,J 2 2 )=2 2 ( 2 j1 + 2 j,s+1 )= 8 > > < > > : 2 2 ,j=1 or s+1 2 2 ⇢ 2 ,j=2 or s or s+2 0, 3 j s1 or js+3 when s 3. Soitholdsthatcov(X 2 j ,Y 2 )=0foralljs+3,andcov(X 2 j ,Y 2 )> 0forj2A ={1,s+1}. Case3: A ={s+1,s+2}. Then J 2 = X s X s+1 and cov(X 2 j ,J 2 2 )=2 2 ( 2 js s+1,s+1 + 4 js j,s+1 s,s+1 + 2 j,s+1 ss ). Thus cov(X 2 j ,J 2 2 )= 8 > > < > > : 2 2 (1+5⇢ 2 ),j =s or s+1, 2 2 ⇢ 2 ,j =s1 or s+2, 0, otherwise. So we have that cov(X 2 j ,Y 2 )=0 for all j s+3, and cov(X 2 j ,Y 2 ) > 0 for j2A = {s+1,s+2}. Therefore,cov(X 2 j ,Y 2 )> 0forallj2A,whereascov(X 2 j ,Y 2 )=0forallj max{s+2,4} forCase1,andcov(X 2 j ,Y 2 )=0foralljs+3forCases2and3. Notethatcorr(X 2 j ,Y 2 )= cov(X 2 j ,Y 2 )/ q var(X 2 j )var(Y 2 ). This ensures that the correlations between X 2 j and Y 2 are nonzeroforthoseactiveinteractionvariables. Inotherwords,usingcorr(X 2 j ,Y 2 )asthemarginal utilitycanstillsingleoutactiveinteractionvariables. AppendixD:ProofsofProposition4.1andTheorems4.2–4.4 D.1. ProofofProposition4.1 LetJ 1 = P p j=1 j X j andJ 2 = P p 1 k=1 P p `=k+1 k` X k X ` . Thenourinteractionmodel(4.1)can bewrittenasY = 0 +J 1 +J 2 +". Foreachj2{1,···,p},thecovariancebetweenX 2 j and Y 2 canbeexpressedas cov(X 2 j ,Y 2 )=cov(X 2 j ,J 2 1 )+cov(X 2 j ,J 2 2 )+cov(X 2 j ," 2 )+2 0 cov(X 2 j ,J 1 ) +2 0 cov(X 2 j ,J 2 )+2 0 cov(X 2 j ,")+2cov(X 2 j ,J 1 J 2 ) +2cov(X 2 j ,J 1 ")+2cov(X 2 j ,J 2 "). (D.1) Chapter3. Application of variable selection in genetic data 73 Recall that " is independent of X j . Thus cov(X 2 j ," 2 )=0 and cov(X 2 j ,")=0. With the assumptionofE(")=0,wehave cov(X 2 j ,J 1 ")=E(X 2 j J 1 ")E(X 2 j )E(J 1 ")=E(X 2 j J 1 )E(")E(X 2 j )E(J 1 )E(")=0. Similarly, cov(X 2 j ,J 2 ")=0. Note that cov(X 2 j ,J 1 J 2 )= E(X 2 j J 1 J 2 ) E(X 2 j )E(J 1 J 2 ). Since X 1 ,···,X p are i.i.d. N(0,1), direct calculation yields E(X 2 j J 1 J 2 )= E(J 1 J 2 )=0, whichleadstocov(X 2 j ,J 1 J 2 )=0. Similarly,cov(X 2 j ,J 1 )=0. Thus,(D.1)reducesto cov(X 2 j ,Y 2 )=cov(X 2 j ,J 2 1 )+cov(X 2 j ,J 2 2 )+2 0 cov(X 2 j ,J 2 ). (D.2) Itremainstocalculatethethreetermsontherighthandsideof(D.2). Wefirstconsidercov(X 2 j ,J 2 1 ). Foreachfixedj=1,···,p,denotebyJ 3 = P k6=j k X k . Then J 1 = j X j +J 3 and cov(X 2 j ,J 2 1 )=cov(X 2 j , 2 j X 2 j )+cov(X 2 j ,2 j X j J 3 )+cov(X 2 j ,J 2 3 ). SinceX j is independent ofJ 3 , it follows that cov(X 2 j ,J 2 3 )=0. Note that cov(X 2 j , 2 j X 2 j )= 2 j var(X 2 j )=2 2 j and cov(X 2 j ,2 j X j J 3 )=2 j [E(X 3 j J 3 )E(X 2 j )E(X j J 3 )] =2 j [E(X 3 j )E(J 3 )E(X 2 j )E(X j )E(J 3 )] = 0. Therefore,weobtain cov(X 2 j ,J 2 1 )=2 2 j . (D.3) Next,wedealwithcov(X 2 j ,J 2 2 ). Forafixedj=1,···,p,letJ 4 = P j 1 k=1 kj X k + P p `=j+1 j` X ` andJ 5 = P p 1 k=1,k6=j P p `=k+1,`6=j k` X k X ` . ThenJ 2 =J 4 X j +J 5 . SinceX j isindependentof J 4 andJ 5 ,wehavecov(X 2 j ,J 2 5 )=0and cov(X 2 j ,J 2 2 )=cov(X 2 j ,J 2 4 X 2 j )+cov(X 2 j ,2J 4 X j J 5 ). (D.4) Thefirsttermontherighthandsideof(D.4)canbefurthercalculatedas cov(X 2 j ,J 2 4 X 2 j )=E(X 4 j J 2 4 )E(X 2 j )E(J 2 4 X 2 j )=E(X 4 j )E(J 2 4 )E(X 2 j )E(J 2 4 )E(X 2 j ) =2E(J 2 4 ) = 2var(J 4 ) = 2( j 1 X k=1 2 kj + p X `=k+1 2 j` ). Thesecondtermontherighthandsideof(D.4)is cov(X 2 j ,2J 4 X j J 5 )=2E(X 3 j J 4 J 5 )2E(X 2 j )E(J 4 X j J 5 ) =2E(X 3 j )E(J 4 J 5 )2E(X 2 j )E(J 4 J 5 )E(X j )=0, Chapter3. Application of variable selection in genetic data 74 sinceE(X 3 j )=E(X j )=0. Therefore,itholdsthat cov(X 2 j ,J 2 2 ) = 2( j 1 X k=1 2 kj + p X `=k+1 2 j` ). (D.5) Finally, we handle cov(X 2 j ,J 2 ). Recall thatJ 2 = J 4 X j +J 5 andX j is independent ofJ 4 and J 5 ,wehave cov(X 2 j ,J 2 )=cov(X 2 j ,J 4 X j )+cov(X 2 j ,J 5 )=E(X 3 j J 4 )E(X 2 j )E(J 4 X j ) =E(X 3 j )E(J 4 )E(X 2 j )E(J 4 )E(X j )=0, whichtogetherwith(D.2),(D.3),and(D.5)completestheproofofProposition4.1. D.2. Proofofparta)ofTheorem4.2 LetS k1 =n 1 n P i=1 X 2 ik Y 2 i ,S k2 =n 1 n P i=1 X 2 ik ,S k3 =n 1 n P i=1 X 4 ik ,andS 4 =n 1 n P i=1 Y 2 i . Then ! k andb ! k canbewrittenas ! k = E(S k1 )E(S k2 )E(S 4 ) p E(S k3 )E 2 (S k2 ) and b ! k = S k1 S k2 S 4 q S k3 S 2 k2 . Toprove(4.9),thekeystepistoshowthatforanypositiveconstantC,thereexistsomeconstants e C 1 ,··· , e C 4 > 0suchthatthefollowingprobabilitybounds P( max 1 k p |S k1 E(S k1 )|Cn 1 ) p e C 1 exp ⇣ e C 2 n ↵ 1⌘ 1 ⌘ + e C 3 exp ⇣ e C 4 n ↵ 2⌘ 1 ⌘ , (D.6) P( max 1 k p |S k2 E(S k2 )|Cn 1 ) p e C 1 exp[ e C 2 n ↵ 1(1 2 1)/(4+↵ 1) ], (D.7) P( max 1 k p |S k3 E(S k3 )|Cn 1 ) p e C 1 exp[ e C 2 n ↵ 1(1 2 1)/(8+↵ 1) ], (D.8) P(|S 4 E(S 4 )|Cn 1 ) e C 1 exp ⇣ e C 2 n ↵ 1⇣ 1 ⌘ + e C 3 exp ⇣ e C 4 n ↵ 2⇣ 0 2 ⌘ , (D.9) hold for all n sufficiently large when 0 2 1 +4⇠ 1 < 1 and 0 2 1 +4⇠ 2 < 1, where ⌘ 1 =min{(12 1 4⇠ 2 )/(8+↵ 1 ), (12 1 4⇠ 1 )/(12+↵ 1 )},⇣ 1 =min{(12 1 4⇠ 2 )/(4+ ↵ 1 ), (12 1 4⇠ 1 )/(8+↵ 1 )},⇣ 2 =min{(12 1 2⇠ 2 )/(4+↵ 1 ), (12 1 2⇠ 1 )/(6+↵ 1 )}, and⇣ 0 2 =min{⇣ 2 ,(12 1 )/(4+↵ 2 )}. Define⌘ =min{⌘ 1 ,(12 1 )/(4+↵ 1 ),(12 1 )/(8+ ↵ 1 ),⇣ 1 } and ⇣ =min{⌘ 1 ,⇣ 0 2 }. Then ⌘ = ⌘ 1 and ⇣ =min{⌘ 1 ,(12 1 )/(4+↵ 2 )}. Thus, by Lemmas5.8–5.12,wehave P( max 1 k p |b ! k ! k |Cn 1 ) p e C 1 exp( e C 2 n ↵ 1⌘ )+ e C 3 exp( e C 4 n ↵ 2⇣ ). (D.10) Chapter3. Application of variable selection in genetic data 75 Thus,iflogp =o{n ↵ 1⌘ },theresultofthepart(a)inTheorem4.2followsimmediately. It thus remains to prove the probability bounds (D.6)–(D.9). Since the proofs of (D.6)–(D.9) aresimilar,herewefocuson(D.6)tosavespace. Throughouttheproof,thesamenotation e C is used to denote a generic positive constant without loss of generality, which may take different valuesateachappearance. Recall that Y i = 0 + x T i 0 + z T i 0 + " i = 0 + x T i,B 0,B + z T i,I 0,I + " i , where x i = (X i1 ,···,X ip ) T ,z i =(X i1 X i2 ,···,X i,p 1 X i,p ) T ,x i,B =(X ij ,j2B) T ,z i,I =(X ik X i` ,(k,`)2 I) T , 0,B =( 0,j 2B) T , and 0,I =( 0,k` ,(k,`)2I) T . To simplify the presentation, we assumethattheintercept 0 iszerowithoutlossofgenerality. Thus S k1 =n 1 n X i=1 X 2 ik Y 2 i =n 1 n X i=1 X 2 ik (x T i,B 0,B +z T i,I 0,I +" i ) 2 =n 1 n X i=1 X 2 ik (x T i,B 0,B +z T i,I 0,I ) 2 +2n 1 n X i=1 X 2 ik (x T i,B 0,B +z T i,I 0,I )" i +n 1 n X i=1 X 2 ik " 2 i ,S k1,1 +2S k1,2 +S k1,3 . Similarly, E(S k1 ) can be written as E(S k1 )= E(S k1,1 )+2E(S k1,2 )+E(S k1,3 ). So S k1 E(S k1 )canbeexpressedasS k1 E(S k1 )=[S k1,1 E(S k1,1 )]+2[S k1,2 E(S k1,2 )]+[S k1,3 E(S k1,3 )]. Bythetriangleinequalityandtheunionboundwehave P( max 1 k p |S k1 E(S k1 )|Cn 1 ) P( 3 [ j=1 { max 1 k p |S k1,j E(S k1,j )|Cn 1 /4}) 3 X j=1 P( max 1 k p |S k1,j E(S k1,j )|Cn 1 /4). (D.11) In what follows, we will provide details on deriving an exponential tail probability bound for each term on the right hand side above. To enhance readability, we split the proof into three steps. Step 1. We start with the first term max 1 k p |S k1,1 E(S k1,1 )|. Define the event ⌦ i = {|X ij | M 1 forallj2M[{ k}} with M = A[B and M 1 a large positive number that will be specified later. Let T k1 = n 1 n P i=1 X 2 ik (x T i,B 0,B + z T i,I 0,I ) 2 I ⌦ i and T k2 = n 1 n P i=1 X 2 ik (x T i,B 0,B +z T i,I 0,I ) 2 I ⌦ c i , where I(·) is the indicator function and⌦ c i is the com- plementoftheset⌦ i . Then S k1,1 E(S k1,1 )=[T k1 E(T k1 )]+T k2 E(T k2 ). (D.12) Chapter3. Application of variable selection in genetic data 76 NotethatE(T k2 )= E[X 2 1k (x T 1,B 0,B +z T 1,I 0,I ) 2 I ⌦ c 1 ]. Bythefact (a+b) 2 2(a 2 +b 2 ) for tworealnumbersaandb,theCauchy-Schwarzinequality,andCondition1,wehave (x T 1,B 0,B +z T 1,I 0,I ) 2 2[(x T 1,B 0,B ) 2 +(z T 1,I 0,I ) 2 ] 2C 2 0 (s 2 kx 1,B k 2 +s 1 kz 1,I k 2 ), (D.13) where C 0 is some positive constant andk·k denotes the Euclidean norm. This ensures that E(T k2 ) is bounded by 2C 2 0 [s 2 E(X 2 1k kx 1,B k 2 I ⌦ c 1 )+ s 1 E(X 2 1k kz 1,I k 2 I ⌦ c 1 )]. By the Cauchy- Schwarz inequality, the union bound, and the inequality (a + b) 2 2(a 2 + b 2 ), we obtain that E(X 2 1k kx 1,B k 2 I ⌦ c 1 ) ⇥ E(X 4 1k kx 1,B k 4 )P(⌦ c 1 ) ⇤ 1/2 8 < : 2 4 s 2 X j2B E(X 4 1k X 4 1j ) 3 5 P(⌦ c 1 ) 9 = ; 1/2 8 < : 2 1 s 2 X j2B [E(X 8 1k )+E(X 8 1j )] 9 = ; 1/2 2 4 X j2M[{ k} P(|X ij |>M 1 ) 3 5 1/2 e Cs 2 (1+s 2 +2s 1 ) 1/2 exp[M ↵ 1 1 /(2c 1 )] for some positive constant e C, where the last inequality follows from Condition 2 and Lemma 5.2. Similarly, we haveE(X 2 1k kz 1,I k 2 I ⌦ c 1 ) e Cs 1 (1+s 2 +2s 1 ) 1/2 exp[M ↵ 1 1 /(2c 1 )]. This togetherwiththeaboveinequalitiesentailsthat 0 E(T k2 ) 2C 2 0 e C(s 2 1 +s 2 2 )(1+s 2 +2s 1 ) 1/2 exp[M ↵ 1 1 /(2c 1 )]. If we chooseM 1 = n ⌘ 1 with ⌘ 1 > 0, then by Condition 1, for any positive constantC, whenn issufficientlylarge, |E(T k2 )| 2C 2 0 e C(n 2⇠ 1 +n 2⇠ 2 )(1+n ⇠ 2 +2n ⇠ 1 ) 1/2 exp[n ↵ 1⌘ 1 /(2c 1 )]<Cn 1 /12(D.14) holdsuniformlyforall1 k p. Theaboveinequalitytogetherwith(D.12)ensuresthat P( max 1 k p |S k1,1 E(S k1,1 )|Cn 1 /4) P( max 1 k p |T k1 E(T k1 )|Cn 1 /12)+P( max 1 k p |T k2 |Cn 1 /12) (D.15) for alln sufficiently large. Thus we only need to establish the probability bound for each term ontherighthandsideof(D.15). Firstconsidermax 1 k p |T k1 E(T k1 )|. Usingsimilarargumentsforproving(D.13),wehave (x T i,B 0,B +z T i,I 0,I ) 2 2C 2 0 (s 2 kx i,B k 2 +s 1 kz i,I k 2 )andthus 0 X 2 ik (x T i,B 0,B +z T i,I 0,I ) 2 I ⌦ i 2C 2 0 X 2 ik (s 2 kx i,B k 2 +s 1 kz i,I k 2 )I ⌦ i 2C 2 0 M 4 1 (s 2 2 +s 2 1 M 2 1 ). Chapter3. Application of variable selection in genetic data 77 Forany> 0,byHoeffding’sinequality[? ],weobtain P(|T k1 E(T k1 )| ) 2exp n 2 2C 4 0 M 8 1 (s 2 2 +s 2 1 M 2 1 ) 2 2exp n 2 4C 4 0 M 8 1 (s 4 2 +s 4 1 M 4 1 ) 2exp ✓ n 2 8C 4 0 M 8 1 s 4 2 ◆ +2exp ✓ n 2 8C 4 0 M 12 1 s 4 1 ◆ , where we have used the fact that (a + b) 2 2(a 2 + b 2 ) for any real numbers a and b, and exp[c/(a+b)] exp[c/(2a)]+exp[c/(2b)] for anya,b,c > 0. Recall thatM 1 = n ⌘ 1 . UnderCondition1,taking =Cn 1 /12givesthat P( max 1 k p |T k1 E(T k1 )|Cn 1 /12) p X k=1 P(|T k1 E(T k1 )|Cn 1 /12) 2pexp ⇣ e Cn 1 2 1 8⌘ 1 4⇠ 2 ⌘ +2pexp ⇣ e Cn 1 2 1 12⌘ 1 4⇠ 1 ⌘ . (D.16) Next, consider max 1 k p |T k2 |. RecallthatT k2 =n 1 n P i=1 X 2 ik (x T i,B 0,B +z T i,I 0,I ) 2 I ⌦ c i 0. By Markov’s inequality, for any> 0, we haveP(|T k2 | ) 1 E(|T k2 |)= 1 E(T k2 ). Inviewofthefirstinequalityin(D.14),taking =Cn 1 /12leadsto P(|T k2 |Cn 1 /12) 24C 1 C 2 0 e Cn 1 (n 2⇠ 1 +n 2⇠ 2 )(1+n ⇠ 2 +2n ⇠ 1 ) 1/2 exp[n ↵ 1⌘ 1 /(2c 1 )] forall1 k p. Therefore, P( max 1 k p |T k2 |Cn 1 /12) p X k=1 P(|T k2 |Cn 1 /12) 24pC 1 C 2 0 e Cn 1 (n 2⇠ 1 +n 2⇠ 2 )(1+n ⇠ 2 +2n ⇠ 1 ) 1/2 exp[n ↵ 1⌘ 1 /(2c 1 )]. (D.17) Combining(D.15),(D.16),and(D.17)yieldsthatforsufficientlylargen, P( max 1 k p |S k1,1 E(S k1,1 )|Cn 1 /4) 2pexp ⇣ e Cn 1 2 1 8⌘ 1 4⇠ 2 ⌘ +2pexp ⇣ e Cn 1 2 1 12⌘ 1 4⇠ 1 ⌘ +24pC 1 C 2 0 e Cn 1 (n 2⇠ 1 +n 2⇠ 2 )(1+n ⇠ 2 +2n ⇠ 1 ) 1/2 exp[n ↵ 1⌘ 1 /(2c 1 )]. (D.18) To balance the three terms on the right hand side of (D.18), we choose ⌘ 1 =min{(12 1 4⇠ 2 )/(8+↵ 1 ), (12 1 4⇠ 1 )/(12+↵ 1 )}> 0andtheprobabilitybound(D.18)becomes P( max 1 k p |S k1,1 E(S k1,1 )|Cn 1 /4) p e C 5 exp ⇣ e C 6 n ↵ 1⌘ 1 ⌘ (D.19) forallnsufficientlylarge,where e C 5 and e C 6 aretwopositiveconstants. Chapter3. Application of variable selection in genetic data 78 Step2. We establish the probability bound for max 1 k p |S k1,2 E(S k1,2 )|. Define the event i ={|X ij | M 2 forallj2M[{ k}}withM =A[B andlet T k3 =n 1 n X i=1 X 2 ik (x T i,B 0,B +z T i,I 0,I )" i I i I(|" i | M 3 ), T k4 =n 1 n X i=1 X 2 ik (x T i,B 0,B +z T i,I 0,I )" i I i I(|" i |>M 3 ), T k5 =n 1 n X i=1 X 2 ik (x T i,B 0,B +z T i,I 0,I )" i I c i , whereM 2 andM 3 are two large positive numbers which will be specified later. ThenS k1,2 = T k3 + T k4 + T k5 . Similarly, E(S k1,2 ) can be written as E(S k1,2 )= E(T k3 )+ E(T k4 )+ E(T k5 ). Since " 1 has mean zero and is independent of X 1,1 ,···,X 1,p , we have E(T k5 )= E[X 2 1k (x T 1,B 0,B + z T 1,I 0,I )" 1 I c 1 ]= E[X 2 1k (x T 1,B 0,B + z T 1,I 0,I )I c 1 ]E(" 1 )= 0. Thus S k1,2 E(S k1,2 )canbeexpressedas S k1,2 E(S k1,2 )=[T k3 E(T k3 )]+T k4 +T k5 E(T k4 ). (D.20) NotethatE(T k4 )=E[X 2 1k (x T 1,B 0,B +z T 1,I 0,I )" 1 I 1 I(|" 1 |>M 3 )]. Thus |E(T k4 )| E[X 2 1k |x T 1,B 0,B +z T 1,I 0,I |I 1 |" 1 |I(|" 1 |>M 3 )]. ItfollowsfromthetriangleinequalityandCondition1that X 2 1k |x T 1,B 0,B +z T 1,I 0,I |I 1 X 2 1k (|x T 1,B 0,B |+|z T 1,I 0,I |)I 1 C 0 M 3 2 (s 2 +s 1 M 2 ) (D.21) forall1 k pandsomepositiveconstantC 0 . BytheCauchy-Schwarzinequality,Condition 2,andLemma5.2,wehave E[|" 1 |I(|" 1 |>M 3 )] [E(" 2 1 )P(|" 1 |>M 3 )] 1/2 e Cexp[M ↵ 2 3 /(2c 1 )]. (D.22) Thistogetherwiththeaboveinequalitiesentailsthat |E(T k4 )| C 0 M 3 2 (s 2 +s 1 M 2 )E[|" 1 |I(|" 1 |>M 3 )] C 0 e CM 3 2 (s 2 +s 1 M 2 )exp[M ↵ 2 3 /(2c 1 )]. If we choose M 2 = n ⌘ 2 and M 3 = n ⌘ 3 with ⌘ 2 > 0 and ⌘ 3 > 0, then under Condition 1, for anypositiveconstantC,whennissufficientlylarge, |E(T k4 )| C 0 e Cn 3⌘ 2 (n ⇠ 2 +n ⇠ 1+⌘ 2 )exp[n ↵ 2⌘ 3 /(2c 1 )] Cn 1 /16 Chapter3. Application of variable selection in genetic data 79 holdsuniformlyforall1 k p. Thistogetherwith(D.20)ensuresthat P( max 1 k p |S k1,2 E(S k1,2 )|Cn 1 /4) P( max 1 k p |T k3 E(T k3 )|Cn 1 /16) +P( max 1 k p |T k4 |Cn 1 /16)+P( max 1 k p |T k5 |Cn 1 /16) (D.23) forallnsufficientlylarge. Inwhatfollows,wewillprovidedetailsonestablishingtheprobabil- ityboundforeachtermontherighthandsideof(D.23). Firstconsidermax 1 k p |T k3 E(T k3 )|. Inviewof(D.21),wehave|X 2 ik (x T i,B 0,B +z T i,I 0,I )" i I i I(|" i | M 3 )| C 0 M 3 2 M 3 (s 2 +s 1 M 2 ). Forany> 0,byHoeffding’sinequality[? ],itholdsthat P(|T k3 E(T k3 )| ) 2exp n 2 2C 2 0 M 6 2 M 2 3 (s 2 +s 1 M 2 ) 2 2exp n 2 4C 2 0 M 6 2 M 2 3 (s 2 2 +s 2 1 M 2 2 ) 2exp ✓ n 2 8C 2 0 M 6 2 M 2 3 s 2 2 ◆ +2exp ✓ n 2 8C 2 0 M 8 2 M 2 3 s 2 1 ◆ , where we have used the fact that exp[c/(a +b)] exp[c/(2a)] + exp[c/(2b)] for any a,b,c> 0. RecallthatM 2 =n ⌘ 2 andM 3 =n ⌘ 3 . Thus,taking =Cn 1 /16gives P( max 1 k p |T k3 E(T k3 )|Cn 1 /16) p X k=1 P(|T k3 E(T k3 )|Cn 1 /16) 2pexp ⇣ e Cn 1 2 1 6⌘ 2 2⌘ 3 2⇠ 2 ⌘ +2pexp ⇣ e Cn 1 2 1 8⌘ 2 2⌘ 3 2⇠ 1 ⌘ . (D.24) Next we handle max 1 k p |T k4 |. Using similar arguments as for proving (D.21), we have X 2 ik |x T i,B 0,B + z T i,I 0,I |I i C 0 M 3 2 (s 2 +s 1 M 2 ) for all 1 i n and 1 k p and thus max 1 k p |T k4 | C 0 M 3 2 (s 2 +s 1 M 2 )n 1 n X i=1 |" i |I(|" i |>M 3 ). ItfollowsfromMarkov’sinequalityand(D.22)that P( max 1 k p |T k4 | ) P ( C 0 M 3 2 (s 2 +s 1 M 2 )n 1 n X i=1 |" i |I(|" i |>M 3 ) ) 1 E " C 0 M 3 2 (s 2 +s 1 M 2 )n 1 n X i=1 |" i |I(|" i |>M 3 ) # = 1 C 0 M 3 2 (s 2 +s 1 M 2 )E[|" 1 |I(|" 1 |>M 3 )] 1 C 0 e CM 3 2 (s 2 +s 1 M 2 )exp[M ↵ 2 3 /(2c 1 )]. Chapter3. Application of variable selection in genetic data 80 RecallthatM 2 =n ⌘ 2 andM 3 =n ⌘ 3 . Thus,taking =Cn 1 /16resultsin P( max 1 k p |T k4 |Cn 1 /16) 16C 1 C 0 e Cn 3⌘ 2+ 1 (n ⇠ 2 +n ⇠ 1+⌘ 2 )exp[n ↵ 2⌘ 3 /(2c 1 )]. (D.25) We next consider max 1 k p |T k5 |. Since|T k5 | n 1 n P i=1 X 2 ik |(x T i,B 0,B +z T i,I 0,I )" i |I c i , by Markov’sinequalitywehave P(|T k5 | ) P ( n 1 n X i=1 X 2 ik |(x T i,B 0,B +z T i,I 0,I )" i |I c i ) 1 E " n 1 n X i=1 X 2 ik |(x T i,B 0,B +z T i,I 0,I )" i |I c i # = 1 E[X 2 1k |(x T 1,B 0,B +z T 1,I 0,I )" 1 |I c 1 ]. ItfollowsfromtheCauchy-Schwarzinequalityand(D.13)that E[X 2 1k |(x T 1,B 0,B +z T 1,I 0,I )" 1 |I c i ]{ E[X 4 1k (x T 1,B 0,B +z T 1,I 0,I ) 2 " 2 1 ]P( c 1 )} 1/2 { 2C 2 0 ⇥ s 2 E(X 4 1k kx 1,B k 2 " 2 1 )+s 1 E(X 4 1k kz 1,I k 2 " 2 1 ) ⇤ P( c 1 )} 1/2 . ApplyingtheCauchy-Schwarzinequalityagaingives E(X 4 1k kx 1,B k 2 " 2 1 ) ⇥ E(X 8 1k kx 1,B k 4 )E(" 4 1 ) ⇤ 1/2 2 4 s 2 X j2B E(X 8 1k X 4 1j ) 3 5 1/2 ⇥ E(" 4 1 ) ⇤ 1/2 8 < : 2 1 s 2 X j2B [E(X 16 1k )+E(X 8 1j )] 9 = ; 1/2 ⇥ E(" 4 1 ) ⇤ 1/2 e Cs 2 , where the last inequality follows from Condition 2 and Lemma 5.2. Similarly, we can show that E(X 4 1k kz 1,I k 2 " 2 1 ) e Cs 1 . By Condition 2 and the union bound, we deduce P( c 1 )= P(|X ij |>M 2 forsomej2M[{ k}) (1+2s 1 +s 2 )c 1 exp(M ↵ 1 2 /c 1 ). Thistogetherwith theaboveinequalitiesentailsthat P(|T k5 | ) 1 {2C 2 0 e C(s 2 1 +s 2 2 )(1+2s 1 +s 2 )c 1 exp(M ↵ 1 2 /c 1 )} 1/2 . RecallthatM 2 =n ⌘ 2 . UnderCondition1,taking =Cn 1 /16yields P( max 1 k p |T k5 |Cn 1 /16) p X k=1 P(|T k5 |Cn 1 /16) 16pC 1 n 1 {2C 2 0 e Cc 1 (n 2⇠ 1 +n 2⇠ 2 )(1+2n ⇠ 1 +n ⇠ 2 )} 1/2 exp[n ↵ 1⌘ 2 /(2c 1 )]. (D.26) Chapter3. Application of variable selection in genetic data 81 Combining(D.23),(D.24),(D.25),and(D.26)yieldsthatforsufficientlylargen, P( max 1 k p |S k1,2 E(S k1,2 )|Cn 1 /4) 2pexp ⇣ e Cn 1 2 1 6⌘ 2 2⌘ 3 2⇠ 2 ⌘ +2pexp ⇣ e Cn 1 2 1 8⌘ 2 2⌘ 3 2⇠ 1 ⌘ +16pC 1 n 1 {2C 2 0 e Cc 1 (n 2⇠ 1 +n 2⇠ 2 )(1+2n ⇠ 1 +n ⇠ 2 )} 1/2 exp[n ↵ 1⌘ 2 /(2c 1 )] +16C 1 C 0 e Cn 3⌘ 2+ 1 (n ⇠ 2 +n ⇠ 1+⌘ 2 )exp[n ↵ 2⌘ 3 /(2c 1 )]. (D.27) Let ⌘ 2 = ⌘ 3 =min{(1 2 1 2⇠ 2 )/(8 + ↵ 1 ),(1 2 1 2⇠ 1 )/(10 + ↵ 1 )}. Then (D.27) becomes P( max 1 k p |S k1,2 E(S k1,2 )|Cn 1 /4) p e C 7 exp ⇣ e C 8 n ↵ 1⌘ 2 ⌘ + e C 9 exp[ e C 10 n ↵ 2⌘ 2 ]. (D.28) forallnsufficientlylarge,where e C 7 , e C 8 , e C 9 ,and e C 10 aresomepositiveconstants. Step3. Weestablishtheprobabilityboundformax 1 k p |S k1,3 E(S k1,3 )|. Define T k6 =n 1 n X i=1 X 2 ik " 2 i I(|X ik | M 4 )I(|" i | M 5 ), T k7 =n 1 n X i=1 X 2 ik " 2 i I(|X ik | M 4 )I(|" i |>M 5 ), T k8 =n 1 n X i=1 X 2 ik " 2 i I(|X ik |>M 4 ), where M 4 and M 5 are two large positive numbers whose values will be specified later. Then S k1,3 =T k6 +T k7 +T k8 . Similarly,E(S k1,3 )canbewrittenasE(S k1,3 )=E(T k6 )+E(T k7 )+ E(T k8 ) withE(T k6 )= E[X 2 1k " 2 1 I(|X 1k | M 4 )I(|" 1 | M 5 )],E(T k7 )= E[X 2 1k " 2 1 I(|X 1k | M 4 )I(|" 1 |>M 5 )], and E(T k8 )= E[X 2 1k " 2 1 I(|X 1k |>M 4 )]. Thus S k1,3 E(S k1,3 ) can be expressedas S k1,3 E(S k1,3 )=[T k6 E(T k6 )]+T k7 +T k8 [E(T k7 )+E(T k8 )]. (D.29) First consider the last two terms E(T k7 ) and E(T k8 ). It follows from 0 X 2 1k " 2 1 I(|X 1k | M 4 )I(|" 1 |>M 5 ) M 2 4 " 2 1 I(|" 1 |>M 5 )that 0 E(T k7 ) M 2 4 E[" 2 1 I(|" 1 |>M 5 )]. (D.30) Chapter3. Application of variable selection in genetic data 82 AnapplicationoftheCauchy-SchwarzinequalityleadstoE[" 2 1 I(|" 1 |>M 5 )] [E(" 4 1 )P(|" 1 |> M 5 )] 1/2 . ByCondition2andLemma5.2,wehave E[" 2 1 I(|" 1 |>M 5 )]{ E(" 4 1 )c 1 } 1/2 exp(c 1 1 M ↵ 2 5 /2) e Cexp[M ↵ 2 5 /(2c 1 )] (D.31) Combining(D.30)with(D.31)yields |E(T k7 )| e CM 2 4 exp[M ↵ 2 5 /(2c 1 )]. (D.32) Similarly,bytheCauchy-SchwarzinequalityandLemma5.2weobtain |E(T k8 )| =E[X 2 1k " 2 1 I(|X 1k |>M 4 )]{ E(X 4 1k " 4 1 )P(|X 1k |>M 4 )]} 1/2 c 1 2 [E(X 8 1k )+E(" 8 1 )] 1/2 exp[M ↵ 1 4 /(2c 1 )] e Cexp[M ↵ 1 4 /(2c 1 )]. (D.33) Combining(D.32)and(D.33)resultsin |E(T k7 )+E(T k8 )| e CM 2 4 exp[M ↵ 2 5 /(2c 1 )]+ e Cexp[M ↵ 1 4 /(2c 1 )]. If we chooseM 4 = n ⌘ 4 andM 5 = n ⌘ 5 with ⌘ 4 > 0 and ⌘ 5 > 0, then for any positive constant C,whennissufficientlylarge, |E(T k7 )+E(T k8 )| e Cn 2⌘ 4 exp[n ↵ 2⌘ 5 /(2c 1 )]+ e Cexp[n ↵ 1⌘ 4 /(2c 1 )]<Cn 1 /16 holdsuniformlyforall1 k p. Theaboveinequalitytogetherwith(D.29)ensuresthat P( max 1 k p |S k1,3 E(S k1,3 )|Cn 1 /4) P( max 1 k p |T k6 E(T k6 )|Cn 1 /16)+P( max 1 k p |T k7 |Cn 1 /16) +P( max 1 k p |T k8 |Cn 1 /16) (D.34) forallnsufficientlylarge. In what follows, we will provide details on establishing the probability bound for each term on therighthandsideof(D.34). Firstconsidermax 1 k p |T k6 E(T k6 )|. Since0 X 2 ik " 2 i I(|X ik | M 4 )I(|" i | M 5 ) M 2 4 M 2 5 ,byHoeffding’sinequality[? ] wehaveforany> 0that P(|T k6 E(T k6 )| ) 2exp ✓ 2n 2 M 4 4 M 4 5 ◆ =2exp 2n 1 4⌘ 4 4⌘ 5 2 , bynotingthatM 4 =n ⌘ 4 andM 5 =n ⌘ 5 . Thus,taking =Cn 1 /16gives P( max 1 k p |T k6 E(T k6 )|Cn 1 /16) p X k=1 P(|T k6 E(T k6 )|Cn 1 /16) Chapter3. Application of variable selection in genetic data 83 2pexp ⇣ e Cn 1 2 1 4⌘ 4 4⌘ 5 ⌘ . (D.35) Next we handle max 1 k p |T k7 |. Since max 1 k p |T k7 | n 1 M 2 4 P n i=1 " 2 i I(|" i |>M 5 ), it followsfromMarkov’sinequalityand(D.31)thatforany> 0, P( max 1 k p |T k7 | ) P{n 1 M 2 4 n X i=1 " 2 i I(|" i |>M 5 ) } 1 E[n 1 M 2 4 n X i=1 " 2 i I(|" i |>M 5 )] = 1 M 2 4 E[" 2 1 I(|" 1 |>M 5 )] e C 1 M 2 4 exp[M ↵ 2 5 /(2c 1 )]. RecallthatM 4 =n ⌘ 4 andM 5 =n ⌘ 5 . Setting =Cn 1 /16intheaboveinequalityentails P(max 1 j p |T k7 |Cn 1 /16) 16C 1 e Cn 2⌘ 4+ 1 exp[n ↵ 2⌘ 5 /(2c 1 )]. (D.36) Wethenconsidermax 1 k p |T k8 |. ByMarkov’sinequalityand(D.33),forany> 0, P(|T k8 | ) 1 E[n 1 n X i=1 X 2 ik " 2 i I(|X ik |>M 4 )] = 1 E[X 2 1k " 2 1 I(|X 1k |>M 4 )] 1 e Cexp[M ↵ 1 4 /(2c 1 )]. (D.37) RecallthatM 4 =n ⌘ 1 . Inviewof(D.37),taking =Cn 1 /16leadsto P( max 1 k p |T k8 |Cn 1 /16) p X k=1 P(|T k8 |Cn 1 /16) 16pC 1 e Cn 1 exp[n ↵ 1⌘ 4 /(2c 1 )]. (D.38) Combining(D.34),(D.35),(D.36)with(D.38)yieldsthatforsufficientlylargen, P( max 1 k p |S k1,3 E(S k1,3 )|Cn 1 /4) 2pexp ⇣ e Cn 1 2 1 4⌘ 4 4⌘ 5 ⌘ +16pC 1 e Cn 1 exp[n ↵ 1⌘ 4 /(2c 1 )]+16C 1 e Cn 2⌘ 4+ 1 exp[n ↵ 2⌘ 5 /(2c 1 )]. (D.39) Let⌘ 4 = ⌘ 5 =(12 1 )/(8+↵ 1 ). Then(D.39)becomes P( max 1 k p |S k1,3 E(S k1,3 )|Cn 1 /4) p e C 11 exp[ e C 12 n ↵ 1⌘ 4 ]+ e C 13 exp[ e C 14 n ↵ 2⌘ 4 ] (D.40) forallnsufficientlylarge,where e C 11 , e C 12 , e C 13 ,and e C 14 aresomepositiveconstants. Chapter3. Application of variable selection in genetic data 84 Since 0<⌘ 1 <⌘ 2 = ⌘ 3 and ⌘ 1 ⌘ 4 , it follows from (D.11), (D.19), (D.28), and (D.40) that thereexistsomepositiveconstants e C 1 ,··· , e C 4 suchthat P( max 1 k p |S k1 E(S k1 )|Cn 1 ) p e C 1 exp ⇣ e C 2 n ↵ 1⌘ 1 ⌘ + e C 3 exp ⇣ e C 4 n ↵ 2⌘ 1 ⌘ forallnsufficientlylarge. Thisconcludestheproofofparta)ofTheorem4.2. D.3. Proofofpartb)ofTheorem4.2 Werecallthat! ⇤ j =E(X j Y)andb ! ⇤ j =n 1 n P i=1 X ij Y i . NotethatY i = 0 +x T i 0 +z T i 0 +" i = 0 +x T i,B 0,B +z T i,I 0,I +" i ,wherex i =(X i1 ,···,X ip ) T ,z i =(X i1 X i2 ,···,X i,p 1 X i,p ) T , x i,B =(X i` ,`2B) T , z i,I =(X ik X i` ,(k,`)2I) T , 0,B =( 0 ` ,`2B) T , and 0,I = ( k` ,(k,`)2I) T . To simplify the proof, we assume that the intercept 0 is zero without loss ofgenerality. Thus b ! ⇤ j =n 1 n X i=1 X ij Y i =n 1 n X i=1 X ij (x T i,B 0,B +z T i,I 0,I )+n 1 n X i=1 X ij " i ,S j1 +S j2 . Similarly,! ⇤ j canbewrittenas! ⇤ j =E(X j Y)=E(S j1 )+E(S j2 ). Sob ! ⇤ j ! ⇤ j canbeexpressed asb ! ⇤ j ! ⇤ j =[S j1 E(S j1 )]+[S j2 E(S j2 )]. Bythetriangleinequalityandtheunionbound, itholdsthat P(max 1 j p |b ! ⇤ j ! ⇤ j |Cn 2 ) P(max 1 j p |S j1 E(S j1 )|Cn 2 /2)+P(max 1 j p |S j2 E(S j2 )|Cn 2 /2). (D.41) In what follows, we will provide details on deriving an exponential tail probability bound for eachtermontherighthandsideabove. Toenhancereadability,wesplittheproofintotwosteps. Step1. Westartwiththefirstterm max 1 k p |S j1 E(S j1 )|. Definetheevent i ={|X i` | M 6 forall`2M[{ j}}withM =A[B andM 6 alargepositivenumberthatwillbespecified later. Let T j1 = n 1 n P i=1 X ij (x T i,B 0,B + z T i,I 0,I )I i and T j2 = n 1 n P i=1 X ij (x T i,B 0,B + z T i,I 0,I )I c i , whereI(·) is the indicator function and c i is the complement of the set i . Then anapplicationofthetriangleinequalityyields |S j1 E(S j1 )| =|[T j1 E(T j1 )]+T j2 E(T j2 )|| T j1 E(T j1 )|+|T j2 |+|E(T j2 )| | T j1 E(T j1 )|+|T j2 |+E(|T j2 |). (D.42) Chapter3. Application of variable selection in genetic data 85 Notethat|T j2 | n 1 n P i=1 |X ij (x T i,B 0,B +z T i,I 0,I )|I c i andthusE(|T j2 |) E[|X 1j (x T 1,B 0,B + z T 1,I 0,I )|I c 1 ]. BythetriangleinequalityandCondition1,wehave |X 1j (x T 1,B 0,B +z T 1,I 0,I )| C 0 (|X 1j |kx 1,B k 1 +|X 1j |kz 1,I k 1 ), (D.43) which ensures that E(|T j2 |) is bounded by C 0 [E(|X 1j |kx 1,B k 1 I ⌦ c 1 )+ E(|X 1j |kz 1,I k 1 I ⌦ c 1 )]. Herek·k 1 istheL 1 norm. BytheCauchy-Schwarzinequalityandthetriangularinequality,we deduce E(|X 1j |kx 1,B k 1 I c 1 ) ⇥ E(X 2 1j kx 1,B k 2 1 )P( c 1 ) ⇤ 1/2 (" s 2 X `2B E(X 2 1j X 2 1` ) # P( c 1 ) ) 1/2 ( 2 1 s 2 X `2B [E(X 4 1j )+E(X 4 1` )] ) 1/2 2 4 X `2M[{ j} P(|X i` |>M 6 ) 3 5 1/2 e Cs 2 (1+s 2 +2s 1 ) 1/2 exp[M ↵ 1 6 /(2c 1 )] for some positive constant e C, where the last inequality follows from Condition 2 and Lemma 5.2. Similarly,wehaveE(|X 1j |kz 1,I k 1 I c 1 ) e Cs 1 (1+s 2 +2s 1 ) 1/2 exp[M ↵ 1 6 /(2c 1 )]. This togetherwiththeaboveinequalitiesentailsthat E(|T j2 |) C 0 e C(s 1 +s 2 )(1+s 2 +2s 1 ) 1/2 exp[M ↵ 1 6 /(2c 1 )]. If we chooseM 6 = n ⌘ 6 with ⌘ 6 > 0, then by Condition 1, for any positive constantC, whenn issufficientlylarge, E(|T j2 |) C 0 e C(n ⇠ 1 +n ⇠ 2 )(1+n ⇠ 2 +2n ⇠ 1 ) 1/2 exp[n ↵ 1⌘ 6 /(2c 1 )]<Cn 2 /6 (D.44) holdsuniformlyforall1 j p. Theaboveinequalitytogetherwith(D.42)ensuresthat P(max 1 j p |S j1 E(S j1 )|Cn 2 /2) P(max 1 j p |T j1 E(T j1 )|Cn 2 /6)+P(max 1 j p |T j2 |Cn 2 /6) (D.45) forallnissufficientlylarge. Thusweonlyneedtoestablishtheprobabilityboundforeachterm ontherighthandsideof(D.45). First consider max 1 j p |T j1 E(T j1 )|. Using similar arguments as for proving (D.43), we have |X ij (x T i,B 0,B +z T i,I 0,I )I i | C 0 (|X ij |kx i,B k 1 +|X ij |kz i,I k 1 )I i C 0 (s 2 M 2 6 +s 1 M 3 6 ). Chapter3. Application of variable selection in genetic data 86 Forany> 0,anapplicationofHoeffding’sinequality[? ] gives P(|T j1 E(T j1 )| ) 2exp n 2 2C 2 0 M 4 6 (s 2 +s 1 M 6 ) 2 2exp n 2 4C 2 0 M 4 6 (s 2 2 +s 2 1 M 2 6 ) 2exp ✓ n 2 8C 2 0 M 4 6 s 2 2 ◆ +2exp ✓ n 2 8C 2 0 M 6 6 s 2 1 ◆ , where we have used the fact that (a + b) 2 2(a 2 + b 2 ) for any real numbers a and b, and exp[c/(a+b)] exp[c/(2a)]+exp[c/(2b)] for anya,b,c > 0. Recall thatM 6 = n ⌘ 6 . UnderCondition1,taking =Cn 2 /6resultsin P(max 1 j p |T j1 E(T j1 )|Cn 2 /6) p X j=1 P(|T j1 E(T j1 )|Cn 2 /6) 2pexp ⇣ e Cn 1 2 2 4⌘ 6 2⇠ 2 ⌘ +2pexp ⇣ e Cn 1 2 2 6⌘ 6 2⇠ 1 ⌘ . (D.46) Next, consider max 1 j p |T j2 |. By Markov’s inequality, for any> 0, we have P(|T j2 | ) 1 E(|T j2 |). Inviewofthefirstinequalityin(D.44),taking =Cn 2 /6givesthat P(|T j2 |Cn 2 /6) 6C 1 C 0 e Cn 2 (n ⇠ 1 +n ⇠ 2 )(1+n ⇠ 2 +2n ⇠ 1 ) 1/2 exp[n ↵ 1⌘ 6 /(2c 1 )] forall1 j p. Therefore, P(max 1 j p |T j2 |Cn 2 /6) p X j=1 P(|T j2 |Cn 2 /6) 6pC 1 C 0 e Cn 2 (n ⇠ 1 +n ⇠ 2 )(1+n ⇠ 2 +2n ⇠ 1 ) 1/2 exp[n ↵ 1⌘ 6 /(2c 1 )]. (D.47) Combining(D.45),(D.46),and(D.47)yieldsthatforsufficientlylargen, P(max 1 j p |S j1 E(S j1 )|Cn 2 /2) 2pexp ⇣ e Cn 1 2 2 4⌘ 6 2⇠ 2 ⌘ +2pexp ⇣ e Cn 1 2 2 6⌘ 6 2⇠ 1 ⌘ +6pC 1 C 0 e Cn 2 (n ⇠ 1 +n ⇠ 2 )(1+n ⇠ 2 +2n ⇠ 1 ) 1/2 exp[n ↵ 1⌘ 6 /(2c 1 )]. (D.48) To balance the three terms on the right hand side of (D.48), we choose ⌘ 6 =min{(12 2 2⇠ 2 )/(4+↵ 1 ), (12 2 2⇠ 1 )/(6+↵ 1 )}> 0andtheprobabilitybound(D.48)thenbecomes P(max 1 j p |S j1 E(S j1 )|Cn 2 /2) p e C 1 exp ⇣ e C 2 n ↵ 1⌘ 6 ⌘ (D.49) forallnsufficientlylarge,where e C 1 and e C 2 aretwopositiveconstants. Chapter3. Application of variable selection in genetic data 87 Step2. Weestablishtheprobabilityboundformax 1 j p |S j2 E(S j2 )|. Define T j3 =n 1 n X i=1 X ij " i I(|X ij | M 7 )I(|" i | M 8 ), T j4 =n 1 n X i=1 X ij " i I(|X ij | M 7 )I(|" i |>M 8 ), T j5 =n 1 n X i=1 X ij " i I(|X ij |>M 7 ), where M 7 and M 8 are two large positive numbers whose values will be specified later. Then S j2 = T j3 +T j4 +T j5 . Similarly, E(S j2 ) can be written as E(S j2 )= E(T j3 )+E(T j4 )+ E(T j5 ). Since " 1 has mean zero and is independent of X 1,1 ,···,X 1,p , we have E(T j5 )= E[X 1j " 1 I(|X 1j |>M 7 )] = E[X 1j I(|X 1j |>M 7 )]E(" 1 )=0. Thus S j2 E(S j2 ) can be expressed as S j2 E(S j2 )=[T j3 E(T j3 )] +T j4 +T j5 E(T j4 ). An application of the triangleinequalityyields |S j2 E(S j2 )|| T j3 E(T j3 )|+|T j4 |+|T j5 |+|E(T j4 )| | T j3 E(T j3 )|+|T j4 |+|T j5 |+E(|T j4 |). (D.50) FirstconsiderthelasttermE(|T j4 |). Notethat|T j4 | n 1 P n i=1 |X ij " i |I(|X ij | M 7 )I(|" i |> M 8 )andthus E(|T j4 |) E[|X 1j " 1 |I(|X 1j | M 7 )I(|" 1 |>M 8 )] M 7 E[|" 1 |I(|" 1 |>M 8 )]. (D.51) AnapplicationoftheCauchy-SchwarzinequalitygivesE[|" 1 |I(|" 1 |>M 8 )] [E(" 2 1 )P(|" 1 |> M 8 )] 1/2 . ByCondition2andLemma5.2,wehave E[|" 1 |I(|" 1 |>M 8 )]{ E(" 2 1 )c 1 } 1/2 exp(c 1 1 M ↵ 2 8 /2) e Cexp[M ↵ 2 8 /(2c 1 )] (D.52) Combining(D.51)with(D.52)yields E(|T j4 |) e CM 7 exp[M ↵ 2 8 /(2c 1 )]. (D.53) If we chooseM 7 = n ⌘ 7 andM 8 = n ⌘ 8 with ⌘ 7 > 0 and ⌘ 8 > 0, then for any positive constant C,whennissufficientlylarge, E(|T j4 |) e Cn ⌘ 7 exp[n ↵ 2⌘ 8 /(2c 1 )]<Cn 2 /8 holdsuniformlyforall1 j p. Theaboveinequalitytogetherwith(D.50)ensuresthat P(max 1 j p |S j2 E(S j2 )|Cn 2 /2) Chapter3. Application of variable selection in genetic data 88 P(max 1 j p |T j3 E(T j3 )|Cn 2 /8)+P(max 1 j p |T j4 |Cn 2 /8) +P(max 1 j p |T j5 |Cn 2 /8) (D.54) forallnsufficientlylarge. In what follows, we will provide details on establishing the probability bound for each term on the right hand side of (D.54). First consider max 1 j p |T j3 E(T j3 )|. Since|X ij " i I(|X ij | M 7 )I(|" i | M 8 )| M 7 M 8 ,forany> 0,byHoeffding’sinequality[? ] weobtain P(|T j3 E(T j3 )| ) 2exp ✓ n 2 2M 2 7 M 2 8 ◆ =2exp 2 1 n 1 2⌘ 7 2⌘ 8 2 , bynotingthatM 7 =n ⌘ 7 andM 8 =n ⌘ 8 . Thus,taking =Cn 2 /8gives P(max 1 j p |T j3 E(T j3 )|Cn 2 /8) p X j=1 P(|T j3 E(T j3 )|Cn 2 /8) 2pexp ⇣ e Cn 1 2 2 2⌘ 7 2⌘ 8 ⌘ . (D.55) Next we handle max 1 j p |T j4 |. Since max 1 j p |T j4 | n 1 M 7 P n i=1 |" i |I(|" i |>M 8 ), it followsfromMarkov’sinequalityand(D.52)thatforany> 0, P(max 1 j p |T j4 | ) P{n 1 M 7 n X i=1 |" i |I(|" i |>M 8 ) } 1 E[n 1 M 7 n X i=1 |" i |I(|" i |>M 8 )] = 1 M 7 E[|" 1 |I(|" 1 |>M 8 )] e C 1 M 7 exp[M ↵ 2 8 /(2c 1 )]. RecallthatM 7 =n ⌘ 7 andM 8 =n ⌘ 8 . Setting =Cn 2 /8intheaboveinequalityentails P(max 1 j p |T j4 |Cn 2 /8) 16C 1 e Cn ⌘ 7+ 2 exp[n ↵ 2⌘ 8 /(2c 1 )]. (D.56) We now consider max 1 j p |T j5 |. By the Cauchy-Schwarz inequality and Lemma 5.2 we de- ducethat E|T j5 | =E|X 1j " 1 I(|X 1j |>M 7 )|{ E(X 2 1j " 2 1 )P(|X 1j |>M 7 )]} 1/2 c 1 2 [E(X 4 1k )+E(" 4 1 )] 1/2 exp[M ↵ 1 7 /(2c 1 )] e Cexp[M ↵ 1 7 /(2c 1 )]. AnapplicationofMarkov’sinequalityyields P(|T j5 | ) 1 E|T j5 | 1 e Cexp[M ↵ 1 7 /(2c 1 )] (D.57) Chapter3. Application of variable selection in genetic data 89 forany> 0. RecallthatM 7 =n ⌘ 7 . Inviewof(D.57),taking =Cn 2 /8givesthat P(max 1 j p |T j5 |Cn 2 /8) p X j=1 P(|T j5 |Cn 2 /8) 8pC 1 e Cn 2 exp[n ↵ 1⌘ 7 /(2c 1 )]. (D.58) Combining(D.54),(D.55),(D.56),and(D.58)yieldsthatforsufficientlylargen, P(max 1 j p |S j2 E(S j2 )|Cn 2 /2) 2pexp ⇣ e Cn 1 2 2 2⌘ 7 2⌘ 8 ⌘ +8pC 1 e Cn 2 exp[n ↵ 1⌘ 7 /(2c 1 )]+16C 1 e Cn ⌘ 7+ 2 exp[n ↵ 2⌘ 8 /(2c 1 )]. (D.59) Let⌘ 7 = ⌘ 8 =(12 2 )/(4+↵ 1 ). Then(D.59)becomes P(max 1 j p |S j2 E(S j2 )|Cn 1 /2) p e C 3 exp[ e C 4 n ↵ 1⌘ 7 ]+ e C 5 exp[ e C 6 n ↵ 2⌘ 7 ] (D.60) forallnsufficientlylarge,where e C 3 , e C 4 , e C 5 ,and e C 6 aresomepositiveconstants. Since0<⌘ 6 <⌘ 7 ,itfollowsfrom(D.41),(D.49),and(D.60)that P(max 1 j p |b ! ⇤ j ! ⇤ j |Cn 2 ) p e C 1 exp ⇣ e C 2 n ↵ 1⌘ 6 ⌘ +p e C 3 exp[ e C 4 n ↵ 1⌘ 7 ]+ e C 5 exp[ e C 6 n ↵ 2⌘ 7 ] p e C 7 exp ⇣ e C 8 n ↵ 1⌘ 6 ⌘ + e C 5 exp[ e C 6 n ↵ 2⌘ 6 ] with e C 7 = e C 1 + e C 3 and e C 8 =min{ e C 2 , e C 4 }forallnsufficientlylarge. Iflogp =o(n ↵ 1⌘ 0 )with ⌘ 0 =min{(1 2 2 2⇠ 2 )/(4 + ↵ 1 ),(1 2 2 2⇠ 1 )/(6 + ↵ 1 )} > 0, then for any positive constantC,thereexistssomearbitrarilylargepositiveconstantC 2 suchthat P(max 1 j p |b ! ⇤ j ! ⇤ j |Cn 2 ) o(n C2 ) forallnsufficientlylarge,whichcompletestheproofofpartb)ofTheorem4.2. D.4. Proofofpartc)ofTheorem4.2 Themainideaoftheproofistofindprobabilityboundsforthetwoevents{I⇢ b I}and{M⇢ c M},respectively. Firstnotethatconditionalontheevent{A⇢ b A},wehave{I ⇢ b I}. Thusit holdsthat P(I⇢ b I)P(A⇢ b A). (D.61) Chapter3. Application of variable selection in genetic data 90 Define the eventE 1 = {max k2A |ˆ ! k ! k | < 2 1 c 2 n 1 }. Then, with ⌧ = c 2 n 1 , the event E 1 ensuresthatA⇢ b A. Thus, P(A⇢ b A)P(E 1 )=1P(E c 1 )=1P(max k2A |ˆ ! k ! k | 2 1 c 2 n 1 ). Following similar arguments as for proving (D.10), it can be shown that there exist some con- stants e C 1 > 0and e C 2 > 0suchthatforallnsufficientlylarge, P(max k2A |ˆ ! k ! k | 2 1 c 2 n 1 ) 2s 1 e C 1 exp[ e C 2 n min{↵ 1,↵ 2}r1 ]. (D.62) Note that the right hand side of (D.62) can be bounded by o(n C1 ) for some arbitrarily large positiveconstantC 1 . Thisgives P(A⇢ b A) 1o(n C1 ). (D.63) Thuscombining(D.61)and(D.63)yields P(I⇢ b I) 1o(n C1 ). (D.64) Using similar arguments as for proving part b) of Theorem 4.2 and (D.63), we can show that thereexistsomepositiveconstants e C 1 , e C 2 ,andC 2 suchthatforallnsufficientlylarge, P(B⇢ b B)P(max j2B |ˆ ! ⇤ j ! ⇤ j |< 2 1 c 2 n 2 ) 1s 2 e C 1 exp( e C 2 n ↵ 1r2 ) 1o(n C2 ), (D.65) Combining(D.63)and(D.65)leadsto P(M⇢ c M)P(A⇢ b A and B⇢ b B)P(A⇢ b A)+P(B⇢ b B)1 1o(n min{C1,C2} ). (D.66) Inviewof(D.64)and(D.66),weobtain P(I⇢ b I and M⇢ c M)P(I⇢ b I)+P(M⇢ c M)1 1o(n min{C1,C2} ) forallnsufficientlylarge. ThiscompletestheproofforthefirstpartofTheorem4.2c). Weproceedtoprovethesecondpartofpartc)ofTheorem4.2. Themainideaistoestablishthe probability bounds for two events{| b A| = O[n 2 1 max (⌃ ⇤ )]} and{| b B| = O[n 2 2 max (⌃ )]}, Chapter3. Application of variable selection in genetic data 91 respectively. Ifwecanshowthat P n | b A| =O[n 2 1 max (⌃ ⇤ )] o 1o(n C1 ), (D.67) P n | b B| =O[n 2 2 max (⌃ )] o 1o(n C2 ) (D.68) withC 1 andC 2 definedin(4.9)and(4.10),respectively,thenitholdsthat P n | b I| =O ⇥ n 4 1 2 max (⌃ ⇤ ) ⇤ o P n | b A| =O ⇥ n 2 1 max (⌃ ⇤ ) ⇤ o 1o(n C1 ) and P n | c M| =O ⇥ n 2 1 max (⌃ ⇤ )+n 2 2 max (⌃ ) ⇤ o P n | b A| =O[n 2 1 max (⌃ ⇤ )]and| b B| =O[n 2 2 max (⌃ )] o 1o(n min{C1,C2} ). Combiningthesetworesultsyields P ⇣ | b I| =O{n 4 1 2 max (⌃ ⇤ )}and| c M| =O{n 2 1 max (⌃ ⇤ )+n 2 2 max (⌃ )} ⌘ =1o ⇣ n min{C1,C2} ⌘ . It thus remains to prove (D.67) and (D.68). We begin with showing (D.68). The key step is to showthat p X j=1 (! ⇤ j ) 2 =kE(xY)k 2 2 e C 3 max (⌃ ) (D.69) forsomeconstant e C 3 > 0. Ifso,conditionalontheeventE 2 = ⇢ max 1 j p |b ! ⇤ j ! ⇤ j | 2 1 c 2 n 2 , the number of variables in b B = {j : |b ! ⇤ j |>c 2 n 2 } cannot exceed the number of variables in{j : |! ⇤ j | > 2 1 c 2 n 2 }, which is bounded by 4 e C 3 c 2 2 n 2 2 max (⌃ ). Thus it follows from (4.10)thatforallnsufficientlylarge, P n | b B| 4 e C 3 c 2 2 n 2 2 max (⌃ ) o P(E 2 )=1P(E c 2 ) 1o(n C2 ). (D.70) Nowwefurtherprove(D.69). Letu 0 = argmin u E Y x T u 2 . Thenthefirstorderequation E[x(Y x T u 0 )] = 0givesE(xY)=[E(xx T )]u 0 =⌃ u 0 . Thus kE(xY)k 2 2 =u T 0 ⌃ 2 u 0 max (⌃ )u 0 T ⌃ u 0 = max (⌃ )var x T u 0 . (D.71) Itfollowsfromtheorthogonaldecompositionthat var(Y)= var x T u 0 +var Y x T u 0 var x T u 0 . Chapter3. Application of variable selection in genetic data 92 Since E 2 (Y 2 ) E(Y 4 )= O(1), we have var(Y) E(Y 2 )= O(1). Then the above in- equality ensures that var x T u 0 e C 3 for some constant e C 3 > 0. This together with (D.71) completestheproofof(D.69). Wenextprove(D.67). RecallthatY ⇤ =Y 2 andX ⇤ k =[X 2 k E(X 2 k )]/ q var(X 2 k ). Thenfrom (4.5),thedefinitionof! k ,wehave! k =E(X ⇤ k Y ⇤ ). Followingsimilarargumentsasforproving (D.69),itcanbeshownthat p X k=1 ! 2 k = p X k=1 E 2 (X ⇤ k Y ⇤ )=kE(x ⇤ Y ⇤ )k 2 2 e C 4 max (⌃ ⇤ ), (D.72) where e C 4 is some positive constant, x ⇤ =(X ⇤ 1 ,···,X ⇤ p ) T , and⌃ ⇤ =cov(x ⇤ ). Then, on the event E 3 = max 1 k p |b ! k ! k | 2 1 c 2 n 1 , the cardinality of {k : |b ! k |>c 2 n 1 } cannot exceed that of {k : |! k | > 2 1 c 2 n 1 }, which is bounded by 4 e C 4 c 2 2 n 2 1 max (⌃ ⇤ ). Thus,wehave P n | b A| 4 e C 4 c 2 2 n 2 1 max (⌃ ⇤ ) o P(E 3 )=1P(E c 3 ) 1o(n C1 ), where the last equality follows from (4.9). This concludes the proof of part c) of Theorem 4.2 andthusTheorem4.2isproved. D.5. ProofofTheorem4.3 Recallthat e X=(e x 1 ,··· ,e x e p )isthecorrespondingn⇥ e paugmenteddesignmatrixincorporating the covariate vectors forX j ’s and their interactions in columns, wheree x j =(X 1j ,···,X nj ) T for 1 j p is the jth covariate vector ande x j for p+1 j e p = p(p + 1)/2 ise x k e x ` with some 1 k<` p anddenoting the Hadamard (componentwise) product. We rescale the design matrix e X such that each column has L 2 -norm n 1/2 , and denote by e Z = e XD 1 the resultingmatrix,whereD = diag{D 11 ,··· ,D e pe p }withD mm =n 1/2 ke x m k 2 isadiagonalscale matrix. Define the eventE 4 ={L 1 min 1 j e p |D jj | max 1 j e p |D jj | L 2 }, whereL 1 andL 2 are twopositiveconstantsdefinedinCondition4. ThenbytheassumptioninCondition4,eventE 4 holdswithprobabilityatleast1a n . Inwhatfollows,wewillconditionontheeventE 4 . NotethatconditionalonE 4 ,wehave k e X k 2 ⇠k e Z k 2 , (D.73) Chapter3. Application of variable selection in genetic data 93 where the notation f n ⇠ g n means that the ratio f n /g n is bounded between two positive con- stants. Thus,conditionalonE 4 ,Condition4holdswithmatrix e Xreplacedwith e Z. Morespecif- ically,withprobabilityatleast1a n ,itholdsthat min k k2=1,k k0<2s n 1/2 k e Z k 2 e 0 , min 6=0,k 2k1 7k 1k1 n n 1/2 k e Z k 2 /(k 1 k 2 _k e 2 k 2 ) o e , wheree 0 ande are two positive constants depending only on , 0 , L 1 , and L 2 . In addition, conditional on E 4 , the desired results in Theorem 4.3 are equivalent to those with e X and ✓ replacedby e Zand✓ ⇤ = D✓ ,respectively. Thus,weonlyneedtoworkwiththedesignmatrix e Z andreparameterizedparametervector✓ ⇤ . ByexaminingtheproofofTheorem1inFanandLv[29],inordertoproveTheorem4.3inour paper,itsufficestoshowthatthefollowinginequality kn 1 e Z T "k 1 > 0 /2 (D.74) holds with probability at most a n +o(p c4 ), where 0 = e c 0 {(logp)/n ↵ 1↵ 2/(↵ 1+2↵ 2) } 1/2 for some constante c 0 > 0 andc 4 is some arbitrarily large positive constant depending one c 0 . Then with(D.74),followingtheproofofTheorem1inFanandLv[29],wecanobtainthatallresults inTheorem4.3holdwithprobabilityatleast1a n o(p c4 ). It remains to prove (D.74). We first show that kn 1 e X T "k 1 >L 1 0 /2 holds with an over- whelmingprobability. Tothisend,notethatanapplicationoftheBonferroniinequalitygives P(kn 1 e X T "k 1 >L 1 0 /2) e p X j=1 P(kn 1 e x T j "k 1 >L 1 0 /2) (D.75) for any 0 > 0. The key idea is to construct an upper bound for P(kn 1 e x T j "k 1 >L 1 0 /2). We claim that such an upper bound is e C 1 exp{ e C 2 n ↵ 1↵ 2/(↵ 1+2↵ 2) 2 0 } for any 0<L 1 0 < 2, where e C 1 and e C 2 are some positive constants. To prove this, we consider the following two cases. Case 1: 1 j p. In this case,e x j =(X 1j ,···,X nj ) T . Thus n 1 e x T j " = n 1 P n i=1 X ij " i . ByLemma5.1,wehaveP(|X ij " i |>t) 2c 1 exp{c 1 1 t ↵ 1↵ 2/(↵ 1+↵ 2) }forall 1 i nand 1 j p. Note that E(X ij " i )=0. Thus it follows from Lemma 5.6 that there exist some positiveconstants e C 3 and e C 4 suchthat P(|n 1 e x T j "|>L 1 0 /2) e C 3 exp{ e C 4 n min{↵ 1↵ 2/(↵ 1+↵ 2),1} 2 0 } forall0<L 1 0 < 2. Chapter3. Application of variable selection in genetic data 94 Case 2: p+1 j e p. In this case,e x j =(X 1k X 1` ,···,X nk X n` ) T . Thus n 1 e x T j " = n 1 P n i=1 X ik X i` " i with some 1 k<` p if p+1 j e p. By Lemma 5.1, we have P(|X ik X i` " i |>t) 4c 1 exp{c 1 1 t ↵ 1↵ 2/(↵ 1+2↵ 2) } for all 1 i n and 1 k<j p. Note thatE(X ik X i` " i )=0. Thus it follows from Lemma 5.6 and ↵ 1 ↵ 2 /(↵ 1 +2↵ 2 ) 1 that thereexistsomepositiveconstants e C 5 and e C 6 suchthat P(|n 1 e x T j "|>L 1 0 /2) e C 5 exp{ e C 6 n ↵ 1↵ 2/(↵ 1+2↵ 2) 2 0 } forall0<L 1 0 < 2. Undertheassumptionthat↵ 1 ↵ 2 /(↵ 1 +2↵ 2 ) 1,wehave↵ 1 ↵ 2 /(↵ 1 +2↵ 2 ) min{↵ 1 ↵ 2 /(↵ 1 + ↵ 2 ),1}. ThuscombiningCases1and2abovealongwith(D.75)leadsto P(kn 1 e X T "k 1 >L 1 0 /2) e p X j=1 P(|n 1 e x T j "|>L 1 0 /2) e C 1 p 2 exp{ e C 2 n ↵ 1↵ 2/(↵ 1+2↵ 2) 2 0 } for all 0<L 1 0 < 2, where e C 1 = max{ e C 3 , e C 5 } and e C 2 =min{ e C 4 , e C 6 }. Here we have used thefactthate p =p(p+1)/2 p 2 . Set 0 =e c 0 {logp/n ↵ 1↵ 2/(↵ 1+2↵ 2) } 1/2 withe c 0 somepositive constant. Then 0<L 1 0 < 2 for alln sufficiently large. Thus, with the above choice of 0 , it holdsthat P(kn 1 e X T "k 1 >L 1 0 /2) o(p c4 ), where c 4 is some positive constant. Note that P(A) P(A|B)+ P(B c ) and P(A|B) P(A)/P(B)foranyeventsAandB withP(B)> 0. Thus, P(kn 1 e Z T "k 1 > 0 /2) P(kn 1 e Z T "k 1 > 0 /2|E 4 )+P(E c 4 ) P(kn 1 e X T "k 1 >L 1 0 /2|E 4 )+P(E c 4 ) P(kn 1 e X T "k 1 >L 1 0 /2)/P(E 4 )+P(E c 4 ) o(p c4 )+a n , whichcompletestheproofofTheorem4.3. D.6. ProofofTheorem4.4 We first prove that the diagonal entries D mm ’s of the scale matrix D are bounded between two positiveconstantsL 1 L 2 withsignificantprobability. SinceP(|X ij |>t) c 1 exp(c 1 1 t ↵ 1 ) foranyt> 0andall1 i nand1 j p,byLemma5.7andnotingthatEX 2 ij =1,there Chapter3. Application of variable selection in genetic data 95 existsomepositiveconstants e C 1 and e C 2 suchthat P(1/2 n 1/2 ke x j k 2 p 7/2) =P{3/4 n 1 n X i=1 [EX 2 ij X 2 ij ] 3/4} =1P{|n 1 n X i=1 [X 2 ij EX 2 ij ]|> 3/4} 1 e C 1 exp( e C 2 n min{↵ 1/2,1} ) (D.76) forall1 j p. Since var(X ik X i` ) is a diagonal entry of the population covariance matrix e ⌃ , it follows from Condition 6 that var(X ik X i` ) K> 0 for all 1 k<` p. Thus, there exists a constant 0<K 0 1 such that E(X 2 ik X 2 i` ) var(X ik X i` ) K>K 0 for all 1 k<` p. Meanwhile, it follows fromX 2 ik X 2 i` (X 4 ik +X 4 i` )/2 and Lemma 5.2 thatE(X 2 ik X 2 i` ) e C 3 , where e C 3 K 0 is some positive constant. Note thatP(|X ij |>t) c 1 exp(c 1 1 t ↵ 1 ) for any t> 0 and all 1 i n and 1 j p. Thus it follows from Lemma 5.7 that there exist some positiveconstants e C 4 and e C 5 suchthatforall1 k<` p, P ⇣ p K 0 /2 n 1/2 ke x k e x ` k 2 q 7 e C 3 /2 ⌘ P ⇣ p K 0 /2 n 1/2 ke x k e x ` k 2 q 3K 0 /4+ e C 3 ⌘ 1P n n 1 n X i=1 [X 2 ik X 2 i` E(X 2 ik X 2 i` )] > 3K 0 /4 o 1 e C 4 exp( e C 5 n min{↵ 1/4,1} ). (D.77) Let L 1 =2 1 min{1,K 1/2 0 } = p K 0 /2 and L 2 = p 7/2max{1, e C 1/2 3 }. Then combin- ing (D.76) with (D.77) yields that with probability at least 1 e C 1 pexp( e C 2 n min{↵ 1/2,1} ) e C 4 p 2 exp( e C 5 n min{↵ 1/4,1} ),itholdsthat Ł 1 min 1 j e p |D jj | max 1 j e p |D jj | L 2 , (D.78) whichshowsthatD mm ’sareboundedawayfromzeroandinfinitywithlargeprobability. We proceed to show that the first two parts of Theorem 4.4 hold with significant probability. For any 0<✏< 1, define an event E 5 = {kn 1 e X T e X e ⌃ k 1 ✏}, wherek·k 1 stands for the entrywise matrix infinity norm and e X and e ⌃ are defined in Section 4.3.3. Recall that e p = p(p+1)/2. SinceP(|X ij |>t) c 1 exp(c 1 1 t ↵ 1 ) for anyt> 0 and all 1 i n and 1 j p, it follows from Lemma 5.7 that there exist some positive constants e C 6 and e C 7 such that P(E 5 )=1P(|(n 1 e X T e X e ⌃ ) jk |>✏ forsome (j,k) with 1 j,k e p) Chapter3. Application of variable selection in genetic data 96 1 e p X j=1 e p X k=1 P(|(n 1 e X T e X e ⌃ ) jk |>✏) 1 e C 6 e p 2 exp( e C 7 n min{↵ 1/4,1} ✏ 2 ) (D.79) forany0<✏< 1,whereA jk denotesthe(j,k)-entryofamatrixA. Next, we show that conditional on the event E 5 , the desired inequalities in Theorem 4.4 hold. Fromnowon,weconditionontheeventE 5 . Notethat(n 1/2 k e X k 2 ) 2 = T (n 1 e X T e X e ⌃ ) + T e ⌃ . Let J be the subvector of formed by putting all nonzero components of together. Forany satisfyingk k 2 =1andk k 0 < 2s,bytheCauchy-Schwarzinequalitywehave | T (n 1 e X T e X e ⌃ ) | ✏k k 2 1 = ✏k J k 2 1 ✏k J k 0 k J k 2 2 = ✏k k 0 k k 2 2 < 2s✏. (D.80) It follows that (n 1/2 k e X k 2 ) 2 > T e ⌃ 2s✏ for any satisfyingk k 2 =1 andk k 0 < 2s. Thuswederive min k k2=1,k k0<2s (n 1/2 k e X k 2 ) 2 min k k2=1,k k0<2s ( T e ⌃ )2s✏K2s✏, (D.81) wherethelastinequalityfollowsfromCondition6. Meanwhile,forany 6= 0wehave n 1/2 k e X k 2 k 1 k 2 _k e 2 k 2 ! 2 = T (n 1 e X T e X e ⌃ ) k 1 k 2 2 _k e 2 k 2 2 + T e ⌃ k 1 k 2 2 _k e 2 k 2 2 T (n 1 e X T e X e ⌃ ) k 1 k 2 2 _k e 2 k 2 2 + T e ⌃ k k 2 2 . Undertheadditionalconditionk 2 k 1 7k 1 k 1 ,bythefirstinequalityof(D.80)itholdsthat T (n 1 e X T e X e ⌃ ) k 1 k 2 2 _k e 2 k 2 2 ✏k k 2 1 k 1 k 2 2 = ✏(k 1 k 1 +k 2 k 1 ) 2 k 1 k 2 2 64✏k 1 k 2 1 k 1 k 2 2 64s✏, where the last inequalityfollowsfrom theCauchy-Schwarz inequality. This entails that for any 6=0withk 2 k 1 7k 1 k 1 , n 1/2 k e X k 2 k 1 k 2 _k e 2 k 2 ! 2 = n 1 T e X T e X k 1 k 2 2 _k e 2 k 2 2 T e ⌃ k k 2 2 64s✏. Thus,byCondition6wehave min 6=0,k 2k1 7k 1k1 n 1/2 k e X k 2 k 1 k 2 _k e 2 k 2 ! 2 min 6=0,k 2k1 7k 1k1 T e ⌃ k k 2 2 64s✏ Chapter3. Application of variable selection in genetic data 97 K64s✏. (D.82) Recall that s = O(n ⇠ 0 ) with 0 ⇠ 0 < min{↵ 1 /8,1/2} by assumption and thus s e C 8 n ⇠ 0 for some positive constant e C 8 . Take ✏ = Kn ⇠ 0 / e C 9 with e C 9 some sufficiently large positive constantsuchthat ✏2 (0,1)andK64s✏> 0. Inviewof(D.78), (D.79), (D.81), and(D.82), sincelogp =o(n min{↵ 1/4,1} 2⇠ 0 )byassumption,weobtainthat a n = e C 1 pexp( e C 2 n min{↵ 1/2,1} )+ e C 4 p 2 exp( e C 5 n min{↵ 1/4,1} ) + e C 6 e p 2 exp( e C 7 K 2 e C 2 9 n min{↵ 1/4,1} 2⇠ 0 )=o(1) with the above choice of ✏, and that with probability at least 1a n , the desired results in the theoremholdwith 0 =K(12 e C 8 / e C 9 )and =K(164 e C 8 / e C 9 ). Thisconcludestheproof ofTheorem4.4. AppendixE:Sometechnicallemmasandtheirproofs Lemma5.1. LetW 1 andW 2 betworandomvariablessuchthatP(|W 1 |>t) e C 1 exp( e C 2 t ↵ 1 ) and P(|W 2 |>t) e C 3 exp( e C 4 t ↵ 2 ) for allt> 0, where ↵ 1 , ↵ 2 , and e C i ’s are some pos- itive constants. Then P(|W 1 W 2 |>t) e C 5 exp( e C 6 t ↵ 1↵ 2/(↵ 1+↵ 2) ) for allt> 0, with e C 5 = e C 1 + e C 3 and e C 6 =min{ e C 2 , e C 4 }. Proof. Foranyt> 0,wehave P(|W 1 W 2 |>t) P(|W 1 |>t ↵ 2/(↵ 1+↵ 2) )+P(|W 2 |>t ↵ 1/(↵ 1+↵ 2) ) e C 1 exp( e C 2 t ↵ 1↵ 2/(↵ 1+↵ 2) )+ e C 3 exp( e C 4 t ↵ 1↵ 2/(↵ 1+↵ 2) ) e C 5 exp( e C 6 t ↵ 1↵ 2/(↵ 1+↵ 2) ) bysetting e C 5 = e C 1 + e C 3 and e C 6 =min{ e C 2 , e C 4 }. Lemma5.2. LetW beanonnegativerandomvariablesuchthatP(W>t) e C 1 exp( e C 2 t ↵ ) forallt> 0,where↵ and e C i ’saresomepositiveconstants. ThenitholdsthatE(e e C3W ↵ ) e C 4 , E(W ↵m ) e C m 3 e C 4 m! for any integer m 0 with e C 3 = e C 2 /2 and e C 4 =1+ e C 1 , and E(W k ) e C 5 for any integerk 1, where constant e C 5 depends onk and↵ . Proof. LetF(t)bethecumulativedistributionfunctionofW. Thenforallt> 0,1F(W)= P(W>t) e C 1 exp( e C 2 t ↵ ). Recall thatW is a nonnegative random variable. Thus, for any 0<T< e C 2 ,byintegrationbypartswehave E(e TW ↵ )= Z 1 0 e Tt ↵ d[1F(t)] = 1+ Z 1 0 T↵t ↵ 1 e Tt ↵ [1F(t)]dt Chapter3. Application of variable selection in genetic data 98 1+ Z 1 0 T↵t ↵ 1 · e C 1 e ( e C2 T)t ↵ dt=1+ T e C 1 e C 2 T . Then,taking e C 3 =T = e C 2 /2and e C 4 =1+ e C 1 provesthefirstdesiredresult. Notethat e C m 3 E(W ↵m )/m! P 1 k=0 e C k 3 E(W ↵k )/k!=E(e e C3W ↵ )foranynonnegativeinteger m. ThusE(W ↵m ) e C m 3 e C 4 m!,whichprovestheseconddesiredresult. Foranyintegerk 1,thereexistsanintegerm 1suchthatk<↵m . ThenapplyingH¨ older’s inequalitygives E(W k ) n E[(W k ) ↵m/k ] o k/(↵m ) n E[1 ↵m/ (↵m k) ] o (↵m k)/(↵m ) ={E(W ↵m )} k/(↵m ) ⇣ e C m 3 e C 4 m! ⌘ k/(↵m ) . ThusthekthmomentofW isboundedbyaconstant e C 5 ,whichdependsonkand↵ . Thisproves thethirddesiredresult. Lemma 5.3. Let W be a nonnegative random variable with tail probability P(W>t) e C 1 exp( e C 2 t ↵ ) for allt> 0,where↵ and e C i ’s aresome positiveconstants. Ifconstant↵ 1, thenE(e e C3W ) e C 4 andE(W m ) e C m 3 e C 4 m! for any integerm 0 with e C 3 = e C 2 /2 and e C 4 =e e C2/2 + e C 1 e e C2/2 . Proof. Let F(t) be the cumulative distribution function of nonnegative random variable W. Then 1 F(t)= P(W>t) e C 1 exp( e C 2 t ↵ ) for all t 1. If ↵ 1, then t t ↵ for all t 1 and thus 1 F(t) e C 1 exp( e C 2 t) for all t 1. Define e C 3 = e C 2 /2 and e C 4 =e e C2/2 + e C 1 e e C2/2 . Byintegrationbyparts,wededuce E(e e C3W )= Z 1 0 e e C3t d[1F(t)] = 1+ Z 1 0 e C 3 e e C3t [1F(t)]dt =1+ Z 1 0 e C 3 e e C3t [1F(t)]dt+ Z 1 1 e C 3 e e C3t [1F(t)]dt 1+ Z 1 0 e C 3 e e C3t dt+ Z 1 1 e C 1 e C 3 e ( e C3 e C2)t dt =e e C2/2 + e C 1 e e C2/2 = e C 4 , whichprovesthefirstdesiredresult. Notethat e C m 3 E(W m )/m! P 1 k=0 e C k 3 E(W k )/k!= E(e e C3W ) for any nonnegative integerm. ThusE(W m ) e C m 3 e C 4 m!,whichprovestheseconddesiredresult. Lemma5.4. Foranyrealnumbersb 1 ,b 2 0and↵> 0,itholdsthat(b 1 +b 2 ) ↵ C ↵ (b ↵ 1 +b ↵ 2 ) withC ↵ =1 if0<↵ 1 and2 ↵ 1 if↵> 1. Proof. We first consider the case of 0<↵ 1. It is trivial ifb 1 =0 orb 2 =0. Assume that bothb 1 andb 2 arepositive. Since0<b 1 /(b 1 +b 2 )< 1,wehave[b 1 /(b 1 +b 2 )] ↵ b 1 /(b 1 +b 2 ). Chapter3. Application of variable selection in genetic data 99 Similarly,itholdsthat[b 2 /(b 1 +b 2 )] ↵ b 2 /(b 1 +b 2 ). Combiningthesetworesultsyields ✓ b 1 b 1 +b 2 ◆ ↵ + ✓ b 2 b 1 +b 2 ◆ ↵ b 1 b 1 +b 2 + b 2 b 1 +b 2 =1, whichimpliesthat(b 1 +b 2 ) ↵ b ↵ 1 +b ↵ 2 . Next, we deal with the case of↵> 1. Since x ↵ is a convex function on [0,1) for a given ↵> 1,wehave[(b 1 +b 2 )/2] ↵ (b ↵ 1 +b ↵ 2 )/2,whichensuresthat(b 1 +b 2 ) ↵ 2 ↵ 1 (b ↵ 1 +b ↵ 2 ). Combiningthetwocasesaboveleadstothedesiredresult. Lemma 5.5 (Lemma B.4 in ? ]). Let W 1 ,···,W n be independent random variables with EW i =0 and Ee T|W i | ↵ A for some constants T,A > 0 and 0< ↵ 1. Then for 0<✏ 1,P(|n 1 P n i=1 W i |>✏) e C 1 exp( e C 2 n ↵ ✏ 2 ) with e C 1 , e C 2 > 0 some constants. Lemma5.6. LetW 1 ,···,W n beindependentrandomvariableswithtailprobabilityP(|W i |> t) e C 1 exp( e C 2 t ↵ ) for allt> 0, where ↵ and e C i ’s are some positive constants. Then there exist some positive constants e C 3 and e C 4 such that P{|n 1 n X i=1 (W i EW i )|>✏} e C 3 exp( e C 4 n min{↵, 1} ✏ 2 ) (E.1) for0<✏ 1. Proof. Define f W i =W i EW i . Thenbythetriangleinequalityandthepropertyofexpectation, wehave | f W i | =|W i EW i || W i |+|EW i || W i |+E|W i |. (E.2) Next,weconsidertwocases. Case 1: 0<↵ 1. It follows from Lemma 5.2 thatE(e T|W i | ↵ ) 1+ e C 1 and E|W i | C 0 for all 1 i n, whereT = e C 2 /2 andC 0 is some positive constant. In view of (E.2) and by Lemma5.4,wehave| f W i | ↵ (|W i |+E|W i |) ↵ | W i | ↵ +(E|W i |) ↵ . Thisensures E(e T| f W i | ↵ ) e T(E|W i |) ↵ E(e T|W i | ↵ ) e TC ↵ 0 (1+ e C 1 ). Thus,byLemma5.5,thereexistsomepositiveconstants e C 5 and e C 6 suchthat P(|n 1 n X i=1 [W i EW i ]|>✏)=P(|n 1 n X i=1 f W i |>✏) e C 5 exp ⇣ e C 6 n ↵ ✏ 2 ⌘ (E.3) forany0<✏ 1. Chapter3. Application of variable selection in genetic data 100 Case 2:↵> 1. In view of (E.2), it follows from Lemma 5.4 and Jensen’s inequality that for eachintegerm 2, E(| f W i | m ) E[(|W i |+E|W i |) m ] 2 m 1 E[|W i | m +(E|W i |) m ] =2 m 1 [E(|W i | m )+(E|W i |) m ] 2 m 1 [E(|W i | m )+E(|W i | m )] = 2 m E(|W i | m ). (E.4) RecallthatP(|W i |>t) e C 1 exp( e C 2 t ↵ )forallt> 0and↵> 1. ByLemma5.3,thereexist some positive constants e C 7 and e C 8 such thatE(|W i | m ) m! e C m 7 e C 8 . This together with (E.4) gives E(| f W i | m ) m!(2 e C 7 ) m 2 (8 e C 2 7 e C 8 )/2 forallm 2. ThusanapplicationofBernstein’sinequality(Lemma2.2.11in? ])yields P{|n 1 n X i=1 (W i EW i )|>✏} =P(|n 1 n X i=1 f W i |>✏) 2exp n✏ 2 16 e C 2 7 e C 8 +4 e C 7 ✏ ! 2exp n✏ 2 16 e C 2 7 e C 8 +4 e C 7 ! (E.5) forany0<✏< 1. Let e C 3 = max{ e C 5 ,2}and e C 4 =min{ e C 6 ,(16 e C 2 7 e C 8 +4 e C 7 ) 1 }. Combining (E.3)and(E.5)completestheproofofLemma5.6. Lemma 5.7. Assume that for each 1 j p, X 1j ,···,X nj are n i.i.d. random variables satisfying P(|X 1j |>t) e C 1 exp( e C 2 t ↵ 1 ) for anyt> 0, where e C 1 , e C 2 and ↵ 1 are some positive constants. Then for any0<✏< 1, we have P ( n 1 n X i=1 [X ij X ik E(X ij X ik )] >✏ ) e C 3 exp( e C 4 n min{↵ 1/2,1} ✏ 2 ), (E.6) P ( n 1 n X i=1 [X ij X ik X i` E(X ij X ik X i` )] >✏ ) e C 5 exp( e C 6 n min{↵ 1/3,1} ✏ 2 ), (E.7) P ( n 1 n X i=1 [X ik X i` X ik 0X i` 0E(X ik X i` X ik 0X i` 0)] >✏ ) e C 7 exp( e C 8 n min{↵ 1/4,1} ✏ 2 ), (E.8) where1 j,k,`,k 0 ,` 0 p and e C i ’s are some positive constants. Proof. The proofs for inequalities (E.6)–(E.8) are similar. To save space, we only show the inequality (E.8) here. Since P(|X ij |>t) e C 1 exp( e C 2 t ↵ 1 ) for allt> 0 and all i and j, it follows from Lemma 5.1 thatX ik X i` X ik 0X i` 0 admits tail probabilityP(|X ik X i` X ik 0X i` 0| > t) 4 e C 1 exp( e C 2 t ↵ 1/4 ). By Lemma 5.6, there exist some positive constants e C 3 and e C 4 such Chapter3. Application of variable selection in genetic data 101 that P(|n 1 n X i=1 [X ik X i` X ik 0X i` 0E(X ik X i` X ik 0X i` 0)]|>✏) e C 3 exp ⇣ e C 4 n min{↵ 1/4,1} ✏ 2 ⌘ forany0<✏< 1,whichconcludestheproofof(E.8). Lemma 5.8. LetA j ’s withj2D⇢{ 1,···,p} satisfy max j2D |A j | L 3 for some constant L 3 > 0, and b A j be an estimate ofA j based on a sample of sizen for eachj2D. Assume that for any constantC> 0, there exist constants e C 1 , e C 2 > 0 such that P ✓ max j2D | b A j A j |Cn 1 ◆ |D| e C 1 exp n e C 2 n f( 1) o withf( 1 ) some function of 1 . Then for any constantC> 0, there exist constants e C 3 , e C 4 > 0 such that P ✓ max j2D | b A 2 j A 2 j |Cn 1 ◆ |D| e C 3 exp n e C 4 n f( 1) o . Proof. Note that max j2D | b A 2 j A 2 j | max j2D | b A j ( b A j A j )| + max j2D |( b A j A j )A j |. Therefore,foranypositiveconstantC, P(max j2D | b A 2 j A 2 j |Cn 1 ) P(max j2D | b A j ( b A j A j )|Cn 1 /2) +P(max j2D |( b A j A j )A j |Cn 1 /2). (E.9) Wefirstdealwiththesecondtermontherighthandsideof(E.9). Sincemax j2D |A j | L 3 ,we have P(max j2D |( b A j A j )A j |Cn 1 /2) P(max j2D | b A j A j |L 3 Cn 1 /2) =P{max j2D | b A j A j | (2L 3 ) 1 Cn 1 }|D| e C 1 exp n e C 2 n f( 1) o , (E.10) where e C 1 and e C 2 aretwopositiveconstants. Next,weconsiderthefirsttermontherighthandsideof(E.9). Notethat P(max j2D | b A j ( b A j A j )|Cn 1 /2) P(max j2D | b A j ( b A j A j )|Cn 1 /2,max j2D | b A j |L 3 +Cn 1 /2) +P(max j2D | b A j ( b A j A j )|Cn 1 /2,max j2D | b A j |<L 3 +Cn 1 /2) P(max j2D | b A j |L 3 +Cn 1 /2)+P(max j2D | b A j ( b A j A j )|Cn 1 /2,max j2D | b A j |<L 3 +C) Chapter3. Application of variable selection in genetic data 102 P(max j2D | b A j |L 3 +Cn 1 /2)+P(max j2D |(L 3 +C)( b A j A j )|Cn 1 /2). (E.11) Let us bound the two terms on the right hand side of (E.11) one by one. Since max j2D |A j | L 3 ,wehave P(max j2D | b A j |L 3 +Cn 1 /2) P(max j2D | b A j A j |+max j2D |A j |L 3 +Cn 1 /2) P(max j2D | b A j A j | 2 1 Cn 1 )|D| e C 5 exp n e C 6 n f( 1) o , (E.12) where e C 5 and e C 6 aretwopositiveconstants. Italsoholdsthat P(max j2D |(L 3 +C)( b A j A j )|Cn 1 /2) =P{max j2D | b A j A j | (2L 3 +2C) 1 Cn 1 } |D| e C 7 exp n e C 8 n f( 1) o , where e C 7 and e C 8 aretwopositiveconstants. This,togetherwith(E.9)–(E.12),entails P(max j2D | b A 2 j A 2 j |Cn 1 )|D| e C 1 exp n e C 2 n f( 1) o +|D| e C 5 exp n e C 6 n f( 1) o +|D| e C 7 exp n e C 8 n f( 1) o |D| e C 3 exp n e C 4 n f( 1) o , where e C 3 = e C 1 + e C 5 + e C 7 > 0and e C 4 =min{ e C 2 , e C 6 , e C 8 }> 0. Lemma5.9. Let b A j and b B j be estimates ofA j andB j , respectively, based on a sample of size n for each j2D⇢{ 1,···,p}. Assume that for any constantC> 0, there exist constants e C 1 ,··· , e C 8 > 0 except e C 3 , e C 7 0 such that P ✓ max j2D | b A j A j |Cn 1 ◆ |D| e C 1 exp n e C 2 n f( 1) o + e C 3 exp n e C 4 n g( 1) o , P ✓ max j2D | b B j B j |Cn 1 ◆ |D| e C 5 exp n e C 6 n f( 1) o + e C 7 exp n e C 8 n g( 1) o withf( 1 ) andg( 1 ) some functions of 1 . Then for any constantC> 0, there exist constants e C 9 ,··· , e C 12 > 0 except e C 11 0 such that P ⇢ max j2D |( b A j b B j )(A j B j )|Cn 1 |D| e C 9 exp n e C 10 n f( 1) o + e C 11 exp n e C 12 n g( 1) o . Proof. Notethatmax j2D |( b A j b B j )(A j B j )| max j2D | b A j A j |+max j2D | b B j B j |. Thus,foranypositiveconstantC, P(max j2D |( b A j b B j )(A j B j )|Cn 1 ) Chapter3. Application of variable selection in genetic data 103 P(max j2D | b A j A j |Cn 1 /2)+P(max j2D | b B j B j |Cn 1 /2) |D| e C 1 exp n e C 2 n f( 1) o + e C 3 exp n e C 4 n g( 1) o +|D| e C 5 exp n e C 6 n f( 1) o + e C 7 exp n e C 8 n g( 1) o |D| e C 9 exp n e C 10 n f( 1) o + e C 11 exp n e C 12 n g( 1) o , where e C 9 = e C 1 + e C 5 > 0, e C 10 =min{ e C 2 , e C 6 } > 0, e C 11 = e C 3 + e C 7 0, and e C 12 = min{ e C 4 , e C 8 }> 0. Lemma 5.10. Let B j ’s with j2D⇢{ 1,···,p} satisfy min j2D B j L 4 for some constant L 4 > 0, and b B j be an estimate ofB j based on a sample of sizen for eachj2D. Assume that for any constantC> 0, there exist constants e C 1 , e C 2 > 0 such that P ✓ max j2D | b B j B j |Cn 1 ◆ |D| e C 1 exp n e C 2 n f( 1) o . Then for any constantC> 0, there exist constants e C 3 , e C 4 > 0 such that P ✓ max j2D q b B j p B j Cn 1 ◆ |D| e C 3 exp n e C 4 n f( 1) o . Proof. Since min j2D B j L 4 > 0, there exists some constant L 0 such that 0<L 0 <L 4 . Notethat,foranypositiveconstantC, P(max j2D | q b B j p B j |Cn 1 ) P(max j2D | q b B j p B j |Cn 1 ,min j2D | b B j | L 4 L 0 n 1 ) +P(max j2D | q b B j p B j |Cn 1 ,min j2D | b B j |>L 4 L 0 n 1 ) P(min j2D | b B j | L 4 L 0 n 1 ) +P(max j2D | b B j B j | | q b B j + p B j | Cn 1 ,min j2D | b B j |>L 4 L 0 ). (E.13) Considerthefirsttermontherighthandsideof(E.13). ForanypositiveconstantC,wehave P(min j2D | b B j | L 4 L 0 n 1 ) P(min j2D |B j |max j2D | b B j B j | L 4 L 0 n 1 ) P(max j2D | b B j B j |L 0 n 1 )|D| e C 1 exp n e C 2 n f( 1) o , (E.14) bynoticingthatmin j2D B j L 4 ,where e C 1 and e C 2 aresomepositiveconstants. Chapter3. Application of variable selection in genetic data 104 Next consider the second term on the right hand side of (E.13). Recall that min j2D B j L 4 . Then,foranypositiveconstantC, P(max j2D | b B j B j | | q b B j + p B j | Cn 1 ,min j2D | b B j |>L 4 L 0 ) P{max j2D | b B j B j |C( p L 4 L 0 + p L 4 )n 1 }|D| e C 5 exp n e C 6 n f( 1) o , (E.15) where e C 5 and e C 6 aresomepositiveconstants. Combining(E.13),(E.14),and(E.15)gives P(max j2D | q b B j p B j |Cn 1 )|D| e C 3 exp n e C 4 n f( 1) o , (E.16) where e C 3 = e C 1 + e C 5 and e C 4 =min{ e C 2 , e C 6 }. Lemma 5.11. Let A j ’s with j2D⇢{ 1,···,p} and B satisfy max j2D |A j | L 5 and |B| L 6 forsomeconstantsL 5 ,L 6 > 0,and b A j and b B beestimatesofA j andB,respectively, based on a sample of size n for each j2D. Assume that for any constantC> 0, there exist constants e C 1 ,··· , e C 8 > 0 except e C 3 0 such that P ✓ max j2D | b A j A j |Cn 1 ◆ | D| e C 1 exp n e C 2 n f( 1) o + e C 3 exp n e C 4 n g( 1) o , P ⇣ | b BB|Cn 1 ⌘ e C 5 exp n e C 6 n f( 1) o + e C 7 exp n e C 8 n g( 1) o withf( 1 ) andg( 1 ) some functions of 1 . Then for any constantC> 0, there exist constants e C 9 ,··· , e C 12 > 0 such that P ✓ max j2D | b A j b BA j B|Cn 1 ◆ | D| e C 9 exp n e C 10 n f( 1) o + e C 11 exp n e C 12 n g( 1) o . Proof. Note that max j2D | b A j b BA j B| max j2D | b A j ( b BB)| + max j2D |( b A j A j )B|. Therefore,foranypositiveconstantC, P(max j2D | b A j b BA j B|Cn 1 ) P(max j2D | b A j ( b BB)|Cn 1 /2) +P(max j2D |( b A j A j )B|Cn 1 /2). (E.17) Wefirstdealwiththesecondtermontherighthandsideof(E.17). Since|B| L 6 ,wehave P(max j2D |( b A j A j )B|Cn 1 /2) P(max j2D | b A j A j |L 6 Cn 1 /2) =P{max j2D | b A j A j | (2L 6 ) 1 Cn 1 } |D| e C 1 exp n e C 2 n f( 1) o + e C 3 exp n e C 4 n g( 1) o (E.18) Chapter3. Application of variable selection in genetic data 105 withconstants e C 1 , e C 2 , e C 4 > 0and e C 3 0. Next,weconsiderthefirsttermontherighthandsideof(E.17). Notethat P(max j2D | b A j ( b BB)|Cn 1 /2) P(max j2D | b A j ( b BB)|Cn 1 /2,max j2D | b A j |L 5 +Cn 1 /2) +P(max j2D | b A j ( b BB)|Cn 1 /2,max j2D | b A j |<L 5 +Cn 1 /2) P(max j2D | b A j |L 5 +Cn 1 /2) +P(max j2D | b A j ( b BB)|Cn 1 /2,max j2D | b A j |<L 5 +C) P(max j2D | b A j |L 5 +Cn 1 /2)+P{(L 5 +C)| b BB|Cn 1 /2}. (E.19) We will bound the two terms on the right hand side of (E.19) separately. Since max j2D |A j | L 5 ,itholdsthat P(max j2D | b A j |L 5 +Cn 1 /2) P(max j2D | b A j A j |+max j2D |A j |L 5 +Cn 1 /2) P{max j2D | b A j A j | 2 1 Cn 1 } |D| e C 5 exp n e C 6 n f( 1) o + e C 7 exp n e C 8 n g( 1) o , (E.20) where e C 5 , e C 6 , e C 8 > 0and e C 7 0aresomeconstants. Wealsohavethat P((L 5 +C)| b BB|Cn 1 /2) =P{| b BB| (2L 5 +2C) 1 Cn 1 } e C 13 exp n e C 14 n f( 1) o + e C 15 exp n e C 16 n g( 1) o , where e C 13 ,··· , e C 16 aresomepositiveconstants. This,togetherwith(E.17)–(E.20),entailsthat P(max j2D | b A j b BA j B|Cn 1 ) |D| e C 1 exp n e C 2 n f( 1) o + e C 3 exp n e C 4 n g( 1) o +|D| e C 5 exp n e C 6 n f( 1) o + e C 7 exp n e C 8 n g( 1) o + e C 13 exp n e C 14 n f( 1) o + e C 15 exp n e C 16 n g( 1) o | D| e C 9 exp n e C 10 n f( 1) o + e C 11 exp n e C 12 n g( 1) o , where e C 9 = e C 1 + e C 5 + e C 13 > 0, e C 10 =min{ e C 2 , e C 6 , e C 14 } > 0, e C 11 = e C 3 + e C 7 + e C 15 > 0, and e C 12 =min{ e C 4 , e C 8 , e C 16 }> 0. Lemma 5.12. Let A j ’s and B j ’s with j2D⇢{ 1,···,p} satisfy max j2D |A j | L 7 and min j2D |B j | L 8 for some constantsL 7 ,L 8 > 0, and b A j and b B j be estimates ofA j andB j , respectively, based on a sample of sizen for eachj2D. Assume that for any constantC> 0, Chapter3. Application of variable selection in genetic data 106 there exist constants e C 1 ,··· , e C 6 > 0 such that P ✓ max j2D | b A j A j |Cn 1 ◆ |D| e C 1 exp n e C 2 n f( 1) o + e C 3 exp n e C 4 n g( 1) o , P ✓ max j2D | b B j B j |Cn 1 ◆ | D| e C 5 exp n e C 6 n f( 1) o withf( 1 ) andg( 1 ) some functions of 1 . Then for any constantC> 0, there exist constants e C 7 ,··· , e C 10 > 0 such that P ✓ max j2D b A j / b B j A j /B j Cn 1 ◆ |D| e C 7 exp n e C 8 n f( 1) o + e C 9 exp n e C 10 n g( 1) o . Proof. Since min j2D B j L 8 > 0, there exists some constant L 0 such that 0<L 0 <L 8 . Notethat,foranypositiveconstantC, P(max j2D | b A j b B j A j B j |Cn 1 ) P(max j2D | b A j b B j A j B j |Cn 1 ,min j2D | b B j | L 8 L 0 n 1 ) +P(max j2D | b A j b B j A j B j |Cn 1 ,min j2D | b B j |>L 8 L 0 n 1 ) P(min j2D | b B j | L 8 L 0 n 1 )+P(max j2D | b A j b B j A j B j |Cn 1 ,min j2D | b B j |>L 8 L 0 ). (E.21) Let us consider the first term on the right hand side of (E.21). Since min j2D B j L 8 , it holds thatforanypositiveconstantC, P(min j2D | b B j | L 8 L 0 n k ) P(min j2D |B j |max j2D | b B j B j | L 8 L 0 n 1 ) P(max j2D | b B j B j |L 0 n 1 )|D| e C 1 exp n e C 2 n f( 1) o , (E.22) where e C 1 and e C 2 aresomepositiveconstants. Thesecondtermontherighthandsideof(E.21)canbeboundedas P(max j2D | b A j b B j A j B j |Cn 1 , min j2D | b B j |>L 8 L 0 ) P(max j2D | b A j b B j A j b B j |Cn 1 /2, min j2D | b B j |>L 8 L 0 ) Chapter3. Application of variable selection in genetic data 107 +P(max j2D | A j b B j A j B j |Cn 1 /2, min j2D | b B j |>L 8 L 0 ) P{max j2D | b A j A j | 2 1 (L 8 L 0 )Cn 1 } +P{max j2D | b B j B j | (2L 7 ) 1 (L 8 L 0 )L 8 Cn 1 } |D| e C 3 exp n e C 4 n f( 1) o + e C 5 exp n e C 6 n g( 1) o +|D| e C 11 exp n e C 12 n f( 1) o , (E.23) where e C 3 ,··· , e C 6 , e C 11 ,and e C 12 aresomepositiveconstants. Combining(E.21)–(E.23)results in P(max j2D | b A j / b B j A j /B j |Cn 1 )|D| e C 7 exp n e C 8 n f( 1) o + e C 9 exp n e C 10 n g( 1) o , where e C 7 = e C 1 + e C 3 + e C 11 > 0, e C 8 =min{ e C 2 , e C 4 , e C 12 } > 0, e C 9 = e C 5 > 0, and e C 10 = e C 6 > 0. ThiscompletestheproofofLemma5.12. Chapter3. Application of variable selection in genetic data 108 TABLE 5.1: The overall and individual signal-to-noise ratios (SNRs) for simulation example in Section A.1 of Supplementary Material. Case 1: " 1 ⇠ N(0,3 2 ), " 2 ⇠ N(0,2.5 2 ), " 3 ⇠ N(0,2.5 2 ), " 4 ⇠ N(0,2 2 ); Case 2: " 1 ⇠ N(0,3.5 2 ), " 2 ⇠ N(0,3 2 ), " 3 ⇠ N(0,3 2 ), " 4 ⇠ N(0,2.5 2 );Case3: " 1 ⇠ N(0,4 2 )," 2 ⇠ N(0,3.5 2 )," 3 ⇠ N(0,3.5 2 )," 4 ⇠ N(0,3 2 ). Case1 Case2 Case3 Settings1,3 Settings2,4 Settings1,3 Settings2,4 Settings1,3 Settings2,4 M1 X 1 0.44 0.44 0.33 0.33 0.25 0.25 X 5 0.44 0.44 0.33 0.33 0.25 0.25 X 1 X 5 1 1.00 0.73 0.74 0.56 0.56 Overall 1.89 1.95 1.39 1.43 1.06 1.10 M2 X 1 0.64 0.64 0.44 0.44 0.33 0.33 X 10 0.64 0.64 0.44 0.44 0.33 0.33 X 1 X 5 1.44 1.45 1 1.00 0.73 0.74 Overall 2.72 2.73 1.89 1.89 1.39 1.39 M3 X 10 0.64 0.64 0.44 0.44 0.33 0.33 X 15 0.64 0.64 0.44 0.44 0.33 0.33 X 1 X 5 1.44 1.45 1 1.00 0.73 0.74 Overall 2.72 2.77 1.89 1.92 1.39 1.41 M4 X 1 X 5 2.25 2.26 1.44 1.45 1 1.00 X 10 X 15 2.25 2.25 1.44 1.44 1 1.00 Overall 4.5 4.51 2.88 2.89 2 2.00 TABLE 5.2: The percentages of retaining each important interaction or main effect, and all important ones (All) by all the screening methods over different models and settings for simu- lationexampleinSectionA.1ofSupplementaryMaterial. Method M1 M2 M3 M4 X 1 X 5 X 1 X 5 All X 1 X 10 X 1 X 5 All X 10 X 15 X 1 X 5 All X 1 X 5 X 10 X 15 All Case1: " 1 ⇠ N(0,3 2 )," 2 ⇠ N(0,2.5 2 )," 3 ⇠ N(0,2.5 2 )," 4 ⇠ N(0,2 2 ) SIS2 1.00 1.00 1.00 1.00 1.00 1.00 0.15 0.15 1.00 1.00 0.00 0.00 0.01 0.04 0.00 DC-SIS2 1.00 1.00 1.00 1.00 1.00 1.00 0.76 0.76 1.00 1.00 0.02 0.02 0.08 0.07 0.01 SIRI 1.00 1.00 1.00 1.00 1.00 0.99 0.60 0.59 0.99 0.99 0.07 0.07 0.26 0.23 0.07 IP 1.00 1.00 0.95 0.95 1.00 1.00 0.83 0.83 1.00 1.00 0.78 0.78 0.72 0.80 0.52 Case2: " 1 ⇠ N(0,3.5 2 )," 2 ⇠ N(0,3 2 )," 3 ⇠ N(0,3 2 )," 4 ⇠ N(0,2.5 2 ) SIS2 1.00 1.00 1.00 1.00 1.00 1.00 0.14 0.14 1.00 1.00 0.00 0.00 0.01 0.04 0.00 DC-SIS2 1.00 1.00 1.00 1.00 1.00 1.00 0.58 0.58 1.00 1.00 0.00 0.00 0.06 0.03 0.00 SIRI 1.00 1.00 1.00 1.00 0.99 0.99 0.40 0.39 0.99 0.98 0.03 0.03 0.15 0.17 0.02 IP 1.00 1.00 0.91 0.91 1.00 1.00 0.77 0.77 1.00 1.00 0.68 0.68 0.67 0.73 0.41 Case3: " 1 ⇠ N(0,4 2 )," 2 ⇠ N(0,3.5 2 )," 3 ⇠ N(0,3.5 2 )," 4 ⇠ N(0,3 2 ) SIS2 1.00 1.00 1.00 1.00 1.00 0.99 0.13 0.13 1.00 1.00 0.00 0.00 0.00 0.04 0.00 DC-SIS2 1.00 1.00 1.00 1.00 1.00 0.99 0.44 0.43 1.00 1.00 0.00 0.00 0.04 0.02 0.00 SIRI 0.99 1.00 0.99 0.99 0.98 0.97 0.37 0.36 0.98 0.96 0.01 0.01 0.11 0.09 0.00 IP 1.00 1.00 0.77 0.77 1.00 0.99 0.71 0.70 0.99 0.99 0.64 0.62 0.58 0.62 0.29 TABLE 5.3: Themeansandstandarderrors(inparentheses)ofcomputationtimeinminutesof eachmethodbasedon100replicationsforsimulationexampleinSectionA.2ofSupplementary Material. Method p = 200 p = 300 p = 500 hierNet 46.060(0.685) 103.889(1.203) 292.770(2.471) IP-hierNet 5.443(0.135) 5.691(0.110) 6.047(0.136) Ratioofmean 8.46 18.25 48.42
Abstract (if available)
Abstract
In my dissertation, I proposed one variable selection method called Constrained Dantzig Selector (CDS), for moderate high dimensions. However, if we need to further consider interactions in prediction models, a screening step is required and a two-step method called Interaction Pursuit (IP) is proposed. The formal method CDS has been applied on a GWAS data and 269 SNPs are selected. The latter method IP can possibly be applied to the same GWAS data to detect gene-gene interactions.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Methodology and application of modern genetic association tests in admixed populations
PDF
Robust feature selection with penalized regression in imbalanced high dimensional data
PDF
High-dimensional regression for gene-environment interactions
PDF
Prediction and feature selection with regularized regression in integrative genomics
PDF
Characterization and discovery of genetic associations: multiethnic fine-mapping and incorporation of functional information
PDF
Two-step testing approaches for detecting quantitative trait gene-environment interactions in a genome-wide association study
PDF
Identification and fine-mapping of genetic susceptibility loci for prostate cancer and statistical methodology for multiethnic fine-mapping
PDF
Extending genome-wide association study methods in African American data
PDF
Statistical methods and analyses in the Multiethnic Cohort (MEC) human gut microbiome data
PDF
Analysis of SNP differential expression and allele-specific expression in gestational trophoblastic disease using RNA-seq data
PDF
Feature selection in high-dimensional modeling with thresholded regression
PDF
Evaluating the use of friend or family controls in epidemiologic case-control studies
PDF
Bayesian model averaging methods for gene-environment interactions and admixture mapping
PDF
Generalized linear discriminant analysis for high-dimensional genomic data with external information
PDF
Missing heritability may be explained by the common household environment and its interaction with genetic variation
PDF
Reproducible large-scale inference in high-dimensional nonlinear models
PDF
Inference correction in measurement error models with a complex dosimetry system
PDF
Identification and characterization of cancer-associated enhancers
PDF
Identification of differentially connected gene expression subnetworks in asthma symptom
PDF
Detecting joint interactions between sets of variables in the context of studies with a dichotomous phenotype, with applications to asthma susceptibility involving epigenetics and epistasis
Asset Metadata
Creator
Kong, Yinfei
(author)
Core Title
High-dimensional feature selection and its applications
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Biostatistics
Publication Date
07/11/2016
Defense Date
07/09/2016
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
constrained Dantzig selector,feature screening and selection,genetic data,high dimensional data,interaction pursuit,OAI-PMH Harvest
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Stram, Daniel O. (
committee chair
), Cozen, Wendy (
committee member
), Lv, Jinchi (
committee member
)
Creator Email
kingterkong@gmail.com,yinfeiko@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-267230
Unique identifier
UC11281495
Identifier
etd-KongYinfei-4534.pdf (filename),usctheses-c40-267230 (legacy record id)
Legacy Identifier
etd-KongYinfei-4534.pdf
Dmrecord
267230
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Kong, Yinfei
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
constrained Dantzig selector
feature screening and selection
genetic data
high dimensional data
interaction pursuit