Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Markovian Models For Discrete Data With Repeated Patterns
(USC Thesis Other)
Markovian Models For Discrete Data With Repeated Patterns
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
INFORMATION TO USERS This manuscript has been reproduced from the microfilm master. UMI films the text directly from the original or copy submitted. Thus, some thesis and dissertation copies are in typewriter face, while others may be from any type o f computer printer. The quality o f this reproduction is dependent upon the quality o f the copy subm itted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleedthrough, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send UMI a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. Oversize materials (e.g., maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand comer and continuing from left to right in equal sections with small overlaps. Each original is also photographed in one exposure and is included in reduced form at the back o f the book. Photographs included in the original manuscript have been reproduced xerographically in this copy. Higher quality 6” x 9” black and white photographic prints are available for any photographs or illustrations appearing in this copy for an additional charge. Contact UMI directly to order. UMI A Bell & Howell Information Company 300 North Zeeb Road, Ann Arbor MI 48106-1346 USA 313/761-4700 800/521-0600 M ARKOVIAN MODELS FOR DISCRETE DATA WITH REPEATED PATTERNS by Xinrong Zhang A Thesis Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree M ASTER OF SCIENCE (Statistics) August 1995 UMI Number: 1378438 UMI Microform 1378438 Copyright 1996, by UMI Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. UMI 300 North Zeeb Road Ann Arbor, MI 48103 UNIVERSITY O F S O U T H E R N C A L IFO R N IA THE GRADUATE SCHOOL UNIVERSITY PARK LOS ANGELES, CALI FORNIA 9 0 0 0 7 This thesis, 'written by .................................................. under the direction of h£.Yl.....Thesis Committee, and approved by all its members, has been pre sented to and accepted by the Dean of The Graduate School, in partial fulfillment of the requirements for the degree of Dean THESIS COMMITTEE Chairiian v<Ua . D ed ica tio n This thesis is dedicated to my parents and my husband. A ck n ow led gm en t The thesis was completed under the direction of my advisor S. Tavare, who gave me this very interesting problem and also gave me several new ideas for the models developed here. His patience and encouragement during both my graduate studies and the writing of my thesis are greatly appreciated. C o n ten ts D ed ica tio n ii A ck n ow led gm en t iii L ist O f T ables v A b stra ct vi 1 In trod u ction 1 2 S ta tistic a l M od els 3 2.1 One Pattern : the Original M o d e l.................................................................... 3 2.2 One Pattern : the Revised M o d e l.................................................................... 9 2.3 M ultiple Patterns : the Original M o d e l......................................................... 11 2.4 Multiple Patterns: A Revised Model ............................................................ 14 2.5 Model C o m p a r iso n ................................................................................................ 15 3 A p p lica tio n s to th e Song o f W ood P ew ee 17 4 A p p lica tio n s to M olecular B io lo g y 21 5 D iscu ssion s and Further R esearch 26 iv L ist O f T ables 3.1 Here k denotes the number of parameters, MC(&) denotes the full model of order k. The order 0 model refers to an independent se quence, the order 1 model refers to an underlying Markov chain of first order, the order -1 refers to an independent and equiprobable sequence..................................................................................................................... 4.1 Here k denotes the number of parameters, MC(/c) denotes the full model of order k. The order 0 model refers to an independent se quence, the order 1 model refers to an underlying Markov chain of first order, the order -1 refers to an independent and equiprobable sequence..................................................................................................................... A b stra ct Repeated patterns occur frequently in behavioral sequences and DNA sequences. It is important to characterize the structure of such sequences. The usual Markov process models do not fit very well if there are repeated patterns in the sequence. We propose several different models using a composition of an underlying Markov process and the repeated patterns. Model comparisons are based on the BIC (Bayesian Information Criterion) value for each model. We use our models to fit the data of the song of Wood Pewee and a DNA sequence with repeated patterns in it. It turns out that our models fit much better than the usual Markov process models. C h ap ter 1 In tro d u ctio n Repeated patterns occur frequently in sequential behaviors such as DNA sequences [Benson and Waterman 1995, Warren and Nelson 1994], song of Wood Pewee [Craig 1943, Chatfield and Lemon 1970] and other phenomena. Although functions for many repeated DNA sequences are not known, some repeated patterns have been related to the catalytic properties of the sequence. DNA repeated patterns have also been used in evolutionary studies [Jin and Chakraborty 1994, Snow et al. 1994, Stewart and Baker 1994]. It has also been found that some repeated patterns are related to human diseases, such as Fragile-X syndrome in which the pattern CGG is repeated many times [Feng et al. 1995, Warren and Ashley 1995, Warren and Nelson 1994]. Several other diseases, including spinal and bulbar muscular atrophy (SBM A), m yotonic dystrophy (M D) and Huntington’s disease (HD) are also related to repeated patterns. W hen repeated patterns exist, usually the process can not be modeled by a single Markov process. Given a sequence with repeated patterns in it, we would like to find a suitable model to explain the pattern. By this we mean that we would like to use as few parameters as possible to explain the data. Chatfield and Lemon (1970) studied the problem of deciding if successive behavior patterns can be modeled by a Markov chain and, if so, deciding the order of the Markov chain. They also studied two situations in which the Markov model had to be modified: first, when behavior patterns are not immediately repeated, so that two successive patterns are not im m ediately repeated; and second when behavior patterns are customarily repeated many times. They used their methods to study the data for the song of Wood Pewee. 1 Raftery and Tavare (1994) reanalyzed the data of the song of the Wood Pewee using the idea of the Mixture Transition Distribution (M TD) model. Their objective is to model the sequence behavior when there is a main repeated pattern in the sequence. Their idea is that when the previous states are in the pattern, the next state continue this pattern with probability a, and with probability 1 — a the next state evolves according to an underlying Markov process. When there are multiple patterns, they also propose a m ethod to model the behavior of the sequences. In this paper, we would like to model sequence behaviors using the ideas of Raftery and Tavare (1994). The organization of the thesis is as follows: In Chapter 2, we propose several models to describe repeated patterns in a sequence, we generalize the models of Raftery and Tavare (1994) and give modifications for the models. We also write explicitly the likelihood function for each of our models and derive the corresponding likelihood equations, and also propose a new model for multiple patterns. In Chapter 3, we apply our models to the data of the song of Wood Pewee, which are dom inated by patterns that are repeated many times. In Chapter 4, we apply our models to a segment of DNA sequence. In Chapter 5, we discuss some further research about repeated patterns. We hope the m ethod we use here can be applied to other problems where complex repeated patterns are embedded in noise, such as speech recognition. 2 C h ap ter 2 S ta tistic a l M od els 2.1 O ne P a tte r n : th e O riginal M o d el In this section we model the repeated patterns in a sequence. Suppose that at each position the number of states it can take is L. We use the idea of Raftery and Tavare (1994) to study repeated patterns in a sequence. First let us consider just one pattern. The basic idea is that, if the past and current states belong to the pattern, the next state either continues the pattern (with probability a) or else is randomly generated from the other states according to a Markov chain of order less than that of the pattern (with probability 1 — a). If the previous states do not belong to the pattern, the next observation is generated randomly from the same Markov chain. We call this Markov chain the underlying Markov chain. For a fixed pattern *o*i • • • ik-i of length k we construct a corresponding group of patterns of / H - 1 consecutive states, which we call the / — group of the pattern. A = {*0*1 • • ■ */-i*/> *1*2 • • • */+i i •''} and B — {*0*1 '■■*/— l*fj*l*2"""*i+lj''"}> where * ; denotes the set of letters that are different from i\. We give the following definition of /th order patterns. D e fin itio n : For a given pattern I = *0*i • • • **-i, define a infinite sequence by repeating the pattern I, that is i0*i • • • ik-ihh+i ' • • , where ic = ic< if c = c' (mod k ). Then the pattern I is called a /th order pattern if whenever imim+i • • • im+i-1 = 3 1 • ■ • im'+i-i, we must have im+i = im'+i- That is, the first I terms com pletely determine the (/ + l)th term. Here I < k. Notice that if a pattern is a Ith order pattern, then the pattern is a sth order pattern for any s > I. E x a m p le: Considering the pattern 1312, we define a infinite sequence by re peating the pattern 1312, i.e 13121312 • • •. It is not a first order pattern as 1 can be followed by 3 or 2. It is a second order pattern as 13 can only be followed by 1; 31 by 2; 12 by 1; 21 by 3. That is, the first two terms completely determine the third term. Similarly we can show that pattern 13132 is a third order pattern. Suppose now the pattern we are interested in is a Ith order pattern, and define A and B as above. According to the model described above, if the underlying Markov chain is of order 0, i.e. the underlying process is a sequence of iid random variables, then we have the following transition distribution a if i £ A, (1 — (Aj'ftii / j£A if 1 € B, 7T {1 otherwise. Therefore, given a sequence i = i0ii ■ ■ • in, with n > I, the probability of getting this sequence under our model can be calculated recursively using the product rule. It is usual to ignore the contribution of IP( i ■ ■ • */-1), especially when we use the m axi mum likelihood method to estim ate parameters. Therefore, the modified likelihood is P(i) = P(q|«0*l • • • */-l) • • • P(*n|*n-Z • • • *n-l) L N{i) = (2-‘ ) where riA = the number of A patterns in sequence i, ng = the number of B patterns in sequence i, N(i) = the number of patterns that do not belong to A and the last index is i, 4 ^Vb(0 = = the number of B patterns whose last index is not z, and the previous / terms together with z is an A pattern. To find the maximum likelihood estim ates of a and {zr;}, we maxim ize the function in equation (2.1). Because 7ri,7r2, ■ • • , kl do not depend on a , it is easy to see the m axim um likelihood estim ate of a is nA a = nA + nB To obtain the maximum likelihood estim ate of zri, 7t2, • • •, nB we take the logarithm of equation (2.1). Note a does not depend on 7Ti,7t2, • • • , t v b. Therefore the MLE of zri, 7 r 2 • • •, 7tl should be the maximum point of £ ( ^ ( 0 l ° g *i ~ NB(i)\og(l - 7 r,-)), 2= 1 with the restriction T T l + 7 T 2 H ------- h 7 T L = 1. If the MLE of tti, 7t2, • • •, ttl is an interior point in [0 ,1]L, we can use the Lagrange multipliers method. We construct the function L = ^(jV (z)logT r; - Ab?(z')log(l — 7T,-)) - A(tti + . . . + ttl - 1). 2 = 1 Then and Therefore I L = m + M i _ A = 0 , i = 1 A ...tL OTTi 7Tj 1 - 7T{ 7Tl + 7T 2 H --------f TYL = 1. 7 T j ' 1 — 7 r ,- nB 1 — 7T l There is no explicit analytical solution for ztj, 7 r 2, • • •, For real data we have to use numerical methods to get the maximum likelihood estim ates of 7T i, 7t2, • • •, ttl- 5 It should be noted that the maximum likelihood estimator given on page 196 of Raftery and Tavare (1994) is incorrect, as the following exam ple shows. E x a m p le 1: Let us look at the exam ple of the song of the Wood Pewee studied in (Craig 1943, Chatfield and Lemon 1970, Raftery and Tavare 1994). The dominant pattern is 1312. It is a second order pattern and the corresponding A and B are A = {131,312,121,213}, and B = {132,133,311,313,122,123,211,212}. Now suppose we have a sequence i = 12231. According to the model, the proba bility of having this sequence is P(12231) = P(2|12)P(3|22)P(1|23)P(12) (! - a )7 r 2_____ = --------- ; ---- 7r37Ti C 7 T 2 + 7 T 3 where c — P(12). For this sequence, we have three second order patterns 122, 223, and 231. 122 € 5 , while 223 and 231 do not belong to A or B. Therefore n = 3, tia = 0, «b = 1. According to the formula on page 196 of Raftery and Tavare (1994), the maximum likelihood estim ator of a, 7rj, 7t2, 7 t3 would be a — 0 1 . 1 . 1 7fl — 3 ’ ^ “ 3’ 13 “ 3' Under these parameters, the probability of having the sequence 12231 is P(12231) = But log(P) = log 7 r3 + log ? r 2 + log(l - 7 r2 - 7 r3) + log(l - a) - log(7r2 + 7 t3) + log(c) 6 Differentiate the above equation with respect to 772,773 to obtain 1 1 1 - - i ------------------------ T— = 0, 7T3 i — 7T2 — 7T3 7T2 + 773 i ----------- 1 = o. 7 T 2 1 — 7 T 2 — 7 T 3 7 T 2 + 7 7 3 Solving the two equations we get 77 j = | , and 7 7 2 = 773 = {•, which gives the proba bility of the sequence 12231 as P(12231) - — > — . V 1 16 18 □ There are several cases for the m axim um likelihood estim ate of 771, 772,- ■ • ,7rj> Usually they exist in the interior of the unit square [0 ,1]L (exam ple 1). Sometimes the m axim um likelihood estim ates are on the boundary (example 2) and sometimes they do not exist (example 3). E x a m p le 2: (MLE are on the boundary.) We still consider the sam e problem as in Example 1. Suppose we have sequence 13123. It is composed of three second order patterns {131, 312, 123 }, of which 131 and 312 E A, and 123 E B. P(13123) = P(1|13)P(2|31)P(3|12)P(13) 2(1 - 01)773 a ------------— c. 7T2 + 773 The MLE of a and 7Ti, 7 r2, 7 r3 are a = | , 7 T i = 7 r2 = 0 ,7)3 = 1 . The MLE of 7r’s lie on the boundary of [0 , l]^. □ E x a m p le 3: (MLE do not exist.) Suppose we have a sequence 13212. 132 and 212 E B, and 321 does not belong to A or B. P(13212) = P(2|13)P(1|32)P(2|21)P(13) / -1 \2 n2 K2 = I1 - " ) 1 1 — 7 T 1 1 — 7 T 3 n \2 7 T 1 7 T 2 2 = (1 — C t) JZ -------- 7 7 ----------- rc. (1 - 7Ti)(7Ti + 7 r 2 ) 7 In order for this probability not to be zero, 7Ti7r2 7^ 0 and 0 < 7Ti, 7t2 < 1 - log(P) = 21og(l - a ) + log 7T 1 + 21o g 7T 2 - log(l - 7Ti) - lo g ^ j + 7 T 2) + log(c) Differentiating the above equation with respect to iti , 7 r2 we have the equations 1 1 l n 1 - -----------------— — — 0, 7Ti i — 7Ti 7T\ -f- 7T 2 — = 0. 7r2 7T1 + 7T2 There are no solutions to the above equations. Therefore MLE of 7Ti,7t2, and 7 r3 do not exist. □ If the underlying Markov process is of first order with transition m atrix 7r ,-j = P(X<+i = j\ X t — i), i,j = 1,2,---, Z /, then the transition distribution is a if i G A, (1 — Q :)7r2 ( - i ! ( / ftii-ij if i G B, TTij.ji, otherwise. Under these assumptions, the modified likelihood of the sequence i = ioii ■ • • in is (i) = a nA( l - a ) nB J ] L N(ij) 1 i where and ub are defined as above, and N(ij) = the number of patterns that do not belong to A and the last two letters are i and j. Ns(ij) — the number of B patterns whose second last letter is i, the last letter is not j , and the first I letters together with j is one of the A patterns. Similarly we can get the maximum likelihood estimator of a nA a = -------------. nA + nB 8 There are no explicit analytical solutions for the maximum likelihood estim ate of 7Tij. Once more, we have to use numerical methods to get the solution. 2.2 O ne P a tte r n : th e R ev ised M o d el Until now, we use the same model as proposed in Raftery and Tavare (1994). We clearly wrote out the likelihood function and pointed out the mistake in the original derivation. Next we propose a modification of the above model. Note that in the above model, if the past and current state belong to pattern A , the next state continue this pattern with probability a. W ith probability 1 — a, the next state does not continue the pattern; that is, the next state assumes a letter that is different from the letter that continues the pattern. In our revised model, we suppose the pattern is continued with probability a and the next state evolves according to the underlying Markov process with probability 1 — a. Note that in the latter case, the next state can be any letter whether it continues the pattern or not. The final process is a composition of these two processes. When the underlying Markov process is independent trials, the transition distribution of a pattern is given as follows: ■ • ■ ii- 1) = a + (1 — a)7r,-( if i £ A, (1 - a)7rt - ( if i e B, 7 T j ( otherwise, Denote the sequence generated by the main pattern by p\. For exam ple, if the pattern is 1312, then Pl = 131213121312-. Denote the underlying Markov process by M0. We call the process defined above as the composition of pi and Mo, we denote this composition as api + (1 - a)M0. Note that if a — 0, the process reduces to the underlying Markov process Mq. We can use this to test the significance of the pattern. 9 Under the new model a sequence i = i 0i i ■ ■ ■ i n has modified likelihood F (i) = P(ii\i0ii---ii-i)---P{in\in-r --in -i) = ( 1 - a)nB f j ( a + ( 1 - ol)t;i)N wiN 1 = 1 where ng and N(i) are the same as above and ^Ya(0 = the number of A patterns whose last index is i. The log likelihood function is: L L log IP (i) = ng log( 1 - a) + ^ NA{i)log{a + (1 - a)ni) + ^ N(i) log iti. i— 1 i—1 Because 7 T i + 7t2 + • • • + ng = 1, we have 7Tg = 1 — 7Ti — 7T2 — • • • — ftL-l- By the Lagrange multipliers m ethod as above, we can show that the MLE should satisfy: ng y , ( 1 - ttj)NA(i) I - a O L + (1 - a ) 7Ti (1 - a)NA(i) + m = (1 -<*)NA{L) | N {L )' i = a + (1 - a)ni 7T j a + (1 - a)ng nL Then we can use numerical methods to get the solution for a , 7 T x , • • •, 717,. If the underlying Markov process is of first order with transition matrix 7^ = P (X /+i = j\Xt = i),i,j = 1,2, • • •, L, then the transition distribution is ■ --it-1) = < a + (1 - if i e A, (1 - a)7rt - (_lt- ( if i e B, 7 T , o t h e r w i s e , 10 Under this model, the modified likelihood of sequence i = io*i ■ * * *n is (1 - a)n* n ( a + (1 - i.j= 1 where ub and N (ij) are the same as above and NA{ij) = the number of A patterns whose last two letters are i and j respectively. 2.3 M u ltip le P a ttern s : th e O riginal M o d el Sometimes in a data set, apart from the main pattern Pq, there are several less dominant patterns P i, P^, • • •, Ps. That can make the model described above less effective. We propose the following approach. W ithout loss of generality, we can assume that the patterns are already sorted by their dominance, i.e. |Po| > |P i| > |P2| • • • > |PS|, where |P,| denotes the number of times P, occurs. Let us assume that all these patterns are /th order patterns. For each pattern P,, we construct the corresponding / — group A; Ai = {P ti, P j'2) ' ■ ■ j Pim i = 0,1, 2, • • • , 5 . We make the following simplifications: 1. If a common pattern appears in several different groups, we only retain the pattern in the most dominant group. 2. If there are patterns whose first / letters are the same and the last letter is different, and these different patterns belong to different groups, then we remove these patterns from the corresponding group. The first / letters of these patterns is denoted by p,-. Let the new pattern groups be Ao, Ai, - ■ ■ ,A S and suppose there are s' removed / letter patterns p i,p 2, ■ ■ • ,Ps1- We consider the following model: 1. If the previous and current states belong to a pattern group Ah, the next state either continue the pattern with probability ah or else is randomly generated 11 from the other states according to a Markov chain of order less than that of the pattern. 2. If the previous states are one of p i, p2, ■ ■ ■, ps> , then the last letter has a specific distribution that depends on the pattern. 3. Otherwise, the next observation is generated randomly from the same Markov chain. As in the one pattern case, if the underlying Markov chain is of order 0, the transition mechanism is determined by As in the one pattern case, we can get the modified likelihood of a specific sequence i = i0ii • • ■ in in the following way. if i G Ah, 1 < h < s, if i G Bh, 1 < h < s, if the previous I letters coincide with pattern ph and 1 < h < s', otherwise. where Bh = { i Ah', 3j G Ah such that the first I letters of i and j are the sam e}. where riAh — the number of patterns in Ah, h — 1,2, • • •, s, ngh = the number of patterns in Bh, h = 1,2, •••, s, N (i ) = the number of patterns that are not A patterns and the first I elements are not any of Pi,P2, • • • ,ps' and last letter is i, 12 N s(i) = the number of B patterns whose last letter is not i, and the first I elements together with i is an A pattern, rii{h) = the number of patterns whose first I letters is ph and the last letter is i. It is easy to see the maximum likelihood estim ates of ah and ji(h) are respectively ah nAh nAh + nBh l i { h ) - rii(h) l < h < s'. As in the one pattern case, it is impossible to get the analytical form for the MLE of the 7rt -. The idea and calculation of the maximum likelihood estimators of 7T i, 7T2,. . . , 7 T £ , are the same as in the one pattern case and we om it the details here. If the underlying Markov chain is of first order, with transition matrix T X ij = P i{A r <+j = j \ X t — i}, 1 < i , j < L, the transition mechanism is ah (1 ah)TTiI_lii/ ^ 7 k { h ) 7 T ; H-lH if i 6 Ah, 1 < h < s, if i 6 Bh, 1 < h < s, if the previous I letters coincide with pattern Ph and 1 < h < s', otherwise. The modified likelihood of sequence i = i0ii .. .in is ( i) = n 7T . N(ij) s' L m n n(7 iW ) h 1 1 (1 “ 11 iJ" h= 1 \ l3 / h= 1 2=1 The notations are the same as above except that ni(h) N(ij ) = the number of patterns that are not A patterns, the first I letters are not any of pi,p2, - ■ • ,ps> , and the last two letters are i and j. 13 ^Yb(u) = the number of B patterns whose second last letter is i, last letter is not j and the first I letters together with j is one of the A patterns, and A = [ j A h , B = [J Bh., h = 1,2, • • • , s. h=l h=l The m axim um likelihood estim ate of ah and 7 ,(/z) are the same as above. We can use the same idea to obtain the maximum likelihood estim ator of 717,-. 2.4 M u ltip le P a ttern s: A R e v ise d M o d el As in the one pattern case, we consider the revised model in which a pattern in Ah is continued with probability ah and the next state evolves according to the underlying Markov process with probability 1 — ah. Note the next state can be any letter whether it continues the pattern or not. The transition distribution is given as ah + (1 - ah)^, if i € Ah, 1 < h < s, (1 - ah)TTi, if i G Bh, 1 < h < s , 7 i((/t) if the previous I letters coincide with pattern ph, and 1 < h < s', To other L ’wise. The modified likelihood of a given sequence i = i0ii ■ ■ ■ in-\in is P (i) = P (* /|* 0 ---* /-l)-'-P (* n |* n -r--* n -l) = n (1 — ahTB h n 7 r i v ( , ) n _ h=l i= 1 h= 1 i= 1 ft h=li= 1 The notation is the same as above except that NAh(i) = the number of Ah patterns whose last index is i. 14 We can use the same m ethod as before to get the equations for the m axim um like lihood estim ate. If the underlying Markov process is of first order with transition matrix 7rt -j = P (X i+i = j\X t — i), i,j = 1 ,2 ,--- Z r , then the transition distribution of a pattern is a h + (1 - afc)7r,if i € Ah, 1 < h < s, (1 - if i € Bh, 1 < h < s, H,(h) if the previous I letters coincide with pattern ph, and 1 < h < s', 7 otherwise. The modified likelihood of a given sequence i = i0ii ■ • ■ in-iin is P(i) = P(i/|z0 • • • •-*n-l) = n (1 - ah)nB » n ^ ( , j ) n n ^ + (! - /i=l *,7 = 1 h = 1 *,7=1 n T k 't< m r w h=li=1 The notation is the same as above except that — the number of Ah patterns whose last two letters are i and j respectively. We can use the same method as before to get the equations for the m axim um like lihood estim ates. 2.5 M o d el C om parison Our approach to fitting these models to data uses the Bayesian Information Criterion (BIC), as in Raftery and Tavare (1994). This allows us to compare models that are not nested. The BIC of a model is defined by B IC = — 2 log L + k log m, 15 where L is the maximized likelihood, k is the number of independent parameters in the model, and m is the number of observations used to compute the likelihood. (For a sequence ioii • • • in, with n > I, this is typically m = n — I.) We choose the model with the smallest BIC. As was pointed out in Raftery and Tavare (1994), such a comparison should not be regarded as decisively favoring a larger m odel over a smaller model unless the difference in BIC values is at least about 10. A classical approach to testing the importance of a pattern in a sequence can be based on a hypothesis test of a = 0. (Recall that in our revised model, if a is about 0, then the pattern is not significant, if a is large, then the pattern is significant.) Therefore, we propose the following hypothesis test Ho : a = 0. The alternative hypothesis is H\ : a ^ 0, and we can then perform a likelihood ratio test. 16 C h ap ter 3 A p p lica tio n s to th e Song o f W ood P ew ee In this chapter, we study the repeated patterns of the song of the Wood Pewee described and analyzed by Craig (1943), Chatfield and Lemon (1970), Bishop (1975), Raftery and Tavare (1994). The song of the Wood Pewee consists of three different states which we denote by 1, 2, and 3. We use the models described in Chapter 2 to reanalyze the data given in Raftery and Tavare (1994). The data are dominated by the pattern 1312, which is a second order pattern according to our definition. The corresponding /-group A is A = {131,312,121,213}, and B = {132,133,311,313,122,123,211,212}. Table 3.1 gives the number of parameters and the BIC for each model. From this model, we see that the revised model is better than the original model, especially when we just consider one pattern. The BICs for the revised model are much less than for the original model. However, when we consider two patterns 1312 and 112, although the BICs for the revised model are a little smaller than the original model, the improvement is not significant. The main reason for this maybe that there are many 1 < i , j < 3, such that N ( ij ) = N s (ij ) = 0. Another thing that needs to be noted is that for the equiprobable model M C (-l), the original model and the revised 17 m odel are the same. To see this let us look at the one pattern case. The original m odel gives the transition probability of a pattern i = i0i i a if i 6 A, |(1 — a) if i € B, otherwise, and the revised model gives a probability Pi(i/|*o*i • ’ *'/— 1) = < a + |(1 — ct) if i 6 A, |(1 - a) if i € B, | otherwise. If we let a* — a + |(1 — a ), then Po(i) is equivalent to IPi(i), except we use ct* in place of a. Therefore the two models are the same. The same can be said about the two pattern case. From the BIC point of view Models 12, 14, and their equiprobable specialization, give significantly better BIC values than other models. The two pattern model with equal probability has the smallest BIC. For this model, the corresponding maximum likelihood estim ate for the parameters are &i = 0.983, a 2 = 0.776. 7i = 0.156,72 = 0.084,73 = 0.760. The estim ate an is close to 1, indicating the strong dependence on the 1312 pattern. The 112 pattern is also significant, as indicated by a 2 = 0.776. Our conclusions coincide with the result of Raftery and Tavare (1994). For Model 7, the parameter estim ate is a = 0.921. Using our numerical m ethod, we get 7ri = 0.365, 7t2 = 0.483,7 r 3 = 0.152. For Model 8, the parameter estim ate is a = 0.921. The numerical algorithm gives 7T n = 0.360, 7 T i2 = 0.493, 7 t 1 3 = 0.147, t t 2 1 = 0.250, t t 22 = 0.500, 7 t 23 = 0.250, 18 7 T 31 = 1.000, 7 T 32 = 0,7T33 = 0. For Model 9, the parameter estim ate is a = 0.870. The numerical m ethod pro vides 7 T x = 0.480,7r2 = 0.440, tt3 = 0.080. For Model 10, the parameter estim ate is a = 0.789, and the MLEs are 7 T U = 0.330, t t i2 = 0.574, t t 13 = 0.096, t t 21 = 0.710,7 T 2 2 = 0.193,7 T 2 3 = 0.097, 7 r 3i = 0.961, 7 t 32 = 0.039, 7 t 33 = 0. For Model 12, we can think of the data as generated by a m ixture of two patterns and a low order Markov chain. Here A x = {312,121,131}, A 2 = {112} , pi = 21, and the parameter estim ates are &i = 0.983, d 2 = 0.776, 7i = 0.156,72 = 0.084,73 = 0.760, and 7 r i _ = 0.325,7r2 = 0.414, t t 3 = 0.261. For Model 13, the parameter estim ates are &i = 0.983, d 2 = 0.776, 7x = 0.156,72 = 0.084,73 = 0.760, 7Tn = 0 , t t12 = 1,71-13 = 0 , 7 1 * 2 1 = 0.250, 7 T 2 2 = 0.500, 7 T 2 3 = 0.250, 7 T 3 1 = 1 • > ^32 = 0, 7 T 33 = 0. For Model 14, the parameter estim ates are a i = 0.972, a 2 = 0.706, 19 Model k BIC 1 full model, order -1 0 2906.9 2 full model, order 0 2 2713.3 3 full model, order 1 6 1431.4 4 full model, order 2 18 866.6 5 full model, order 3 54 1096.1 6 1312 pattern + M C (-l) 1 1001.1 7 1312 pattern + MC(0) 3 996.2 8 1312 pattern + MC(1) 7 1017.6 9 Revised model (1312 pattern +M C (0)) 3 946.5 10 Revised model (1312 pattern +M C (1)) 7 894.3 11 1312 and 112 patterns + M C (-l) 4 807.1 12 1312 and 112 patterns + MC(0) 6 820.5 13 1312 and 112 patterns + MC(1) 10 841.2 14 Revised model (1312 and 112 patterns+M C(0)) 6 818.9 15 Revised model (1312 and 112 patterns +M C (1)) 10 844.5 Table 3.1: Here k denotes the number of parameters, MC(fc) denotes the full model of order k. The order 0 model refers to an independent sequence, the order 1 model refers to an underlying Markov chain of first order, the order -1 refers to an independent and equiprobable sequence. 7i = 0.156,72 = 0.084,73 = 0.760, and t h = 0.471, ? r 2 = 0.241,7 f3 = 0.288. For Model 15, the parameter estim ates are o l\ = 0.921, a 2 — 0, 7i = 0.15 6 ,7 2 = 0.08 4 ,7 3 = 0.760, ffu = 0 .1 3 8 ,7 T12 = 0 .7 2 4 ,7Ti3 = 0.138, 7 T 2 1 = 0.7 1 0, 7 1 - 2 2 = 0.193, 7 1 - 2 3 = 0.097, 7 T 31 = 0.961, 7 1-32 = 0.039, 71 -3 3 = 0. 20 C h ap ter 4 A p p lica tio n s to M olecu lar B io lo g y In this chapter we study the repeated patterns in DNA. Such repeated patterns occur frequently in DNA sequences. Some of the repeats are associated with the regulation of genes. Some of them are related to diseases such as fragile-X syndrome, myotonic dystrophy (MD) and Huntington’s disease (HD). Benson and Waterman (1994) gave an algorithm that can find rapidly repeats in a DNA sequence. They used their al gorithm to find several new repeats that had not been discovered before. In this chapter, we analyze one DNA sequence given in Example 1 in Benson and Water man (1994). This region comes from the GENBANK primate sequence database. It occurs in an intron of the Human int-2 proto-oncogene. Other types of DNA sequences with repeats can be analyzed similarly. In this example the main pattern is ACCCATCC. As pointed out in Benson and Waterman, it occurs 18 tim es in this region. There are no G nucleotides in the region we are interested in. Therefore we only need to consider 3 letters, A, C, and T. If we let Af->1, Cf+2, T-h-3, then the main pattern is 12221322. This is a fourth order pattern according to our definition in Chapter 2. The corresponding set A is A = {12221,22213,22132,21322,13221,32212,22122,21222}. It is not a third order pattern as 221 can be followed either by 3 or by 2. The corresponding B is given by B = {12222,12223,22211,22212,22131,22133,21321,21323,13222,13223, 21 Model k BIC 1 full model, order -1 0 1104.1 2 full model, order 0 2 704.6 3 full model, order 1 6 663.5 4 full model, order 2 18 1010.3 5 full model, order 3 54 3150.5 6 12221322 pattern + M C (-l) 1 445.7 7 12221322 pattern + MC(0) 3 424.9 8 12221322 pattern + MC(1) 7 417.8 9 Revised model (12221322 pattern +M C (0)) 3 440.2 10 Revised model (12221322 pattern +M C (1)) 7 423.3 11 1322 and 1222 patterns + M C (-l) 4 384.5 12 1322 and 1222 patterns + MC(0) 6 390.1 13 1322 and 1222 patterns -fi MC(1) 10 398.6 14 Revised model (1322 and 1222 patterns + M C(0)) 6 393.6 15 Revised model (1322 and 1222 patterns + M C(1)) 10 400.5 Table 4.1: Here k denotes the number of parameters, MC(/c) denotes the full model of order k. The order 0 model refers to an independent sequence, the order 1 model refers to an underlying Markov chain of first order, the order -1 refers to an independent and equiprobable sequence. 32211,32213,22121,22123,21221,21223}. In Table 4.1, rows 7 to 10, we give the BIC values for the one pattern models in Chapter 2. From the BIC point of view, the original model with underlying Markov process of order 1 seems to fit the data better. For this model, the main pattern is significant as a — 0.855. For Model 7, the parameter estim ate is a = 0.855. Using our numerical method, we get 7ri = 0.119, d -2 = 0.578, 7 r 3 = 0.303. For Model 8, the parameter estim ate is a — 0.855. The numerical algorithm gives 7 In = 0 ,7 T i2 = 0.444,7 T i3 = 0.556, 7 t 2i = 0.171, d22 = 0.611, 7 t 23 = 0.218, 7 T 3 1 = 0.236, 7 T 3 2 = 0.607, 7 T 3 3 = 0.157. 22 For Model 9, the parameter estim ate is a. = 0.770. The numerical m ethod pro duces, 7 t i = 0.170, 7t2 = 0.470,7 T 3 = 0.360. For Model 10, the parameter estim ate is a = 0.741, and the MLEs are 7T n = 0,7r1 2 = 0.312,7T i3 = 0.688, t t 21 = 0.278, t t 22 = 0.496, t t 23 = 0.226, 7 T 3 i = 0.235, 7 t 32 = 0.609, 7 t 33 = 0.156. Note that in Example 1 in Benson and Waterman (1994), the pattern ACC- CATCC occurs 18 tim es, but ATCCATCC occurs 8 times. Both of these pat terns contain ATCC. So we can think of the data as generated by two patterns (ATCC and ACCC) and a low order Markov chain. We fit the data according to the m ultiple pattern models given in section 2.3 in Chapter 2. The pattern ATCC is a second order pattern and ACCC is a third order pattern. Therefore we can fit the DNA sequence data regarding ATCC and ACCC as third order pat terns. Under these assumptions and the notation given in Chapter 2, we have At = {1322,3221,2132}, A 2 = {1222,2221,2122} and px = 221. For Model 12, the parameter estim ates are: d\ = 0.962, a 2 = 0.859, 7i = 0 ,7 2 = 0.429,73 = 0.571, 7 r i = 0.216,7 r 2 — 0.471, 7 t 3 = 0.313. For Model 13, the estim ates of a i , a 2, 7 1 , 7 2 , and 73 are the same as the Model 12. The estim ates of 7 r are T T U = 0,7Tl2 = 0.50, 7T 1 3 = 0.50, 7 T 21 = 0.273,7 T 22 = 0.348,7 T 23 = 0.379, 7T n = 0.264,7 r 1 2 = 0.560,7 r1 3 = 0.176. 23 And also we would like to fit the data according to the revised model for multiple patterns. (Given in section 2.4 in Chapter 2.) For Model 14, the parameter estim ates are tfi = 0.941,02 = 0.779, 7i = 0 ,7 2 = 0 .4 2 9 ,7 3 = 0.571, tti = 0.254,7r2 = 0.413, tt3 = 0.333. For Model 15, the estim ates of « i , a 2, 7 i , 7 2 ) and 73 are the same as the model 14. The estim ates of 7 r are 7T n - 0,7r12 = 0.50,7 T 1 3 = 0.50, 7 T 21 = 0.350, 7 T 2 2 = 0.268, 7 T 2 3 = 0.382, 7 rn = 0.298,71-12 = 0.503,7 T 1 3 = 0.199. From BIC point of view, we can see Models 11, 1 2 , and 13 fit the data better than others. The two pattern model with equal probability has the sm allest BIC. W ith only four parameters, we can explain the DNA sequence data much better than other more complicated models. The estim ate a i = 0.962 is close to 1, indicating the strong dependence on the pattern 1322. d 2 = 0.859 means pattern 1222 is also significant. In this chapter, we fit the DNA sequence data using the usual Markov process m odels and the models given in Chapter 2. From our analysis, we find that the usual Markov models give much higher BIC values than our models, that take into consideration of the repeated patterns (whether by one pattern models or m ulti pattern m odels). Among the usual Markov models, the independent models and Markov process with order 1 fit much better than other order of Markov process models. Taking the repeated pattern as ACCCATCC can significantly improve the BIC values. Among these models, the original model with underlying Markov process of order 1 give the best fitting BIC value of 417.8. Taking two patterns ATCC and ACCC as the repeated patterns can further improve the BIC value. Among these models, the two patterns with underlying independent identically distributed sequences give the smallest BIC of 384.5. From this analysis, we see that the two 24 pattern model with independent and identically distributed sequences fits the DNA sequence data the best. This analysis coincides with our analysis of the song of the Wood Pewee. In both of these examples, the data can best be modeled as two repeated patterns plus random noise. 25 C h a p ter 5 D iscu ssio n s and F urther R esearch In this thesis, we have developed several different models for explaining behavioral or DNA sequences with repeated patterns in them. The basic idea behind these models comes from Raftery and Tavare (1994). We generalized and gave several modifications to their original models. Sometimes, our modifications gave better BIC values than their original models, as shown in the analysis of the song of the Wood Pewee. Sometimes the modified models gave worse BIC values than their original models as shown in the analysis of DNA sequence data given by Benson and Waterman (1994). The reason for this is not clear. From our analysis we also see that these models can give much better BIC values than the usual Markov process models of any order. That means our models are much better suited to modeling sequences that contain repeated patterns. Another interesting feature from our analysis is that both exam ples can best be modeled by multiple pattern models with independent identically distributed underlying sequences. It is our guess that this phenomenon is true for many sequences with repeated patterns. The m odeling of behavioral sequences is an important area of research and our analysis can only be regarded as a first step to understanding such complicated phenomena. A lot of work needs to be done to understand repeated patterns, es pecially when there are m ultiple patterns in the sequence. Another important area of research concerns repeats that differ from a basic pattern at some fixed number of positions. This is a difficult problem and we have not been able to model such repeats very well. In the following we propose another model that may be used to m odel sequences with multiple repeated patterns in it. At the present stage, we are not sure how this model works compared to other models. 26 In the following, we construct a new model for m ultiple patterns. The basic idea behind the new model can be described as follows. For simplicity, suppose that we have two main patterns Pi and P2 and that they are both /th order patterns. According to the method described in previous sections, we can write their corre sponding /-group Ai and A 2 respectively. Unlike in sections 2.3 and 2.4, we do not modify A\ and A 2. Instead we consider the following model. (i) If the previous I letters coincide with that of a member of Ai and do not coincide with that of any members of A 2, then with probability cvi the pro cess continues pattern Pi, and with probability 1 — ct\ the process continues according to the underlying Markov process. (ii) Similarly if the previous I letters coincide with that of a member of /12 and do not coincide with that of any members of Ai, then with probability a 2 the process continues pattern P2, and with probability 1 — a 2 the process continues according to the underlying process. (iii) If the previous / letters coincide with that a member of both Ai and A 2, there are two cases: (a) The pattern in Ai that continues the previous / letters is the same as that in A 2. (b) Otherwise. Then with probability a,- the process continues P;, i = 1,2 and with probability 1 — cti — a 2 the process continues according to the underlying Markov process. (iv) Otherwise the process continues according to the underlying Markov process. As before in our original models, we assume that when the process continues according to the underlying Markov process, the next state can only take letters that are different from the letters that continue patterns Pi or P2. In cases (i), (ii) and (iiia), the transition distribution for any pattern i = ioi\i2 ---ii is given as in section 2.1 with a being changed to a i , a 2, and a\ + a 2 respectively. In case (iiif,), the transition distribution is given by 27 p/.|. • • X . a* if i 6 A*, A : = 1,2, (1 ail <^2)7rt ( / A J j i i o * ! j ( J A 2 Case (iv) is given by the underlying Markov process. Under this model, the modified probability of obtaining a sequence i = i0ii ■ • • ir is given by L N(i) p(i> = «;■ (i - a , r « i - ( e + 1 i where X Q (1 - 7 T ; - 7 Tj)NP^) rii = the number of patterns that belong to Ai but not A 2; mi = the number of patterns whose first I letters coincide with that of a member of A 1; but not A 2, and the pattern is not in Ai; n i2 = the number of patterns that belong to both Ai and A 2\ m i2 = the number of patterns whose first I letters coincide with that of a member of Ai and A 2, and the pattern is not in Ai or A 2; N(i) — the number of patterns that do not belong to Ai (J A2, and the last letter is i\ Np(i) = the number of patterns whose last letter is not i, the first I letters is one of member in (i), (ii) and (iiia), and the first / letters together with i belong to Ai |J A 2. Np(i,j ) = the number of patterns whose last letters is not i nor j, the first I letters belong to case (iiia), and the first I terms with i or j belong to Ai (J A2. n2 and m2 can similarly be defined as ni and m i. 28 Carrying this idea over to our revised model is possible but the likelihood function becomes very complicated. Detailed analysis of this model has not been done and is an area for further research. 29 R eferen ce L ist [1] Benson, G. and Waterman, M. S. (1994) A m ethod for fast database search for all fc-nucleotide repeats Nucleic Acids Research, 22, 4828-4836. [2] Bishop, Y. M. M., Fienberg, S. E. and Holland, P. W. (1975) Discrete Multi variate Analysis. Cambridge: Massachusetts Institute of Technology Press. [3] Chatfield, C. and Lemon, R. E. (1970) Analysing sequences of behaviour events. Journal of Theoretical Biology, 29, 427-445. [4] Craig, W. (1943) The song of the Wood Pewee. Albany: University of the State of New York. [5] Feng, Y ., Zhang, F.P., Lokey, L. K., Chastain, J. L., Lakkis, L., Eberhart, D. and Warren, S. T. (1995) Translational suppression by trinucleotide repeat expansion at FMR1. Science, 268, 731-734. [6] Jin, L., and Chakraborty, R. (1994) Estimation of genetic-distance and co efficient of gene diversity from single-probe multilocus DNA-fingerprint data. Molecular Biology and Evolution, 11, 120-127. [7] Raftery, A. and Tavare, S. (1994) Estimation and modeling repeated patterns in high order Markov chains with the mixture transition distribution model. Applied Statistics, 43, 179-199. [8] Stewart, D. T. and Baker, A. J. (1994) Patterns of sequence variation in the mitochondrial D-loop region of shrews. Molecular Biology and Evolution, 11, 9-21. 30 [9] Snow, K., Tester, D. J., Kruckeberg, K. E., Schaid, D. J., and Thibodeau, S. N. (1994) Sequence-Analysis of the fragile trinucleotide repeat-implications for the origin of the Fragile-X m utation. Human Molecular Genetics, 3, 1543-1551. [10] Warren, S. T. and Ashley, C. T. (1995) Triplet repeat expansion m utations-the exam ple of fragile-X syndrome. Annual Review of Neuroscience, 18, 77-99. [11] Warren, S. T. and Nelson, D. L. (1994) Advances in molecular analysis of fragile- X syndrome. Journal of the American Medical Association, 271, 536-542. 31
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Ordered Probit Models For Transaction Stock Prices
PDF
Structural equation modeling in educational psychology
PDF
The analysis of circular data
PDF
The effects of dependence among sites in phylogeny reconstruction
PDF
Repeated Measures In Psychology: Bias In Collinearity Judgment
PDF
Independent process approximation for the coupon collector's problem
PDF
Algorithms for phylogenetic tree reconstruction based on genome rearrangements
PDF
A qualitative study on the performance of R-code statistical software
PDF
Microwave Coherence Tomography
PDF
Relative Efficiency Study Of Nested Case-Control Sampling In The Logistic Regression Model
PDF
Inference for stochastic models of molecular data
PDF
A Syntactic Analysis Of Luther'S 'Adventspostille'
PDF
A Fuzzy Controlled Nonuniform Diffusion Model For Competing Products
PDF
The Authenticity Of David Of Augsburg'S German Works, With Particular Reference To His 'Paternoster'
PDF
Implementation aspects of Bézier surfaces to model complex geometric forms
PDF
The Aura Of Romance: Smoking And Classical Hollywood Cinema, Image And Representation
PDF
An integrative model of subjective well-being: Culture, personality, and demographics
PDF
Gene mapping using haplotype data
PDF
A finite element approach on singularly perturbed elliptic equations
PDF
Imputation methods for missing data in growth curve models
Asset Metadata
Creator
Zhang, Xinrong (author)
Core Title
Markovian Models For Discrete Data With Repeated Patterns
Degree
Master of Science
Degree Program
Statistics
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
OAI-PMH Harvest,statistics
Language
English
Contributor
Digitized by ProQuest
(provenance)
Advisor
Tavare, Simon (
committee chair
), Arratia, Richard (
committee member
), Waterman, Michael (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c18-10100
Unique identifier
UC11356792
Identifier
1378438.pdf (filename),usctheses-c18-10100 (legacy record id)
Legacy Identifier
1378438-0.pdf
Dmrecord
10100
Document Type
Thesis
Rights
Zhang, Xinrong
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA