Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
De novo peptide sequencing and spectral alignment algorithm via tandem mass spectrometry
(USC Thesis Other)
De novo peptide sequencing and spectral alignment algorithm via tandem mass spectrometry
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
DE NOVO PEPTIDE SEQUENCING AND SPECTRAL ALIGNMENT ALGORITHM VIA TANDEM MASS SPECTROMETRY by Lijuan Mo A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTATIONAL BIOLOGY AND BIOINFORMATICS) December 2009 Copyright 2009 Lijuan Mo Dedication This thesis is dedicated to my family, for their endless love and support. ii Acknowledgments I would like to take this chance to express my gratitude to many people who gave me help and support for my PhD study here at USC. I thank my advisor and mentor, Professor Ting Chen. I thank him for his great under- standing, patience, help and support he constantly provided to me. I would like to thank my previous and current thesis committee members Prof. Fengzhu Sun, Prof. David Kempe, Prof. Frank Alber, Prof. Lei Li and Prof. Ebrahim Zandi for their comments, suggestions, and help. I would like to thank Prof. Michael Waterman, Prof. Jasmine Zhou, for giving me lots of advice in both academy and teaching when being their TA. I would like to thank my colleagues in USC computational biology group for their various help and discussions. I would like to thank our program secretaries Linda Bazilian, Christina Tasulis, Eleni Yokas, Joe Ungco etc.,for their help and assistance throughout the years. I also want to thank my family, for their endless love, encouragement and support. I could never stop loving them. iii Table of Contents Dedication ii Acknowledgments iii List of Tables vi List of Figures vii Abstract ix Preface x Chapter 1: Introduction 1.1 Introduction to MS and MS/MS . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Mass Spectrometry . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.2 Tandem Mass Spectrometry . . . . . . . . . . . . . . . . . . . 4 1.1.3 MS/MS Fragmentation Patterns . . . . . . . . . . . . . . . . . 7 1.2 Peptide Identification by MS/MS . . . . . . . . . . . . . . . . . . . . . 8 1.2.1 Database Search . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2.2 De Novo Sequencing . . . . . . . . . . . . . . . . . . . . . . . 13 1.3 Post-Translational Modifications . . . . . . . . . . . . . . . . . . . . . 18 1.3.1 Introduction to Post-Translational Modifications . . . . . . . . 18 1.3.2 PTMs Identification by Tandem Mass Spectrometry . . . . . . . 19 Chapter 2: MSNovo: De Novo Peptide Sequencing Algorithm 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2.1 Preprocessing Spectra . . . . . . . . . . . . . . . . . . . . . . 23 2.2.2 Scoring Function . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2.3 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . 29 2.2.4 Triply charged Spectra . . . . . . . . . . . . . . . . . . . . . . 32 2.2.5 LTQ-FT Spectra . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 iv 2.3.1 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.3.2 Effect of different peptide lengths . . . . . . . . . . . . . . . . 37 2.3.3 Charge +1 and +2 spectra . . . . . . . . . . . . . . . . . . . . 38 2.3.4 Triply charged spectra . . . . . . . . . . . . . . . . . . . . . . 40 2.3.5 Adding neutral loss ions . . . . . . . . . . . . . . . . . . . . . 41 2.3.6 Parameter tuning . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.3.7 P-values for de novo sequencing results . . . . . . . . . . . . . 44 2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Chapter 3: MSPEP: Spectral Alignment Algorithm 3.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.1.1 Spectrum-Peptide Alignment . . . . . . . . . . . . . . . . . . . 48 3.1.2 Spectrum-Spectrum Alignment . . . . . . . . . . . . . . . . . 52 3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.2.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.2.2 Spectrum-Peptide Alignment . . . . . . . . . . . . . . . . . . . 58 3.2.3 Significance of the results . . . . . . . . . . . . . . . . . . . . 60 3.2.4 Spectrum-Spectrum Alignment . . . . . . . . . . . . . . . . . 63 3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Chapter 4: Future Research 4.1 Spectral Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.2 Whole Genome Annotation using MS/MS data . . . . . . . . . . . . . 70 4.2.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 References 75 Appendices 81 A Residue Masses of Amino Acids 81 B Abbreviations used in the thesis 82 C Denovo Peptide Sequencing Programs 83 D Database Search Programs 84 E Post Translational Modifications Database 86 v List of Tables 1.1 Ionization and fragmentation of peptideR 1 R 2 R 3 R 4 . . . . . . 8 2.1 Parameters used in MSNovo . . . . . . . . . . . . . . . . . . . . . . . 27 2.2 Means and Standard Deviations of precursor mass errors and fragments mass errors of each dataset. . . . . . . . . . . . . . . . . . . . . . . . . 36 2.3 Percentage of correctly predicted sequence tags of length at least x. Dataset used is OPD280 . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.4 Comparison of MSNovo with PepNovo & NovoHMM on various datasets 41 2.5 Effect of adding neutral loss ions into MSNovo . . . . . . . . . . . . . 42 3.1 Comparison of MSPEP and InSpect on three datasets: BSA, Alpha-casein and IKKb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.2 False positive rates and true positive rates for difference Z-score threshold. 63 3.3 False positive rates and true positive rates for difference Z-score threshold. 67 4.1 Virtual peptide database created from human ORFs. . . . . . . . . . . . 71 A.1 Residue masses of the amino acids. The residue masses of the 20 com- mon amino acids and selected modified amino acids. The data in this table are for amino acid residues. To calculate the mass of a neutral pep- tide or protein, sum the residue masses plus the masses of the terminating groups (e.g. H at the N-terminus and OH at the C-terminus). . . . . . . 81 B.1 Abbreviations used in the thesis . . . . . . . . . . . . . . . . . . . . . 82 vi List of Figures 1.1 A sample mass spectrum of ovalbumin . . . . . . . . . . . . . . . . . . 3 1.2 A sample tandem mass spectrum of peptide AAEPSWNGQYLVTLSANAK 5 1.3 The process of protein and peptide sequencing via tandem MS/MS . . . 6 1.4 Fragmentation patterns of a peptide containing n residues . . . . . . . . 7 2.1 The match tolerance ofbH 2 O ions follows a normal distribution. . . 25 2.2 The normalized ranking of intensities ofbH 2 O ions follows an expo- nential distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3 Comparison of the accuracy of MSNovo against that of PepNovo and NovoHMM as a function of peptide length . . . . . . . . . . . . . . . . 38 2.4 Comparison of MSNovo with PepNovo and NovoHMM using a spectrum of peptide FAAYLER. . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.5 MSNovo accuracy, precision and recall at different values of mass resolution 43 2.6 MSNovo accuracy, precision and recall at different values of P(noise) . 44 2.7 The distribution of scores of 1,000 random peptides for a LTQ spectrum with parent ion mass 606:37Da. . . . . . . . . . . . . . . . . . . . . . 45 3.1 An example of spectrum-peptide alignment. A spectrum is aligned with peptide sequence SEAGGKRQ. The alignment path is shown as red line. A horizontal line in the alignment path denotes a deletion (or PTMs with negative mass) and a vertical line in the alignment path denotes an insertion (or PTMs with positive mass). The black circles along the alignment path denote matched pairs of prefix masses and spectrum peaks. 49 vii 3.2 (a) shows that the two peptides from spectra S1 and S2 are overlapping. The two alignment paths represent b-ion and y-ion paths respectively. (b) shows one of the peptides contains 2 modifications. One modification with positive mass shift while the other one with negative mass shift. (c) shows the two peptides are overlapping and at the same time contains modifications. The red line in each figure is the alignment path. A vertical line means deletion or PTM with negative mass. A horizontal line means insertion or PTM with positive mass. . . . . . . . . . . . . . 53 3.3 Distribution of the Z-scores of the positive set and the negative sets. . . 61 3.4 ROC Curve of the Z-Score for the BSA dataset . . . . . . . . . . . . . 62 3.5 ROC Curve of the Z-Score for the A-casein dataset . . . . . . . . . . . 62 3.6 Optimal alignment path of two tandem mass spectra for peptides VLTSSAR and EKVLTSSAR. Each dot on the path represents an aligned peak pair in the two spectra. The size of dot represents the statistical score of the matched pair. The higher the score, the more possible that these two peaks are signal peaks. The black line is the alignment path. From this path we know that these two peptides overlap, and the mass difference is 257Da. The mass difference occurs at the N-terminal of the peptides. . . 65 3.7 Distribution of the alignment scores for related spectra and unrelated spectra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.8 ROC curve of the spectrum-spectrum alignment scores . . . . . . . . . 67 4.1 Six reading frames of chromosome. . . . . . . . . . . . . . . . . . . . 72 viii Abstract Tandem mass spectrometry (MS/MS) has become an important experimental method for high throughput proteomics based biological discovery. The most common usage of MS/MS in biological applications is peptide sequencing. In this thesis, we focuse on algorithms for MS/MS peptide identification and spectral alignment. We carry out two studies: (1) We have developed a de novo sequencing algorithm called MSNovo that integrates a new probabilistic scoring function with a mass array based dynamic programming algorithm. MSNovo works on various MS data generated from both LCQ and LTQ mass spectrometers and interprets singly, doubly and triply charged ions. MSNovo was tested to perform better than previous algorithms on several datasets. (2) We have developed a spectrum-peptide and spectrum-spectrum alignment algorithms called MSPEP. MSPEP identifies Post Translational Modifications through the spectrum-peptide alignment algorithm and reveals the relationship among unknown peptides through the spectrum-spectrum alignment algorithm. ix Preface This thesis consists of two parts, reflecting two different research areas in which I had an opportunity to be involved during my graduate studies at USC. The first part describe a de novo sequencing algorithm for tandem mass spectrum annotation called MSNovo. MSNovo works on various types of tandem mass spectral data, and on singly, doubly and triply charged ions. And it integrates a new probabilistic scoring function with a mass array based dynamic programming algorithm. The second part describes a spectral alignment algorithm to detect the post trans- lational modifications, mutations and spectral overlaps in tandem mass spectra called MSPEP. MSPEP is designed to align an MS/MS spectrum with peptides, proteins or another spectrum, to detect known and unknown post-translational modifications or mutations, or to detect overlapping peptides. Lijuan Mo Los Angeles, California Sep 2009. x Chapter 1 Introduction 1.1 Introduction to MS and MS/MS The completion of the Human Genome Project (HGP) is a marked stone for the genomics era. When all the information about DNA sequences is available, the next step is to learn about proteins, which is the product of DNA translation. Nowadays, the study of expression and functions of proteins becomes a routine protocol due to the fast- development of proteomics. Many questions that can not be answered by genomics studies alone can be solved by proteomics techniques. Proteomics approaches include sequence analysis, structure determination and func- tional exploration of proteins, as well as the construction and analysis of protein-protein interaction network. Sequence analysis is the first and indispensable step to study a novel protein and provides information for structural and functional proteomics studies. In the past, Edman degradation method was used to remove one amino acid at a time from protein terminals by chemical reagents and then the removed amino acids are used to analyze the peptide sequence. The efficiency and speed of the chemical reaction became the greatest limitation for this method to be widely used in modern proteomics era, when large amounts of proteins need to be identified for sequence and further analyzed in a short time. Mass spectrometry now almost replaces the classic Edman degradation method in protein and peptide sequencing because of its high-throughput, high accuracy and sensitivity. Moreover, mass spectrometry provides scientists a way to analyze protein mixtures directly and to study post-translational modifications of proteins. 1 1.1.1 Mass Spectrometry Mass spectrometry is an analytical technique to measure the mass-to-charge ratios (m/z) of ions. Thus it can be used for measuring the molecular weights of chemical compounds and studying the molecular structures by generating a mass spectrum consisting of a set of peaks representing the mass-to-charge ratios of different ions. Mass spectrometer is the device used to measure mass-to-charge ratios of ions. Mass spectrometry first appeared in 1940s, mainly for quantitative analysis of common violations such as light hydrocarbons[58]. But its requirement of vaporization of the analyzed samples has limited its applications for a long time. As the ionization techniques being developed in late 20th century, such as electrospray ionization (ESI) and matrix-assisted laser desorption ionization (MALDI), mass spectrometry became widely used in biological research, especially in high throughput identification of peptides and proteins. In mass spectrometer, the sample is first vaporized and ionized in the ion source of mass spectrometer. Since the ionized samples carry positive charges, they will move in electric field of the mass analyzer. Because different particles in the samples have different mass and carry different charges, they will move in the electric field at different speed. And also when we put them into a magnetic field, the flux of charged ions will be deflected in to a mass analyzer, and the lighter ions will be deflected further than the heavier ions. The last part of mass spectrometer is the detector, which measures exactly when each ion has been deflected when they pass by to hit the surface, and from this measurement, the ions’ mass to charge ratios (m/z) can be calculated and recorded as peaks along with their intensities as different peaks in the mass spectrum. Using this information it is possible to determine with a high level of certainty what the chemical composition the original sample was composed of. The set of measured peaks is called mass spectrum. A mass spectrum can be represented by a graph where the x-axis represents the mass-to-charge ratios (m/z) of the ions, and the y-axis represents 2 the intensities of ions. The m/z ratios can be considered as the molecular weight when the charge state is +1. The intensity level represents the abundance of each ion. Figure 1.1 is a sample ESI mass spectrum generated from a tryptic digested peptide fragment of protein ovalbumin. Figure 1.1: A sample mass spectrum of ovalbumin Mass Spectrometry is mainly used in protein identification. To identify a protein using only one mass spectrometer is called protein fingerprinting or peptide mass fingerprinting. The protein or protein mixture of interest is first digested by a protease such as trypsin. Then the resulting peptide fragments are charged and measured by the mass spectrometer to generate a mass spectrum with high specificity. Because of the specificity of the enzyme digestion, it is often possible to identify the protein from this information alone. This machine-generated mass spectrum is called the experimental spectrum. A theoretical mass spectrum can be generated for each protein in a database, by a virtual digestion of the protein using the same proteolytic enzyme. Identification is accomplished by matching the observed peptide masses in the experimental mass spectrum with the 3 theoretical masses in the virtual mass spectrum derived from a sequence database. If this information is not enough to identify the protein of interest, its peptides can be analyzed further by tandem mass spectrometry (MS/MS). 1.1.2 Tandem Mass Spectrometry Tandem mass spectrometry enables specific peptides to be detected in complex mixtures on account of their specific and characteristic fragmentation patterns and is now the most dominantly used tools in peptide identification and sequencing. Tandem mass spectrometry is a technique which uses two mass spectrometers in series to gain more detailed information of the protein of interest. The two mass spectrometers are connected by a chamber, where peptides are fragmented into small charged ions by Collision Induced Dissociation(CID). The first mass spectrometer functions as a normal mass spectrometer, where enzyme-digested peptide fragments of a protein or protein mixture are measured for their mass-to-charge ratios. As all peptides go through the connection chamber, and only those peptides within a small range of m/z ratios are selected for CID. The ions collide with inert gas atoms in the chamber and the resulting ion fragments then enter the second mass spectrometer and are measured by m/z ratios to generate a tandem mass spectrum (MS2 or MS/MS). Figure 1.2 shows a sample tandem mass spectrum of peptide AAEPSWNGQYLVTLSANAK. The x-axis is the mass to charge ratios and the y-axis is the intensity level. The intensity level represents the abundance of the charged fragments. Enzyme digestion of a protein results in a series of fragments. In tandem mass spectrometry, C-terminal fragments are called y-ions and N-terminal ions are called b-ions. For example, for peptide AAEPSWNGQYLVTLSANAK, b6 ion represents fragment AAEPSW, b7 ion represents fragment AAEPSWN, b9 ion represents fragment AAEPSWNGQ, b18 ion represents fragment AAEPSWNGQYLVTLSANA, y3 ion represents fragment NAK, y16 ion represent fragment PSWNGQYLVTLSANAK, 4 as shown in figure 1.2. In the tandem mass spectrum we can see that most of these b- and y-ions appear. Usually b1, b2, y1, y2 ions are missing in MS/MS spectra because of the low molecular weights. Distance between two consecutive ions is exactly the mass of the amino acid residue between them. A series of b-ions or y-ions looks just like a ladder which contains the sequence information. Therefore, we can use either the b-ions or y-ions ladder to reconstruct the peptide sequence from the spectra. Peaks representing b-ions and y-ions are the most important signal peaks in MS/MS spectra and they have the highest intensities. There are some other ions generated from CID process, such as fragmentation products a-ion, c-ion, x-ion and z-ion as shown in figure 1.4 and neutral loss ions. CID fragmenation products can further lose one or two water or immonion groups, resulting neutral loss ions such asbH 2 O ions,yNH 3 ions etc. There are also many noise peaks in tandem mass spectrum resulting from chemical contaminant, machine errors etc. The sequencing problem of tandem mass spectrometry becomes complicated because of the existence of all these peaks. Figure 1.2: A sample tandem mass spectrum of peptide AAEPSWNGQYLVTLSANAK MS/MS spectrum can be used to identify or sequence peptides and small DNA/RNA molecules, to identify the post translational modifications (PTMs) and to determine 5 some structure. Figure 1.3 illustrates the experimental procedure of MS and MS/MS. Biological samples such as protein or proteins mixtures are first digested by protease such as trypsin or chymotrypsin, resulting in a series of peptides. The peptides are then charged (ionized) and separated according to their different mass to charge (m/z) ratios in the first mass spectrometer (1 st MS). Selected peptide is then further fragmented and ionized by CID process into fragment ions and the m/z values of the fragment ions are measured in the 2 nd MS. The intensities level and m/z ratios of these fragments are then measured and forms a tandem mass spectrum. Figure 1.3: The process of protein and peptide sequencing via tandem MS/MS MS/MS is mainly used in peptide identification. Similar as protein fingerprinting, we can also theoretically digest proteins in a protein database and compare the virtual spectra to the experimental spectrum and identify the peptides, which is called the database search method. Or, we can identify the sequence information of the peptide only from the spectrum directly without the protein database, which is called de novo peptide sequencing method. 6 1.1.3 MS/MS Fragmentation Patterns In the process of CID, one peptide bond for each peptide molecule is broken, and the peptide is fragmented into two ions, typically an N-terminal ion called b-ion and a C-terminal ion called y-ion. Besides b- and y-ions, there are still many other types of ions appearing in the MS/MS spectrum due to various reasons: (1) Neutral loss ions: neutral loss ions form when b- and y-ions may lose certain chemical groups such as H 2 O andNH 3 ; (2) Internal fragments: internal fragments form when more than one peptide bonds are broken; (3) Doubly or multiple charged b- and y-ions: multiple charged b- and y-ions form when more than one protons exist on the fragments; (4) Noise of the spectrum: fragments of ”intruder” peptides, due to imperfect HPLC separation; chemical contaminants; and possible random physical effects[22]. The patterns of these ions generation is called fragmentation pattern of a peptide. The major peptide fragmentation pattern is shown in Figure 1.4. These ions display a characteristic pattern in the mass spectrometry, called a tandem mass spectrum (MS/MS), which allows computer identification of the selected peptide. However, the actual MS/MS spectra are much more complicated because of unknown ion types, unknown charges, missing ions, noise, isotopic ions, and machine errors. So the identification of peptide sequences using tandem mass spectra remains a challenging task nowadays. Figure 1.4: Fragmentation patterns of a peptide containing n residues 7 When the i th peptide bond is broken, it will generate b i and y ni ions. They are called complementary ion pairs because of the combination of them is the complete peptide sequence. The mass/charge ratios of these b- and y-ions are then measured by the mass spectrometer and showed as spectrum. The peaks in the tandem mass spectrum correspond to the b- and y-ions and all kinds of other ions and noises described above. An ideal fragmentation will generate n-1 b-ions and n-1 y-ions from an n amino-acids long peptide. These b- and y-ions form a complete b and y ions ladder, where two adjacent sequences differ by only one amino acid. Table 1.1 is a simple example of the b y ion ladders, showing the b- and y-ions generated from a peptide R1-R2-R3-R4. If the tandem mass spectrum is noise-free and contains only b- and y-ions, we can easily find the complete b- or y-ions ladder and reconstruct the peptide. Table 1.1: Ionization and fragmentation of peptideR 1 R 2 R 3 R 4 Ions b-ion sequences Ions y-ion sequences b 1 (R 1 ) + y 3 (R 2 R 3 R 4 ) + b 2 (R 1 R 2 ) + y 2 (R 3 R 4 ) + b 3 (R 1 R 2 R 3 ) + y 1 (R 4 ) + 1.2 Peptide Identification by MS/MS There are two major kinds of methods for protein identification using MS/MS. One is the database search, and the other is the de novo sequencing. Database search algorithms extract candidate peptides from a protein database using the parent ion mass and score these peptide candidates by matching experimental peaks with theoretical peaks from virtual fragments of peptides. Popular algorithms include SEQUEST[17] and Mascot[47]. The database search algorithm fails when the peptide is not in the database, or when there are unknown mutations and post translational modifications in the peptide. Denovo algorithms search the space of all possible peptides in an attempt to find the one that best 8 matches the mass spectrum’s peaks. DeNovo algorithms don’t rely on protein database information but reconstruct the peptide sequences from the spectra directly. Manydenovo sequencing algorithms have been developed since late 1990’s, including Sherenga[15], Lutefisk[64], Chen’s dynamic programming method[12] and suboptimal method[35], PEAKS[37], DACSIM[70], EigenMS[8], PepNovo[18, 20] and NovoHMM[10]. Besides database search and de novo sequencing, there is another kind of algorithms called sequence tagging[19, 38, 42, 59, 60], which find sub-sequences of the peptide, typically 3-5 amino acids, instead of finding the whole peptide sequence. Sequence tagging can be considered as the combination of de novo sequencing and database search algorithms. Usually, we usedenovo methods to find the sequence tag (subsequence) and use this tag as query sequence for database search. 1.2.1 Database Search In database search methods, first the proteins in the database are virtually digested by predefined protease such as trypsin. The resulting peptides are then indexed by their molecular weight. Only those with their molecular weight similar to the precursor ion of the MS/MS spectrum are considered to be candidates and the hypothetical spectrum of those candidates and simulated. A scoring function is then applied to calculate the scores for the match of the hypothetical spectrum and experimental spectrum. The one with the highest match score is predicted to be the peptide generating the experimental spectrum. Database search algorithms depend on protein database, so the size of the database will also affect the searching speed and the prediction results. Larger database will give more accurate results but requires longer searching time and smaller database can search faster but sometimes cannot predict correct peptide. Usually, people will use MSDB or NCBI non-redundant database when the protein which generates the spectrum is unknown. If 9 the protein is known from certain species or from certain known protein mixture, then we can build up our own database to shorten the search time. Most popular database search programs includes SEQUEST(Thermo Electron) [17], Mascot(MatrixScience) [47], PepHMM[66], ProbID[68], SONAR[21], OMSSA[26], InSpect[61]. Other available programs can be found in [4, 14, 23, 52]. In a database search algorithm, the first step is to pre-process a protein database by virtually digesting each protein into smaller peptidesin-silico and indexing these peptides by their masses. Then for a query spectrum with a precursor ion massM, these algorithms extract, or filter, a set of peptidesS M that contains all the peptides whose masses are within a certain range ofM: [M;M +]. All these extracted peptides are then compared or scored against the query spectrum using some scoring functions such as the cross-correlations used in SEQUEST or the probabilistic model used in Mascot. The peptides with the best scores are reported upon ranking. Though it is the method of choice for most current discovery projects, one major limitation of the database search methods is that it cannot identify novel proteins whose sequences are not already included in sequence databases. Database search algorithms fail when the peptide is not in the database, or when there are unknown mutations or polymorphisms in the peptide. Database search programs have also been used to predict modified peptides that are a result of mutations from the database sequence or post-translational modifications(PTMs)[31, 34, 36, 40, 45, 53, 61]. When the types of modifications are unknown, unrestrictive search algorithms are developed to predict the modified peptides[63, 61]. Speed is the bottleneck of the database search methods, especially when searching against a very large protein database. Recently there are numerous works being devoted to speeding up the database searches. Many algorithms filter the database by using de 10 novo methods to predict short sequence tags, and then only score peptides that match these tags[9, 13, 19, 34, 38, 42, 57, 60, 61]. Database methods failed when protein samples come from unknown or unsequenced organisms, or even organisms with poorly annotated genes. In these cases, we need to identify peptides by matching mass spectra to homologous proteome. Programs such as MS-Blast[54, 55] used the popular similarity search algorithm Blast[2] to simultaneously align multiple de novo predictions with database sequences. MultiTag[59] also does similarity search but uses shorter peptide sequence tags. OpenSea[56] and Spider[30], align individual de novo sequences to a database one at a time, treating unaligned portions as either de novo sequencing errors, mutations, or post-translational modifications. Below we described the algorithms of SEQUEST, Mascot and PepHMM in details. SEQUEST The first database search algorithm for MS/MS peptide identification is SEQUEST[17], which was developed by Eng et al. in 1994. First hypothetical digestion of proteins in a protein database is performed and those linear amino acid sequences whose molecular weights are within 1u of the precursor ion are selected for further analysis. The hypo- thetical spectra of those selected peptides are generated and a cross-correlation function is used to measure the similarity of the experimental tandem mass spectrum and the mass-to-charge ratio for the fragment ions predicted from the selected peptides from the protein database. Then all the candidate peptides are sorted according to the normalized cross-correlation scores Xcorr. The delta values used by SEQUEST is the difference between the normalized cross-correlation functions of the first- and second-ranked search results. A larger than 0.1 indicates successful match between the MS/MS spectrum and the sequence in the database. There are two scores used in SEQUEST: Preliminary Score and Correlation Score(Xcorr). 11 Preliminary Score S p = ( X i m )n i (1 +)(1 +)=n t (1.1) where (n i ) is the number of predicted fragment ions that match ions observed in the spectrum and ( P i m ) are the sum of intensities of matched ions. The continuity index is added if consecutive ions are matched. If an immonium ion for the amino acid His, Tyr, Trp, Met, Phe is present in the spectrum along with the associated amino acid, is added,n t is the total number of predicted sequence ions. Correlation Score(Xcorr) The hypothetical spectrumx i and the experimental spec- trumy i are used in calculating the cross-correlation. R = n1 X i=0 x[i]y[i +] (1.2) The final score attributed to each candidate peptide sequence is the value of function when = 0 minus the mean of cross-correlation function over the range75< < 75 and then scores are normalized to 1. Mascot Mascot[47] is another widely used database search algorithm for MS/MS peptides identification. It was developed by Perkins et al. in 1999. They used an approach to calculate the probability that the observed match between the experimental data set and each sequence database entry is a chance event. The lower the probability, the better the match is. The Mascot score is10log 10 (P ), where P is the probability, so then the best match is the peptide with the highest score. 12 PepHMM PepHMM[66] was developed by former Ph.D. student in our group, Yunhu Wan. PepHMM incorporates an accurate scoring function for database search into the algo- rithm. The new scoring function combines the information of machine accuracy, mass intensity and correlation among ions. The tolerances of MS/MS were found out to be normally distributed and the intensities of mass peaks were exponentially distributed. An HMM score was calculated based on these two distributions. And statistical significance of the HMM scores was also calculated to give the final ranking of the peptide candidates. PepHMM was tested to work better than SEQUEST and Mascot on ISB ion trap datasets. 1.2.2 De Novo Sequencing Denovo sequencing algorithms aim to reconstruct the peptide sequence directly from a MS/MS spectrum without the aid of protein sequence databases. Therefore, in principle, it can overcome some of the limitations of database search algorithms. Many de novo sequencing algorithms have been developed since late 1990’s, including Sherenga [15], Lutefisk [64], the dynamic programming method [12] and a suboptimal method [35], PEAKS [37], DACSIM [70], EigenMS [8], PepNovo [18, 20] and NovoHMM [10]. The fundamental problem of thesedenovo sequencing algorithms is to find a complete sequence or a ladder of b- or y-ions such that the distance between the consecutive ladders or peaks is equal to the mass of an amino acid. However, the existence of isotopes, incomplete fragmentations, multiple fragmentations, unknown fragmentations and random noises severely inhibits the efficacy of thesedenovo algorithms in practice, and frequently leads to false positive predictions. Thus, as of today,denovo sequencing tools are not as widely used as database search methods. 13 Mathematically, the de novo sequencing problem can be formulated as follow: Given a parent massM, an error range, an experimental spectrumS and a scoring function f(), find a peptideP with massm such thatjMmj< and the hypothetical spectrum H of the peptideP has the best match withS, i.e.f(H;S) is maximized. There are two major steps in developing a reliable de novo sequencing algorithm. The first is to generate a pool of candidates, one of which is the correct peptide. The second is to design a good scoring function to select the best candidate from the pool. For the first task, mostdenovo algorithms [1, 12, 15, 35, 37] create a spectrum graph and then apply a dynamic programming algorithm to find paths in the graph which represent peptide candidates. Some algorithms [70] use a divide and conquer heuristics combined with more sophisticated scoring functions to generate peptide candidates. The choice of scoring functions is quite limited because the dynamic programming algorithm works only if the scoring function is additive: the score of a peptide equals the sum or the multiplication of the score of each ion (or fragmentation). Nevertheless many scoring functions have been proposed. Dancik (1999) considered the probability of different ion types in their scoring function. Later, Frank and Pevzner(2005) developed a de novo sequencing program called PepNovo by using a probability network with a hypothesis testing. At the same time, Bernd Fischeretal.(2005) proposed an HMM model called NovoHMM. Recently, Frank and Pevzner released a new version of PepNovo which deals with LTQ-FT data[20]. The new version of PepNovo generates sequence tags up to 8 amino acids long which are then used to query the database for peptide identification. The most popular de novo sequencing algorithms for peptide identification include PEAKS, PepNovo, Lutfisk and Sherenga. 14 PEAKS PEAKS is one of the most frequently used programs for de novo sequencing using tandem mass spectrometry. It was developed by Ma et al[37]. They first preprocessed the raw MS/MS data by filtering noises, centering peaks and deconvolving doubly and triply charged peaks into singly charged peaks. Reward or penalty was given based on how close a peak matches to a mass in the hypothetical spectrum. They intended to find a sequence whose b- and y-ions maximizes the rewards at their mass value. A dynamic programming was used to compute the 10000 sequences with highest scores. After that, they refined the scores by a more stringent scoring scheme which used stricter mass error tolerance and then outputted the best candidates under the new scoring scheme. A recalibration method was used to calibrate the minor deviation in the MSMS data. Finally, PEAKS computes a confidence score for each top-scoring peptide sequences. PepNovo The PepNovo program was developed by Frank and Pevzner[18, 20], which is based on a probabilistic network modeling scoring method. They defined a probabilistic network with three different types of dependencies, correlations between fragment ions, dependencies due to the relative position of the cleavage site in the peptide and the influence of flanking amino acids to the cleavage site. They integrated a hypothesis test idea into their scoring function. The CID hypothesis assumed that mass m was caused by a genuine cleavage in the peptide and the random hypothesis assumed that mass m was caused by a random process. They defined the score given to a mass m and spectrum S to be the logarithm of the likelihood ratio of the probabilities of these two hypotheses. Their scoring function is represented as follow: 15 score(m;S) = lg P CID (Ijm;S) P random (Ijm;S) (1.3) P CID (Ijm;S) = Y v2V P CID (Ij(v);m;S) (1.4) P random (I =tjn 1 ;n 2 ;:::;n d ) = (1 nt) P d i=t+1 n i (1.5) The probabilistic network was then trained using a test dataset from the same type of source (in their experiment, they tested the program using ISB ion-trap dataset). A spectral graph was constructed using only the top-ranking peaks in a spectrum by a sliding window method and also recalibrate the parent mass using a combinatorial parent mass correction procedure. A dynamic programming similar to Chen et al’s algorithm was applied to find the highest asymmetric path in the spectral graph. The uniqueness of their algorithm is their scoring function. Lutefisk algorithm was developed by Taylor and Johnson[64]. It uses a de novo sequencing algorithm to derive a short list of possible sequence candidates which are used as query sequences in a subsequent homology-based database search. Lutefisk Lutefisk algorithm first identifies significant ions. In this step, they use local maxima and sliding windows methods. Then they determine the N- and C- terminal evidence lists. The N-terminal ions they consider include b, b-17, b-18, a, a-17, a-18. The C-terminal ions they consider include y, y-17 and y-18. They use an approximate probability of each ions, those of a, a-17, a-18, b-17, b-18, y-17 and y-18 are half of that assigned to b and y. The next step is to determine the sequence spectrum, in which the x ordinate is the nominal m/z values for b-ions and the y ordinate is a sum of the various ion probabilities 16 suggesting cleavage at each site. After establishing the sequence spectrum, the program proceeds by tracing out sequences starting from the N-terminus. They search for b-ion values that differ from N-terminal value by one or two amino acids. After they get the completed sequences, they score them according to an intensity-based score. Sherenga Sherenga is developed by Dancik et al[15]. It is the first algorithm importing the concept of spectrum graph and training data. The authors used a set of training data to learn ion types of the peaks in the spectrum without any prior information from fragmentation patterns. Then they used the ion types’ information to build a spectrum graph. It defined different ion types as a set ; k is the number of ion types. Each peak in the spectrum generates k vertices . Two vertices u and v are connected by directed edge if and only if their mass difference equals one amino acid. The peptide sequencing problem was then transformed to finding the longest antisymmetric path in the directed acyclic spectrum graph. The reason of finding antisymmetric path is to avoid multiple uses of vertices corresponding to the same experimental spectral peak. Dynamic Programming algorithm Dynamic programming algorithm was developed by Chen et al[12]. They first observed that the forbidden pairs in the spectrum graph are noninterleaving and then designed a dynamic programming method to find the longest antisymmetric path in the spectrum graph, this makes the first polynomial time algorithm for de novo sequencing. They simplified Dancik’s idea and assume each peak to be only b or y ions. Then each peak will generate two vertices instead of k vertices. Two vertices are connected if their mass difference equals an amino acid. Then all the vertices are placed on the real line of spectrum and a dynamic programming was used to get the optimal results. Dynamic 17 programming is a common technique for solving optimization problems. However, the optimal solution may not be the sequence that produces the experimental spectrum, which leads scientists to find the suboptimal solutions for de novo sequencing problem. Suboptimal Methods The suboptimal concept was developed by Lu and Chen[35]. This is an extension of the previous dynamic programming method. They transformed the spectrum into a two dimensional matrix spectrum graph and find the suboptimal solutions. A hypothetical spectrum was generated for each candidate peptide and was scored using a simple scoring function. Let S1 be the sum of the abundance levels of all of the ions in the hypothetical spectrum and S2 be the sum of the abundance levels of the ions that match with experimental spectrum. The ration S2/S1 is used as score to rank the results. This is also a polynomial time algorithm which takesO(pjEj) time to find all suboptimal solutions, where p is the number of solutions andjEj is the number of edges in the matrix spectrum graph. 1.3 Post-Translational Modifications 1.3.1 Introduction to Post-Translational Modifications Post-translational modification (PTM)[39] is the chemical modification of a protein after its translation. In this thesis, PTM specifically refer to a chemical, covalent modification on an amino acid. Post-translational modifications play important role in protein functions and thus get more and more attention in biological studies. There are two kinds of PTMs, fixed modifications and differential modifications. Fixed modifications is also called static modifications, which the amino acid mass is permanently changed by a fixed value, such as the modifications of Cysteine. In a peptide, 18 usually all Cys are modified by Carboxymethylation(+57) or Carbamidometylation(+58). In differential modifications, some amino acids are modified, while others stay the same. Introduction of differential modifications into peptide identification algorithms, either database search ordenovo sequencing, will greatly slow down the searching speed of the programs. Considering the efficiency of the algorithms, usually we only include a limited number of PTMs in peptide identification programs. There are several online database of PTMs, such as UNI- MOD(http://www.unimod.org/) and RESID[24, 25] (http://www.ebi.ac.uk/RESID/). 1.3.2 PTMs Identification by Tandem Mass Spectrometry PTMs now become an important topic in the field of tandem mass spectrometry research. It can be detected by searching for m/z shifts in the tandem mass spectrum. As the mass spectrometry equipments have been greatly improved to generate more accurate and precise spectra, designing an efficient and accurate algorithm to interpret tons of spectra and annotate PTMs in these spectra has become an urgent task in computational biology. As we have described before, traditional protein or peptide identifications include two major methods: Database Search and De Novo Sequencing. Database search method obtain higher accuracy than de novo sequencing. However, database search methods failed when the peptides come from unkown proteins which are not in the database. De novo sequencing, which detects unknown peptides, however, cannot achieve the same accuracy. De novo sequencing deal with a much more limited set of PTMs compared to database search. For example, PepNovo[18, 20] only list 5 known PTMs in their lists. Although we can add other known PTMs to the list by ourselves, adding more PTMs will decrease the accuracy and increase the running time. So, nowadays, identification of PTMs in the proteins still requires manual verifications. 19 The availability of high resolution mass spectra and tandem mass spectra such as FTICR and OrbiTrap data, allows more accurate identification of PTMs. In addition to searching one spectrum against the database or de novo sequencing, we are now able to compare two spectra by aligning two spectra for peptide overlaps or PTMs. The unrestrictive database search method was developed by Tanner et al.[65, 63], in which they developed a spectral alignment method to identify spectral pairs from overlapping peptides or modified and unmodified version of the same peptide. A spectral network is then constructed, and consensus peaks are found. With the spectral network, we can reduce noise peaks and separate b and y ion mass ladders, therefore are able to reconstruct the protein sequences. 20 Chapter 2 MSNovo: De Novo Peptide Sequencing Algorithm In this chapter, we present a new approach to peptide de novo sequencing, called MSNovo, which has the following advanced features. (1) It works on data generated from both LCQ and LTQ mass spectrometers and interprets singly, doubly and triply charged ions. (2) It integrates a new probabilistic scoring function with a mass array based dynamic programming algorithm. The simplicity of the scoring function, with only 6-10 parameters to be trained, avoids the problem of overfitting and allows MSNovo to be adopted for other machines and data sets easily. The mass array data structure explicitly encodes all possible peptides and allows the dynamic programming algorithm to find the best peptide. (3) Compared to existing programs, MSNovo predicts pep- tides as well as sequence tags with a higher accuracy, which is important for those applications that search protein databases using the de novo sequencing results. More specifically, we show that MSNovo outperforms other programs on various ESI ion trap data. We also show that for high resolution data the performance of MSNovo improves significantly. Supplementary materials, executable files and datasets can be found at http://msms.cmb.usc.edu/supplementary/msnovo. 21 2.1 Introduction In this chapter, we present a newdenovo sequencing algorithm that outperforms the exist- ing ones and is novel in the following ways: (1) We use a novel scoring function based on the probabilistic distributions of thematchtolerance, defined as the distance between a mass peak and a hypothetical ion, and thenormalizedranking of peak intensities, defined as the ratio of the ranking of the intensity in the descending order over the total number of peaks. We have shown that the match tolerance follows a normal distribution centered at around 0, and that the normalized ranking of peak intensities follows an exponential distribution. Our contribution is to leverage the above two distributions to calculate the score for the match of a peak with an ion: the likelihood ratio of the probability that the peak corresponds to this ion over the probability that this peak is a noise peak. The concept of likelihood ratio scores has been used in different ways in other peptide identification programs[15, 23, 16]. (2) We do not use the traditional spectrum graph in our model. Instead, we construct a mass array starting from 0 with a predetermined resolution. The idea of the ”mass array” was first used in [12] to determine if a given mass corresponds to the mass of some peptides orb/y ions in the construction of the spectrum graph. The idea was later applied to sample random peptides that have the same mass in [66]. Our mass array data structure is different from the spectrum graph in that the mass array explicitly encodes all possible peptides with the same massM, while the spectrum graph only encodes peptides that are fit into the given spectrum. We will explain the differences in details in the later sections. Empirical results show that our program, MSNovo, performs better than existingdenovo tools on multiple data sets, due to the improved scoring function and the novel mass array based dynamic programming algorithm. 22 2.2 Methods 2.2.1 Preprocessing Spectra The Achilles heel of tandem mass spectra analysis is the amount of noise in the mass spectra. We call around 80% non-interpretable peaks’noise’ peaks, and peaks matched to b-ions and y-ions’signal’ peaks. The existence of a large number of ’noise’ peaks will cause false positive predictions. Removing those noise peaks improves the signal to noise ratio, and, therefore, increases the prediction accuracy. As a first step, we remove noise peaks by using a sliding window based noise removal method. We choose a window of size 100u and select the top 6 picks from each window. Another important preprocessing step is to resolve the isotopic peaks in the spectrum. The presence of isotopic peaks leads to ambiguity in distinguishing amino acids that are one Dalton apart, suchI=L(113), N(114) andD(115). 2.2.2 Scoring Function The core of any peptide identification method is the scoring function. In database search, a scoring function calculates the similarity between an experimental spectrum and a hypothetical spectrum that is generated after the in-silico digestion of protein sequences from a database. In de novo sequencing, an appropriate scoring function is used to (1) decide whether a pair of peaks is indeed a part of a sequence ladder and (2) to calculate the similarity between the experimental spectrum and the hypothetical spectrum generated from the (partial)denovo peptide sequence, i.e. the output. DistributionsforMatchtoleranceandpeakintensity In our model, there are two main parameters for calculating a good scoring function: the match tolerance and the peak intensity. Given a mass peakp i = (m i ;I i ) from an 23 experimental spectrum and a peak (or an ion)q j = (m 0 j ;I 0 j ) from another spectrum (or a hypothetical spectrum), we need to determine whether there is a match between the two peaksp i andq j . Intuitively a match would occur, when the difference of the mass to charge ratio or thematchtolerance between the two peaks is small and when theintensity ranks of the two peaks are similar. The match tolerance and the normalized ranking of peak intensities are the two major novel features used in our scoring function as they determine whether a peak in the spectrum matches a hypothetical ion of a peptide or not. We use the ISB769 as the training data set to obtain the probabilistic distributions of the match tolerance and the peak intensities ofb- andy-ions. We observe that the distribution of the match tolerance follows a normal distribution, and the distribution of the normalized ranking of the peak intensities follows an exponential distribution. The distributions forb- andy-ion intensities are different. For the LTQ data, we also observe a large fraction ( 60%) of the neutral loss ions, i.e. thebH 2 O and thebNH 3 ions. Similarly, we found that the match tolerance and the normalized ranking of the peak intensities of the neutral loss ions also follow the normal distribution and the exponential distribution respectively. Figure 2.1 and figure 2.2 show the distribution of the match tolerance and that of the peak intensities of thebH 2 O ions of the LTQ spectra. Other ions in the LTQ spectra follow the similar distributions. Scoringamatch We use the above distributions to create a new and effective scoring function for ourde novo sequencing program. For each ion in the hypothetical spectrum, we first determine a matching peak in the experimental spectrum. If there is no matching peak, we say the hypothetical ion is missing. If there is a match, then the peak could be a signal peak or a noise peak. For now, we consider two kinds of signals in our algorithm: b-ions 24 Figure 2.1: The match tolerance ofbH 2 O ions follows a normal distribution. and y-ions. It can easily be extended to include other ions such as a-ions and neutral loss ions. We denote the match tolerance and the normalized ranking of peak intensity byT ,I respectively. Note that the match tolerance is due to the difference of masses of a hypothetical ion and a mass peak while the normalized ranking of peak intensity is due to the intensity of the peak that matches with the hypothetical ion. We denote the probability that the matching peak is a signal given the match tolerance T and the normalized intensity rankI to beP (signaljT;I), and the probability that the matching peak is a noise peak givenT andI to beP (noisejT;I). Our score is then defined to be the likelihood ratio of these two probabilities: 25 Figure 2.2: The normalized ranking of intensities ofbH 2 O ions follows an exponential distribution. matchScore = P (signaljT;I) P (noisejT;I) = P (T;Ijsignal) P (T;Ijnoise) P (signal) P (noise) / P (T;Ijsignal) P (T;Ijnoise) : (2.1) For the simplest case, we consider only two kinds of signals (b- and y-ions) in our algorithm, so thatP (T;Ijsignal) can be further divided intoP (T;Ijb) andP (T;Ijy) according to the types of the ions. AssumingT andI are independent, we have scoreb = P (T;Ijb) P (T;Ijnoise) = P (Tjb)P (Ijb) P (T;Ijnoise) ; (2.2) 26 scorey = P (T;Ijy) P (T;Ijnoise) = P (Tjy)P (Ijy) P (T;Ijnoise) ; (2.3) where P (Tjb) N(;), P (Tjy) N(;), P (Ijb) exp( b ) and P (Ijy) exp( y ). Here, = 0 if the machine is calibrated. is the standard deviation of tolerance and b and y are the mean of the normalized ranking of intensities ofb- and y-ions respectively. All these parameters will be determined from the training dataset. Table 2.1 lists all the parameters for singly-charged, doubly-charged and triply-charged spectra. We also simplifyP (T;Ijnoise) =P (noise), a constant that can be determined by the average of the lowest scores of signal peaks among all of the spectra in the training data set. We will discuss how to determine its approximate value later in the parameter tuning section. Table 2.1: Parameters used in MSNovo +1 Spectra +2 Spectra +3 Spectra P(b) 0.77 0.61 0.43 P(y) 0.70 0.63 0.44 P (b 2+ ) - - 0.46 P (y 2+ ) - - 0.50 0 0 0 0.10 0.10 0.10 b 5.11 3.61 2.87 y 5.38 5.03 3.39 b 2+ - - 2.93 y 2+ - - 3.14 P(noise) 0.002 0.002 0.003 resolution 0.1 m/z Likelihoodratioscore We denote theb- andy-ions in the hypothetical spectrum to beb 1 ;:::;b n andy 1 ;:::;y n respectively, wheren is the number ofb- ory-ions in the hypothetical spectrum. Usually, 27 n is equal to the number of peptide bonds andn + 1 is equal to the length of the peptide. For ionb i , we define the probability of an observation (T i ;I i ) to be SB i =P (T i ;I i jb i ); (2.4) whereT i andI i represent the match tolerance and the normalized ranking of intensity of a mass peak from the spectrumS that matches with the i th b ion respectively. Similarly, the probability of observingT i andI i for an iony i is SY i =P (T i ;I i jy i ): (2.5) When the peak corresponding to ion b i or ion y i is missing, we define SB i = P (missing) = 1P (b i ) orSY i =P (missing) = 1P (y i ). The log likelihood ratio ofSB i andSY i are defined in the following. LSB i = 8 < : log SB i P(noise) , when ion is present log(1P (b i )); when ion is absent (2.6) LSY i = 8 < : log SY i P(noise) , when ion is present log(1P (y i )); when ion is absent (2.7) In the case that both the b-ion and the y-ion are missing, the score should belog(1 P (b)) + log(1 P (y)). The probability of observing a spectrum S should be the probability of all peaks in this spectrum, as defined in the following: Pr(Sjb 1 ;:::;b n ;y 1 ;:::;y n ) = n Y i=1 SB i SY i P (noise) mm 1 ; (2.8) where m is the number of peaks in spectrum S andm 1 is the number of matched peaks. We consider the othermm 1 peaks that have no matches as noise peaks for the 28 termP (noise) mm 1 in equation 2.8. Our goal is to maximize the log likelihood ratio of the probability thatS is generated from peptideP over the probability thatS is generated by random noise (i.e. all them peaks in the spectrum are noise peaks, so the probability of this situation isP (noise) m ). Hence, we need to maximize the final score: log Pr(SjP ) Pr(Sjnoise) = log Pr(Sjb 1 ;:::;b n ;y 1 ;:::;y n ) P (noise) m = log Q n i=1 SB i SY i P (noise) mm 1 P (noise) m 1 P (noise) mm 1 = log Q n i=1 SB i SY i P (noise) m 1 = n X i=1 LSB i +LSY i (2.9) 2.2.3 Dynamic Programming One of the unique features of our algorithm is that we use a mass array-based dynamic programming algorithm instead of using mass peaks directly in a spectral graph, as previous dynamic programming algorithms do. The mass array is different in principle from the spectrum graph in that the mass array data structure explicitly encodes all possible peptides that has the given massM, and uses the spectrumS as the observation to find the best peptide, while the spectrum graph is constructed directly using peaks in S and finds the peptide that best fitsS. Constructionofthemassarray Given M and a pre-defined resolution e, the mass array can be constructed in the following two steps. Note thatM is defined as the singly charged b-ion mass of the whole peptide, about 19 Daltons (the mass of a water molecule plus one proton) less than the doubly charged precursor ion mass. First, we construct an array withM=e + 1 29 indices, starting from 0 toM=e. One can view our mass array data structure as a graph where the number of vertices are the number of elements of the mass array in the same order as they occur in the array. For example, givenM = 1; 000 Dalton ande = 0:1 Dalton, we can construct a mass array with 1; 000=0:1 + 1 = 10; 001 elements, starting from 0.0 to 1,000.0. From the graph point of view, there are 10,001 vertices in this graph. However, for the clearness of representation, we still refer to an index or a vertex using the actual mass throughout this chapter. Second, for each pair of verticesi andj,i<j, we define a directed edge (i;j) if and only if (1)i is equal to the mass of some b-ion ori = 0 and (2)ji is equal to the mass of one of the 20 amino acids. These two conditions immediately restrict edges to be defined over indices (actually b-ions) whose masses are equal to some b-ions. Note that a b-ion mass is equal to the sum of the residue mass of every amino acid in the sequence plus a proton (1 Dalton). Therefore, we can prove that there is a one-to-one mapping between a peptide whose mass is equal toM and a path from 1 toM. The de novo sequencing is equivalent to finding the best path of which the corresponding peptide has the best score defined in Eq 2.9. Note that we do not need to explicitly construct edges in the mass array. The edges will be implicitly constructed in the dynamic programming algorithm. The mass array data structure is different from both the spectrum graph based methods and the Markov chain method (used in NovoHMM) in (1) that the mass array explicitly encodes all possible peptides with the same mass M, while the others only encode peptides that are fit into the given spectrum, and (2) that the mass array defines both vertices and edges using exact (or theoretical) masses while the others define vertices and edges using observed masses in the spectrum. Using exact masses avoids a serious problem that happens frequently in other methods: a small mass error, typically up to 0.5 Dalton for a vertex and 1.0 Dalton for an edge in a spectrum graph, will be accumulated and usually lead to a much larger mass error for a path corresponding to a peptide. This 30 kind of errors will cause false predictions because the masses of several amino acids are within 1 Dalton of each other. Scoringapathinthemassarray LetLSB[m] be the likelihood ratio score defined in Eq 2.6 for a b-ion at massm in the mass array and let LSY [m] be the likelihood ratio score defined in Eq 2.7 for a y-ion at massm. For a massm or a positionm in the mass array, we can calculate the scoreLSB[m] using the normal and exponential distributions described before, where the match tolerance and the normalized ranking of the matching peak are derived from the mass peak that is closest tom in spectrumS. Similarly, we can calculate the score LSY [m] by finding the mass peak that is closest tom in spectrumS. The goal of the algorithm is to identify a pathP =fP 0 = 1;P 1 ;P 2 ;:::;P n ;P n+1 = Mg in the mass array that maximizes n X i=1 (LSB[P i ] +LSY [MP i + 19]); where 19 is for an extra water molecule plus an extra proton in the y-ion. Thedynamicprogrammingalgorithm DefineScore(i) to be the maximum score among all paths from 1 toi. The recursion of the functionScore() is Score(1) = 0; (2.10) Score(j) = max ji=aa (Score(i) +LSB[j] +LSY [Mj + 19]); (2.11) whereaa stands for the mass of one of the 20 amino acids. For each massm in the mass array, its complementary massMm + 19 should be forbidden to appear in the same 31 path because it would interpretm as both a b-ion and a y-ion, a very rare event in actual peptide sequences. After we find the best path along the mass array, we check whether there exist forbidden pairs in the path. If there is no forbidden pair in the path, then we trace back to reconstruct the best peptide. Otherwise we need to run the dynamic programming again by prohibiting each of these forbidden pairs. For example, if the best path contains a forbidden pair ofm andMm + 19. Then we need to run dynamic programming twice, excluding m or Mm + 19 from the path each time. So this process grows at 2 k times if we encounterk forbidden pairs in the best path. We have observed that usuallyk = 0 in practice. So this process does not usually affect the speed of our program. Dealingwithparentionmasserrors The accuracy of the parent ion mass is critical to the mass array data structure and the dynamic programming algorithm because it is used to calculate the mass of y-ions. One problem in the spectrum graph-based methods is that the precursor mass error tends to propagate throughout the steps in the dynamic programming algorithm leading to wrong solutions. In our algorithm, we allow a flexible range(2:0u) for the precursor ion mass in LCQ data. For this correction interval [M 2:0;M + 2:0] and with a precision of 0.1u, we execute 41 runs of the dynamic programming algorithm each with one parent ion mass in the range ofM 2:0,M 1:9, ...,M + 2:0. After each run, we trace back from the precursor mass used at that run to identify theb-ion ladders to reconstruct the best peptide sequence. We then sort the 41 candidates to find the best peptide. 2.2.4 Triply charged Spectra For charge +3 spectra, the b 2+ and y 2+ ions become dominant. On average, 46% of theb 2+ ions and 51% of they 2+ ions are present, and they are higher than those of the 32 b-ions (42%) and y-ions (44%). We need to consider these ions in the scoring function and as well as in the dynamic programming algorithm. Let LSB2[m] be the likelihood ratio score for ab 2+ -ion with massm in the mass array and LSY2[m] be the likelihood ratio score of a y 2+ -ion with massm. Using the parameters shown in Table 2.1, we can calculate the score for theb 2+ i -ion and the score for they 2+ i -ion using the following equations: LSB2 i =log P (T i ;I i jb 2+ i ) P (noise) ; (2.12) LSY 2 i =log P (T i ;I i jy 2+ i ) P (noise) : (2.13) If the mass of a b-ion ism, we can derive its correspondingb 2+ -ion mass as m+1 2 and y 2+ -ion mass as Mm+20 2 . Then, we integrate them into the dynamic programming algorithm using the following recursion. Score[j] = max ji=aa (Score[i] +LSB[j] +LSY [Mj] + LSB2[ j + 1 2 ] +LSY 2[ Mj + 20 2 ]): (2.14) 2.2.5 LTQ-FT Spectra The mass resolution of LTQ-FT (Linear ion trap-Fourier transformation mass spectrome- ter) is close to one part per million(ppm). The LTQ data we used in this chapter has a resolution better than 0.1 Dalton in measuring the parent ion mass. Thus, we construct a mass array with a high resolution of 0.01Da. There are two modes of acquiring LTQ-FT spectra: (1) the profile mode and (2) the centroid mode. In the profile mode, the mass analyzer scans at a resolution of 0.1 Da and reads an intensity value. In the centroid mode, the mass analyzer reports only the 33 centroid of the profile data. The LTQ-FT data[27] we used was captured in the profile mode, which cannot be directly used by de novo sequencing programs. We converted the profile mode spectra into centroid data by using a simple weighted averaging scheme. We calculated the weighted average of m/z value of each bin of each 1 Da, using the intensity value of each profile peak as the weight. This preprocessing step generates a MS/MS spectrum similar to that from ordinary ion trap machines. We then run MSNovo on this weighted data. 2.3 Results 2.3.1 Data sets We obtained MS/MS spectra data from four sources: the ISB dataset [32], OPD(Open Proteomics Database)[51], PeptideAtlas[44] and a recent LTQ dataset of Godoy et.al. [27]. We use five datasets ISB769, ISB646(charge +3), OPD280, HUPO513 and LTQ600 from the above sources to compare the performance of MSNovo with otherdenovo sequencing programs. The mean and the standard deviation of precursor mass errors and fragments’ mass errors for each dataset is listed in Table 2.2. For ISB646 (triply-charged spectra) we consider major fragmentsb-,y-,b 2+ - andy 2+ -ions. For other datasets, we only consider the major fragments, i.e. theb- andy-ions. ISB769 We obtained a tandem mass spectra data set from ISB (The Institute of Systems Biology) [32]. The spectra data came from twenty-two runs of LC/MS/MS of two protein mixtures consisting of 18 purified proteins of different physicochemical properties with different relative molar amounts and modifications. The data set 34 was analyzed by SEQUEST followed by manual validation. We selected 769 [M+2H] 2+ tryptic spectra, referred to as ISB769, which have SEQUEST Xcorr score of 2.5 or more. In this dataset, on average 47.7% of the b-ions and 51.2% of the y-ions are present, and we use these numbers as indicators of the quality of spectra. ISB646 (charge +3) There are 646 +3 charged tryptic spectra identified by SEQUEST in the ISB dataset. We name these spectra as ISB646 dataset. In this dataset, on average only around 30.1% of the b-ions and 34.9% of the y-ions are present. OPD280 From the Open Proteomics Database(OPD) [51], we obtained 280 doubly-charged spectra with Xcorr> 2:5 that were used by PepNovo. This dataset is referred to as OPD280. In this dataset, on average around 47.8% of the b-ions and 53.1% of the y-ions are present. HUPO513 From PeptideAtlas, we obtained the HUPO PPP (Plasma Proteome Project) protein datasets, contributed by HUPO lab 28 (Pacific Northwest National Lab, USA) [44]. We selected 513 doubly-charged tryptic spectra with Xcorr> 2:5 from those generated from human serum. We restrict the length of peptides to be 16 residues or less. In this dataset, on average about 49% of the b-ions and 62.1% of the y-ions are present. LTQ600 From a LTQ dataset generated by Matthias Mann’s Group [27], we selected 600 charge +1/+2 tandem mass spectra, and named them LTQ600. The accuracy of the 35 parent ion mass is 2ppm and the accuracy of the MS/MS spectra is roughly0.1 Da. We used Mascot’s annotations to evaluate the accuracy of our de novo tool. We ranked all of these spectra according to Mascot scores in descending order and chose the first 600 as our dataset. In this dataset, on average about 80.4% of the b-ions and 81.3% of the y-ions are present, and also the percentages forbH 2 O, bNH 3 ,yH 2 O,yNH 3 ions are 63.7%, 63.7%, 53.7%, 54.4% respectively. LTQ2448 Usually one run of LTQ experiment generates around 8000 tandem mass spectra. We took two full runs of LTQ spectra data [27] and used the Mascot-annotation (p-value<0.05) to obtain a set of 2,448 doubly-charged LTQ spectra. In this dataset, on average, 61.9% of the b-ions and 71.3% of the y-ions are present. In addition, the percentages ofbH 2 O,bNH 3 ,yH 2 O,yNH 3 ions are 49.5%, 50.2%, 47%, 47% respectively. Table 2.2: Means and Standard Deviations of precursor mass errors and fragments mass errors of each dataset. Precursor Mass Fragments Mass Precursor Mass Error Standard Fragments Mass Error Standard Dataset Error Mean Deviation Error Mean Deviation ISB769 -0.7717 0.3974 -0.02048 0.1296 ISB646 -1.384 0.3915 0.03882 0.1862 OPD280 -0.5701 0.3222 -0.03210 0.1377 HUPO513 -0.7949 0.6647 -0.07674 0.1578 LTQ600 -0.02843 0.01275 -0.01317 0.1129 LTQ2448 -0.1517 0.4723 -0.05700 0.1562 The parameters for MSNovo are listed in Table 2.1. For each spectrum, MSNovo reports top 20 peptide candidates from the results of the multiple runs of the dynamic programming algorithm using different precursor ion masses, because the correct peptide 36 may not always be the peptide ranked on the top so we also seek useful information from the rest of the candidates. 2.3.2 Effect of different peptide lengths First we compare the performance of MSNovo with PepNovo v1.01 and NovoHMM on spectra that are generated from peptides of different lengths. We define a prediction to be correct when the predicted residues are correct and at the same position as in the case of the true peptide. That is to say, it may not contain all the residues of the original peptide, but the content is accurate. The accuracy(pep%) is defined as the percentage of correctly predicted peptides. We ordered all of the 1,639 SEQUEST-annotated spectra from the ISB dataset according to the lengths of the peptides, from 7 to 29. Then we compared the accuracy of the three programs in each group, the results are shown in Figure 2.3. We observe from Figure 2.3 that the accuracy of MSNovo is the highest for peptides shorter than 19 amino acids, and as the length of the peptides increases beyond 23 PepNovo begins to perform better. NovoHMM does not perform well for long peptides because it tries to predict the whole length of the peptide. In general, all programs perform substantially poorer in spectra of long peptides because the quality of these spectra is not as good as that of spectra of shorter peptides. For example, for the spectrum from the peptide YGDFGTAAQQPDGLA VVGVFLK, MSNovo predicts [711.1]AKKPDNGA VVRIFK, of which 9 amino acids are correct. PepNovo predicts QPD, starting at 910.52, and is considered as a correct prediction by the above definition. NovoHMM predicts EEFRWYKRFKYGRFIK, of which 6 amino acids are correct. Table 2.3 shows the frequency of correctly predicted subsequences (or sequence tags) of length at leastx, the same criteria as used by PepNovo and NovoHMM. MSNovo is clearly the best in every category among all programs. 37 Figure 2.3: Comparison of the accuracy of MSNovo against that of PepNovo and NovoHMM as a function of peptide length Table 2.3: Percentage of correctly predicted sequence tags of length at least x. Dataset used is OPD280 Algorithm x=3 x=4 x=5 x=6 x=7 x=8 x=9 x=10 MSNovo 0.957 0.918 0.829 0.735 0.668 0.564 0.442 0.314 NovoHMM 0.893 0.796 0.711 0.589 0.486 0.404 0.293 0.193 PepNovo 0.946 0.871 0.800 0.654 0.525 0.411 0.271 0.193 Sherenga 0.821 0.711 0.564 0.364 0.279 0.207 0.121 0.071 PEAKS 0.889 0.814 0.689 0.575 0.482 0.371 0.275 0.179 Lutefisk 0.661 0.521 0.425 0.339 0.268 0.200 0.104 0.057 2.3.3 Charge +1 and +2 spectra We compared the precision and the recall of the three programs, MSNovo, PepNovo and NovoHMM, on charge +1/+2 MS/MS spectra from four datasets of ISB769, OPD280, HUPO513 and LTQ600. The results are shown in Table 2.4. The precision is defined as the ratio of correctly predicted residues over the total number of predicted residues and 38 the recall is defined as the ratio of the correctly predicted residues over the total number of residues in true peptides. For a fair comparison, we also include the average predicted length of the peptides. For the 3 ion trap data (ISB, OPD and PeptideAtlas), the average predicted length of MSNovo is similar to that of the other two, but the precision and the recall of MSNovo are better. For LTQ data, theb 1 andb 2 ions are usually missing. In such a case we cannot predict the first two amino acids of the peptide so we leave a mass gap at the beginning of the peptide. Hence the average predicted length for the LTQ600 data is 2 amino acids less than the average length of the peptides. The prediction accuracy for the LTQ data is substantially higher than that for the ion trap data. Note that our comparison of PepNovo and NovoHMM are preliminary as these tools were probably not cognizant of the low errors in precursor ion masses. NovoHMM works poorly on the LTQ data because it always over-predicted the whole peptides. The results clearly show that MSNovo performs better for data with better quality. Here we demonstrate an example of the comparison of MSNovo with PepNovo and NovoHMM on a MS/MS spectrum of FAAYLER which MSNovo correctly predicted. The length of this peptide is 7, so there are 6 b- and 6 y-ions. We use the number of matchedb- andy-ion peaks to represent the prediction accuracy of the three programs. The result is shown in Figure 2.4. The prediction given by PepNovo is FAAYLRK. The b-ion series is correct except forb 6 peak, but they-ion series was shifted by one Dalton because the molecular weight of E is 129 Da, one Dalton more than the molecular weight of K (128 Da). Peptide FAAYLRK has only one match ofy-ion with the spectrum, plus a random match with a noise peak. The prediction given by NovoHMM is AFAYKAAK, of which onlyb 2 ;b 3 andb 4 ions are correctly predicted, but other b-ions,b 5 ;b 6 andb 7 , match to the isotopic peaks of the actualy 4 ;y 5 andy 6 ions respectively. 39 Figure 2.4: Comparison of MSNovo with PepNovo and NovoHMM using a spectrum of peptide FAAYLER. 2.3.4 Triply charged spectra We compare the performance of MSNovo and NovoHMM on charge +3 spectra using ISB646. The results are shown in Table 2.4. MSNovo predicts 76 correct peptides compared to 0 by NovoHMM. If we consider the top 20 predicted peptides for each spectrum, MSNovo had 99 correct predictions. Table 2.4 clearly shows that MSNovo outperforms NovoHMM in almost every category. 40 Table 2.4: Comparison of MSNovo with PepNovo & NovoHMM on various datasets Num.Pep a Avg.Len b Correct Pep c Accuracy d Avg.Pre.Len e Precision f Recall g ISB769 MSNovo 769 11.7 373 48.50% 10.4 86.85% 77.36% MSNovo 769 11.7 503 65.41% 10.2 91.96% 80.59% PepNovo 769 11.7 252 32.77% 10.6 82.67% 75.36% NovoHMM 769 11.7 269 34.98% 11.3 79.21% 77.13% OPD280 MSNovo 280 10.5 141 50.36% 9.7 85.13% 78.98% MSNovo 280 10.5 181 50% 9.8 87.63% 82.31% PepNovo 280 10.5 99 35.36% 9.7 81.66% 75.84% NovoHMM 280 10.5 102 36.43% 10.4 79.52% 78.40% HUPO513 MSNovo 513 14.1 106 20.66% 12.1 77.61% 66.58% MSNovo 513 14.1 192 37.43% 12.0 83.58% 71.34% PepNovo 513 14.1 50 9.75% 12.1 70.96% 61.24% NovoHMM 513 14.1 80 15.59% 13.6 73.92% 71.78% LTQ600 MSNovo 600 12.3 254 42.33% 9.6 81.16% 68.77% MSNovo 600 12.3 298 49.67% 9.6 86.89% 73.83% ISB646(Charge +3 Spectra) MSNovo 646 17.5 79 12.23% 12.3 66.85% 47.08% MSNovo 646 17.5 99 20.28% 12.7 78.27% 56.87% NovoHMM 646 17.5 0 0% 14.7 50.39% 40.43% Search for top 20 ranking predicted candidates. a Num.Pep is the total number of peptides in the dataset. b Avg.Len is the average length of the peptides. c Correct pep is the number of correctly predicted peptides. d Accuracy is the percentage of correctly predicted peptides. e Avg.Pre.Len is the average predicted length of the peptides. f Precision is defined as the ratio of correct aa and predicted aa. g Recall is defined as the ratio of correct aa and total aa. 2.3.5 Adding neutral loss ions For high-quality LTQ MS/MS spectra, neutral loss ions appear with a high probability. We added the information of neutral loss ions in our algorithm in order to improve the prediction accuracy. We ran MSNovo on the LTQ2448 dataset, adding different combinations of neutral loss ions in the dynamic programming algorithm. This LTQ 41 dataset has a total of 2,448 peptides and 30,472 amino acids. Table 2.5 shows the results of adding neutral loss ions. For each case, we kept the top ranking peptides with p-value < 0:001. We analyzed the tradeoff between the numbers of correctly predicted peptides and the number of correctly predicted amino acids, and adding botha ions andH 2 O ions yields the best performance. In fact, considering more ion types adds to the power of selecting high confident ions (or peaks). For example, a b-ion, corresponding to an element in the mass array, will likely receive a higher score if it is supported by multiple evidences: matches with mass peaks found for its b-ion, b-H2O, and b-NH3. In this sense, it has the same effect as trimming off low scoring amino acids. An important reason to focus on high confident amino acids rather than the whole length of the peptide is that many applications in fact use these high confident amino acids (tags) for database searches. Ion Combinations Correct Peptides Correct Amino Predicted Amino (fraction) Acids (fraction) Acids b;yion 674 (27.5%) 16932(55.6%) 23917 b;yions,b;yH 2 O ions 824(33.7%) 15183(49.8%) 21029 b;yions,b;yNH 3 ions 823(33.6%) 14170(46.5%) 19991 b;yions,a ions 889(36.3%) 13889(45.6%) 19471 b;yions,b;yH 2 O 1028(42%) 12506(41%) 17443 andb;yNH 3 ions b;yions,a ions 1037(42.4%) 12537(41.1%) 17324 andb;yH 2 O ions b;yions,a ions 1049(42.9%) 10930(35.9%) 15499 andb;yNH 3 ions b;yions,a ions,b;yH 2 O ions 1100(44.9%) 10776(35.4%) 15594 andb;yNH 3 ions Table 2.5: Effect of adding neutral loss ions into MSNovo 42 2.3.6 Parameter tuning In this section, we present the results of tuning the two important parameters used in MSNovo: the match tolerance and the noise density of the mass spectra. The mass resolution of tandem mass spectra in most ion trap mass spectrometers is about0:5u. Hence in MSNovo, we used 0.5u as the default cutoff threshold for ion trap data. In this study, we ran MSNovo with ten different values of mass resolutions, from 0.1 to 1.0 Da, and calculated the prediction accuracy. The results shown in Figure 2.5 suggests that setting the cutoff threshold to0:5u gave the best performance. Figure 2.5: MSNovo accuracy, precision and recall at different values of mass resolution Noise density P(noise) is in fact an empirical parameter used in MSNovo because it varies from one experiment to another. Currently, we set the default value to be 0.002, the average value of the lowest scores of signal peaks in each spectrum within the training dataset. Basically, we treat it as a constant because the results do not change much within 43 a certain range of P(noise). To explore the best value of P(noise), we ran MSNovo under different values of noise density, showing the results in Figure 2.6. Figure 2.6: MSNovo accuracy, precision and recall at different values of P(noise) 2.3.7 P-values for de novo sequencing results For a given spectrum, MSNovo outputs top 20 peptide candidates, each with a likelihood ratio score and a p-value. The p-value of the score is calculated in the following way. We simulated 1,000 random peptides with the same parent ion mass indicated in the spectrum, and calculated the score for each of them. The distribution of the scores of these simulated peptides follows a normal distribution. Figure 2.7 shows such a distribution for a LTQ spectrum with parent ion mass 606:37Da. We used the normal distribution to calculate the p-value for each score. 44 Figure 2.7: The distribution of scores of 1,000 random peptides for a LTQ spectrum with parent ion mass 606:37Da. 2.4 Discussion There are three main advantages of our algorithm. The first is that very few parameters need to be determined in our scoring function, and this makes our program simple yet flexible in dealing with data from different types of mass spectrometers. The second is that all possible peptides are encoded in the mass array data structure, and this allows us to explore the best solution without bias. The third is the combination of the dynamic programming algorithm with the powerful scoring function is very efficient and produces accurate solutions. Our program is quite sensitive to the accuracy of the parent ion mass because we use it to construct the mass array. For ion trap data, we need to re-calibrate the parent ion 45 mass because it is not accurate. However, MSNovo is perfect for high resolution high quality LTQ data that have very accurate measurement of parent ion masses. In general, MSNovo and other programs work better on the singly and doubly-charged spectra than on triply-charged spectra. The main reasons are the larger number of ion types in triply-charged spectra and more complex fragmentation patterns. Therefore, understanding the complicated collision-induced-association (CID) process is critical for de novo sequencing programs. Further studies are needed to integrate current knowledge of the CID process with de novo sequencing algorithms. 46 Chapter 3 MSPEP: Spectral Alignment Algorithm With the availability of more and more tandem mass spectra data with post translational modifications (PTMs) and high-resolution spectra such as OrbiTrap data, it is possible to detect post translational modifications or mutations by comparing the spectrum with peptide sequences or with other spectra. The traditional denovo sequencing method only detects pre-defined PTMs and the prediction accuracy is not very high. Database search methods have higher accuracy but the bottleneck is the speed. MSPEP is designed to align MS/MS spectrum with peptides, proteins or another spectrum, to detect known and unknown post-translational modifications or mutations, or to detect overlapping peptides. The software is available online at http://msms.cmb.usc.edu/MSPEP/. In this chapter we demonstrate a spectral alignment method using our scoring func- tion, which is used in our database search method PepHMM[66] anddenovo sequencing method MSNovo[41] and proved to have higher prediction accuracy than existing meth- ods. 3.1 Methods MSPEP can align an MS/MS spectrum with a peptide as well as two MS/MS spectra. MSPEP is especially useful for the biological applications that target at the identification 47 of PTMs in a small set of known proteins. In general, a list of post-translational modi- fications (PTMs) is provided for MSPEP. The program then output a peptide sequence with modified sites and with the types of PTMs. The set of proteins, if not known, can be identified through standard database search methods excluding PTMs. The algorithm is described as follow. If only spectra data are available, MSPEP identifies related spectra through spectrum- spectrum alignment. The input will be two spectra and a list of known PTMs and the output is an optimal alignment, which will tell whether these two spectra are overlapping, or they are the modified and unmodified version of the same peptide. All the signal peaks appearing on the optimal aligning path can be used to reconstruct the peptide sequence usingdenovo sequencing algorithm. 3.1.1 Spectrum-Peptide Alignment The spectrum-peptide alignment algorithm aligns a single tandem mass spectrum with a single peptide sequence, or a list of peptide sequences, or more generally, a set of protein sequences. In aligning with a protein sequence, we first virtually digest the protein into peptides, and then align the spectrum with each of these peptides. Generally speaking, we focus on aligning an MS/MS spectrum with a single peptide sequence. The purpose of spectrum-peptide alignment is to find the peptide which generates the MS/MS spectrum. When we align the spectrum with a list of peptides or protein sequences, the peptide which has the highest alignment score will be considered as the most possible annotation for this spectrum. Figure 3.1 shows an example of spectrum-peptide alignment. The input to the spectrum-peptide alignment algorithm includes a tandem mass spectrum, a peptide sequence, a list of possible modifications, and a parameter k which defines the maximum number of modifications allowed in the alignment. We represent the peptide of n residues long asaa 1 ;:::;aa n , whereaa i is thei th residue in the peptide 48 Figure 3.1: An example of spectrum-peptide alignment. A spectrum is aligned with peptide sequence SEAGGKRQ. The alignment path is shown as red line. A horizontal line in the alignment path denotes a deletion (or PTMs with negative mass) and a vertical line in the alignment path denotes an insertion (or PTMs with positive mass). The black circles along the alignment path denote matched pairs of prefix masses and spectrum peaks. sequence. A list ofd possible PTMs is also provided as: md 1 ;:::;md d . The peptide contains up to k-modifications. We denotes(i;j;m) as the optimal score of aligning spectrum mass peaks in the range of [0;m] with residues aa 1 ;:::;aa j with up to i modifications. We use the dynamic programming to solve this problem. The recursion of the spectrum-peptide alignment is: 49 S(i;j;m) =max 8 > > > > < > > > > : S(i;j 1;maa j ) +Score(m) max 1ld S(i;j 1;maa j md l ) +Score(m) arrayscore(m) (3.1) whereScore(m) is the score of mass elementm.arrayscore(m) = maxS(i;j;m), is the maximum score for massm containing up to i modifications and j amino acids. S(i;j;m) is not updated ifarrayscore(m) is the highest score. The initial condition is S(0; 0; 0) = 0. Pre(i;j;m) = max 8 > > > > > > > > > > < > > > > > > > > > > : maa j ifS(i;j 1;maa j )>S(i;j 1;maa j md l ) maa j md l ifS(i;j 1;maa j md l )>S(i;j 1;maa j ) (3.2) Assume that the parent mass of the tandem mass spectrum isM. Our goal is to find the maximum value ofS(0;n;M);:::;S(k;n;M). max 8 > > > < > > > : S(k;n;M) . . . S(0;n;M) (3.3) We then find out the number of modifications which maximize the score S and trace back to obtain the path. The time complexity of this algorithm isO(nkl). Although this algorithm requires a list of pre-defined PTMs is provide, but it can be easily expanded to the case of unknown modifications. In this case, the recursion is: 50 S(i;j;m) = max 8 > > > > < > > > > : S(i;j 1;maa j ) +Score(m) max 1ld S(i;j 1;m ) +Score(m) arrayscore(m) (3.4) And Pre(i;j;m) = max 8 > > > > > > > > > > < > > > > > > > > > > : maa j ifS(i;j 1;maa j )>S(i;j 1;m ) maa j md l ifS(i;j 1;m )>S(i;j 1;maa j ) (3.5) The time complexity of this algorithm isO(nkM), whereM is the parent mass of the tandem mass spectrum. The score may vary by the length of the peptides, the density of the spectrum and the distributions of peak intensities. A more generalized score is theZscore, which reflects the significance of the alignment score and has been used in PepHMM[66]. We briefly described our methods here. Given that the parent mass of a MSMS spectrum is m, and the machine accuracy is. We consider all peptides with a mass in [m;m +]. We consider this set of peptides asQ. We can calculate the original scores for all the peptides inQ. We assume thatQ is infinite and the scores of peptides inQ follow a normal distribution. So we can calculate the mean and standard deviation of these scores to compute the significance of a score for the predicted peptide. Three steps are needed to calculate theZscore for a predicted peptide. 51 First we build a mass arrayA whereA[i] = 1 means that there exist some combination of amino acids of which their sum of masses equalsi. A[i] = max aa A[imass(aa)];A[0] = 1 (3.6) whereaa is one of 20 amino acid, andmass(aa) is the mass of that amino acids. In our algorithm, we use a 0.01 Da as a unit in mass arrayA. A can be calculated in linear time. Then the program randomly samples 500 peptides within the mass range [m;m +]. The last step is to calculate the scores of the simulated peptides and then calculate the mean and standard deviation for the distribution of the scores. We use the above normal distribution for each spectrum to calculate theZscore for each score given by the alignment algorithm. 3.1.2 Spectrum-Spectrum Alignment The spectrum-spectrum alignment aligns two tandem mass spectra. The difference between the spectrum-peptide alignment and the spectrum-spectrum alignment is that we don’t know the exact sequence of the peptide. We need to find a path in the alignment graph, which contains most of the signal peaks. Of course, those noise peaks or un- explained peaks which appear in both spectra are also included in the path, while some signal peaks may be missing. So in most cases, we cannot reconstruct the peptide sequence simply from the alignment path. We need to usedenovo sequencing tools to reconstruct the peptide sequence. However, our spectrum-spectrum alignment algorithm can show whether the two spectra are related to each other based on the alignment score. There are three possible cases in the spectrum-spectrum alignment: The two peptides overlap, as shown in figure 3.2 (a); 52 The two sequences are the modified and the unmodified version of the same peptide, as shown in figure 3.2 (b); The two sequences overlap and contain modifications at the same time, as shown in figure 3.2 (c). Figure 3.2: (a) shows that the two peptides from spectra S1 and S2 are overlapping. The two alignment paths represent b-ion and y-ion paths respectively. (b) shows one of the peptides contains 2 modifications. One modification with positive mass shift while the other one with negative mass shift. (c) shows the two peptides are overlapping and at the same time contains modifications. The red line in each figure is the alignment path. A vertical line means deletion or PTM with negative mass. A horizontal line means insertion or PTM with positive mass. 53 Dynamic programming algorithm is used to find the optimal alignment path. Assume that the firstMS=MS spectrumS1 containsm peaks, and we denote the spectrum as a list of peaksA 1 ;A 2 ;:::;A m . Assume the secondMS=MS spectrumS2 containsn peaks. We denote the spectrum as a list of peaksB 1 ;B 2 ;:::;B n . A list ofd possible PTMs is also provided as:md 1 ;:::;md d . The peptide contains up tok-modifications. We define D(i;j;f) as the optimal score to a point in the alignment path, which corresponds to i th peaks in spectrum S1, j th peaks in spectrum S2 and contains f modifications. The dynamic programming recursion is as follow: D(i;j;f) = max 8 < : D(diag(i;j);f) +S(i;j) M(i 1;j 1;f 1) +S(i;j) (3.7) wherediag(i;j) is the closest co-diagonal point of point (i;j) in alignment graph. We denoted pointsi 1 ;i 2 as two peaks inS1, andj 1 ;j 2 as two peaks inS2. Ifi 2 i 1 = j 2 j 1 and there is no points on the line between (i 1 ;j 1 ) to (i 2 ;j 2 ), we then denote diag(i 2 ;j 2 ) = (i 1 ;j 1 ). Here we defineM(i;j;f) = max i 0 <i;j 0 <j D(i 0 ;j 0 ;f). M(i;j;f) can also be calculated using a recursion: M(i;j;f) = max 8 > > > < > > > : D(i;j;f) M(i 1;j;f) M(i;j 1;f) (3.8) Using the regular brute force search method we calculate the co-diagonal table in an O(n 2 m 2 ) algorithm. Here we use the spectrum convolution to calculate the co-diagonal table, which is an O(nm) algorithm. Pevzner et al.[48, 49] first used this spectrum convolution method to calculate the co-diagonal points in their spectral alignment paper. 54 S(i;j) is defined as the sum of statistical scores ofi th peak inS1 andj th peak in S2. The statistical score has been used in our database search program PepHMM[66] and the de novo program MSNovo[41]. We briefly describe the statistical score here. Two distributions are used in this scoring function. The relative ranking of a peak is defined as the ratio of its ranking in descending order and the total number of peaks in a MS/MS spectrum. The matching tolerance of a peak is defined as the mass difference of a theoretical peak and its corresponding experimental peak. We found out that the relative ranking follows an exponential distribution, while the matching tolerance follows a normal distribution. To be simple, we consider the two distributions are independent. We then define the statistical score of a peak to be the likelihood ratio of the probability that this peak is a signal peak and the probability that this peak is a noise peak. The statistical score of peak i is defined as below: S(i) = P (I i jsignal)P (T i jsignal) P (I i jnoise)P (T i jnoise) (3.9) So here, in our spectrum-spectrum alignment program, S(i;j) =S 1 (i) +S 2 (j) (3.10) S 1 (i) defines the statistical score of thei th peak in spectrumS1. S 2 (j) defines the statistical score of thej th peak in spectrumS2. After the dynamic programming tableD(i;j;f) is filled, we can back track from the point max f=1;:::;k D(m;n;f) to get the optimal alignment path. Heref which maximize D(m;n;f) tells us the number of modifications existing in the alignment. The optimal alignment path can tell us the relationship of these two peptides, whether they overlap or they are the modified and the unmodified version of the same peptide. The path also shows where the overlap begins. With the help of some further biological information, 55 such as the protein sequence which the peptide comes from, or with the help ofdenovo sequencing program, we could find a sub-path on the optimal alignment path, which can be used to reconstruct the peptide sequence. The points in sub-path must be one or more residues in distance, and represents the highest-scored sub-path in the optimal alignment path. 3.2 Results 3.2.1 Data Sets We obtained two MS/MS OrbiTrap spectra data BSA202 and Acasein844 from Professor Austin Yang’s lab at the University of Maryland(unpublished data). We then compare the performance of MSPEP with InSpect on these datasets. InSpect can run in unrestrictive mode to find unknown modifications while the current version of MSPEP can only run in restrictive mode(search in a given list of PTMs). We compare the performances of two programs under restrictive mode. BSA OrbiTrap Dataset This dataset contains 202 OrbiTrap tandem mass spectra, which are confidently annotated by SEQUEST with Xcorr score larger than 2.5. The data is generated from pure BSA(bovine serum albumin) protein. There are one fixed modification, C +58.01(carboxymethyl) and one differential modification M +15.99 (methylation) in this dataset. The parent ion mass tolerance of these spectra is set to 0.01Da. The annotations are obtained by SEQUEST. Alpha-casein OrbiTrap Dataset This dataset contains 844 OrbiTrap tandem mass spectra. The data is generated from trypsin digested bovine alpha-casein. There is one optional modification, 56 STY +80(phosphorylation). The parent mass tolerance is 0.2Da. The annotations are obtained by searching SEQUEST against the alpha-casein protein sequence. IKKb Dataset This dataset contains 1119 spectra from a digestion of IKKb(inhibitor of nuclear factor kappa B kinase beta) by trypsin from professor Ebrahim Zandi’s laboora- tory at USC. Original dataset contains 45,500 spectra. The dataset is annotated by SEQUEST on SwissProt database. We chose those annotated to be IKKB HUMAN to make sure they have correct identifications, which makes a total of 1119 spec- tra. In these 1119 spectra, the Cystein protecting group is known to be Car- bamidomethylation (+57). This dataset has been widely used in recent studies [6, 61, 65]. BSA and-casein sample preparation Lyophilized bovine serum albumin digest standard (Michrom Bioresources) was re- constituted in LC Solvent A: 2% Acetonitrile (Burdick & Jackson)/ 0.1% Formic Acid (Fluka). Serial dilutions were made in Solvent A to achieve final concentrations of 1pmol/L immediately prior to LC-MS/MS analysis. A 2.0 mg/mL solution of-casein (Sigma) was prepared in 100 mM Ammonium bicarbonate, pH 7.8. An aliquot containing 200g-casein was digested with trypsin (2% w/w) (Promega) overnight at 37 C. The digest was stopped by the addition of formic acid to 1% final volume and stored at -80 C until MS analysis. MS and LC-MS/MS analysis All MS analyses were performed using an LTQ-OrbiTrap (Thermo Electron) mass spectrometer equipped with a nanospray ionization source containing an uncoated 10m 57 i.d. SilicaTipTM PicoTipTM nanospray emitter (New Objective). The spray voltage was 1.8 kV and the heated capillary temperature was 200 C. BSA peptides were injected directly into a 5L sample loop and the samples were analyzed in the order of least to most concentration. A 40 min gradient method from 5 - 40% solvent B (95% acetonitrile, 0.1% formic acid) was used. MS1 and MS/MS data were acquired in the linear ion trap using a top 5 data-dependent acquisition method with dynamic exclusion enabled (repeat count 2, 30 sec exclusion duration). Digested -casein was diluted to 5 pmol/L in LC Solvent A immediately prior to infusion for MS analysis. MS1 data were acquired in the OrbiTrap mass analyzer (60,000 resolution, 2 scans, 100 ms max injection time) and MS/MS data were acquired in the linear ion trap using a top 20 data-dependent method. For methods with dynamic exclusion enabled, the repeat count was 2 with a 60 sec exclusion duration. Digested BSA was analyzed by LC-MS/MS using an Xtreme Simple nano LC system (Micro-Tech Scientific) equipped with a 150 mm x 75m C-18 reversed-phase column (5m particles with 300 ˚ A pores). 3.2.2 Spectrum-Peptide Alignment The spectrum-peptide alignment algorithm is implemented as a C++ pro- gram named MSPEP. The executables and source code is available from http://msms.cmb.usc.edu/MSPEP/. This algorithm aligns a MSMS spectrum with a single peptide sequence, or aligns a MSMS spectrum with a list of peptide sequences and aligning a MSMS spectrum with a protein sequence in FASTA format. The input of MSPEP includes a MSMS spectrum, sequence information and a list of known PTMs. Also users can specify which protease is used to digest the protein. At the current version, it supports Trypsin, Chymotrypsin, Lys-C and the non-specific digestion. It outputs the peptide sequence for that MSMS spectrum, whether or not it contains PTMs or not. If the peptide contains PTMs, the result shows the sites and types of the modifications. It 58 also outputs the original scores, the Z-score of the results. The Z-score are calculated from randomly sampled 500 peptides which have the same molecular mass. Details of how to calculate the Z-score have been talked about in section 3.1.2. We compare the performances of MSPEP and InSpect on three datasets, the BSA Orbi- Trap dataset, the Alpha-Casein OrbiTrap dataset and the IKKb dataset. The comparison results are listed in Table 3.1. BSA dataset In the BSA dataset, there are one fixed modification, C +58.01(carboxymethyl) and one differential modification M +15.99 (methylation). The spectra are annotated by SEQUEST. We run the alignment algorithm against the BSA protein sequence. A list of 15 common PTMs are provided, including the two existing in the sample, and also acetylation(42Da), Phosphorylation(80Da), Methylation(14Da) etc. OrbiTrap data has high parent mass precision (around 0.001Da). So we use 0.01Da as parent mass precision and 0.1Da as peak precision in our algorithm. We then compared our alignment results with SEQUEST results. MSPEP predicted 200 peptide sequences and all of them are the same as the SEQUEST annotations. The running time is much faster than database search methods and InSpect. MSPEP takes less than 0.3 seconds for one spectrum, while SEQUEST takes about 9 seconds for one spectrum. Webpage-based InSpect(http://bix.ucsd.edu/MassSpec/) takes about 5-10 seconds to run one spectrum. We then compare the performance of MSPEP with InSpect on this BSA MS/MS dataset. InSpect was developed by Tanner et al.[61]. It also identifies PTMs from MS/MS spectrum. We run Inspect against a protein database containing only the BSA protein, and specify the Cystein protecting group to be carboxymethyla- tion(+58). InSpect predicted 191 peptides, of which 183 are the same as SEQUEST annotations. 59 Alpha-casein dataset In the Alpha-casein dataset, there is only one modification, STY +80(phosphoryla- tion). The spectra are annotated by running SEQUEST against bovine alpha-casein protein. We then run MSPEP and InSpect on this dataset and compare the results. When running MSPEP, we use a list of 15 common PTMs. We use 0.1Da in both parent mass precision and peak precision. In this dataset, MSPEP predicted 628 peptides and all of them are the same as SEQUEST annotations, while InSpect only predicted 273 peptides, and 224 of them are correct. IKKb dataset In the IKKb dataset, Cystein is protected by carbamido- methylation(+57). The spectra are annotated by running SEQUEST against SwissProt database. Only 1119 spectra which are predicted to be IKKB HUMAN peptides are selected. MSPEP predicted 1119 spectra and 1118 of them are the same as SEQUEST predictions, while InSpect predicted 1119 spectra and 1109 of them are correct. Table 3.1: Comparison of MSPEP and InSpect on three datasets: BSA, Alpha-casein and IKKb Dataset # Spectra MSPEP InSpect (correct/predicted) BSA 202 200/200 183/191 Acasein 844 628/628 224/273 IKKb 1119 1118/1119 1109/1119 3.2.3 Significance of the results To calculate the significance of our Z-score, we search the 202 OrbiTrap spectra against a non-BSA protein database, which contains 127 human proteins. Since BSA protein in 60 Figure 3.3: Distribution of the Z-scores of the positive set and the negative sets. not included in this database, the search results are obviously incorrect. We define the search results against the non-BSA database as the negative set and the search results against the BSA protein as the positive set. We calculate the Z-scores for both sets and compare the distributions of the Z-scores of these two sets, as shown in Figure 3.3. We can see that the Z-scores of the correct predictions and wrong predictions are separated well. Then we can calculate the false positive rate using the distributions of the Z-scores from the negative and the positive sets. The results are shown in Table 3.2. Table 3.2 shows that if we can choose a threshold of 3, we can get a relative high true positive rate and keep the false positive rate small enough. The ROC curve of the Z-score for the BSA dataset is shown in figure 3.4. Similarly, we searched spectra of the A-casein peptides against a database which doesn’t contain A-casein proteins. The ROC curve of the Z-score for the A-casein dataset is shown in figure 3.5. 61 Figure 3.4: ROC Curve of the Z-Score for the BSA dataset Figure 3.5: ROC Curve of the Z-Score for the A-casein dataset 62 Table 3.2: False positive rates and true positive rates for difference Z-score threshold. Z-score False positive True positive 0 97.44% 100% 1 64.74% 98.51% 2 22.44% 90.05% 3 3.85% 77.61% 4 0.64% 58.71% 5 0.64% 37.31% 6 0% 19.40% 3.2.4 Spectrum-Spectrum Alignment The spectrum-peptide alignment algorithm is implemented as a C++ program named MSMS Align, which is available online at http://msms.cmb.usc.edu/MSPEP/MSMS Align.htm. The input of MSMS Align includes two MS/MS spectra and a list of known PTMs. The output is an optimal alignment path in the alignment graph. The path shows the relationship of the two spectra, whether they overlap or contains PTMs. Adenovo sequencing algorithm may be applied to the points in the optimal alignment path to find out the signal peaks which can be used to reconstruct the peptide sequence. However, at this stage, our program only shows the optimal alignment path and the alignment score. Figure 3.6 shows an optimal alignment path found by MSMS Align for two peptides VLTSSAR and EKVLTSSAR.S1 is the spectrum of VLTSSAR andS2 is the spectrum of EKVLTSSAR. The alignment path contains 170 matched peak pairs; only 7 of them are b/y-ion signal peaks. MSMS Align algorithm greatly increases the signal-to-noise ratio of the two spectra and helps to find the peptides usingdenovo sequencing algorithms. Signal peaks will usually have high statistical score. We then can usedenovo sequencing algorithm to reconstruct the peptide sequences from those points on the optimal alignment path. In this example, ourdenovo sequencing algorithm picks up the following matching 63 peak pairs (98.95, 356.23), (212.09, 469.30), (313.16, 570.30), (400.12, 657.33), (487.07, 744.33), (558.27, 815.44), (714.42, 971.56) as b/y-ion peaks. From the mass distance between each two matching pairs, we can reconstruct the peptides are VLTSSAR and [257Da]VLTSSAR. With the knowledge of amino acid masses, we know that 257Da is the mass sum of residue E and residue K. From further peak information from spectrum S1, we can finally determine that the peptide forS2 is EKVLTSSAR. Other signal peaks, such as neutral loss and a-ions, are also useful in reconstructing peptide sequences. The statistical scores of neutral loss ions can also be used in de novo sequencing algorithm to increase prediction accuracy. We run a pair wise spectrum-spectrum alignment on the BSA OrbiTrap Dataset, resulting in a total of 12309 alignments. We call two spectra related if: (1) The two peptides are identical; (2) The two peptides are overlap; (3) The two peptides are modified and unmodified version of the same sequence; (4) The two peptides overlap and contains PTMs at the same time. Otherwise, we call the two spectra unrelated. In our 12309 alignments, 559 of them are from related spectra and the rest 11750 are from unrelated spectra. Figure 3.7 shows the distributions of the alignment scores of the related and unrelated spectra alignments. We can see two separate normal distributions for related and unrelated spectra alignments, which means that we can find a threshold to identify whether an alignment is from related or unrelated spectra given an acceptable false positive rate. Given a threshold, we can calculate the true positive and false positive rate of the alignment scores. The results are shown in Table 3.3. Figure 3.8 shows the ROC curve for the alignment scores. From the results, we suggest using a threshold of 20. When we get an alignment score of 20, the possibility that the two spectra are related is 91%, while the false positive rate is 4.9%. 64 Figure 3.6: Optimal alignment path of two tandem mass spectra for peptides VLTSSAR and EKVLTSSAR. Each dot on the path represents an aligned peak pair in the two spectra. The size of dot represents the statistical score of the matched pair. The higher the score, the more possible that these two peaks are signal peaks. The black line is the alignment path. From this path we know that these two peptides overlap, and the mass difference is 257Da. The mass difference occurs at the N-terminal of the peptides. 3.3 Discussion SEQUEST annotations are first generated against all bovine proteins. We consider SEQUEST annotations as true peptides. Sometimes there are discrepancies between MSPEP annotations and SEQUEST annotations, then we re-run SEQUEST against only Alpha-casein proteins and found that under this condition, MSPEP prediction is same as SEQUEST prediction, which means that MSPEP predictions are correct. Among the 628 65 Figure 3.7: Distribution of the alignment scores for related spectra and unrelated spectra A-casein spectra, SEQUEST cannot predict correct peptides when searching against all bovine proteins for 73 spectra, while MSPEP predict all of them correctly. 13 of these 73 predictions has a Z-score above our threshold 3. InSpect can run in unrestrictive mode to find unknown PTMs[62, 63]. In this paper, we only compare the performance of MSPEP with InSpect under restrictive mode (web version: http://proteomics.ucsd.edu/LiveSearch/). There are two reasons for this: (1) The results of unrestrictive PTM search requires manual verification to confirm its existence. Restrictive search given a list of possible PTMs still has much higher accuracy than unrestrictive search. (2) We haven’t finished developing the unrestrictive search of MSPEP. MSMS Align program now only generate a list of peaks on the optimal alignment path. Most of the signal peaks are located on this path. Right now our program has high 66 Table 3.3: False positive rates and true positive rates for difference Z-score threshold. Alignment Score False positive True positive 10 47.23% 96.60% 12 33.42% 96.24% 14 22.59% 94.99% 16 13.40% 93.74% 18 7.76% 93.38% 20 4.91% 91.06% 22 3.00% 88.55% 24 1.48% 83.01% 26 0.59% 77.10% 28 0.20% 69.76% 30 0.09% 59.93% Figure 3.8: ROC curve of the spectrum-spectrum alignment scores 67 accuracy in predicting whether two spectra are from related or unrelated peptides. We are still working on combining ourdenovo program MSNovo[41] with MSMS Align to predict peptides directly from spectrum-spectrum alignment. Since a list of peaks containing most signal peaks are generated, the results can also be imported into otherde novo programs such as PepNovo[18, 20], PEAKS[37] to get the annotations. 68 Chapter 4 Future Research In this chapter, we are going to discuss other related works which are in progress. They are my possible future research topics in mass spectrometry field. 4.1 Spectral Network In Chapter 3, we have discussed using spectral alignment algorithms to discover post- translational modifications. However, our program MSPEP now only aligns one peptide with one tandem mass spectra at a time. We will be able to build up spectral network for PTMs identification using spectrum-spectrum alignment algorithm. Our spectrum- spectrum alignment program MSMS Align now gives an alignment path as well as an alignment score for each alignment. The scores and path will indicate the relationship of the two aligned spectra, whether they are identical, overlap, containing PTMs or unrelated. Then a spectral network will be built based on the information, containing the relationship of the spectra, and the alignment scores. The spectral network will be used for PTMs identification, high accuracydenovo sequencing program and shotgun genome annotation, which will be covered in the next section. This idea has been implemented by Nuno Bandeira from UCSD[6], especially in the application of PTMs identification. There are still more spaces to work on in this field, including improving the PTMs identification accuracy, shotgun proteome sequencing using MS/MS data etc. 69 4.2 Whole Genome Annotation using MS/MS data It becomes possible to use tandem mass spectrometry data information to annotate the whole genome when more and more MS/MS data becomes available. Seattle Proteome center builds an online mass spectrometry database called PeptideAtlas which aimed at full annotation of eukaryotic genomes through a thorough validation of expressed proteins. The data in PeptideAtlas includes MS/MS from collected for human, mouse, yeast, and several other organisms. The data was searched using the latest search engines and protein sequences to help protein identification in a whole proteome framework. However, we can also de novo sequencing method to annotate these MS/MS data to discover possible new proteins or ORFs. 4.2.1 Methods Our goal is to use Tandem Mass Spectrometrydenovo sequencing methods to annotate peptides and find possible new ORFs in human genome. Although database search methods is the most reliable identification tools for tandem mass spectrometry, but it has several limitations. First, because it has to search a large protein database for the matched peptides, the speed is very slow. Second, it only identifies existing proteins in the database but is unable to find out new predicted proteins. So in this project, we will use de novo sequencing methods to screen tandem mass spectra generating from human proteins and try to find out possible new ORFs. The results can be submitted to experiments for verifications. Detailed methods will be described below. 1. Processing Human Genome Latest human genome data was downloaded from NCBI from this site: ftp://ftp.ncbi.nih.gov/genomes/H sapiens. Then we find ORFs (Open Reading Frames) longer than 200aa in whole human genome. Although there are short proteins such as Insulin (51aa), but most of the human proteins 70 Table 4.1: Virtual peptide database created from human ORFs. mass peptide sequence ORF access no. 484.2 AGGGAGGG 4 3737672 F2 498.21 GGGGGVGG 16 387047 F2 527.23 GGGGGGGK 12 105165190 R1 2 7185502 R1 X 1813925 R2 are longer than 200 aa. To eliminate ORFs candidates and improve searching efficiency, we set the minimum number of residues in an ORF is 200. For each chromosome, we searched all 6 reading frames. And then we combine all the ORFs into a human ORF database. 2. Building peptide database ORFs in the human ORF database will be virtually digested into possible peptides. We are interested in peptides between 8 residues and 30 residues and build a peptide database for these qualified peptides. The tandem mass spectra are then searched against this virtual peptide database for matches. For each theoretical peptide, we store its chromosome id, location of ORFs in the chromosome, the length of the ORF, reading frame and location of peptide on the ORF. If we find a possible hit of the peptides with the MS/MS spectrum, we can use the information to locate the ORF. The peptide database is then indexed by peptide mass in table 4.1. The last column ORF access no. is the access number of the ORF where the peptide locates. ORF access number is chrid chrloc ReadingFrame. For example, access number 4 3737672 F2 means chromosome 4, location of the ORF begins at 3737672 of the chromosome, reading frame is F2. There are six reading frames, F1, F2, F3, R1, R2 and R3, which are shown in figure 4.1. 3. Downloading PeptideAtlas data PeptideAtlas[46] is a multi-organism, publicly accessible compendium of peptides identified in a large set of tandem mass spec- trometry proteomics experiments. Mass spectrometer output files are collected for 71 Figure 4.1: Six reading frames of chromosome. human, mouse, yeast, and several other organisms, and searched using the latest search engines and protein sequences. In this project, we downloaded tandem mass spectra data for human from PeptideAtlas in mzXML format. The mzXML files are then converted to dta files. 4. Search for matched ORFs Each dta files are then searched against every peptide in the virtual peptide database. A match score is calculated for each pair of spectra and peptide. We keep all peptides with a matching score of 20 or higher. A peptide may have multiple spectra hits. The more hits a peptide has, it has higher possibility to be a real protein. We sorted all ORFs according to the number of spectra hits. 72 Then we submitted those with more than 5 00 spectra hits into BLASTp program against nr database. We can imagine that most of these ORFs w/ high number of spectral hits are existing proteins in nr database. There may be few of them are novel proteins, which requires further experimental verification. Our project provides a potential list of novel proteins which might have important biological roles. 4.2.2 Results There are 55 million amino acids in our ORF database, 180935 ORFs of length 200 aa or longer. In the virtual peptide database we built, there are 3436549 peptides. The tandem mass spectra we used are from 65 PeptideAtlas datasets, totally 3876 mzXML files. We found 171,862 ORFs have spectral hits. 95,312 ORFs have more than 200 spectral hits and 57,789 ORFs have more than 500 spectral hits. We submitted those with more than 500 hits to BLASTp program against nr database. Most of them will receive a BLAST scores and indicate that they may be one of the existing human proteins. Some of the received no BLAST matches, which may be potential novel proteins. There are totally 2142 ORFs which don’t match any sequences in nr database. They may be our most interested targets. Also we need to look at those ORFs with low BLAST scores. These ORFs may contain PTMs or may experience alternative splice after transcription. 4.2.3 Discussion Now we only have the preliminary results of this projects. All the possible new ORFs found by this project hasn’t been validated by experiments. Related works by others[3, 11, 28, 29, 33] also show that it is possible to use tandem mass spectrometry data for whole proteome shotgun sequencing and novel protein discovery. Future directions may include (1) Experimental validation of the ORFs which has no protein hits and see if 73 they are new proteins. (2) Homology Search of the ORFs which has no protein hits or low BLAST scores in other organisms. (3) Shotgun whole proteome sequencing of small eukaryotic organisms. There future directions require more skills and expertise besides mass spectrometry, I may continue in these directions in the future if possible collaborations are available. 74 References [1] Alves G, Yu YK. Robust Accurate Identification of Peptides (RAId): deciphering MS2 data using a structured library search with de novo based statistics. Bioinfor- matics. 2005 Aug 16. [2] Altschul, S., W. Gish, W. Miller, E. Myers, and D. Lipman (1990, Oct). Basic local alignment search tool. J. Mol. Biol. 215 (3), 403-410. [3] Jean Armengaud, A perfect genome annotation is within reach with the proteomics and genomics alliance. Current Opinion in Microbiology 2009, 12:19 [4] Bafna, V ., Edwards, N. SCOPE: a probabilistic model for scoring tandem mass spectra against a peptide database,Bioinformatics,V ol 17, Suppl 1 (2001),S13-21. [5] Bafna V . and Edwards N. On de novo interpretation of tandem mass spectra for peptide identification.Proceedingsoftheseventhannualinternationalconference onComputationalmolecularbiology,2003. [6] Nuno Bandeira et al., Protein identification by spectral networks analysis, Proc Natl Acad Sci. 2007 Apr 10;104(15):6140-5. [7] Bandeira N, Tang H, Bafna V , Pevzner P. Shotgun protein sequencing by tandem mass spectra assembly. Anal Chem. 2004 Dec 15;76(24):7221-33. [8] Bern M, Goldberg D. EigenMS: de novo analysis of peptide tandem mass spectra by spectral graph paritioning. RECOMB 2005. [9] Bern, M., Y . Cai, and D. Goldberg (2007). Lookup peaks: A hybrid of de novo sequencing and database search for protein identification by tandem mass spectrome- try. Analytical Chemistry 79, 1393-1400. [10] Bernd Fischer et al., NovoHMM: A Hidden Markov Model for de Novo Peptide Sequencing, Anal. Chem.2005, 77,7265-7273 75 [11] Castellana NE, Payne SH, Shen Z, Stanke M, Bafna V , Briggs SP. Discovery and revision of Arabidopsis genes by proteogenomics. Proc Natl Acad Sci U S A 2008 Dec 30 105(52):21034-8 [12] Chen, T., Kao, M.Y ., Tepel, M., Rush, J., and Church, G.M. 2001a. A dynamic progra ming approach fordenovo peptide sequencing via tandem mass spectrometry. JournalofComputationalBiology, 8(3): 325-337. [13] Clauser, K. R., Baker, P. R., and Burlingame, A. L. 1999. Role of accurate mass measurement (+/- 10 ppm) in protein identification strategies employing MS or MS/MS and database searching.AnalyticalChemistry 71(14): 2871-9. [14] Colinge J, Masselot A, Giron M, Dessingy T, Magnin J. OLA V: towards high-throughput tandem mass spectrometry data identification. Proteomics. 2003 Aug;3(8):1454-63. [15] Dancik, V ., et al. De novo peptide sequencing via tandem mass spectrometry,J Comput Biol,V ol 6,3-4 (1999),327-42. [16] Elias JE, Gibbons FD, King OD, Roth FP, Gygi SP. Intensity-based protein identifi- cation by machine learning from a library of tandem mass spectra. Nat Biotechnol. 2004 Feb;22(2):214-9. Epub 2004 Jan 18. [17] Eng, J. K., et al. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database,Journal of the American Society for Mass Spectrometry,V ol 5,11 (1994),976-989. [18] Frank A, Pevzner P. PepNovo: De Novo Peptide Sequencing via Probabilistic Network Modeling. Anal. Chem. 2005, 77, 964-973. [19] Frank, A., S. Tanner, V . Bafna, and P. Pevzner (2005). Peptide sequence tags for fast database search in mass-spectrometry. J. of Proteome Research 4, 1287-1295. [20] Frank A et al. De Novo Peptide Sequencing and Identification with Precision Mass Spectrometry. J. Proteome Res. 6:114-123, 2007. [21] Field HI, Fenyo D, Beavis RC. RADARS, a bioinformatics solution that automates proteome mass spectral analysis, optimises protein identification, and archives data in a relational database. Proteomics. 2002 Jan;2(1):36-47. [22] Tema Fridman, Robert Day, Jane Razumovsbya, Dong Xu, Andrey Gorin, Probabil- ity Profiles - Novel Approach in Tandem Mass Spectrometry De Novo Sequencing, csb, pp.415, IEEE Computer Society Bioinformatics Conference (CSB’03), 2003 [23] Havilio, M., Haddad, Y .,Smilansky, Z. Intensity-based statistical scorer for tandem mass spectrometry,Anal Chem,V ol 75,3 (2003),435-44. 76 [24] Garavelli J.S. 2003. The RESID Database of Protein Modifications: 2003 develop- ments. Nucleic Acids Research 31: 499-501. [25] Garavelli, J.S. 2004. The RESID Database of Protein Modifications as a resource and annotation tool. Proteomics 4: 1527-1533. [26] Geer LY , Markey SP, Kowalak JA, Wagner L, Xu M, Maynard DM, Yang X, Shi W, Bryant SH. Open mass spectrometry search algorithm. J Proteome Res. 2004 Sep-Oct;3(5):958-64. [27] Lyris MF de Godoy, Jesper V Olsen, Gustavo A de Souza, Guoqing Li, Peter Mortensen, Matthias Mann, Status of complete proteome analysis by mass spectrom- etry: SILAC labeled yeast as a model system. Genome Biology 2006, 7:R50. [28] Nitin Gupta, Stephen Tanner, Navdeep Jaitly, Joshua N. Adkins, Mary Lipton, Robert Edwards, Margaret Romine, Andrei Osterman, Vineet Bafna, Richard D. Smith and Pavel A. Pevzner. Whole proteome analysis of post-translational modifi- cations:Applications of mass-spectrometry for proteogenomic annotation. Genome. Res, 17:13621377, 2007. [29] N. Gupta, J. Benhamida, V . Bhargava, D. Goodman, E. Kain, I. Kerman, N. Nguyen, N. Ollikainen, J. Rodriguez, J. Wang, M.S. Lipton, M. Romine, A. Osterman, V . Bafna, R.D. Smith and P.A. Pevzner. Comparative Proteogenomics: Combining Mass Spectrometry and Comparative Genomics to Analyze Multiple Genomes. Genome Res. 2008. 18: 1133-1142. [30] Han, Y ., B. Ma, and K. Zhang (2005). SPIDER: software for protein identification from sequence tags with de novo sequencing error. J Bioinform. Comput. Biol. 3, 697-716. [31] Havilio, M. and A. Wool (2007). Large-scale unrestricted identification of post- translation modifications using tandem mass spectrometry. Analytical Chemistry 79, 1362-1368. [32] Keller,A., et al. Experimental protein mixture for validating tandem mass spectral analysis, Omics, V ol 6,2 (2002),207-12. [33] Kubota K, Kosaka T, Ichikawa K. Shotgun Protein Analysis by Liquid Chromatography-Tandem Mass Spectrometry. Methods Mol Biol. 2009;519:483-94. [34] Liu, C., B. Yan, Y . Song, Y . Xu, and L. Cai (2006). Peptide sequence tag-based blind identification of post-translational modifications with point process model. Bioinformatics 22, e307-313. [35] Lu B, Chen T. A suboptimal algorithm for de novo peptide sequencing via tandem mass spectrometry,JComputBiol, V ol 10,1 (2003),1-12. 77 [36] Lu B and Chen T. A suffix tree approach to protein identification via mass spectrom- etry: applications to peptides of non-specific digestions and amino acid modifications. Bioinformatics Suppl. 2 (ECCB), Page 113-121. [37] Ma B., Doherty-Kirby A., Lajoie G.,: PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry,Rapid Commun Mass Spectrom,V ol 17,20 (2003),2337-42. [38] Mann M, Wilm M. Error-tolerant identification of peptides in sequence databases by peptide sequence tags. Anal Chem. 1994 Dec 15;66(24):4390-9. [39] M.Mann and O. N. Jensen. Proteomic analysis of post-translationalmodifications. Nat Biotechnol, 21(3):255C61, Mar 2003. [40] Matthiesen, R., M. Trelle, P. Hojrup, J. Bunkenborg, and O. Jensen (2005). VEMS 3.0: Algorithms and computational tools for tandem mass spectrometry based identi- fication of post-translational modifications in proteins. Journal of Proteome Research 4, 2338-2347. [41] Mo L, Dutta D, Wan Y and Chen T. MSNovo: A new dynamic programming algo- rithm for de novo peptide sequencing. Analytical Chemistry 2007 Jul 1;79(13):4870- 8. [42] Mortz E, O’Connor PB, Roepstorff P, Kelleher NL, Wood TD, McLafferty FW, Mann M. Sequence tag identification of intact proteins by matching tanden mass spec- tral data against sequence data bases. Proc Natl Acad Sci. 1996 Aug 6;93(16):8264-7. [43] Nesvizhskii AI, Keller A, Kolker E, Aebersold R. A statistical model for identifying proteins by tandem mass spectrometry. Anal Chem. 2003 Sep 1;75(17):4646-58. [44] Omenn GS., et al. Overview of the HUPO Plasma Proteome Project: results from the pilot phase with 35 collaborating laboratories and multiple analytical groups, gen- erating a core dataset of 3020 proteins and a publicly-available database, Proteomics, 2005 Aug; 5(13):3226-45. [45] Payne, S., M. Yau, M. Smolka, S. Tanner, H. Zhou, and V . Bafna (2008). Phos- phorylation specic ms/ms scoring for rapid and accurate phospho-proteome analysis. submitted. [46] http://www.peptideatlas.org/ [47] Perkins, D. N., et al. Probability-based protein identification by searching sequence databases using mass spectrometry data,Electrophoresis,V ol 20,18 (1999),3551-67. [48] Pevzner, P. A., et al. Mutation-tolerant protein identification by mass spectrometry,J Comput Biol,V ol 7,6 (2000),777-87. 78 [49] Pevzner, P. A., et al. Efficiency of database search for identification of mutated and modified proteins via mass spectrometry,Genome Res,V ol 11,2 (2001),290-9. [50] Pevzner, P. A., et al. Mutation-tolerant protein identification by mass spectrometry,J Comput Biol,V ol 7,6 (2002),777-87. [51] Prince, J.T., Carlson, M.W., Want, R., Lu, P., Marcotte, E.M., The need for a public proteomics repository. Nature Biotechnology 22 (2004), 471-474. [52] Sadygov RG, Yates JR 3rd. A hypergeometric probability model for protein identifi- cation and validation using tandem mass spectral data and protein sequence databases. Anal Chem. 2003 Aug 1;75(15):3792-8. [53] Savitski, M., M. Nielsen, and R. Zubarev (2006). ModifiComb, a New Proteomic Tool for Mapping Substoichiometric Post-translational Modifications, Finding Novel Types of Modifications, and Fingerprinting Complex Protein Mixtures. Mol Cell Proteomics 5, 935-948. [54] Shevchenko, A., A. Loboda, S. Sunyaev, A. Shevchenko, P. Bork, W. Ens, and K. Standing. (2001). Charting the proteomes of organisms with unsequenced genomes by MALDI-Quadrupole Time-of Flight Mass Spectrometry and BLAST homology searching. Anal. Chem. 73, 1917-1926. [55] Shevchenko, A., S. Sunyaev, A. Liska, P. Bork, and A. Shevchenko (2003). Nano- electrospray tandem mass spectrometry and sequence similarity searching for iden- tification of proteins from organisms with unknown genomes. Methods Mol. Biol. 211, 221-234. [56] Searle BC, Dasari S, Turner M, Reddy AP, Choi D, Wilmarth PA, McCormack AL, David LL, Nagalla SR. High-throughput identification of proteins and unanticipated sequence modifications using a mass-based alignment algorithm for MS/MS de novo sequencing results. Anal Chem. 2004 Apr 15;76(8):2220-30. [57] Shilov, I., S. Seymour, A. Patel, A. Loboda, W. Tang, S. Keating, C. Hunter, L. Nuwaysir, and D. Schaeer (2007). The Paragon Algorithm, a Next Generation Search Engine That Uses Sequence Temperature Values and Feature Probabilities to Identify Peptides from Tandem Mass Spectra. Mol. Cell. Proteomics 6, 1638-1655. [58] Snyder A.Peter Interpreting Protein Mass Spectra A comprehensive Resource. Oxford University Press 2000. [59] Sunyaev S, Liska AJ, Golod A, Shevchenko A, Shevchenko A. MultiTag: multi- ple error-tolerant sequence tag search for the sequence-similarity identification of proteins by mass spectrometry. Anal Chem. 2003 Mar 15;75(6):1307-15. 79 [60] Tabb, D.L., Saraf, A., Yates, J. R. 3rd. GutenTag: high-throughput sequence tagging via an empirically derived fragmentation model. Anal Chem. V ol 75, 23 (2003), 6415-21. [61] Tanner, S., H. Shu, A. Frank, M. Mumby, P. Pevzner, and V . Bafna (2005). Inspect: Fast and accurate identification of post-translationally modified peptides from tandem mass spectra. Anal. Chem. 77, 4626-4639. [62] Tanner S et al.Unrestrictive identification of post-translational modifications through peptide mass spectrometry. Nature Protocols 1, 67-72, 2006 [63] Stephen Tanner et al., Accurate annotation of peptide modifications through unre- strictive database search. Journal of Proteome Research 2008, 7, 170-181. [64] Taylor JA, Johnson RS. Sequence database searches via de novo peptide sequencing by tandem mass spectrometry. Rapid Commun Mass Spectrom. 1997;11(9):1067-75. [65] Tsur D, Tanner S, Zandi E, Bafna V , Pevzner P. Identification of post-translational modifications by blind search of mass spectra. Nature Biotechnology. 2005 Dec 23(12):1562-67. [66] Wan Y , Yang A., Chen T. PepHMM: a hidden Markov model based scoring function for mass spectrometry database search. Anal Chem. 2006 Jan 15;78(2):432-7. [67] Yates JR 3rd, Eng JK, McCormack AL, Schieltz D. Method to correlate tandem mass spectra of modified peptides to amino acid sequences in the protein database,Anal Chem,1995,67,8,1426-36. [68] Zhang N, Aebersold R, Schwikowski B. ProbID: a probabilistic algorithm to identify peptides through sequence database searching using tandem mass spec- tral data,Proteomics,V ol 2,10 (2002),1406-12. [69] Zhang, W.,Chait, B. T. ProFound: an expert system for protein identification using mass spectrometric peptide mapping information,Anal Chem,V ol 72,11 (2000),2482- 9. [70] Zhang Z. De Novo Peptide Sequencing Based on a Divide-and-Conquer Algorithm and Peptide Tandem Spectrum Simulation, Anal Chem, V ol 76, 2004, 6374-83. 80 Appendix A Residue Masses of Amino Acids Amino Acids Three-letter Single-letter Monoisotopic Average Code Code Mass Mass Glycine Gly G 57.02147 57.052 Alanine Ala A 71.03712 71.079 Serine Ser S 87.03203 87.078 Proline Pro P 97.05277 97.117 Valine Val V 99.06842 99.133 Threonine Thr T 101.04768 101.105 Cysteine Cys C 103.00919 103.144 Isoleucine Ile I 113.08407 113.160 Leucine Leu L 113.08407 113.160 Asparagine Asn N 114.04293 114.104 Aspartic Acid Asp D 115.02695 115.089 Glutamine Gln Q 128.05858 128.131 Lysine Lys K 128.09497 128.174 Glutamic Acid Glu E 129.04260 129.116 Methionine Met M 131.04049 131.198 Histidine His H 137.05891 137.142 Phenylalanine Phe F 147.06842 147.177 Arginine Arg R 156.10112 156.188 Tyrosine Tyr Y 163.06333 163.170 Tryptophan Try W 186.07932 186.21 Carboxyamidomethyl Cysteine 160.03065 160.197 Carboxymethylcysteine 161.01466 161.181 Table A.1: Residue masses of the amino acids. The residue masses of the 20 common amino acids and selected modified amino acids. The data in this table are for amino acid residues. To calculate the mass of a neutral peptide or protein, sum the residue masses plus the masses of the terminating groups (e.g. H at the N-terminus and OH at the C-terminus). 81 Appendix B Abbreviations used in the thesis Abbreviation Meaning AA Amino Acid BSA Bovine Serum Albumin CID Collision Induced Dissociation DNA Deoxyribonucleic Acid ESI Electrospray Ionization HPLC High Performance Liquid Chromatography HUPO Human Proteome Organisation ISB Institute for System Biology LC liquid chromatography LTQ Linear Trap Quadrupole MALDI Matrix-Assisted Laser Desorption/Ionization MS Mass Spectrometry MS2 = MS/MS (Tandem Mass Spectrometry) MS/MS Tandem Mass Spectrometry m/z mass/charge ratio OPD Open Proteomics Database PTM Post-Translational Modification ROC Receiver Operating Characteristic TOF Time-of-Flight Table B.1: Abbreviations used in the thesis 82 Appendix C Denovo Peptide Sequencing Programs DeNonoX http://www.thermo.com/ Lutefisk http://www.hairyfatguy.com/lutefisk/ MSNovo http://msms.cmb.usc.edu/supplementary/msnovo/ NovoHMM http://people.inf.ethz.ch/befische/proteomics/ PEAKS http://www.bioinfor.com/products/peaks/index.php PEPNovo http://proteomics.bioprojects.org/Software/PepNovo.html Sub-Denovo http://msms.cmb.usc.edu/sub/ 83 Appendix D Database Search Programs InsPecT http://proteomics.bioprojects.org/Software/Inspect.html Mascot http://www.matrixscience.com/search form select.html OMSSA http://pubchem.ncbi.nlm.nih.gov/omssa/ PepHMM http://msms.cmb.usc.edu/PepHMM/PepHMM.htm PeptideSearch http://www.unb.br/cbsp/paginiciais/peptsrcfingerprint.htm ProbID http://tools.proteomecenter.org/wiki/index.php?title=Software:ProbID Profond http://prowl.rockefeller.edu/prowl-cgi/profound.exe ProLuCID http://fields.scripps.edu/prolucid/index.html ProteinProspector http://prospector.ucsf.edu/ 84 SEQUEST Commerical software, implemented in Bioworks software Trans-Proteomic Pipeline (TPP) http://tools.proteomecenter.org/wiki/index.php?title=Main Page Xpreteo http://xproteo.com:2698/ X!Tandem Source code available through http://www.thegpm.org/TANDEM/index.html X!Hunter http://gpm.rockefeller.edu/tandem/thegpm hunter.html 85 Appendix E Post Translational Modifications Database dbPTM http://dbptm.mbc.nctu.edu.tw/ Delta Mass http://www.abrf.org/index.cfm/dm.home RESID http://www.ebi.ac.uk/RESID/ UNIMOD http://www.unimod.org/ 86
Abstract (if available)
Abstract
Tandem mass spectrometry (MS/MS) has become an important experimental method for high throughput proteomics based biological discovery. The most common usage of MS/MS in biological applications is peptide sequencing. In this thesis, we focus on algorithms for MS/MS peptide identification and spectral alignment. We carry out two studies: (1) We have developed a de novo sequencing algorithm called MSNovo that integrates a new probabilistic scoring function with a mass array based dynamic programming algorithm. MSNovo works on various MS data generated from both LCQ and LTQ mass spectrometers and interprets singly, doubly and triply charged ions. MSNovo was tested to perform better than previous algorithms on several datasets. (2)We have developed a spectrum-peptide and spectrum-spectrum alignment algorithms called MSPEP. MSPEP identifies Post Translational Modifications through the spectrum-peptide alignment algorithm and reveals the relationship among unknown peptides through thespectrum-spectrum alignment algorithm.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Snake venom proteins identification by nano high performance liquid chromatography tandem mass spectrometry
PDF
Too many needles in this haystack: algorithms for the analysis of next generation sequence data
PDF
Techniques for de novo sequence assembly: algorithms and experimental results
PDF
The use of alignment-free statistics for the evolutionary study of study of 5' cis-regulatory sequences
PDF
Efficient algorithms to map whole genome bisulfite sequencing reads
Asset Metadata
Creator
Mo, Lijuan
(author)
Core Title
De novo peptide sequencing and spectral alignment algorithm via tandem mass spectrometry
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology
Publication Date
10/06/2009
Defense Date
06/02/2009
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
de novo sequencing,dynamic programing,mass spectrometry,OAI-PMH Harvest,peptide sequencing,spectral alignment,tandem mass spectrometry
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Chen, Ting (
committee chair
), Alber, Frank (
committee member
), Zandi, Ebrahim (
committee member
)
Creator Email
lijuanmo@gmail.com,lijuanmo@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m2651
Unique identifier
UC1465910
Identifier
etd-Mo-2918 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-266475 (legacy record id),usctheses-m2651 (legacy record id)
Legacy Identifier
etd-Mo-2918.pdf
Dmrecord
266475
Document Type
Dissertation
Rights
Mo, Lijuan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
de novo sequencing
dynamic programing
mass spectrometry
peptide sequencing
spectral alignment
tandem mass spectrometry