Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Solvation as a driving force for peptide docking to the major histocompatibility complex (MHC) class II molecules
(USC Thesis Other)
Solvation as a driving force for peptide docking to the major histocompatibility complex (MHC) class II molecules
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
1
Solvation as a Driving Force for Peptide Docking to the Major Histocompatibility Complex
(MHC) Class II Molecules
By
Jason B. Giles
A Thesis Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
MASTER OF SCIENCE
(MOLECULAR PHARMACOLOGY AND TOXICOLOGY)
August 2018
2
Table of Contents
ABSTRACT .............................................................................................................................................. 3
Chapter 1: Introduction ........................................................................................................................... 5
Chapter 2: Review of Current Peptide Binding Prediction Programs.................................................. 9
2.1 Background ...................................................................................................................................... 9
2.2 Programs ........................................................................................................................................ 13
2.2.1 NetMHCpan-4.0
10
................................................................................................................... 13
2.2.2 SVM/SVR
16
............................................................................................................................. 16
2.2.3 NetMHCIIpan-3.2
13
................................................................................................................ 20
2.2.4 DynaPred
15
.............................................................................................................................. 22
2.2.5 GradDock
11
............................................................................................................................. 28
2.2.6 DINC
12
..................................................................................................................................... 32
2.3 Discussion ...................................................................................................................................... 35
Chapter 3: Influence of Solvation on Peptide Docking to MHC class II ............................................ 36
3.1 Background .................................................................................................................................... 36
3.2 Methods .......................................................................................................................................... 39
3.3 Results ............................................................................................................................................ 40
3.3.1 Replication of Water Network .................................................................................................. 40
3.3.2 Completing Unresolved Crystal Structure Water Network ....................................................... 43
3.3.3 Water Replacement .................................................................................................................. 44
3.4 Discussion ...................................................................................................................................... 47
Chapter 4: Docking of Peptides to MHC class II ................................................................................. 50
4.1 Introduction ................................................................................................................................... 50
4.2 Methods .......................................................................................................................................... 50
4.3 Results ............................................................................................................................................ 54
4.3.1 Peptide Growth in EXSAN ....................................................................................................... 54
4.3.2 Validation of EXSAN ................................................................................................................ 55
4.4 Discussion ...................................................................................................................................... 59
4.4.1 Prediction of Crystal Structures ............................................................................................... 59
References................................................................................................................................................ 63
3
ABSTRACT
Major histocompatibility complex (MHC) class I and II molecules play a vital role in the
cellular immune system. MHC molecules present peptide fragments (antigens) derived from
pathogens and display them at the cell surface for T-cell recognition. MHC genes are highly
polymorphic and polygenic, and there are a large number of peptide binders for any single MHC
allele, which makes experimental determination of all binders intractable. Computational
prediction methods have been used to address this problem. Structure-based algorithms calculate
the conformation of a peptide based upon interactions between the MHC and the bound peptide.
Limited work has been conducted on sequence-based MHC class II prediction methods.
The investigation of algorithms for the prediction of peptides leads to the conclusion that there
is a lack of developed programs for sequence-based prediction of binders to MHC class II
molecules. Additionally, those programs do not accurately replicate in vivo conditions as they fail
to include a water network at the protein-peptide interface. Interfacial water between protein-
peptide complexes play an active role in the binding affinity of peptides. These findings lead to
the development of a sequence-based model for the docking of peptides to MHC class II. Here, we
present the development and validation of a structure-based program, EXSAN (EXplicit Solvent
ANchored Docking), which allows explicit inclusion of individual water molecules in the docking
of protein-peptide complexes and demonstrates the importance of solvation in the prediction of
peptide binding at these interfaces. The program shows strong prediction power for class II
molecules. The ability to predict binders to MHC molecules has far reaching implications such as
aiding in vaccine development.
Utilization of EXSAN for MHC class II peptide binding predictions resulted in a number of
insights, including the influence of water. The lack of solvation dramatically reduced the prediction
4
ability of EXSAN. Solvated docking leads to improved prediction, as EXSAN is able to reproduce
the solvation pattern at the interface accurately, as well as calculate interactions such as water-
mediated hydrogen bonding, displacement of water, and water-mediated salt bridges present in
protein-peptide interfaces.
Additionally, we were able to show what is believed to be a novel phenomenon. EXSAN
showed that peptides preferentially bind in a conformation where an oxygen of a backbone
carbonyl replaces a water molecule present in the unbound protein. That is, during peptide binding
an oxygen will be oriented in such a way as to be in near vicinity of displaced water molecule, so
as to maintain the hydrogen bond network present prior to the ligand binding. MHC class II
molecules exhibited this phenomenon.
5
Chapter 1: Introduction
The major histocompatibility complex (MHC) plays a vital role in the cellular immune system.
MHC molecules present peptide fragments (antigens) derived from exogenous pathogens and
endogenous sources, and display them at the cell surface for T-cell recognition. There are two
types of MHC molecules, class I and class II, which share common features, as well as exhibit
unique traits.
MHC class I molecules are present on nearly all somatic cells
1
. Class I molecules bind peptide
fragments and present them for recognition by CD8 T-cells
2
. These peptides are derived from
proteins degraded in the cytosol. The proteins include those from viruses and bacteria, as well as
endogenous proteins
3
. The degraded proteins are transported into the endoplasmic reticulum and
directly loaded onto MHC class I molecules. The loaded MHC I molecules are transported to the
cell surface for T-cell presentation. In contrast, MHC class II molecules are expressed on the
surface of antigen-presenting cells (APCs), which include B-cells, macrophages and dendritic
cells. Pathogenic proteins are endocytosed and degraded to peptides in acidic endosomes, where
they fuse with vesicles containing MHC class II molecules. Once binding of the antigen occurs,
the MHC class II molecule is transported to the cell surface
4
. At the surface of APCs, the peptide-
bound MHC class II molecule is recognized by CD4 T-cells. Interaction between the MHC-peptide
complex and the T-cell stimulates the regulatory function of CD4 cells
5
.
The ability to recognize and bind a diverse set of peptides is vital for initiating an immune
response. This “promiscuous” binding is key for combating the large number of possible pathogens
the immune system may encounter. In addition to the binding features of individual alleles, MHC
genes are highly polymorphic and polygenic
3
. The high polymorphism ensures diversity in gene
expression within the population while high polygeny ensures individuals produce different MHC
6
molecules. These two traits aid in the ability to bind and present all the possible antigens an MHC
molecule may encounter. However, the promiscuous peptide binding and polymorphic nature of
MHC class I and II create a challenge for determining which peptide sequences will bind to a
specific MHC allele.
The subunits of MHC class I and II molecules with bound peptides are shown in Figure 1. A
defining feature of the MHC class I binding site are clusters of conserved residues that bind the N-
and C-termini of the peptide. A cluster of tyrosine residues common to all MHC class I molecules
forms hydrogen bonds to the N-terminus of the bound peptide, while a second cluster of residues
forms hydrogen bonds and ionic interactions with the C terminus
6
. This causes the peptide termini
to be buried deep in the class I binding groove (Figure 1B), and limits the length of class I-binding
peptides to 8-10 amino acids. This structural limitation has contributed to some successes in
computational prediction for MHC class I
7
. In contrast, MHC class II features an open-ended
binding groove (Figure 1D) that allows the C- and N-termini to extend beyond the binding groove
8
.
This allows for variation in the length of the peptide antigen bound to MHC class II and potentially
leads to multiple binding motifs for a single antigen. Due to this promiscuity in both length and
sequence of bound antigens, MHC class II proteins are a particularly challenging target for
sequence and structure prediction.
Current computational methods can be divided into two categories: sequence-based methods,
which use patterns in peptide sequences with known binding affinities to a particular MHC
allotype to predict sequences of binding peptides; and structure-based approaches, such as protein
threading, rigid docking, and flexible docking, which consider the interactions between the protein
and peptide, and result in a predicted peptide conformation. The sequence-based method is rapid,
but has the limitation of being unable to make predictions for allotypes that are significantly
7
different from experimentally characterized peptide-MHC complexes. Structure-based algorithms
utilize the MHC protein as a target and calculate the conformation of a bound peptide based upon
parameterization of the interactions between the MHC and peptide.
Figure 1. Structural representation of (A, B) MHC class I (PDB ID 5TEZ) and (C, D) MHC class
II (PDB ID 5LAX) proteins with a bound peptide (red). (A, C) View looking along the peptide
binding groove. (B, D). Rotation of 90° to show the peptide with termini buried in the binding
groove of MHC class I (B) and extending from an open-ended binding groove of MHC class II
(D). The protein chains forming the MHC molecules are shown in blue (α chain) and yellow (β
chain; β2-microglobulin in MHC class I).
8
In this thesis, current programs which predict peptide binders to MHC molecules are discussed
in Chapter 2. Looking at both sequence- and structure-based methods for prediction of peptide
binders to MHC class I and class II. This chapter, recently published in Computer Modelling: New
Technologies, Applications and Techniques
9
provides overview of various methods and algorithms
that attempt to predict peptides binder to MHC molecules. The review examines at least one
structure-based and one sequence-based program for class I and well as one structure-based and
one sequence-based method for class II. This is not an exhaustive review as there are more
programs that have been developed for this purpose. But all remaining programs fall under one of
the four categories mentioned. The analysis of sequence- and structure-based methods for
predicting peptide binders to MHC class I and II molecules showed a lack of structure-based
prediction methods for MHC class II molecules. This conclusion lead to the development of
EXSAN.
Water networks and reproducibility of experiment water networks via the EXSAN protocol is
reported in Chapter 3. The ability to replicate in vivo water networks is the foundation of the
EXSAN protocol. The elucidation of the importance of water in the docking of peptides to MHC
class II and ability to accurately predict the binding conformation of peptides with proper
weighting of solvation is discussed. The discovery of a phenomenon involving water molecules is
presented in this chapter as well. Accurate water networks helps resolve the binding conformation
of peptides docked to MHC class II proteins.
A brief description of EXSAN as well as the validation of the program is presented in Chapter
4. To quantify the prediction ability the predicted peptide conformation was compared to the X-
ray using root-mean-squared-deviation (RMSD). The average RMSD was 2.9 Å . Given the
complicated binding nature of MHC class II molecules, this demonstrates the strong prediction
9
ability of EXSAN. Additional predictions show a decreased accuracy when explicit water
molecules are not included in energy calculations of peptide binders. Accurate prediction can aid
in vaccine development, the understanding of autoimmune diseases and therapeutic
peptidomimetic drug development.
Chapter 2: Review of Current Peptide Binding Prediction Programs
2.1 Background
The goal of this chapter is to provide a detailed description of current programs for prediction
of peptide binding to MHC molecules. The programs discussed in the chapter (Table 1) were
chosen so as to include at least one of each type of prediction: a sequence-based method for MHC
class I (NetMHCpan
10
), two structure-based class I methods (GradDock
11
, DINC
12
), a sequence-
based method for MHC class II (NetMHCIIpan
13
), and a structure-based class II method
(EXSAN
14
). In addition, two hybrid approaches that incorporate both sequence- and structure-
based methods for peptide binding to MHC class I are discussed (DynaPred
15
, SVM/SVR
16
). We
note that there are several other programs that fall into these categories (Fig. 2). For the seven
programs described in the chapter, the key features of each are shown in Table 1, and results of
validation of the programs are summarized in Table 2.
10
Table 1. Summary of programs for sequence- and structure-based prediction of peptides binding
to MHC class I and class II
Programs MHC
I/II
Alleles Sequence-
Based
Structure-
Based
Solvation Output
a
Computational
Method
NetMHCpan-4.0 I Pan-
specific
Yes No No Binary /
Affinity
Machine learning
SVM/SVR I Alleles
with
sufficient
binding
data
Yes Rigid No Binary /
Affinity
Machine learning,
support vector
machine, support
vector regression
NetMHCIIpan-3.2
II Pan-
specific
Yes No No Binary Machine learning
DynaPred I HLA-
A*02:01
Yes Semi-rigid Explicit
SPS water
MD
Binary /
Affinity
Machine learning,
docking,
GROMACS3.2,
OPLSAA/L
GradDock I Pan-
specific
No Fragment-
based
anchored
docking
No Affinity Autodock-like
grid-box,
minimization,
Monte Carlo,
Gromos54a7,
Rosetta
DINC I Pan-
specific
No Fragment-
based
anchored
docking
Implicit
Solvation
Affinity Autodock
parameters
EXSAN I, II Pan-
specific
No Fragment-
based
anchored
docking
Explicit
water
WATGEN
Affinity Docking with
empirical
parameters
a
Binary = binder vs. non-binder; Affinity = prediction of binding affinity
11
Fig 2. Graphical representation of programs for prediction of peptide binding to MHC, including
sequence-based methods for MHC class I (NetMHCpan
10
, MHCcons
17
, NetMHC
18
,
SYFPEITHI
19
), structure-based methods for MHC class I (GradDock
11
, DINC
12
, Bordner
20
,
PePSSI
7
), sequence-based methods for MHC class II (NetMHCIIpan
13
, NN-align
21
, IEDB
Consensus
22
, MHC2SKpan
23
, MHC2MIL
24
), structure-based methods for MHC class II
(EpiDock
25
, MHC-THREAD
26
), structure-based methods for MHC class I and II (pDOCK
27
,
EXSAN
14
), and hybrid approaches using both sequence- and structure-based methods for MHC
class I (DynaPred
15
, SVM/SVR
16
). Image created on Creately (www.creately.com).
12
Table 2. Validations of algorithms for sequence- and structure-based prediction of peptides binding
to MHC class I and class II
Programs
MHC
I/II
Test Set Sequence Outcome Structure Outcome
NetMHCpan-4.0 I 85,217 peptides
(binders and
non-binders)
90% of test cases
predict true peptide
binder within top
0.5% of source
protein peptides
Not applicable
SVM & SVR I 15 alleles with
sufficient
binding data
Accuracy
a
(TB +
TN)/(TB + FB + FN
+ TN) = 88.17%
Correlation coefficient
between predicted and
experimental IC 50 = 0.78
NetMHCIIpan-3.2 II 36 HLA-DR,
27 HLA-DP,
9 HLA-DQ,
8 murine MHC
alleles
Area under the
receiver operating
characteristics curve
(AUC) 0.858
Not applicable
DynaPred I HLA-A*02:01 Area under the
receiver operating
characteristics curve
(AUC) >0.85
Accuracy >70%
Average backbone
RMSD = 1.53 Å
GradDock I 107 complexes;
Pan-specific
Not applicable Average backbone
RMSD = 1.18 Å (self-
docking): 1.2 Å (cross-
docking)
DINC I 25 complexes;
Pan-specific
Not applicable Average all-atom
RMSD = 1.92 Å
EXSAN II 8 complexes;
Pan-specific
Not applicable Average backbone
RMSD = 2.1 Å
a
TB = True binder; TN = True non-binder; FB = False binder; FN = False non-binder
13
2.2 Programs
2.2.1 NetMHCpan-4.0
10
2.2.1.1 Introduction
NetMHCpan-4.0 is a sequence-based method for prediction of peptide binders to MHC class I.
It is a pan-specific prediction model, which leverages information from MHC alleles with
sufficient experimental binding data to predict binders to other alleles that lack experimental
information. Although this program was validated on MHC class I, it could be used for MHC class
II and other protein domains.
2.2.1.2 Protocol
NetMHCpan is a well-known and continually developed
10,28-31
sequence-based model for
prediction of binders to MHC class I that is currently in its fourth iteration: NetMHCpan-4.0. A
requirement of sequence-based prediction models is a database of experimentally determined
peptide binders that contains a sufficient number of unique peptides for training of the algorithm.
As the name implies, NetMHCpan is a pan-specific sequence-based prediction method, in which
information from alleles with sufficient peptide binding data is abstracted and used for prediction
of peptide binding to alleles with insufficient binding information. This ability to make predictions
across alleles is a vital part of the program.
The Immune Epitope Database (IEDB)
32
contains 764 MHC alleles, and is the largest public
database of its type. However, only 55 MHC class I alleles from human or mouse contain at least
50 peptide binding entries, which is the self-imposed minimum number of entries for training in
NetMHCpan. Therefore, the ability to use binding information across alleles is important. The
initial training set used for NetMHCpan-1.0 contained 37,384 peptides covering 24 human
14
leukocyte antigen (HLA)-A alleles and 18 HLA-B alleles
31
. The original dataset was randomly
split into five subgroups, with five individual neural networks each trained using 4/5 of the data to
update the network weights and 1/5 used to decide when to terminate the training (i.e. five-fold
cross-validation).
The third version of NetMHCpan implemented the ability to predict peptide binders that were
not solely 9 amino acids in length. Certain alleles have preference for 8-residue peptides
33
, and for
HLA-A*01:01 >35% of experimentally determined binders are longer than 9 amino acids
34
. The
dataset used for training consisted of 186,684 peptide-MHC binding affinities covering 172 MHC
molecules from human, mouse, primates, cattle, and swine
30
. Networks were trained with five-fold
cross-validation. For comparison, a network ensemble was trained using only the subset of 9-
residue peptides from the five data partitions described above. This integration of binding data for
peptides of 8, 10, or 11 amino acids significantly improved the prediction accuracy, based on
measurement of the area under a ROC curve (AUC-ROC)
30
.
NetMHCpan-4.0 has an expanded knowledge base of the neural network through inclusion of
mass spectrometry (MS) peptidomics data
10
. Peptidomics is an emerging field of qualitative and
quantitative analysis of all peptides in a biological sample
35
. In an immunological context,
peptidomics provides insights into which peptides may potentially be presented by an MHC
molecule by determining the pool of peptide fragments present. Although available data are still
limited, with 55 MHC class I molecules being characterized by MS peptidome data at the time of
the study
10
, this information can be used in conjunction with binding affinity data. This
combination resulted in 85,217 entries on which to train the neural networks. The neural network
ensemble was trained as described for NetMHCpan-3.0
30
, using a 5-fold nested cross-validation.
2.2.1.3 Results
15
The performance of NetMHCpan-4.0 was assessed on two independent datasets. The source
protein sequence of the peptide binder (epitope) was identified. All overlapping 8-14 amino acid
sequences were annotated as negative, except for the epitope, and a Frank value was calculated, as
the ratio of the number of peptide sequences with a predicted higher score than the epitope over
the number of peptide sequences in the source protein. This value was used to construct sensitivity
curves. The results demonstrated the increased predictive power of integrating peptidomic data
into the training data of NetMHCpan. Across all thresholds, the combined dataset showed a higher
sensitivity for discerning true peptide binders. For instance, in NetMHCpan-4.0 there was a 15%
increase in sensitivity at a Frank value of 0.5%. These numbers indicate that an epitope will have
a prediction score within the top 0.5% of its source protein peptides in 90% of cases using the
combined (peptidomic and binding affinity) model, compared with only 75% using NetMHCpan-
3.0. This shows that NetMHCpan-4.0 has improved performance and that binding peptides are
predicted with high specificity.
2.2.1.4 Advantages
A major problem of many sequence-based prediction methods is lack of universality. The
training of neural networks requires substantial experimental binding data for accurate prediction
for a specific allele. In most cases, if there are insufficient data for a particular allele, binding
predictions are not possible for that allele. NetMHCpan overcomes this barrier by applying binding
data from common MHC alleles to similar but lesser studied alleles. NetMHCpan-4.0 is also the
first program of its kind to integrate MS data into the prediction method. The more information
given to a machine learning algorithm, the more accurate the predictions become. Inclusion of this
type of data has increased the precision and accuracy of NetMHCpan-4.0.
Continual validation of a program, while often overlooked, is a clear advantage for any
16
program. The first version of NetMHCpan was published in 2007. Over a decade of refinement,
improvement, additional binding affinity entries, and algorithm implementation sets this program
apart from others in the field. A final advantage of NetMHCpan is its universality, extending to
MHC class I and II and other protein domains. All that is needed is sufficient peptide binding
dataset for the target. There is no aspect of NetMHCpan that makes it specific to MHC class I.
2.2.1.5 Limitations
Sequence-based prediction methods do not use physicochemical properties or molecular
interactions in the prediction model. Hydrogen bonds, hydrophobic interactions, solvation, van der
Waals, and the formation of salt bridges all play a pivotal role in binding of peptides to proteins.
If the 3-D structure of the peptide is not predicted, the calculations cannot identify these
interactions. Sequence-based methods do not attempt to make binding affinity predictions,
although this can be done by comparing differences in two peptide sequences, and identifying the
affinity change when a single substitution is made. However, the resulting change in affinity is
often a combination of the substitution at that residue and the additional influence the substitution
has on neighboring residues and the overall peptide conformation. That is, a substitution at residue
N can affect the interaction at residue N+1. This complexity is not taken into account in sequence-
based prediction methods.
2.2.2 SVM/SVR
16
2.2.2.1 Introduction
A sequence-based prediction algorithm that utilizes two machine learning algorithms has been
developed for prediction of peptide binding to MHC class I
15
. The program uses a support vector
machine (SVM) algorithm for classification of peptides as binders or non-binders. Then, using
17
support vector regression (SVR), those which are predicted to be binders are assigned a binding
affinity calculated using the divided physicochemical property score (DPPS)
36
.
2.2.2.2 Protocol
All sequence-based prediction models require a training set of known binders and non-binders
to train the program. The dataset used in SVM/SVR was obtained from the IEDB, with a number
of criteria to filter out entries from the database. The training set only included 9-amino acid
peptide binders. Additionally, alleles with insufficient information for model building, peptides
with no qualitative or qualitative information, peptides with the same sequence bound to the same
allele had entries in the database as a binder and non-binder, and peptides with non-natural amino
acids were all excluded. This gave a list of viable alleles to build models. The top 15 alleles are
presented in terms of peptide entries
15
, but there is no mention of how many alleles met the criteria
for model building, and the program website is not currently available to establish this information.
For binary training (non-binder or binder), peptides need to be represented in a reliable
classification method. The frequency of amino acids at appropriate positions in a peptide was
calculated to extract features of occurrence of amino acids in binders and non-binders, which
enables easier classification. Instead of a standard calculation of amino acid frequency by position
in a peptide, a modified frequency calculation was used. Peptides were represented as a vector
containing information on substitution of amino acids at particular positions by other amino acids,
using BLOSUM62 encoding
16
. The BLOSUM62 score contains prior knowledge on which amino
acids are similar or dissimilar to each other in distantly related proteins.
For training and testing of the prediction algorithm, the entire dataset of binders and non-binders
were split: 70% of the dataset was used for training, and 30% was the test set. This division is
typical in algorithms that need to be trained on binders versus non-binders. For SVR, the algorithm
18
that discerns binding affinity of peptides classified as binders, implementation of physicochemical
properties was necessary. In Tian et al., a total of 119 such properties were analyzed
36
, and the 10
most important were selected for the SVR algorithm, based on those with the highest Kappa
statistic
16
. These 10 properties were used to calculate a binding affinity, expressed as the half-
maximal inhibitory concentration (IC 50). Values of IC 50 ranging from sub-micromolar (taken as 0
) to 50,000 were scaled as 1-logIC50 to give an affinity range of 0 (weak binder) to 1 (strong
binder).
2.2.2.3 Results
Two results are obtained from the SVM/SVR prediction. First, the prediction of whether the
peptide sequence is a binder. Three models were used in the validation process. A model used to
compare against existing prediction programs is described here. For the 15 MHC class I alleles,
the accuracy of prediction ranged from 80.65-92.11%. On average, the program was able to
correctly discriminate between binders and non-binders 88.17% of the time for all alleles. The
second benchmark is for the regression model. Pearson correlation coefficients were used to
compare predicted and experimental binding data for different alleles. The minimum and
maximum correlation coefficients were 0.68 and 0.94, respectively, and the overall correlation
coefficient was 0.78 for all predicted and experimental binding affinities.
The SVM/SVR method performed similarly to two other MHC class I sequence-based
prediction models: NetMHCpan
28
and SMMPMBEC
37
, and outperformed one of the two
algorithms for 9 of the 15 alleles tested. However, SVM/SVR had a better correlation coefficient
than both programs for only one allele. The average correlation coefficient was greater than that
for SMMPMBEC, but less than the value of 0.81 for NetMHCpan. In a final analysis, predicted
binding data for a total of eight programs were examined for two MHC class I alleles, HLA-
19
B*07:02 and HLA-A*02:01. The programs included NetMHCpan
28
, SMMPMBEC
37
,
NetMHCcons
17
and IEDB Consensus
38
. Most of the programs performed similarly for prediction
of IC 50, and the SVM/SVR method was consistent with the other prediction models.
2.2.2.4 Advantages
Sequence-based models for prediction of peptide binding to proteins have inherent limitations
when compared to structure-based models. One such limitation is that these models typically
classify peptides as binders or non-binders. After binary classification, the SVM/SVR method can
calculate a relative binding affinity based on 10 physicochemical properties of the peptide. Without
inclusion of these 10 properties, the predictions were not as accurate. The development of the
program to include two machine learning methods is a novel technique that can improve some
allele prediction results compared to other sequence-based prediction models.
2.2.2.5 Limitations
The development of sequence-based prediction models requires a sufficiently large dataset of
peptide binding information to train the program to discern binders versus non-binders. Due to the
need for experimentally binding data, many alleles lack enough information for training purposes.
Only 15 alleles for MHC class I are included in SVM/SVR, which limits the prediction power of
the program. The dataset used for training was further reduced to peptides that are only 9 amino
acids in length, due to insufficient information on peptides of other lengths. However, the binding
site of MHC class I can accommodate peptides that have eight or ten residues. In fact, for HLA-
A*01:01, nearly one-third of all peptides presented are longer than nine amino acids
39
. Thus, the
limitation to 9 amino acids reduces the scope of the SVM/SVR method.
The incorporation of physicochemical properties in binding affinity predictions is an advantage
for sequence-based models. Use of the top ten physiochemical properties chosen from an original
20
119 properties in an earlier publication
36
for predictions of binding affinity is valuable for these
types of models. However, the calculation of the ten most important properties varies from allele
to allele, since different alleles favor different properties (that is, bind to peptides with different
sequences). This limits the program to use for only alleles with sufficient experimental binding
data. The lack of explicit solvation is a further limitation, since water plays a vital role in binding
of peptides to protein interfaces. A solvation parameter may be a good addition to the
physicochemical parameters to improve the prediction accuracy.
2.2.3 NetMHCIIpan-3.2
13
2.2.3.1 Introduction
NetMHCIIpan is a sequence-based artificial neural network for prediction of peptide binders to
MHC class II. The program integrates a classification model (binder/non-binders) with an
additional binding ranking step.
2.2.3.2 Protocol
NetMHCIIpan is a pan-specific prediction method capable of predicting peptides that bind to
all MHC class II alleles. The method is based on an artificial neural network trained on 134,281
quantitative peptide-binding measurements from human MHC alleles (HLA) covering 36 HLA-
DR, 27 HLA-DP, 9 HLA-DQ and 8 murine MHC alleles. The dataset was obtained from the IEDB
and randomly split into five subgroups for five-fold cross-validation, as described above for
NetMHCpan. The most recent version of the program is NetMHCIIpan-3.2. The training method
follows previously published versions and is described in detail by Andreatta et al
40
. The pan-
specific method can make predictions for any MHC molecule with a known protein sequence,
based on information on MHC alleles with sufficient experimental binding data. NetMHCIIpan-
21
3.2 obtains a sequence of residues in the protein that are considered important for peptide binding,
and then uses these sequences to align to alleles that do not have a satisfactory amount of binding
data, based on the best match sequence from well-studied MHC alleles.
2.2.3.3 Results
The accuracy of the program was evaluated using the AUC-ROC. This analysis used a protein
sequence containing a fragment (peptide) known to bind to an MHC molecule, with division of
the protein into overlapping fragments of length equal to the peptide antigen. NetMHCIIpan-3.2
then calculates the rank of the true peptide binder in terms of binding affinity relative to all other
peptide fragments that could hypothetically be presented. The average AUC for all MHC class II
alleles tested was 0.858, where a value of 1 is perfect prediction; that is the true peptide antigen is
the top ranked for the allele of interest, and an AUC of 0.5 indicates no prediction ability.
2.2.3.4 Advantages
The biggest problem with sequence-based prediction methods is the lack of universality. The
training of neural networks requires a substantial number of experimentally determined peptide
binders, but a large portion of MHC class II alleles do not have enough peptide binding data for in
silico predictions. NetMHCIIpan program circumvents this barrier by applying binding data from
MHC alleles with sufficient data to those alleles lacking data. The ongoing development of
NetMHCIIpan is also an advantage. The first version was published in 2008,
41
and has been
improved over 10 years with added binding peptide datasets and refinement of the algorithm.
2.2.3.5 Limitations
The limitations of NetMHCIIpan are similar to those described for NetMHCpan in terms of
predicting the actual interactions of the peptide with the MHC molecule. An additional limitation
22
of sequence-based prediction methods is output. These programs often classify the prediction
power of the algorithm by taking the entire protein sequence, dividing it into lengths equivalent to
those for peptides that bind to MHC, and then ranking all sequences based upon the binding
likelihood. A strong prediction program will have the true peptide binder ranked in the top 20%
across all alleles. Thus, there are many peptide sequences from the source protein that the program
predicts will bind to the MHC molecule; there are dozens, if not hundreds of false positives before
the true peptide binder. Therefore, the result is not a single binding peptide, but a narrowing of
options for experimental determination. This is true for NetMHCIIpan-3.2, where the average
AUC is 0.858. This indicates that the true peptide binder is ranked higher than 85.8% of all
potential peptides from a source protein sequence. Conversely, there are still 14.2% of peptide
sequences which, according to the program, are more likely to be the true peptide binder.
2.2.4 DynaPred
15
2.2.4.1 Introduction
DynaPred is a structure- and sequence-based method for prediction of MHC class I-binding
peptides that was trained on human-specific MHC molecules (human leukocyte antigens: HLA).
The method was validated on the HLA-A*02:01 allele. DynaPred utilizes a combination of
sequence-based machine learning and structure prediction via molecular docking, based on
molecular dynamics (MD) simulations performed using GROMACS3.2
42
and the OPLSAA/L
force field with explicit SPC water.
2.2.4.2 Protocol
A two-step protocol is used, which combines structure- and sequence-based predictions. The
strategy is to approximate the binding energy of all 20 amino acids in each of the nine binding
23
pockets of an MHC class I binding groove and extract energetic information obtained from MD
simulations. This information is used for training of a sequence-based prediction model. The
structural information from the simulations is used for constructing peptide-protein complexes of
predicted binders. The algorithm is based on a single main assumption: the total binding affinity
of a peptide can be approximated as the sum of the binding affinities of its individual amino acids,
neglecting the effect of neighboring residues
15
. This assumption allows the program to simulate
each amino acid individually in each binding pocket. Using this protocol reduces the total
computational time.
Initial conformations of the individual residues bound to the MHC protein are constructed from
crystal structures. To stabilize the peptide conformations, the program extends the single residues
to peptide dimers and trimers by adding a glycine residue on both sides (for terminating residues
only on the non-terminating side). For side chains for which no bound conformation is available,
existing residues are mutated to the corresponding amino acid. MD simulations are performed on
the bound complexes and on the individual molecules in solution. Energy terms reflecting the
binding properties of the amino acids are calculated from the simulation results and subsequently
used for construction of a binding-free-energy-based scoring matrix (BFESM). This matrix
contains energy terms for each residue in each binding pocket and forms the basis for the
construction of two prediction models.
The DynaPred method can be summarized as follows: (1) For a given MHC allele, the
compatibility of each amino acid in each of the nine binding pockets is examined by MD
simulations. (2) A BFESM is produced by extracting energy terms important for binding from the
simulations. (3) The position-based bound conformations are extracted from the simulations for
each amino acid type and saved in a database. (4) Experimental binding data together with the
24
BFESM is used in the training process to generate the prediction models. (5) Final prediction is a
two-step process: first, the query sequence is classified as a binder or non-binder; and then the
bound conformation of a predicted binder is generated.
To calculate the energy contributions of all residues in each of the 9 specified pockets, structures
are generated that are referred to as pseudo-peptides. The amino acid to be investigated (called the
pivot residue) is embedded inside a short pseudo-peptide, either a di- or tripeptide. A dipeptide is
used when the pivot residue is the C- or N-termini residue; otherwise a tripeptide is generated.
Residues that are not the pivot are generated as glycines. The initial backbone conformations
(torsional angles) on the pseudo-peptide were obtained from PDB files: 1AKJ, 1DUZ, 1HHG, and
1QRN. Amino acids with no experimental data are mutated using the program SCWRL3.0
43
. To
approximate the constraining force of the remaining fragments of the 9 amino acid peptide on the
pivot residue, the program applies position restraints to certain atoms of the peptide during the
simulations. The restraints were chosen such that the pseudo-peptide backbone is still able to move
within a few Angströms (Å) to span the space occupied by the different backbones in the above
PDB files and to allow free rotation of the pivot residue.
Binding affinity is calculated using the BFESM, a matrix of dimension 20 x (number of pockets)
x (number of features). Each entry represents one feature of a particular amino acid in a particular
binding pocket. These features include electrostatic interactions, hydrophobic contributions,
torsional deviation of side chain atoms in the bound state compared to the unbound conformation,
and loss in conformational energy, among other parameters from the linear energy model
44
. The
BFESM is used to generate feature vectors for each sequence in the training set; all vectors together
produce a feature matrix for model generation and prediction.
Two feature matrices were constructed from the BFESM: a local feature matrix, which uses all
25
the residue and binding pocket positional information from the scoring matrix and is thus called
‘the position-dependent feature set’ (DynaPred
POS
), and a global feature matrix, for which
information from the BFESM is reduced, assuming that the positional information can be
neglected and that the same feature can be summed over all residues to give one value for each
feature for each peptide. This model, the ‘position-independent feature set’ (DynaPred), can best
be compared to the global features used in Doytchinova et al
45
and Bordner and Abagyan
46
.
As the emphasis of this program is the combination of a sequence and structure-based prediction
model, a training set is necessary for sequence-based predictions. In order for the program to
discern binders from non-binders, it must be trained on a sufficiently large set of experimentally
determined binding peptides. Two publicly available datasets were used: MHCPEP and
SYFPEITHI. MHCPEP
47
is a static database of MHC-binding peptide sequences. SYFPEITHI
48
is an online database with over 4500 sequences and 250 motifs from naturally processed peptides
and T-cell epitopes. Since the focus was binary classification, all 9-mer binding sequences are
considered as binders, regardless of their binding specificity. Duplicated or contradictory entries
were removed. Since SYFPEITHI contains only binders, the non-binding sequences from
MHCPEP were included for prediction. The training of the two models was performed on three
dataset combinations: MHCPEP (binder + non-binder), SYFPEITHI (binder) + MHCPEP (non-
binder), and MHCPEP (binder + non-binder) + SYFPEITHI (binder).
For evaluation using an independent data set, the HIV genome was chosen, with prediction
models trained on the combined MHCPEP/SYFPEITHI dataset. For prediction, the complete HIV
genome of the HXB2 strain (GenBank accession number K03455) was used. The 3,150 residues
were divided into MHC binding and non-binding regions according to the HIV-epitope map
49
.
Binding sequences were extracted as indicated on the map, while non-binding sequences were
26
generated by chopping 9-mer sequences (with eight overlapping positions) from non-binding
regions and deleting the duplicated entries. Only peptides for which all 9 residues were located in
either region (epitope or non-epitope) were included in the performance evaluation.
2.2.4.3 Results
Two types of results for DynaPred are presented here. First is the sequence prediction accuracy;
that is, discerning whether the sequence will or will not be a binder for a particular MHC allele.
Second is the accuracy of the peptide conformation for predicted binding sequences. A 10-fold
cross validation or Leave-One-Out (LOO) technique and calculation of the AUC-ROC were
performed. As mentioned above, the program was trained with 3 different datasets, and all methods
performed well in prediction of binders, with accuracy >70% and AUC-ROC >0.85. In prediction
for the HIV genome, the accuracies were 69.68% for DynaPred and 85.45% for DynaPred
POS
. The
sequences predicted to be binders were constructed in the binding groove of the MHC molecule.
The average backbone root-mean-squared deviation (RMSD) between the crystal structure and the
predicted structure for a set of 20 known structures was 1.53 Å.
2.2.4.4 Advantages
The hybrid approach of sequence-based predictions to classify binders and non-binders and
structure-based modeling is a novel technique that overcomes the processor-intensive and time-
consuming problems of molecular docking. Additionally, a major drawback of sequence-based
programs is the binary output that discriminates only non-binders from binders. DynaPred
POS
is
able to circumvent this with incorporation of molecular docking. The strong positive correlation
between the potential energy of crystal structures and predicted structures provides evidence that
the components in the BFESM were properly weighted. Therefore, interactions between the
protein and peptide were accurately accounted for numerically.
27
The requirement for sequence-based prediction models is a sufficient dataset of known peptide
binders to a specific MHC allele. This is the training criteria that algorithms require to make
assumptions on unknown binders. DynaPred
POS
was able to make predictions with a >80%
accuracy on a set of 100 binders and 100 non-binders. This should be considered a strength of the
program, since other algorithms require larger data sets to achieve similar accuracy.
2.2.4.5 Limitations
A limitation to the sequence-based portion of the model in DynaPred
POS
, as with most sequence-
based models, is the lack of cross-allele predictions. If there is not a large enough dataset of known
binder and non-binders, these programs cannot make accurate predictions for novel peptide
sequences. As this is the first step in DynaPred
POS
, this is an inherent disadvantage. A further
limitation is that if a sequence is deemed to not be a binder in the first step, the program does not
attempt to dock the peptide. This is in contrast to strictly structure-based models, which dock the
peptide sequence without pre-judgment of its binding affinity. This should be a considered a
disadvantage in the DynaPred
POS
protocol as it limits the universality of the program. Overall, this
method can only be utilized for highly studied MHC alleles.
The peptide docking procedure itself is an additional limitation. As mentioned above, the
energy calculation of each amino acid is performed by docking a residue in each of 9 designated
pockets of an MHC class I binding groove, with flanking glycine residues. MD is then used to
rotate and position the side chain via energy minimization, with adjustment of the backbone
position within a defined Å box
15
. This is performed independently of all other residues. The
positions of all side chains and backbones in a given sequence are finally stitched together to
resolve the 9-mer peptide. The major difficulty in this process is attempting to fit the best
conformation for each residue. The binding mechanism of MHC class I peptides is driven by strong
28
interactions of the C- and N-termini, and peptides exhibit a bulge in the center. This bowing out
of the central residues often results in conformations of side chains that are not necessarily
energetically optimal. Piecing together residues of a peptide is a type of fragment-based docking,
but interactions in the center of the peptide may be difficult to predict correctly.
Solvation is a driving force in the docking of peptides to MHC class I. DynaPred
POS
uses an
explicit solvation network in conjunction with docking of the pseudo-peptides to the various
pockets in the binding site. Inclusion of any solvation network is a potential advantage. However,
explicit water networks produced by MD also have concerns, since the run time of MD may not
be long enough for sufficient water molecules to migrate into pockets between the peptide and
protein and produce a complete water network.
2.2.5 GradDock
11
2.2.5.1 Introduction
GradDock is a structure-based prediction method for docking of peptides to MHC class I. The
program uses a regression model via MD using the Gromos54a7 force field
44
with an Autodock
like grid-box for lowering peptides into the binding site. Energy minimization occurs during
simulated annealing without solvation, while side-chain orientation in the bound position is
calculated using Monte Carlo simulated annealing. The binding affinity is calculated via a
weighted version of Rosetta 2015.19
50
.
2.2.5.2 Protocol
GradDock has three main steps: 1) generation of peptide candidates from the sequence, 2)
insertion of the peptides into the MHC class I molecule, and 3) ranking of the peptides. The
program input is an MHC class I structure and the sequence of the binding peptide. The peptide is
29
generated in an unbound state in GradDock. However, the protocol can be considered to be
anchored fragment-based docking. Due to the highly conserved hydrogen bond network that drives
binding of C- and N-termini residues, the program generates unbound peptides by growing and
joining half-peptides from two fixed termini. The termini torsional angles were obtained from the
PDB structure 1DUZ. The half-peptides are generated only with a backbone, with φ and ψ torsional
angles randomly generated using overall cumulative dihedral angle probabilities
11
calculated using
fine-grained Ramachandran maps
51
. The half-peptides are then randomly paired, and the peptide
then undergoes simulated binding.
The simulated binding has a three-step protocol: steered insertion, energy minimization, and
topological correction. A total of 2,048 bound peptides are generated in the simulation process.
Steered insertion is directed via a gradient descent algorithm, where the gradient is iteratively
calculated and added to the coordinates. The non-bonded interaction parameters are from the
Gromos54a7 force field
44
. The MHC molecule is fixed during the binding simulation and
represented as a force field via the AutoDock
52
protocol. The peptide is generated 20 Å from the
MHC class I binding site, and then directed into the binding groove along a “binding axis”. This
is a hypothetical direction based on a pathway through which a peptide associates and dissociates
with minimal friction. In steered insertion, side chains of terminal residues are optimized via a
Monte Carlo approach. All torsional angles are randomized and refined until no energy
improvement is observed. Side chains are optimized to prevent steric clashes.
The second stage of the binding simulation in GradDock can be described as gradient descent
energy minimization via MD, with further refinement of the torsional angles. Rare torsional angles
(probability <0.001) are amended to nearby probable angles using pseudo-gradient rules
11
. The
final step is the energy ranking of peptides. Energy calculations are performed using 83 weighted
30
scoring terms from Rosetta 2015.19
50
. The weighted equation was obtained by defining the crystal
structure as the native conformation with the global minimum energy. All other conformations
have a higher energy relative to the native state. Constraints implemented in GradDock result in a
weighted optimal ranking function. The output is a single conformation of the peptide with the
lowest energy using the weighted Rosetta equation.
2.2.5.3 Results
Validation of GradDock was conducted on 107 non-redundant MHC class I-peptide crystal
structures. All structures were truncated at the peptide-binding domain. Structures with missing
backbone atoms or with peptides with non-natural amino acids were omitted. Cross-docking with
a binding prediction conducted using MHC class I structures bound to a different peptide in the
crystal structure was used for validation. This is a particularly useful approach because it tests the
sensitivity of the docking results to the specific conformation of the MHC molecule. The
GradDock prediction protocol with a weighted Rosetta energy gave a prediction accuracy of 1.18
Å for the RMSD of all atoms for self-docking, and an RMSD of 1.2 Å for cross-docking. Based
on generation of 2,048 peptide conformations and the flexible nature of side chains atoms (which
are typically omitted from RMSD calculations), this is a strong prediction program.
2.2.5.4 Advantages
The Rosetta scoring criteria are validated for energy calculations for peptide-protein complexes,
and use of an existing scoring function improves the docking method. The RMSD of the predicted
structure relative to the X-ray structure shows that GradDock is ubiquitous for all MHC class I
variants. Additionally, the program is five times faster than the original Rosetta simulation
method:
53
GradDock took 107.79 s and the original Rosetta simulation method required 544.84 s
to produce one candidate peptide under identical settings.
31
2.2.5.5 Limitations
Non-terminal side chains in GradDock are generated prior to steered insertion of the peptide,
and the side chains are optimized to prevent steric clashes. At this point, optimization only takes
into account intramolecular clashes. Adding side chains prior to docking may create side chains
that interact poorly with the protein or have no favorable interactions after docking. This is a
limitation because side chain interactions contribute to peptide-MHC binding. In a worst-case
scenario, generated side chain conformations might hinder binding as the peptide is docked to the
MHC molecule. Steered docking uses an energy gradient descent model, and an intermolecular
clash will increase the energy and affect the gradient model for the peptide.
An additional limitation is the potential lack of diversity in peptide torsional angle generation.
GradDock generates 2,048 peptide conformations for an 8-10 amino acid peptide. For comparison,
if the torsional angles of a single amino acid are rotated at 30 for both φ and ψ backbone dihedral
angles, this results in 144 possible conformations. If these increments are used over an 8-amino
acid peptide, the total number of conformation is 144
8
, which is far larger than the number of
conformations generated in GradDock. Filtering conformations using the overall cumulative
dihedral angle probability, as described above, helps to identify likely conformations, but 2,048
conformations is still an underrepresentation of the number of possible conformations.
A further limitation in GradDock is the omission of water molecules, which play a key role in
the binding of peptides to protein interfaces. This may affect the scoring function, since
displacement of water molecules upon peptide binding, water-mediated hydrogen bonding, and
water-mediated salt bridges are not explicitly included in calculation of the binding energy.
32
2.2.6 DINC
12
2.2.6.1 Introduction
The Docking INCrementally (DINC) algorithm is a structure-based regression model for
docking of peptides to MHC class I. DINC uses a fragment-based anchored docking approach with
implicit solvation and energy calculations via Autodock
52
.
2.2.6.2 Protocol
DINC docks only a small fragment of the peptide into the binding site of an HLA molecule.
The size of the fragment generated in the initial step is restricted by generation of fragments with
no greater than 6 internal degrees of freedom (DoF). Internal DoFs are the rotatable bonds of a
peptide: the φ and ψ torsional angles. The position, 3D coordinates, and orientation of the fragment
are not incorporated into the 6 DoFs. The same DoF heuristic is applied to all protocol steps, where
additional fragments can have no more than 6 DoFs. Favorable conformations of the initial peptide
fragment are evaluated via a scoring function,
52
which relies heavily on optimal hydrogen bonding
geometry. DINC then performs iterative rounds of fragment expansion, selecting which atoms to
include in the subsequent expansion so as to not exceed 6 DoFs. Scoring of the conformations of
the added fragment and selection of the top ten most favorable conformations for the next round
of expansion are performed until the entire peptide is reconstructed and docked. The DINC
protocol is highly customizable, in that it allows for use of different combinations of parameters
(i.e., number of DoFs in each round), heuristics (i.e., fragment expansion method), and protocols
based on changes in parameters.
2.2.6.3 Results
DINC was validated on a diverse dataset of 25 MHC-peptide complexes. The complexes were
33
all from human MHC allotypes (HLA). All three types of class I HLA molecules (HLA-DR, -DQ,
-DP) were represented in the dataset. The 25 complexes were pre-processed to correct for any
discrepancies, such as duplicate side chains; remove redundant chains; and remove X-ray water
molecules. The resulting structure was run through three steps of energy minimization via
GROMACS
54
, using the GROMOS96 (53a6) force field with the SPC water model. The resulting
conformation of the peptide was compared to the DINC-predicted conformation. A default
protocol for DINC was used for re-docking of the 25 interfaces in the validation process. The
output conformation with the lowest binding energy for each complex was compared to the
corresponding reference structure. If the all-atom RMSD between the output structure and
reference structure was lower than 2 Å , it was considered to be a good reproduction of this
complex. If not, the program was executed using the next alternative protocol. This process was
repeated until a good reproduction was obtained, or the maximum of 5 protocols were used.
The re-docking experiment show good reproductions of the 25 peptide-MHC complexes. The
average all-atom RMSD was 1.92 ± 0.41 Å , and the RMSD was <2.2 Å for 68% of the complexes.
The highest all-atom RMSD was 2.61 Å . One result of interest was for the interface of an HLA-
B*44:03 allotype (PDB ID: 3I6L). DINC was able to generate a predicted peptide with an RMSD
of 2.5 Å . However, a key hydrogen bond in the experimental structure at the 3
rd
residue was not
identified in the predicted structure, whereas a second hydrogen bond was predicted by DINC.
Despite the absence of the key hydrogen bond, the DINC protocol was able to predict the
conformation of the peptide with a satisfactory accuracy.
2.2.6.4 Advantages
The flexibility of amino acid side chains, particularly longer side chains such as arginine, lysine,
glutamine, and glutamic acid, is difficult to handle in a prediction algorithm, since the side chain
34
atoms can occupy a large conformational space. Hydrophobic side chains that face toward bulk
water (as opposed to being buried in the protein) present a further problem for prediction models.
This situation may occur despite being energetically unfavorable, if other residues in the peptide
are interacting with the protein in an energetically favorable position that overcomes the
unfavorable bulk-facing residue. DINC was able to accurately model these phenomena on the
reported dataset, which is a significant achievement.
For an HLA molecule, using a fragment-based anchored docking approach is appropriate
because the binding site is defined, and fragment-based docking allows for generation of a wide
distribution of peptide conformations without becoming computationally intractable. DINC
implements this approach effectively as a logical choice for the target interface.
2.2.6.5 Limitations
DINC should be viewed as a proof of concept, rather than a true prediction algorithm. In the re-
docking validation, peptides were docked using five separate protocols, with the best output
conformation of the peptide from the five protocols chosen as the representative conformation for
a given interface. However, the goal of any protein-peptide docking program is to predict
previously unknown binders. The use of five separate docking protocols requires further
investigation, since it is currently unclear which of the five protocols should be used for a given
peptide and MHC molecule. Having multiple separate protocols reduces the utility of the program
for the average user. DINC also does not use water molecules in the docking process, which
eliminates interactions of explicit water molecules in the scoring protocol. Water-mediated
hydrogen bonds, water displacement, and water-mediated salt bridges are not taken into account
in the scoring step, and this could lead to erroneous predicted conformations.
Making predictions in a reasonable amount of time is a central obstacle in molecular docking.
35
DINC can calculate binding of peptides to HLA interfaces in a remarkably short time, but only
with use of high-performance computing and parallel processing. The program was able to
complete a protocol, on average, in just 30 minutes, using 80 computer nodes with 16 cores per
node; effectively a computer with 1,280 cores. A typical workstation in 2018 has 8-18 cores, and
the average consumer desktop has roughly six. Thus, DINC on a computer with 1% of 1,280 cores
would take a markedly longer time to run, and this may limit its general use.
2.3 Discussion
Experimental information on peptide-MHC binding specificities is limited due to the highly
polymorphic and polygenic nature of MHC, as well as the promiscuous binding of peptides to
MHC molecules. As shown in this chapter, computational methods can assist by predicting
peptide-MHC binding. Such prediction methods broadly fall into two categories, sequence-based
and structure-based, and each has advantages and limitations. Sequence-based methods are fast
but require experimental binding data for the MHC type of interest. Recent programs have been
able to overcome this restriction by implementing pan-specific models with varying degrees of
success. With increasing computer performance, structure-based methods may become the best
option for prediction of the conformation of binding peptides. These methods can be applied to
any MHC allele. This has implications for therapeutic applications such as vaccine development
and understanding of autoimmunity. Key problems that remain to be solved include efficient
algorithms for handling peptide flexibility, predicting protein conformational change in response
to peptide binding, and rapid incorporation of water molecules at the peptide-MHC interface.
36
Chapter 3: Influence of Solvation on Peptide Docking to MHC class II
3.1 Background
Water may be the most important molecule in any biological system. The entropic and enthalpic
characteristics of water makes it unique from other liquids, and it serves as the solvent for virtually
all biochemical processes
55
. The influence of solvation is critical as water takes on multiple roles
in any biochemical process. Water determines the structure of biopolymers
56-59
and their
interactions with other molecules
60-62
. Solvation dictates the folding, structure and stability of
proteins
63
, and removal of water-mediated hydrogen bonds often lead to substantial destabilization
of complexes
64,65
. The role of hydrophobicity in the function of proteins, membranes and lipid
aggregation are also the consequence of solvation and desolvation
66
. Solvation has also been
shown to play a key role in protein-DNA interfaces
67
. Solvation is a significant factor that
contributes to the free-energy of biological interactions.
Interfacial water in protein-peptide complexes plays an active role in the binding affinity of
peptides. Understanding the influence and energy contribution of water molecules within protein
binding sites is vital for a complete understanding of the binding mechanism
68
. Solvated water
networks are implicated in peptide affinity and specificity within the peptide-protein interface
68
.
The binding of a peptide results in one of two outcomes for water molecules present at the binding
site, which depends on the Gibbs free energy, one of the most important thermodynamic quantities
for characterization of the driving forces in solvated water networks
69
. If the entropic (-ΔTS) gain
does not overcome the enthalpic barrier (ΔH) in Gibbs Free Energy, water molecules are spatially
relocated within the interface. This spatial rearrangement effects the binding free energy
68
. For
water molecules where the entropy (-ΔTS) is sufficient, resulting in an overall energetically
favorable outcome (-ΔG), water molecules are displaced. This would be the case for water
37
molecules that are in a hydrophobic environment, where hydrogen bonding is not possible
68
. This
release of water increases the binding affinity of the peptide-protein interface. Expulsion of water
is possible when the peptide has a greater affinity for the environment the water molecule occupied.
Given the importance of solvation in binding affinity, water is disproportionately neglected in
computational modeling of protein-peptide docking. The development of EXSAN allows explicit
inclusion of individual water molecules in the docking of protein-peptide complexes and
demonstrates the importance of solvation in the prediction of peptide binding at these interfaces.
The use of molecular docking (in silico modeling) has become an ever-expanding field for
understanding peptide-protein interactions. EXSAN utilizes structure-based predictions; where
binding affinity is based on universal physical principles of intermolecular and intramolecular
interactions in order to find regions in the proximity of the protein that are energetically favorable
to the binding of a specific ligand. Despite the importance of peptide binding predictions, most
docking algorithms do not have reflect in vivo conditions as they lack explicit solvation in the
interface of the protein-peptide binding site
70-78
.
Computational docking programs solvate interfaces in one of two ways; using an implicit-
solvation or an explicit-solvation model. Implicit-solvent models simulate the behavior of water
in an abstract manner, with water commonly treated as a continuum electrostatic equation to
describe polar solvation (continuous homogenous polar liquid) with dielectrics largely based on
the Poisson-Boltzman equation. The Poisson-Boltzman Equation is often used to calculate ligand
solvation and estimate ligand binding free energies by calculating the total energy of the protein–
ligand complex and then subtracting the solvated energy of the protein and ligand separately to
give the binding free energy estimates
68
. The key weakness of implicit solvation models is their
poor description of water-mediated interactions, especially the directionality of water hydrogen
38
bonds; additionally, they approximate the nonpolar contribution to solvation
68
. The implicit-
solvent approach is often augmented by the inclusion of a small number of water molecules known
to be especially important in ligand binding
79
. Explicit-solvent models place individual water
molecules throughout the protein. The benefit of explicit models is a deeper understanding of
solvation dynamics. Explicit models are able to calculate the free energy of displaced waters and
whether the displacement is energetically favorable. An in-house explicit solvent algorithm,
WATGEN
80
, is the foundation from which EXSAN calculates the behavior of all waters in an
interface.
WATGEN is an algorithm that is capable of accurately predicting water molecules present in a
peptide-protein interface, using hydrogen-bonding in the interface to predict the location of
specific, explicit water molecules
80
. It has already been used successfully in knowledge-based
prediction of major histocompatibility complex (MHC) class I binding
81,82
. WATGEN identifies
potential locations for water molecules by first identifying hydrogen-bond donor and acceptor
atoms in both ligand and peptide, then permutates distance, angle, and torsional angle around that
atom as potential locations for a water molecule. Favorably-scored potential waters are retained
and optimized. This process is far more computationally intensive than implicit-solvent methods,
but improvements in computational power has dramatically decreased calculation time, which
facilitates the practicality of using the WATGEN explicit solvent model in analyzing the large
numbers of peptide conformations seen in docking calculations. The ability to reproduce the
solvation network at an interface is key for accurate binding prediction. The solvation network in
MHC class I has been shown to be important for prediction of peptide docking
82
. Here, we show
using the EXSAN program that the solvation network in MHC class II is vital for peptide docking
to MHC molecules.
39
3.2 Methods
The development of EXSAN for the docking of peptides to MHC class II molecules is reliant on
an accurate water network. Prior to peptide binding predictions, the WATGEN algorithm was
verified for MHC class II. Crystal structures were obtained from the PDB bank. A total of 30
crystal structures were solvated via the WATGEN algorithm.
An alignment algorithm was used to calculate how close a water molecule generated by
WATGEN was positioned relative to a water molecule seen in the crystal structure. The
experimental water molecules are matched to a corresponding predicted water molecule by the
sum of the distances between all matched waters from the two water networks. The smallest total
distance between all matched waters is used for comparison of the water networks for validation
of the WATGEN algorithm. In addition to water network alignment calculations, an analysis of
MHC class II peptides was conducted for the development of peptide docking.
The flexibility of peptides is particularly challenging for computational methods as the torsional
angles of the peptide’s backbone can occupy any conformation in a 360° space. Accurately
replication this flexibility is one of the greatest challenges for in silico predictions. All the possible
conformations of a peptide are generated by permutating rotatable bonds. For instance, if the
backbone φ (phi) and ψ (psi) torsional angles are rotated at 30° for an entire revolution the
maximum number of poses generated is 144, given 12 φ conformations by 12 ψ conformations
(360/30). To reduce the number of conformations, knowledge-based design parameters were
implemented. The φ and ψ torsional angles for a set of 30 X-ray peptides bound to MHC class II
molecules were calculated. The ψ angle ranged between 100° and 200°, and the φ angle between
180° and 320°. Based upon this analysis, peptide conformations were only generated for the range
of torsional angles observed.
40
3.3 Results
3.3.1 Replication of Water Network
The reproducibility of water networks is crucial for accurate replication of peptide binding. The
analysis to quantify how accurately the WATGEN algorithm was conducted by solvating the
protein-peptide interface, which yielded strong reproducibility. As shown in Figure 1, WATGEN
creates a water network where the position of explicit water molecules are placed in the equivalent
location as the water molecules seen in the crystal structure. The solvation network for peptide
binding is crucial for prediction of protein-peptide complexes. WATGEN placement of water
molecules is a key step in the prediction process.
Crystal structures often lack a fully solvated protein-peptide interface. The difficulty in resolving
the location of individual water molecules creates an incomplete water network. The WATGEN
algorithm must place water molecules in the location of experimental water molecules as well as
include additional water molecules to create a fully solvated protein-peptide complex.
Fig 1a. Comparison of experimental water network (green) and predicted water network (blue) at protein-
peptide interface (PDB ID: 2NNA) (N-terminus left).
41
Fig 1b. Comparison of experimental water network (green) and predicted water network (blue) at protein-
peptide interface (PDB ID: 4X5W) (N-terminus left).
An alignment algorithm was used to see how often WATGEN places a water molecule in the
same vicinity as an experiment water molecule to ensure accurate reproduction of the water
network. All experimental water molecules within 7.5 Å of any atom on the peptide were used for
the alignment routine. This distance is used as it allows for the ability to capture double water-
water mediate hydrogen bonding between the protein and the peptide. On average there were 28
water molecules per interface in the 7.5 Å bubble. One interface (PBD ID: 4IS6) had zero water
molecules in the bounds created. Whereas, the highest interface (PDB ID: 4X5W) has 81 water
molecules in the 7.5 Å bounds. The median number of water molecules was 14.
An experimental water was considered to be replicated by WATGEN if the program placed a
water molecule within 2.2 Å from a water molecule seen in the crystal structure. WATGEN
positioned a water molecule in the vicinity of an experimental water molecule at a rate of 65.4%.
The average alignment distance was 1.25 Å. For interfaces with < 20 experimental water
molecules, the remaining water molecules were often hydrogen bonded between two atoms on the
42
protein. These water molecules were within the 7.5 Å cutoff, but were not interacting with the
peptide. Only water molecules which interact with the peptide are generated by WATGEN. Other
water molecules that are bulk facing, but captured by experimental determination, are also not
generated by WATGEN.
When water molecules which either are bulk facing, or bound between two atoms on the protein,
were removed from calculation, the alignment accuracy of WATGEN dramatically improves. As
shown in Table 1, when calculating only experimental water molecules which are at the protein-
peptide interface, WATGEN replicated the water network seen in vivo. Water molecules, which
are facing into bulk are exchanged with other water molecules at a rate that would limit their ability
to have an energetic influence on peptide binding. This dynamic is the justification for removal of
bulk facing water in alignment calculations. Water molecules, which are buried in the target
protein, have no direct effect on peptide binding and are removed from calculation as well. Figure
2 shows a detailed view of water molecules that are predicted by WATGEN.
Table 1. Comparison of X-ray and WATGEN-predicted water molecules in MHC class
II/peptide interfaces.
Interface Experimental
water molecules
a
WATGEN
Alignment
Non-reproducible
water molecules
c
Predicted water
molecules
1T5W 14 10 4 100%
2NNA 33 25 5 90.90%
2Q6W 14 9 5 100%
4I5B 13 7 5 92.30%
4IS6 0 - - -
4X5W 81 53 20 90.10%
5KSV 23 14 4 78.20%
5LAX 18 12 6 100%
a
Number of experimental water molecules within 7.5 Å of the peptide.
b
Water molecules replicated by WATGEN within 2.2 Å
c
Water molecules that could not be reproduced by WATGEN based on the limitation of the
algorithm.
43
Figure 2. Detailed view of water molecule reproduction. Experimental water molecule (green) is
exhibiting a water-mediated hydrogen bond between peptide (top) and protein (bottom).
WATGEN predicted water molecule in exact location (blue), which maintains hydrogen
bonding. (PDB ID:2Q6W).
3.3.2 Completing Unresolved Crystal Structure Water Network
The WATGEN water network also showed the ability to position water molecules in the protein-
peptide interface where there were no experimental water molecules resolved in crystal structures.
An example of this ability, the WATGEN program was able to place a water molecule in the exact
position of an experimental water, then place a second water molecule between the first water and
a polar atom on the peptide (Figure 3). This resulted in a water-water mediated hydrogen bond
network between the MHC protein and the peptide. Despite the crystal structure missing the second
water molecule, in vivo a water molecule would be present in that position to bridge the gap
between the MHC molecule and the peptide. WATGEN was able to fill in missing water molecules
which are unable to be resolved in experimentally determined structures. This is a very powerful
component of the EXSAN program. Replicating and completing the water network at a protein-
peptide interface is critical for accurate prediction of peptide binding.
44
Figure 3. The ability of WATGEN to complete water network by additional water molecules not
resolved in the crystal structure. Double water-mediated hydrogen bond between peptide (left:
0
1
) and protein (right: 0
2
). Experimental water (green) is reproduced by WATGEN water
molecule (blue: W
2
), with additional water molecule (blue: W
1
) placed in position to mediate
double water hydrogen bond. Replicating in vivo interaction which crystal structure was unable
to resolve (PDB ID: 2NNA).
3.3.3 Water Replacement
The solvation network for peptide binding is crucial for prediction of protein-peptide complexes.
Water molecules are important for many interactions at the interface of the complex, including the
formation of hydrogen bonds. MHC class II exhibited a phenomenon where the oxygen of the
backbone carbonyl was located in the same vicinity as a displaced water. An analysis of 30 X-ray
structures showed a correlation between the IC50 of the bound peptide and the number of carbonyl
oxygens within 1.25 Å of a displaced water (r
2
=.23). The more backbone carbonyl oxygen atoms
that were in the vicinity of a displaced water, the stronger the binding affinity of the peptide (P <
0.05, t-test). The correlation between oxygen carbonyl atoms replacing water molecules and
correlation with IC50 data is shown in Table 2.
O
1
O
2
W
1
W
2
45
Table 2. Water Replacement analysis and correlation to IC 50
PDB ID IC50
a
WATER
REPLACEMENT
(1.0 Å)
WATER
REPLACEMENT
(1.25 Å)
2IPK 0.6 5 7
5JLZ 1.5 1 4
1T5W 1.8 3 4
3S4S 1.8 5 5
3S5L 1.8 4 6
1KG0 1.8 4 6
1HXY(SA_X)
b
1.8 4 4
1LO5(SA_X) 1.8 2 4
1DLH 1.8 3 5
1JWM (SA_X) 1.8 5 6
1R5I (SA_X) 1.8 3 6
1T5X (SA_X) 1.8 2 3
2G9H (SA_X) 1.8 6 7
2OEJ (SA_X) 1.8 0 3
5LAX 4 4 5
5KSV 6 0 3
1KLG (SA_X) 13 6 6
1PYW 20 4 5
1KLU (SA_X) 25 3 5
1SJH (SA_X) 28 3 3
1SJE (SA_X) 29 3 4
40V5 41 3 4
1D5M 67.47 1 2
1D6E 78.78 1 3
5KSU 82.5 2 2
3L6F 102 4 4
4IS6 136 2 5
a
IC 50 values are in μM
b
SA_X indicates the presence of a superantigen in the crystal structure, which was removed prior
to analysis
The analysis was done on a set of 79 MHC class I X-ray structures and the same correlation was
not seen. We refer to this observation as water replacement. Figure 4 show an MHC class II peptide
(PDB ID:2NNA) with 5 water molecules, green, which were calculated to be displaced. Upon
docking of the peptide, the oxygen of the carbonyl is within 1.25 Å of the displaced waters. The
peptide shown is a 9-mer peptide, where five of the nine backbone carbonyls are “replacing” water
46
molecules. Figure 5 demonstrated the preservation of the hydrogen bond network that was present
prior to docking of the peptide. The water molecule, shown in orange, exhibited hydrogen bonding
to the protein and a second water molecule in the binding site prior to docking of the peptide. Once
the peptide was docked, the water molecule demonstrates water-mediated hydrogen bonding
between the PHE1 of the peptide and the backbone nitrogen of GLU55 in the protein.
Figure 4. The replacement of water molecules (green) by carbonyl oxygen atoms upon peptide
binding. Oxygen atoms are positioned within 1.25 Å of a displaced water molecule, maintain
interactions present prior to binding (PDB ID: 2NNA).
47
Figure 5. Water Replacement & Maintaining of Hydrogen Bonding
The green and red spheres are predicted water molecules; orange sphere is water molecule
present in crystal structure. Water-mediated hydrogen bond between GLU55 on the protein
(bottom) and PHE1 of the peptide (top) is mediated by experimental water molecule (orange);
same interaction is exhibited by predicted water (red). Water molecule (green) is calculated to be
displaced, and replaced, by oxygen of carbonyl on PHE1 of peptide. The carbonyl maintains the
same hydrogen bonds exhibited by water network prior to binding of peptide.
3.4 Discussion
The ability to replicate water networks in the docking of peptides to any protein complex is
critical for accurate binding predictions. Solvation plays a role in the docking of peptides to MHC
molecules; the displacement of water molecules during binding, water-mediated hydrogen
bonding, the capture of water at the protein-peptide interface, and water-mediated salt bridges all
contribute to the binding affinity of a peptide. Without a proper water network, these interactions
would not be modeled in a prediction program. We have shown that the EXSAN program is able
to reproduce the water network seen in vivo. Allowing the calculation of water molecules’
influence during peptide binding. The proper water network also brings insights into water
molecule dynamics during peptide binding.
48
EXSAN has not only been proven to create an accurate water network for the binding of peptides
to MHC molecules but has shown to fill in missing information not resolved in crystal structures.
This is a powerful ability of the program. The use of EXSAN to supplement experimental
determination of peptide binding is only useful if the program can make accurate predictions. The
integration of explicit water molecules is key for computation methods in order to aid experimental
binding research. The use of explicit water molecules has elucidated insights into the binding of
peptides to MHC class II molecules.
The most important insight obtained from solvation of these complexes was the water
replacement phenomenon. Maintaining the environment around a water molecule, especially one
that exhibits hydrogen bonds, would result in no + ΔG due to enthalpy. Only the – ΔG due to the
entropic gain of displacing a water molecule into bulk solvent. Having a hydrogen bonding capable
atom translocate to the position of the displaced water molecule would maintain the same micro-
environment seen prior to binding of a peptide. MHC class II exhibited this phenomenon where
the oxygen of the backbone carbonyl would be located in the same micro-environment as a
displaced water. Theoretically maintain the interactions the water molecule exhibited prior to
displacement. This process of replacing a water molecule with a carbonyl oxygen would not be
elucidate without computational methods. The experimentally determined crystal structures all
show the peptide bound to an MHC protein. The water network prior to binding is not resolved.
This is a strong attribute of computational solvation of protein-peptide complexes via EXSAN.
The observation of this phenomenon in only Class II interfaces can be attributed for two reasons.
In MHC Class I the binding site of protein is an enclosed groove. The termini of docked peptides
are flanked by multiple tyrosine residues in the protein. This moiety is a driving force is the binding
of peptides, as both the c-terminus and n-terminus of the peptide are confined to this area around
49
the hydroxyl of the tyrosine residues
83
. Effectively locking in the ends of the peptide. Contrast that
with MHC Class II, where the binding site is open ended and does not have the tyrosine moieties
seen in class I. Due to this open-ended binding site, docking of a peptide is more reliant on the
energetics of water molecules. The ability to maintain a hydrogen bond network pre- and post-
docking of the peptide would decrease the enthalpic energy barrier, effectively increasing the
binding affinity of a peptide.
50
Chapter 4: Docking of Peptides to MHC class II
4.1 Introduction
EXSAN
14
is a structure-based anchored docking program, which predicts the conformation of a
peptide sequence bound to a protein (MHC) with a known structure. Developed in-house (by
Dabrick Brill), and later adapted for the docking of peptide to MHC class II molecules. EXSAN
utilizes explicit water molecules in the peptide-protein interface and is able to calculate interactions
that would not be possible without a water network. As no prior knowledge of binding is necessary,
EXSAN is able to dock any peptide to any protein interface.
4.2 Methods
The EXSAN algorithm uses information about a protein of known structure to calculate the
3D structure of a peptide when bound to the protein, integrating the location of explicit predicted
water molecules in structure prediction. The crystallographic structures of MHC class II interfaces
in complex with a peptide were selected from the Protein Data Bank (rcsb.org). A detail algorithm
of the EXSAN program has previously been published
14
but a short description follows.
The docking algorithm EXSAN employs an anchor-and-grow method, that is, when the program
starts it only considers an anchor point (one residue). Additional residues are sequentially added
to the peptide. Residue addition occurs in two phases, first the backbone is extended, and then the
side chain is added. Many different possible conformations of each residue, or ‘poses’, are created
by variation of rotatable bond. The backbone φ (phi) and ψ (psi) torsional angles and the side chain
rotatable bonds are varied to generate these poses. The poses are then scored, with the most
energetically favorable poses being extended by an additional residue. This cycle of pose
generation and retention of favorable poses proceeds until the full ligand has been considered. A
51
diagram of the peptide docking routine is shown in figure 6. This anchor-and-grow approach can
be considered a type of fragment-based docking.
The initiation of the program anchors a residue known to interact with the target protein. This
initiation step places a single atom, in the first residue, a specific distance from a relevant atom on
the target protein. This target atom along with two of its neighboring atoms are reference points.
The distance, angle, and torsion of the anchor atom relative to the reference points of the target
protein are permutated, the angle and torsion of the alpha carbon is permutated relative to the
anchor and reference points, and finally the torsion of the last backbone atom of the residue is
permutated. This yields every possible configuration of the backbone of the residue within a sphere
from the reference point. To keep the anchor residue within the vicinity of the binding pocket, an
additional parameter is often used to require that a specific peptide atom is within a certain distance
of another atom on the target protein, limiting the space the peptide can occupy to the intersection
of two spheres.
Following the initiation step, the side chain of the anchor residue is added. All rotatable bonds
are permutated, and a set of poses are generated. After all the poses have been generated, the
position of each water molecule is calculated by WATGEN
80
. The water networks of the protein
before and after the generation of the residue are compared. This allows the calculation of free
energy change of water molecules that were either released from binding of the peptide, or those
that were captured from bulk solvent during binding.
52
Figure 6. Flow chart of program operation. The phases of the program are color-coded as initiation
(red), growth (orange), solvation (green), scoring (blue), culling (purple), and consensus (black).
Following solvation, generated poses are given a “score” based on favorable interactions
(hydrogen bonding, hydrophobic interactions, displacement of water molecules, formation of salt
bridges, etc.) and a “penalty” for unfavorable interactions such as hydrophobic-hydrophilic
contacts, hydrophobic atoms facing bulk water, and the displacement of strongly bound water
molecules. Poses are then retained or rejected (referred to as culling) based upon an energy
threshold. This process is repeated for the entire peptide. A diagram of peptide rejection can be
seen in figure 7.
Poses can additionally be culled by spatially determined parameters. Poses that result in Van der
Waals clashes are rejected. Poses can also be rejected if any atom, on the most recently added
residue, is not within the standard distance cutoff of 4 Å from the target protein. These poses are
defined to be out of the binding groove and are subsequently rejected.
53
Figure 7. Anchored fragment-based solvated docking using EXSAN. The N-terminus of the
peptide is on the right. The green peptide is terminated after the 1
st
amino acid due to a poor
docking score. The blue, pink and orange peptides continue to the 2
nd
amino acid, at which point
the blue peptide is terminated. Similarly, the pink peptide is terminated after the 4
th
amino acid
due to a score below the energy threshold. × indicates termination of a peptide chain. Indicates
water molecules.
The EXSAN anchor and grow approach leads to many conformations of the peptide after
all residues have been added. Upon the addition of the final residue, poses were calculated, and
the top 200 energetically favorable poses were grouped by similarity. The group that possesses the
highest number of poses will be the group where the representative pose is chosen from, referred
to as the consensus pose. This consensus pose EXSAN believes to be the true conformation of the
peptide in vivo. The pose within the group that is the most energetically favorable becomes the
consensus pose.
54
In rare circumstances there will be more than one output consensus pose. The protocol dictates
that any group with < 20% of top 200 poses will output a consensus pose. The pose which is the
most energetically favorable from the largest group is output as consensus pose 1, the remaining
groups output in descending size, i.e. the second largest group outputs the most energetically
favorable pose from that group as consensus pose 2, and so on and so forth. This procedure is
repeated and can continue up to a total of 5 consensus poses. The consensus poses are all
conformationally distinct from one another, and often only one consensus pose is produced. This
approach also acts as a safeguard against a pose which was highly scored for spurious reasons, as
several similar poses must have high scores to lead to a consensus pose
EXSAN has been shown to have strong predicting ability in generating this representative pose
that resembles the X-ray structure. A root-mean-square deviation (RMSD) was calculated to give
a quantitative score to how closely the predicted structure matches the experimental structure. The
RMSD was calculated only for the backbone atoms of the peptide (carbonyl carbon, alpha carbon,
and amide nitrogen).
4.3 Results
4.3.1 Peptide Growth in EXSAN
An illustration of an EXSAN peptide docking run is shown in Figure 8. A random subset of
poses is generated at each step in the growth of the peptide from one residue to its full length,
followed by selection of a consensus pose. The initially varied and relatively unrestrained
trajectories of the peptide become narrower with addition of subsequent residues and as more
information on the interaction with the target protein is established.
55
Figure 8. Anchor-and-grow process
Panels A-F show the progression of the peptide as residues are added and unfavorable poses are
subsequently rejected. (A) show the 1
st
residue (GLU), (B) shows the tripeptide (GLU-GLY-SER),
(C) the pentapeptide (GLU-GLY-SER-PHE-GLN), (D) the 6-mer peptide (GLU-GLY-SER-PHE-
GLN-PRO), (E) 7-mer (GLU-GLY-SER-PHE-GLN-PRO-SER) and (F) the complete peptide
docked (GLU-GLY-SER-PHE-GLN-PRO-SER-GLU-GLN).
4.3.2 Validation of EXSAN
EXSAN was validated on peptides binding to 8 MHC class II proteins. The RMSD between the
consensus pose in EXSAN and the corresponding X-ray structure is within 2.7 Å for 6 interfaces.
The remaining interfaces have an RMSD under 5.0 Å. One of the interfaces (PDB ID: 2Q6W)
produced a consensus > 2.5 Å, and a second consensus pose has an RMSD of 2.19 Å. Table 3
shows the binding prediction for the interfaces. To evaluate the agreement between the predicted
and X-ray conformations in more detail, the parameters and total energy were calculated for each
X-ray structure. These are compared with the data for the consensus pose in Table 4. We also show
that the solvation component in EXSAN is a key element in these results. Without inclusion of
solvation, the prediction of peptide binding shows reduced accuracy.
A) E B) EGS C) EGSFQ
D) EGSFQP E) EGSFQPS
F) EGSFQPSQE
56
Table 3. EXSAN prediction of peptide binding to MHC class II proteins
PDB ID MHC ALLELE
BOUND
SEQUENCE
A
RMSD
TIER 1
RMSD
TIER 2
RMSD
TIER 3
1T5W HLA-DR1 YSDQATPLL 1.78 6.07
2NNA HLA-DQ8 EGSFQPSQE 1.07
2Q6W HLA-DRA/DRB3*0101 WRSDEALPL 4.93 2.19 2.94
4I5B HLA-DR1 VVKQNCLKL 2.16
4IS6 HLA-DR4 LYPEWTEAQ 3.42
4X5W HLA-DR1 WRMATPLLM 2.70
5KSV
HLA-DQ2.5
(DQA1*05/DQB1*02)
PLLMQALPM 2.48
5LAX HLA-DRB1*04:01 FRAAVPSGA 0.85
A.
Residues which extend beyond binding side (omitted from table) were not used in RMSD
calculation.
For all the interfaces where solvation was not used in the binding calculation, the prediction
of peptide binding was affected. Interfaces either output a consensus pose that had a higher RMSD,
or the program terminates prematurely due to an intractable number of peptide conformations
(poses). If at any step in the protocol, the number of unique peptide conformations exceeded 1.5
million, the program terminated. Five interfaces exhibited such behavior, whereas with the
inclusion of water, no interface reached the threshold. The protein-peptide interfaces were solvated
in either docking simulation however in the second execution, water was not used in the calculation
of binding affinity by scoring any interaction which involved water molecules in those interactions.
Table 5 shows the comparison between docking simulations where water molecules were included
in the scoring protocol and when water molecules were not used in calculation of interactions
between the protein and the peptide. The time for calculation was markedly longer for the
simulations which did not use water molecules in the scoring of interactions. This was due to the
higher number of poses retained at the various steps in the protocol. The run times can be seen in
Table 5.
57
Table 4. Comparison of EXSAN scores and parameters between the consensus docking pose
(C) and the X-ray (X) peptide conformation
Interface EXSAN Total
Score
Displaced
Waters
Displaced
Waters Score
H-
Bonds
a
Hydro
Phobic
b
Wrong
Pocket
c
Bulk
Facing
d
Salt
Bridge
X C X C X C X C X C X C X C X C
1T5W
-7866
-8251 53 53 -1575 -2800 11 7 18 15 4 4 2 2 0 0
2NNA -8116 -10277 50 60 -1945 -3070 9 9 10 8 2 1 2 1 2 3
2Q6W -8800 -10578 55 58 -1625 -2615 18 13 5 1 3 6 3 3 0 0
4I5B -7387 -8898 64 64 -1925 -3170 9 8 14 15 7 1 2 2 0 2
4IS6 -5721 -7930 55 57 -1335 -2910 8 5 7 16 3 2 1 1 2 2
4X5W -8786 -8032 60 61 -2140 -2545 10 7 22 20 6 7 2 0 0 0
5KSV -7689 -7203 67 57 -1370 -1675 8 4 23 23 3 0 1 1 0 0
5LAX -5451 -7293 49 50 -1110 -2575 9 9 14 14 7 6 1 1 0 1
a
Number of hydrogen bonds between peptide backbone atom and protein backbone or sidechain
atom.
b
Hydrophobic peptide sidechain with a carbon or sulfur in proximity (2.3Å-4.5Å) to a carbon or
sulfur in a hydrophobic sidechain on the protein. Any unique pair of residues is only scored once.
c
Hydrophilic sidechain (SC) in a hydrophobic pocket in the protein, wherein the most polar
atom of the SC is within 4.5Å of a carbon or sulfur atom in a hydrophobic SC on the protein.
d
Score for a bulk-facing hydrophilic sidechain (SC), in which the most polar atom of the SC
faces away from the center of the protein into bulk.
A third set of simulations were conducted to probe the validity of the Water Replacement
parameter. These runs were identical to the default protocol except water molecules which were
“replaced” by a backbone carbonyl oxygen were not given the additional score. This score is minor
in comparison to other interactions exhibited by the peptide, but we believe is still an important
observed phenomenon which should be appropriately scored. During this protocol, any water
molecules which were displaced and then replaced still scored interactions such as the water
molecule displacement energy, or hydrogen bond formation of the incoming amino acid carbonyl
oxygen. Just the additional score of Water Replacement was omitted.
The removal of this parameter affected the results of the program in a variable way depending
58
on the interface. Results for the no Water Replacement protocol can be seen in table 5. Three
interfaces failed to complete the peptide as the number of conformations of the peptide (poses)
became intractable. All but one of the interfaces which did complete resulted in worse RMSD.
One interface (PDB ID: 2Q6W) resulted in an improved RMSD, from 4.92 Å to 4.09 Å, with the
removal of the Water Replacement score. It should be noted in the default protocol a second
consensus pose was generated which had an RMSD of 2.19 Å.
Table 5. EXSAN protocols, peptide binding predictions & runtimes
PDB ID PROTOCOL RUNTIME
e
RMSD
TIER 1
RMSD
TIER 2
RMSD
TIER 3
1T5W Default
b
16:07:54 1.78 6.07 -
No Wat Rep
c
Intractable @ YSDQATPL - - -
No Wat
d
Intractable @ YSDQ - - -
2NNA Default 7:59:42 1.07 - -
No Wat Rep 18:15:02 1.12 - -
No Wat Intractable @ EGSFQPSQ - - -
2Q6W Default 15:28:45 4.93 2.19 2.94
No Wat Rep 16:41:55 4.09 2.93 -
No Wat Intractable @
WRSDEALPL
- - -
4I5B Default 7:31:25 2.16 - -
No Wat Rep Intractable @ VVKQNCL - - -
No Wat Intractable @ VVK - - -
4IS6 Default 5:19:51 3.42 - -
No Wat Rep 5:38:55 3.80 4.76 4.21
No Wat 60:44:16 13.56 - -
4X5W Default 10:17:58 2.70 - -
No Wat Rep 17:18:34
3.65
a
23.18
a
12.20
a
No Wat 7:11:40 2.63 - -
5KSV Default 2:04:19 2.48 - -
No Wat Rep Intractable @ PLLM - - -
No Wat Intractable @ PLLM - - -
5LAX Default 4:20:59 0.85 - -
No Wat Rep 11:26:19 4.43 - -
No Wat Intractable @ FRAAVPS - - -
a
All 3 Tiers were obtained from a group containing just 1 pose. All Tiers can be viewed as
equivalent.
b
Default: includes water molecules in energy calculations.
c
No Wat Rep: removes the score given for Water Replacement parameter.
59
d
No Wat: all parameters involving water molecules are not used in binding prediction. (All other
parameters: direct hydrogen bonds, salt bridges, hydrophobic interactions, Van der Waals clash,
hydrophilic, Etc. are static in all protocols).
e
All protocols which reach a threshold of 1.5 million peptide conformations (poses) were halted
(noted as Intractable and sequence completed before reaching threshold).
4.4 Discussion
4.4.1 Prediction of Crystal Structures
The anchored fragment-based docking approach with inclusion of explicit solvent in EXSAN
was effective for prediction of peptides binding to MHC class II molecules. The strengths of
EXSAN are the ability to make prediction without prior knowledge, except the first “anchor”
residue, and the prediction of interface water molecules in the binding pockets help drive the proper
conformation of the docked peptide. The residue growth approach overcomes the major difficulty
in peptide docking, which is the flexibility of the peptide. Using this approach, all rotatable bonds
in the peptide are flexible while dramatically reducing the number of independent geometric
variables that must be considered by the program at any moment. For all these reasons, EXSAN
is a powerful tool for de novo prediction of peptide binding to a protein target.
The inclusion of a water network in the protocol showed strong prediction power resulting in
low RMSD relative to the experimentally determined structure. Interactions between the protein
and the peptide which did not involve water (i.e. direct hydrogen bonds, hydrophobic interactions,
electrostatic forces) were also accurately identified in the predicted structures. As shown in table
2, the parameters used for prediction showed that crystal structures and predicted structures
exhibited an equivalent number of hydrogen bonds, hydrophobic contacts, salt bridges and a
number of other interactions the program incorporated for prediction. Not only are the predicted
60
structures similar in binding but they are exhibiting the same interactions observed in the
experimental structures.
The removal of the Water Replacement parameter in the protocol resulted in reduced accuracy
of predicted peptides. The influence of this parameter was greater than expected. The increase in
RMSD was substantial in the majority of the interfaces. One reason for this change is the grain at
which backbone torsional angles are permutated. The torsional angles were permutated every 10
degrees. A finer grain would be too computationally demanding. If 2 poses generate carbonyl
oxygens 10° apart but one pose replaces a water molecule while the second does not, the pose
which replaces a water molecule is more likely to be retained as it is more energetically favorable.
It was given the additional Water Replacement score. If that score is removed, both poses are
equally favorable. However, the conformation which replaces a water molecule is more likely to
be the experimentally determined structure conformation based on the analysis previously
described. This 10° change would be a minor difference in 3D space at that point in the peptide.
But the difference becomes greater as the peptide extends. The small change propagates along the
peptide. Resulting in a greater deviation from the crystal structure. This water replacement score
helped discern and align the predicted peptide as sequential residues are generated. Resulting in
more accurate prediction of the docked peptide.
The removal of energy calculation involving water molecules all together was shown to have
a negative effect on the prediction of peptide binding. The protocol in which the protein-peptide
interface was solvated, but water molecules interactions were not scored, resulted in worse
predictions. This confirms the vital role water molecules play in the docking of peptides to MHC
class II molecules. Without the inclusion of water molecules, prediction models cannot replicate
in vivo conditions. In a number of interfaces, the program failed to complete due to an intractable
61
number of peptide conformations.
A cutoff for the number of peptide conformations (poses), was set at 1.5 million. At any point
in the protocol if the number of poses reached that threshold the program was terminated. All
interfaces where water molecules were included in energy calculations never reached this
threshold. However, six of the eight interfaces which omitted water molecule energetics failed to
complete as the reach the cutoff value. The reason for the difference in the number of poses is a
result of the EXSAN protocol. The energies of all the poses is used to cull unfavorable
conformations of the peptide. Poses with low scores (defined as a score < [65% * (max – min) +
min]) are culled. The range (max – min) is affected by the interactions exhibited in each pose. If
the set of all poses (i.e. N poses) are very similar in score, the range is small, and the number of
poses retained will increase. These poses then continue with residue addition, as described in
methods. The higher the number of poses kept at any one step, the higher the number of poses
generated when adding the subsequent amino acids in the peptide. Leading to potentially an
intractable number of poses prior to the completion of the peptide.
Conversely, if the range of scores for N poses is large, the number of poses retained actually
decreases. The highly scored poses are binding in a conformation which is favorable and believed
to be the experimental binding conformation. There is no reason, therefore, to keep poses which
deviate from favorable trajectories. This is an important attribute of EXSAN. If there are no
interactions which are highly favorable, the program needs to retain more options, as the binding
conformation is not clear. However, if there are favorable interactions, the program is able to
discern what poses should be culled based upon energy differences.
Water molecule energetics help discriminate between favorable and unfavorable poses. The
scoring of water molecules helps dictate what conformations of the peptide are favorably scored.
62
If a pose displaces a water molecule that is interacting with the protein via a hydrogen bond, it is
unfavorable for that water molecule to be released. The peptide conformation would be less
favorable in vivo as it has to overcome the enthalpic barrier of breaking a hydrogen bonded water
molecule. But, a water molecule which is trapped in a hydrophobic environment has a high
propensity for release into bulk solvent. A peptide conformation which can displace that water
molecule would be more favorable due to the entropic gain of displacing the water molecule. This
distinction of water molecules across an entire peptide leads to a wider range of scores for the
various conformations of the peptide. Which allows EXSAN to cull more poses based upon the
formula used for energy discretion. This shows the importance of individual water molecules in
the binding of peptides, and the effects on prediction without solvation. Water energetics reduces
the number of poses generated at proceeding steps and stopping calculations from becoming
intractable.
Accurately predicting peptide binders has vast implications for a number of biological
applications such as vaccine development and autoimmune diseases. EXSAN has been shown to
predict peptides binders to MHC class II molecules. An ability which can help reduce the time and
cost required in experimental determination of peptide binding aiding in research across multiple
disciplines.
63
References
1. Monaco JJ. A molecular model of MHC class-I-restricted antigen processing. Immunology today.
1992;13(5):173-179.
2. Yewdell JW, Bennink JR. Immunodominance in major histocompatibility complex class I-
restricted T lymphocyte responses. Annu Rev Immunol. 1999;17:51-88.
3. Murphy K, Weaver C. Janeway's immunobiology. 9th edition. ed. New York, NY: Garland
Science/Taylor & Francis Group, LLC; 2016.
4. Reith W, LeibundGut-Landmann S, Waldburger JM. Regulation of MHC class II gene expression
by the class II transactivator. Nat Rev Immunol. 2005;5(10):793-806.
5. Lafuente EM, Reche PA. Prediction of MHC-peptide binding: a systematic and comprehensive
overview. Curr Pharm Des. 2009;15(28):3209-3220.
6. Wieczorek M, Abualrous ET, Sticht J, et al. Major Histocompatibility Complex (MHC) Class I and
MHC Class II Proteins: Conformational Plasticity in Antigen Presentation. Front Immunol.
2017;8:292.
7. Bui HH, Schiewe AJ, von Grafenstein H, Haworth IS. Structural prediction of peptides binding to
MHC class I molecules. PROTEINS: Structure, Function, and Bioinformatics. 2006;63(1):43-52.
8. Chicz RM, Urban RG, Lane WS, et al. Predominant naturally processed peptides bound to HLA-
DR1 are derived from MHC-related molecules and are heterogeneous in size. Nature.
1992;358(6389):764-768.
9. Giles JB, Brill DA, Chavoya A, Haworth IS. Algorithms for prediction of peptide binding to major
histocompatibility complex (MHC) molecules. In: Computer Modelling: New Technologies,
Applications and Techniques. Nova Science Publishers; 2018 (In Press).
10. Jurtz V, Paul S, Andreatta M, Marcatili P, Peters B, Nielsen M. NetMHCpan-4.0: Improved
Peptide-MHC Class I Interaction Predictions Integrating Eluted Ligand and Peptide Binding
Affinity Data. J Immunol. 2017;199(9):3360-3368.
11. Kyeong HH, Choi Y, Kim HS. GradDock: rapid simulation and tailored ranking functions for
peptide-MHC Class I docking. Bioinformatics. 2018;34(3):469-476.
12. Antunes DA, Devaurs D, Moll M, Lizee G, Kavraki LE. General Prediction of Peptide-MHC Binding
Modes Using Incremental Docking: A Proof of Concept. Sci Rep. 2018;8(1):4327.
13. Jensen KK, Andreatta M, Marcatili P, et al. Improved methods for predicting peptide binding
affinity to MHC class II molecules. Immunology. 2018.
14. Brill DA, Giles JB, Haworth IS. EXSAN: Explicit-Solvent Anchored Fragment-Based Docking.
Journal of Chemical Information amd Modeling. 2018.
15. Antes I, Siu SW, Lengauer T. DynaPred: a structure and sequence based method for the
prediction of MHC class I binding peptide sequences and conformations. Bioinformatics.
2006;22(14):e16-24.
16. Jandrlic DR. SVM and SVR-based MHC-binding prediction using a mathematical presentation of
peptide sequences. Comput Biol Chem. 2016;65:117-127.
17. Karosiene E, Lundegaard C, Lund O, Nielsen M. NetMHCcons: a consensus method for the major
histocompatibility complex class I predictions. Immunogenetics. 2012;64(3):177-186.
18. Lundegaard C, Lamberth K, Harndahl M, Buus S, Lund O, Nielsen M. NetMHC-3.0: accurate web
accessible predictions of human, mouse and monkey MHC class I affinities for peptides of length
8–11. Nucleic acids research. 2008;36(suppl_2):W509-W512.
19. Schuler MM, Nastke M-D, Stevanović S. SYFPEITHI. In: Immunoinformatics. Springer; 2007:75-93.
20. Bordner AJ. Towards universal structure-based prediction of class II MHC epitopes for diverse
allotypes. PLoS One. 2010;5(12):e14383.
64
21. Nielsen M, Lund O. NN-align. An artificial neural network-based alignment algorithm for MHC
class II peptide binding prediction. BMC Bioinformatics. 2009;10(1):296.
22. Kim Y, Ponomarenko J, Zhu Z, et al. Immune epitope database analysis resource. Nucleic acids
research. 2012;40(W1):W525-W530.
23. Guo L, Luo C, Zhu S. MHC2SKpan: a novel kernel based approach for pan-specific MHC class II
peptide binding prediction. BMC Genomics. 2013;14 Suppl 5:S11.
24. Xu Y, Luo C, Qian M, Huang X, Zhu S. MHC2MIL: a novel multiple instance learning based method
for MHC-II peptide binding prediction by considering peptide flanking region and residue
positions. BMC Genomics. 2014;15 Suppl 9:S9.
25. Atanasova M, Patronov A, Dimitrov I, Flower DR, Doytchinova I. EpiDOCK: a molecular docking-
based tool for MHC class II binding prediction. Protein Eng Des Sel. 2013;26(10):631-634.
26. Swain MT, Brooks AJ, Kemp GJ. An automated approach to modelling class II MHC alleles and
predicting peptide binding. Paper presented at: Bioinformatics and Bioengineering Conference,
2001. Proceedings of the IEEE 2nd International Symposium on2001.
27. Khan JM, Ranganathan S. pDOCK: a new technique for rapid and accurate docking of peptide
ligands to Major Histocompatibility Complexes. Immunome Res. 2010;6 Suppl 1:S2.
28. Zhang H, Lundegaard C, Nielsen M. Pan-specific MHC class I predictors: a benchmark of HLA
class I pan-specific prediction methods. Bioinformatics. 2009;25(1):83-89.
29. Hoof I, Peters B, Sidney J, et al. NetMHCpan, a method for MHC class I binding prediction
beyond humans. Immunogenetics. 2009;61(1):1-13.
30. Nielsen M, Andreatta M. NetMHCpan-3.0; improved prediction of binding to MHC class I
molecules integrating information from multiple receptor and peptide length datasets. Genome
Med. 2016;8(1):33.
31. Nielsen M, Lundegaard C, Blicher T, et al. NetMHCpan, a method for quantitative predictions of
peptide binding to any HLA-A and -B locus protein of known sequence. PLoS One.
2007;2(8):e796.
32. Vita R, Overton JA, Greenbaum JA, et al. The immune epitope database (IEDB) 3.0. Nucleic Acids
Res. 2015;43(Database issue):D405-412.
33. Deres K, Schumacher TN, Wiesmuller KH, et al. Preferred size of peptides that bind to H-2 Kb is
sequence dependent. Eur J Immunol. 1992;22(6):1603-1608.
34. Rammensee H, Bachmann J, Emmerich NP, Bachor OA, Stevanovic S. SYFPEITHI: database for
MHC ligands and peptide motifs. Immunogenetics. 1999;50.
35. Dallas DC, Guerrero A, Parker EA, et al. Current peptidomics: applications, purification,
identification, quantification, and functional analysis. Proteomics. 2015;15(5-6):1026-1038.
36. Tian F, Yang L, Lv F, Yang Q, Zhou P. In silico quantitative prediction of peptides binding affinity
to human MHC molecule: an intuitive quantitative structure-activity relationship approach.
Amino Acids. 2009;36(3):535-554.
37. Kim Y, Sidney J, Pinilla C, Sette A, Peters B. Derivation of an amino acid similarity matrix for
peptide: MHC binding and its application as a Bayesian prior. BMC Bioinformatics. 2009;10:394.
38. Wang P, Sidney J, Dow C, Mothe B, Sette A, Peters B. A systematic assessment of MHC class II
peptide binding predictions and evaluation of a consensus approach. PLoS Comput Biol.
2008;4(4):e1000048.
39. Trolle T, McMurtrey CP, Sidney J, et al. The Length Distribution of Class I-Restricted T Cell
Epitopes Is Determined by Both Peptide Supply and MHC Allele-Specific Binding Preference. J
Immunol. 2016;196(4):1480-1487.
40. Andreatta M, Karosiene E, Rasmussen M, Stryhn A, Buus S, Nielsen M. Accurate pan-specific
prediction of peptide-MHC class II binding affinity with improved binding core identification.
Immunogenetics. 2015;67(11-12):641-650.
65
41. Nielsen M, Lundegaard C, Blicher T, et al. Quantitative predictions of peptide binding to any
HLA-DR molecule of known sequence: NetMHCIIpan. PLoS Comput Biol. 2008;4.
42. Lindahl E, Hess B, van der Spoel D. GROMACS 3.0: a package for molecular simulation and
trajectory analysis. Molecular modeling annual. 2001;7(8):306-317.
43. Canutescu AA, Shelenkov AA, Dunbrack RL, Jr. A graph-theory algorithm for rapid protein side-
chain prediction. Protein Sci. 2003;12(9):2001-2014.
44. Schmid N, Eichenberger AP, Choutko A, et al. Definition and testing of the GROMOS force-field
versions 54A7 and 54B7. Eur Biophys J. 2011;40(7):843-856.
45. Doytchinova IA, Walshe V, Borrow P, Flower DR. Towards the chemometric dissection of
peptide--HLA-A*0201 binding affinity: comparison of local and global QSAR models. J Comput
Aided Mol Des. 2005;19(3):203-212.
46. Bordner AJ, Abagyan R. Ab initio prediction of peptide-MHC binding geometry for diverse class I
MHC allotypes. Proteins. 2006;63(3):512-526.
47. Brusic V, Rudy G, Harrison LC. MHCPEP, a database of MHC-binding peptides: update 1997.
Nucleic Acids Res. 1998;26(1):368-371.
48. Rammensee H, Bachmann J, Emmerich NP, Bachor OA, Stevanovic S. SYFPEITHI: database for
MHC ligands and peptide motifs. Immunogenetics. 1999;50(3-4):213-219.
49. Korber B, Brander C, Haynes B, et al. Los Alamos National Laboratory. Theoretical Biology and
Biophysics, Los Alamos, NM. 2005.
50. Baker D. Rosetta Commons. 2015; https://www.rosettacommons.org. Accessed 4/1/2018, 2018.
51. Ting D, Wang G, Shapovalov M, Mitra R, Jordan MI, Dunbrack RL, Jr. Neighbor-dependent
Ramachandran probability distributions of amino acids developed from a hierarchical Dirichlet
process model. PLoS Comput Biol. 2010;6(4):e1000763.
52. Morris GM, Huey R, Lindstrom W, et al. AutoDock4 and AutoDockTools4: Automated docking
with selective receptor flexibility. J Comput Chem. 2009;30(16):2785-2791.
53. Yanover C, Bradley P. Large-scale characterization of peptide-MHC binding landscapes with
structural simulations. Proc Natl Acad Sci U S A. 2011;108(17):6981-6986.
54. Pronk S, Pall S, Schulz R, et al. GROMACS 4.5: a high-throughput and highly parallel open source
molecular simulation toolkit. Bioinformatics. 2013;29(7):845-854.
55. Reichardt C, Welton T. Solvents and solvent effects in organic chemistry. John Wiley & Sons;
2011.
56. Kellis JT, Nyberg K, Fersht AR. Contribution of hydrophobic interactions to protein stability.
Nature. 1988;333(6175):784-786.
57. Dill KA. Dominant forces in protein folding. Biochemistry. 1990;29(31):7133-7155.
58. Nicholls A, Sharp KA, Honig B. Protein folding and association: insights from the interfacial and
thermodynamic properties of hydrocarbons. Proteins: Structure, Function, and Bioinformatics.
1991;11(4):281-296.
59. Subramanian P, Beveridge D. A Monte Carlo simulation study of the aqueous hydration of d
(CGCGCG) in Z form. Theoretica chimica acta. 1993;85(1-3):3-15.
60. Nemethy G, Peer W, Scheraga H. Effect of protein-solvent interactions on protein conformation.
Annual review of biophysics and bioengineering. 1981;10(1):459-497.
61. Fersht A. Structure and mechanism in protein science: a guide to enzyme catalysis and protein
folding. Macmillan; 1999.
62. Wolfenden R, Kati WM. Testing the limits of protein-ligand binding discrimination with
transition-state analogue inhibitors. Accounts of chemical research. 1991;24(7):209-215.
63. Ball P. Water as an active constituent in cell biology. Chemical reviews. 2008;108(1):74-108.
66
64. Schreiber G, Fersht AR. Energetics of protein-protein interactions: Analysis ofthe Barnase-
Barstar interface by single mutations and double mutant cycles. Journal of molecular biology.
1995;248(2):478-486.
65. Burnett JC, Kellogg GE, Abraham DJ. Computational Methodology for Estimating Changes in Free
Energies of Biomolecular Association upon Mutation. The Importance of Bound Water in Dimer−
Tetramer Assembly for β37 Mutant Hemoglobins. Biochemistry. 2000;39(7):1622-1633.
66. Hajari T, van der Vegt NF. Solvation thermodynamics of amino acid side chains on a short
peptide backbone. The Journal of chemical physics. 2015;142(14):04B607_601.
67. Janin J. Wet and dry interfaces: the role of solvent in protein–protein and protein–DNA
recognition. Structure. 1999;7(12):R277-R279.
68. Klon AE. Fragment-based Methods in Drug Discovery. Springer; 2015.
69. Du X, Li Y, Xia Y-L, et al. Insights into protein–ligand interactions: Mechanisms, models, and
methods. International journal of molecular sciences. 2016;17(2):144.
70. Lee H, Heo L, Lee MS, Seok C. GalaxyPepDock: a protein-peptide docking tool based on
interaction similarity and energy optimization. Nucleic Acids Res. 2015;43(W1):W431-435.
71. London N, Raveh B, Cohen E, Fathi G, Schueler-Furman O. Rosetta FlexPepDock web server--high
resolution modeling of peptide-protein interactions. Nucleic Acids Res. 2011;39(Web Server
issue):W249-253.
72. Antes I. DynaDock: A new molecular dynamics-based algorithm for protein-peptide docking
including receptor flexibility. Proteins. 2010;78(5):1084-1104.
73. Schindler CE, de Vries SJ, Zacharias M. Fully Blind Peptide-Protein Docking with pepATTRACT.
Structure. 2015;23(8):1507-1515.
74. Blaszczyk M, Kurcinski M, Kouza M, et al. Modeling of protein-peptide interactions using the
CABS-dock web server for binding site search and flexible docking. Methods. 2016;93:72-83.
75. Kurcinski M, Jamroz M, Blaszczyk M, Kolinski A, Kmiecik S. CABS-dock web server for the flexible
docking of peptides to proteins without prior knowledge of the binding site. Nucleic Acids Res.
2015;43(W1):W419-424.
76. Ben-Shimon A, Niv MY. AnchorDock: Blind and Flexible Anchor-Driven Peptide Docking.
Structure. 2015;23(5):929-940.
77. Trellet M, Melquiond AS, Bonvin AM. A unified conformational selection and induced fit
approach to protein-peptide docking. PLoS One. 2013;8(3):e58769.
78. Allen WJ, Balius TE, Mukherjee S, et al. DOCK 6: Impact of new features and current docking
performance. J Comput Chem. 2015;36(15):1132-1156.
79. Verdonk ML, Chessari G, Cole JC, et al. Modeling water molecules in protein-ligand docking using
GOLD. J Med Chem. 2005;48(20):6504-6515.
80. Bui HH, Schiewe AJ, Haworth IS. WATGEN: an algorithm for modeling water networks at protein-
protein interfaces. J Comput Chem. 2007;28(14):2241-2251.
81. Bui H-H, Schiewe AJ, von Grafenstein H, Haworth IS. Structural prediction of peptides binding to
MHC class I molecules. Proteins: Structure, Function, and Bioinformatics. 2006;63(1):43-52.
82. Schiewe AJ, Haworth IS. Structure-based prediction of MHC–peptide association: Algorithm
comparison and application to cancer vaccine design. Journal of Molecular Graphics and
Modelling. 2007;26(3):667-675.
83. Matsumura M, Fremont DH, Peterson PA, Wilson IA. Emerging principles for the recognition of
peptide antigens by MHC class I molecules. Science. 1992;257(5072):927-934.
Abstract (if available)
Abstract
Major histocompatibility complex (MHC) class I and II molecules play a vital role in the cellular immune system. MHC molecules present peptide fragments (antigens) derived from pathogens and display them at the cell surface for T-cell recognition. MHC genes are highly polymorphic and polygenic, and there are a large number of peptide binders for any single MHC allele, which makes experimental determination of all binders intractable. Computational prediction methods have been used to address this problem. Structure-based algorithms calculate the conformation of a peptide based upon interactions between the MHC and the bound peptide. Limited work has been conducted on sequence-based MHC class II prediction methods. ❧ The investigation of algorithms for the prediction of peptides leads to the conclusion that there is a lack of developed programs for sequence-based prediction of binders to MHC class II molecules. Additionally, those programs do not accurately replicate in vivo conditions as they fail to include a water network at the protein-peptide interface. Interfacial water between protein-peptide complexes play an active role in the binding affinity of peptides. These findings lead to the development of a sequence-based model for the docking of peptides to MHC class II. Here, we present the development and validation of a structure-based program, EXSAN (EXplicit Solvent ANchored Docking), which allows explicit inclusion of individual water molecules in the docking of protein-peptide complexes and demonstrates the importance of solvation in the prediction of peptide binding at these interfaces. The program shows strong prediction power for class II molecules. The ability to predict binders to MHC molecules has far reaching implications such as aiding in vaccine development. ❧ Utilization of EXSAN for MHC class II peptide binding predictions resulted in a number of insights, including the influence of water. The lack of solvation dramatically reduced the prediction ability of EXSAN. Solvated docking leads to improved prediction, as EXSAN is able to reproduce the solvation pattern at the interface accurately, as well as calculate interactions such as water-mediated hydrogen bonding, displacement of water, and water-mediated salt bridges present in protein-peptide interfaces. ❧ Additionally, we were able to show what is believed to be a novel phenomenon. EXSAN showed that peptides preferentially bind in a conformation where an oxygen of a backbone carbonyl replaces a water molecule present in the unbound protein. That is, during peptide binding an oxygen will be oriented in such a way as to be in near vicinity of displaced water molecule, so as to maintain the hydrogen bond network present prior to the ligand binding. MHC class II molecules exhibited this phenomenon.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Computational modeling of solvation and docking of peptide-MHC class I
PDF
Structural prediction of MHC-peptide-TCR interactions: potential for vaccine design
PDF
Prediction of peptides in formation of MHC class I - peptide - TCR complexes using molecular models and artificial intelligence
PDF
Computer modeling of protein-peptide interface solvation
PDF
Structure-based computational analysis and prediction of TCR CDR3 loops in the TCR-peptide-MHC complex using solvation parameters and peptide molecular dynamics.
Asset Metadata
Creator
Giles, Jason Britt (author)
Core Title
Solvation as a driving force for peptide docking to the major histocompatibility complex (MHC) class II molecules
School
School of Pharmacy
Degree
Master of Science
Degree Program
Molecular Pharmacology and Toxicology
Publication Date
08/08/2020
Defense Date
08/08/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
antigen,computation,fragment docking,HLA,human leukocyte antigen,immunology,in silico,ligand,major histocompatibility complex,MHC,MHC class II,OAI-PMH Harvest,peptide,prediction,solvation,Water
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Haworth, Ian (
committee chair
), Alachkar, Houda (
committee member
), Xie, Jianming (
committee member
)
Creator Email
jasonbgiles@email.arizona.edu,jasongil@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-64125
Unique identifier
UC11668923
Identifier
etd-GilesJason-6710.pdf (filename),usctheses-c89-64125 (legacy record id)
Legacy Identifier
etd-GilesJason-6710.pdf
Dmrecord
64125
Document Type
Thesis
Format
application/pdf (imt)
Rights
Giles, Jason Britt
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
antigen
computation
fragment docking
HLA
human leukocyte antigen
immunology
in silico
ligand
major histocompatibility complex
MHC
MHC class II
peptide
prediction
solvation