Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Modeling of water molecules in protein-ligand binding
(USC Thesis Other)
Modeling of water molecules in protein-ligand binding
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Zhaohui Wang
1
Modeling of Water Molecules in Protein-Ligand Binding
By
Zhaohui Wang
A Thesis Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the Requirements for the Degree
MASTER OF SCIENCES (Pharmaceutical Sciences)
December 2019
Copyright 2019 Zhaohui Wang
Zhaohui Wang
2
Contents
ABSTRACT ................................................................................................................................... 3
Chapter 1. Introduction ............................................................................................................... 4
1.1. Background ...................................................................................................................... 4
1.2. Role of Water in Protein-Ligand Binding ........................................................................ 8
1.3. Methods for Evaluation and Prediction of Explicit Water ............................................... 9
1.3.1. WATGEN ................................................................................................................. 9
1.3.2. WaterScore .............................................................................................................. 10
1.3.3. WaterMap ............................................................................................................... 11
1.3.4. SZMAP ................................................................................................................... 11
1.3.5. Other Methods ........................................................................................................ 12
Chapter 2. WATGEN Analysis ................................................................................................ 13
2.1. Cases of Simple Analysis ............................................................................................... 13
2.1.1. Method .................................................................................................................... 13
2.1.2. Results ..................................................................................................................... 16
2.2. Global Data Analysis ..................................................................................................... 32
2.2.1. Method .................................................................................................................... 32
2.2.2. Results ..................................................................................................................... 34
Chapter 3. Discussion ............................................................................................................... 47
Chapter 4. Conclusion .............................................................................................................. 50
References .................................................................................................................................... 51
Zhaohui Wang
3
ABSTRACT
WATGEN is an efficient tool with easy access for quickly adding water molecules based on their
interactions with the protein-ligand interface. WATGEN not only adds hydration layers to protein
complexes but also provides detailed information on all possible interactions of each water
molecule with either the complex or the surrounding water. In this study, the accuracy of
WATGEN for adding water molecules to 3,485 protein-ligand complexes was examined by
comparing the WATGEN water with existing water in X-ray structures. Several kinds of data for
the ligand and its surrounding waters were collected from WATGEN and other tools. The
rationality of WATGEN results was inspected for seven important protein-drug complexes by
looking into the specific interactions of each water that is essential to the ligand binding. A few
highly conserved water molecules reported previously were predicted, which further increases the
credibility of WATGEN. In a global analysis, statistical methods were used to find correlations
among the data collected. Even though these data were highly scattered, and no significant
relationship was found except for those already known, several general trends comply with the
presumptions based on concrete evidence have been shown from a non-standard analysis. An
attempt to transform the scattered data into a prediction model failed due to deficiency of the
feature parameters.
Zhaohui Wang
4
Chapter 1. Introduction
1.1. Background
Water is one of the most essential elements to life both macroscopically and microscopically. It
nourishes life and provides the nurturing environment for cell reproduction and function. Life is
based on cells and organic molecules, which play the roles of agents of most physiological
functions. When an organism is in a well-balanced state, every component works properly and in
an orderly manner without deviating too much from the homeostasis no matter whether the
function is known. For humans, once the balance has been broken, various kinds of diseases may
develop. Traditionally, people learned to use agents to change the unhealthy physiological state
such as penicillin for treating bacterium infection, or to offset adverse effects directly such as
calcium carbonate for treating acid reflux, from long-term historical practice, without knowing the
origin of these diseases. By definition, a drug is any chemical substance other than a food or device
that affects the function of a living organism. With deeper knowledge on biology, biochemistry,
and physiology, studies of how a drug affects body function and how it is processed within the
body have led to the development of pharmacokinetics and pharmacodynamics. A consensus has
been reached that most drugs exert their effects by interacting with various functional proteins,
and studies of these interactions at a molecular level have started to gather more and more attention.
Early practice to identify and select therapeutically relevant targets came directly from known
drugs. In vitro experiments were done by extracting contents from animal tissues or
physiologically relevant cell lines treated with the drug to determine which protein the drug was
binding; then the equilibrium constant (K) for formation of a complex (if not covalently bonded),
generally referred to as binding affinity, was tested for each drug that binds with a specific protein.
This is usually expressed as the half maximal inhibitory concentration (IC50), inhibition constant
Zhaohui Wang
5
(Ki) or dissociation constant (Kd) in different situations. As might be expected, the binding affinity
of a drug to a functional protein is closely related to the drug potency and efficacy in most cases,
and is associated with the standard Gibbs free energy equation ΔGº = -RT ln K. In reality,
researchers have long been using physicochemical terms from the perspective of energy to explain
and describe the function of drugs and other endogenous molecules, but due to the complex energy
terms coming from numerous enthalpic and entropic contributions, a vast amount of computation
is needed, which was hard to realize just a few decades ago (Bissantz et al., 2010).
With continuous progress of computational power and constant upgrades to algorithms for more
and more complex applications, the field of in silico drug design has exploded in the past two
decades, as a firmly established essential component in early drug lead discovery (Jorgensen,
2009). General energy terms have been integrated into what is called a force field as a specialized
function form with a set of optimized parameters derived from classic experiments
in physics or chemistry or computations in quantum mechanics (QM) to calculate the potential
energy of a physicochemical system. The most well-known force fields in computational chemistry
are AMBER (Cornell et al., 1995), CHARMM (MacKerell, 1998), and GROMOS (van Gunsteren
and Berendsen, 1987).
Computational chemists have been working with in silico simulation software to predict the
binding affinity according to known or predicted physiochemical properties and to determine if a
newly found chemical is druggable for the next round of selection. Virtual screening and
optimization are two critical processes for increasing the speed, efficiency, and decreasing the cost
of drug discovery. Nevertheless, challenges still exist for developing methods with more rational
theoretical bases, not only for applications in general situations but also to solve particular cases
with unique binding patterns, such as covalent binding, which does not perform well in standard
Zhaohui Wang
6
docking software (Kumalo et al., 2015). Currently, the two major challenges faced are how to
consider protein flexibility and how to treat water molecules as part of the interacting system, since
both factors play essential roles in protein-ligand binding in a dynamic way (Cavasotto et al., 2005;
Verdonk et al., 2005). Theoretically, there is a simplified way to represent structures in 3-D
graphics, as most simulation and docking software have done, since there is always a state with
the lowest overall energy for every binding system. However, graphics are not enough to interpret
the dynamic aspect of binding, to which more efforts should be devoted (Antunes et al., 2015).
Traditionally, interpreting water molecules within protein-ligand binding is done in implicit ways,
such as MM-PBSA (Molecular Mechanics - Poisson-Boltzmann Surface Area) developed under
the classical AMBER force field, and its counterpart MM-GBSA (Molecular Mechanics -
Generalized Born Surface Area), in which the free energy is determined by adding up three terms:
the change of gas-phase energy, the change of solvation-free energy (ΔGsol), and the change in the
configurational entropy. ΔGsol is referred to as the term representing the water within the system.
MM-PB(GB)SA has the advantage of reducing the computational cost while tending to ignore
some interactions between water, ligand, and receptors (Genheden and Ryde, 2015; Hou et al.,
2011; Sun et al., 2014). An alternative implicit way to consider water is to use QM-based strategies,
which are computationally more expensive and complicated to apply.
Several other models specifically designed for water have been developed, along with different
types of force fields, and could be adapted to one another. The most crucial water models are SPC
(Simple Point Charge) or SPC/E (extended), TIP3P (transferable Intermolecular Potential 3 Point)
and its upgraded model TIP4P and TIP5P. Parameters of associated properties such as melting
temperature, density at 298° C, and heat of vaporization, can be adjusted depending on different
models and situations. All the water models, including the non-specific MM-PB(GB)SA, could be
Zhaohui Wang
7
used to calculate the potential energy of water, whether there are water molecules explicitly present
in the structures or not, but these are generally not user-friendly and are highly specialized for
specific fields in which a decent amount of time of training is needed. However, ready-to-use
simulation/docking software has been developed based on force fields with integrated water
models, while the water force fields only consider the effect of water implicitly in the form of
potential energy contributions, with no information on the exact water interactions. When
considering binding patterns of proteins, a detailed inspection of the structure is needed to
determine the important interactions, especially when one or more key water molecules are present.
Information about hydration sites can be obtained through high-resolution X-ray crystallography,
which is generally trustworthy but lacks a full map of water distribution inside the structure, and
sometimes information from protein X-ray crystallography is incomplete for technical reasons.
There is a wide application of identifying key water molecules for protein-ligand interaction in
drug discovery and drug design, in which an explicit water layer is computationally generated,
followed by taking key water molecules into account for virtual screening, docking, and de novo
design (Garcia-Sosa et al., 2011). It is a general acknowledgment that water molecules can be
regarded as part of pharmacophores (Lloyd et al., 2004; Mikol, 1995); essential roles of water
molecules have been identified within some ligand-receptor complex, such as poly-ADP Ribose
Polymerase (PARP) (Garcia-Sosa et al., 2005) and cyclin-dependent kinase 2 (CDK2) (Garcia-
Sosa and Mancera, 2006). Several docking programs have taken explicit water molecules into
consideration, such as GRID (Goodford, 1985) and FITTED (Corbeil and Moitessier, 2009), but
how to determine which water molecules should be included and how the key water molecules
should be used are still under debate (Hu et al., 2018).
Zhaohui Wang
8
1.2. Role of Water in Protein-Ligand Binding
In brief, the role of water molecules in protein-ligand binding can be categorized into four types.
The first is the tightly bound waters to the protein, which might be near the binding pocket, but
will not be affected by introducing a ligand since their binding with the protein is relative strong;
however, these molecules could act as media connecting the protein and the ligand by forming
single water bridges (SWBs) and double water bridges (DWBs) through hydrogen bonds. The
tightly bound water could be energy favorable if they mediate ligand binding, but if the ligand
collapses this water structure, it is highly enthalpically unfavorable. The second type is loosely
bound waters, which form hydrogen bonds either to atoms with low polarity or to other tightly
bound waters, and can be easily displaced by introducing a ligand; this type of water might also
form water bridges after binding. By displacing these two types of waters, there would be an
entropy gain and enthalpy loss. The third type is bulk waters, which could be considered as free
waters that simply fill the void space within the protein. These waters do not have any direct
contacts with the protein and ligand, and are usually considered as water that does not differ much
from solvent water with a high degree of freedom; bulk waters are easy to displaced by any shape
that has clashed with them, without much energy variation. The last type is the buried waters,
which is trapped in a small space that only has hydrophobic contacts with the surrounding area,
but the water has to go inside to prevent a vacuum space; this type of water is highly energy
unfavorable, and there are both enthalpy and entropy gain if they are displaced. More and more
water molecules have been found to play an important role in the target binding of extensively
studied drugs, and to engineer water molecules into drug binding sites on purpose has become part
of the common considerations for drug design (Hu et al., 2018; Ladbury, 1996).
Zhaohui Wang
9
1.3. Methods for Evaluation and Prediction of Explicit Water
1.3.1. WATGEN
WATGEN was an algorithm developed in the Haworth laboratory originally for adding explicit
water networks at protein-protein interfaces. In a previous study, it accurately predicted 72% and
88% of water hydration sites within 1.5 Ã… and 2.0 Ã… respectively, and the number of hydration
sites predicted at the interface was generally much higher than the data obtained from the X-ray
structures (Bui et al., 2007). WATGEN can explicitly add water molecules sites to most types of
protein complex structure in PDB format, placing water molecules mainly by their interactions
and empirical positioning, instead of in a thermodynamic way. There are four steps of adding a
water layer in WATGEN: first, distribution of oxygen water sites around hydrogen centers; second,
categorization of ligand-based hydration sites or receptor-based water sites and elimination of
noninteracting water sites; third, selection of ‘best water sites’ by an empirical scoring system and
excluding VDW clashes; finally, optimization of hydration sites by defining the position of the
hydrogen of each water molecule. The WATGEN algorithm is a simple but reasonable method
which is easy to run and free to use. A single WATGEN run can be done within 1 minute for a
typical 200 kDa complex, which does not need supercomputer power and is affordable by anyone.
The WATGEN algorithm does not focus on how to determine the exact positions of every water
molecule, but to provide information on potential sites that are highly probable to be hydrated
within a protein structure, especially for the interface between a protein and its interacting ligand.
To date, the algorithm has been successfully applied to MHC I-ligand interfaces and also to β-
cyclodextrin complexes.
Structures with water layers generated by WATGEN can be prepared for further energy evaluation
and docking, while in this study, the analysis is based on another program developed in the
Zhaohui Wang
10
Haworth laboratory: WaterAnalysis. Sometimes this program is collectively referred as part of
WATGEN. WaterAnalysis does not consider energy in the way of other force field methods does
but in an empirical and quantitative way. It gives information on all possible water interactions at
an interface, such as direct hydrogen bond, SWBs, DWBs, and hydrophobic interactions. Even
though WATGEN and WaterAnalysis are relatively simple, they provide an informative method
for getting a general idea on how water molecules are spreading in a biochemical macromolecule,
how water molecules interacting with the ligand and residues of the protein, and how does the
water help with the interaction between the ligand and the protein.
1.3.2. WaterScore
WaterScore(Garcà a-Sosa et al., 2003) was a statistical method originally designed for predicting
tightly bound and loosely bound water molecules which are displaceable at the binding sites. The
criteria for tightly bound water molecules was with a large number of contacts, low
crystallographic B-factor, and low solvent accessible contact surface area, by using multivariate
logistic regression analysis. The main advantage of the WaterScore method is the availability to
rapidly score water molecules by using functions that are interpreted by observable
physicochemical properties. A probability of an explicit water molecule in the binding site to be
classified as tightly bound is given, which could also stand for how tight the water is binding to
the structure. Different criteria can be chosen to determine the key waters in a binding complex.
WaterScore was intensively used by Garcà a-Sosa for studying the role of tightly bound water
molecules within drug binding. (Garcia-Sosa, 2013; Lloyd et al., 2004) The disadvantage of
WaterScore is that it is highly dependent on experimental data with given information about water
molecules and it only concerns key water molecules without giving enough information on other
molecules that are not present in the X-ray structure.
Zhaohui Wang
11
1.3.3. WaterMap
Another noteworthy tool that shows the hydration structure by adding explicit water is WaterMap,
developed by Schrö dinger, LLC and commercially used in the docking software Glide. WaterMap
uses inhomogeneous solvation theory (IST or IFST) (Lazaridis, 1998; Lazaridis and Paulaitis,
1992). It starts with a Monte Carlo (MD) simulation of water by recording the figures of every
molecule into a density profile, then maps them into gridded positions and modified by energy.
However, the calculation of WaterMap might fail to take positions on some deeply-buried regions
around the binding site into account.
WaterMap has a very user-friendly GUI and can add highly probable hydration sites and display
as a continuous watermap. Being part of an advanced MD suite, it can choose the most suitable
water model, directly followed by druggability, activity, selectivity analysis. However, even as a
mature commercial software, a full WaterMap calculation takes around one day on an 8-16 CPUs
computer and costs more than $5.000 a year even for academic use, which means it is not only
time-consuming but also costly, and is not realistic for general use.
1.3.4. SZMAP
SZMAP is another commercial application by OpenEye for helping to model and understand the
role of water in molecular interactions, which provides an insight on essential features of a binding
site. SZMAP, similar to the classical GRID method, adopted a robust semi-continuum solvent
approach combining with a classical statistical mechanics (Grant et al., 2001) to calculate the free
binding energy and the thermodynamic components for water, such as hydration entropy. Instead
of identifying important hydration sites directly, SZMAP distinguishes significant favorable
solvent thermodynamics regions around the target binding sites from unfavorable ones, show
Zhaohui Wang
12
different orientations for each water molecule, and then predicts where and how the surrounding
waters would function upon binding with the ligand. The remarkable part is that it could propose
ligand modification hypotheses designed to better exploit the space of the binding sites, which is
extremely useful for in silico drug design (Bayden et al., 2015).
1.3.5. Other Methods
WaterScore is an efficient method to get to know if a water molecule is important, and it provides
an online calculator for simple application. Both WaterMap and SZMAP are developed
commercially and have support for detailed energy analysis, which makes them excellent choices
for advanced practice. Several methods have been proposed and applied in the study, which include
grid inhomogeneous solvation theory (GIST) implemented in the AmberTools cpptraj package
and then integrated in AutoDock4 (Ramsey et al., 2016), Just Add Water Molecules (JAWS or
JAWM) implemented in a modified version of MCPRO (Michel et al., 2009), WATsite
implemented in a PyMol plugin for identifying important water molecules (Hu and Lill, 2014),
WaterDock implemented in Dowser++ and also a PyMol plugin (Sridhar et al., 2017). Most of
them focus more on energy and positions of water molecules instead of the exact interactions water
molecule near binding sites.
As computational power continually grows with the booming technological development, there is
a tendency to create more and more advanced and sophisticated methods with the increasing
feasibility of complicated algorithms for application in real tasks to screen and design potential
drugs. However, there is still a place for quick and brief water analysis both for educational
purposes and for early-stage drug selection.
Zhaohui Wang
13
Chapter 2. WATGEN Analysis
2.1. Cases of Simple Analysis
2.1.1. Method
The analysis was starting by WATGEN v.5.0, an update for running WATGEN on small molecule
ligands. The PDB file was cleaned before running WATGEN. A Python package named
BioPython widely used in computational molecular biology was utilized to extract the ligand and
exclude protein chains that do not have any atom within 10.0 Ã… near the ligand to prevent running
errors due to data overflow. Then a python script was written to format the ligand file into a Z-
Matrix stored in a JSON file for WATGEN to recognize. The JSON file also stored information
on ligands such as hydrogen donors and acceptors.
There are two modes of WATGEN: one is used only to add the interfacial water, while the other
is used to add water to the whole structure. In the first mode, WATGEN searches for all the
possible water-mediated interactions within the binding area, in which case all the water molecules
added could be considered to be important. In the second mode, WATGEN also searches for water-
mediated interactions among all the protein residues. However, besides the water molecules that
directly interact with the protein and ligand, there are some pockets inside the protein that keep
waters simply because there are spaces to fill, and a vacuum state is definitely energetically
unfavorable. The bulk waters only interact with the surrounding waters and are generally missed
in a standard run of WATGEN. Nonetheless, in some analysis, it required that all the explicit water
molecules be considered.
A program named WaterBulk was written in the Haworth laboratory to solve the problem of bulk
water. However, it was simply using simulation without considering actual interactions, and the
results from WaterBulk were not ideal. An alternative way to add bulk water was obtained by
Zhaohui Wang
14
running WATGEN several times in mode I, that is, after adding the first layer of interfacial water,
the first layer of waters was regarded as a whole part with the ligand in the next run. Then the
second layer of waters was added based on the interaction with the waters from the previous layer
and any other possible interactions. Python scripts were written to realize this process, and the total
number of waters being added was limited to the capability of two digits.
Several common drugs that are top prescribed on the market were selected to demonstrate the
analysis from WATGEN; the crystallographic structures were obtained from the RCSB PDB
database: 1A28 (Progesterone), 4P6X (Hydrocortisone), 1IE9 (Calcitriol), 1O86 (Lisinopril), 2P16
(Apixaban), 2W26 (Rivaroxaban), and 3HKU (Topiramate).
The file was first run in mode II, then water molecules were trimmed within 7.0 Ã…, by doing this
not only the interfacial waters but also some protein-interacting waters remained. The reason for
not just using the interfacial waters was that the shape of interfacial waters was usually highly
irregular, and adding more layers of water might result in an unwanted shape. Then two more
rounds of WATGEN runs were done to fill the void space, and a last trim of waters within 10.0 Ã…
of ligand and 5.0 Ã… of ligand and protein as a whole part was done for computational saving for
the later analysis, since waters within 10.0 Ã… were enough to consider the binding status. Then
WaterAnalysis was used to output information on the interactions of all water molecules, which
was saved as a text file. A Python script was used to obtain information on the waters of interest.
Water molecules generated by WATGEN were compared with waters presented in the original X-
ray files using a distance-based matching process done by a KDTree and Graph algorithm. Waters
in one file were chosen as the reference and waters in another file were chosen to pair the reference
waters. If a water was found to be within a certain distance to a reference water of interest, it was
considered as being paired to that water, or if more than one water were within the distance, the
Zhaohui Wang
15
closest one was chosen. If a water was paired to one water, it could not be paired to another. The
procedure of pairing water gave information about how many paired waters were found in the
reference file and the distances between each pair of waters. The locations of the water molecules
only referred to oxygen, and the distance chosen as the criteria for matching was 1.5 Ã….
To finish the whole process efficiently, a python environment needs to be established. The PDB
file should be put in the same directory as the main python script, and all other programs and helper
scripts such as WATGEN, OpenBabel, and common files such as helper parameter files should be
put into other appointed folders. All parameters are set to default values, and ligands are usually
automatically identified; however, a parameter of ligand ID is recommended to be entered, or a
list/dictionary of ligand IDs to be provided in a batch run, for a more accurate result because some
drug-like molecules in the structure might interfere with the ligand identification.
In the case studies below, only SWBs are discussed since they are the most important interactions
of water contributing to the energy of the entire system, besides the entropy gain of the displaced
water. SWBs secondary to other interactions and DWBs were omitted since they are less important
and have less certainty. Hydrophobic interactions are another essential binding factor to consider
but are not discussed here.
WaterAnalysis was run considering the given position of oxygen, and interactions were predicted
by searching possible interactions with surrounding atoms/group certain range. However, in the
case studies below, waters are displayed with hydrogens from the previous WATGEN result for a
more explicit view, as shown in Figure 1, which do not necessarily stand for certain positions of
the hydrogens in the interactions. The directions the actual hydrogens toward should be the ones
that are most favorable for the interactions. An important water molecule can function as more
than one water bridge, either as SWB or DWB, while some might be more dominant than others.
Zhaohui Wang
16
2.1.2. Results
Table 1. List of drug used in case studies and WATGEN data
PDB
ID
Generic Name Ligand
ID
Ligand Name Affinity
(pKd/pKi)
SWB DWB DW HPI
*
X-ray
Paired
1A28 Progesterone STR Progesterone
receptor
8.29 6 6 18 6 13/16
4P6X Hydrocortisone HCY Glucocorticoid
receptor
7.04 9 3 18 4 9/12
1IE9 Calcitriol VDX Vitamin D3
receptor
10.19 6 6 22 13 27/34
1O86 Lisinopril LPR Angiotensin-
converting
enzyme
9.57 5 14 21 8 62/68
2P16 Apixaban GG2 Coagulation
factor X
10.1 11 8 15 1 12/17
2W26 Rivaroxaban RIV Coagulation
factor X
9.4 13 5 16 2 13/16
3HKU Topiramate TOR Carbonic
anhydrase 2
8.3 5 3 14 4 23/30
Structures were displayed by ViewerPro with an overall view and separate detailed views for
insight into water-mediated interactions. The cyan asterisks stand for water molecules from the
original X-ray structure that had paired water molecules from WATGEN, while the blue asterisks
stand for those that did not have paired water molecules. In the detailed views, the blue balls stand
Figure 1. WATGEN generated water molecules and files prepared for WaterAnalysis
without hydrogen (PDB ID: 1A28)
*HPI: the number of hydrophobic interactions of the ligand
Zhaohui Wang
17
for X-ray water molecules with its paired water forming a SWB and was regarded as important.
The ligand is shown in orange while the protein residues are shown as grey.
Progesterone (PDB ID: 1A28)
Progesterone is an endogenous steroid sex hormone that plays an essential role in pregnancy,
embryogenesis, and other biological processes within or beyond the reproductive system. The
molecular only had two carbonyl groups that could form hydrogen bonding with surrounding
residues and water molecules, and hydrophobic interaction and entropy gain should be more
Figure 1-2-1. Overall view of the binding of Progesterone (PDB ID: 1A28).
Zhaohui Wang
18
significant.
Five interfacial water generated by WATGEN were regarded as important by WaterAnalysis and
worked as SWBs, as shown in Figure 1-2-2. WAT1 connects O01 on the progesterone with the
carbonyl group on the backbone of MET759, WAT14 connects O01 with the carbonyl group on
Figure 1-2-2. Detailed view of SWBs of progesterone (PDB ID: 1A28).
Zhaohui Wang
19
the backbone of PHE778, WAT15 connects O02 with both the amine on the backbone of PHE895
and the carbonyl group on the backbone of CYS891, which means that WAT15 had stronger
interactions and was more likely to present in that position upon binding. An X-ray water molecule
was also present in the position of WAT1, so it could be assumed that the WAT1 and WAT 15 are
most important within this complex.
Hydrocortisone (PDB ID: 4P6X)
Hydrocortisone is another steroid hormone that is widely used as an anti-inflammatory agent both
for external and internal use. Compared with Progesterone, it has more polar groups and is more
soluble, and it displaced the same number of water molecules upon binding according to the
WATGEN analysis.
Figure 1-3-1. Overall view of the binding of Hydrocortisone (PDB ID: 4P6X).
Zhaohui Wang
20
There were also five interfacial water molecules forming SWBs, WAT5, WAT13, WAT19, WAT
23, and WAT25, as shown in Figure 1-3-2, and they were generally more tightly bound as WAT19
Figure 1-3-2. Detailed view of SWBs of Hydrocortisone (PDB ID:4P6X).
Zhaohui Wang
21
functioned for three SWBs, both WAT23 and WAT25 functioned for two SWBs. A water
molecule in the X-ray structure was paired with the WAT5.
Since the progesterone receptor and hydrocortisone receptor both belong to nuclear receptor
subfamily 3, group C, it is meaningful to make a comparison between the two structures. A rough
equivalence between the two structure can be found. O01-WAT1-MET759, O01-WAT14-
PHE778 in 1A28 (left), comparing to O01-WAT5-MET604, O01-WAT13-PHE623 in 4P6X
(right), as shown in Figure 1-3-3.
WAT5 and WAT13 in the complex of 4P6X could correspond to WAT1 and WAT14 in the
complex of 1A28, which form SWBs with PHE and MET separately. It had been found that in
some steroid receptors, key water molecules are required to form a network of hydrogen bindings
that could hold the steroid molecule in position, since the C1-C2 was connected by single bond,
which otherwise would result in oscillating of the A-ring between the above and below
Figure 1-3-3. Comparison between the A-rings of the steroid structure from progesterone
(left) and hydrocortisone (right).
Zhaohui Wang
22
conformations (He et al., 2014), and WAT5, WAT 13 in 4P6X and WAT1, WAT14 are exactly
the key water to hold the conformation of A-ring by binding with O01. Meanwhile, X-ray water
molecules were also found at the same positions between the two structures, which are WAT5 in
4P6X and WAT1 in 1A28.
Calcitriol (PDB ID: 1IE9)
Calcitriol is the active form of Vitamin D that binds to vitamin D receptor (VDR), which has three
hydroxyl groups that can function as both hydrogen donor and receptor. Besides the hydrogen
bonding of the three hydroxyl groups, it holds a large amount of lipophilic surface area which has
both van der Waals and hydrophobic contacts with aliphatic side chains that contribute a lot to the
binding process (Tocchini-Valentini et al., 2001). The shape of calcitriol is well fitted into the
binding pocket, the SER237, ARG274, and HIS397 on VDR have been widely reported to anchor
the ligand by forming hydrogen bonds with the hydroxyl groups on the A-ring (Rochel et al.,
2007). In this structure, the water is less important compared with the massive hydrophobic
Figure 1-4-1. Overview of the binding of Calcitriol. (PDB ID: 1IE9)
Zhaohui Wang
23
interactions and direct hydrogen bonding.
The key water molecules found by the analysis were WAT20, WAT21, WAT27, and WAT28, as
shown in Figure 1-4-2, and the WAT21 was the most important one because it assisted the contacts
of calcitriol with three amino acid side chains: SER275, GLU277, and SER278.
Figure 1-4-2. Detailed view of SWBs of Calcitriol. (PDB ID: 1IE9)
Zhaohui Wang
24
Lisinopril (PDB ID: 1O68)
Lisinopril is an ACE inhibitor that is used as the first line drug to treat cardiovascular diseases
such as high blood pressure. It has two carboxyl groups which would be ionized at physiological
pH and change to salt form. It had been found that chlorine and zinc play a critical role for the
ACE activity, and water also facilitates the interaction (Liu et al., 2001; Natesh et al., 2003). In
this structure, one zinc ion and two chloride ions were presented. Even though the chloride ions
were relatively far from the ligand and were guiding the conformational change of the protein
which might not directly affect the simulation around the ligand, the zinc ion is in direct contact
with the O3, which is a carboxyl group with a negative charge. However, WATGEN does not
consider ionic bonds since ionic binding is not part of generally-considered water interactions,
and any solute molecule left in the original structure was automatically ignored, which, in some
cases where the other molecule is part of the interaction, results in poor prediction of interfacial
Figure 1-5-1. Overall view of the binding of Lisinopril. (PDB ID: 1O86)
Zhaohui Wang
25
hydration sites. Five important SWB water molecules were generated by WATGEN, as shown in
Figure 1-5-2. Three of these water molecules were paired with X-ray water, and two of them
formed hydrogen bonds with one of the carboxyl group. Therefore, even though there was a
presence of zinc ion binding with the carbonyl group, in this result, the zinc ion did not affect the
prediction of important water.
Figure 1-5-2. Detailed view of SWBs of Lisinopril. (PDB ID:
1O86)
Zhaohui Wang
26
Apixaban (PDB ID: 2P16)
Apixaban is an anticoagulant that could directly inhibit factor Xa within the coagulation cascade
system. It is highly selective due to its unique structure with several specific functional groups
matching with the binding pocket; therefore, interactions between the residues in the binding
pocket of factor Xa and its ligand have been intensively studied. The binding pocket of factor Xa
has been divided into four small pockets, namely S1, S2, S3, S4, and each pocket has its unique
role as binding (Agrawal, 2012).
In the analysis, 9 important waters were identified, which were WAT4, WAT5, WAT7, WAT13,
WAT31, WAT38, WAT50, WAT138, as shown in Figure 1-6-2, among which WAT138 was the
most important because it formed three SWBs separately with SER214, TRP215 and ILE 227, and
the WAT138 was within the sub-pocket, S4 in which a highly conserved hydration site was
reported. Besides, there was an X-ray water paired with WAT7, which bridged O02 and SER195
Figure 1-6-1. Overall view of the binding of Apixaban. (PDB ID: 2P16)
Zhaohui Wang
27
within sub-pocket S1, in which a conserved water was widely reported too (Salonen et al., 2012).
Figure 1-6-2. Detailed view of SWBs of Apixaban. (PDB ID: 2P16)
Zhaohui Wang
28
Rivaroxaban (PDB ID: 2W26)
Rivaroxaban is also an anticoagulant that directly inhibits factor Xa inhibitor, and was approved
on the market earlier than apixaban. In the WATGEN analysis, rivaroxaban had ten essential water
Figure 1-7-1. Overall view of the binding of Rivaroxaban. (PDB ID: 2W26)
Figure 1-7-2. Detailed view of SWBs of Rivaroxaban. (PDB ID: 2W26).
Zhaohui Wang
29
molecules that formed SWBs, which were WAT1, WAT3, WAT5, WAT11, WAT12, WAT17,
WAT28, WAT51, WAT53, WAT69, as shown in Figure 1-7-2 and Figure 1-7-3.
However, only three of these waters were comparable to the ones in the 2P16. WAT17 at a
hydration site in sub-pocket S1 contacted with SER179 compared to WAT7 in 2P16, and WAT53
was at hydration site in sub-pocket S4 contacted with TRP215 compared to WAT138 in 2P16;
also, the WAT3 was in contact with LYS96, as the WAT4 of 2P16 was in contact with the same
residue. However, no water was found at the same site of WAT17 in this X-ray structure. The
Figure 1-7-2. Detailed view of SWBs of Rivaroxaban. (PDB ID: 2W26).
Zhaohui Wang
30
difference between WAT53, with WAT138 in 2P16, which had contacts with three residues at the
same time, was that the rivaroxaban had an extra oxygen on the lactam ring compared with
apixaban, which could efficiently bind more water, which was superior than the contacts that
WAT138 had.
Topiramate (PDB ID: 3HKU)
Topiramate is usually used for epilepsy treatment and migraine prevention by binding to several
possibly relevant targets. It has a unique structure with a sulfamate modified diacetonide fructose,
and is highly soluble due to the presence of multiple polar groups. Inhibition of topiramate against
carbonic anhydrase (CA) was assumed to cause weight loss of epilepsy patients and was suggested
to have some other therapeutic potential (Supuran and Scozzafava, 2000).
Five water molecules forming SWBs were found in the analysis, WAT1, WAT2, WAT9, and
Figure 1-8-1. Overall view of the binding of Topiramate. (PDB ID: 3HKU)
Zhaohui Wang
31
WAT20, as shown in Figure 1-8-2. Only WAT20 did not have a counterpart in the X-ray structure.
However, WAT20 was shown to function as the linking water of two SWB, which indicates that
it is an important water molecule that was missed by the X-ray crystallography. The number of
water bridges was counterintuitive for a ligand with many hydrogen donors and acceptors, which
was because there were ten direct interactions found between the ligand and this CA isozyme.
Figure 1-8-2. Detailed view of SWBs of Topiramate. (PDB ID: 3HKU).
Zhaohui Wang
32
2.2. Global Data Analysis
2.2.1. Method
PDBbind is a comprehensive database of experimentally measured binding affinity data for
biomolecular complexes of various types. It provides update yearly, and the most current version
contains data about 16,126 complexes total and a refined version with 4,463 complexes in total, in
which the resolution of all the PDB files is higher than 2.5A. Therefore, it is an ideal database to
use for a more general analysis by WATGEN,
In this study, the refined database was chosen for a more reasonable result, which reduced the
workload as well. Complexes with binding ligands as small molecules other than peptides were
screened out. One consideration was how WATGEN-predicted water could reflect the actual
presence of X-ray water, to which an X-ray water comparison was done; Another consideration
was if comparing the water binding between the Holoprotein structures (with ligand bound) and
apo structures (without ligand bound), would there be any useful results obtained. In the study, the
WATGEN results for the holo protein (referred to as PL, protein-ligand) were compared to the
original X-ray structure. At the same time, a comparison was made with the WATGEN result for
the apo protein (referred to as PO, protein only) with the ligand deleted directly from the original
structure.
To hydrate the PO protein with ligand deleted from the PDB file, a glycine was put into the space
near the protein and acted as a ligand to be recognized by WATGEN without affecting the water
added to the original ligand cavity. Then the water was also trimmed based on a virtual presence
of the ligand. Water in the original X-ray structure was also trimmed accordingly, as a comparison
group.
Zhaohui Wang
33
Data on binding waters from the WATGEN analysis and properties on the ligand were collected
as much as possible. The water molecules were categorized by a simple standard: if the waters
existed in a binding pocket in the PO run and could not be paired to waters in the PL run which
were close to the ligand (in the range of 1.5 A), then they were considered as waters that were
displaced after binding of the ligand. The interactions of waters in the PL run were also compared
with that of their paired waters, if the interactions became more favorable, they were considered
as energy optimized (potential energy decreased) waters upon binding, and if the interactions
became less favorable, they were considered as energy impaired waters (potential energy
increased).
PyMol views for comparison were generated for all pairs of PO and PL groups: waters that did not
find a match in the counterpart file were colored white, displaced water was colored light purple,
the binding energy optimized waters were colored green, and the energy diminished waters were
colored red; waters that had matches but did not have much energy change were colored yellow.
Data for the number of direct protein-ligand interactions and the number of waters that had
hydrophobic interactions with the ligand were extracted from the results of WaterAnalysis;
properties of ligands such as the molecular weight, van der Waals volume, numbers of total atoms,
hydrogen bond donors, hydrogen bond acceptors, rotatable bonds, ring structures, were generated
from XLogP3 v.3.2.2 (Cheng et al., 2007) and Chemicalize Online from ChemAxon Ltd, and the
affinity was from the original data in the PDBBind database.
Zhaohui Wang
34
2.2.2. Results
2.2.2.1. Data Overview
Table 2. Summary of information about the protein-ligand complex in the refined set.
All data collected was stored in a CSV file, and the first 25 lines are shown in Table. 2, the left
part of the data was from WaterAnalysis, the right part is the ligand information. Several types of
information were gathered for each complex and ordered in the PDB ID number. The explanation
for each parameter is listed as follows: a) # DisplacedW: the number of water molecules within
1.5 Ã… ligand-binding area in the PO run, which are considered as displaced; b) # NonContactW:
the number of displaced water with no direct contact to the ligand and protein; c) #
green/red/yellow: the number of water in the PL run with lower/higher/close energy compared
with its paired water in the PO run; d) # P-L DirectInteaction: the number of direct hydrogen bond
interactions between the protein and ligand; e) # HydrophobicInteractionW: the number of water
that has only hydrophobic interaction with the ligand; f) # XrayAligned: the number of water that
PDB ID #DisplacedW #NonContactW #Green #Red #Yellow
#P-L Direct
Interaction
#Hydrophobic
InteractionW
#XrayAligned MW VdWVolume # TotalAtoms RuleOf5 #HBD #HBA #RotB #N&O #Ring XlogP3 Kd/Ki pKd/pKi
10gs 20 4 18 8 1 11 5 31 473 418.34 33 FALSE 3 6 13 9 2 0.82 Ki=0.4uM 6.4
184l 9 0 3 7 0 0 6 6 134 149.98 10 TRUE 0 0 2 0 1 3.51 Kd=19uM 4.72
185l 3 0 2 6 0 0 3 6 117 108.77 9 TRUE 1 0 0 1 2 2.05 Kd=290uM 3.54
186l 7 0 0 4 0 0 2 11 134 149.81 10 TRUE 0 0 3 0 1 3.8 Kd=14uM 4.85
187l 6 0 2 4 1 0 2 5 106 115.7 8 TRUE 0 0 0 0 1 2.65 Kd=422uM 3.37
188l 4 0 1 4 2 0 6 5 106 115.84 8 TRUE 0 0 0 0 1 2.65 Kd=470uM 3.33
1a28 18 3 3 12 3 3 6 13 315 321.11 23 TRUE 0 2 1 2 4 4 Ki=5.1nM 8.29
1a4k 14 4 17 3 4 2 2 12 428 120.9 31 TRUE 2 6 7 10 4 1.26 Kd=0.01uM 8
1a4r 19 2 20 1 0 13 1 0 440 313 28 FALSE 5 8 8 16 3 -5.89 Kd=0.22uM 6.66
1a4w 27 5 16 9 4 12 8 22 615 553.05 42 FALSE 1 5 14 11 4 5.04 Ki=1.2uM 5.92
1a69 12 0 11 3 1 13 0 26 268 211.77 19 TRUE 5 5 5 9 3 -2.47 Ki=5uM 5.3
1a94 39 7 37 13 2 17 9 56 847 332 73 TRUE 8 9 33 19 1 1.97 Ki=14nM 7.85
1a99 6 0 6 0 0 4 0 23 90.2 107.81 6 FALSE 2 0 3 2 0 -0.94 Kd=2uM 5.7
1a9m 28 4 26 11 2 13 5 9 598 582.52 42 FALSE 6 6 23 14 1 1.89 Ki=119nM 6.92
1a9q 4 1 6 1 0 5 0 11 136 103.77 10 TRUE 2 2 0 5 2 -0.49 Kd=0.68uM 6.17
1aaq 30 5 21 12 5 17 10 1 579 563.52 41 FALSE 6 6 22 12 1 1.79 Ki=4.0nM 8.4
1add 9 1 8 0 1 11 0 21 266 220.28 19 TRUE 4 5 5 8 3 -1.59 Ki=0.18uM 6.74
1adl 16 5 14 14 3 3 11 17 304 329.36 22 FALSE 0 2 13 2 0 7.73 Kd=4.4uM 5.36
1afk 14 2 21 0 0 16 0 17 502 340.18 31 FALSE 2 12 7 18 3 -6.15 Ki=240nM 6.62
1afl 13 0 18 1 1 18 0 20 502 340.12 31 FALSE 2 12 7 18 3 -6.15 Ki=520nM 6.28
1ai4 5 0 5 5 0 3 4 21 167 141.76 12 FALSE 2 4 3 4 1 2.12 Ki=3.13mM 2.5
1ai5 5 1 6 4 2 3 2 28 180 146.83 13 TRUE 0 4 1 5 1 2.66 Ki=0.189mM 3.72
1ai7 2 0 1 7 0 1 3 15 94.1 90.52 7 TRUE 1 1 1 1 1 1.57 Ki=0.082mM 4.09
1aj7 15 3 9 5 1 6 2 0 301 241.18 32 FALSE 0 6 6 8 1 2.15 Kd=135uM 3.87
1ajn 7 0 8 4 0 2 3 27 180 146.89 13 TRUE 0 4 1 5 1 2.66 Ki=2.32mM 2.63
……
WATGEN Info. Ligand Info. Affinity
Zhaohui Wang
35
has been paired to water in the X-ray structure; g) MW: Molecular Weight; h) VdW Volume: van
der Waals volume calculated by Chemicalize; i) # TotalAt: the number of total ligand atoms; j) #
HBD & # HBA: the number of hydrogen bond donors and hydrogen bond acceptors; k) # RotB:
the number of rotatable bonds; l) XLogP3: LogP value calculated by XLogP3; m) Ki & Kd: the
binding affinity data directly collected from PDBBind database.
Among all the 4,463 complexes in the PDBBind refined set, 236 have peptides as ligands and
were excluded from the analysis because the method/procedures to run peptide ligands differ from
that of running small molecules as ligands. Among all the 4,227 complexes with small molecules
as ligands, 723 complexes failed to run WATGEN successfully for unknown reasons, 29
complexes failed to get the data from XLOP3 or Chemicalize data, and they were all abandoned
for data completeness.
Table 3. Statistics of data for protein-ligand complexes in the refined set from PDBBind database.
Table 3 shows the statistics for all data collected. The number of displaced waters ranges from 0
to 45, with an average of 14.24; The XLogP3 value ranges from -12.87 to 12.8 with an average of
1.30; The pKd/pKi of the ligand in the complexes ranges from 2 to 11.85 with an average of 6.37.
Some of the values are far beyond the range for consideration as a potential drug.
The average percentage of water paired with corresponding X-ray structure is 76.87%, which is
better than the test of old version of WATGEN on peptide ligand that predicted 72% hydration
sites by standard of 1.5 Ã… (Bui et al., 2007), and the old verification was only done on interfacial
Zhaohui Wang
36
water while this study verified on water within 10 Ã… range around the ligand. The X-ray water
prediction accuracy was from 50% to 100% in studies of other water generating method, with
different verification standards (Hu et al., 2018).
Figure 2-1-1. PO group (PDB ID: 1O86). WATGEN and analysis viewer from PyMol. The light blue molecule is the virtual
ligand in the PO group, waters colored light pink or light purple are near the ligand and with no paired water found in the PL
group, which are considered to be displaced by the binding of the ligand.
Figure 2-1-2. PL group, (PDB ID: 1O86). The water around the ligand binding area in the PO structure are mostly displaced by
the introducing of the ligand. There were more green waters in the binding area than that of the PO group, which are more
energetically stable and indicates in general the binding of the ligand is a favorable process in terms of hydration energy besides
the entropic gain from water replacement.
Zhaohui Wang
37
In the PyMol views of all the comparison, a general pattern was observed that there tend to be
more energetically favorable water molecules in the binding area for the PL runs (Figure 2-1-1)
compared with the PO runs (Figure 2-2-2), which means the water molecules around the ligand
became more energy favorable if they remain upon binding. This pattern conforms with the general
understanding of ligand binding and indicates that there would always be a change of the aqueous
environment before and after binding. Some water molecules existed in the area away from the
binding site also have a difference between the PL and PO runs both energetically and locationally,
and the interaction types might also have changed too, which can be explained by the whole system,
especially for water, being in a dynamic state and a slight change at one place could affect another.
From the results of the case studies on marketed drugs, the X-ray water prediction accuracy, and
the comparison between PL and PO runs, it is reasonable to say the result from WATGEN are
valid, and the data generated from WATGEN might apply to a global analysis aimed at getting
some general result.
2.2.2.2. Relationship between Displaced Water with LogP
The data first analyzed was computationally predicted LogP, since the original presumption in the
global analysis was that the more hydrophobic the ligand is, the more water molecules would be
displaced by it because the hydrophobic surface is less favorable to keep waters. The LogP range
of the data set was far beyond the consideration of common drugs being studied, which are usually
from -2 to 5 according to Lipinski’s rule, so some data far away from the normal range were
ignored. From the scatter plot of the number of displaced waters vs. XLogP3 (Figure 2-2-1), no
clear pattern was observed, since there are many other factors influencing water replacement at the
same time.
Zhaohui Wang
38
Figure 2-2-1. Scatter plot of the number of displaced waters vs. XLogP
Figure 2-2-2. Scatter plot of the normalized number of displaced waters vs. XLogP
0
5
10
15
20
25
30
35
40
45
50
-15 -10 -5 0 5 10 15
#DisplacedWater
XLogP3
#DisplacedWater vs. XLogP3
5
10
15
20
25
30
35
-10 -5 0 5 10
Mean Displaced Water
LogP Range
Mean DisplacedWater vs. LogP
Zhaohui Wang
39
Figure 2-2-3. The linearity between Displaced Water (DW) and LogP range become better after normalized the
Displaced Water by Molecular Weight (MW), a clearer trend of increasing of the displaced water with the LogP
range was observed.
Then the number of displaced waters was averaged for each LogP range, and the relationship is
still not clear (Figure 2-2-2). Considering the general fact that the larger the ligand is, the more
water molecules it would be able to displace, the number of displaced waters was normalized by
molecular weight to eliminate the interference from molecular size. Then a linear relationship was
found between the normalized number of average displaced waters and the LogP, with a standard
deviation (S) of 0.916 (Figure 2-2-3). When normalizing the displaced waters with the Van der
Waal volume, which better represents the molecular size, no such clear relationship was found.
y = 0.0006x + 0.0381
R² = 0.8395
0.03
0.032
0.034
0.036
0.038
0.04
0.042
0.044
0.046
-9 -7 -5 -3 -1 1 3 5 7 9
Mean DisplacedWater/MW
LogP Range
Mean DisplacedWater/MW vs. LogP
Zhaohui Wang
40
2.2.2.3. Relationship between Displaced Water and Affinity
Figure 2-2-4. Scatterplot of Affinity vs. #DisplacedWarer normalized
It is natural to guess that more water replacement would increase the binding affinity since entropy
gain of displacing water is the major contribution of water to the binding process (Ahmad et al.,
2014). However, since there are so many other important factors that influence the binding affinity,
and some water replacement is not energy unfavorable, it is not a surprise that the affinity vs.
normalized displaced waters plot is scattered; nevertheless, there is still a trend line with a positive
relationship in the scattered plot (Figure 2-2-4). It might be argued that the weak linearity might
be a random result because the slope is almost close to zero; therefore, the data were grouped into
different LogP range from -3 to 6 and 1 step in a range to check if there is any difference among
these groups (Figure 2-2-5). The result shows that the positive linearity still exists in each group,
y = 0.0456x + 4.6146
R² = 0.0508
1
3
5
7
9
11
13
-5 5 15 25 35 45 55 65 75
pKd/pKi
#DisplacedWater/MW*1000
pKd/pKi vs. #DisplacedWater/MW*1000
Zhaohui Wang
41
so it could be concluded that the data comply with the expectations
Figure 2-2-6. pKd/pKi vs. #DisplacedWater grouped by LogP range from -3 to 6.
2.2.2.4. Relationship between Affinity and LogP
After studying the relationship of replacement of water with LogP and affinity separately, there is
a need to check if there is also a correlation between affinity and LogP from this data. Evidence
from studies of specific groups of protein-ligand complexes has shown that the lipophilicity
correlates directly with the affinity in the situation when the pharmacophore group needed to be
hydrophobic (Parker et al., 2008; Wang et al., 2003). In general, the trend showed positive linearity
better than that of displaced water vs. affinity with a slightly lower variation.
Zhaohui Wang
42
Figure 2-2-7. Scatter plot of affinity against lipophilicity.
The result indicates that in general a ligand with higher hydrophobicity might be a stronger binder
to its target protein, which complies with the fact that proteins tend to fold in the way that the inner
parts are relatively lipophilic compared with their surface area that is directly in contact with the
aqueous environment (Spolar et al., 1989). In most cases, some lipophilicity is needed to help with
the ligand to leave the polar aqueous environment and enter inner binding pockets.
y = 0.4837x - 1.7867
R² = 0.0804
1
3
5
7
9
11
13
-13 -8 -3 2 7 12
pKd/pKi
XLogP3
pKd/pKi vs. XLogP3
Zhaohui Wang
43
Figure 2-2-8. Correlations among lipophilicity, affinity and displaced water.
Figure 2-2-8 shows a simplification of the relationship among affinity, lipophilicity, and displaced
water. The x-axis is the LogP range, from -3 to 7, and 1 step in each range; the y-axis is the affinity
averaged based on if it has the number of displaced waters larger or smaller than the average
number of displaced waters in that LogP range.
These result does not provide realistic meaning for drug screening or drug design, since it all
depends on specific situations where the properties of the target protein varies, but gives a picture
of the general trends which could be explained by theory based on long term study and practice;
hence, some simple assumptions can be made when there is no access to research data.
5
5.5
6
6.5
7
7.5
8
-3 -2 -1 0 1 2 3 4 5 6 7
Average pKi/pKd
LogP
Average pKi/pKd vs. LogP (MW)
Average pKi/pKd
(>Mean DisplacedWater/MW )
Average pKi/pKd
(< Mean DisplacedWater/MW)
Average pKi/pKd
Zhaohui Wang
44
2.2.2.5. Global Correlation Analysis
Figure 2-3-1. Heatmap of all the collected data
In order to get an insight into these collected data, a correlation heatmap (Figure 2-3-1) was built
by Seaborn from Python to check if there is any other significant relationship among the data. It is
evident that the number of displaced waters has a strong relationship with any terms relevant to
the ligand size such as molecular weight, van der Waals volume, and the number of total atoms,
with a correlation coefficient more than 0.95. The number of rotatable bonds, the number of ring
structures, and XLogP3 as mentioned earlier also show a good relationship with the displaced
Zhaohui Wang
45
water, with a correlation coefficient more than 0.65.
As for affinity, the most relevant factor listed here is the number of protein-ligand direct
interactions with a correlation coefficient of 0.62 and the number of hydrogen bond acceptors with
a correlation coefficient of 0.69, while the number of displaced water and XLogP3 do not stand
out. In fact, the data collected are highly scattered and are hard to explain by simple linear
regression, but since some overall trends could be observed, it is of interest to see if these could
play a predictive role.
Two machine learning models, the Linear Regression model and the Random Forest Regression
in model in Scikit-Learn, were chosen to test the performance, with '#DisplacedW',
'#NonContactW', '#Green', '#Red', 'P-L DirectInteraction', '#HydrophobicInteractionW', 'MW',
'#HBD', '#HBA', '#RotB', '#N&O', '#Ring', 'XlogP3' as feature parameters, and 'pKd/pKi' as value
to predict. The Linear regression model is a classical statistical method to correlate multiple
parameters to a value of interest by multivariate regression; The Random Forest is a more advanced
method to treat complicated data based on a Decision Tree. Data were randomly split into training
data to make the prediction model and testing data to test if the model could be used on unknown
data. Cross-validation is a method to test the performance which efficiently eliminates the data
splitting bias. In the result of cross-validation of the two models (Figure 2-3-2), the result of
training data and the test data from the linear regression model could be close with the increasing
of data size, but the validation score was low, which means that this model has a high bias between
the prediction value with the real value; the Random Forest model did perform well on the training
data, but it failed to predict the test data no matter how many data was trained, which means this
model is overfitting to the training data and did not capture the features that really matter to predict
the value of affinity.
Zhaohui Wang
46
.
To conclude, the result did not mean the models chosen were too simple or too complicated for
the data set since the test scores were both low, and it would not be able to find a better model no
matter how delicate it was. The poor performance suggested that there should be more efficient
feature parameters, as it had already shown that many of these parameters used did not have a
decent correlation with affinity, which is determined by various factors relating with both proteins,
ligands, solvent, and some other physicochemical factors; while in this data set, only the factors
about water and ligands were provided. Therefore, it is not sufficient to make a machine learning
model for the prediction of affinity.
Figure 2-3-2. Cross-validation learning curve of Linear
Regression Model and Random Forest Model.
Zhaohui Wang
47
Chapter 3. Discussion
In this study, some structures in the original database failed to run at a specific stage. 723 out of
4,198 structure failed to run WATGEN for unknown reasons; the error message showed in
WATGEN was ‘INDEX out of RANGE’, which seems like a data overflow due to large protein
size. However, since the original structure had already been cleaned only to keep chains that had
residues near the ligand, this should not be the main reason for the failure of numerous runs; it
might also be because there were unrecognizable molecules which had uncommon structures. The
previous version of WATGEN is hard to debug without advanced programming skills, thus there
is a need to go with the basic idea of WATGEN algorithm and update a more robust version to
adjust various kinds of situations and better identify the unknown errors; meanwhile, there is a
need for WATGEN to recognize and understand inorganic particles, such as zinc ion in 1O86, and
to understand other interactions other than hydrogen bonds and hydrophobic contacts, such as ionic
bonds.
In the analysis of the relationship between displaced water and LogP, the molecular weight was
used to normalize the number of displaced waters and had its effect of optimizing the data.
However, the result of linearity was much worse if VdW volume was used to normalize the data.
The reason might be that the VdW volume from Chemicalize could not represent the actual space
occupation inside the pocket because they were calculated from SMILES string which did not
include the information of the actual conformations, and the data of VdW volume itself is more
scattered compared with the data for molecular weight.
Among all the generated data, a few complexes had zero water displaced; however, a common
sense is within a certain space, if something goes in, something should come out. The reason for
this conflict might be that there were ligands such as oxalate and butyric acid that were especially
Zhaohui Wang
48
small and hydrophilic and did not displace water molecules under the criteria we set, and there
might also be some error caused by the cutoff. In reality, it should also be true that some ligands
might not displace any water since there is a conformational change of protein upon binding and
the size of the binding pocket would change too.
In the beginning, we were trying to do analysis only based on data for water involvement into
ligand binding from WATGEN and WaterAnalysis, which only consider binding terms related to
water, and the analysis was more about a validation of the WATGEN result based on known facts.
Even though the result was informative, it was quite rough too; some important parameters were
absent, and some were intercorrelated.
Since binding cannot easily be explained by any factor that stands alone, it is inevitable that a more
comprehensive analysis is required. In fact, the properties of the protein binding pocket should
always be the first thing to consider, followed by the ligand. However, it is much harder to get the
same kind of parameters about the binding pocket, since the total information about a protein is
much greater than either the water or ligand, and how to make the boundaries to select the data is
a more onerous task.
There was one parameter in the analysis considering the protein binding, that is the number of
direct protein-ligand interactions, the interactions were quantified by how many hydrogen bonds
formed between the ligand and the protein without counting hydrophobic contacts, and no energy
terms were used. WaterAnalysis provides detailed information about the interactions such as the
atoms, distance, binding conformations, but no more efficient way was found to use these data.
Actually, the interactions among residues of the protein are also provided, but with no place to use
except for structure inspection.
Zhaohui Wang
49
Despite the incomplete data, there is another problem of the algorithm being rational enough to
use these discrete data to do the analysis. The existence of different force fields has enabled
development of several efficient water models which could provide hydration information in terms
of energy that performs pretty well within a whole binding system. On the other hand, the
simplicity of discrete data is what makes WATGEN different, since there are already so many
methods and tools to add and analysis explicit water on the market, and the concept of various
physicochemical theories is too complicated to be understood by people who are not in the specific
fields but might also have the interest to start exploring the micro world. WATGEN definitely
provides an understandable way to look at water interactions and explain the important roles of
these interactions.
Zhaohui Wang
50
Chapter 4. Conclusion
In this study, WATGEN was proved to be able to predict the location of 76.87% of X-ray water
molecules from the structures of PDBBind refined set accurately within 1.5 Ã…. Detailed inspection
of the water molecules in 7 cases of top-prescribed drugs showed that it could predict most of the
important bridging water that might not be present in X-ray structures, and it could also give
information on detailed interactions of each waters molecule that could be helpful to get an insight
into the essential role of water in ligand binding.
In a global analysis of WATGEN data on the refined set, WATGEN showed a general relationship
among the displaced waters by ligands, the lipophilicity of the ligand, and the affinity of the ligand
to the protein, which are correlated with each other positively in a pattern that can be explained by
existing knowledge.
The correlation heatmap on all parameters showed that rotatable bonds, number of ring structures,
and XLogP3 value have more correlation with the number of displaced waters, besides terms for
molecular size. These three parameters, along with the size terms, gave another general picture on
what may affect the number of displaced waters. An attempt to build a ML model to predict the
binding affinity has failed, indicating that the WATGEN data currently provided is not suitable for
advanced purposes. Tasks to be done are to find a way to describe the energy that could categorize
the interaction data further, to consider small particles and ion binding, and most importantly, to
fix bugs. Even though WATGEN still has aspects to be improved, the results from the analysis in
this study still make sense and are quite plain to understand.
Zhaohui Wang
51
References
Agrawal, R.J., Pratima; N. Dikshit, S. (2012). Apixaban: A New Player in the Anticoagulant Class. Current
Drug Targets 13, 863-875(813).
Ahmad, M., Kalinina, O., and Lengauer, T. (2014). Entropy gain due to water release upon ligand binding.
Journal of Cheminformatics 6.
Antunes, D.A., Devaurs, D., and Kavraki, L.E. (2015). Understanding the challenges of protein flexibility in
drug design. Expert Opinion on Drug Discovery 10, 1301-1313.
Bayden, A.S., Moustakas, D.T., Joseph-McCarthy, D., and Lamb, M.L. (2015). Evaluating Free Energies of
Binding and Conservation of Crystallographic Waters Using SZMAP. J Chem Inf Model 55, 1552-1565.
Bissantz, C., Kuhn, B., and Stahl, M. (2010). A Medicinal Chemist’s Guide to Molecular Interactions.
Journal of Medicinal Chemistry 53, 5061-5084.
Bui, H.H., Schiewe, A.J., and Haworth, I.S. (2007). WATGEN: an algorithm for modeling water networks at
protein-protein interfaces. J Comput Chem 28, 2241-2251.
Cavasotto, C., Orry, A., and Abagyan, R. (2005). The Challenge of Considering Receptor Flexibility in
Ligand Docking and Virtual Screening. 1, 423-440.
Cheng, T., Zhao, Y., Li, X., Lin, F., Xu, Y., Zhang, X., Li, Y., Wang, R., and Lai, L. (2007). Computation of
octanol-water partition coefficients by guiding an additive model with knowledge. J Chem Inf Model 47,
2140-2148.
Corbeil, C.R., and Moitessier, N. (2009). Docking Ligands into Flexible and Solvated Macromolecules. 3.
Impact of Input Ligand Conformation, Protein Flexibility, and Water Molecules on the Accuracy of
Docking Programs. 49, 997-1009.
Garcia-Sosa, A.T. (2013). Hydration Properties of Ligands and Drugs in Protein Binding Sites: Tightly-
Bound, Bridging Water Molecules and Their Effects and Consequences on Molecular Design Strategies.
Journal of Chemical Information and Modeling 53, 1388-1405.
Garcia-Sosa, A.T., Firth-Clark, S., and Mancera, R.L. (2005). Including tightly-bound water molecules in de
novo drug design. Exemplification through the in silico generation of poly (ADP-ribose)polymerase
ligands. Journal of Chemical Information and Modeling 45, 624-633.
Garcia-Sosa, A.T., and Mancera, R.L. (2006). The effect of a tightly bound water molecule on scaffold
diversity in the computer-aided de novo ligand design of CDK2 inhibitors. J Mol Model 12, 422-431.
Garcà a-Sosa, A.T., Mancera, R.L., and Dean, P.M. (2003). WaterScore: a novel method for distinguishing
between bound and displaceable water molecules in the crystal structure of the binding site of protein-
ligand complexes. J Mol Model 9, 172-182.
Garcia-Sosa, A.T., Sild, S., Takkis, K., and Maran, U. (2011). Combined Approach Using Ligand Efficiency,
Cross-Docking, and Antitarget Hits for Wild-Type and Drug-Resistant Y181C HIV-1 Reverse Transcriptase.
Journal of Chemical Information and Modeling 51, 2595-2611.
Genheden, S., and Ryde, U. (2015). The MM/PBSA and MM/GBSA methods to estimate ligand-binding
affinities. Expert Opinion on Drug Discovery 10, 449-461.
Goodford, P.J. (1985). A computational procedure for determining energetically favorable binding sites
on biologically important macromolecules. Journal of Medicinal Chemistry 28, 849-857.
Grant, J.A., Pickup, B.T., and Nicholls, A. (2001). A smooth permittivity function for Poisson-Boltzmann
solvation methods. Journal of Computational Chemistry 22, 608-640.
He, Y., Yi, W., Suino-Powell, K., Zhou, X.E., Tolbert, W.D., Tang, X., Yang, J., Yang, H., Shi, J., Hou, L., et al.
(2014). Structures and mechanism for the design of highly potent glucocorticoids. Cell Research 24, 713-
726.
Hou, T., Wang, J., Li, Y., and Wang, W. (2011). Assessing the Performance of the MM/PBSA and
MM/GBSA Methods. 1. The Accuracy of Binding Free Energy Calculations Based on Molecular Dynamics
Simulations. Journal of Chemical Information and Modeling 51, 69-82.
Zhaohui Wang
52
Hu, B., and Lill, M.A. (2014). WATsite: Hydration site prediction program with PyMOL interface. Journal
of Computational Chemistry 35, 1255-1260.
Hu, X., Maffucci, I., and Contini, A. (2018). Advances in the treatment of explicit water molecules in
docking and binding free energy calculations. Curr Med Chem.
Jorgensen, W.L. (2009). Efficient Drug Lead Discovery and Optimization. 42, 724-733.
Kumalo, H., Bhakat, S., and Soliman, M. (2015). Theory and Applications of Covalent Docking in Drug
Discovery: Merits and Pitfalls. Molecules 20, 1984-2000.
Ladbury, J.E. (1996). Just add water! The effect of water on the specificity of protein-ligand binding sites
and its potential application to drug design. 3, 973-980.
Lazaridis, T. (1998). Inhomogeneous Fluid Approach to Solvation Thermodynamics. 1. Theory. The
Journal of Physical Chemistry B 102, 3531-3541.
Lazaridis, T., and Paulaitis, M.E. (1992). Entropy of hydrophobic hydration: a new statistical mechanical
formulation. 96, 3847-3855.
Liu, X., Fernandez, M., Wouters, M.A., Heyberger, S., and Husain, A. (2001). Arg1098 Is Critical for the
Chloride Dependence of Human Angiotensin I-converting Enzyme C-domain Catalytic Activity. 276,
33518-33525.
Lloyd, D.G., Garcia-Sosa, A.T., Alberts, I.L., Todorov, N.P., and Mancera, R.L. (2004). The effect of tightly
bound water molecules on the structural interpretation of ligand-derived pharmacophore models. J
Comput Aid Mol Des 18, 89-100.
Michel, J., Tirado-Rives, J., and Jorgensen, W.L. (2009). Prediction of the water content in protein
binding sites. J Phys Chem B 113, 13337-13346.
Mikol, C.P.X.B.V. (1995). The Role of Water Molecules in the Structure-Based Design of (5-
Hydroxynorvaline)-2-cyclosporin: Synthesis, Biological Activity, and Crystallographic Analysis with
Cyclophilin A. J Med Chem 38, 3361-3367.
Natesh, R., Schwager, S.L.U., Sturrock, E.D., and Acharya, K.R. (2003). Crystal structure of the human
angiotensin-converting enzyme–lisinopril complex. 421, 551-554.
Parker, M.A., Kurrasch, D.M., and Nichols, D.E. (2008). The role of lipophilicity in determining binding
affinity and functional activity for 5-HT2A receptor ligands. Bioorg Med Chem 16, 4661-4669.
Ramsey, S., Nguyen, C., Salomon-Ferrer, R., Walker, R.C., Gilson, M.K., and Kurtzman, T. (2016).
Solvation thermodynamic mapping of molecular surfaces in AmberTools: GIST. Journal of Computational
Chemistry 37, 2029-2037.
Rochel, N., Hourai, S., Pérez-Garcà a, X., Rumbo, A., Mourino, A., and Moras, D. (2007). Crystal structure
of the vitamin D nuclear receptor ligand binding domain in complex with a locked side chain analog of
calcitriol. Archives of Biochemistry and Biophysics 460, 172-176.
Salonen, L.M., Holland, M.C., Kaib, P.S.J., Haap, W., Benz, J., Mary, J.-L., Kuster, O., Schweizer, W.B.,
Banner, D.W., and Diederich, F. (2012). Molecular Recognition at the Active Site of Factor Xa: Cation-Ï€
Interactions, Stacking on Planar Peptide Surfaces, and Replacement of Structural Water. Chemistry - A
European Journal 18, 213-222.
Spolar, R.S., Ha, J.H., and Record, M.T. (1989). Hydrophobic effect in protein folding and other
noncovalent processes involving proteins. Proceedings of the National Academy of Sciences 86, 8382-
8385.
Sridhar, A., Ross, G.A., and Biggin, P.C. (2017). Waterdock 2.0: Water placement prediction for Holo-
structures with a pymol plugin. PLOS ONE 12, e0172743.
Sun, H., Li, Y., Tian, S., Xu, L., and Hou, T. (2014). Assessing the performance of MM/PBSA and MM/GBSA
methods. 4. Accuracies of MM/PBSA and MM/GBSA methodologies evaluated by various simulation
protocols using PDBbind data set. 16, 16719.
Supuran, C.T., and Scozzafava, A. (2000). Carbonic anhydrase inhibitors and their therapeutic potential.
10, 575-600.
Zhaohui Wang
53
Tocchini-Valentini, G., Rochel, N., Wurtz, J.M., Mitschler, A., and Moras, D. (2001). Crystal structures of
the vitamin D receptor complexed to superagonist 20-epi ligands. 98, 5491-5496.
Verdonk, M.L., Chessari, G., Cole, J.C., Hartshorn, M.J., Murray, C.W., Nissink, J.W.M., Taylor, R.D., and
Taylor, R. (2005). Modeling Water Molecules in Protein−Ligand Docking Using GOLD. Journal of
Medicinal Chemistry 48, 6504-6515.
Wang, Y., Mathis, C.A., Huang, G.-F., Debnath, M.L., Holt, D.P., Shao, L., and Klunk, W.E. (2003). Effects
of Lipophilicity on the Affinity and Nonspecific Binding of Iodinated Benzothiazole Derivatives. Journal of
Molecular Neuroscience 20, 255-260.
Abstract (if available)
Abstract
WATGEN is an efficient tool with easy access for quickly adding water molecules based on their interactions with the protein-ligand interface. WATGEN not only adds hydration layers to protein complexes but also provides detailed information on all possible interactions of each water molecule with either the complex or the surrounding water. In this study, the accuracy of WATGEN for adding water molecules to 3,485 protein-ligand complexes was examined by comparing the WATGEN water with existing water in X-ray structures. Several kinds of data for the ligand and its surrounding waters were collected from WATGEN and other tools. The rationality of WATGEN results was inspected for seven important protein-drug complexes by looking into the specific interactions of each water that is essential to the ligand binding. A few highly conserved water molecules reported previously were predicted, which further increases the credibility of WATGEN. In a global analysis, statistical methods were used to find correlations among the data collected. Even though these data were highly scattered, and no significant relationship was found except for those already known, several general trends comply with the presumptions based on concrete evidence have been shown from a non-standard analysis. An attempt to transform the scattered data into a prediction model failed due to deficiency of the feature parameters.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Molecular modeling of cyclodextrin interactions with proteins
PDF
Computational model for predicting ionic solubility
PDF
Computer modeling of protein-peptide interface solvation
PDF
Characterization of actin based motility in mammalian cells through LIM and SH3 domain protein 1 (LASP1) and elastin like polypeptide (ELP) fusion protein
PDF
Inhibition of monoamine oxidase A and histone deacetylase inhibitors: computational prediction of ligand binding
PDF
Inhibition of MAO-A by Dual MAO-A/HDAC inhibitors: in silico approach for ligand binding and affinity prediction
PDF
Characterization of IL-1β secretion by fusing elastin-like polypeptides to pro-caspase-1
PDF
Computational modeling of solvation and docking of peptide-MHC class I
PDF
Solvation as a driving force for peptide docking to the major histocompatibility complex (MHC) class II molecules
PDF
EXSAN: explicit solvent anchored fragment-base docking
PDF
Pharmacokinetic modeling: ciprofloxacin in the environment and metformin PBPK model
PDF
Methods and protocols for detecting the intracellular assembly of elastin-like polypeptides
PDF
In vivo formaldehyde crosslink of RNA
PDF
Algorithm development for modeling protein assemblies
PDF
In-silico physiological based pharmacokinetic modeling of prodrugs
PDF
Characterization of caveolin-1 based on elastin-like polypeptides
PDF
Temperature-mediated induction of caveolin-mediated endocytosis via elastin-like polypeptides
PDF
Predicting mortality of sepsis with machine learning model approaches
PDF
Genome engineering of filamentous fungi for efficient novel molecule production
PDF
Comparing drug release kinetics of rapamycin bound FKBP-ELP fusion proteins
Asset Metadata
Creator
Wang, Zhaohui (author)
Core Title
Modeling of water molecules in protein-ligand binding
School
School of Pharmacy
Degree
Master of Science
Degree Program
Pharmaceutical Sciences
Publication Date
08/16/2019
Defense Date
08/16/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
computational simulation,explicit water,OAI-PMH Harvest,water modeling,WATGEN
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Haworth, Ian (
committee chair
), Mackay, Andrew (
committee member
), Romero, Rebecca (
committee member
)
Creator Email
595103545@qq.com,wangzhao@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-217747
Unique identifier
UC11673368
Identifier
etd-WangZhaohu-7806.pdf (filename),usctheses-c89-217747 (legacy record id)
Legacy Identifier
etd-WangZhaohu-7806.pdf
Dmrecord
217747
Document Type
Thesis
Rights
Wang, Zhaohui
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
computational simulation
explicit water
water modeling
WATGEN