Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Stability and folding rate of proteins and identification of their inhibitors
(USC Thesis Other)
Stability and folding rate of proteins and identification of their inhibitors
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Stability and Folding Rate of Proteins and Identification of Their Inhibitors by Congyue Wang A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY CHEMICAL ENGINEERING May 2020 Copyright 2020 Congyue Wang ii Acknowledgements Foremost, I would like to express my sincere gratitude to my advisor, Prof. Muhammad Sahimi, for his continuous support of my Ph.D. study and research. His guidance helped me at all the times that I was carrying out the research and writing my thesis. I could not have imagined having a better advisor and mentor for my Ph.D. study. Besides my advisor, I would like to thank the rest of my thesis committee: Professors Katherine Shing and Aiichiro Nakano, for their encouragement, insightful comments, and hard questions. My sincere thanks also go to Dr. Leili Javidpour for offering in- sightful advice and great help while working on my diverse and exciting research projects. Above all, nobody has been more important to me in the pursuit of my studies that the members of my family. I would like to thank my parents whose love and guidance are with me in whatever that I pursue. They are the ultimate role models. iii Contents Acknowledgements ii List of Tables iv List of Figures vi Abstract x 1 Introduction 1 1.1 Protein structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Protein folding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Diseases linked with protein misfolding . . . . . . . . . . . . . 4 1.4 Protein folding in confined environments . . . . . . . . . . . . 5 1.4.1 Theoritical study . . . . . . . . . . . . . . . . . . . . . . 6 1.4.2 Experimental study . . . . . . . . . . . . . . . . . . . . . 11 1.5 Goals of The Research . . . . . . . . . . . . . . . . . . . . . . . 14 2 Effect of the geometry of confining media on the stability and folding rate of-helix proteins 15 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 Models of proteins . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3 Molecular Dynamics Simulation . . . . . . . . . . . . . . . . . 19 2.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . 25 2.5 Implications for Interpreting Experimental Data . . . . . . . . 45 2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3 Computer-Aided Discovery of Protein Inhibitors 52 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.1.1 Structure-Based CADD Methods . . . . . . . . . . . . . 53 3.1.2 Ligand-Based CADD Methods . . . . . . . . . . . . . . 54 3.2 Accuracy of CADD . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.3 Neural Network in Drug Discovery . . . . . . . . . . . . . . . . 60 3.3.1 ANNs with Deep Learning . . . . . . . . . . . . . . . . 62 3.4 Discovering -Secretase Inhibitors . . . . . . . . . . . . . . . . 68 3.5 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.6 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.6.1 Data Selection . . . . . . . . . . . . . . . . . . . . . . . . 72 3.6.2 Data Featurization . . . . . . . . . . . . . . . . . . . . . 72 3.6.3 Structure of the Neural Network . . . . . . . . . . . . . 73 3.6.4 Training of the ANN . . . . . . . . . . . . . . . . . . . . 74 iv 3.7 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.7.1 Cross Validation . . . . . . . . . . . . . . . . . . . . . . . 74 3.7.2 Discovering Potential Inhibitors . . . . . . . . . . . . . . 76 3.7.3 Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.7.4 Diversity . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4 Top-Leads of Inhibitors for the Proteins Contributing to Alzheimer’s Disease Identified by Neural Networks: Docking Study 82 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.2.1 Protein Preparation . . . . . . . . . . . . . . . . . . . . . 85 4.2.2 Ligand Preparation . . . . . . . . . . . . . . . . . . . . . 85 4.2.3 Generation of the receptor grid . . . . . . . . . . . . . . 85 4.2.4 Binding Site . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.2.5 Glide Standard Precision for Ligand Docking . . . . . . 86 4.3 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.3.1 binding energy . . . . . . . . . . . . . . . . . . . . . . . 86 4.3.2 Top-score inhibitors of ASP 257 . . . . . . . . . . . . . . 87 4.3.3 Top score inhibitors of ASP 385 . . . . . . . . . . . . . . 94 4.4 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . 97 4.5 Perspective and Future Studies . . . . . . . . . . . . . . . . . . 99 4.6 Supporting Information . . . . . . . . . . . . . . . . . . . . . . 101 Bibliography 112 v List of Tables 4.1 Binding energy range of the original and new potential in- hibitors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.2 List of the important binding energies between top inhibitors and the residue ASP257. . . . . . . . . . . . . . . . . . . . . . . 88 4.3 Chemical properties of the top ligands. . . . . . . . . . . . . . . 90 4.4 List of interactions of residue Asp257 and top inhibitors . . . . 94 4.5 List of the important binding energies between the top inhibitors and the residue ASP385. . . . . . . . . . . . . . . . . . . . . . . 95 4.6 Chemical properties of the top ligands. . . . . . . . . . . . . . . 95 4.7 List of interactions of residue Asp385 and top inhibitors . . . . 99 4.8 List of important binding energy between top inhibitors and residue Ala246 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.9 Chemical properties of top ligands . . . . . . . . . . . . . . . . 102 4.10 List of interactions of residue Ala246 and top inhibitors . . . . 102 4.11 List of important binding energy between top inhibitors and residue Gly382 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.12 Chemical properties of top ligands . . . . . . . . . . . . . . . . 105 4.13 List of interactions of residue Gly382 and the top inhibitors . . 108 4.14 List of the important binding energies between the top inhibitors and the residue Leu150 . . . . . . . . . . . . . . . . . . . . . . . 108 4.15 Chemical properties of top ligands . . . . . . . . . . . . . . . . 110 4.16 List of the interactions of the residue Leu150 with the top in- hibitors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 vi List of Figures 1.1 Structure of amino acid . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Depiction of the motifs of secondary structure: helix and betastrand. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.1 Folded and misfolded proteins in slit pores of size D = 1:5 nm. The walls are repulsive and T = 0:125. (a) A protein that is nearly folded. (b) Misfolded protein. (c) Misfolded pro- tein with a-strandlike structure. (d) Similar to (c) but with strand partially folded. . . . . . . . . . . . . . . . . . . . . . . . 26 2.2 T f , as defined by Eq. (6), versus the size differenceDD 0 for proteins of length ` in slit pores. The walls are repulsive, w = 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3 Dependence of T f , as defined by Eq. (6), on the size differ- enceDD 0 in a slit pore for proteins of length (a)` = 9; (b) ` = 16, and (c)` = 23. w is the strength of the wall potential. . 29 2.4 Dependence of T f , as defined by Eq. (6), on the strength w of the wall potential in a slit pore. The protein’s length is` = 9. 29 2.5 Dependence of the free energy on n , the average number of -helical hydrogen bonds for a protein length of` = 16 atT = T f;b in a slit pore of size (a)D = 1:28 nm, and (b)D = 1:6 nm. . 31 vii 2.6 Determination of the minimum size D 0 for cylindrical pores: dependence ofn , average number of-helical hydrogen bonds on the sizeD of the pore, at low temperature. D 0 is that value D for whichn has achieved a constant value. . . . . . . . . . 33 2.7 Temperature-dependence of n , average number of -helical hydrogen bonds, in cylindrical pores of sizeD, as well as un- der the bulk conditions. Arrows indicate the location of the folding temperature. The pore’s wall is repulsive, w = 0, and the protein’s length is` = 16. . . . . . . . . . . . . . . . . . . . . 34 2.8 T f , as defined by Eq. (6), versus the size differenceDD 0 for proteins of length` in cylindrical pores. The walls are re- pulsive, w = 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.9 Folded and misfolded proteins in a cylindrical pore of sizeD = 1:5 nm with repulsive wall, w = 0, at temperatureT = 0:125. (a) Partially folded protein; (b) c-like, and (c) u-like misfolded structures. Due to the geometry of cylindrical pores, c- and u- like misfolded structures are more likely to form. (d) Similar to (b) and (c) and showing another misfolded state. . . . . . . 36 2.10 Dependence of T f , as defined by Eq. (6), on the size differ- enceDD 0 of a cylindrical pore for proteins of length` = 9. . 37 2.11 Dependence of T f , as defined by Eq. (6), on the strength w of the wall potential in a cylindrical pore of sizeD. The protein’s length is (a)` = 9, and (b)` = 16. . . . . . . . . . . . . . . . . . 38 2.12 Dependence of the free energy on n , the average number of -helical hydrogen bonds for a protein length of ` = 16 at T = T f;b in a cylindrical pore of size D 0 = 2:7 nm, and vari- ous strength of the wall potential, w . . . . . . . . . . . . . . . . 39 viii 2.13 Unfolding of an de-novo protein of ` = 16 in a spherical cavity of sizeD 0 = 2:85 nm at temperatures (a)T = 0:105; (b) 0.11; (c) 0.12; (d) 0.13; (e) 0.14, and (f) 0.15. . . . . . . . . . . . . 40 2.14 Determination of the minimum sizeD 0 for spherical cavities: dependence ofn , average number of-helical hydrogen bonds on the sizeD of the cavity, at low temperature ofT = 0:08. D 0 is that valueD for whichn has achieved a constant value. The protein’s length is (a)` = 9 and (b)` = 16. . . . . . . . . . . . . 41 2.15 Temperature-dependence of n , average number of -helical hydrogen bonds, in spherical cavities of sizeD, as well as un- der the bulk conditions. Arrows indicate the location of the folding temperature. The cavity’s wall is repulsive, w = 0, and the protein’s length is` = 16. . . . . . . . . . . . . . . . . . 42 2.16 Dependence of T f , as defined by Eq. (6), on the size differ- enceDD 0 of a spherical cavity for proteins of length`. The wall is repulsive. . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.17 Dependence of T f , as defined by Eq. (6), on the size differ- enceDD 0 of a spherical cavity for a protein of length` = 9. = 0 represents a cavity with repulsive. . . . . . . . . . . . . . 44 2.18 Dependence of T f , as defined by Eq. (6), on the strength w of the wall potential in a spherical cavity of sizeD. The protein’s length is` = 16. . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.19 Comparison of the dependence of T f , as defined by Eq. (6), on size differenceDD 0 of the three confining media for pro- tein of length (a)` = 9 and (b)` = 16. The walls are repulsive. 46 ix 2.20 Comparison of the dependence of free energies onn , the av- erage-helical hydrogen bonds, in the three confining media with sized = 2 nm. The profiles in (a) correspond to room tem- perature, and the protein’s length is (a)` = 9 and (b)` = 16. The walls are repulsive. . . . . . . . . . . . . . . . . . . . . . . . 47 3.1 AUC curve after training model . . . . . . . . . . . . . . . . . . 76 3.2 Histogram of chemical properties. . . . . . . . . . . . . . . . . 77 3.3 t-SNE projection of 7 physicochemical descriptors of active molecules and molecules generated with the ANN, to two unitless di- mensions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.4 nearest-neighbor similarity distribution of active training set and newly discovered molecules. . . . . . . . . . . . . . . . . . 79 3.5 Violin plot of similarity distribution of random decoy, newly discovered molecules, and original training set . . . . . . . . . 80 4.1 Distribution of both the original and new inhibitors. . . . . . . 87 4.2 Chemical structure and properties of top five potential inhibitors 91 4.3 Interaction of residue Asp257 and top inhibitor . . . . . . . . . 93 4.4 The chemical structure and properties of the top inhibitors. . . 96 4.5 Interaction of residue Asp385 and top inhibitors . . . . . . . . 98 4.6 Chemical structure and properties of top inhibitors . . . . . . . 103 4.7 Interaction of the residue Ala246 with the top inhibitors . . . . 104 4.8 Chemical structure and properties of top inhibitors . . . . . . . 106 4.9 Interactions of residue Gly382 and the top inhibitors . . . . . . 107 4.10 Chemical structure and properties of the top inhibitors . . . . 109 4.11 Interaction of the residue Leu150 with the top inhibitors . . . . 111 x Abstract The dissertation is composed of two parts. The first part focuses on pro- tein stability in confined structures. We use discontinuous molecular dy- namics (DMD) simulation to study the folding and stability of alpha-helix proteins in cylindrical nanopores. By using the PRIME model of proteins, we study the effect of the pore size, type of pore walls in terms of their interac- tion energies - repulsive and attractive walls - and the nature of interaction between the pore wall and proteins on their folding and stability. In the sec- ond part we use the concepts of feature locality and hierarchical composition in order to study and model of bioactivity and chemical interactions. By training a neural network with existing inhibitors for misfolding of proteins and decoys of gamm-asecretase protein, we attempt to discover more po- tential inhibitors of the proteins, which would help predicting the potential drugs for Alzheimer’s disease. . . . 1 Chapter 1 Introduction 1.1 Protein structure Proteins play crucial and diverse biological roles. All living organisms, from viruses to human beings, produce proteins to maintain their life characteris- tics and survive. In fact, proteinss constitute more than one half of the dry weight of human cells. Proteins are composed of amino acid. The general structure of an amino acid is depicted schematically in Figure 1.1. An amino acid is composed of FIGURE 1.1: Structure of amino acid the amino group NH 2 , which is electrochemically basic, an acidic carboxyl group COOH, an -carbon atom linking the two groups, a hydrogen atom H, and a side chain R connected to the-carbon that carries the individual 2 characteristic of each amino acid. The side group R determines the individ- ual amino acid, which is why a large number of protein sequences and struc- tures exist. There are at least 20 different natural amino acids, which are categorized into two groups: If they have stronger interactions with water molecules, they are polar or hydrophilic, but they are called nonpolar or hy- drophobic if they do not. A string of amino acids is formed by adding a new monomer into the chain and removing a water molecule at the same time, which is called a condensation process. Bond length between two residues, bond angle and dihedral angle pair ('; ), are used to describe the configuration of amino acids. The large num- ber of conformations of a protein is largely due to the flexibility of the ('; ) angle pair. Most of the protein structure is unknown and extensive experi- mental and theoretical research has been carried out in reveal the structure, as well as their mechanics. It has been found out that some common secondary structures emerge from the arrangement of local amino acids. The two ma- jority kinds are thehelix and-strand motifs. The first-helix secondary structures were proposed by Pauling et al. in their theoretical studies of the stabilizing hydrogen bonding energy [1–7]. Years later, Perutz and Kendrew used x-ray crystallography to verify such structures [8]. The unique three-dimensional (3D) structure of protein is necessary in order to function properly. In Figure 1.2 examples of secondary structures of protein,-helix and-strand, are shown. FIGURE 1.2: Depiction of the motifs of secondary structure: helix andbetastrand. 3 1.2 Protein folding In the 1960s and 1970s Christian Anfinsen [9–11] proposed and verified that, “The native conformation [of protein] is determined by the totality of inter- atomic interactions and, hence, by the amino acid sequence, in a given envi- ronment” [12]. It is widely believed currently that the primary structure of protein determines the structure’s nascent chain of amino acids that it must fold into. But how the sequence of protein determines and governs the fold- ing process still unknown. In 1968, Levinthal raised a paradox in studying the pathways of protein folding [13]. In Levinthal’s reasoning, the conforma- tions of protein are mutually independent and the searching of native con- formation can be mapped onto a simple random walk in a wide phase space with an extremely small probability of hitting one single specified point, the native conformation of protein. Ramachandran, et al. [14, 15] systematically computed the dihedral an- gle pair ('; ) for small polypeptides through computing the energetically favorable conformations. The contour diagram that locates the dihedral an- gle pairs of stable conformations is referred as the Ramachandran plot. The plot demonstrates that dipeptides occupy only limited and distinct regions on the ('; ) plane. Rose et al. [16] reexamined the Ramachandran plot by performing statistical analysis of exhaustive conformation sampling of short and simple polypeptides, and verified that protein cannot be considered a chain of mutually independent amino acids, due to steric hindrance between protein backbone monomers. The conformation space of a protein is, thus, much smaller than what Levinthal had proposed. It was also proved that a protein does not have to sample most of the conformation states during its folding process. In other words, protein folding is actually a cooperative process [17, 18]. In this theory, it is believed that during folding process, the 4 preceding conformation of a protein can logically direct and improve the suc- cessive folding steps and, thus, narrow down the conformation space to help the protein to fold rapidly. Years ago, Wolynes and Onuchic [19, 20] proposed a theory that used fun- nel energy landscape to explain protein folding process. Chan and Dill [21] proposed the energy landscape to be funnel-like with small energy frustra- tions and small nonnative local minima. The higher energy part of the funnel is wide and stands for a large conformation space in which the protein takes extended (unfolded) structures. The funnel becomes thinner when the free energy is lowered. In protein folding, the free energy funnel directs the pro- tein towards a well-defined global minimum point in a non-smooth way, due to the energy frustration along the funnel surface. Because of the existence of hydrophobic force between water and hydrophobic amino acids, an amino acid chain first collapses into a globular structure, regardless of the starting conformation. Hydrophobic monomers of proteins will be buried inside and form a core, and polar monomers tend to stay on the protein surface, forming a shield that partly separates hydrophobic monomers from water. The glob- ular structure is then geared to form the necessary hydrogen bonds between residues. During folding, the enthalpy gain competes with the entropy loss [22] and, as a result, there will be a folded potential well, unfolded potential wells, and transition state barriers that separate them. 1.3 Diseases linked with protein misfolding Protein misfolding happens when proteins do not fold into their correct 3D structures. Some human and animal diseases are believed to be related to improper protein folding. They result from the fact that some responsible proteins, particularly the neuronal membrane proteins, misfold. Misfolded proteins can often aggregate and produce insoluble amorphous or fibrillar 5 structures, called amyloid fibrils. These fibrils can deposit and form spongy- like structures in a variety of tissues in the brain. Alzheimer’s disease is an example. Another type of disease that may be linked with protein misfold- ing, is prion disease [23], which is usually termed spongiform encephalopa- thy, and is induced by the so-called prion protein that infects responsible proteins and drives them to misfold. The most famous example is the Mad Cow disease. The common feature of these diseases is protein misfolding and aggrega- tion, triggered by either prion protein, or simply by aging. As a result, such diseases damage the cognitive and motive ability of animals or humans, lead- ing eventually to death. It was pointed out that the fibrillar aggregates are a generic form of polypeptide structure under some specific circumstances [24]. A great deal of work has been done to reveal the misfolding and ag- gregation mechanisms, so as to develop the ability to prevent the transition process from normal protein to misfolded ones. There are still many open questions that are vital to the understanding of the diseases and to their treatment. This explains why protein folding, misfolding, and aggregation are important research topics with great practical significance. Our studies of the effects of confinement and crowding on protein stability and dynamics are stimulated by this perspective. 1.4 Protein folding in confined environments Since protein folding or misfolding in living organisms takes place in crowded cellular environments, their study in structures that mimics the crowded en- vironments has attracted wide attention. What follows is a brif discussion of the past work. 6 1.4.1 Theoritical study Besides experimental and theoretical studies, an ensemble of computer sim- ulation approaches plays increasingly important roles in protein folding re- search. Various computational techniques have been implemented, depend- ing on the characteristics or aspects of the problem under study. At the top level of the in silico simulations of protein folding are the all-atom models using molecular dynamics (MD) simulation. There exist several versions of all-atom MD models. In the first model, all atoms in each protein monomer are included and the inter-atomic interaction potentials are represented, but no solvent effect is included. In the second version, the protein is represented by an all-atom approach, and in order to include the effects of solvent (water), a continuum model of solvent is added in. The third version uses an all-atom method to represent both the protein and the solvent molecules, usually lead- ing to a large number of atoms on the order of 10 5 number of atoms. Such all-atom models need accurate descriptions of interaction potentials between different types of atoms, which can only be obtained through experiments and high-accuracy quantum mechanical computations. Due to the limited power of the current computers, the computational time step of simulation is on the order of 10 15 , and due to the fact that the typical time scale of the folding process is usually at least 10 6 , it makes tracking of a complete pro- tein folding process very difficult, especially with larger number of atoms. The current studies often forcus on the folding of short polypeptides or proteins for shorter durations, using all-atom models. Another direction of all-atom MD models is to track the unfolding process and establish the free- energy landscape [25–27]. The advantage of all-atom models is their accu- rate description of interaction potentials for the study of protein folding and unfolding processes. In order to determine how the amino acid sequence de- termines the protein’s 3D structure and study the process in detail, all-atom 7 protein models with explicit solvent are the generally accepted tools. Some insightful results have been obtained for short polypeptides [28, 29]. With increasing computer power, it will become possible to track the complete protein folding process. Pande et al. at Stanford University developed a dis- tributed system termed Folding@home [30] that can integrate contributions of computers worldwide to study the protein folding process. In addition, recently, Blue Gene supercomputers at IBM have been used to produce inter- esting results for protein folding using ensembles of all-atom MD trajectories. A protein may adopt, in addition to its native and randomcoil-like dena- tured states, a number of collapsed globular states and sometimes misfolded states [31, 32]. Only in its native state is the protein biologically active. Pre- vious studies suggest that confinement affects the mechanism of structural transitions by altering the pathways that leads a protein from a random-coil state to various collapsed globular states, or to the native state [33, 34]. The changes in the thermodynamics and kinetics of protein folding are primarily due to the restriction of the configurational space for denatured states [35, 36]. The stability of protein is often measured via the folding temperatureT f , which was shown to vary with the radius of the confining cage as (T f T bulk f )=T bulk f / R c , where was found to be 3:25. Previous studies on pro- tein folding under spherical confinement [37] and in a crowded solution [38] have reported an exponent close to 2, the value expected for an ideal chain [39]. Another study of protein folding in spherical confinement found that the scaling exponent varied with the protein and repulsive confinement model [40]. For an excluded volume chain confined in a spherical cavity, the free energy of confinement is expected to scale with = 3:75 [41]. It is, therefore, not clear whether protein-folding thermodynamics follows a polymer-law scaling behavior under confinement and, if so, with what value 8 of the exponent. Although folding kinetics is known to be affected by con- finement, no quantitative explanation for the observed increase in the rates exists. The essential features of protein-folding kinetics in bulk solution can be captured by diffusion along a well-defined reaction coordinate [42–44]. Although much progress has been made regarding the effect of confinement on folding and collapse of denatured proteins, it remains unclear how the specific protein-surface interactions, and in particular surface energy, affect the kinetics and thermodynamics of structural transitions for confined pro- teins. What is more, protein folding inside the body is much more complicated. Inside a cell a protein binds and interacts with other molecules, modifies its conformations via folding/unfolding to function well in a crowded and/or confined environment. Experimental and theoretical studies have shown that when encapsulated in small space, the thermal stability, chemical re- activity and folding dynamics of proteins are expected to be affected. It is believed that effects arise from excluded volume interactions, either due to macromolecular crowding [45–55] or confinement of protein in a small vol- ume [56–65]. Protein folding in confined environment inside body is usually caused by the ribosome tunnel [66] or a chaperonin cavity [67–69], and can indeed have significant effects on folding. The free volume available to the functional protein is limited, either by the dense surrounding biomolecules, or by small confinements. As a result, the limited space and the reduced free volume availability will affect the protein stability and folding dynamics in a nontrivial way. Using circular dichroism in the UV region, Eggers and Valen- tine [70] showed that the thermal stability of an enzyme, lactalbumin is enhanced when encapsulated in a silica matrix. Chaperones (chaperonin folding machines), a class of proteins found in all organisms, can help nascent proteins fold into their correct native conformations [71]. It has been proved that protein folding in vivo is assisted by molecular 9 chaperones and chaperonins that interact with and stabilize newly synthe- sized polypeptides [72–75]. A relatively well-understood example is protein folding in the cavity of the GroEL-GroES complex, a barrel-shaped bacteria chaperonin where a nascent polypeptide can be encapsulated and undergoes productive structural transitions [76]. Although the precise mechanism by which a chaperonin assists protein folding remains uncovered, two mech- anisms were proposed for understanding chaperonin action: the Anfinsen cage model and iterative annealing model. In distinguishing the two work- ing mechanisms, Brinker [77] found that the folding of denatured protein in narrow space is accelerated compared to that in free solution, supporting the Anfinsen cage model. Moreover, recent theoretical investigations have revealed that a chaperonin- like cavity favors the compact structure of the native protein, thereby accel- erating the folding rate [35, 36, 78–82]. Both the stability and folding kinetics of an encapsulated protein are strongly correlated with the geometry and degree of confinement [36, 80]. It has been shown that in an inert or a hy- drophilic cage, as provided by a chaperonin, a protein becomes most stable, and the rate of folding is also maximized when the cavity size is 1.6 times the gyration radius of the native protein [36]. The confinement has little effect if the protein is too small. Conversely, it may prohibit folding if the encaged protein is exceedingly larger than the cage size [33, 34]. A recent lattice sim- ulation revealed that the kinetics of protein folding in a chaperonin-like cage depends critically on the hydrophobicity of the confining surface, in addi- tion to the accessible volume [83]. More recent off-lattice simulations suggest that a weakly hydrophobic environment accelerates protein folding via tran- sient binding of the intermediate states to the cage surface [84]. However, to our knowledge no experimental validation of the simulation results has been reported, and little is yet known on the interplay between the confinement and surface energy, and on their specific effects in collapse and folding of 10 confined proteins. Cheung, Klimov, and Thirumalai [38] showed that in the limit in which the crowding particles are much larger and heavier than the protein, the macromolecular crowding effects can be approximated by confinement and the shape and dimensions of the cavity will depend on the crowders’ concen- tration. This approach was shown to be applicable for a range of conditions. Specifically, they showed that the effect of crowding on protein-folding kinet- ics can be mimicked by confining the protein within a spherical cavity. The effect of crowding at low concentrations will be different from encapsulation in a spherical cavity, and may be better represented by the weaker confin- ing environment within a cylinder or between two planes. Because all of these confinement configurations may also appear naturally (e.g., cylindrical confinement for protein passage through a ribosome tunnel and planar con- finement for proteins at interfaces or near surfaces), it is instructive to study the effects of varying the reduced dimensions due to confinement on protein stability and kinetics. Considerations from polymer physics suggest that the free energy of confinement within repulsive boundaries should have a simple power law dependence [39, 41, 85] on the size of the cavity. Takagi et al. [86] reported a scaling law based on a simple coarse-grained model for the shift in protein-folding temperature with respect to the bulk in a chaperonin-like cage (cylindrical cage with lengthL twice as big as radiusR). Some studies also suggested that the solvent plays a critical role in pro- tein folding, as most of the free energy for folding comes from maximizing solvent entropy (because of the molecular nature of hydrophobicity). Poly- mer models for confined folding do not consider the effect of confinement on the solvent and its subsequent effects on protein stability. Although explicit solvent complicates analytical models and makes simulation more compu- tationally demanding, including explicit solvent allows one to account for solvent-mediated effects in folding mechanism. For example, it has been 11 shown for simulations of small proteins that implicit and explicit solvent models can yield similar folding rates, but different folding mechanisms [87, 88]. Because "confinement" includes confined solvent and confined protein, and nanoscopic water has been shown (by experiment and simulation) to behave differently from bulk water both thermodynamically and kinetically [89–91], we expect that treating water explicitly is crucial to properly describ- ing the dynamics of protein folding in confined spaces. Recent folding simulations of purely polymeric models and models that treat solvent explicitly have shown drastically different results. Specifically, Ziv et al. [92] have suggested that for a small helical peptide helix forma- tion is stabilized upon confinement to a cylindrical cavity. These results were explained in terms of polymer entropy arguments, as described above. On the other hand, Sorin and Pande [93] showed that for an -helical peptide confined to a single-walled carbon nanotube with explicit solvent, the oppo- site effect is observed; the unfolded state is stabilized and the helix unfolds. This observation was explained in terms of solvent entropy. In bulk, protein folding maximizes solvent entropy, but in a confined system solvent entropy is already limited and protein-protein interactions experience a reduced en- tropic stabilization relative to protein-water interactions. 1.4.2 Experimental study The interest to mimic the effects of cellular encapsulation, coupled with theo- retical predictions of significant stabilization, has spurred experimental stud- ies of protein folding stability in artificially confined environments. Two kinds of encapsulation are now widely used. The first is formed by nanoporous silica gels or glasses [94–97] or polyacrylamide gels [36]; the second is formed by sodium bis(2-ethylhexyl) sulfosuccinate (AOT) reverse micelles [98–102]. In both cases, the cage sizes can be controlled. The inner diameters of AOT 12 reverse micelles are easily varied by changingW 0 , the molar ratio of water to AOT, also known as water loading. With isooctane as cosolvent, the cage size follows a linear relation with water loading: R c = 1:5W 0 + 4:5 [103]. These studies have confirmed that significant stabilization can arise from protein encapsulation. For example, Ravindraetal. [94] found that, upon being con- fined within the pores, with sizes averaging 25 of a silica glass, the melting temperature of ribonuclease A is raised by 30 C. The measured increase in melting temperature for the 124-residue protein is in quantitative agreement with predicted stabilization by pores of such sizes. Mukherjee et al. [102] observed increased helix formation of alanine-rich peptides in AOT reverse micelles, but attributed the increased stability to a different reason. Perhaps the most dramatic demonstration of the stabilizing effect of con- finement is provided by the study of Peterson et al. [99]. They extensively mutated a designed three helixbundle protein, 3 W, so that it became un- folded in a dilute solution, as indicated by an unresolvedN 15 heteronuclear single quantum correlation (HSQC) spectrum. When the 3 W variant is en- capsulated in AOT reverse micelles, theN 15 HSQC peaks are sharpened. For a low water loading, corresponding to a small cage, theN 15 HSQC spectrum resembles that of the original 3 W, indicating the formation of a three-helix bundle structure. The exit tunnel of the ribosome is 100 long with a diameter 10 in the nar- rowest central part and 20 toward the ends. Lu and Deutsch [104] found that helix formation is promoted at the wider portions of the exit tunnel. Exper- imental and theoretical studies aimed at untangling possible contributions from entropy reduction by confinement, surface interaction, and modulation of water activity will shed light on the functional roles of the ribosomal exit tunnel in the folding of nascent proteins. Tangetal. [105] designed mutations of theE:coli chaperonin GroEL to test the theoretical prediction that the Anfinsen cage facilitates protein folding 13 by favoring compact intermediates. They reduced or enlarged the cage size by adding or removing sequences to the C-terminals of the GroEL subunits, which protrude into the central cavity, and found that the folding rate can be modulated by the cage size. There is an optimal cage size for each substrate protein, and the optimum shifts to larger sizes for larger proteins. These re- sults thus “are remarkably consistent with prediction.” Indeed, Hayer-Hartl and Minton [106] showed that the results can be quantitatively rationalized by a theoretical model, with the transition state (instead of the folded state) modeled as a sphere. However, Tang et al. also presented data demon- strating that more than just entropy reduction from confinement is at play in GroEL/ES-assisted folding. The C-terminal sequences and a number of conserved negative charges lining the cavity wall are critical in facilitating the folding of some proteins. Importantly, Tang et al. also found that the ability of overexpressed GroEL/ES mutants to suppress aggregation of over- expressed substrate proteins inE:coli correlates with the folding rates of the substrate proteins within the respective GroEL/ES mutants. -synuclein is a naturally disordered protein, which, in fibrillar form, is the primary component of Lewy bodies found in Parkinson’s disease pa- tients. Overexpressed-synuclein is found in the periplasm ofE:coli [107]. Using in-cell NMR spectroscopy, McNulty et al. [108] investigated the con- formations of -synuclein in the periplasm of E:coli. They found that - synuclein adopts a more compact form than is observed in a dilute solution. Interestingly, the more compact form can be induced by adding 300 g/l of bovine serum albumin as a crowding agent, in line with theoretical predic- tions for conformational compaction by crowding. These studies demon- strate that, in concert with theoretical modeling and parallel experiments in simulated conditions, much of the complexity of protein folding in cellular environments can be elucidated. 14 1.5 Goals of The Research Given the above discussions, the goals of this research are twofold. One is to study the stability and folding rates of-helica proteins in very tight con- fining media. We will show that not only the folding rates and stability of the proteins is affected very strongly by very tight confining media, but that results may provide a plausible explanation for the recently reported anoma- lously low rates of folding seen in experiments. Given the significance of misfolding of proteins to their aggregation, which in turn is believed to be important to many currently untreatable illnesses, such as the Alzheimer’s disease, the second goal is to try to identify molecular structures that inhibit folding or misfolding. Toward this goal, we will utilize artificial neural net- works that have very recently found applications in biological systems. 15 Chapter 2 Effect of the geometry of confining media on the stability and folding rate of-helix proteins 2.1 Introduction Globular proteins fold onto compact, biologically active molecules. The mech- anisms of folding and factors that contribute to it, as well as the environ- ment that can give rise to the folding transition have been studied exten- sively.[109][110] But, the problem is not purely theoretical, as industrial pro- duction of enzymes and therapeutic proteins based on the DNA recombinant are linked directly to protein folding.[111] Among factors that affect protein folding is confinement. Numerous ex- periments have indicated that folding of denatured proteins is accelerated in the the cage model, which is usually referred to as the Anfinsen cage.[112] Protein folding in confined media is important to biocatalysis,[113] biosen- sors,[114] enzyme immobilization in porous materials,[115][116] and protein purification by membranes.[117] As for biological systems, macromolecules in living cells create a confined environment for natural proteins and, thus, 16 their foldinginvivo occurs in their presence.[118] Moreover, living cells con- tribute to the folding by entrapping them in chaperonins.[119] It is also widely believed that formation of misfolded proteins and fibrils is a major contribut- ing factor in the development of the Alzheimer’s, Parkinson’s, and Hunting- ton’s diseases[120] occur in the molecularly-crowded environment of living cells. It is due to such important applications that study of protein folding in confined media has attracted great attention over the past two decades.[121] But, what is the effect of confinement on the stability of proteins and asso- ciated structures? The answer to the question depends, among other factor, on the nature of the confined medium’s walls, namely, repulsive versus at- tractive walls, as well as the size of the confined medium relative to that of the proteins. The current general consensus is that, (i) molecular crowding may stabilize compact native states of proteins at moderate, but not high, con- centrations;[122] (ii) trapping of proteins in tight pores with purely repulsive interaction stabilizes their native state and increases the folding temperature T f , and (iii) as the confined medium becomes smaller, T f increases continu- ously.[123] In fact, a standard strategy for stabilizing enzymes and proteins is immobilizing them in a confined medium, such as a porous material that has inert pore surface.[124] But, stabilizing proteins and enzymes in tight pores does have its limit. If the pore size is very close to a critical size, which is equal to the smallest dimension of a protein’s folded state, stabilization stops and the protein may not fold at all,[123] with the critical size being larger for larger proteins.[124] On the other hand, some researchers have argued[123] that the folding rate is optimal in a pore with a size larger than the size of the protein’s folded state. But, if the pore size decreases further, the folding rate will decrease due to destabilization of the transition states in the folding pathway. In a confined medium with attractive walls weak interactions can stabilize proteins to an extent less than that in a medium with repulsive walls. Strong attractive 17 interactions, on the other hand, may lead to adsorption of the protein onto the walls, either partially or completely, in which case the protein will lose its native structure.[125][126] Javidpour and Sahimi[127] studied folding of-helix proteins of various lengths and demonstrated for the first time, to our knowledge, that nanopores with purely repulsive walls destabilize smaller-helices, with folding tem- peratures that aresmaller than those in bulk solution. The size of the nanopores was such that it did not affect the folded state’s structure. In addition, they demonstrated that the destabilization is accompanied unexpectedly by en- tropic stabilization of the misfolded -strandlike states, and described the possible implications of the results for several sets of experimental data, in- cluding anomalously low folding rates, that had been reported but had re- mained unexplained (see below), and for other important problems in which protein folding plays an important role. Since the results that we reported were unexpected, one important question that one should address is, to what extent did the results depend on the geometry of the confined medium and the nature of the interactions of its walls with the proteins? If the results that Javidpour and Sahimi had previously reported for slit pores are inde- pendent of the geometry of confining media, then our results in this Chapter will have direct implications for the aforementioned experimental data. The present Chapter aims to address this question by studying the problem in three types of small confining media, namely, a slit pore between two paral- lel flat surfaces, a cylindrical pore, and a spherical cavity, and compare the results. 18 2.2 Models of proteins We use de novo-designed-family of proteins[128] that consist of four types of amino acids in their 16-residue sequence. The model has been simpli- fied[129][130] to a sequence of hydrophobic (H) and polar (P) residues, {PPH- PPHHPPHPPHHPP}. Periodicity in the H-P sequence of the 16-residue pep- tide 1B was used, so as to make three other sequences PP(HPPHHPP) n , with n = 1; 2; 3, and 4, corresponding to protein lengths of` = 9, 16, 23 and 30. They all have similar native structures, and they all fold onto an-helix with (` 4) hydrogen bonds (HBs).[126] The PRIME model[131][132] has been successful in reproducing several important experimental features of proteins under bulk conditions, particu- larly the formation of fibrils.[132][133] As explained below, we have modified the model for use in a confined medium.[126][127] The PRIME, an intermediate- resolution model of proteins, represents every amino acid by four united atom (UA) beads, with a nitrogen UA representing the amide N and hydro- gen of an amino acid; a C UA modeling the -C and its hydrogen, and a carbon UA representing the carbonyl carbon and oxygen. The side chains are represented by the bead R, all of which are assumed to have the same di- ameter as alanine, CH 3 . All the backbone bonds’ lengths and bond angles are fixed at their experimental values, as is the distance between consecutive C UA. Other proteins do have more complex molecular structures, but because the three-dimensional structure of native proteins is controlled mainly by their amino acid sequences, which the model represents accurately, PRIME is adequate for the problem that we study. 19 2.3 Molecular Dynamics Simulation We use discontinuous molecular dynamics (DMD) simulation.[134] In the DMD simulations, the interaction potentials are simplified in order to speed up the computations; see below. The speed up makes it possible to make long simulation runs, on the order of microseconds. The PRIME model of proteins was specifically developed for use with the DMD simulation. Our procedure for estimating the folding temperature T f under various conditions is similar to what we used in our previous papers.[126][127] T f is defined as the temperature at which the heat capacity C V is maximum. Note that we compute the folding temperatures only for those cases in which the folded state is stable and does not unfold at low temperatures. This en- ables us to compare the stability of proteins in the various geometries that we study. In practical terms, in what follows when we report aT f for any protein with specific length`, pore sizeD and the type of protein-wall interaction po- tentialU w , we mean that at low (dimensionless) temperaturesT 0:09 the folded state (the-helix) could maintain its structure and would not unfold over about 100 ns of the simulations. Otherwise, we do not reportT f for such cases. Four types of forces are included in the DMD simulations, namely, the hard-core repulsion and the attraction between the bonded and pseudobonded beads, and between pairs of the backbone beads during the HB formation, and between hydrophobic (HP) side chains. Nearest-neighbor beads along the chain backbone, as well as the C and the R beads, are covalently bonded. The pseudobonds are between (i) next-nearest neighbor beads along the back- bone to keep their angles fixed; (ii) neighboring pairs of C beads to maintain their distances close to the experimental data, and (iii) the side chains and backbone N and C united atom beads to keep the side-chain beads fixed rel- ative to the backbone. As a result, the interpeptide group is kept in the trans 20 configuration and all the model residues asLisomers, as required. The HB interaction may occur between the N and C beads with at least three inter- vening residues. Each bead may not, however, contribute to more than one HB at any time, with the range of the interaction being about 4:2 and the strength being HB . An HB is one between C i and N i+4 , while other HBs are referred to as non- HBs. Since the NH group of an amino acid forms a HB with the C=O group of the amino acid four residues earlier, a fully folded sequence should have (` 4) HBs. The potential between a pairij of the bonded beads, separated by a distance r ij , is given by: U ij = 8 > > > > > > < > > > > > > : 1 r ij l b (1 b ) 1 r ij l b (l b +) 0 l b (1 b )<r ij <l b (1 b +) (2.1) Here,l b is the ideal bond length and 0:02375 is the tolerance in the bond’s length (as it fluctuates). There are also HP interactions between the side chains with the H residues in the sequence, if there are at least 3 interven- ing residues between them. Then, the interaction is given by U HP = 8 > > > > > > < > > > > > > : 1 r ij HP HP HP <r ij 1:5 HP 0 r ij > 1:5 HP (2.2) where HP is the HP side-chains’ diameter. The shape of the HB potential is similar to that of the HP potential. The HBs are stable when the angles in N-H-O and C-O-H are almost 180 . Such angles are controlled by a repulsive interaction between each of the N and C beads with the neighboring beads of the other one. Thus, if a HB is 21 formed between beads N i and C j , a repulsive interaction between the neigh- bor beads of N i , namely, C i1 and C i with C j is assumed, with a similar assumption for the neighbor beads of C j , namely, N j+1 and C j with the N i bead. If one of the N or C beads is at one end of the protein, it will have only one neighbor bead in the backbone (instead of two) and, hence, controlling the HB angles will be limited, causing the HBs with one of their terminal constituents to be less restricted and, thus, more stable than the other HBs. This may lead to the formation of the next nonhelical HBs in a part of the protein between the N and C beads and of semi-stable structures that in- fluence the simulation results. Thus, we modify the PRIME and proceed as follows. Assume that the N-terminal bead, N 1 , has a HB with C j . Fori = 1, the bead C i1 does not exist to have a repulsive interaction with C j and help control the HB angles. Therefore, we use C 1 . Not only can we consider the repulsion between this bead and C j , but also define an upper limit for their distance, so as to control freedom of motion of N 1 and C j that constitute the beads in the HB. The potential U kl of such interactions is given by U kl = 8 > > > > > > > > > > < > > > > > > > > > > : 1 r kl 1 2 ( k + l ) HB 1 2 ( k + l )<r kl d 1 0 d 1 <r kl d 2 1 r kl >d 2 (2.3) Two H atoms have chemical bonds with the nitrogen in the protein’s N- terminal, can rotate around the N 1 C 1 bond, and satisfy the constraints on the angles between the chemical bonds of N 1 . Thus, if a HB is formed, one of the two H atoms lies in a plane formed by N, O, and C, such that the angles in N-H-O and C-O-H are as close to 180 as possible. Therefore, we force the maximum distance between C 1 and C j to be the same as the 22 maximum distance d 2 between C i and C j in the usual HBs, which makes it possible to control the angles in a HB that contains N 1 . A similar ap- proach is used when the C-terminal C ` forms a HB with N i . The distanced 2 is temperature-dependent, whose functional form was obtained from separate simulations.[126] The interaction between the walls of a cylindrical pore of radiusR and the protein beads is U w = 8 > > > > > > < > > > > > > : 1 d>Rd 3X w Rd 3X d 4X dRd 3X 0 d<Rd 3X d 4X (2.4) if the wall is attractive, and U w = 8 > > < > > : 1 d>Rd 3X 0 dRd 3X (2.5) if the wall is repulsive, whered is the distance from the center of the cylinder. Here, X refers to the four beads. Using separate simulations and assuming that the walls are made of carbon, the interaction potentialU CX between dif- ferent beads was estimated.[126] The distances at whichU CX and its second derivative were zero were taken asd 3X andd 3X +d 4X . The results (all in ) are, d 3X 2:85, 3.02, 3.14, and 3.31, andd 4X 0:96, 1.01, 0.98, and 1.12, for X=N, C , C, and R, respectively. Up to 200s of real times were simulated for the case of proteins in cylindrical pores. In the spherical cavity the interatomic potentials for HP , HB, and hard- core repulsion within the protein remain the same as above. The interaction potentials between the cavity’s wall and the protein beads are also the same Eqs. (4) and (5), except that the distanced should be replaced by the magni- tude of radial distance of the center of a bead X from the origin of the coor- dinates. The same aforementioned values ofd 3X andd 4X are also used in the 23 spherical cavity. Unlike a slit or cylindrical nanopore, there is no uniform minimum spher- ical diameter that can contain all the proteins of various lengths that we con- sider. Therefore, we determined the minimum diameter in which a fully- folded protein can exist for each protein’s length. This was done by determin- ing the average number of bonds and of the HBs throughout the protein at (dimensionless) temperature of 0.08. As pointed out earlier, our previous simulations[126] indicated that the respective number of such interactions is (` 4) for a fully-folded protein of length `. Thus, for proteins of lengths ` = 9, 16, 23, and 30 the minimum respective cavity diameters D were de- termined to be 1.91, 2.85, 3.91, and 4.93 nm, respectively. In all but one case the potentialU w was attractive, varying from w = 0:0625 to 0.375. Note that w > 0 implies more strongly attractive pores. For each cavity diameter and strength of the wall potetial, the proteins with lengths ` = 9 and 16 were simulated for 1.54s and 1.21s, respectively. The corresponding simulated time for proteins of lengths` = 23 and 30 was 1.60s. The interaction between amino acids in proteins is not, of course, con- strained to hydrogen bonding and repulsive/attractive attraction, as they may have electric charges, and disulfide bonds may form between some spe- cial amino acids. In addition, other types of proteins have more complex structures that include -strands, sheets, etc. Despite the simplifying fea- tures of our model protein, however, we expect our results to be quite gen- eral, because many fundamental aspects of protein folding are captured by even such simplistic models as two-dimensional lattice HP models.[135] In the simulations and in discussion of the results in the next section, en- ergy is expressed in units of HB , the strength of the HB interaction. Temper- ature is held constant by using the Andersen thermostat, and is measured in units of a temperature T u at which k B T u = HB , where k B is the Boltz- mann’s constant, with HB estimated previously[131] to be about 6 kcal/mol, 24 hence yieldingT u = 3300 K. Other calculations indicated[136] that HB is de- pendant upon many parameters, such as the types of the neighboring atoms and the chemical stucture of the molecule. For example, in polyalanine HB varies[137] between 3.5 and 8.6 kcal/mol. Full-atom MD simulations indi- cated[138] that in-helices HB varies between 1.9 and 5.6 kcal/mol, depend- ing on whether water is present or absent. Thus, if we assume that the true value of HB is between 3 and 6 kcal/mol,T u turns out to be between 1500 K and 3000 K. The effect of the solvent is important. In our previous paper[126] we de- veloped a new algorithm for simulating the effect of the solvent, which was based on a coupling between the DMD and the Langevin equation. We use the same approach in this paper in order to implicitly account for the effect of the solvent. It would, of course, be more realistic[139] to explicitly consider the solvent molecules. But, given that we utilize the DMD to simulate the behavior of the system, our goal is to gain a qualitative, rather than quan- titative, understanding of the effect of the geometry of confined media on folding and stability of proteins. Thus, the implicit account of the solvent effect should suffice for our purpose. To analyze the data we use a modification of the weighted histogram anal- ysis method[140] (WHAM). The modification was proposed by us[126] in order to calculate the density of states (DOS)N in nanopores. The individ- ual entropies S are computed based on the DOS by taking the average of N over all the states in any given energy basin that has nonzero DOS. Then, S =k B ln(hNi). This makes it possible to compute the thermodynamic quan- tities of interest. Beginning from different initial states of the proteins (folded versus unfolded), a large number of DMD runs were made in order to explore all regions of the free energy surface of the proteins. To ensure that the sim- ulation timet s is long enough for sampling various regions of the proteins’ free energy surface, and that the results are independent of their initial states, 25 the averages of the potential energy andn and the number of-helical HBs computed over the time period t s =2 < t t s were compared. The results computed at temperatures that resulted in unequal average values were dis- carded from all the entropy calculations. Multiple temperatures were sim- ulated for each protein length` in order to capture the protein dynamics at temperatures that are sufficiently above and below the folding temperature T f;b in the bulk. Since the change in the folding temperature of the proteins in the pores relatively to their bulk values can be relatively small, computing the possible errors must be done carefully. We carried out multiple series of simulations for each pore size and in the bulk, some of which began from the folded states, while others commenced from the unfolded ones. The results for all the series were then divided into several separate sets and the (modified) WHAM was used to compute the folding temperature for each of the sets. The results were then used to compute the average and standard deviations ofT f and the estimated errors. 2.4 Results and Discussion We have carried out extensive DMD simulations of folding of proteins of various lengths in the three types of confined media described earlier. In what follows we present and discuss the results, and compare the three cases. A. Slit pore Javidpour and Sahimi already reported[127] extensive results for the prob- lem in slit pores. In this section we briefly describe the key features of protein folding and stability in such pores, so as to have a basis for comparison with cylindrical pores and spherical cavities. Figure 1 presents samples of folded and misfolded structures of the proteins in a slit pore of size D = 1:5 nm 26 at temperature T = 0:125. The walls are repulsive. They show a protein structure that is almost folded [Fig. 1(a]; a misfolded structure [Fig. 1(b)]; a misfolded protein with a strandlike structure [Fig. 1(c)], and one with a strand structure that is partially folded [Fig. 1(d)]. Thus, a variety of structures can form in small pores. FIGURE 2.1: Folded and misfolded proteins in slit pores of size D = 1:5 nm. The walls are repulsive andT = 0:125. (a) A pro- tein that is nearly folded. (b) Misfolded protein. (c) Misfolded protein with a-strandlike structure. (d) Similar to (c) but with strand partially folded. Next, we define the smallest pore size D 0 for folding of the proteins of different length` as the size in which, at very lowT and in the presence of purely repulsive walls, the average number of -helical HBs are at least 99 percent of the maximum possible value of (` 4). To determine the mini- mum pore size, a protein of length` = 30 was simulated in very small pores, beginning from its folded state atT = 0:08, well belowT f;b , the folding tem- perature under bulk conditions. We then computedn , the average number of-helical HBs, after 100 ns of simulation. For a pore size ofD = 1:35 nm, n is equal to its value in the bulk, 26. The valueD 0 = 1:35 nm is much smaller than 2R g , whereR g 2:62 nm is the radius of gyration of the protein. However, although the-helix of length` = 30 is in the form of a long rodlike molecule, it can still lay in the pore, hence explaining why its native structure 27 is not squeezed. Even entropic destabilization of the dangling ends, which loosens their-helical HBs, may stabilize the native state of the protein. Most of the following discussions are completely general and, as we show below, hold for cylindrical and spherical geometries as well. Figure 2 presents the dependence of T f = T f T f;b T f;b ; (2.6) on the size differenceDD 0 of the slit pore. T f;b was estimated to beT f;b 0:1198, 0.1364, 0.1412, and 0.1434 for proteins of lengths ` = 9, 16, 23, and 30. The walls are repulsive. Note that the maximum value of T f T f;b is 0:00690:0012, which the protein of length` = 30 and a pore of sizeD = 1:40 nm produce. Given the aforementioned range of values of the basic tempera- ture unitsT u , the increase in dimensionlessT f corresponds to physical values of about 10 21 C. Experimental values for, for example,-lactalbumin con- fined in sol-gel matrices, is 5 25 30 C, reasonabaly close to what we have computed, especially given the simplicity of the interaction potentials in the DMD simulations. The important and surprising feature of Fig. 2 is that it indicates that the optimal pore size for stabilizing the proteins in the slit pore islarger thanD 0 , and that in very small slit pores the protein is destabilized. Destabilization can, of course, be the result of the folded state itself, but it should be pointed out that such effect would be stronger for larger proteins with larger native states, implying that one should expect larger proteins to be destabilized in larger pores. But, as Fig. 2 indicates, the smallest proteins are destabilized in larger pores and, accordingly, the unexpected occurrence of a maximumT f of the smaller proteins in larger pores. Note also that the difference between the temperature in the pore andT f is approximately5%, which one might argue is not significant enough, considering the error margins of the simula- tions. A difference of about 10% or smaller is, however, typical in the typical 28 FIGURE 2.2: T f , as defined by Eq. (6), versus the size differ- enceDD 0 for proteins of length` in slit pores. The walls are repulsive, w = 0. experimental and computational data for protein folding. For example, the aforementioned 25-30 C enhancement ofT f for-lactalbumin 5 is less than 10%. Similarly, in the computations of Takagi, Koga, and Takada[124] and of Klimov, Newfield, and Thirumalai|citeReference208 in highly confined media, the folding temperature T f was reported to be, T f 1:1T f;b , a 10% difference. We already described the method by which we computed the possible computational errors, which we also used for estimating the possible errors of T f . Figure 2 indicates that the error bars are reliable, at least for peak values and pore sizeD = 1:28 nm. This implies that by decreasing the pore sizeD the folding temperatureT f attains a maximum and then falls off. Figure 3(a) presents T f for a protein of length` = 9 in the slit pore with attractive walls. Increasing w , i.e., making the walls more strongly attractive, indicates that is for small pore sizesD the protein is more stable, which is in contrast with the case in which the walls are repulsive. The general behavior changes, however, for larger pores, namely, the peptide becomes less stable 29 FIGURE 2.3: Dependence of T f , as defined by Eq. (6), on the size difference DD 0 in a slit pore for proteins of length (a) ` = 9; (b)` = 16, and (c)` = 23. w is the strength of the wall potential. for more strongly attractive walls when compared with the repulsive walls. Note also that for small w it is difficult to discern the behavior and, thus, Figs. 3(b) and 3(c) present the data for w = 0 and 0.125 for proteins of length ` = 16 and 23. FIGURE 2.4: Dependence of T f , as defined by Eq. (6), on the strength w of the wall potential in a slit pore. The protein’s length is` = 9. Figure 4 presents the dependence of T f on the pore size and the strength w of the attrtactive interaction between a protein of length ` = 9 and the walls, indicating that increasing w in small pores stabilizes the protein, whereas 30 increasing w in a medium-size pore first stabilizes the peptide and then destabilizes it. On the other hand, increasing w in the largest pore com- pletely destabilizates the peptide. Thus, overall, one has a complex picture of the interplay between the strength w of the wall interaction and the pore size. Similar to the results shown in Fig. 3, if w is large enough, the protein will fold in all the pores of various sizes. As we show below, similar behavior is observed in cylindrical pores and spherical cavities as well. Another feature of Fig. 4 is that it indicates that for small w that corre- sponds to weakly attractive slit pores, the peptide can be even more stable than when the walls are repulsive ( w = 0), but increasing w , i.e., making the walls more strongly attractive, causes the proteins to be less stable than when w = 0. In fact, for large enough w (e.g., w = 0:25) the protein is completely adsorbed onto the pore’s walls, hence disturbing folding notice- ably. The reason for the instability is that the interaction with the pore walls becomes important and competes with the potential energy of the folded state and, hence, destabilizes the protein, even for large values of w because when the protein is completely adsorbed onto the pore walls, it cannot fold. But, for weaker attractive potential, the protein becomes more stable because the walls, while not attractive enough to adsorb the proteins, bring different parts of the protein closer to each other to become roughly two dimensional, rather than the initial three-dimensional structure. As a consequence, the amino acids can form helical HBs more easily, making the protein more stable. The same behavior of T f versus the strength of the wall’s attractive potential has been reported elsewhere.[141] The discussion can be made quantitative by considering the energetics. The potential energy of the folded state due to the HBs is (` 4), whereas if the protein is adsorbed completely onto the pores’ walls, it is nearly equal to 4` w . The latter estimate is based on the fact that each amino acid has 4 united atoms and, thus, we have a total of 4` UAs, each of which lowers 31 the potential energy by one unit of w . If ` is large enough, the maximum value of w for stability of the folded state, in contrast to the adsorbed state, is obtained roughly from the relation,` = 4` max , with max being the maximum value of w , hence yielding max 0:25. Indeed, Fig. 4 (and the results for other geometries presented below) indicate that for w = 0:25 the protein is completely adsorbed onto the pores’ walls. The above discussion concerns proteins with large lengths`. If, however, ` is comparable to 4, approximating (`4) with` causes large errors. Thus, for small` the potential energy becomes weaker and, hence, even for w < 0:25 the protein may adsorb onto the pore walls. In fact, in slit pores, as well as in cylindrical one, described below, for ` = 9 and w = 0:1875 the protein cannot fold even at lowT , whereas for` = 16 the protein can still fold to its helical native state, even with w = 0:21875. FIGURE 2.5: Dependence of the free energy onn , the average number of-helical hydrogen bonds for a protein length of` = 16 at T = T f;b in a slit pore of size (a) D = 1:28 nm, and (b) D = 1:6 nm. Figure 5(a) presents the free energy profiles under bulk condition and in the slit nanopores with the size D = 1:28 nm, at bulk folding temperature 32 T f;b . We have shifted the free energy value of the unfolded states to 0, so that we can better compare the stability of the folded structures for various cases. When the walls are repulsive, i.e., when w = 0, the folded state is less stable than under the bulk condition, but the same protein in the same pore, but with attractive walls and w = 0:125, is more stable. Figure 4(b) presents the free energy for a pore of sizeD = 1:6 nm, which indicates that the protein is more stable than under the bulk condition when the walls are repulsive, but it is less stable than under the bulk condition when the walls are attractive with w = 0:125. These are all consistent with Fig. 3. Note that, due to confinement, the free energy profile in the nanopores is rougher, and especially so in those with attractive walls. B. Cylindrical pore As discussed earlier, the minimum pore size is determined from the de- pendence of n , the average number of helical HBs, on the diameter D of the pore and is taken to be the size for which n is a constant. Figure 6 presents the dependence ofn on the sizeD of the pore. The minimum di- ametersD 0 of a cylindrical pore for folding of the proteins with lengths` = 9, 16, 23 and 30 are, respectively, 1.34, 1.32, 1.32 and 1.32 nm. These are slightly larger than the corresponding minimum slit pore size for stability of folded state of the same proteins. The reason is that in a cylindrical porehelix is confined from all directions around its cross section, whereas in slit pores it is confined only in one direction, but can expand slightly in the other two. Figure 7 presents the dependence on temperature ofn that we computed directly from the DMD simulations. The length of the protein was` = 23, and the results were computed for several pore sizes. Also shown are the results for the bulk conditions. As pointed out earlier, current theories stipulate that proteins should stabilize in smaller pores. That is, the folding temperature T f should increase with decreasing pore size, which would increasen at all 33 FIGURE 2.6: Determination of the minimum sizeD 0 for cylin- drical pores: dependence of n , average number of -helical hydrogen bonds on the sizeD of the pore, at low temperature. D 0 is that valueD for whichn has achieved a constant value. temperatures T . The reason is that T f represents the temperature at which a transition occurs between high and low values of n (arrows in Figure 7 indicate theT f values). But, Fig. 7 indicates that this is not the case for the -helices that we have simulated in very small pores. In contrast to the bulk conditions, the protein is destabilized in small pores, but then is stabilized in larger pores, which is completely similar to the behavior in the slit pore described earlier and in our previous paper.[127] Figure 8 presents dependence of T f on the size differenceDD 0 for the four proteins that we simulated. The trends are similar to those for the slit pores shown in Fig. 2, namely, that the smallest proteins achieve maximum T f in larger pores. This is in complete contradiction with the “expected” be- havior reported by Zhou[122] and by Thirumalai, Klimov, and Lorimer,[124] and is a result of higher entropy of the misfolded structures, including-like structures in small pores. To better visualize the structure of proteins in a cylindrical pore, we present 34 FIGURE 2.7: Temperature-dependence of n , average number of-helical hydrogen bonds, in cylindrical pores of sizeD, as well as under the bulk conditions. Arrows indicate the location of the folding temperature. The pore’s wall is repulsive, w = 0, and the protein’s length is` = 16. in Fig. 9 examples of their configurations. This is important because differ- ent pore geometries give rise to distinct kinds of unfolded and misfolded states. In cylindrical pores, for examples, stretched out straight shapes with no bending, as well as c-or u-like configurations are frequent. For such con- figurations non- HBs between different parts of a protein, which come close to each other as dictated by geometry, may occur. Such misfolded states may become stabilized in the pores, and because of their entropy, in addition to their rather low potential energy, compete with the correctly folded state that also have lower potential energy and smaller entropy. Figure 10 presents the dependence of T f on the wall potential w and the size differenceDD 0 for a protein of length` = 9. Similar to Fig. 3(a) for a slit pore, increasing w from 0 - a pore with repulsive walls - to 0.125 - a pore with strongly attractive wall - the general behavior is that for small sizesD the protein is more stable, and that the behavior reverses itself in larger pores, namely, that it will be less stable in pores with more strongly attractive walls. 35 FIGURE 2.8: T f , as defined by Eq. (6), versus the size differ- enceDD 0 for proteins of length` in cylindrical pores. The walls are repulsive, w = 0. Figure 11 presents the dependence of T f on the strength w of the wall potentialU w for proteins of length` = 9 and` = 16. As the figure indicates, in small cylindrical pores increasing w stabilizes the protein, whereas the same protein in a larger pore is destabilized by increasing w , which is in agreement with Fig. 4 for the slit nanopores. Figure 12 presents the dependence of the free energy profiles on the strength w of the wall potential, and contrasts it with that under the bulk conditions. We find that if, similar to Fig. 5, we compute the profiles at the same T in both the pore and under bulk conditions, the implications for the stability of the protein relative to the bulk would be similar to Fig. 5, and are also con- sistent with Fig. 11(b). Thus, we computed the profiles in Fig. 12 at the corre- spondingT f of each case in order to better compare the roughness of the free energy. Figure 12 indicates that the free energy profile in the pores becomes rougher with increasing w , i.e., as the walls become more strongly attractive, for the same reason that the profiles shown in Fig. 4 become rougher, namely, stronger attractive interaction between the proteins and the walls competes 36 FIGURE 2.9: Folded and misfolded proteins in a cylindrical pore of size D = 1:5 nm with repulsive wall, w = 0, at tem- peratureT = 0:125. (a) Partially folded protein; (b) c-like, and (c) u-like misfolded structures. Due to the geometry of cylindri- cal pores, c- and u-like misfolded structures are more likely to form. (d) Similar to (b) and (c) and showing another misfolded state. with the internal interactions of the protein molecules. C. Spherical cavity 1 Figure 13 presents examples of unfolding proteins in a spherical cavity of its minimum sizeD 0 (see below). The temperature varies between 0.105 and 0.15 and, once again, one obtains a wide variety of configurations in the same spherical cavity and the same protein length` = 16. There is a qualitative difference between a spherical cavity and slit and cylindrical pores. To understand the difference we estimated the minimum diameterD 0 for various protein lengths`. Figure 14 presents sample results for the average number n of the - helical HBs for proteins of lengths` = 9 and 16 at temperatureT = 0:08. As before, D 0 is the cavity size at whichn becomes more or less independent of the size. We estimated that in a spherical cavityD 0 = 1:91, 2.85, 3.91, and 4.93 nm for proteins of length ` = 9, 16, 23 and 30, respectively. These are 1 The results of this section were obtained in a joint work with Nariman Piroozan, and will be published as a joint paper 37 FIGURE 2.10: Dependence of T f , as defined by Eq. (6), on the size differenceDD 0 of a cylindrical pore for proteins of length` = 9. quite different from values of D 0 for the slit and cylindrical pores, because in a spherical cavity the protein is squeezed symmetrically from all orienta- tions. A spherical cavity should be large enough to hold the-helix protein of length` in its roughly cylindrical form. The size of such a sphere is roughly proportional to`. Thus, as before, the minimum sizeD 0 of a spherical cavity depends on`, but is independent of the size of cross section of-helices. Figure 15 presents temperature-dependence of n of a protein of length ` = 16 in spherical cavities of various sizes with repulsive wall. Once again, in contrast to the bulk state, the protein is destabilized in small cavities, but then it is stabilized again in larger ones, completely similar to the behavior of the same protein in slit and cylindrical nanopores. Figure 16 depicts the dependence of T f on the diameter size difference DD 0 for the four proteins that we have simulated. The wall is repulsive. Interestingly, Fig. 16 indicates a different behavior in spherical cavities than what was described for the slit and cylindrical pores. The diameterD (and, hence, the differenceDD 0 ) for maximum T f increases with protein length 38 FIGURE 2.11: Dependence of T f , as defined by Eq. (6), on the strength w of the wall potential in a cylindrical pore of sizeD. The protein’s length is (a)` = 9, and (b)` = 16. ` (compare, for example,` = 9 and 16), which is in accordance with the ex- pected behavior.[121] The maximum T f exhibits, however, unexpected be- havior in that, it islarger forsmaller proteins. In addition, we can hardly claim that that confinement has affected T f for a protein of length ` = 30, even in the smallest cavity. The reason is that in studies of proteins one usually assumes that the native state is roughly a spherical molecule, the so-called globular proteins. Thus, the size of the smallest spherical cavity is propor- tional to` 1=3 . On the other hand, the volume of the smallest cavity (with di- ameterD 0 ) is very close to that of the folded structure. In addition, unfolded and misfolded structures have an available volume proportional to`. But, in the case of-helical proteins the smallest cavity sizeD 0 is proportional to`. This is because the end-to-end distance of the folded state is proportional to ` and, therefore, for a cavity with diameterD 0 the available volume for the unfolded structures is proportional to ` 3 , a large enough volume for larger ` to eliminate completely the effect of confinement. The results for ` = 30 confirm this. Another important difference between spherical cavities, on the one hand, and slit and cylindrical pores, on the other hand that Fig. 16 indicates is that proteins of lengths ` = 16 and 23 are again destabilized in small cavities. This is unexpected because the value of the minimum diameterD 0 of spheri- cal cavities is proportional to`, which is much larger than the corresponding 39 FIGURE 2.12: Dependence of the free energy onn , the average number of-helical hydrogen bonds for a protein length of` = 16 at T = T f;b in a cylindrical pore of size D 0 = 2:7 nm, and various strength of the wall potential, w . diameters of the slit and cylindrical pores for whichD 0 = 1:28 and 1.32 nm, respectively, whereas for the spherical cavities and proteins of lengths` = 16 and 23 one has, D 0 = 2:85 and 3.91 nm, correspondingly. This implies that the destabilization in very small cavities may be a consequence of entropic destabilization of the folded state that can hardly move, if at all, in the radial direction in small cavitiess, whereas the unfolded and misfolded states are much less destabilized entropicaly because of their freedom in taking on dif- ferent structures and making maximum use from their freedom of movement in small pores. Figure 17 shows the dependence of the quantity T f for a protein of size ` = 9 on the size difference DD 0 . Similar to Fig. 3(a) for the slit pores and Fig. 10 for the cylindrical pores, increasing w from 0 to a positive value, which is representative of a cavity with a more stronly attractive wall, the general behavior is that for smallD (hence smallDD 0 ) the protein is more stable, but the behavior is reversed in larger cavities with the same type of 40 FIGURE 2.13: Unfolding of an de-novo protein of` = 16 in a spherical cavity of sizeD 0 = 2:85 nm at temperatures (a)T = 0:105; (b) 0.11; (c) 0.12; (d) 0.13; (e) 0.14, and (f) 0.15. wall, as the protein is less stable. Figure 18 presents the dependence of T f on the strength w of the wall potential in the spherical cavities whose sizes correspond to the smallest cav- ity sizeD 0 for proteins of length` = 16. Increasing w in the smaller cavity first stabilizes and then destabilizes the protein, whereas in the larger cavity one has only destabilization by increasing w . These features are in agreement with those for the slit and cylindrical pores. D. Comparison of the three geometries Given the results described and discussed so far, it would be instructive to make a direct comparison between the three geometries. In Fig. 19(a) we compare the dependence of T f on the size differenceDD 0 for a protein of length` = 9. The comparison indicates that the protein is more stable in the spherical cavity than in the cylindrical pore, followed by the slit pores. This is, of course, due to the fact that spherical confinement is a three-dimensional structure, whereas confinement in a cylindrical pore is restricted to two di- rections, and a slit pore represents an essentially 1D confined medium. Thus, entropic destabilization of the unfolded and misfolded proteins in the avail- able volume should be more pronounced in the spherical cavity than the 41 FIGURE 2.14: Determination of the minimum sizeD 0 for spher- ical cavities: dependence of n , average number of -helical hydrogen bonds on the sizeD of the cavity, at low temperature of T = 0:08. D 0 is that value D for which n has achieved a constant value. The protein’s length is (a)` = 9 and (b)` = 16. other two gemotries. But, in the case of a protein of length` = 16, it is the cylindrical pore that has the most stabilizing geometry, followed by the slit pore and then the spherical cavity; see Fig. 19(b). The reason, as we already described when analyzing Fig. 16, is that for larger proteins the minimum sizeD 0 of the three-dimensional spherical cavity is proportional to`, but the available volume to the unfolded and misfolded proteins is proportional to ` 3 . Therefore, the proteins’ entropy decreases by smaller amounts for larger `. In addition, in contrast to the protein of length` = 16, the folded protein of length` = 9 is more accurately approximated as a spherical molecule, which explains why it is better stabilized in a spherical cavity than in the other two geometries. But, a protein of length` = 16 is roughly cylindrical, rather than spherical, hence explaining why it is better stabilized in the cylindrical pore. Thus, both the shape and size of proteins, as well as those of the confined media, are important to their stabilization. Note also that in both cases the cylindrical pores exhibit a more pro- nounced effect of stabilization or destabilization, below and above a critical D, of the proteins, which is in contrast to the slit nanopore. As pointed out 42 FIGURE 2.15: Temperature-dependence ofn , average number of-helical hydrogen bonds, in spherical cavities of sizeD, as well as under the bulk conditions. Arrows indicate the location of the folding temperature. The cavity’s wall is repulsive, w = 0, and the protein’s length is` = 16. earlier, one reason for this may be that the geometry of an-helix is similar to inner space of a cylindrical pore and, hence, such a protein is much more stable. On the other hand, the unfolded states in a cylindrical pore (Fig. 9) may take on u-like shape and not necessarily hairpin. In contrast to the slit pore, it is very difficult for such a protein to unfold, starting from its mis- folded state and, then, fold again. Figure 20(a) presents the results for` = 9 at temperatureT = 0:09 (around room temperature). All the confined media have a size,D = 2 nm. We have shifted the free energies of the completely unfolded states withn = 0 to 5, so that we can easily compare the free energy barriers for folding, as well as the free energy difference of the folded and unfolded states. In this way the stability of the protein, the folding rates, and the roughness of the free energy profiles can be compared. Note that for the given` andD, we have DD 0 = 0:09, 0.66 nm, and 0.72 nm for, respectively, the spherical cavity, the cylindrical geometry, and the slit pore. We see, consistent with Fig. 17(a), 43 FIGURE 2.16: Dependence of T f , as defined by Eq. (6), on the size differenceDD 0 of a spherical cavity for proteins of length `. The wall is repulsive. that the folded state is most stable - has lower free energy compared with that of the unfolded states - in the spherical cavity, followed by cylindrical and slit pores and, finally, under the bulk condition. Note that, as Fig. 19(a) indicates,T f in the slit pore withD = 2 nm is very close toT f;b , the bulk fold- ing temperature, and Fig. 20(a) shows also that the free energy differences of the folded and unfolded states are not much different, and are actually negligible in the slit pore and under the bulk condition. Figure 20(b) shows the results for ` = 16 and D = 2:9 nm the spherical cavity (for whichDD 0 = 0:05) and cylindrical pore (for whichDD 0 = 1:58 nm), as well as under bulk conditions. Consistent with Fig. 19(b), Fig. 20(b) indicates that, relative to the bulk conditions, the protein in the spherical cavity is the least stable, and is most stable in the cylindrical pore. To see this, consider the difference between the free energy of the folded state with n = 12 with that of the completely unfolded states withn = 0 at a fixedT , which provides a measure of stability of the protein. 44 FIGURE 2.17: Dependence of T f , as defined by Eq. (6), on the size differenceDD 0 of a spherical cavity for a protein of length` = 9. = 0 represents a cavity with repulsive. Note that different confining geometries give rise to distinct kinds of un- folded and misfolded states. For example, proteins in cylindrical pores have straight shapes with no bending, and also take on c- or u-like configurations quite frequently. But, in spherical cavities with smallD s-like shapes are com- mon. As discussed earlier, in the case of spherical cavities, the diameterD 0 of the minimum size is proportional to`, implying that total its total volume is proportional to` 3 , so that for` = 9 we have the most severe confinement for the unfolded states. In such a severe confinement unfolded states with no HB are very infrequent, even at temperatures as high asT = 0:15. Instead, we have approximately a state with one non-native HB formed between parts of the protein that, due to confinement, have come close to each other. 45 FIGURE 2.18: Dependence of T f , as defined by Eq. (6), on the strength w of the wall potential in a spherical cavity of sizeD. The protein’s length is` = 16. 2.5 Implications for Interpreting Experimental Data As mentioned briefly in the Introduction, there are several sets of experi- mental data for the folding rates of proteins that have either remained unex- plained, or that one can interpret them by equally plausible, but completely distinct, arguments. Horst et al.[142] reported much smaller folding rates in the chaperonin of GroEL/GroES and its mutants, which cannot be predicted by the current theories described throughout this paper. Farr et al.[143] re- ported the absence of the expected folding rate enhancement for malate de- hydrogenase (MDH) in single- and double-ring GroEL and its tail-multiplied variants with smaller cages. This is in contrast with the folding rates for other proteins with the same size, such as the DM-MBP , which increase in GroEL.[144] But, it must be recognized that a combination of in-vivo experi- ments, the caging mechanism of GroEL/GroES, and the interaction between the proteins and the inner GroEL cage and other molecules gives rise to such 46 FIGURE 2.19: Comparison of the dependence of T f , as de- fined by Eq. (6), on size differenceDD 0 of the three confining media for protein of length (a)` = 9 and (b)` = 16. The walls are repulsive. a complex system that it is difficult to definitively claim it is proteins entrap- ment in the chaperonin cavity that plays the leading role in the phenomenon. Further related discussions of this issue were presented by Hoffman et al.[Refer] who, using single-molecule Förster resonance energy transfer tech- nique, reported that the folding rate of C-terminal domain of rodanese decel- erates in the GroEL/GroES cavity, whereas that of its N-terminal domain remains unchanged. They discussed and examined in detail the possible reasons for their data, and concluded that the reason for the phenomenon that they had reported on and the associated data is higher intramolecular diffusivity of the polypeptide. The present theories indicate, however, that the sole effect of confinement should be the same for both the C- and N- terminal domains, which contradicts Hoffman et al.’s conclusion. They also reported[145] a surprising decrease in the enthalpic barrier of the C-terminal 47 FIGURE 2.20: Comparison of the dependence of free energies onn , the average-helical hydrogen bonds, in the three con- fining media with sized = 2 nm. The profiles in (a) correspond to room temperature, and the protein’s length is (a)` = 9 and (b)` = 16. The walls are repulsive. domain folding, which was not, however, detected for the N-terminal do- main. Based on such observations and data, they concluded that it is im- probable to have a universal chaperonin mechanism at work. All such arguments are, however, based on a fundamental assumption, namely, that caging inside a confined medium with purely repulsive walls affects in a similar way folding rates of proteins with nearly the same size. We propose, however, another possible interpretation of the observed be- havior. As our results indicates, due to severe confinement and depending on the shape of proteins, their transition state structures in confinement and the misfolded structures, some states with a potential energy between that of the unfolded states and the average potential energy of the transition state may be destabilized. Thus, a new barrier very close to the basin of the un- folded states in the folding pathway may emerge, hence reducing the en- thalpic barrier for folding; see Figs. 5, 12 and 20. It is also plausible that some misfolded states or off-pathway intermediates with a non-negligible 48 number of nonnative contacts are entropically stabilized, which causes free energy roughening. But, because of the severity of confinement, such states require much longer times to break the nonnative contacts, and because fold- ing onto the native states requires escaping from such states, the folding rate decreases dramatically. Current theories for the effect of destabilization of the transition state cannot predict such a reduction of the folding rates, which is considerably more than their prediction. Thus, In view of the “universality” of our results in the sense of being independent of the geometry of confining media, the effect of stabilized mis- folded states and destabilized states between the unfolded and transition states under severe confinement is a plausible scenario for describing the observed lower folding rate and lower enthalpy barrier for folding of the C-terminal domain of rodanese in GroEL.[145] Jewett and Shea[117] pointed out that, although computer simulations and theoretical studies assume that the inner cage of the GroEL cavity is hydrophilic and has negligible interaction with the protein atoms,[124] the cavity’s interaction with proteins may be more complex. Thus, for the case in which we assume the confining walls as being purely repulsive, the effect of other physical interactions on the observed enthalpic change in the folding barrier of the C-terminal domain cannot be excluded. The main purpose of studying protein folding/misfolding in confined media is its relevance to the cause of debilitating illnesses, such as Alzheimer’s and Parkinson’s diseases. Experimental data have been reported that indi- cated that the nucleation step of amyloidogenic proteins’ misfolding, such as human prion protein, may be significantly accelerated in very crowded environments, modeled by a confined medium. Although in some experi- ments the crowders may have strong interactions, such as electrostatic or hy- drophobic, with proteins, in the aforementioned experiments inert crowders that did not interact considerably with proteins were used. Thus, their effect 49 on proteins’ stability may be considered as solely entropic, hence providing experimental evidence forentropic effects of crowding on protein misfolding and destabilization. As discussed earlier in this paper, our results provide evidence for possible entropic stabilization of the structures in the voids between the crowders. We believe that this can be very important to un- derstanding of fundamentals of the nucleation step of amyloid formation in Alzheimer’s and Parkinson’s diseases. Proteins that have known misfolded states, such as amyloidogenic proteins in such diseases, appear to be remark- ably sensitive to confinment, as well as their size and geometrical shape. This, as we discussed earlier, is due to stabilization of their misfolded states can affect significantly their folding dynamics and folding temperature. Finally, in some cases the probability of forming misfolded structures can be higher, resulting in higher concentrations of such structures. In that case, because the surface of the misfolded structures usually has exposed hydrophobic parts that can cause their aggregation[133] due to hydrophobic attraction, the formation of oligomers and fibrils in such diseases will also be more likely.[133] 2.6 Summary Using extensive discontinuous molecular dynamics simulations, we studied the effect of the geometry of confined media on protein stability and fold- ing. Our study was motivated by the results that Javidpour and Sahimi had previously reported,[127] namely, that contrary to the current theoretical un- derstanding, the maximum folding temperature occurs in larger pores for smaller -helices. Moreover, in very tight pores the free energy surface be- comes rough, and a new barrier for protein folding emerges close to the un- folded state. In contrast with the unbounded domains, protein states in small pores that contain the structures are entropically stabilized, implying that 50 folding rates decrease notably and the free energy surface becomes rougher. In view of the possible sigfinicance of the results to the interpretation of many recent experimental observations that could not be explained by the current theories, and the importance of entropic effects on proteins’ misfolded states in highly confined environments, we addressed in the present paper the fol- lowing question: To what extent the geometry of a confined medium affects the stability and folding rates of proteins? We used three distinct geometries in the simulations, namely, a sphirical cavity and slit and cylindrical pores, and also studied the effect of the strength and nature (repulsive as opposed to attractive) of the interactions between the confined media’s walls and the proteins. We found that the general qualitative trends of the results that we had reported previously for the slit pores remain the same in cylindrical pores and spherical cavities. Moreover, we found that with purely repulsive walls, the size of the smallest pore size in which the proteins can fold is essentially the same for cylindrical and slit nanopores, and is nearly independent of`, the length of the proteins. In contrast, in spherical cavities with the same type of walls, the smallest size for folding depends on the end-to-end distance of the proteins in their native state, namely,-helices. Most importantly, we find that the dependence of the maximum folding temperatureT f on the sizeD of a confined medium occurs in larger media for larger proteins - which is what the previous studies had claimed, but only in spherical geometry, whereas the opposite is true in the two other geometries that we study. Thus, the results indicate the great importance of the geometry of a confining medium, as well as that of the protein itself, to the stability of the protein. On the other hand, in slit pores with strongly attractive walls the pro- teins are adsorbed onto the walls and, thus, one cannot define a folding tem- peratureT f for the proteins. But, if the interaction potential w between the 51 proteins and the walls is only weakly or moderately attractive, the proteins exhibit a complex behavior that depends on size of the pore. If the pore is very small, the proteins are destabilized. If w increases to a moderate value, but smaller than the strength of the potential that causes complete adsorption of proteins onto the walls, the difference between the folding temperatures in the pore and under the bulk conditions first increases, and then for large w falls again. Similar behavior emerges in cylindrical pores and spherical cavities. 52 Chapter 3 Computer-Aided Discovery of Protein Inhibitors 3.1 Introduction Over the past several decades, computer-aided discovery and design (CADD) methods have played a crucial role in the development of small therapeutic molecules. Compared with the traditional high throughput screening and combinatorial chemistry, CADD methods are better able to achieve a much higher “hit” rate for identifying potential novel drug compounds, because they use a much more targeted search. They not only aim to explain the molecular basis of therapeutic activity, but also to predict possible deriva- tives that would improve the drugs’ activity. CADD methods are usually used for three main purposes. The first is to narrow down the large compound sets into smaller sets of potential active compounds that can be tested experimentally; The second is to optimize the leading candidate compounds, either by increasing their affinity, or by opti- mizing their pharmacokinetics (PK) properties, including absorption, distri- bution, excretion and so on. The third is to design novel compounds, either by "growing" some initial molecules one functional group at a time, or by combining fragments into novel chemical types. 53 CADD methods may be classified into two general categories, namely, structure-based and ligand-based techniques. Structure-based CADD meth- ods are generally preferred where high-resolution structural data of the tar- get protein are available, which are soluble proteins that are readily crystal- lized. Ligand-based CADD nethods are utilized when no, or very little, struc- tural information is available, which is often the case for membrane protein targets. 3.1.1 Structure-Based CADD Methods Structure-based CADD (SB-CADD) methods rely on the analysis of 3D struc- tures of biologic molecules. Their core hypothesis is that by favorably inter- acting with a particular binding site on a protein, small molecules are able to exert biological effects on the proteins. Molecules that share such favorable interactions will exert similar biological effects elsewhere. Therefore, novel compounds can be identified through careful analysis of a protein’s binding site. The 3D structural information about the target is a prerequisite for any SB-CADD project. Scientists have been using a target protein’s structure to aid in drug dis- covery since the early 1980s. Since then, SB-CADD methods have become commonly-used drug discovery techniques. Biophysical techniques, such as X-ray crystallography and NMR spectroscopy, have led to the elucidation of a number of 3D structures of human and pathogenic proteins. For example, the protein data base (PDB) has over 81,000 protein structures, whereas data bases such as PDBBind [146] and protein ligand data base houses (as of 2014), respectvely, basic information of about 13000 complex structures formed be- tween protein-small molecule ligand, protein-protein, protein-nucleic acid and nucleic acid-small molecule ligand, and binding affinity data and struc- tural information for a total of 12,995 biomolecular complexes, including 54 protein-ligand (10656), nucleic acid-ligand (87), protein-nucleic acid (660), and protein-protein complexes (1592), which is the largest collection of this kind so far. Drug discovery campaigns that leverage target structures and information have sped up the discovery process, and have led to the devel- opment of several clinical drugs. A prerequisite for the drug-discovery process is the ability to rapidly de- termine potential binders to the target of biological interest. Computational methods in drug discovery allow rapid screening of a large compound li- brary, as well as determination of potential binders through modeling or simulation and visualization techniques. 3.1.2 Ligand-Based CADD Methods The ligand-based CADD (LB-CADD) approaches are based on ligands that have already been proven to be effective for a target of interest. Such meth- ods analyze the structures of collected compounds known to interact with the target of interest. The overall goal is to represent such compounds in such a way that the most important physical and chemical properties for their de- sired interactions are retained, whereas other types of information not rele- vant to the interactions are discarded. LB-CADD approaches may be viewed as indirect methods of drug discovery, since knowledge of the structure of the target is not required. Two fundamental classes of LB-CADD methods are selection of compounds that are chemically similar to known actives, using some similarity measure or the construction of a quantitative structure-activity relationship (QSAR) model that predicts biological activity from the chemical structure. The dif- ference between the two classes is that the latter weighs the features of the chemical structure according to their influence on the biologic activity of in- terest, whereas the former does not. LB-CADD methods are based on the 55 Similar Property Principle, first proposed by Johnsonetal. [147], which states that molecules that are structurally similar are likely to have similar proper- ties. In contrast to the SB-CADD approaches, on advantage of LB-CADD methods is that they can also be applied when the structure of the biologic target is unknown. Moreover, active compounds identified by ligand-based virtual high-throughput screening (LB-vHTS) methods are often more potent than those identified by the SB-vHTS [148]. 3.2 Accuracy of CADD The fact that there are many computational tools that are currently being used in drug discovery suggests that there are actually no fundamentally superior techniques. Their performance varies greatly with target protein, available data, and the resources. For example, Kruger and Evers [149] com- pleted a performance benchmark between SB and LB vHTS tools across four different targets, including angiotensin-converting enzyme (ACES), cyclooxygenase- 2, thrombin, and HIV-1 protease. The so-called docking methods, including Glide, GOLD, “Surflex,” and “FlexX,” were used to dock ligands into rigid crystal structures of the targets obtained from the PDB. Glide, grid-based lig- and docking with energetics, approximates a complete systematic search of the conformational, orientational, and positional space of the docked ligand. It uses an initial rough positioning and scoring phase that dramatically nar- rows the search space, and follows it by torsionally flexible energy optimiza- tion on an OPLS-AA nonbonded potential grid for a few hundred surviving candidate poses. Surflex is a fully automatic flexible molecular docking al- gorithm that combines the scoring function from the Hammerhead docking system with a search engine that relies on a surface-based molecular similar- ity method as a means to rapidly generate suitable putative poses for molecu- lar fragments. Finally, FlexX is a software package for predict protein-ligand 56 interactions. A single ligand was used as a reference for LB similarity search strategies, such as 2D (fingerprints and feature trees) and 3D [rapid overlay of chemical structures, or ROCS, provided by OpenEye Scientific Software, Santa Fe, NM], a similarity algorithm that calculates maximum overlap vol- ume of two 3D structures [150]. In general, Kruger and Evers found that docking methods performed poorly for HIV-1 protease and thrombin, which is attributable to the flexible nature of the targets and the fact that the known ligands for such proteins have large molecular weight and peptidomimetic character. Enrichments based on 3D similarity searches were also poor for HIV-1 protease and thrombin datasets, when compared with ACE, which is likely due to the higher level of diversity in the HIV-1 protease and thrombin ligand datasets. Similarity scoring algorithms, such as Shape-Tanimoto, ColorScore, and ComboScore were also compared with the performance of ROCS [151]. It was found that even within the scoring, the algorithm performance var- ied across targets. For example, ColorScore performed best for the ACE and HIV-1 protease, whereas ShapeTanimoto for COX-2 and ComboScore was the method of choice for thrombin. All vHTS tools performed comparatively well for ACE, but LB 2D fingerprint approach generally outperformed dock- ing methods. The authors [151] also noted an important observation in that, especially for HIV-1 protease, the SB and LB approaches yielded complimen- tary “hit” lists. Therefore, performance metrics are not the only benchmark to consider, when comparing CADD techniques. In some cases, discovery of novel chemotypes is more important than high hit rates or high activity. Reference [151] found that ROCS and feature trees were more successful in retrieving compounds with novel scaffolds, when compared with other fin- gerprints. 57 Warrenetal. [152] reported on an assessment of the capabilities and short- comings of docking programs and their scoring techniques against eight pro- teins of seven evolutionarily diverse target types. They reported that, in general, although the molecular conformation was less precise across dock- ing programs, docking methods were fairly accurate in terms of the ligand’s overall positioning. They also found that docking programs were well adept at generating poses that included ones similar to those found in complex crystal structures. Their findings also agree with those of others that had reported that docking programs lack reliable scoring algorithms. Thus, on one the hand, docking methods are able to predict a set of poses of ligand in the crystal structure, but, on the other hand, the preference for the crys- tal structure pose was not necessarily reflected in the scoring. Warren et al. also found that enrichment of the hits can be increased by applying previ- ous knowledge regarding the target. However, there was little statistically significant correlation between docking scores and ligand affinity across the targets. Their study concluded that a docking program’s ability to reproduce accurate binding poses did not necessarily mean that the program could ac- curately predict binding affinities. Their analysis underscored the necessity of not only to re-ranking the top hits from a docking-based vHTS using com- putationally expensive tools, but also to continue evaluating novel scoring functions that can efficiently and accurately predict binding affinities. In order to improve the scoring functions, consensus-scoring methods and free energy scoring with docking techniques are involved. Consensus- scoring methods have been shown to improve enrichments and prediction of bound conformations and poses by balancing out errors of individual scor- ing functions. Enyedy and Egan [153] compared docking scores of ligands with known IC50 [the concentration of an inhibitor where the response (or binding) is reduced by half], and found that docking scores were incapable 58 of correctly ranking compounds, and were sometimes unable to differenti- ate active from inactive compounds. They concluded that individual scoring methods can be used successfully to enrich a dataset with increased popu- lation of actives, but they are insufficient for identifying actives against in- actives. Page and Bates [154] reported that although binding energy calcu- lations, such as MM-PBSA (molecular mechanics, combined with Poisson- Boltzmann surface area) methods, are one of the more successful methods of estimating free energy of complexes, thy are more applicable to providing in- sights into the nature of the interactions, rather than prediction or screening. Consensus scoring functions, where free energy scores of various algorithms have been combined or averaged, have been shown to substantially improve the performance [155] [156] [157] [158]. In their review Ripphausen et al. [159] reported that LB virtual screen- ing was used much more frequently than LB virtual screening, 322 against 107 studies. Despite a preference for LB methods, on average they yield hits with higher potency than SB methods. Most LB hits had activities better than 1 mM, whereas the SB hits fell frequently in the range of 1-100 mM. In fact, scoring algorithms in docking functions have been found to be biased toward known protein ligand complexes. For example, more potent hits against pro- tein kinase targets have been discovered when compared with other target classes [160]. Condisering advantage and disadvantage of both methods, an approach that combines SB and LB computation techniques has gained increasingly more attention [161]. For example, a method based on combining the GRID force field with GOLPE variable selection methods docks a set of ligands at a common binding site using GRID, and then calculates descriptors for the binding interactions by probing the docking poses with GOLPE [162]. 59 Multivariate regression is then used to create a statistical model that can ex- plain the biological activity of the ligands. Structure-based interactions be- tween a ligand and target can also be used in similarity-based searches to identify compounds that are similar only in the regions that participate in binding, rather than cross the entire ligand. LigandScout, a computer soft- ware that allows creating 3D pharmacophore models from structural data of macromolecule-ligand complexes, or from training and test sets of organic molecules, uses such a technique to define a pharmacophore based on hy- drogen bonding and charge-transfer interactions between a ligand and its target. Another technique, known as the pseudo-receptor technique [163], uses pharmacophore mapping-like overlaying techniques for a collection of lig- ands that bind to the same binding site to establish a virtual representation of the binding site’s structure, which is then used as a template for dock- ing and other structure-based vHTS. This approach has been used by Vir- tualToxLab [164] for the creation of nuclear receptors and cytochrome P450 binding site models in ADMET (absorption, distribution, metabolism, and excretion) prediction tools, and by Tanrikulu [165] in the modeling of the H4 receptor binding site subsequently used to identify novel active scaffolds [165]. In a recent review Wilson and Lill [166] called grouped these methods into a major class of combined techniques called interaction-based methods. A second major class involves the use of QSAR and similarity methods to enrich a library of virtual compounds prior to a molecular docking project. This can increase the efficiency of the project by reducing the number of com- pounds to be docked. This is similar to the application of CADD to enrich libraries prior to traditional HTS (high throughput screening) projects. The review also presented comprehensive descriptions of software packages us- ing a combination of LB and SB techniques, as well as several case studies 60 that tested the performance of such tools. As discussed earlier, these meth- ods are often used in series where LB methods are first used to enrich libraries that will subsequently be used in SB vHTS. The most common application is at the ligand library creation stage through the use of QSAR techniques to filter out compounds with low similarity to a query compound, or no predicted activity, based on a statistical model. QSAR has also been used as a means to refine the docking scores of a SB virtual screening. 2D and 3D QSAR can also be used to track docking errors, and have been used by Novartis [167] who built a QSAR model from docking scores, rather than the observed activities. The model was then applied to the same set in order to provide additional score weights for each compound. Although CADD methods have been extensively used in drug discov- ery, such targets as protein-protein interaction and protein-DNA interactions are still formidable problems, mainly relate to the massive size of interaction sites [168]. 3.3 Neural Network in Drug Discovery As described in Chapter 2, misfolding of proteins leads to their aggregation and formation of supramolecular structures. If aggregation occurs in a bio- logical system, deposition of the large aggregates onto the internal surface of of the cellular materials, it can contribute to development of such important and debilitating diseases, such as Alzheimer’s. So, the question is whether one can identify molecular structures that can inhibit folding or misfolding of important natural proteins, such as amyloids. This is not an easy task. Since identifying inhibitors through experiments is highly expensive and time con- suming, an approach based on advanced computational methods may be able to identify some potential inhibitors, narrow down their range and vari- ability and, hence, make it possible to carry out a much more restricted set of 61 experiments. But, what is the right computational approach? The idea that we have been working on to address this problem is based on machine-learning methods that have a long and rich history. Known as artificial neural networks (ANNs), their first application to drug discovery goes back to the early 1970s when Hiller et al. [169] published a study using the Rosenblatt perceptron to classify substituted 1,3-dioxanes as physiologi- cally active or inactive. Building on the earlier works in the 1940s, Rosenblatt, a psychologist, developed the first algorithmically described ANN in 1958. A perceptron is the simplest form of a ANN used for classification of patterns that are linearly separable. That is, the patterns lie on opposite sides of a hy- perplane. It consists of a single neuron with adjustable synaptic weights and bias. In the work of Hilleretal., the trained NN manifested good recognition on both training and testing data sets. The next stage of development was in 1990 when Aoyamaetal. [170] used ANNs in the QSAR studies. For the last 25 years, the ANN approach to modeling structure-activity relationships has matured into a well-established scientific field with numerous theoretical ap- proaches and successful practical applications. The field now encompasses the use of ANNs for predicting not only different types of biological activ- ity, but also physicochemical, ADMET, biodegradability and spectroscopic properties, as well as reactivity [171] [172] [173] [174]. Before continuing, let us describe the essence of ANNs. An ANN is a col- lection of connected units or nodes called artificial neurons, which act anal- ogous to biological neurons in an animal or human brain. Each connection, analogous to a synapse - a structure that permits a neuron or nerve cell to pass an electrical or chemical signal to another neuron or to the target effer- ent cell - between artificial neurons can transmit a signal from one to another. The artificial neuron that receives the signal can process it and then signal artificial neurons connected to it. In common ANN applications, the signal at a connection between artificial neurons is a real number, and the output 62 of each artificial neuron is calculated by a nonlinear function of the sum of its inputs. Both the artificial neurons and connections typically have weights that are adjusted as learning proceeds. The weight increases or decreases the strength of the signal at a connection. Artificial neurons may have a thresh- old such that only if the aggregate signal crosses that threshold, the signal is sent. Typically, artificial neurons are organized in (hidden) layers. Differ- ent layers may perform different types of transformations on their inputs. Signals travel from the first (input) to the last (output) layer, possibly after traversing the layers multiple times. The way an ANN is developed for any particular applications is that, after deciding the structure of the net, known data are selected in order to train the ANN. In practice, the ANN correlates certain variables through a normally nonlinear regression. Once the regression is developed, the ANN is used with another set of known data that, however, had not been used in the training, in order to test whether the model is valid. If the validation and other measures of accuracy are passed by the ANN, it is then used to make predictions. 3.3.1 ANNs with Deep Learning At a more advanced level deep learning refers to training multilayer ANNs with up to thousands of nodes and more than one hidden layers [175] [176]. Before the concept was developed in the middle of 2000s, standard machine- learning methods wereshallow: they could be described by at most three lay- ers of processing units [175]. Although an ANN can be constructed with any number of hidden layers, their training using back propagation optimization algorithms usually fails if the number of hidden layers exceeds three or four [175]. This is due to the increased risk of overfitting with larger numbers of 63 weights than necessary. It could also be due to the characteristic of back prop- agation algorithm itself. During the back propagation process, the values of error derivatives are propagated from the output layer back to the input one, and vanish rapidly with the distance from the output layer. This is because of the multiplication of several small partial derivatives as required by the chain differentiation rule. As a result, only layers closest to the output layer can actually be trained, whereas all weight parameters in the remaining hid- den layers stay almost unchanged during the training. Since all adjustable weights of multilayer ANNs are usually initialized with small random num- bers, during the training, the network tries to approximate the “functional dependence” of the output values on the random numbers formed in the hidden units near the input layer and, thus, not surprisingly, fails. An efficient solution to this problem was found in 2006 by Hinton and Salakhutdinov [177], who suggested splitting the learning process into two stages: (1) representation learning [178], and (2) training the network using the learned representation. In the first successful implementation of their methodology, a cascade of restricted Boltzmann machines (RBMs) was used to learn a hierarchy of internal data representations [177]. A RBM is a gener- ative stochastic ANN that can learn a probability distribution over its set of inputs. Then, the weight parameters learned by RBMs were used to initialize the weights of the deep multilayer ANNs that were subsequently readjusted during the training using the standard backpropagation algorithm. In this way, multilayer ANNs with virtually any number of hidden layers can be trained efficiently. Since pioneering work of Hinton and Salakhutdinov, deep learning has been augmented in several important ways. First, the sigmoidal transfer function was replaced by the linear rectifier function, usually producing stronger models [179]. Second, a new, powerful regularization technique, called weight 64 dropout [180], was introduced. To implement weight dropout, nodes are ran- domly switched off during training. The regularizing effect of the dropout technique in conjunction with the use of a rectifier transfer function means that, it becomes possible to train very large ANNs with a huge number of hidden layer nodes and their interconnections without overtraining or over- fitting [176]. Moreover, with sufficiently large data sets, it is not necessary to pretrain ANNs using cascades of the RBMs or other autoencoders in order to learn data representation and the weights, except between the final hidden and the output layers can be set randomly once. Another important technique that was successfully integrated with deep learning is convolutional architecture [181]. Convolutional ANNs have roots in the neocognitron [182] architecture, specifically designed to mimic infor- mation processing in visual cortex. Distinct from the standard multiple-layer ANNs working with fixed-size data vectors, convolutional ANNs are de- signed to work with data in the form of multiple arrays with variable size, such as 2D pixel matrices for images, while providing necessary invariance to irrelevant data transformations, such as shifts or distortions of images. Con- volutional ANNs consist of two types of layers: convolutional and pooling layers. Each unit in a convolutional layer takes signals from a small patch of units from the previous layer through a set of weights shared by all units in the layer. Each unit in a pooling layer computes the maximum of signals coming from a patch of units in the previous layer. Stacks of several convo- lution and pooling units allow extraction of complex relevant features from images. In deep ANNs, convolution and pooling layers are typically placed at the input side of the network. An important factor in the recent success of ANNs with deep learning is the use of fast graphics processing units (GPUs) that significantly accel- erate the training due to parallelization. Currently, a deep-learning ANN composed of millions of units with hundreds of millions adjustable weights 65 organized in several dozen layers can be trained with huge data sets of hun- dreds of millions examples. Such networks have already achieved human or higher performance in solving such tasks as image and speech recognition. Note that deep learning is not just a new “fancy” term to denote the state- of-the-art of ANNs, because it cannot be reduced to a simple application of additional techniques, such as dropout and rectifier units, or simple aug- mentation of the number of hidden layers in multilayer ANNs. It cannot also be reduced to a mere application of deep-learning software to solve old problems using traditional approaches. It actually represents a new philos- ophy of predictive modeling. The success of the application of standard shallow machine-learning methods is greatly influenced by how well the features representing data have been chosen using experience and domain knowledge. With very well-designed features, even the simplest linear or nearest-neighbors machine-learning methods can be applied to build pre- dictive models. The great promise of deep learning is the possibility of ex- tracting necessary features with required invariance properties automatically from raw data via representation learning [176]. ANNs with deep learning form multiple levels of representation in their hidden layers, with each sub- sequent layer forming representation of a higher, more complex and abstract, level than the previous one. With multiple - up to several dozens - hidden layers of nonlinear units, such ANNs learn extremely complex functions of their inputs with all necessary invariance properties, which is not always possible using standard machine-learning methods with manually tailored features. Due to the process of representation learning, deep learning can easily profit from related data sets with multiple labels via multitask and transfer learning [183] [184], as well as from data without labels via semi- supervised and transductive learning [185]. So, deep learning can be consid- ered an important step towards artificial intelligence [175]. On the negative side, however, ANNs with deep learning cannot as easily perform sparse 66 feature selection, important for optimizing predictions of new data and for simple interpretation of models. Such methods as multiple linear regression with expectation maximization [186] can achieve efficient sparse feature se- lection, so that they can be complementary to deep-learning methods. On the positive side, although they perform as well on average as state-of-the- art shallow NNs, like Bayesian-regularized NNs, they may be faster to train, and large cluster or GPU hardware handle large data sets, and may even be easier to code algorithmically. Representation learning provided by deep multilayer ANNs is playing an increasingly important role in computational drug discovery [187] [188]. The question of molecular descriptors used to capture the important prop- erties of molecules is, however, a relatively poorly addressed one. Despite the large number and variety of molecular descriptors, none can be guar- anteed to have universal applicability and provide optimal solutions to all problems arising in drug discovery. Deep-learning ANNs may alleviate this issue somewhat by generating novel and useful complex representations that may be more suited to solving specific tasks, albeit at the expense of gener- ating models whose interpretation is even more difficult. However, the dis- covery of more suitable and chemically interpretable molecular descriptors is still an important, poorly solved problem in QSAR. One can also expect that the ability to integrate a large amount of related data using deep multi- layer ANNs with multiple outputs will be very useful for drug discovery, as it allows reuse of previously accumulated data and knowledge to meet new challenges in drug discovery. Although the first papers on the use of deep learning in drug discov- ery appeared very recently [189] [190], some of the key ideas underlying its concepts have already been used for building QSAR models. In 1997, for example, the first multilayer ANN with convolutional layers containing 67 shared weights (receptors) and pooling layers (collectors), capable of extract- ing molecular features from raw data, was reported [191]. Similar to deep learning, convolutional ANNs were inspired by the neocognitron [182] ar- chitecture for image recognition. The analysis of pixels in images was re- placed by analysis of atoms and bonds in molecules. The resulting neural device for searching direct correlations between structures and properties of organic compounds allowed construction of QSAR models using raw molec- ular data without preliminary computation of molecular descriptors [191]. Another idea applied to QSAR modeling and discussed above is the use of ANNs with several outputs to predict several properties using the multi- task learning framework [192]. It was shown recently that massively multi- task ANNs, trained with deep learning, significantly outperform single-task methods, and their predictive performance improves as additional tasks (tar- gets) and data are added. Two massive, multitask ANNs for drug discovery have recently been re- ported. One of them was trained on a data set of nearly 40 million protein- ligand measurements across 259 biological targets [193]. Another was trained on 2 million data points for 1280 biological targets [194]. The improvement is significantly influenced by both the amount of data and the number of tasks (targets). It has also been demonstrated for toxicity prediction that, by combining reactive centers, such networks can learn complex internal repre- sentation that resemble well-established toxicophores [195]. As far as application of ANNs to drug discovery is concerned, although they originally suffered from overfitting, overtraining, and incorrect model validation, such problems have been essentially addressed. Such methods as early stopping [196], bias correction as used in associative NNs [197], Bayesian regularization [198], and training with dropout techniques [199] al- low development of highly predictive robust models. Hence, application of the traditional ANNs to drug design, as well as such other fields as materials, 68 has matured. NNs are sometimes criticized as a black-box approach. This is, hyowever, as much due to use of poorly interpretable descriptors as a problem with the NN method. There are increasingly sophisticated methods for analyzing the significance of NN weights [200], or such general purpose methods as predicted matched molecular pairs [201] allows more facile interpretation of models. Additionally, NNs can be better interpreted by analysis of the dis- tribution of partial derivatives of their outputs with respect to the inputs, or calculating their sensitivities, as discussed above. NNs, and in particular those with deep-learning, will continue to be used actively in drug discovery in the future. They will be particularly useful for analysis of large data sets that are increasingly generated by automated high- throughput technologies and, thus, they are well suited to the challenges of the so called Big Data [202]. NNs will also be increasingly used for other com- plex tasks, such as force field parameterization, optimization of drug delivery systems, ADMET prediction and drug classification, prediction of synthesis difficulty, and especially for multitask learning and simultaneously predic- tion of multiple biological activities or properties. 3.4 Discovering -Secretase Inhibitors Having learned about the main concepts and ideas of ANNs, we now turn our attention to their application to the problem that we outlined in Chap- ter 1. It is well known that, worldwide, Alzheimer’s disease (AD) is the most common cause of dementia, which is caused by damage to the nerve cells and will lead to loss of memory or other cognitive impairments. AD is character- ized by aggregation of-amyloid peptides (A) and formation of neurofib- rillary tangles [203]. During the past decades, genetic studies have identified a strong link between A and the pathogenesis of AD [204]. A is believed to 69 be generated from amyloid precursor protein (APP) by two sequential prote- olytic reactions: -secretase (BACE) cleaves APP at its N-terminus, produc- ing a membrane-bound fragment, C99. The second enzyme, -secretase, is a multi-subunit protease complex, itself an integral membrane protein, cleav- age C99 and produces A37 to A42 amyloids. Although constituting only 5% to 10% of the total A, A42 plays a key role in initiating plaque deposi- tion. Blocking A production with BACE and -secretase inhibitors has, thus, become a major approach to disease modifying therapy. -secretase is a large complex of four membrane proteins: presenilin, nicastrin, aph-1, and pen-2 [205]. Although all four components of the -secretase complex are essen- tial for enzymatic activity, studies suggest that PS harbors the catalytic site, as well as the allosteric inhibitor binding sites [206]. Sincegamma-secretase is a highly tractable therapeutic target, plenty ofgamma-secretase inhibitors (GSIs) have been developed that effectively inhibit gamma-secretase cleav- age in humans, and numerous orally-bioavailable, brain penetrant GSIs have been developed [207]. In AD the efficacy of GSIs has been tied to inhibition of A. Thus, in AD, GSIs have been conceptualized as A production in- hibitors [207]. GSIs can decrease A production in human and mouse brain. Their chronic administration decreases A deposition in APP in mouse mod- els [208]. Early -secretase inhibitors were aspartyl protease transition-state ana- logues based on the APP substrate cleavage site, and most were peptidic mimetics [209]. Although such analogues are valuable tools in purifying - secretase and elucidating its mechanism and function [210], they were deemed less feasible for in-vivo studies and further development as orally available drugs. During the past few years, a number of low molecular-weight, more druglike small molecules with high potency have been reported in the sci- entific and patent literature [211]. Of the small molecule inhibitors, most 70 can be classified into either sulfonamide or benzodiazepine series [212]. Re- cent data suggest that all non-transition-state inhibitors investigated so far target the same binding site, which is distinct from the transition-state ana- logue binding site [213]. In addition to GSIs, compounds referred to as - secretase modulators (GSMs) that modulate processivity of -secretase have been identified and remain in development as potentially inherently safer ways to selectively target A42 in AD. 3.5 Related Works Discovering new treatments for human diseases is an immensely complex challenge. Although computational methods have been applied to drug dis- covery for more than thirty years, computer tools remain inaccurate for rou- tine binding prediction. The success of predicting new molecules for binding will largely reduce the time needed for discovering new treatments and, thus, shed light on medicinal development [214]. In addition to traditional computational methods used in drug discov- ery, machine learning has made inroads into the field in recent years. At early-stage work mainly focuses on featuring molecules to predict drug ac- tivity [215]. Later on, more refined models have been derived. For example, influence-relevance voting method combines low-complexity ANNs andk- nearest neighbors [216]. Other related works extract features from connectiv- ity graphs of small molecules, and then use NNs to predict aqueous solubility [217]. A notable application of deep learning is the massively multitask neu- ral architectures for drug discovery, which was mentiuoned earlier. Overall, more than 40 million measurements across more than 200 biological targets have been synthesized, and massively multi-task networks significantly out- perform single-task methods; their predictive power of multitask networks improves as additional tasks and data are added [218]. 71 Several data mining methods have been applied to identify interaction between ligand and unknown target, such as QSAR. Since traditional QSAR is the LB model, it can be applied only when the target protein is unspe- cific, or to a series of ligands against a single target [219]. Recently, Vina et al. developed a multitarget-QSAR (mt-QSAR) classification method with an accuracy of 72% for the training set and 72% in cross-validation [220]. Fang etal. [221] applied mt-QSAR method to predict the chemical protein interac- tions in 25 key targets related to AD. The predictions were then confirmed by reported data and experimental validation, which indicated the potential of this strategy in target prediction of compounds and MTDLs discovery. Computational chemogenomic methods aims at exploiting not only the small molecules, but also of drug targets interacting with the molecules [222] [223]. The major advantage of chemogenomic model is that it can predict chemical-protein interaction (CPI) by a single binary model. Wangetal. con- structed a model for predicting CPI based only on the primary sequence of proteins and the structural features of small molecules, and used it to identify novel ligands for four targets. The result was then validated by experimen- tal assays [222]. Cheng et al. compared the mt-QSAR and computational chemogenomics method for CPI, and found the performance of mt-QSAR method was better than that of the chemogenomic for the external validation set [223]. In this Chapter we use an ANN to identify potential inhibitors for - secretase protein. To do so, we first train the NN model with that inhibitors that have been proven to be active experimentally. We then use the model, trained by various inhibitors, to select potentially active ones for the target protein. Further analysis about the newly-discovered inhibitors regarding their chemical properties, similarity, diversity, and so on will be followed. 72 3.6 Methods As already described, when using any ANN, one must first select the data with which the network is trained, which is then tested and validated. Thus, we first describe the selection of the data. 3.6.1 Data Selection Inhibitors of -secretase are from BindingDB website 1 , which is a public database of small, druglike molecules that have interactions with proteins considered to be drug-targets. More than 600 of inhibitors, which have been proved ex- perimentally effective with the target proteins, are listed there and selected as active compounds for our study. Six hundreds of decoys are generated using DUD-E database 2 as inactive compounds. 3.6.2 Data Featurization Extended connectivity fingerprints (ECFP4) - circular topological fingerprints designed for molecular characterization, similarity searching, and structure- activity modeling [224] - were generated from RDKit 3 [225] to featurize each molecule. In the ECFP4, each molecule is decomposed into a set of frag- ments, with each fragment centered at a non-hydrogen atom, which extends radially along bonds to neighbor atoms. Each fragment is identified by a unique identifier, so that each molecule becomes a collection of identifiers. The collection of identifiers is hashed into a fixed-length bit vector, which is considered as "fingerprint". In this study each compound is hashed into a 2048 long bitvector. In general, fingerprints, such as ECFP4, ECFP6 and 1 https://www.bindingdb.org/bind/ index.jsp 2 http://dude.docking.org/ 3 http://www.rdkit.org/docs/ 73 MASCC (multinational association of supportive care in cancer), are com- monly used to describe molecules and to measure similarity between the compounds [226]. 3.6.3 Structure of the Neural Network The ANN that we employ is composed of an input layer with 2048 nodes, and two hidden layers with 1024 nodes each. In the hidden layers the hyperbolic tagent function is used as the activation function, and the activation function ofith node in the hidden layer is written as: B j (x;w j ) = 1exp(2w T j x) 1 +exp(2w T j x) ; (3.1) wherew j is the vector of weights of the jth node. Adding dropout of 0.25 to our pyramidal networks improves performance. Dropout is a regular- ization technique for reducing overfitting in ANNs by preventing complex co-adaptations on the training data, and is a very efficient way of performing model averaging with neural networks. The term “dropout” refers to drop- ping out units (both hidden and visible) in a neural network. The output layer is a single node that yields binary numbers, indicating the activity of inhibitors. 74 Input #1 Input #2 Input #6 Input #7 Input #8 Output Hidden Input Output 3.6.4 Training of the ANN In this study, 1200 samples with nearly equal number of active and inactive inhibitors are used as the training set. We employed the error-back propaga- tion algorithm [227] with a dropout ratio of 0.2 to regularize the ANN. The training takes about 20000 steps, but after about 1000 steps the model be- gins to converge. Note that back propagation is a method used in ANNs to calculate the error contribution of each neuron after a batch of data has been processed. In the context of learning (or training the ANN), back propagation is commonly used by the gradient descent optimization algorithm to adjust the weight of the neurons by calculating the gradient of the loss function. 3.7 Result In what follows we present, describe and discuss the results. 3.7.1 Cross Validation The ANN model was validated byk-fold cross validation, i.e., the data set is divided intok subsets). Each time, one of thek subsets is used as the test set 75 and the rest of the subsets are put together to form a training set. We used k = 5 and, thus, divided the entire data set was divided into five equal cross- validation splits. The model was trained on a set of four cross-validation splits together, and the fifth subsample set was used as an internal validation set, or the test set). Repeating the k-fold cross validation for multiple runs provides a better and more accurate statistical estimate. The hope was that the estimates have low bias and low variance. Leave-one-out is a specialk-fold cross validation, in which the number of folds equals the number of available data. This method is almost unbiased, but has high variance, leading to unreliable estimates. When choosing the number of folds, we would like to trade off bias for low variance. All the developed ANNs were evaluated by the quantity of true positives (TP), true negatives (TN), false positives (FP), false negatives (FN), sensitiv- ity (SE), specificity (SP), the overall prediction accuracy (Q), and Matthews correlation coefficient (MCC) that are given by: SE = TP TP +FN ; (3.2) SP = TN TN +FP ; (3.3) Q = TP +TN TP +TN +FP +FN ; (3.4) MCC = TPTNFNFP q (TP +FN)(TP +FP )(TN +FN)(TN +FP ) : (3.5) The value of MCC falls in the range, -1 MCC +1. A perfect classifica- tion gives a MCC value of 1. In addition, the receiver operating characteristic (ROC) curve was plotted. The ROC curve was used to graphically present the model behavior of true positive rate against false positive rate in a visual way [228]. Performance of the ANN was also measured by the area under the ROC curve (AUC). Recall that the ROC curve for a binary classifier is the 76 plot of true positive rate (TPR) versus false positive rate (FPR), as the discrim- ination threshold is varied. For individual datasets, we are interested in the area under the ROC curve (AUC), which is a global measure of classification performance. Note that AUC must lie in the range [0;1]. More generally, for a collection ofN datasets, we consider the mean and median ofk-fold-average AUC: AUC = 1 k K X i=1 AUC i (3.6) In this study the MCC value after training is 0.942. Figure 3.1 indicates that the AUC of our training is 0.972. Median and mean among different dataset shifted 0.01. Both represent excellent results. FIGURE 3.1: AUC curve after training model 3.7.2 Discovering Potential Inhibitors In order to discover potential inhibitors, 5078 inhibitors, which had been proved to be active to peptidase AD family proteins, were selected from BindingDB website. After running the dataset with the trained ANN, 522 new inhibitors are identified as potentially active relative to the target pro- tein. 77 3.7.3 Similarity To determine whether the properties of the generated molecules match the properties of the training data, we computed several molecular properties, namely, molecular weight, BertzCT, the number of H-donors, the number of H-acceptors, rotatable bonds, log(P ) (whereP is the partition coefficient), and total polar surface area for both the generated molecules and the training active dataset. In particular, BertzCT is perhaps the most popular complexity index that takes into account both the variety of types of bond connectivities and atom types. It is defined by, I CPX = I CPB +I CPA , whereI CPB andI CPA are, respectively, the information contents related to the bond connectivity and atom type diversity. Figure 3.2 is the histogram of those properties. The blue bar represents the original active data set, while the red bar represents the newly reproduced data set. The figure indicates that all the properties are in the same range, hence providing evidence that there is a high probability that both set share similar structure and properties. FIGURE 3.2: Histogram of chemical properties. 78 Next, we performed dimensionality reduction to 2D with thet-distributed stochastic neighbor embedding. The result are shown in Figure 3.3. Blue dots represent the generated molecules, while red dots indicate active train- ing data sets. Figure 3.3 indicates that both sets overlap significantly, which means that the generated molecules very well reproduce the properties of the training molecules. FIGURE 3.3: t-SNE projection of 7 physicochemical descriptors of active molecules and molecules generated with the ANN, to two unitless dimensions. 3.7.4 Diversity In order to assess the novelty of the newly discovered molecules, the Tani- moto index [229], a commonly-used 2D fingerprint based on similarity anal- ysis method was used to analyze the similarity of each inhibitor with its near- est neighbor from the active training set. Figure 3.4(a) shows the similarity distribution of inhibitors from active training set with its nearest neighbor from the same original set, while Figure 3.4(b) shows the similarity distribu- tion of inhibitors from the newly produced set with its nearest neighbor in the original active training set. Figure 3.4(a) indicates that the similarity index is concentrated in the area around 0.8, and that there is a high frequency of sim- ilarity close to one. On the other hand, similarity distribution in Figure 3.4(b) 79 is close to normal distribution with median close to 0.6. The results, thus, show that the model output molecules similar to the target-specific train- ing set. Notably, not only highly similar molecules are produced, but also molecules covering the whole range of similarity, indicating that our method could not only deliver close analogs, but new chemotypes or scaffold ideas to a drug discovery project. FIGURE 3.4: nearest-neighbor similarity distribution of active training set and newly discovered molecules. In addition, the results in Figure 3.4 show that while the molecules often seem to have made small replacements of compounds, in many cases they also make more complex modifications or even generated completely differ- ent molecules. This is supported also by the violin plot of the distribution of the nearest neighbor fingerprint similarities of training and rediscovered molecules shown in Figure 3.5. A violin plot is a method of plotting nu- merical data similar to box plot with a rotated kernel density plot on each side. The better known box plot used in descriptive statistics is a method for graphically depicting groups of numerical data through their quartiles. Box 80 plots may also have lines extending vertically from the boxes (whiskers) in- dicating variability outside the upper and lower quartiles. In Figure 3.5 the leftmost plot is the similarity distribution of molecules from rediscovered set to the random decoys, which have been proven to be inactive to the target protein. The middle and right plot, same as Figure 3.4, is similarity distribu- tion of rediscovered molecules to original active training set, and molecules in the original training set to their nearest neighbor in the same set, respec- tively. The low similarity between rediscovered molecules and random de- coy indicates that the ANN can well distinguish active and inactive ones. The median of the middle plot is 0.6, while it is 0.8 for the rightmost plot. The lower similarity may again indicate discovery of new element with de- sired activity. FIGURE 3.5: Violin plot of similarity distribution of random de- coy, newly discovered molecules, and original training set 3.8 Summary In this chapter we showed that ANNs can be successfully utilized to learn a statistical chemical language model. The model can generate novel molecules with physicochemical properties similar to the training molecules. The ad- vantage of our model is that it is not only straightforward to apply, but it can 81 also discover new molecules that cannot be identified with the traditional method. To extend our work, further study may be called to analyze the structure and chemical properties of the newly discovered molecules using other com- putational approaches, such as docking or molecular dynamics simulation. WE hope to be able to report on the latter one. 82 Chapter 4 Top-Leads of Inhibitors for the Proteins Contributing to Alzheimer’s Disease Identified by Neural Networks: Docking Study 4.1 Introduction In modern drug discovery, docking plays an important role in the study of the orientation of ligand that are bounded to proteins’ receptors. The in- teractions are approximated by a docking score, which is measured asthe po- tential energy of binding. The first docking method was proposed by Fis- cher who assumed that both the ligand and the receptor can be treated as rigid compounds, and that the binding affinity is directly related to their ge- ometry match [230], whereas Koshland suggested that, during the docking process both the ligand and the receptor should be treated as flexible com- pounds [231]. Since the movement of the protein backbone affects multi- ple side chains, the degree of freedom of fully flexible receptor and ligand docking is much larger than docking with a rigid receptor. As a result, such 83 flexible docking algorithms possess much higher accuracy, not only in pre- dicting the binding mode of a molecule than rigid body algorithms, but also in the binding affinity [232]. Moreover, local minimization of the molecu- lar energy would improve the docking result [233]. It has also been found that the location of the binding site is important to improving the docking efficiency. Several algorithms and programs, such as GRID [234], POCKET [235], SURFNET [236], and PASS [237] detect binding site or the cavity site of proteins. In fact, over the last several decades, various kinds of docking tools and programs have been developed for both academic and commercial use. The most popular ones are DOCK [238] AutoDock [239], GOLD [240], Glide [241], LigandFit [242], MCDock, MOE-Dock [243], AutoDock Vina [244] and so on. Although such docking tools and algorithms differ from one another, but they can be broadly categorized as incremental construction approaches, such as FlexX [245], shape-based algorithms, e.g., Dock, genetic algorithms, GOLD [240], for example, and systematic search techniques, of which Glide is an example. For the sake of efficiency, most docking algorithms treat the receptor as rigid. These programs have been developed to produce optimal binding mode of the ligands to the binding targets, so as to identify com- pounds with top binding scores through a visual screening process. Among them programs, GOLD and LeDock were able to identify the correct ligand binding poses. Both Glide (in its XP mode; see below) and GOLD predict the poses consistently with high accuracy [246]. It was also reported that Glide, GOLD and Flex docking algorithms are able to reproduce the experi- mental poses [247]. AutoDock Vina and GOLD score based on poses predict top docking poses. However, flexible receptor docking remains a major chal- lenge for most docking methods. In our research we use Glide as the main tool for docking and further study. The Glide algorithm uses a series of hierarchical filters [248] to search 84 optimal positions and orientations of the ligands in the receptors. Prior to docking, a grid is generated to represent the properties and binding site of the receptor. The rectangular grid also confines the translations of the center of mass of the ligands to ensure more accurate scoring of the ligand pose. Dur- ing docking process, a set of initial ligand conformations is generated and, then, the search begins with a rough positioning and scoring phase tat sig- nificantly narrows down the search space and reduces the number of poses to be further considered to a few hundred. In the next stage, the energy of the selected poses are minimized with the OPLS force field for the receptor. After minimization, less than 10 poses with the lowest energy are obtained, and the orientation of the ligands is refined. The minimized poses are then re-scored using the GlideScore function that also contains additional terms that account for solvation and repulsive interactions. The choice of the best pose is based on a model energy score (Emodel) that combines the energy grid score, GlideScore, and the internal strain of the ligand. We investigated both the Standard Precision (SP) mode and Ex- tra Precision (XP) mode of Glide in a comparative study. The XP scoring is a combination of ChemScore and physical effects that are missing from ChemScore. Additional terms are involved in the special treatment of the salt bridges, pi-cation interactions, and various other specialized medicinal chemistry motifs. In order to properly evaluate such new terms, the algo- rithm is discussed further in [249]. 4.2 Methods We now describe the various steps of docking processes that we implemented in our study. 85 4.2.1 Protein Preparation -Secretase is a high molecular weight multicomponent complex contain- ing presenilin, nicastrin, aph-1, and pen-2. For simplicity, presenilin is the catalytic subunit that contains separate substrate binding and the catalytic sites. The three dimensional crystal structure of the protein was downloaded in pdb format from the protein data bank [250], with the protein ID being 5FN2. The original structure was then prepared and refined. The refine- ment included charge assignment and bond reorder. In addition, hydrogen was added to the heavy atoms. Selenomethionines were converted to Me- thionines, and all the water molecules were deleted. The minimization of the energy of the protein was carried out using the force field OPLS. All the computations were performed using Schrödinger-Maestro program [248]. 4.2.2 Ligand Preparation The compounds were retrieved from our neural network computation de- scribed in the previous chapter. More than 400 inhibitors were identified as the potential active inhibitors for the target protein. Their 3D structures and energy minimization were carried out by the Schrödinger Suite using the OPLS force field. Ionization states were generated at a pH of 7.0 using Epik. 4.2.3 Generation of the receptor grid The receptor’s grids were generated for the prepared proteins such that vari- ous ligand poses bind within the predicted active site during docking. Using Glide, the grids were generated by keeping the default parameters of van der Waals scaling factor of 1.0 and charge cutoff 0.25 in the OPLS force field. A cubic box of specific dimensions centred around the centroid of the active site residues was generated for the receptor. The bounding box was set to be 20 Å 20 Å 20 Åfor docking experiments. 86 4.2.4 Binding Site The reaction is inhibited either through a physical mechanism where the in- hibitor binds in the region between the substrate binding and the catalytic sites in order to form a physical blockage to the substrate movement, or through a conformational mechanism by which the structure between the substrate and the catalytic sites is distorted - for example, stretched - such that the substrate movement into the catalytic site is no longer possible. The catalytic sites of -Secretase are ALA 246, ASP 257, ASP 385, GLY 382, LEU 150, and so on. If the inhibitor interacts with the catalytic site of the target, it will reduce its activity and change the protein conformation. 4.2.5 Glide Standard Precision for Ligand Docking Around each binding site, the SP flexible ligand docking was carried out using Glide, within which penalties were applied to amide bonds. As men- tioned before, the van der Waals scaling factor and partial charge cutoff were selected to be 1.0 and 0.25, respectively for the ligand’s atoms. The final scor- ing was performed on the energy-minimized poses and displayed as Glide score. The best docked pose with the lowest Glide score value was recorded for each ligand. 4.3 Result In what follows we present our results and discuss their implementation. 4.3.1 binding energy Docking to all the selected binding sites was performed for both the orig- inal and potential inhibitors. The binding score for each of the receptor is 87 shown in Table 4.1. The first column refers to the animo acid as the bind- ing site, while the second and third column refer, respectively, to the range of binding energy and the distribution of binding score of original and new inhibitor. They are shown in Figure 4.1. The blue histogram refers to the original inhibitors, while the red histogram represent to new potential ones. The binding scores of both kinds of inhibitors are in the same range and indi- cate similar distribution, indicating that two kinds of inhibitors have similar binding affinity to the residues. binding site original inhibitor new inhibitor ALA 246 -2.2 E bind -5.6 -3.5 E bind -5.9 ASP 257 -3.5 E bind -8 -4.9 E bind -8.4 ASP 385 -4.6 E bind -9.9 -5.9 E bind -8.8 GLY 382 -4.1 E bind -10 -5.8 E bind -9.2 LEU 150 -3.2 E bind -8.3 -4 E bind -8.4 TABLE 4.1: Binding energy range of the original and new po- tential inhibitors. FIGURE 4.1: Distribution of both the original and new in- hibitors. 4.3.2 Top-score inhibitors of ASP 257 Molecular docking was processed with all the potential inhibitors around the ASP 257 residue. The docking interactions were analyzed by Glide score 88 and the binding free energy. The binding affinity between the protein and ligand was evaluated by several polar and non-polar interactions, such as the H-bonds, electrostatic, van der Waals’ and hydrophobic interactions for each ligand. Ligand C 21 H 29 F 4 N 5 O 3 (ChemspiderID 34950164) turned out to be the best docking ligand with the lowest docking score, -8.397 kcal/mol. The top five docking ligands with the lowest docking score are C 21 H 29 F 4 N 5 O 3 (ChemspiderID 34950164), C 23 H 23 F 3 N 4 O (ChemspiderID 23260671), C 21 H 21 F 3 N 4 O (ChemspiderID 28516002), C 27 H 30 N 2 O 2 (ChemspiderID 23264436), and C 22 H 26 F 2 N 3 O 2 (ChemspiderID 34450518), with docking scores, respectively, of -8.397, -8.129, -7.594, -7.561, -7.396, and binding energies of -39.852, -50.408, -45.330, -44.685, -49.299, as shown in Table 4.2. The negative and low value of the free energy of binding demonstrates a strong favorable bond between the ligands and the binding site in the most favorable conformations. Weak interaction energies, such as the van der Waals and electrostatic energies, are also recorded here. The existence of the electrostatic interaction between a charged ligand and a receptor’s bind- ing site helps providing the thermodynamic driving forces that form protein ligand complexes. Van der Waals interaction energy develops when there is fluctuation in the electron cloud of a nucleus that affects the transient dipole moment and the electron cloud of the nearby atoms. Ligands Docking Score Glide En- ergy VDW En- ergy Electrstatic Energy C 21 H 29 F 4 N 5 O 3 -8.397 -39.852 -34.365 -5.487 C 23 H 23 F 3 N 4 O -8.129 -50.408 -44.089 -6.318 C 21 H 21 F 3 N 4 O -7.594 -45.330 -40.957 -4.373 C 27 H 30 N 2 O 2 -7.561 -44.685 -43.443 -1.242 C 22 H 26 F 2 N 3 O 2 -7.396 -49.299 -42.665 -6.634 TABLE 4.2: List of the important binding energies between top inhibitors and the residue ASP257. Figure 4.2 and Table 4.2 show the 2D structure and absorption, distribu- tion, metabolism and excretion (ADME) properties of each ligand. In-silico 89 prediction of the ADME properties have become important in drug selection and to determination of its rate of success for human therapeutic use. There- fore, these physio-chemical descriptors were calculated so as to determine the ADME properties of the drugs. Molecular weight (MW) is an important property in small-molecule drug discovery, which impacts various molecu- lar events, such as absorption, the bile elimination rate, blood brain barrier penetration, interactions with targets, and so on. Lipophilicity, characterized here by the computed log(P ) and log(D) values, plays a crucial role in deter- mining several ADME parameters, as well as potency. HereP is the partition coefficient that describes the propensity of a neutral (uncharged) compound to dissolve in an immiscible biphasic system of lipid. A negative value for log(P ) means the compound has a higher affinity for the aqueous phase, i.e., it is more hydrophilic. If log(P )=0, the compound is equally partitioned be- tween the lipid and aqueous phases, while log(P )> 0 denotes a higher con- centration in the lipid phase. For ionizable solutes, the compound may exist as a variety of different species in each phase at any given pH.D, the distribu- tion coefficient, is the appropriate descriptor for ionizable compounds, since it is a measure of the pH-dependant differential solubility of all the species in the octanol/water system, and is typically used in the logarithmic form, log(D). For example, solubility and metabolism are more likely to be compro- mised at high lipophilicity values, whereas permeability may decrease when this property is too low. Hydrogen bond (HB)-acceptors and HB-donors are the other important parameters related to the compounds’ polarity and per- meability. It was found that the HB-donors count may be more crucial than the HB-acceptors count for drug development, and may perhaps be related to the efforts for enhancing membrane permeability. Lipinski’s rule of five is based on the observation that drugs with molecular weight of 500 or less, HB-donors less than 5, HB-acceptors less than 10, and log(P ) less than 5 are orally administered drugs. Molecules that violate more than one of the four 90 rules may have problems with bio-availability and may be poor at absorption or permeation as oral drug. As shown in the Table, none of the inhibitors violated the Lipinski’s rule of five. The polar surface area (PSA) or the related topological surface area (TPSA), is another commonly-investigated descriptor related to hydrogen bonding that is important for permeability estimation and oral bio-availability. It has been found that these properties decrease when the TPSA increases and ideal orally administered drugs usually have TPSA below 140 Å 2 . Molecular complexity is another property known to influence such events as solubil- ity, permeability, and promiscuity. This measure accounts for the number of rings and aromatic rings, the fraction of carbons that are sp 3 hybridized (Fsp 3 ), or the number of stereocenters. More than three aromatic rings in a molecule may result in increased risk of toxicity. The average Fsp 3 value has been shown to positively correlate with success in drug development. Property Ligand 1 Ligand 2 Ligand 3 Ligand 4 Ligand 5 Formula C 21 H 29 F 4 N 5 O 3 C 23 H 23 F 3 N 4 O C 21 H 21 F 3 N 4 O C 27 H 30 N 2 O 2 C 22 H 26 F 2 N 3 O 2 MW 475.48 428.45 402.41 414.54 401.45 log(P ) 0.71 2.54 2.35 4.33 2.43 log(D) 1.01 1.61 1.43 3.81 1.68 TPSA 114.27 65.77 74.98 57.18 75.43 Rotatable bonds 6 5 5 8 6 HB Don- nors 3 2 2 2 3 HB Ac- ceptors 8 5 5 4 5 Rings 2 3 3 4 3 Total charge 1 1 1 1 1 Fsp 3 0.71 0.3 0.3 0.3 0.36 Solubility (mg/l) 25879 7655 10323 3316 10662 Oral bio- availability Good Good Good Good Good TABLE 4.3: Chemical properties of the top ligands. 91 (A) C 21 H 29 F 4 N 5 O 3 , ChemspiderID 34950164 (B) C 23 H 23 F 3 N 4 O ChemspiderID 23260671 (C) C 21 H 21 F 3 N 4 O ChemspiderID 28516002 (D) C 27 H 30 N 2 O 2 ChemspiderID 23264436 (E) C 22 H 26 F 2 N 3 O 2 ChemspiderID 34450518 FIGURE 4.2: Chemical structure and properties of top five po- tential inhibitors 92 All the protein ligand interactions are illustrated and analyzed in Figure 4.3 and Table 4.4. The 2D map of docking position and protein ligand in- teraction for the top five molecules are depicted in Figure 4.3. The docking scores, ligand interfaces, and bonding interactions were extended features, indicating that the top five ligand leads were more specific inhibitors at pro- tein active residue Asp257. The cavity involving Asp257 specific site is potentially involved in Alzheimer’s disease in brains. Previous studies have already indicated that presenilin (PS1) is responsible for -Secretase activity and endoproteolytic cleavage, be- cause mutagenesis of two putative catalytic residues of PS1 at Asp257 and Asp385 abolished both -Secretase activity. The docking results show that these ligands target definitely the specific binding site. The intermolecular interactions formed between each compound and the specific residues, to- gether with their distances, are presented in Table 4.4. The results indicate that the majority of the HB donors come from the protein residues, and that the corresponding acceptors were derived from the ligands. C 21 H 29 F 4 N 5 O 3 formed hydrogen bonding with Asp257 and Thr147, and a salt bridge was observed with Asp257, and similarly for C 21 H 29 F 4 N 5 O 3 and C 23 H 23 F 3 N 4 O that formed hydrogen bonding and salt bridge with Asp257. stacking was observed with Phe283 and Phe388. This indicates the importance of the binding since Phe283 is another active residue inside the cavity, and mutation of Phe283 will elevateA42 sequence. The inter- actions with the residues made a strong contribution to the binding affinity of this compound. All the rest of the inhibitors formed hydrogen bonding and salt bridge with Asp257. According to the docking results, the interac- tions observed for compounds are very important and could guarantee the good degree of inhibition, indicating that they are promising inhibitors in the treatment of AD. 93 (A)C 21 H 29 F 4 N 5 O 3 (B)C 23 H 23 F 3 N 4 O (C)C 21 H 21 F 3 N 4 O (D)C 27 H 30 N 2 O 2 (E)C 22 H 26 F 2 N 3 O 2 FIGURE 4.3: Interaction of residue Asp257 and top inhibitor 94 Ligands Residue Interaction Distance(Å) C 21 H 29 F 4 N 5 O 3 Asp257 Hbond 1.65 Asp257 Salt Bridge 2.65 Thr147 Hbond 1.92 C 23 H 23 F 3 N 4 O Asp257 Hbond 1.69 Asp257 Salt Bridge 2.71 Phe283 Pi-Pi Stacking 4.10 Phe388 Pi-Pi Stacking 5.81 C 21 H 21 F 3 N 4 O Asp257 Hbond 2.00 Asp257 Salt Bridge 3.24 C 27 H 30 N 2 O 2 Asp257 Hbond 1.81 Asp257 Salt Bridge 3.01 C 22 H 26 F 2 N 3 O 2 Asp257 Hbond 1.88 Asp257 Salt Bridge 2.85 TABLE 4.4: List of interactions of residue Asp257 and top in- hibitors 4.3.3 Top score inhibitors of ASP 385 As mentioned earlier, two separtate residues, ASP257 and Asp385, constitute the core of the catalytic sites. Thus, docking around ASP385 residue was also investigated. Ligand C 26 H 24 F 4 N 4 O 2 (ChemspiderID 28511635) had the low- est docking score, -8.873 kcal/mol, and the top five ligands for ASP385 were C 26 H 24 F 4 N 4 O 2 (ChemspiderID 28511635), C 21 H 22 F 3 N 3 O 3 S (ChemspiderID 26337947), C 20 H 21 F 2 N 5 O 2 (ChemspiderID 34239324), C 23 H 23 F 3 N 4 O (Chem- spiderID 25036606), and C 22 H 21 F 3 N 4 O 4 (ChemspiderID 25036616), with dock- ing scores of -8.873, -8.758, -8.665, -8.489, -8.291 and binding energies of - 49.205, -51.602, -50.556, -42.906, -52.387, respectively. The binding energies along with van der Waals’ and electrostatic energies are presented in Table 4.5. In order to evaluate the druggability of each ligand, the ADME properties were assessed to confirm the efficacy of the candidate molecules. The result are shown in Figure 4.4 and Table 4.6. The physically important descriptors and pharmaceutically relevant ADMET properties were evaluated using the QikProp4 module. We found that all the ligands follow Lipinski’s rule, and 95 Ligands Docking Score Glide En- ergy VDW En- ergy Electrstatic Energy C 26 H 24 F 4 N 4 O 2 -8.873 -49.205 -41.723 -7.481 C 21 H 22 F 3 N 3 O 3 S -8.758 -51.602 -40.093 -13.509 C 20 H 21 F 2 N 5 O 2 -8.665 -50.556 -43.690 -6.866 C 23 H 23 F 3 N 4 O -8.489 -42.906 -39.320 -3.586 C 22 H 21 F 3 N 4 O 4 -8.291 -52.387 -49.292 -9.095 TABLE 4.5: List of the important binding energies between the top inhibitors and the residue ASP385. had reliable polarity for better permeation and absorption, as revealed by the HB donors and HB acceptors. Morever, high solubility, as shown in the Table, would enhance the absorption. Topological polar surface area (TPSA) indicates that the surface belongs to the polar atoms in the compound. An increased TPSA is associated with diminished membrane permeability, so that lower TPSA was favorable for drug-like property. Property Ligand 1 Ligand 2 Ligand 3 Ligand 4 Ligand 5 Formula C 26 H 24 F 4 N 4 O 2 C 21 H 22 F 3 N 3 O 3 S C 20 H 21 F 2 N 5 O 2 C 23 H 23 F 3 N 4 O C 22 H 21 F 3 N 4 O 4 MW 500.40 453.48 400.40 428.45 462.42 log(P ) 3.00 2.08 1.67 2.97 -1.36 log(D) 2.11 0.86 -0.53 1.94 -3.01 TPSA 81.15 122.57 98.11 65.77 129.54 Rotatable bonds 6 7 3 5 8 HB Don- nors 2 4 3 2 4 HB Ac- ceptors 6 6 7 5 8 Rings 4 3 3 3 3 Total charge 1 1 1 1 0 Fsp 3 0.27 0.3 0.3 0.3 0.27 Solubility (mg/l) 4742 11581 13121 5987 10465 Oral bio- availability Good Good Good Good Good TABLE 4.6: Chemical properties of the top ligands. 96 (A) C 26 H 24 F 4 N 4 O 2 ChemspiderID 28511635 (B) C 21 H 22 F 3 N 3 O 3 S ChemspiderID 26337947 (C) C 20 H 21 F 2 N 5 O 2 ChemspiderID 34239324 (D) C 23 H 23 F 3 N 4 O ChemspiderID 25036606 (E) C 22 H 21 F 3 N 4 O 4 ChemspiderID 25036616 FIGURE 4.4: The chemical structure and properties of the top inhibitors. 97 All the protein-ligand interactions near ASP385 are illustrated and an- alyzed in Figure 4.5 and Table 4.7. The 2D map of the docking positions and the protein-ligand interaction for the top five molecules are depicted in Figure 4.5. The results indicate that ligands is binded within the active region of target protein by forming hydrogen bonds. Favorable hydrogen- bond interactions between the ligand and the protein residues were encoun- tered near the binding site. All the ligands formed hydrogen bonds with and salt bridge with ASP257. C 20 H 21 F 2 N 5 O 2 and C 22 H 21 F 3 N 4 O 4 formed -cation with the residue Lys265, which is an attractive and noncovalent interactions between aromatic rings, and plays an important role in stabilization of the inhibitor at the active site. Although the ligands did not exhibit direct bond- ing interactions with the active residue ASP385, the NH group within all the inhibitors formed H-bond with the catalytic residue ASP257 in the binding pocket. Such strong H-bond interactions can well compensate the loss of the key interaction with another residue. Moreover, all the ligands occupy the active cavity near two active residue, and benzene rings in these ligands stretched to the hydrophobic pocket in the binding site. This suggests that the hydrophobic pocket formed by other residue near ASP257 and ASP385 is very important for the binding affinity of ligands. 4.4 Summary and Conclusions To summarize this Chapter, the continued improvement of machine-learning methods in chemistry, which compete with standard approaches or expert skill, are poised to become a force for change in modern computational medic- inal chemistry. In the present study, deep-learning neural network was used to reveal novel inhibitors for -Secretase protein. With the help of our dock- ing studies, the inhibitors were docked to the active sites of the protein and the top five inhibitors were selected for further studies. Bond interactions, 98 (A)C 26 H 24 F 4 N 4 O 2 (B)C 21 H 22 F 3 N 3 O 3 S (C)C 20 H 21 F 2 N 5 O 2 (D)C 23 H 23 F 3 N 4 O (E)C 22 H 21 F 3 N 4 O 4 FIGURE 4.5: Interaction of residue Asp385 and top inhibitors 99 Ligands Residue Interaction Distance () C 26 H 24 F 4 N 4 O 2 ASP257 H-bond 1.88 ASP257 Salt bridge 2.71 C 21 H 22 F 3 N 3 O 3 S ASP257 H-bond 1.59 ASP257 Salt bridge 2.78 C 20 H 21 F 2 N 5 O 2 ASP257 H-bond 1.73 Lys265 -cation 5.00 ASP257 Salt bridge 2.71 C 23 H 23 F 3 N 4 O ASP257 H-bond 2.29 ASP257 Salt bridge 3.09 C 22 H 21 F 3 N 4 O 4 ASP257 H-bond 1.66 Lys265 -cation 5.46 ASP257 Salt bridge 2.68 TABLE 4.7: List of interactions of residue Asp385 and top in- hibitors such as H-bond, stacking were successfully formed with active sites. Pharmacological characteristics, such as H-donor, H-acceptor, toxicity and metabolism were analysed for the top-leads inhibitors. But, our theoretical predictions are just consultative and have to be carefully verified by clinical experiments. 4.5 Perspective and Future Studies On the one hand, machine-learning techniques used in medicinal chemistry studies have been applied since the initial stages of the drug design process, such as the 3D models of the target and the prediction of binding site, all the way to the final ones, as the selection of putative binder compounds to be submitted to biological assays. Nowadays, the trend is to combine the tradi- tional score functions with machine-learning techniques in order to improve performance of predicting biological activities, binding sites and docking so- lutions for unknown compounds. Machine-learning techniques have also been used to predict interacting regions between proteins interface. Morever, the design of inhibitors to such interfaces is considered as a new paradigm in drug discovery. 100 On the other hand, our knowledge of -Secretase and its role in the Alzheimer’s disease has increased dramatically in the past decades. Moreover, the devel- opment of clinically useful -Secretase inhibitors is of critical importance for this protease in the Notch signaling pathway. However, most of the positive results in the AD animal models have not been recapitulated in clinical trials, and rodent models might have a more powerful ability to recover brain cells, compared to that of human brain. 101 4.6 Supporting Information Here, we list the top inhibitors for other active sites, ALA246, GLY382, and LEU150. Ligands Docking Score Glide En- ergy VDW En- ergy Electronic Energy C 22 H 24 F 3 N 5 O 3 S ChemspiderID 34950169 -5.919 -51.947 -35.878 -16.068 C 30 H 30 N 8 O 2 Chem- spiderID 8657651 -5.477 -52.879 -46.513 -6.366 C 21 H 19 Cl 2 N 3 O 2 ChemspiderID 28501533 -5.385 -40.189 -25.027 -15.163 C 22 H 27 F 2 N 3 O 3 S ChemspiderID 26388788 -5.161 -34.324 -24.134 -10.187 C 20 H 29 N 3 O 3 Chem- spiderID 10672068 -5.141 -31.156 -21.893 -9.263 TABLE 4.8: List of important binding energy between top in- hibitors and residue Ala246 102 Property Ligand 1 Ligand 2 Ligand 3 Ligand 4 Ligand 5 Formula C 22 H 24 F 3 N 5 O 3 S C 30 H 30 N 8 O 2 C 21 H 19 Cl 2 N 3 O 2 C 22 H 27 F 2 N 3 O 3 S C 20 H 29 N 3 O 3 MW 405.52 534.61 416.3 397.53 359.46 log(P ) 0.77 3.18 3.92 3.12 1.09 log(D) 1.15 2.25 3.18 1.14 1.03 TPSA 139.35 118.48 73.2 102.3 69.65 Rotatable bonds 6 5 5 4 3 HB Don- nors 2 2 1 3 2 HB Ac- ceptors 8 10 5 5 6 Rings 3 4 3 3 2 Total charge 1 1 0 1 1 Fsp 3 0.5 0.3 0.2 0.45 0.65 Solubility (mg/l) 21686 3284 3603 6324 25700 Oral bio- availability Good Good Good Good Good TABLE 4.9: Chemical properties of top ligands Ligands Residue Interaction Distance(Å) C 22 H 24 F 3 N 5 O 3 S Arg108 H-bond 2.01 Thr107 H-bond 2.32 Trp244 Pi-Pi stacking 4.11 C 30 H 30 N 8 O 2 Glu243 Hbond 2.23 Trp244 H-bond 2.67 Trp244 -cation 4.15 Arg108 -cation 3.35 Lys395 -cation 3.58 C 21 H 19 Cl 2 N 3 O 2 Glu243 H-bond 2.08 Arg108 H-bond 1.74 Lys395 H-bond 2.17 C 22 H 27 F 2 N 3 O 3 S Trp244 -cation 4.85 ASP450 H-bond 1.69 C 20 H 29 N 3 O 3 Asp110 Hbond 1.88 ASP110 Salt bridge 2.87 Arg108 H-bond 1.74 TABLE 4.10: List of interactions of residue Ala246 and top in- hibitors 103 (A)C 20 H 29 N 3 O 3 ChemspiderID 10672068 (B) C 21 H 19 Cl 2 N 3 O 2 ChemspiderID 28501533 (C) C 22 H 24 F 3 N 5 O 3 S ChemspiderID 34950169 (D) C 22 H 27 F 2 N 3 O 3 S ChemspiderID 26388788 (E)C 30 H 30 N 8 O 2 ChemspiderID 8657651 FIGURE 4.6: Chemical structure and properties of top inhibitors 104 (A)C 20 H 29 N 3 O 3 (B)C 21 H 19 Cl 2 N 3 O 2 (C)C 22 H 24 F 3 N 5 O 3 S (D)C 30 H 30 N 8 O 2 (E)C 22 H 27 F 2 N 3 O 3 S FIGURE 4.7: Interaction of the residue Ala246 with the top in- hibitors 105 Ligands Docking Score Glide En- ergy VDW En- ergy Electronic Energy C 32 H 32 O 6 Chemspi- derID 23310545 -9.213 -54.637 -42.843 -11.794 C 27 H 30 N 2 O 2 Chem- spiderID 23264436 -8.457 -47.719 -44.921 -2.798 C 27 H 31 F 3 N 4 O 5 ChemspiderID 25037466 -8.111 -55.707 -53.056 -2.650 C 20 H 19 F 3 N 6 O 2 ChemspiderID 24631662 -7.965 -49.449 -44.705 -4.745 C 21 H 22 F 3 N 3 O 3 S ChemspiderID 26337947 -7.902 -49.807 -45.076 -4.731 TABLE 4.11: List of important binding energy between top in- hibitors and residue Gly382 Property Ligand 1 Ligand 2 Ligand 3 Ligand 4 Ligand 5 Formula C 32 H 32 O 6 C 27 H 30 N 2 O 2 C 27 H 31 F 3 N 4 O 5 C 20 H 19 F 3 N 6 O 2 C 21 H 22 F 3 N 3 O 3 S MW 512.59 414.54 548.55 432.4 453.48 log(P ) 6.94 4.33 3.08 1.29 2.08 log(D) 7.62 3.81 2.03 1.27 0.86 TPSA 99.38 57.81 124.54 112.65 122.58 Rotatable bonds 7 8 12 6 7 HB Don- nors 4 2 3 2 4 HB Ac- ceptors 6 4 9 8 6 Rings 4 4 3 4 3 Total charge 0 1 1 1 1 Fsp 3 0.25 0.3 0.41 0.35 0.33 Solubility (mg/l) 389.65 3316 5995 17817 15581 Oral bio- availability Good Good Good Good Good TABLE 4.12: Chemical properties of top ligands 106 (A)C 32 H 32 O 6 ChemspiderID 23310545 (B)C 27 H 30 N 2 O 2 ChemspiderID 23264436 (C) C 27 H 31 F 3 N 4 O 5 ChemspiderID 25037466 (D) C 20 H 19 F 3 N 6 O 2 ChemspiderID 24631662 (E) C 21 H 22 F 3 N 3 O 3 S ChemspiderID 26337947 FIGURE 4.8: Chemical structure and properties of top inhibitors 107 (A) C 32 H 32 O 6 (B) C 27 H 30 N 2 O 2 (C) C 27 H 31 F 3 N 4 O 5 (D) C 20 H 19 F 3 N 6 O 2 (E) C 21 H 22 F 3 N 3 O 3 S FIGURE 4.9: Interactions of residue Gly382 and the top in- hibitors 108 Ligands Residue Interaction Distance(Å) C 32 H 32 O 6 ASP257 H-bond 1.85 Thr256 H-bond 1.82 C 27 H 30 N 2 O 2 ASP257 H-bond 2.49 ASP257 Salt bridge 2.89 C 27 H 31 F 3 N 4 O 5 Lys265 -cation 4.85 C 20 H 19 F 3 N 6 O 2 Lys265 -cation 4.43 Phe283 Stacking 5.41 C 21 H 22 F 3 N 3 O 3 S ASP257 H-bond 2.02 TABLE 4.13: List of interactions of residue Gly382 and the top inhibitors Ligands Docking Score Glide En- ergy VDW En- ergy Electronic Energy C 21 H 29 F 4 N 5 O 3 ChemspiderID 34950164 -8.404 -39.852 -34.365 -5.487 C 23 H 23 F 3 N 4 O Chem- spiderID 23260671 -8.129 -50.408 -44.089 -6.318 C 27 H 30 N 2 O 2 Chem- spiderID 23264436 -7.716 -44.685 -43.443 -1.242 C 17 H 22 FN 3 O 3 S ChemspiderID 8311335 -7.123 -44.167 -38.477 -5.690 C 20 H 19 F 3 N 6 O 2 ChemspiderID 24631662 -7.032 -48.757 -43.48 -5.277 TABLE 4.14: List of the important binding energies between the top inhibitors and the residue Leu150 109 (A) C 21 H 29 F 4 N 5 O 3 ChemspiderID 34950164 (B) C 23 H 23 F 3 N 4 O ChemspiderID 23260671 (C) C 27 H 30 N 2 O 2 ChemspiderID 23264436 (D) C 17 H 22 FN 3 O 3 S ChemspiderID 8311335 (E) C 20 H 19 F 3 N 6 O 2 ChemspiderID 24631662 FIGURE 4.10: Chemical structure and properties of the top in- hibitors 110 Property Ligand 1 Ligand 2 Ligand 3 Ligand 4 Ligand 5 Formula C 21 H 29 F 4 N 5 O 3 C 23 H 23 F 3 N 4 O C 27 H 30 N 2 O 2 C 17 H 22 FN 3 O 3 S C 20 H 19 F 3 N 6 O 2 MW 475.40 428.45 414.54 367.44 432.4 log(P ) 0.71 2.58 4.33 0.7 1.29 log(D) 1.01 1.64 3.81 0.06 1.27 TPSA 114.27 65.77 57.81 112.64 122.65 Rotatable bonds 6 5 8 5 6 HB Don- nors 3 2 2 2 2 HB Ac- ceptors 8 5 4 6 8 Rings 2 3 4 2 4 Total charge 1 1 1 0 1 Fsp 3 0.71 0.3 0.3 0.53 0.35 Solubility (mg/l) 25879 7655 3316 35980 17817 Oral bio- availability Good Good Good Good Good TABLE 4.15: Chemical properties of top ligands Ligands Residue Interaction Distance(Å) C 21 H 29 F 4 N 5 O 3 ASP257 H-bond 1.65 Thr147 H-bond 1.92 ASP257 Salt bridge 2.65 C 23 H 23 F 3 N 4 O ASP257 H-bond 1.69 Phe283 Stacking 4.10 Phe388 Stacking 5.18 ASP257 Salt bridge 2.71 C 27 H 30 N 2 O 2 ASP257 H-bond 1.81 ASP257 Salt bridge 5.01 C 17 H 22 FN 3 O 3 S Gly384 H-bond 2.26 C 20 H 19 F 3 N 6 O 2 Phe283 Stacking 4.00 TABLE 4.16: List of the interactions of the residue Leu150 with the top inhibitors 111 (A) C 21 H 29 F 4 N 5 O 3 (B) C 23 H 23 F 3 N 4 O (C) C 27 H 30 N 2 O 2 (D) C 17 H 22 FN 3 O 3 S (E) C 20 H 19 F 3 N 6 O 2 FIGURE 4.11: Interaction of the residue Leu150 with the top inhibitors 112 Bibliography [1] L. Pauling and R. B. Corey. “Atomic Coordinates and Structure Fac- tors for Two Helical Configurations of Polypeptide Chains”. In:PNAS 37 (1951), pp. 235–240. [2] L. Pauling and R. B. Corey. “The Structure of Synthetic Polypeptides”. In:PNAS 37 (1951), pp. 241–250. [3] L. Pauling and R. B. Corey. “The Pleated Sheet, A New Layer Config- uration of Polypeptide Chains”. In:PNAS 37 (1951), pp. 251–256. [4] L. Pauling and R. B. Corey. “The Structure of Feather Rachis Keratin”. In:PNAS 37 (1951), pp. 257–261. [5] L. Pauling and R. B. Corey. “The Structure of Hair, Muscle, and Re- lated Proteins”. In:PNAS 37 (1951), pp. 261–271. [6] L. Pauling and R. B. Corey. “The Structure of Fibrous Proteins of the CollagenGelatin Group”. In:PNAS 37 (1951), pp. 272–281. [7] L. Pauling and R. B. Corey. “Polypeptide-Chain Configuration in Hemoglobin and Other Globular Proteins”. In:PNAS 37 (1951), pp. 282–285. [8] L. Pauling and R. B. Corey. “A Three-dimensional Model of the Myo- globin Molecule Obtained by X-ray Analysis”. In: Nature 181 (1958), pp. 662–666. [9] C. B. Anfinsen and E. Haber. “Studies on the Reduction and Re-formation of Protein Disulfide Bonds”. In:J.Biol.Chem 236 (1961), pp. 1361–1363. 113 [10] M. Sela C. B. Anfinsen E. Haber and F. H. White Jr. “The Kinetics of Formation of Native Ribonuclease During Oxidation of the Reduced Polypeptide Chain”. In:Proc.Natl.Acad.Sci 47 (1961), pp. 1309–1314. [11] C. B. Anfinsen. “Principles that Govern the Folding of Protein Chains”. In:Science 181 (1973), pp. 223–230. [12] C. B. Anfinsen. “Studies on the Principles That Govern the Folding of Protein Chains”. In:LesPrixNobelen1972,NobelFoundation 181 (1973), pp. 103–119. [13] C. Levinthal. “Are There Pathways for Protein Folding?” In: J. Chem. Phys 65 (1968), pp. 44–45. [14] C. Ramakrishnan G. N. Ramachandran and V . Sasisekharan. “Per- spective on "Stereochemistry of Polypeptide Chain Conformations"”. In:JMolBiol. 7 (1963), pp. 95–99. [15] G. N. Ramachandran and V . Sasisekharan. “Conformation of Polypep- tides and Proteins”. In:Adv.ProteinChem 235 (1968), pp. 283–438. [16] Nicholas C. Fitzkee and George D. Rose. “Reassessing Random-Coil Statistics in Unfolded Proteins”. In: Proc. Nat. Acad. Sci. 101 (2004), pp. 12497–12502. [17] A. D. Miranker and C. M. Dobson. “Collapse and Cooperativity in Protein Folding”. In:Curr.Opin.Struc.Biol. 6 (1996), pp. 31–42. [18] H. A. Scheraga M. H. Hao. “How optimization of potential functions affects protein folding”. In: Proc. Natl. Acad. Sci. 93 (1996), pp. 4984– 4989. [19] N. D. Socci J. D. Bryngelson J. N. Onuchic. “Funnels, Pathways, and the Energy Landscape of Protein Folding: A Synthesis”. In: Proteins: Structure,Function,andGenetics 21 (1995), pp. 167–195. 114 [20] J. N. Onuchoc P . G. Wolynes and D. Thirumalai. “Navigating the Fold- ing Routes”. In:Science 267 (1995), pp. 1619–1620. [21] K. A. Dill and H. S. Chan. “From Levinthal to Pathways to Funnels”. In:NatureStruct.Biol 4 (1997), pp. 10–19. [22] B. Honig and A. S. “Yang, Free Energy Balance in Protein Folding”. In:AdvancesinProteinChemistry 46 (1995), pp. 27–58. [23] S. Jana C. Chakraborty S. Nandi. “Prion Disease: A Deadly Disease for Protein Misfolding”. In: Current Pharmaceutical Biotechnology 6 (2005), pp. 167–177. [24] C. M. Dobson. “Protein Folding and Misfolding”. In:Nature 426 (2003), pp. 884–890. [25] E. M. Boczko and C. L. Brooks. “First-Principles Calculation of the Folding Free-Energy of a 3-Helix Bundle Proteins”. In: Science 269 (1995), pp. 393–396. [26] F. B. Sheinerman and C. L. Brooks. “Calculations on Folding of Seg- ment B1 of Streptococcal Protein G”. In:J.Mol.Biol. 278 (1998), pp. 439– 456. [27] B. D. Bursulaya and C. L. Brooks. “Folding Free Energy Surface of a ThreeStranded Beta-Sheet Protein”. In: J. Am. Chem. Soc. 121 (1999), pp. 9947–9951. [28] J. E. Mertz D. J. Tobias and C. L. Brooks. “Nanosecond Time Scale Folding Dynamics of a Pentapeptide in Water”. In: Biochemistry 30 (1991), pp. 6054–6058. [29] D. Seebach X. Daura B. jaun. “Reversible Peptide Folding in Solution by Molecular Dynamics Study”. In: J. Mol.Biol. 280 (1998), pp. 925– 932. [30] In: (). URL:http://folding.stanford.edu/. 115 [31] J. N. Onuchic Shea J. E. “Energetic frustration and the nature of the transition state in protein folding”. In:J.Chem.Phys 113 (2000), pp. 7663– 7671. [32] A. E. Garcia Nymeyer H. “Folding funnels and frustration in off-lattice minimalist protein landscapes”. In:Proc.Natl.Acad.Sci.USA 95 (1998), pp. 5921–5928. [33] A. Jewett Baumketner A. “Effects of confinement in chaperonin as- sisted protein folding: rate enhancement by decreasing the roughness of the folding energy landscape”. In: J. Mol. Biol 332 (2003), pp. 701– 713. [34] D. J. Sheeler Friedel M. “Effects of confinement and crowding on the thermodynamics and kinetics of folding of a minimalist beta-barrel protein”. In:J.Chem.Phys 118 (2003), pp. 8106–8113. [35] D. K. Klimov Thirumalai D. “Caging helps proteins fold”. In: Proc. Natl.Acad.Sci.USA 100 (2003), pp. 11195–11197. [36] D. Newfield Klimov D. K. “Simulations of beta-hairpin folding con- fined to spherical pores using distributed computing”. In: Proc. Natl. Acad.Sci.USA 99 (2002), pp. 8019–8024. [37] Pande VS Lucent D Vishal V. “Protein folding under confinement: A role for solvent”. In: Proc Natl Acad Sci USA 104 (2007), pp. 10430– 10434. [38] Thirumalai D Cheung MS Klimov D. “Molecular crowding enhances native state stability and refolding rates of globular proteins”. In:Proc NatlAcadSciUSA 102 (2005), pp. 4753–4758. [39] Zhou HX. “Protein folding in confined and crowded environments”. In:ArchBiochemBiophys 469 (2008), pp. 76–82. 116 [40] de Pablo JJ Rathore N Knotts TA. “Confinement effects on the thermo- dynamics of protein folding: Monte Carlo simulations”. In: Biophys J 90 (2006), pp. 1767–1773. [41] Raphael E Sakaue T. “Polymer chains in confined spaces and flow- injection problems: Some remarks”. In:Macromolecules 39 (2006), pp. 2621– 2628. [42] Wolynes PG Socci ND Onuchic JN. “Diffusive dynamics of the reac- tion coordinate for protein folding funnels”. In:JChemPhys 104 (1996), pp. 5860–5868. [43] Wolynes PG Oliveberg M. “The experimental survey of protein-folding energy landscapes”. In:QRevBiophys 38 (2005), pp. 245–288. [44] Hummer G Best RB. “Diffusive model of protein folding dynamics with Kramers turnover in rate”. In:PhysRevLett 96 (2006), pp. 228104– 228107. [45] Dobson CM van den Berg B Ellis RJ. “Effects of macromolecular crowd- ing on protein folding and aggregation”. In:EMBOJ 18 (1999), pp. 6927– 6933. [46] Minton AP. “Implications of macromolecular crowding for protein as- sembly”. In:CurrOpinStructBiol 10 (2000), pp. 34–39. [47] Ellis RJ. “Macromolecular crowding: Obvious but under appreciated”. In:TrendsBiochemSci 26 (2001), pp. 597–604. [48] Minton AP. “Effects of a concentrated “inert” macromolecular coso- lute on the stability of a globular protein with respect to denaturation by heat and by chaotropes: A statistical-thermodynamic model”. In: BiophysJ 78 (2001), pp. 101–109. 117 [49] Valentine JS Eggers DK. “Crowding and hydration effects on protein conformation: A study with sol-gel encapsulated proteins”. In: J Mol Biol 314 (2001), pp. 911–922. [50] Minton AP Ellis RJ. “Cell biology—Join the crowd”. In: Nature 425 (2003), pp. 27–28. [51] Shea JE Friedel M Sheeler DJ. “Effects of confinement and crowd- ing on the thermodynamics and kinetics of folding of a minimalist barrel protein.” In:JChemPhys 118 (2003), pp. 8106–8113. [52] Zhou HX. “Protein folding and binding in confined spaces and in crowded solutions”. In:JMolRecognit 17 (2004), pp. 368–375. [53] Thirumalai D Cheung MS Klimov D. “Molecular crowding enhances native state stability and refolding rates of globular proteins”. In:Proc NatlAcadSciUSA 102 (2005), pp. 4753–4758. [54] Truskett TM Cheung J. “Coarse-grained strategy for modeling protein stability in concentrated solutions”. In: Biophys J 89 (2005), pp. 2372– 2384. [55] Cheung MS Stagg L Zhang SQ. “Molecular crowding enhances native structure and stability ofprotein flavodoxin”. In:ProcNatlAcadSci USA 104 (2007), pp. 18976–18981. [56] Thirumalai D Betancourt MR. “Exploring the kinetic requirements for enhancement of protein folding rates in the GroEL cavity”. In: J Mol Biol 287 (1999), pp. 627–644. [57] Thirumalai D Betancourt MR. “Stabilization of proteins in confined spaces”. In:Biochemistry 40 (2001), pp. 11289–11293. [58] Valentine JS Eggers DK. “Molecular confinement influences protein structure and enhances thermal protein stability”. In: Protein Sci 10 (2001), pp. 250–261. 118 [59] Thirumalai D Klimov DK Newfield D. “Simulations of -hairpin fold- ing confined to spherical pores using distributed computing”. In:Proc NatlAcadSciUSA 99 (2002), pp. 8019–8024. [60] Takada S Takagi F Koga N. “How protein thermodynamics and fold- ing mechanisms are altered by the chaperonin cage: Molecular simu- lations”. In:ProcNatlAcadSciUSA 100 (2003), pp. 11367–11372. [61] Lorimer G Thirumalai D Klimov D. “Caging helps proteins fold”. In: ProcNatlAcadSciUSA 100 (2003), pp. 11195–11197. [62] Shea JE Baumketner A Jewett A. “Effects of confinement in chaper- onin assisted protein folding: Rate enhancement by decreasing the roughness of the folding energy landscape”. In: J Mol Biol 332 (2003), pp. 701–713. [63] Kelly G Bolis D Politou AS. “Protein stability in nanocages: A novel approach for influencing protein stability by molecular confinement”. In:JMolBiol 336 (2004), pp. 203–212. [64] Shea JE Jewett AI Baumketner A. “Accelerated folding in the weak hydrophobic environment of a chaperonin cavity: Creation of an al- ternate fast folding pathway”. In: Proc Natl Acad Sci USA 101 (2004), pp. 13192–13197. [65] Thirumalai D Ziv G Haran G. “Ribosome exit tunnel can entropically stabilizehelices”. In:ProcNatlAcadSciUSA 102 (2005), pp. 18956– 18961. [66] Ban N Nissen P Hansen J. “The structural basis of ribosome activity in peptide bond synthesis”. In:Science 289 (2000), pp. 920–930. [67] Dill KA Chan HS. “A simple model of chaperonin-mediated protein folding”. In:ProteinsStructFunctBiol 24 (1998), pp. 345–351. 119 [68] Lorimer G Thirumalai D. “Chaperonin mediated protein folding”. In: AnnuRevBiophysBiomolStruct 30 (2001), pp. 245–269. [69] Hayer-HartlM Hartl FU. “Molecular chaperones in the cytosol: From nascent chain to folded protein”. In:Science 295 (2002), pp. 1852–1858. [70] D. K. Eggers and J. S. Valentine. “Molecular Confinement Influences Protein Structure and Enhances Thermal Protein Stability”. In:Protein Sci 10 (2001), pp. 250–261. [71] R. Lewontin A. Grififiths S. Wessler. “W. H. Freeman and Company”. In:IntroductiontoGeneticAnalysis (2005). [72] J. Frydman and F. U. Hartl. “Principles of chaperone-assisted protein folding: differences between in vitro and in vivo mechanisms”. In: Science 272 (1996), pp. 1497–1502. [73] Y. Kashi Fenton W. A. “Residues in chaperonin GroEL required for polypeptide binding and release”. In:Nature 371 (2006), pp. 614–619. [74] M. Gross Robinson C. V . “Conformation of GroEL-bound alpha-lactalbumin probed by mass-spectrometry”. In:Nature 372 (1994), pp. 646–651. [75] A. M. Roseman Chen S. “Location of a folding protein and shape changes in GroEL-GroES complexes imaged by cryoelectron microscopy”. In:Nature 371 (1994), pp. 261–264. [76] Z. Otwinowski Braig K. “The crystal-structure of the bacterial chaper- onin GroEL at 2.8 Å”. In:Nature 371 (1994), pp. 578–586. [77] M. J. Kerner A. Brinker G. Pfeifer. “Dual Function of Protein Con- finement in Chaperonin-Assisted Protein Folding”. In:Cell 107 (2001), pp. 223–233. [78] H. X Zhou. “Loops, linkages, rings, catenanes, cages, and crowders: entropy-based strategies for stabilizing proteins”. In: Acc. Chem. Res 37 (2004), pp. 123–130. 120 [79] H. X Zhou. “Protein folding and binding in confined spaces and in crowded solutions”. In:J.Mol.Recognit 17 (2004), pp. 368–375. [80] N. Koga Takagi F. “How protein thermodynamics and folding mecha- nisms are altered by the chaperonin cage: molecular simulations”. In: Proc.Natl.Acad.Sci.USA 100 (2003), pp. 11367–11372. [81] J. M. Yuan Ping G. “Effects of confinement on protein folding and protein stability”. In:J.Chem.Phys 118 (2003), pp. 8042–8048. [82] R. J Ellis. “Protein folding: importance of the Anfinsen cage”. In:Curr. Biol 13 (2003), R881–R883. [83] M. R. Betancourt and D. Thirumalai. “Exploring the kinetic require- ments for enhancement of protein folding rates in the GroEL cavity”. In:J.Mol.Biol 287 (1999), pp. 627–644. [84] A. Baumketner Jewett A. I. “Accelerated folding in the weak hydropho- bic environment of a chaperonin cavity: creation of an alternate fast folding pathway”. In:Proc.Natl.Acad.Sci.USA 101 (2004), pp. 13192– 13197. [85] de Gennes PG. “Scaling Concepts in Polymer”. In:Physics (1979). [86] Takada S Takagi F Koga N. “How protein thermodynamics and fold- ing mechanisms are altered by the chaperonin cage: Molecular simu- lations”. In:ProcNatlAcadSciUSA 100 (2003), pp. 11367–11372. [87] Jayachandran G Rhee YM Sorin EJ. In: Proc Natl Acad Sci USA 101 (2004), pp. 6456–6461. [88] Pande VS Rhee YM. In:JChemPhys 323 (2006), pp. 66–77. [89] Pande VS Rhee YM. In:Science 298 (2002), pp. 1722–1723. [90] Aluru NR Mashl RJ Joseph S. In:NanoLett 3 (2003), pp. 589–592. [91] Fayer MD Piletic IR Tan HS. In: J Phys Chem B Condens Matter Mater SurfInterfacesBiophys 109 (2005), pp. 21273–21284. 121 [92] Thirumalai D Ziv G Haran G. In: Proc Natl Acad Sci USA 102 (2005), pp. 18956–18961. [93] Pande VS Sorin EJ. In:JAmChemSoc 128 (2006), pp. 6316–6317. [94] Gies H Ravindra R Zhao S. “Winter R. Protein encapsulation in meso- porous silicate: the effects of confinement on protein stability, hydra- tion, and volumetric properties”. In:JAmChemSoc 126 (2004), pp. 12224– 12225. [95] Cannone F Campanini B Bologna S. “Unfolding of Green Fluorescent Protein mut2 in wet nanoporous silica gels”. In: Protein Sci 14 (2005), pp. 1125–1133. [96] Hoffman BM Wheeler KE Nocek JM. “NMR spectroscopy can char- acterize proteins encapsulated in a sol-gel matrix”. In: J Am Chem Soc 128 (2008), pp. 14782–14783. [97] Winter R Zhao S Gies H. “Stability of proteins confined in MCM-48 mesoporous molecular sieves - The effects of pH, temperature and co- solvents”. In: Z Phys Chem-Int J Res Phys Chem Chem Phys 221 (2007), pp. 139–154. [98] Wand AJ Babu CR Hilser VJ. “Direct access to the cooperative sub- structure of proteins and the protein ensemble via cold denaturation”. In:NatStructMolBiol 11 (2004), pp. 352–357. [99] Tommos C Peterson RW Anbalagan K. “Forced folding and struc- tural analysis of metastable proteins”. In: J Am Chem Soc 126 (2004), pp. 9498–9499. [100] Wand AJ Shi ZS Peterson RW. “New reverse micelle surfactant sys- tems optimized for high-resolution NMR Spectroscopy of encapsu- lated proteins”. In:Langmuir 21 (2005), pp. 10632–10637. 122 [101] Flynn PF Simorellis AK. “Fast local backbone dynamics of encapsu- lated ubiquitin”. In:JAmChemSoc 128 (2006), pp. 9580–9581. [102] Gai F Mukherjee S Chowdhury P. “Tuning the cooperativity of the helix-coil transition by aqueous reverse micelles”. In: J Phys Chem B 110 (2006), pp. 11615–11619. [103] Maitra A. “Determination of size parameters of water-aerosol OT-oil reverse micelles from their nuclear magnetic resonance data”. In: J PhysChem 88 (1984), pp. 5122–5125. [104] Deutsch C Lu JL. “Deutsch C. Folding zones inside the ribosomal exit tunnel”. In:NatStructMolBiol 12 (2005), pp. 1123–1129. [105] Roeben A Tang YC Chang HC. “Structural features of the GroEL- GroES nano-cage required for rapid folding of encapsulated protein”. In:Cell 125 (2006), pp. 903–914. [106] Minton AP Hayer-Hartl M. “A simple semiempirical model for the effect of molecular confinement upon the rate of protein folding”. In: Biochemistry 45 (2006), pp. 13356–13360. [107] Zhou H Huang C Ren G. “A new method for purification of recom- binant humansynuclein inEscherichiacoli”. In: Protein Expr Purif 42 (2005), pp. 173–177. [108] Pielak GJ McNulty BC Young GB. “Macromolecular crowding in the Escherichia coli periplasm maintains alpha-synuclein disorder”. In: J MolBiol 355 (2006), pp. 893–897. [109] J. Tooze C. Branden. In:ntroductiontoProteinStructure,2nded (1998). [110] C. B. Anfinsen. In:Science 181 (1973), p. 223. [111] J. Microbiol. A. P . J. Midelberg. In:Biotechnol 6 (1996), p. 225. [112] J. S. Valentine D. K. Eggers. In:ProteinSci. 10 (2001), p. 250. [113] M. Sahimi M. Dadvar. In:Chem.Eng.Sci. 1 (2000), p. 56. 123 [114] N. A. Chaniotakis S. Sotiropoulou V . Vamvakaki. In: Chem. Eng. Sci. 1674 (2005), p. 20. [115] T. Coradin. In:J.Am.Chem.Soc. 13 (2001), p. 673. [116] J. Ackerman C. Lei J. Liu. In:J.Am.Chem.Soc. 124 (2002), p. 11242. [117] M. Wessling M.E. Avramescu. In:Biotechnol.Bioeng. 84 (2003), p. 564. [118] L. M. Gierasch B. Krishnan J. Hong. In:Biopolymers 88 (2007), p. 157. [119] R. Ellis. In:Curr.Biol. 13 (2003), p. 564. [120] C. M. Dobson. In:Ann.Rev.Biochem. 75 (2006), p. 333. [121] Jewett. In:Cell.Mol.LifeSci. 67 (2010), p. 255. [122] H. Zhou. In:Biochem.Biophys. 130 (2008), p. 76. [123] R. B. Best J. Mittala. In:Proc.Nat.Acad.Sci.U.S.A. 105 (2008), p. 20233. [124] H.X. Zhou. In:Acc.Chem.Res. 37 (2004), p. 123. [125] D. Thirumalai M. S. Cheung. In:J.Mol.Biol. 357 (2006), p. 632. [126] M. R. Rahimi Tabar L. Javidpour and M. Sahimi. In:J.Chem.Phys. 128 (2008), p. 115105. [127] L. Javidpour and M. Sahimi. In:J.Chem.Phys. 135 (2011), p. 125101. [128] Z. Guo and D. Thirumalai. In:J.Mol.Biol. 263 (1996), p. 323. [129] S. Sun. In:ProteinSci. 2 (1993), p. 762. [130] A. V . Smith and C. K. Hall. In:Proteins 44 (2001), p. 344. [131] D. Hall and A. P . Minton. In:Biochim.Biophys.Acta 1649 (2003), p. 127. [132] M. Sahimi L. Javidpour K.S. Shing. In:J.Chem.Phys. 145 (2016), p. 134306. [133] D. Rapaport. In:J.Chem.Phys. 71 (1979), p. 3299. [134] M. Vallieres J. Yuan. In:J.Chem.Phys. 118 (2003), p. 8042. [135] V . S. Pande E. J. Sorin. In:J.Am.Chem.Soc. 128 (2006), p. 6316. [136] M. Scheffler J. Neugebauer. In:J.Phys.Chem 107 (2003), p. 1432. 124 [137] H. L. Selzle D.Y. Yang. In:J.Phys.Chem 100 (2003), p. 12683. [138] R.B. Best A. Sirur M. Knott. In:J.Phys.Chem 14 (2014). [139] R. H. Swendsen A. M. Ferrenberg. In:Phys.Rev.Lett. 63 (1989), p. 1195. [140] Y. Okamoto. In:J.Phys.Chem 99 (1995), p. 11276. [141] S. W. Englander R. Horst W. A. Fenton. In: Proc. Natl. Acad. Sci. 104 (2007), p. 20788. [142] W. A. Fenton G. W. Farr. In:Proc.Natl.Acad.Sci. 104 (2007), p. 5342. [143] N. Wischnewski A. Roeben D. Wischnewski. In:Cell 125 (2006), p. 903. [144] S. H. Pfeilb H. Hofmanna F. Hillgera. In: Proc. Natl. Acad. Sci. U.S.A. 107 (2010), p. 11793. [145] W. Kabsch and C. Sander. “Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features.” In:Biopolymers 22 (Dec. 1983), pp. 2577–2637. [146] Wang S Fang XL Lu Y. “The PDBbind database: collection of binding affinities for protein-ligand complexes with known three-dimensional structures”. In:JMedChem 47 (2004), pp. 2977–2980. [147] Johnson MA and Maggiora GM. “Concepts and Applications of Molec- ular Similarity”. In:Wiley,NewYork (1990). [148] Ripphausen P Stumpfe D and Bajorath J. “Virtual compound screen- ing in drug discovery”. In:FutureMedChem 4 (2012), pp. 593–602. [149] Krüger DM and Evers A. “Comparison of structure- and ligand-based virtual screening protocols considering hit list complementarity and enrichment factors”. In:ChemMedChem 5 (2010), pp. 148–158. [150] Mosyak L Grant JA and Nicholls A. “A shape-based 3-D scaffold hop- ping method and its application to a bacterial protein-protein interac- tion”. In:JMedChem 48 (2005), pp. 1489–1495. 125 [151] Krüger DM and Evers A. “Comparison of structure- and ligand-based virtual screening protocols considering hit list complementarity and enrichment factors”. In:ChemMedChem 5 (2010), pp. 148–158. [152] Semus SF Nevins N and Senger S. “A critical assessment of docking programs and scoring functions”. In: J Med Chem 49 (2006), pp. 5912– 1531. [153] Enyedy IJ and Egan WJ. “Can we use docking and scoring for hit-to- lead optimization?” In:JComputAidedMolDes 22 (2008), pp. 161–168. [154] Page CS and Bates PA. “Can MM-PBSA calculations predict the speci- ficities of protein kinase inhibitors?” In:JComputChem 27 (2006), pp. 1990– 2007. [155] Takada T Fukunishi H and Shimada J. “Bootstrap-based consensus scoring method for protein-ligand docking”. In: J Chem Inf Model 48 (2008), pp. 988–996. [156] Teramoto R and Fukunishi H. “Consensus scoring with feature se- lection for structure-based virtual screening”. In: J Chem Inf Model 48 (2008), pp. 288–295. [157] Marantz Y. “SeleXCS: a new consensus scoring algorithm for hit dis- covery and lead optimization”. In:JChemInfModel 49 (2009), pp. 623– 633. [158] von Grotthuss M Plewczynski D Łazniewski M. “VoteDock: consen- sus docking method for prediction of protein-ligand interactions”. In: JComputChem 32 (2011), pp. 568–581. [159] Peltason L Nisius B and Bajorath J. “Quo vadis, virtual screening? A comprehensive survey of prospective applications”. In:JMedChem 53 (2010), pp. 8461–8467. 126 [160] Ripphausen P Stumpfe D and Bajorath J. “Virtual compound screen- ing in drug discovery”. In:FutureMedChem 4 (2012), pp. 593–602. [161] Leonetti F Miscioscia TF Carotti A. “An integrated approach to ligand- and structure-based drug design: development and application to a series of serine protease inhibitors”. In: J Chem Inf Model 48 (2008), pp. 1211–1226. [162] Riganelli D Costantino G Cruciani G. “Generating optimal linear PLS estimations (GOLPE)–an advanced chemometric tool for handling 3d- QSAR problems. Quantitative”. In: Struct-Activity Relation 12 (1993), pp. 9–20. [163] Tanrikulu Y and Schneider G. “Pseudoreceptor models in drug de- sign: bridging ligand- and receptor-based virtual screening”. In: Nat RevDrugDiscov 7 (2008), pp. 667–677. [164] Peristera O Dobler M and SmieskoM. “VirtualToxLab - in silico pre- diction of the toxic potential of drugs and environmental chemicals: evaluation status and internet access protocol”. In: ALTEX 24 (2007), pp. 153–161. [165] Kottke T Klenner A and Seifert R. “Homology model adjustment and ligand screening with a pseudoreceptor of the human histamine H4 receptor”. In:ChemMedChem 4 (2009), pp. 820–827. [166] Wilson GL and Lill MA. “Integrating structure-based and ligand-based approaches for computational drug design”. In: Future Med Chem 3 (2011), pp. 735–750. [167] Acklin P Glick M and Davies JW. “Finding more needles in the haystack: A simple and efficient method for improving high-throughput dock- ing results”. In:JMedChem 47 (2004), pp. 2743–2749. 127 [168] Van Drie JH. “Computer-aided drug design: the next 20 years”. In: J ComputAidedMolDes 21 (2007), pp. 594–601. [169] Rosenblit AB Hiller SA Golender VE. “Cybernetic methods of drug design. I. Statement of the problem – the perceptron approach”. In: ComputBiomedRes 6(5) (1973), pp. 411–421. [170] Ichikawa H Aoyama T Suzuki Y. “Neural networks applied to structure- activity relationships.” In:JMedChem 33(3) (1990), pp. 905–908. [171] Burden FR Winkler DA. “Application of neural networks to large dataset QSAR, virtual screening, and library design.” In:MethodsMolBiol 201 (2002), pp. 325–367. [172] Palyulin VA Baskin II. “Neural networks as a method for elucidating structure-property relationships for organic compounds.” In: Russian ChemRev 72(7) (2003), pp. 629–649. [173] Zefirov NS Baskin II Palyulin VA. “Neural networks in building QSAR models.” In:MethodsMolBiol 458 (2008), pp. 137–158. [174] Rowe PH Dearden JC. “Use of artificial neural networks in the QSAR prediction of physicochemical properties and toxicities for REACH legislation.” In:MethodsMolBiol 1260 (2015), pp. 65–88. [175] Bengio Y. “Learning deep architectures for AI.” In:FoundationsTrends MachineLearning 2(1) (2009), pp. 1–127. [176] Hinton G LeCun Y Bengio Y. “Deep learning.” In: Nature 521(7553) (2015), pp. 436–444. [177] Salakhutdinov RR Hinton GE. “Reducing the dimensionality of data with neural networks.” In:Science 313(5786) (2006), pp. 504–507. [178] Vincent P Bengio Y Courville A. “Representation learning: a review and new perspectives.” In: IEEE Trans Pattern Anal Mach Intell 35(8) (2013), pp. 1798–1828. 128 [179] Bengio Y Glorot X Bordes A. “Deep sparse rectifier neural networks.” In:IntConfArtifIntelligenceStat 2011 (2011), pp. 315–323. [180] Krizhevsky A Srivastava N Hinton G. “Dropout: a simple way to pre- vent neural networks from overfitting.” In: J Machine Learn Res 15(1) (2014), pp. 1929–1958. [181] Bengio Y LeCun Y Bottou L. “Gradient-based learning applied to doc- ument recognition.” In:ProcIEEE 86(11) (1998). [182] Fukushima K. “Neocognitron: a self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in posi- tion.” In:BiolCybernetics 36 (1980), pp. 193–202. [183] Caruana R. “Multitask learning.” In: Mach Learn 28(1) (1997), pp. 41– 75. [184] Yang Q Pan SJ. “A survey on transfer learning.” In: IEEE Trans Knowl DataEng 22(10) (2010), pp. 1345–1359. [185] Zien A Chapelle O Schoelkopf B. “Semi-supervised learning.” In:Cam- bridge(MA):TheMITPress (2006). [186] Winkler DA Burden FR. “Optimal sparse descriptor selection for QSAR using Bayesian methods.” In:QSARCombSci 28 (2009), pp. 645–653. [187] Baskin II Varnek A. “Chemoinformatics as a theoretical chemistry dis- cipline.” In:MolInform 30(1) (2011), pp. 20–32. [188] Fourches D Cherkasov A Muratov EN. “QSAR modeling: where have you been? Where are you going to?” In: J Med Chem 57(12) (2015), pp. 4977–5010. [189] Baldi P Lusci A Pollastri G. “Deep architectures and deep learning in chemoinformatics: the prediction of aqueous solubility for drug-like molecules.” In:JChemInfModel 53(7) (2013), pp. 1563–1575. 129 [190] Schneider G Gawehn E Hiss JA. “Deep learning in drug discovery.” In:MolInf 35(1) (2016), pp. 3–14. [191] Zefirov NS Baskin II Palyulin VA. “A neural device for searching di- rect correlations between structures and properties of chemical com- pounds.” In:JChemInfComputSci 37(4) (1997), pp. 715–721. [192] Marcou G Varnek A Gaudin C. “Inductive transfer of knowledge: ap- plication of multi-task learning and feature net approaches to model tissue-air partition coefficients.” In:JChemInfModel 49(1) (2009), pp. 133– 144. [193] Riley P Ramsundar B Kearnes S. “Massively multitask networks for drug discovery.” In:arXivpreprint (2015). [194] Ünter Klambauer G Unterthiner T Mayr A. “Deep learning as an op- portunity in virtual screening.” In: Deep Learning and Representation LearningWorkshop (2014). [195] Klambauer G Unterthiner T Mayr A. “Toxicity prediction using deep learning.” In:arXivpreprint (2015). [196] Luik AI Tetko IV Livingstone DJ. “Neural network studies. 1. Com- parison of overfitting and overtraining.” In: J Chem Inf Comput Sci 35(5) (1995), pp. 826–833. [197] Sushko Y Novotarskyi S Abdelaziz A. “ToxCast EPA in vitro to in vivo challenge: insight into the Rank-i model.” In: Chem Res Toxicol 29(5) (2016), pp. 768–775. [198] Winkler D Burden F. “Bayesian regularization of neural networks.” In:MethodsMolBiol 458 (2008). [199] Krizhevsky A Srivastava N Hinton G. “Dropout: a simple way to pre- vent neural networks from overfitting.” In: J Machine Learn Res 15(1) (2014). 130 [200] Livingstone DJ Tetko IV Villa AE. “Neural network studies. 2. Vari- able selection.” In:JChemInfComputSci 36(4) (1996), pp. 794–803. [201] Korner R Sushko Y Novotarskyi S. “Prediction-driven matched molec- ular pairs to interpret QSARs and aid the molecular optimization pro- cess.” In:JCheminform 6(1) (2014). [202] Koch U Tetko IV Engkvist O. “BIGCHEM: challenges and opportuni- ties for Big Data analysis in chemistry.” In:MolInf (2016). [203] Klug A Crowther RA Walker JE. “Cloning and sequencing of the cDNA encoding a core protein of the paired helical filament of Alzheimer disease.” In:ProcNatlAcadSci 85 (1988). [204] M. R. Farlow. “A critical analysis of new molecular targets and strate- gies for drug developments in Alzheimer’s disease.” In: Curr. Drug Targets 4 (2003), pp. 97–112. [205] De Strooper B. “Aph-1, Pen-2, and nicastrin with presenilin generate an active -secretase complex.” In:Neuron 38 (2003), pp. 9–12. [206] F. Cervantes. “Erythropoietin treatment of the anaemia of myelofibro- sis with myeloid metaplasia: results in 20 patients and review of the literature.” In:BritishJournalofHaematology 127 (2004), pp. 399–403. [207] J. Lewis T.E. Golde L. Petrucelli. “Targeting Abeta and tau in Alzheimer’s disease, an early interim report.” In: Exp. Neurol. 223 (2010), pp. 252– 266. [208] K.R. Browning E.R. Siemers G. Wen. “A gamma-secretase inhibitor decreases amyloid-beta production in the central nervous system.” In:Ann.Neurol. 66 (2009), pp. 48–54. [209] Beher D. Churcher I. “Gamma-secretase as a therapeutic target for the treatment of Alzheimer’s disease.” In: Curr Pharm Des. 11(26) (2005), pp. 3363–82. 131 [210] W.P . Esler. “Activation barriers to structural transition determine de- position rates of Alzheimer’s disease.” In: J. Struct. Biol. 130 (2000), pp. 174–183. [211] et al. Schmidt G. A. “Present day atmospheric simulations using GISS ModelE: Comparison to in-situ, satellite and reanalysis data.” In: J. Clim. 19 (2006), pp. 153–192. [212] J. Wakabayashi S. Harrison. “Does the age of exposure of serpentine explain variation in endemic plant diversity in California?” In: Inter- nationalGeologyReview 46 (2004), pp. 235–242. [213] J. Naslund J. Lundkvist. “Gamma-secretase: a complex target for Alzheimer’s disease.” In:CurrOpinPharmacol 7 (2007), pp. 112–118. [214] J. R. Furr D. B. Kitchen H. Decornez. “Docking and scoring in virtual screening for drug discovery: methods and applications.” In:Nat.Rev. DrugDiscov 3(11) (2004), pp. 935–949. [215] K. Brian M. Michael J. John. “Directory of useful decoys, enhanced (DUD-E): better ligands and decoys for better benchmarking.” In:Jour- nalofmedicinalchemistry 55(14) (2012), pp. 6582–6594. [216] T. Lin H. Gramajo S. Joshua. “Influence relevance voting: an accu- rate and interpretable virtual high throughput screening method.” In: Journalofchemicalinformationandmodeling 49(4) (2009), pp. 756–766. [217] Pierre Baldi Alessandro Lusci Gianluca Pollastri. “Deep architectures and deep learning in chemoinformatics: the prediction of aqueous sol- ubility for drug-like molecules.” In: Journal of chemical information and modeling 53(7) (2013), pp. 1563–1575. [218] Patrick Riley David Konerding Vijay Pande. “Massively multitask net- works for drug discovery.” In:arXiv:1502.02072 (2015). 132 [219] J. Fang; H. Ge; D. Huang; H. B. Luo; J. Xu; W. Zhao. “A New Proto- col for Predicting Novel GSK-3 ATP Competitive Inhibitors.” In: J. Chem.Inf.Model. 51 (2011), pp. 1431–1438. [220] D. Vina; E. Uriarte; F. Orallo; H. Gonzalez-Diaz. “Alignment- Free Pre- diction of a Drug Target Complex Network Based on Parameters of Drug Connectivity and Protein Sequence of Receptors.” In:Mol.Phar- maceutics 6 (2009), pp. 825–835. [221] G. Du; J. Fang; Y. He; Y. Li; X. Pang; R. Yang. “Discovery of Multitar- get Directed Ligands against Alzheimer’s Disease through Systematic Prediction of Chemical Protein Interactions.” In:J.Chem.Inf.Model 55 (2015), pp. 149–164. [222] X.; J. Zhang; H. Jiang F. Wang; D. Liu; H. Wang; C. Luo; M. Zheng; H. Liu; W. Zhu; Luo. “Computational Screening for active Compounds Targeting Protein Sequences: Methodology and Experimental Valida- tion.” In:J.Chem.Inf.Model 51 (2011), pp. 2821–2828. [223] F. Cheng; Y. Zhou; J. Li; W. Li; G. Liu; Y. Tang. “Prediction of Chemical Protein Interactions: Multitarget QSAR Versus Computational Chemoge- nomic Methods.” In:Mol.Biosyst. 8 (2012), pp. 2373–2384. [224] David Rogers and Mathew Hahn. “Extended-connectivity fingerprints.” In:Journalofchemicalinformationandmodeling 50 (2010), pp. 742–754. [225] Greg Landrum. “RDKit: Open-source cheminformatics”. In: (). URL: http://www.rdkit.org. [226] John M Barnard and Geoffrey M Downs. “Chemical similarity search- ing.” In:Journalofchemicalinformationandcomputersciences 38(6) (1998), pp. 983–996. 133 [227] G. E. Hinton D. E. Rumelhart and R. J. Williams. ““Neurocomput- ing: Foundations of research,” ch. Learning Representations by Back- propagating Errors.” In: (), pp. 696–699. [228] H. Chauvin Y.; Nielsen. “Assessing the Accuracy of Prediction Algo- rithms for Classification: An Overview.” In: Bioinformatics 16 (2000), pp. 412–424. [229] P . Chevillard F.; Kolb. In:J.Chem.Inf.Mod. 55 (2015), pp. 1824–1835. [230] Mezei M. “A new method for mapping macromolecular topography.” In:JMolGraphModel 21 (2003), pp. 463–472. [231] Abagyan R Totrov M. “Soft protein–protein docking in internal coor- dinates.” In:ProteinSci. 11 (2002), pp. 280–291. [232] Gehlhaar DK Verkhivker GM Bouzida D. “Deciphering common fail- ures in molecular docking of ligand–protein complexes.” In:JComput AidedMolDes 14 (2000), pp. 731–751. [233] MacKerell AD Brooks BR Brooks CL. “CHARMM: the biomolecular simulation program.” In:JComputChem 30 (2009), pp. 1545–1614. [234] Goodford PJ. “A computational procedure for determining energeti- cally favorable binding sites on biologically important macromolecules.” In:JMedChem 28 (1985), pp. 849–857. [235] Lasters I Desmet J Hazes B. “The dead-end elimination theorem and its use in protein side-chain positioning.” In:Nature 356 (1992), pp. 539– 542. [236] Laskowski RA. “SURFNET: a program for visualizing molecular sur- faces, cavities, and intermolecular interactions.” In: J Mol Graph 13 (1995), pp. 323–330. 134 [237] Stouten PF Brady GP Jr. “Fast prediction and visualization of protein binding pockets with PASS.” In: J Comput Aided Mol Des 14 (2000), pp. 383–401. [238] Waldman M Jiang X Oldfield T. “LigandFit: a novel method for the shape-directed rapid docking of ligands to protein active sites.” In: J MolGraphModel 21 (2003), pp. 289–307. [239] Olson AJ Morris GM Sanner MF. “Automated docking to multiple tar- get structures: incorporation of protein mobility and structural water heterogeneity in AutoDock.” In: J Mol Graph Model 46 (2002), pp. 34– 40. [240] Taylor R Willett P Leach AR. “Development and validation of a ge- netic algorithm for flexible docking.” In:JMolBiol 267 (1997), pp. 727– 748. [241] Murphy RB Friesner RA Banks JL. “Glide: a new approach for rapid, accurate docking and scoring. 1. Method and assessment of docking accuracy.” In:JMedChem 47 (2004), pp. 1739–1749. [242] Waldman M Jiang X Oldfield T. “LigandFit: a novel method for the shape-directed rapid docking of ligands to protein active sites.” In: J MolGraphModel 21 (2003), pp. 289–307. [243] Labute P Corbeil CR Williams CI. “Variability in docking success rates due to dataset preparation.” In: J Comput Aided Mol Des 26 (2012), pp. 775–786. [244] Olson AJ Trott O. “AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading.” In:JComputChem 31 (2010), pp. 455–461. [245] Klebe G Lengauer T. “A fast flexible docking method using an incre- mental construction algorithm.” In:JMolBiol 261 (1996), pp. 470–489. 135 [246] Yao X Wang Z Sun H. “Comprehensive evaluation of ten docking pro- grams on a diverse set of protein–ligand complexes: the prediction accuracy of sampling power and scoring power.” In:PhysChemChem Phys 18 (2016), pp. 12964–12975. [247] Rognan D Bissantz C Folkers G. “Protein-based virtual screening of chemical databases. 1. Evaluation of different docking/scoring com- binations.” In:PhysChemChemPhys 43 (2000), pp. 4759–4767. [248] Klicic JJ Murphy RB Halgren TA. “Glide: A new approach for rapid, accurate docking and scoring. 1. Method and assessment of docking accuracy.” In:JournalofMedicinalChemistry 47 (2004), pp. 1739–1749. [249] Greenwood JR Murphy RB Repasky MP. “Extra precision glide: Dock- ing and scoring incorporating a model of hydrophobic enclosure for protein-ligand complexes.” In:JournalofMedicinalChemistry 49 (2006), pp. 6177–6196. [250] Gilliland G Westbrook J Feng Z. “The protein data bank.” In: Nucleic acidsresearch 28 (2000), pp. 235–242.
Abstract (if available)
Abstract
The dissertation is composed of two parts. The first part focuses on protein stability in confined structures. We use discontinuous molecular dynamics (DMD) simulation to study the folding and stability of alpha-helixproteins in cylindrical nanopores. By using the PRIME model of proteins, we study the effect of the pore size, type of pore walls in terms of their interaction energies -- repulsive and attractive walls -- and the nature of interaction between the pore wall and proteins on their folding and stability. In the second part we use the concepts of feature locality and hierarchical composition in order to study and model of bioactivity and chemical interactions. By training a neural network with existing inhibitors for misfolding of proteins and decoys of gamm-asecretase protein, we attempt to discover more potential inhibitors of the proteins, which would help predicting the potential drugs for Alzheimer’s disease.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Molecular dynamics studies of protein aggregation in unbounded and confined media
PDF
Control the function, dynamics, aggregation of proteins with light illumination
PDF
Molecular dynamics simulation study of initial protein unfolding induced by the photo-responsive surfactants, azoTAB
PDF
Experimental study and atomic simulation of protein adsorption
PDF
A molecular dynamics study of interactions between the enzyme lysozyme and the photo-responsive surfactant azobenzene trimethylammonium bromide (azoTAB)
PDF
Dynamics of water in nanotubes: liquid below freezing point and ice-like near boiling point
PDF
Multiscale and multiresolution approach to characterization and modeling of porous media: From pore to field scale
PDF
Controlling membrane protein folding using photoresponsive surfactant
PDF
Protein-surfactant adsorption on solid surfaces
PDF
Atomistic simulation of nanoporous layered double hydroxide materials and their properties
PDF
Chemical and mechanical deformation of porous media and materials during adsorption and fluid flow
PDF
Controlling the form-dynamics-function relationship of proteins with light illumination
PDF
High-throughput nanoparticle fabrication and nano-biomembrane interactions
PDF
Controlling membrane protein folding with light illumination and catanionic surfactant systems
PDF
Machine-learning approaches for modeling of complex materials and media
PDF
Continuum and pore netwok modeling of preparation of silicon-carbide membranes by chemical-vapor deposition and chemical-vapor infiltration
PDF
Molecular-scale studies of mechanical phenomena at the interface between two solid surfaces: from high performance friction to superlubricity and flash heating
PDF
Effective flow and transport properties of deforming porous media and materials: theoretical modeling and comparison with experimental data
PDF
Exploring properties of silicon-carbide nanotubes and their composites with polymers
PDF
Engineered CAR-T cells for treatment of solid cancers
Asset Metadata
Creator
Wang, Congyue
(author)
Core Title
Stability and folding rate of proteins and identification of their inhibitors
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Chemical Engineering
Publication Date
05/15/2020
Defense Date
06/28/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
inhibitors,machine learning,OAI-PMH Harvest,protein,simulation
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Muhammad, Sahimi (
committee chair
), Nakano, Aiichiro (
committee member
), Shing, Katherine (
committee member
)
Creator Email
congyuew@usc.edu,jessiewangwcy@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-309905
Unique identifier
UC11664011
Identifier
etd-WangCongyu-8537.pdf (filename),usctheses-c89-309905 (legacy record id)
Legacy Identifier
etd-WangCongyu-8537.pdf
Dmrecord
309905
Document Type
Dissertation
Rights
Wang, Congyue
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
inhibitors
machine learning
protein
simulation