Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Artificial intelligence in medicinal chemistry and drug discovery
(USC Thesis Other)
Artificial intelligence in medicinal chemistry and drug discovery
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Artificial Intelligence in Medicinal Chemistry and Drug Discovery By Ruchira Vishwanath Joshi A Thesis Presented to the FACULTY OF THE USC ALFRED E. MANN SCHOOL OF PHARMACY AND PHARMACEUTICAL SCIENCES UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfilment of the Requirements for the Degree MASTER OF SCIENCE (MOLECULAR PHARMACOLOGY AND TOXICOLOGY) August 2024 - 1 - Acknowledgements I wish to express my sincere appreciation to my mentor, Dr. Ian Haworth, whose unwavering support, and guidance have been instrumental in making this project successful. I also thank Simulations Plus for providing a license for ADMETPredictor® through the University+ program. I extend my gratitude to the esteemed members of the committee, Dr. Serghei Mangul and Dr. Paul Seidler, whose innovative insights and critique have played an important role in refining and enhancing the quality of my work. I am deeply grateful for the mentorship, encouragement, and collaborative spirit demonstrated by Drs. Haworth, Mangul, and Seidler throughout this journey. Their collective contributions have not only shaped this thesis but also have contributed to my development as a student. Finally, I would like to express my love towards my family and friends that have wholeheartedly supported me and been my pillars of strength during tough times. - 2 - Table of Contents Acknowledgements 1 List of Tables 7 List of Figures 8 Abstract 13 Chapter 1: Introduction 14 1.1 KNIME 16 1.2 Python 18 1.3 ADMET Predictor® 18 1.4 Thesis overview 19 Chapter 2: Building an AI model using KNIME 20 2.1 Background 20 2.1.1 Calculation of molecular predictors 20 2.2 Theory 20 2.2.1 Data processing 20 2.2.2 Decision tree analysis 21 2.2.3 Machine learning model 22 2.2.4 Genetic algorithm 22 2.2.5 Linear correlation 23 2.2.6 Variable threshold and activity prediction 24 2.3 Results 25 2.3.1 Data processing 25 2.3.2 Decision tree analysis 25 2.3.3 Machine learning model 26 2.3.4 Genetic algorithm 26 2.3.5 Linear correlation 27 2.3.6 Variable threshold and activity prediction 28 2.4 Discussion 31 - 3 - 2.5 Data sharing 33 Chapter 3: Investigating effects of solvation using SOLVATE 34 3.1 Background 34 3.1.1 WALE (Water analysis for ligand evolution) 34 3.1.2 Ligand-protein complexes used to evaluate WALE 35 3.1.3 Processing through SOLVATE 36 3.1.4 Categories of water 36 3.2 Theory 37 3.2.1 Python script to identify and split multiple entries 37 3.2.2 Python script to classify interactions between empty protein and solvated water 37 3.2.2.1 Absolute displacement, contact displaced bulk and contact displaced HF 37 3.2.2.2 Absolute displacement, contact displaced bulk and contact displaced HF using “ProtHB” and “Closest Prot Atom” 38 3.2.2.3 Contact SWB 38 3.2.2.4 Matched and ghost match 39 3.2.2.5 Total count of waters in each category 42 3.2.2.6 Addition of an identifier 42 3.2.2.7 Combining the CSV files 42 3.3 Results 42 3.3.1 Development of a Python code to identify and split multiple entries 43 3.3.2 Development of a Python code to classify interactions between empty protein and solvated water 43 3.3.2.1 Absolute displacement, contact displaced bulk and contact displaced HF 43 3.3.2.2 Absolute displacement, contact displaced bulk and contact displaced HF using “ProtHB” and “Closest Prot Atom” 44 3.3.2.3 Contact SWB 45 3.3.2.4 Matched and ghost match 46 3.3.2.5 Total count of waters in each category 47 3.3.2.6 Addition of an identifier 47 3.3.2.7 Combining the CSV files 48 - 4 - 3.3.2.7.1 Alogliptin 50 3.3.2.7.2 Sitagliptin 51 3.3.2.7.3 Vildagliptin 51 3.3.2.7.4 Alprazolam 52 3.3.2.7.5 Diazepam 53 3.3.2.7.6 Atorvastatin 53 3.3.2.7.7 Fluvastatin 54 3.3.2.7.8 Rosuvastatin 55 3.3.2.7.9 Glibenclamide 55 3.3.2.7.10 Ibuprofen 56 3.4 Discussion 57 3.5 Data sharing 59 Chapter 4: Developing water constellations using WALE and comparing similarities using 3D similarity screening 60 4.1 Background 60 4.1.1 Processing for 3D similarity screening 60 4.1.1.1 Processing PDB files 60 4.1.1.2 Creating 3D conformer database 60 4.1.1.3 Computing 3D similarity 61 4.2 Theory 61 4.2.1 Python script to isolate protein solvated waters from the rest of the output CSV file from SOLVATE 61 4.2.2 Python script to identify interested categories of solvated waters and convert them to carbons by default 61 4.2.3 Classification of the water-converted-carbons as either hydrophilic or hydrophobic – Creation of water constellations 62 4.2.3.1 Ligand based water constellations 62 4.2.3.2 Protein-interaction based water constellations 62 4.2.4 Fortran script for establishment of bonds between the atoms 63 - 5 - 4.2.5 Validation of water constellations by computing 3D similarity scores against reference molecules 63 4.3 Results 64 4.3.1 Python script to isolate protein solvated waters from the rest of the output CSV file from SOLVATE 64 4.3.2 Python script to identify interested categories of solvated waters and convert them to carbons by default 65 4.3.3 Classification of the water-converted-carbons as either hydrophilic or hydrophobic – Creation of water constellations 65 4.3.3.4 Ligand-based water constellations 65 4.3.3.5 Protein-interaction based water constellations 66 4.3.4 Fortran script for establishment of bonds between the atoms 68 4.3.5 Validation of water constellations by computing 3D similarity scores against reference molecules 68 4.4 Discussion 71 4.5 Data sharing 72 Chapter 5: Ligand evolution using AIDD 73 5.1 Background 73 5.1.1 Processing for AIDD 73 5.2 Theory 73 5.2.1 Using water constellations to evolve ligands with AIDD 75 5.2.1.1 Run 1: Seed molecule – Manually modified sitagliptin, Reference molecule – Sitagliptin protein-interaction based water constellation 75 5.2.1.2 Run 2: Seed molecule - Sitagliptin protein-interaction based water constellation, Reference molecule –Sitagliptin 75 5.2.1.3 Run 3: Seed molecule - Sitagliptin protein-interaction based water constellation (all carbons by default), Reference molecule – Sitagliptin protein-interaction based water - 6 - constellation 76 5.3 Results 76 5.3.1 Using water constellations to evolve ligands with AIDD 76 5.3.1.1 Run 1: Seed molecule – Manually modified sitagliptin, Reference molecule – Sitagliptin protein-interaction based water constellation 76 5.3.1.2 Run 2: Seed molecule - Sitagliptin protein-interaction based water constellation, Reference molecule – Sitagliptin 80 5.3.1.3 Run 3: Seed molecule - Sitagliptin protein-interaction based water constellation (all carbons by default), Reference molecule – Sitagliptin protein-interaction based water constellation 83 5.4 Discussion 86 Chapter 6: Conclusion 89 References 90 Appendices 97 - 7 - List of Tables Table 1. Details of 10 ligand-protein complexes selected for evaluation in WALE 35 Table 2. Categories of water as generated by SOLVATE 36 Table 3. Detailed description of water categories generated by each component of WALE 39 Table 4. Representation of the total count of waters in each category along with identifier for Alogliptin obtained after running WALE 48 Table 5. A complete summary of all the water categories generated by WALE for the ten drugs 49 Table 6. Logic used for generation of protein-interaction based water constellations 52 Table 7. Top 10 candidate molecules generated by AIDD using modified sitagliptin as the seed molecule and its protein-interaction based water constellation as the reference molecule for calculating 3D similarity. Molecules are listed based on decreasing order of 3D similarity score and synthetic difficulty 76 Table 8. Top 10 candidate molecules generated by AIDD using sitagliptin’s proteininteraction based water constellation as the seed molecule and sitagliptin as the reference molecule for calculating 3D similarity. Molecules are listed based on decreasing order of 3D similarity score and synthetic difficulty 80 Table 9. Top 10 candidate molecules generated by AIDD using sitagliptin’s proteininteraction based water constellation (all carbons by default) as the seed molecule and the normal protein-interaction based water constellation as the reference molecule for calculating 3D similarity. Molecules are listed based on decreasing order of 3D similarity score 83 - 8 - List of Figures Figure 1. Applications of AI in drug development 16 Figure 2. Examples of a KNIME workflow consisting of nodes 16 Figure 3. A. KNIME workflow for data processing (Box B), decision tree analysis, and an initial machine learning model, giving the output shown in Figure 6. B. Expanded Box B, showing the procedure for data processing and manipulation of pKa values (Box C). C. Routine for extracting the most acidic pKa for compounds with multiple acidic groups 22 Figure 4. A. KNIME workflow for data input (as obtained from Figure 3), generation of a list of features (Box B), application of a genetic algorithm (Box C), counting of the resulting features (Box D), and insertion of a linear correlation routine (Box E) prior to the genetic algorithm, giving the output shown in Figure 5. B. Expanded Box B. C. Expanded Box C showing the feature selection loop to use the genetic algorithm. D. Expanded Box D. E. Expanded Box E, showing nodes for feature elimination based on autocorrelation, with the connections to Boxes B and C 23 Figure 5. A KNIME workflow incorporating variation of activity thresholds with the output from Figure 4 for use in a machine learning model, giving the output shown in Figure 9 25 Figure 6. Results of a decision tree analysis of 928 compounds at puncta threshold of 20,000 26 Figure 7. The number of times the top 10 molecular features were chosen by the genetic algorithm without a prior linear correlation piece (given in blue) and with a prior correlation piece (given in red) to eliminate multicollinearity among the features. The features are arranged from left to right in order of their importance with prior linear correlation analysis 28 Figure 8. Scatter plot of predicted vs. experimental puncta scores demonstrating the extent of tau fibril formation in the presence of 278 potential tau inhibitors 30 Figure 9. Prediction of tau inhibition using a machine learning model with a variable puncta threshold which helps in defining inhibition. The top of the figure shows calculations being - 9 - performed at an interval of 1000 puncta counts. The order of compounds in the left column follows: 10 strong inhibitors with experimental puncta score less than 15,000, 10 moderate inhibitors with puncta count close to the median puncta score (ranging from 24,621-25,675), and 10 weak inhibitors with experimental puncta score of greater than 40,000. The green region indicates positive prediction of an inhibitor whereas the pink region indicates a negative prediction of an inhibitor at the puncta score 31 Figure 10. WALE (Water Analysis for Ligand Evolution) consisting of 12 components written in Python and Fortran languages for applications in novel drug design and incumbent ligand evolution 35 Figure 11 Representation of the initial output obtained from ADMET Predictor® for Alogliptin converted by WALE to make it more appropriate for counting and categorizing solvation waters. The figure shows only those columns of the output as used by the code 43 Figure 12. Representation of the initial output obtained from ADMET Predictor® for Glibenclamide categorized by WALE into detailed water categories using the absolute displacement, contact displaced bulk and contact displaced HF waters. Red: Identification by Python script of a “TRUE” value in the “ProtHB” column, a value other than “O” or “N” in the “ProtHB Atoms” column and, a value of “O” in the “Closest Lig Atom” column → classified as “HB to sidechain of protein displaced with HB by ligand”. Blue: Identification of a “TRUE” value in the “ProtHF” column, a value of “PHE” in the “ProtHF Atoms” column and, a value other than “O” or “N” in the “Closest Lig Atom” column → classified as “HF to aromatic residue of protein displaced with HF by ligand”. Pink: “No interaction to protein displaced with HF by ligand”, Purple: “HF to aliphatic residue of protein displaced with HF by ligand”, Orange: “HB to backbone of protein displaced with HF by ligand” 44 Figure 13. Representation of the initial output obtained from SOLVATE for Vildagliptin categorized by WALE into detailed water categories using the absolute displacement, contact displaced bulk and contact displaced HF waters and the “ProtHB” and “Closest Prot Atom” columns 45 - 10 - Figure 14. Representation of the initial output obtained from SOLVATE for Alogliptin categorized by WALE into detailed water categories using the contact SWB waters and the “Closest Lig Atom” column 46 Figure 15. Representation of the initial output obtained from SOLVATE for Ibuprofen categorized by WALE into detailed water categories using the ghost match and matched waters and the “Closest Lig Atom” column 47 Figure 16. Representation of the initial output obtained from SOLVATE for Sitagliptin, containing information of solvated waters for empty protein and protein-ligand complex, separated by WALE into a separate CSV file 64 Figure 17. Obtaining the PDB file consisting of the interested water categories using WALE. A. Solvated structure of the target protein of Sitagliptin (DPP IV) with all waters (represented as red spheres). B. Solvated waters of the empty protein. C. Water categories filtered by the Python code represented as “O” atoms (in red). D. Water categories converted to “C” atoms (in grey) by the Python script 65 Figure 18. Obtaining the ligand-based water constellation for Sitagliptin using WALE. A. Default “C”-based water constellation of the ligand. B. Super-imposition of the ligand structure (N = blue, O = red, F = cyan, C = grey, and dotted lines represent aromaticity) with its “C”-based water constellation. Closest ligand atoms to waters 14, 19 and 202 are carbon, nitrogen, and oxygen respectively. C. Ligand-based water constellation derived from the closest atom in the ligand to each “C” in the “C”-based water constellation generated by the Python code. Water 14 is classified as carbon, water 19 classified as nitrogen and water 202 classified as oxygen 66 Figure 19. Obtaining the protein-interaction based water constellation for Sitagliptin. A. The protein (DPP IV) with the interested categories of solvated waters converted to carbons by default. Two amino acid residues GLU 180 and TYR 621 are highlighted as examples for further investigation. B. Representation of the interactions each water has with the protein. Waters 14 and 19 interact with different oxygens on GLU 180 while water 202 interacts with the aromatic carbon of TYR 621. C. Protein-interaction based water constellation classified as per the interactions present between the protein and the waters. Waters 14 and 19 get classified as oxygens while water 202 stays a carbon 67 Figure 20. A. Ligand-based water constellation before Fortran script, a. Ligand-based water - 11 - constellation after Fortran script. B. Protein-interaction based water constellation before Fortran script, b. Protein-interaction based water constellation after Fortran script 68 Figure 21. Heatmap of similarity scores of ligand-based water constellations and ligand structures (Green highlighted box = Hits) 70 Figure 22. Heatmap of similarity scores of protein-interaction based water constellations and ligand structures (Green highlighted box = Hits, Red highlighted box = Misses as compared to Figure 22) 70 Figure 23. Heatmap of the diagonal differences between similarity scores of ligand-based and protein-interaction based water constellations 71 Figure 24. Rules for deciding on substitutions in the original ligand structure to produce the manually modified ligand 75 Figure 25. Superimposition of candidate molecule: Compound 1 (stick figure) on the 3D similarity reference molecule: Sitagliptin protein-interaction based water constellation (scaled ball and stick figure) in AIDD (with seed molecule: Manually modified sitagliptin) to compute 3D similarity scores 78 Figure 26. Superimposition of candidate molecule: Compound 2 (stick figure) on the 3D similarity reference molecule: Sitagliptin protein-interaction based water constellation (scaled ball and stick figure) in AIDD (with seed molecule: Manually modified sitagliptin) to compute 3D similarity scores 79 Figure 27. Superimposition of candidate molecule: Compound 3 (stick figure) on the 3D similarity reference molecule: Sitagliptin protein-interaction based water constellation (scaled ball and stick figure) in AIDD (with seed molecule: Manually modified sitagliptin) to compute 3D similarity scores 79 Figure 28. Superimposition of candidate molecule: Compound 1 (stick figure) on the 3D similarity reference molecule: Sitagliptin (scaled ball and stick figure) in AIDD (with seed molecule: Sitagliptin protein-interaction based water constellation) to compute 3D similarity scores 82 Figure 29. Superimposition of candidate molecule: Compound 2 (stick figure) on the 3D similarity reference molecule: Sitagliptin (scaled ball and stick figure) in AIDD (with seed molecule: Sitagliptin protein-interaction based water constellation) to compute 3D similarity - 12 - scores 82 Figure 30. Superimposition of candidate molecule: Compound 3 (stick figure) on the 3D similarity reference molecule: Sitagliptin (scaled ball and stick figure) in AIDD (with seed molecule: Sitagliptin protein-interaction based water constellation) to compute 3D similarity scores 83 Figure 31. Superimposition of candidate molecule: Compound 1 (stick figure) on the 3D similarity reference molecule: Sitagliptin protein-interaction based water constellation (scaled ball and stick figure) in AIDD to compute 3D similarity scores 85 Figure 32. Superimposition of candidate molecule: Compound 2 (stick figure) on the 3D similarity reference molecule: Sitagliptin protein-interaction based water constellation (scaled ball and stick figure) in AIDD to compute 3D similarity scores 86 Figure 33. Superimposition of candidate molecule: Compound 3 (pink) on the 3D similarity reference molecule: Sitagliptin protein-interaction based water constellation (red: oxygen, grey: carbon) in AIDD to compute 3D similarity scores 86 - 13 - Abstract The objective of using computational tools in medicinal chemistry is to design compounds with better physicochemical properties with minimal investment of resources like time, chemicals, and human workload. Artificial intelligence and machine learning help increase the chances of success of a molecule in the drug pipeline by identifying the best performing molecules in the earlier stages of discovery. Medicinal chemists require training and a holistic understanding of the computational tools to make optimum use of them. KNIME presents a solution through its comprehensive toolsets and user-friendly platform, enabling the utilization of machine learning for SAR data analysis. In this thesis, downloadable workflows for predicting pharmacological activity of compounds and investigate the ligand protein interactions are made available to facilitate scientists who seek to incorporate AI tools into their research projects using KNIME. Python, as a programming language, presents an approach to navigating folder structures, handling large datasets, and efficiently generating and organizing results. ADMET Predictor, along with its AIDD (Artificial Intelligence-driven Drug Design) and 3D similarity screening module, facilitates the prediction of physicochemical properties of investigational compounds and the design of novel drugs using artificial intelligence and similarity methods. This thesis highlights the utility of the abovementioned computational tools and methods to manipulate them to aid the laboratory medicinal chemist enhance the drug discovery process. Keywords: KNIME, Python, AIDD, 3D similarity, workflows, computational tools, medicinal chemistry - 14 - Chapter 1: Introduction Building an efficient drug discovery process is highly sought after, particularly in the pharmaceutical industry that faces cost pressures 1 . The drug development pipeline consists of stages from hit to lead and lead optimization, typically including 2000 compounds which undergo synthesis and testing to identify a potential clinical candidate with the desired activity2 . The rate of failure of a candidate molecule in a clinical trial is >85%. This is mainly due to the inefficiencies in the preceding traditional drug discovery process, which involves designing, making, testing and analysing the molecules 3 . Testing involves screening multimillion sample compound libraries against relevant disease models4 . Drug development is a prolonged and expensive process which can be improved with the help of advanced technologies such as artificial intelligence (AI)5 . Over the years, there has been an increase in the digitalization of data in the pharmaceutical industry. These digital data can be scrutinized and analysed to produce knowledge that will improve various aspects of the medicinal chemistry and drug discovery process. AI can handle a huge amount of data in a short amount of time, which is not feasible for laboratory medicinal chemists6 . This is mainly due to a lack of expertise needed to use the constantly evolving AI software and technology. AI and its paradigms like machine learning are a technology that can learn and recognize patterns from input data to make automated decisions for specified objectives7 . The simplest application of machine learning in AI is the decision tree algorithm, which divides the input data into two or more categories depending on the presence or absence of a feature that is identified to be important by the model. In the commonly used decision tree models, the user can specify the number of branches in the tree thus limiting the split of the data 8 . Applications of AI in drug discovery deal with the drug design and drug screening phases9 (Figure 1). Virtual drug screening considers the physicochemical and pharmacokinetic parameters to enable the medicinal chemist to identify molecules with better profiles and eliminate less desirable molecules in a much shorter timeframe and with reduced costs10 . The basis of this approach is machine learning models trained on large data sets of compounds and properties such as solubility, pKa and log P to predict the properties of the test data set 11 . Virtual drug design uses ligand-based modelling approaches like similarity and fingerprint recognition to predict the optimum chemical structure12 . Neural networks use molecular descriptors like 3D coordinates and SMILES strings to design novel molecules13 . Fast QSAR models using AI and machine learning have been developed using random forest learners and - 15 - decision trees14 which make use of chemical features of ligands that are responsible for interacting with the biological target15 . AI has an interesting application in predicting the toxicity of compound databases. DeepTox, a machine learning software trained on 2500 predefined toxic features, successfully predicted the toxicity of the test set compounds16 . The deep neural network-based AI tool, AlphaFold, considers the distance and angles between various amino acids in the training data set to understand how the biological protein folds and predicts the target protein structure once the user feeds the amino acid letter sequence as input query17 . Investigating the ligand-target interactions through AI and machine learning approaches like support vector machines, regression models and neural networks can lead to the development of novel molecules that maintain significant interactions with the protein, drug repurposing, as well as prevention of the effects of polypharmacology 18 . A support vector machine model developed by Wang et al. was able to predict nine new molecules along with their target interactions after being trained on 15,000 protein-drug complex interactions19 . DeepDTnet, a cellular network-based deep learning algorithm, successfully predicted the potential of topotecan, a conventional topoisomerase inhibitor, to be used in the treatment of multiple sclerosis20 . KinomeX is an AI based platform that uses deep neural networks trained with more than 300 kinases and their bioactivities for analysing the general selectivity of molecules towards the kinase enzyme family21 . AI can be used in synthetic medicinal chemistry to determine the optimum route of synthesis of new molecules. Synthia, developed by Grzybowski et al., is an AI algorithm that has an option of encoding rules to help it propose the optimum synthetic route for specific targets22 . Scaled automation has resulted in substantial time savings by reducing the need to synthesize and test the investigational compounds23 . - 16 - Figure 1. Applications of AI in drug development 1.1 KNIME KNIME is an open-source platform that offers user-friendly ways for data processing and analysis. It is specifically designed to manage large volumes of diverse data types24 . Since its inception in 2006, KNIME has been utilized by professionals across various industries and academic institutions. The tools within the software are referred to as nodes, serving as the fundamental processing units. These nodes can then be assembled to create workflows, a key feature of the platform that can be exported and readily shared with others25 (Figure 2). This enables users to reproduce the results or utilize the workflows as a foundation for their own analyses. Figure 2. Example of a KNIME workflow consisting of nodes. Data input in KNIME is controlled by input nodes which read various formats of data, including Excel, URL, SDF and Mol2 files. The OpenBabel node in KNIME is a useful tool to interchange various molecular formats that can be then used in the workflow26 . After input, - 17 - data cleaning is needed for the data to become appropriate for analysis. KNIME offers a wide range of nodes that help in manipulating initial data, including concatenation, row and column filtering, and data transformation27 . Analysis in KNIME is performed by AI based nodes which use decision tree models, neural networks, regression models and clustering software to run everything ranging from simple mathematical functions to complex algorithms. Data validation can also be done to enhance the performance of existing models27 . Output in KNIME can be generated in a lot of ways including report generation in EXCEL or Word format. Publication ready figures can also be generated with the help of an image to report node27 . KNIME has been used extensively in the scientific community in various phases of drug development. Lemcke and Kruggel demonstrated a KNIME workflow that could score in silico homology models against enzyme GSK-3 in plasmodium28 . Often scoring is performed for docked poses of ligands using various scoring functions. However, it becomes difficult to analyse the strength of any one scoring function, which makes it difficult to build a consensus model. The data mining function in KNIME based on the clustering algorithm was used by Korb et al. to analyse the strength of numerous scoring algorithms29 . A common property of ligand-based in silico screening is defining molecular structure relating to clinical endpoints which can be described using molecular fingerprints. Sala et al. developed a workflow to compute molecular fingerprints of hIKK-2 inhibitors using cluster analysis for result generation29 . Libraries like DrugBank and ChEMBL available in KNIME store databases of compounds with their biological activities. These libraries were used by Steri et al. to train and implement a self-organizing map algorithm to virtually screen a QSAR model of farnesoid X receptor modulators for predicting the potency of novel compounds30 . PAINS (Pan-assay interference structures) are chemical features in investigative drug molecules that frequently result in false positive predictions in numerous assays31 . A variety of KNIME workflows have been developed to address this issue that identify and eliminate compounds having these features to minimize their interference in the hit identification process32 . In this thesis is highlighted the development of a KNIME workflow that takes the major physicochemical and structural properties (molecular features) of ~9000 polyphenol inhibitors of tau protein that play a role in Alzheimer's therapy, as input, performs screening on the features, eliminates those with a high correlation to each other to eliminate the problem of overfitting and runs a genetic algorithm based machine learning model to determine the relative importance of the features. The identification of such important - 18 - features gives us an understanding of their structural significance and plays a role in determining the activity of the test data set of molecules based on SAR data of the library of polyphenol inhibitors (training set)33 . 1.2 Python Python is a programming language that has found increasing popularity in the drug development process. It is largely versatile, user-friendly and contains packages of tools and libraries. It has applications in molecular dynamics and modelling, docking, in silico screening, QSAR modelling, visualization and other data analysis34 . Polyply, a Python based open-source algorithm developed by Grunewald et al., enables users to simulate nanoparticles and macromolecules. It also allows users to specify parameters such as force fields, coordinates and formulation type, needed to run molecular dynamics simulations and undergoes rigorous testing through continuous integration and semantic versioning35 . DOCKSTRING was developed as a Python package for applications in molecular docking which provides virtual docking scores, libraries of docking poses for standardization and benchmarking and a set of predefined tasks to analyse models36. VSFlow (virtual screening workflow) was developed in Python as a command line prompt containing both 2D (fingerprint and chemical features based) and 3D screening (shape based) models. It also contains tools for managing and processing drug databases for in silico screening37 . An open source QSAR model was recently developed using Python to predict pharmacological activity of thiazole compounds demonstrating anticancer activity. The correlation model between structure and the IC50 value was developed using multiple linear regression and support vector machine approaches38 . Python can help summarize big data analytics in the form of graphs and publication-ready figures using its various libraries like Matplotlib and Seaborn. Its applications include exploring input data, identifying patterns and anomalies and communicating scientific findings effectively39 . This project uses Python in a variety of ways to manipulate and handle large CSV files, direct output into specific directories and most importantly, in computational chemistry to identify and categorize the waters that surround an empty protein as well as a protein-ligand complex. 1.3 ADMET Predictor® - 19 - ADMET Predictor® is a software developed by Simulation Plus, Inc. to estimate important physicochemical and pharmacokinetic properties of a compound database with applications in computational screening during early drug development40 . It comes with various modules, including AIDDTM (AI- Driven Drug Design), a de novo drug design algorithm that takes compound structures as seed molecules and evolves new analogs following predefined transformation rules. AIDD further checks the fitness of the candidate molecule structures against initially set parameters and structural objective functions and allows the most successful analogs to undergo another round of evolution41 . 3D similarity screening is another module that predicts Tanimoto scores based on atomic volume (shape) and chemical feature overlap of 3D structures42 . Both 3D similarity screening and AIDD have been used in this project for analysing the arrangement of waters around the empty protein and comparing it against the ligand structure, as well as for evolution and optimization of the ligand based on its similarity score. 1.4 Thesis Overview Computational tools highlighted in this thesis help reduce the workload for chemists by identifying lead compounds with enhanced physicochemical and pharmacokinetic properties using artificial intelligence, targeted functional group modifications and similarity. The KNIME workflow developed uses a known dataset of successful compounds as a training set and employs machine learning algorithms that can identify the presence of characteristic functional groups in the compounds, compare these against SAR data and predict pharmacological activity of novel compounds. SOLVATE provides a detailed analysis of protein-ligand interactions involving water. These interactions were further investigated using a program called WALE (consisting of 12 components written in Python and Fortran languages, Section 3.1.1) that led to some interesting findings about the way ligand binding leads to the replacement of protein-water interactions forming several new bonds with the protein. 3D similarity screening helps to highlight the need for ligand structure optimization by comparing ideal protein-water interactions against existing protein-ligand interactions. This idea of ligand structure optimization was then explored using AIDD to generate novel compounds with higher similarity to the ideal protein-water interactions. Due to the presence of the right functional groups at the right place in these novel compounds, it may lead to stronger interactions with the protein, thus resulting in better binding efficacy. - 20 - Chapter 2: Building an AI model using KNIME 2.1 Background The chapter explains the KNIME workflow developed to analyse the molecular features and activity data of the ~930 potential polyphenol inhibitors of tau protein and predict their pharmacological activity using AI approaches like decision tree and machine learning methods. The workflow combines AI and QSAR models with molecular features derived from ADMET Predictor® and makes it accessible for laboratory medicinal chemists to conduct a medicinal chemistry analysis computationally. The work in this chapter has been published33 and was performed in collaboration with Zipeng Zheng, as noted in several sections below. 2.1.1 Calculation of molecular predictors Canonical SMILES strings of the 930 compounds were used as input in ADMET Predictor® 10.0 (Simulations Plus, Lancaster, CA) to determine the molecular features. One compound, tannic acid, could not be processed due to its high number of ionizable groups (25) that exceeded the limit of 20 groups in ADMET Predictor®. This compound was not included in further analysis. The calculated molecular features were obtained from ADMET Predictor as an Excel spreadsheet which was then used as input for AI, data analysis and computational chemistry in KNIME Analytics ver. 4.7.2. 2.2 Theory An initial KNIME workflow was developed to read in the Excel file of molecular features and inhibition data in the form of puncta count through the “Excel Reader” node. Preprocessing and data cleaning was performed followed by a simple analysis of data using the decision tree. 2.2.1 Data processing The data cleaning workflow is shown in Figure 3A. After removing unnecessary columns and filtering rows using the “Column Filter” and “Row Filter” nodes, the data underwent - 21 - processing to include the most acidic and basic pKa for each compound (Figure 3B and Figure 3C). The resulting data set used in the decision tree analysis contained 928 compounds with 65 molecular features each after filtering out lysipressin and tannic acid due to insufficient data. The detailed description of the workflow and involved nodes are given in Zipeng Zheng’s thesis33,43 . Figure 3. A. KNIME workflow for data processing (Box B), decision tree analysis, and an initial machine learning model, giving the output shown in Figure 4. B. Expanded Box B, showing the procedure for data processing and manipulation of pKa values (Box C). C. Routine for extracting the most acidic pKa for compounds with multiple acidic groups. 2.2.2 Decision tree analysis A simple decision tree analysis was introduced into the workflow (Figure 3A) by inclusion of three additional nodes after data cleaning namely the “Rule Engine”, “Column Filter” and - 22 - “Decision Tree Learner” nodes. Additional details about this part are given in Zipeng Zheng’s thesis43 . 2.2.3 Machine learning model The resulting data from Section 2.2.1 was also subjected to a machine learning algorithm using the “Gradient Boosted Tree Learner” node (Figure 3A). The model was trained on 70% of the total data (as specified in the “Partitioning” node) and the remaining 30% was used as the test data set which served as input into the “Gradient Boosted Trees Predictor” node. The “Scorer” node was used to obtain output in the form of a confusion matrix. This is discussed in detail in Zipeng Zheng’s thesis43 . 2.2.4 Genetic algorithm To improve the accuracy of the above generated machine learning model, the workflow in Figure 4 was generated. After initial data cleaning and further processing, the workflow progresses to a metanode (box B, Figure 4A) which generates a single list of the 65 molecular features (Figure 4B). The “String Manipulation” and “Rule-based Row Filter (Dictionary)” nodes work together to take the 65 features one by one (as controlled by the “Chunk Loop”) and compare it to the features selected by the genetic algorithm (box C, Figure 4A). The genetic algorithm expanded in Figure 4C contains a “Counting Loop Start” node, which determines the number of times (50) that the “Feature Selection Loop” will run. The “Feature Selection Loop” is based on a machine learning algorithm like the one described in Section 3.2, except that the “X-Partitioner” node is used to partition data multiple times and increase the number of validations (5 for each prediction), along with the data being collated by the “X-Aggregator” node. The “X-Partitioner” node increases the robustness of the model predictions. The number of validations (n=5) also fixes the ratio of training to test data as 80:20. Random seed and random sampling were specified to generate reproducible results. The accuracy of iteration was determined using the “Scorer” node. The final result compilation takes place in the “Feature Selection Loop End” node in the form of a list of models: N generated models (where N is up to 220) in the feature selection loop for each iteration of the counting loop. Each model consists of a set of features (maximum 20) along - 23 - with the accuracy of the prediction of the puncta count. The “Row Filter” node was used to eliminate models with an accuracy lower than 0.6. This filtered list of models is directed back to the main workflow (Figure 4A) and a count of each feature across all the models is performed (box D, Figure 4A; expanded in Figure 4D). The resulting data are then processed using various nodes as shown in Figure 4A for obtaining the final output. Figure 4. A. KNIME workflow for data input (as obtained from Figure 3), generation of a list of features (Box B), application of a genetic algorithm (Box C), counting of the resulting features (Box D), and insertion of a linear correlation routine (Box E) prior to the genetic algorithm, giving the output shown in Figure 5. B. Expanded Box B. C. Expanded Box C showing the feature selection loop to use the genetic algorithm. D. Expanded Box D. E. Expanded Box E, showing nodes for feature elimination based on autocorrelation, with the connections to Boxes B and C. 2.2.5 Linear correlation ADMET Predictor® gives information about the molecule in extensive detail in the form of molecular features. In most cases, even one property may be explained by many molecular features. Hence, it becomes a possibility to replace multiple features by a single representative feature thereby reducing the negative influence of correlation on the genetic - 24 - algorithm. A linear correlation workflow piece (box E, Figure 4A; expanded in Figure 4E) was developed and inserted before the main workflow branches out to perform the feature selection loop (box C, Figure 4A) or the listing of features (box B, Figure 4A). The objective of this piece is to calculate the linear correlation coefficients in a pairwise order for all the 65 features using the “Linear Correlation” node, after which a “Correlation Filter” excludes all columns which have a coefficient >0.8. The routine was performed on a subset of 50 compounds chosen linearly from the dataset. The output obtained from the workflow is in the form of a bar chart (shown in Figure 4A) which contains a list of the most important to least important molecular features generated by the machine learning model with the linear correlation piece and compares it against the list obtained without the correlation piece using the “CSV Writer” node. 2.2.6 Variable threshold and activity prediction The workflow discussed in Section 2.2.4 generated a final list of molecular features which was then used as input in a new machine learning model (Figure 5) which is distinguished by the incorporation of a method to vary the puncta threshold using the “Empty Table Creator” and the “Counter Generation” nodes. This allowed for each molecule in the test set to be evaluated against the varying predicted threshold. At each threshold, the “Rule Engine” node within the “Chunk Loop” was used to categorize each molecule as either an inhibitor or a non-inhibitor. The development of the rest of the machine learning model was similar to Figure 3A. The “Loop End (Column Append)” node was used to collect the output and the “CSV Writer” was used to write out an output CSV file. This part is discussed in detail in Zipeng Zheng’s thesis43 . - 25 - Figure 5. A KNIME workflow incorporating variation of activity thresholds with the output from Figure 4 for use in a machine learning model, giving the output shown in Figure 7. 2.3 Results 2.3.1 Data processing An important part of this workflow was to extract the most acidic or basic values for each compound from the multiple pKa values obtained from ADMET Predictor® (Figure 3C). The final output of this workflow consisted of a table of 928 compounds with 65 molecular features each which was then used as input for various workflows. The results of this section are discussed in detail in Zipeng Zheng’s thesis43 . 2.3.2 Decision tree analysis The objective of the decision tree analysis (Figure 3A) was to determine the molecular features that categorized molecules as inhibitors or non-inhibitors of tau fibrils at a static puncta threshold of 20,000 (wherein, the molecules that had puncta count of <20,000 were described as inhibitors). The number of aromatic hydroxyl groups (ArHdrxl_-OH) present in a molecule was determined to be the best indicator of an inhibitor (Figure 6) across multiple puncta thresholds (see Appendix 1). This helped to establish the probable importance of ArHdrxl_-OH in a predictive machine learning model of tau fibril inhibition. The results of this section are discussed in detail in Zipeng Zheng’s thesis 43 . - 26 - Figure 6. Results of a decision tree analysis of 928 compounds at puncta threshold of 20,000. 2.3.3 Machine learning model At puncta threshold of 20,000, the model predicted with an accuracy of 70% with good specificity but low sensitivity (see Appendix 2). The result is discussed in detail in Zipeng Zheng’s thesis 43 . Improvements made to this model are discussed in the following sections. 2.3.4 Genetic algorithm The machine learning algorithm is incapable of identifying the key molecular features of a molecule. The genetic algorithm workflow was developed to address this issue and determine the most important features for predictive modelling (Figure 4A). In the part of the workflow dealing with the data input, the “Rule Engine” node was used to categorize the molecules into two groups based on their puncta count median, which was found to be 24,958.38. The workflow was performed on the dataset of 928 compounds with 65 molecular features each (as listed in Appendix 2). The output consisted of the best features over 1417 models (represented as rows in KNIME) obtained through 50 iterations (“Counting Loop”, Figure 4C). Each iteration consisted of approximately 30 generations in the genetic algorithm. - 27 - Average number of features in the 1417 models were about 10. If each of the 65 features were equally important then that would result in each feature appearing 218 times in the analysis ([1417 × 10] / 65). However, only 40 features appeared more than 218 times in the analysis whereas the remaining 25 appeared less than 218 times (Appendix 2, blue bars). The number of hydrogen bond donors (HBD) was found to be the most selected feature (appearing 983 times) while the number of aromatic hydroxyl groups (ArHdrxl_-OH) was 10th on the most selected feature list (appearing 377 times) (Appendix 3). This result indicates that although ArHdrxl_-OH was found to be more important than random selection (218 times), it was not the most important delineating feature as found in the decision tree (Figure 6). This discrepancy suggested a scope of improvement for the workflow for more accurate and robust results. 2.3.5 Linear correlation The genetic algorithm workflow was improved by incorporating a linear correlation piece prior to the genetic algorithm to eliminate the autocorrelated features (Figure 4E). The results obtained from this refined workflow were drastically different than the original genetic algorithm results. Only three of the top ten features selected without prior correlation analysis were present after use of the linear correlation routine (Appendix 3). The routine also eliminated 24 correlated features from the initial 65, leaving a total of 41 non-correlated features with r values less than 0.8. These 41 features were then used as input to the genetic algorithm leading to the generation of 889 models across 50 iterations. There were about 18 generations present in each iteration and each model had about 10 features on average. Hence, if each of the 41 features were equally important, then they should appear 216 times in the analysis ([889 × 10] / 41). The most significant result of this refined workflow was the re-appearance of ArHdrxl_-OH as the most dominant molecular feature, appearing in 794 models out of a maximum 889 (Appendix 2). This agrees with the results obtained from the decision tree proving that adopting the correlation routine solved the issue of multicollinearity among the features. The top 10 features in this analysis were selected more than 300 times (Figure 7) and the result shows that the ranking changed substantially from the original genetic algorithm workflow (Appendix 4). - 28 - Figure 7. The number of times the top 10 molecular features were chosen by the genetic algorithm without a prior linear correlation piece (given in blue) and with a prior correlation piece (given in red) to eliminate multicollinearity among the features. The features are arranged from left to right in order of their importance with prior linear correlation analysis. 2.3.6 Variable threshold and activity prediction The shortlisted features (41) obtained from the workflow discussed in Section 2.3.5 were used as input into a new machine learning workflow which incorporated a method to automate variation of the puncta threshold (Figure 5). Figure 8 shows the predicted results obtained after running the workflow at a puncta interval of 1000 for 279 compounds included in the test set (30% of the 928 compounds) and its comparison with experimental data. At a puncta threshold of 20,000, the model gives 19 true positives and 21 false positives. This is a huge advantage over the conventional methods of identifying lead compounds. If one were to randomly choose and test 40 compounds, it would result in the selection of about 11 active inhibitors, whereas the model is able to identify 19 potential inhibitors. A similar comparison at a puncta threshold of 15,000 results in 9 true positives and 15 false positives, while a - 29 - random choice of 24 compounds would result in the selection of 7 active inhibitors. Figure 9 shows a detailed prediction output of 3 categories of compounds: 10 strong inhibitors with experimental puncta score of less than 15,000, 10 moderate inhibitors with puncta score close to median puncta count of 24,958 (ranging from 24,621 to 25,675) and 10 weak inhibitors with puncta score greater than 40,000. The top six compounds in Figure 9, which include (-)- epigallocatechin gallate (EGCG), a popular tau inhibitor44 , were counted in the true positive prediction category as the green region starts at puncta score of less than 15,000. The following 4 compounds were a case of false negatives as the green region for these compounds does not start until >15,000. All the moderate inhibitors were classified as true negatives as they are experimentally proven to be non-inhibitors, except for 7-hydroxy-4Hchromen-4-one, that showed a blotchy result with the green area at less than 15,000 puncta score and numerous pink areas at >15,000 puncta score. Although this compound was officially designated as a false positive prediction but upon closer look at Figure 9, it can easily be identified as a negative prediction due to the blotchy result. All the weak inhibitors were also classified as true negatives with their green regions starting from >20,000 puncta score. - 30 - Figure 8. Scatter plot of predicted vs. experimental puncta scores demonstrating the extent of tau fibril formation in the presence of 278 potential tau inhibitors. - 31 - Figure 9. Prediction of tau inhibition using a machine learning model with a variable puncta threshold which helps in defining inhibition. The top of the figure shows calculations being performed at an interval of 1000 puncta counts. The order of compounds in the left column follows: 10 strong inhibitors with experimental puncta score less than 15,000, 10 moderate inhibitors with puncta count close to the median puncta score (ranging from 24,621-25,675), and 10 weak inhibitors with experimental puncta score of greater than 40,000. The green region indicates positive prediction of an inhibitor whereas the pink region indicates a negative prediction of an inhibitor at the puncta score. 2.4 Discussion The features of a molecule need to be considered carefully before using them in machine learning algorithms to predict molecular events. The importance of not only the presence but also the number of aromatic hydroxyl groups in the SAR of polyphenol inhibitors of tau, was demonstrated with an initial decision tree model. However, translating this result into something that could be used by a machine learning algorithm to make a robust predictive model required numerous steps including the identification of the most significant molecular features. Firstly, highly correlated features had to be identified and eliminated to solve the issue of multicollinearity. Secondly, a genetic algorithm was employed to determine the most descriptive features from the dataset. The analysis using the genetic algorithm was successful - 32 - in revealing several features that contributed positively to the predictive model while asserting the result obtained from the decision tree regarding the significance of the number of aromatic hydroxyls. This was a successful cross-check of the validity of the genetic algorithm. This approach to identify crucial features using a preluding machine learning algorithm can be used as a general method to deal with huge amounts of activity data and focus only on the key features, thus making the interpretation of SARs simpler45 . The genetic algorithm acts as a strategy to select features based on evolutionary changes46 . Furthermore, this approach provides a solution to the inability of common AI models to identify crucial features contributing to a molecular effect in chemistry thus providing an explanation for the basis of prediction by the model 47,48 . This method allows development of drug molecules by quick and methodical interpretations of SARs that can be easily performed by chemists49 . Thus, using machine learning in an iterative manner to identify features that can be subsequently used in further machine learning, termed as the reinforced machine learning model, can be potentially used to identify lead compounds across several disease areas and improve the drug discovery process. Another limitation of applying machine learning and AI in chemistry is the identification of a suitable platform that is appropriate for use by chemists50 . KNIME provides a way of using machine learning in an accessible manner to medicinal chemists without the extensive need for coding51–57 . It also consists of several medicinal chemistry nodes including the similarity score generator58 that we have used to calculate similarities of the 930 molecules discussed above with the EGCG conformation bound to tau protein59 . KNIME also contains visualization nodes that permit viewing of the molecule in a 3D space with the help of a 3D viewer60 . The accuracy of the current model can be enhanced further by generating detailed and more relevant molecular data from KNIME and other modelling software based on chemical features and 3D shape of the molecules61 . The predictive model described in this chapter can enrich the number of probable inhibitors in a data set of predicted inhibitors as compared to a random selection method. A deeper analysis of the results for false positive predictions (21 out of 40 at a puncta threshold of 20,000) reveals that we can classify 11 of these false positives as non-inhibitors of tau due to blotchy results around this threshold. Furthermore, the molecular features used in this analysis are general and not directly relevant to tau inhibitors. A more in-depth examination - 33 - of the results combined with use of specific features to tau inhibitors can enhance this model’s predictive ability. 2.5 Data Sharing The KNIME workflows discussed in this chapter are available at https://github.com/ruch555/KNIME-Workflows-for-Applications-in-Medicinal-andComputational-Chemistry-.git - 34 - Chapter 3: Investigating effects of solvation using SOLVATE 3.1 Background To analyse the interactions between a protein and ligand, it is essential to understand the role of water because an empty protein is surrounded by water in a biological environment62 . Upon binding of a ligand, some waters get displaced and stabilizing interactions between the protein and water are compromised. These broken protein-water bonds are compensated by strong interactions with the ligand. Ideally, the ligand should interact in a hydrophilic manner at all those positions at which it has displaced the waters in the binding site. SOLVATE takes the PDB files of a protein and a ligand (including H atoms) as input, proceeds to run the WATGEN algorithm as described by Morningstar-Kywi et al.63 and solvates the empty protein as well as the complex by adding waters in layers. The program generates an output of solvated PDB files and CSV files with detailed information about the solvation process. This approach gives us an opportunity to investigate the interactions between the proteinligand as well as the protein-water complex. 3.1.1 WALE (Water Analysis for Ligand Evolution) To investigate the interactions between waters and the protein, and apply them to novel drug design, a program that we refer to as WALE (Water Analysis for Ligand Evolution) was developed. WALE consists of 12 Python and Fortran components (Figure 10). The environment used for Python coding was PyCharm Community Edition 2023.1.2 while Fortran was written in a Visual Studio 2022 environment and compiled using Fortran compiler. The components must be executed in the order shown in the figure. Each component is responsible for a specific function given by its name and is described in detail in the following sections. The general workflow involves manipulating the initial data obtained from SOLVATE to make it suitable for further processing, classification of solvated waters into more refined categories, generation of a total count of such waters in each category, and combination of this information for all ligands, followed by the creation of water constellations and establishment of bonds within the constellations. WALE takes the data on empty protein solvated waters in the form of CSV files (obtained from SOLVATE) as input, processes it in multiple steps, and finally generates an output SDF file of the water constellations that can then be used for ligand evolution using AIDD. In essence, WALE takes the solvated protein complex, processes the information associated with it, and - 35 - produces output that helps in the evolution of original ligands into compounds with structures having optimal interactions with the target protein. Figure 10. WALE (Water Analysis for Ligand Evolution) consisting of 12 components written in Python and Fortran languages for applications in novel drug design and incumbent ligand evolution. 3.1.2 Ligand-protein complexes used to evaluate WALE In this chapter, 10 ligands across various disease domains were selected and the PDB files for the drug-protein complexes were downloaded from the RCSB PDB database (Table 1). The complexes included the ligand bound to its pharmacological target protein, except for alprazolam and diazepam, which were bound to the first bromodomain of human BRD4 and human serum albumin, respectively. The complexes of sitagliptin and ibuprofen included the target protein, but from a different species (bacterial and ovine, respectively). Table 1. Details of 10 ligand-protein complexes selected for evaluation in WALE PDB ID Ligand Protein Protein chain 3G0B Alogliptin Human dipeptidyl peptidase (DPP) IV A 7Y4G Sitagliptin Bacterial dipeptidyl peptidase (DPP) IV A 6B1E Vildagliptin Human dipeptidyl peptidase (DPP) IV A 3U5J Alprazolam First bromodomain of human BRD4 A - 36 - 2BXF Diazepam Human serum albumin A 1HWK Atorvastatin Catalytic portion of human HMg-CoA reductase A, B 1HWI Fluvastatin Catalytic portion of human HMg-CoA reductase A, B 1HWL Rosuvastatin Catalytic portion of human HMg-CoA reductase A, B 5YW7 Glibenclamide Human pancreatic ATP-sensitive potassium channel A 1EQG Ibuprofen Ovine COX-1 A 3.1.3 Processing through SOLVATE The protein-ligand complexes were separated into protein and ligand PDB files. The protein chains were selected according to the binding site of the ligand. Hydrogens were added to the ligand using a 3D viewing and editing software, ViewerPro64 . The protein and ligand PDB files were loaded into SOLVATE and the drug-protein complex and “empty” protein were solvated. The output CSV file giving a detailed description of the added waters and a summary of their properties was used for further analysis. 3.1.4 Categories of water SOLVATE categorizes each water in the solvated empty protein depending on whether it was displaced upon ligand binding. The various categories of water that were used for this project are shown in Table 2. Table 2. Categories of water as generated by SOLVATE Water category Definition Inference Absolute displacement Water displaced by ligand Bonds made by ligand to protein must mimic the bonds made by waters to protein Contact displaced bulk Water with a high probability of displacement by ligand away from the binding site Waters not significantly interacting with the protein as they form the bulk - 37 - Contact displaced HF Water with a high probability of displacement by ligand interacting with a hydrophobic (HF) part of the protein Beneficial water displacement Contact SWB Water with a high probability of forming a single water bridge (SWB) between the ligand and the protein For a SWB to be beneficial in binding both the parts in the ligand and the protein need to be hydrophilic Matched Water not displaced by the ligand Indicates points of evolving the ligand structure based on identification of beneficial interactions with the protein Ghost match Water with a high probability of not being displaced by the ligand Indicates probable points of evolving the ligand structure based on identification of beneficial interactions with the protein 3.2 Theory 3.2.1 Python script to identify and split multiple entries During analysis of the CSV file containing details of the empty protein solvation, it was found that one water occasionally, either hydrophilically or hydrophobically, interacted with more than one atom in the empty protein, as reflected in the “ProtHB Atoms” or “ProtHF Atoms” column. To resolve this issue, a Python script (Appendix 5, component 1 in Figure 10) was developed to split the multiple entries in this column, thus resulting in consideration of all the interactions made by a single water with the protein. 3.2.2 Python script to classify interactions between empty protein and solvated water 3.2.2.1 Absolute displacement, contact displaced bulk and contact displaced HF - 38 - This script (Appendix 6, component 2 in Figure 10) uses the “ProtHB” and “ProtHF” columns to classify the interactions between the water and the protein as either hydrophilic or hydrophobic. Going into further detail, the hydrophilic interactions were classified into two types, either interacting hydrophilically with the protein backbone or sidechain, using the “ProtHB Atoms” column, which gives information on what protein atoms were involved in the hydrogen bond. Similarly, the hydrophobic interactions were classified into either interacting with an aromatic residue or an aliphatic residue of the protein using the “ProtHF Atoms” column which gives information on what protein atoms were involved in the hydrophobic interaction. If the water did not interact with the protein, it was simply classified as “No interaction with the protein”. Furthermore, the “Closest Lig Atom” column was used to gain information on the ligand atom closest to the water after binding. If the closest ligand atom was either “O” or “N” then the water was categorized as “displaced with HB by ligand”. If it was any other atom, then it was categorized as “displaced with HF by ligand”. The output was obtained in the form of a CSV file with the water categories described above (Table 3). 3.2.2.2 Absolute displacement, contact displaced bulk and contact displaced HF using “ProtHB” and “Closest Prot Atom” During analysis of solvation, it was discovered that although certain waters form a hydrogen bond with the protein, the closest atom in the protein to the water was hydrophobic. To investigate this interesting phenomenon, a script (Appendix 7, component 3 in Figure 10) was developed to give a summary of waters which undergo such interactions. Initially a filter was set to use waters only in the interested categories. It was checked whether a water forms a hydrogen bond with the protein using the “ProtHB” column values. If true, the code checked whether the values in the “Closest Prot Atom” column was either “O” or “N”. If true, the water was classified as “Forms HB with protein and closest atom is HB in protein”. If the closest atom in the protein to water was anything else, it was classified as “Forms HB with protein but closest atom is HF in protein”. The new generated water categories (Table 3) were updated into the CSV files generated in the above step as output. 3.2.2.3 Contact SWB - 39 - Even though a water belonging to the contact SWB category makes a single water bridge between ligand and the protein, the closest atom in the ligand to such a water is not always hydrophilic. This forms another interesting category to investigate as it highlights how hydrogen bond forming water can also be close to large hydrophobic groups. In this script (Appendix 8, component 4 in Figure 10), all waters belonging to the Contact SWB category were investigated. If the values in the “Closest Lig Atom” column were either “O” or “N”, then the water was classified as “Contact SWB and closest atom in ligand is HB”. If the values were anything else, then it was classified as “Contact SWB but closest atom in ligand is HF”. The new generated water categories (Table 3) were updated into the CSV files generated in Section 3.2.2.1 as output. 3.2.2.4 Matched and ghost match Although both matched and ghost match are non-displaced waters, they provide an interesting opportunity to evolve the ligand derived from their interactions with the protein. This can be done with the objective of maximising ligand-protein interactions, thereby leading to a possible tighter binding. For the development of this script (Appendix 9, component 5 in Figure 10), all waters belonging to the interested categories and less than 5 Å away from the ligand were considered. The classification logic of protein backbone or sidechain hydrophilic interactions and protein aromatic or aliphatic hydrophobic interactions as well as classification based on the presence of either a hydrophilic (“O” or “N”) or hydrophobic atom in the “Closest Lig Atom” column remains the same as given in Section 3.2.2.1. The waters were classified into eight categories (Table 3). The new generated water categories were updated into the CSV files generated in Section 3.2.2.1 as output. Table 3. Detailed description of water categories generated by each component of WALE WALE component Category Description Component 2 Category 1 HB to backbone of protein displaced with HB by ligand Category 2 HB to backbone of protein displaced with HF by ligand - 40 - Category 3 HB to sidechain of protein displaced with HB by ligand Category 4 HB to sidechain of protein displaced with HF by ligand Category 5 HF to aliphatic residue of protein displaced with HB by ligand Category 6 HF to aliphatic residue of protein displaced with HF by ligand Category 7 HF to aromatic residue of protein displaced with HB by ligand Category 8 HF to aromatic residue of protein displaced with HF by ligand Category 9 No interaction to protein displaced with HB by ligand Category 10 No interaction to protein displaced with HF by ligand Component 3 Category 11 Forms HB with protein and closest atom is HB in protein Category 12 Forms HB with protein but closest atom is HF in protein Component 4 Category 13 Contact SWB and closest atom in ligand is HB Category 14 Contact SWB but closest atom in ligand is HF Component 5 Category 15 Point of evolution based on HB backbone interaction to protein and HB interaction to ligand - 41 - Category 16 Point of evolution based on HB backbone interaction to protein and HF interaction to ligand Category 17 Point of evolution based on HB sidechain interaction to protein and HB interaction to ligand Category 18 Point of evolution based on HB sidechain interaction to protein and HF interaction to ligand Category 19 Point of evolution based on HF aliphatic interaction to protein and HB interaction to ligand Category 20 Point of evolution based on HF aliphatic interaction to protein and HF interaction to ligand Category 21 Point of evolution based on HF aromatic interaction to protein and HB interaction to ligand Category 22 Point of evolution based on HF aromatic interaction to protein and HF interaction to ligand Category 23 Point of evolution based on water in bulk and HB interaction to ligand - 42 - Category 24 Point of evolution based on water in bulk and HF interaction to ligand 3.2.2.5 Total count of waters in each category This script (Appendix 10, component 6 in Figure 10) considers the total unique categories of water developed using the previously mentioned codes and counts the number of waters falling in each category. The objective was to make it easier to view the different categories and the respective count of waters for each compound. The output is in the form of CSV files for each compound in a different folder. 3.2.2.6 Addition of an identifier Although the above-mentioned script gives us a tally of the total waters in each category, it does not specify the compound name for which the categories are generated. This is necessary before compiling the information for all compounds. This script (Appendix 11, component 7 in Figure 10) considers the name of the CSV files, which are named after each compound, and writes the column “Name” in the file. The output gets updated in the CSV files generated in the above-mentioned code. 3.2.2.7 Combining the CSV files The next step requires the development of a script (Appendix 12, component 8 in Figure 10) to combine the CSV files of all compounds into one. The column used for combining the files was “Name”. The output was generated in the form of a new CSV file called “Combined.csv” in a separate folder. 3.3 Results 3.3.1 Development of a Python script to identify and split multiple entries - 43 - The initial data in the CSV files obtained directly from ADMET Predictor® consisted of multiple entries in the columns “ProtHB Atoms” and “ProtHF Atoms”. This poses a problem with the counting of waters as, one water, despite interacting either hydrophobically or hydrophilically, with several atoms in the protein, will get counted as a water making a single interaction because of its presence as a single row in the CSV file. The script given in Appendix 5 (component 1 in Figure 10) deals with this issue and splits each entry in the column and concatenates it as separate rows (Figure 11). Figure 11. Representation of the initial output obtained from ADMET Predictor® for Alogliptin converted by WALE to make it more appropriate for counting and categorizing solvation waters. The figure shows only those columns of the output as used by the code. 3.3.2 Development of a Python script to classify interactions between empty protein and solvated water 3.3.2.1 Absolute displacement, contact displaced bulk and contact displaced HF Following the logic of script described in Section 3.2.2.1, the three interested categories were used to classify the interactions of waters to the protein in detail (Figure 12). The three categories can be used together by the script as they can be considered as displaced waters by the ligand. Analysing their interactions and comparing them to the atoms they were replaced with by the ligand presents an interesting perspective in determining the ideal ligand-protein - 44 - bonds. The final output represented in figure 12 shows how a single water (231) is classified twice in the same category so that when the final code is run, that water is counted twice in the same category. Figure 12. Representation of the initial output obtained from ADMET Predictor® for Glibenclamide categorized by WALE into detailed water categories using the absolute displacement, contact displaced bulk and contact displaced HF waters. Red: Identification by Python script of a “TRUE” value in the “ProtHB” column, a value other than “O” or “N” in the “ProtHB Atoms” column and, a value of “O” in the “Closest Lig Atom” column → classified as “HB to sidechain of protein displaced with HB by ligand”. Blue: Identification of a “TRUE” value in the “ProtHF” column, a value of “PHE” in the “ProtHF Atoms” column and, a value other than “O” or “N” in the “Closest Lig Atom” column → classified as “HF to aromatic residue of protein displaced with HF by ligand”. Pink: “No interaction to protein displaced with HF by ligand”, Purple: “HF to aliphatic residue of protein displaced with HF by ligand”, Orange: “HB to backbone of protein displaced with HF by ligand”. 3.3.2.2 Absolute displacement, contact displaced bulk and contact displaced HF using “ProtHB” and “Closest Prot Atom” A deeper analysis of the SOLVATE output gives an interesting perspective on the formation of hydrogen bonds between the waters and the protein. Although a water may be involved in - 45 - hydrogen bond formation with the protein, the closest atom in the protein to the water may not always be hydrophilic (Figure 13). This is an exciting phenomenon as it stimulates a conversation against the common belief that a hydrogen bond is surrounded by other hydrophilic contact. It gives an insight into whether displacing a hydrogen bond forming water by a hydrophilic atom in the ligand is beneficial for ligand-protein interaction if the hydrogen bond forming water were in fact surrounded by a lot of hydrophobic contacts. Figure 13 shows water 109 forming a hydrogen bond with the NH2 group of the arginine 87 residue in DPP IV. However, interestingly, the closest atom in the protein to water 109 is OE2 of the glutamine 167 residue. This shows that hydrogen bond formation does not only depend on the proximity of the hydrogen bond donor and acceptor atoms, but also on the geometry and optimum orientation of the atoms towards each other. Figure 13. Representation of the initial output obtained from SOLVATE for Vildagliptin categorized by WALE into detailed water categories using the absolute displacement, contact displaced bulk and contact displaced HF waters and the “ProtHB” and “Closest Prot Atom” columns. 3.3.2.3 Contact SWB - 46 - A contact single water bridge between a water and protein may not always be replaced by a water bridge between the ligand and protein. Figure 14 demonstrates the occurrence of this phenomenon and gives an insight into the optimisation of ligand structure. Figure 14. Representation of the initial output obtained from SOLVATE for Alogliptin categorized by WALE into detailed water categories using the contact SWB waters and the “Closest Lig Atom” column. 3.3.2.4 Matched and ghost match These non-displaced waters at less than 5 Å from the ligand offer a starting point to grow the ligand structure such that it mimics the natural protein-water interactions better. Figure 15 shows the various points of evolution depending on the kind of interactions the waters have with the protein along with what atom they are currently close to in the ligand. This gives a representation of ‘what is’ versus ‘what should be’. - 47 - Figure 15. Representation of the initial output obtained from SOLVATE for Ibuprofen categorized by WALE into detailed water categories using the ghost match and matched waters and the “Closest Lig Atom” column. 3.3.2.5 Total count of waters in each category The final step in this analysis is to make a count of all the waters in each category for a drug, given in Table 4, so that it gives a summary needed to examine the interactions that exist between the protein and waters and how they get replaced upon ligand binding. Table 4 shows the output only for Alogliptin and the categories its water fall into. Note that these are not all the categories that exist. 3.3.2.6 Addition of an identifier Although the output of Section 3.3.2.5 depicts the water tally in the categories for each drug, it fails to specify the name of the compound for which the tally is done. The output given in Table 4 includes the column “Name” along with the water tally so that it becomes easier to identify the compound. - 48 - Table 4. Representation of the total count of waters in each category along with identifier for Alogliptin obtained after running WALE Alogliptin Name 22 Point of evolution based on HB backbone interaction to protein and HF interaction to ligand 20 Point of evolution based on HB sidechain interaction to protein and HF interaction to ligand 14 Point of evolution based on HB backbone interaction to protein and HB interaction to ligand 14 Point of evolution based on HF aromatic interaction to protein and HF interaction to ligand 7 Point of evolution based on water in bulk and HB interaction to ligand 7 Point of evolution based on water in bulk and HB interaction to ligand 7 Point of evolution based on HF aromatic interaction to protein and HB interaction to ligand 5 Contact SWB and closest atom in ligand is HB 3 Contact SWB but closest atom in ligand is HF 2 No interaction to protein displaced with HB by ligand 2 No interaction to protein displaced with HF by ligand 2 HB to sidechain of protein displaced with HF by ligand 2 HF to aromatic residue of protein displaced with HF by ligand 1 Forms HB with protein but closest atom is HF in protein 3.3.2.7 Combining the CSV files The water tally was performed for each drug individually. To compare the analysis of different ligands at once, there was a need to combine the different CSVs generated. Table 5 shows a representation of the output of this code. This table contains all the water categories that are generated by the code. For the ligands that do not have waters falling into some of the categories, a value of zero was added in the columns. The details of the analysis are given for each ligand-protein complex below. - 49 - Table 5. A complete summary of all the water categories generated by WALE for the ten ligands Name Alogliptin – DPP IV Sitagliptin – DPP IV Vildagliptin – DPP IV Alprazolam – BRD4 Diazepam - Albumin Atorvastatin - HMg-CoA reductase Fluvastatin - HMg-CoA reductase Rosuvastatin - HMg-CoA reductase Glibenclamide – channel Potassium Ibuprofen – COX 1 Category 1 0 0 1 0 0 0 0 0 0 0 Category 2 0 1 0 1 0 4 4 1 1 0 Category 3 0 4 4 1 0 4 4 3 3 4 Category 4 2 2 3 0 2 2 2 5 3 0 Category 5 0 0 0 2 0 0 0 1 0 0 Category 6 0 4 0 13 1 8 2 1 1 6 Category 7 0 1 0 0 0 0 0 0 0 0 Category 8 2 5 7 6 0 0 0 0 7 3 Category 9 3 3 0 1 0 3 1 2 2 0 Category 10 1 5 5 4 2 9 8 5 7 4 Category 11 0 7 4 2 0 7 5 9 7 3 Category 12 2 0 4 0 2 3 5 0 0 1 Category 13 7 6 13 2 3 11 6 12 6 0 Category 14 14 9 9 2 3 9 6 16 3 0 Category 15 0 4 1 5 6 15 13 20 5 1 Category 16 22 24 14 17 27 13 26 23 25 19 - 50 - Category 17 7 16 25 5 10 25 28 31 17 2 Category 18 20 52 20 8 12 31 37 29 44 11 Category 19 0 0 0 0 3 0 7 0 4 0 Category 20 0 6 0 19 7 6 2 10 25 18 Category 21 2 0 2 0 0 0 0 0 4 0 Category 22 14 21 15 8 0 0 0 0 8 13 Category 23 7 4 3 2 0 0 0 2 5 0 Category 24 5 13 5 9 0 13 5 13 8 0 3.3.2.7.1 Alogliptin Alogliptin replaced 14 waters involved in single water bridge formation with the protein with hydrophobic atoms as compared to 7 SWB waters replaced with hydrophilic atoms in the ligand. There were 2 waters that were involved in hydrogen bonds with the protein but were closest to hydrophobic atoms. Furthermore, there were 2 more waters involved in a hydrogen bond with the protein but were displaced by hydrophobic atoms by the ligand. The 2 waters forming hydrophobic interactions with aromatic protein residues were favorably replaced by hydrophobic atoms by the ligand. Out of 4 total waters not interacting with the protein, 3 were replaced by hydrophilic atoms and 1 was replaced by a hydrophobic atom in the ligand. There were 22 waters which were not displaced by the ligand but could be used to optimise ligand structure based on hydrophilic interaction to the backbone of the protein but close to hydrophobic atoms in the ligand. 27 waters interacted hydrophilically to protein sidechain, 7 of which were close to hydrophilic atoms in the ligand and the rest were close to hydrophobic atoms. 16 waters interacted hydrophobically to aromatic protein residues, 2 of which were close to hydrophilic atoms while the rest were close to hydrophobic atoms in the ligand. Waters in bulk (away from active site of protein) were also used as a point of ligand structure evolution probably to achieve other objectives like optimisation of pharmacokinetic - 51 - properties of the drug. There were 12 such waters, 7 of which were close to hydrophilic atoms and the rest were close to hydrophobic atoms in the ligand. 3.3.2.7.2 Sitagliptin Sitagliptin replaced 6 waters involved in single water bridge formation with the protein with hydrophilic atoms as compared to 9 SWB waters replaced with hydrophobic atoms in the ligand. 7 waters were found to interact via hydrogen bonds with the protein and were favorably close to hydrophilic atoms. A strong hydrogen bond interaction with the protein backbone was found to be displaced with a hydrophobic atom by the ligand while there were 4 waters interacting hydrophilically with the protein sidechain substituted by hydrophilic atoms. There were 2 waters that were involved in hydrogen bonds with the protein sidechain but were displaced by hydrophobic atoms. There were 4 waters interacting with aliphatic hydrophobic protein residues which were displaced by hydrophobic atoms in the ligand. 6 waters were found to interact hydrophobically with aromatic protein residues, 1 of which was displaced by a hydrophilic atom while the rest were hydrophobic ligand atoms. 8 waters were found not interacting with the protein, 3 of which were replaced by hydrophilic atoms and the rest by hydrophobic ligand atoms. 28 waters were classified as points of evolution based on hydrophilic backbone interactions, 4 of which were close to hydrophilic atoms and rest to hydrophobic atoms in the ligand. 68 waters were classified as points of evolution based on hydrophilic sidechain interactions, 16 of which were close to hydrophilic atoms and rest to hydrophobic atoms in the ligand. 6 waters were points of evolution based on hydrophobic aliphatic protein interactions and close to hydrophobic atoms in the ligand. 21 waters were classified as points of evolution based on hydrophobic aromatic protein interactions and close to hydrophobic atoms in the ligand. 17 waters in bulk were classified as points of evolution, 4 of which were close to hydrophilic atoms and the rest were close to hydrophobic atoms in the ligand. 3.3.2.7.3 Vildagliptin Vildagliptin replaced 13 waters involved in single water bridge formation with the protein with hydrophilic atoms as compared to 9 SWB waters replaced with hydrophobic atoms in the ligand. 4 waters were found to interact via hydrogen bonds with the protein and favorably - 52 - close to hydrophilic atoms while 4 other waters interacted through hydrogen bonds but were close to hydrophobic atoms in the protein. A strong hydrogen bond interaction with the protein backbone was found to be replaced with a hydrophilic atom by the ligand. Additionally, there were 4 waters interacting hydrophilically with the protein sidechain substituted by hydrophilic ligand atoms. There were 3 waters that were involved in hydrogen bonds with the protein sidechain but were displaced by hydrophobic atoms. 7 waters were found to interact with hydrophobic aromatic protein residues and were displaced by hydrophobic atoms in the ligand. 5 waters were found not interacting with the protein that were replaced by hydrophobic ligand atoms. 15 waters were classified as points of evolution based on hydrophilic backbone interactions, 1 of which was close to hydrophilic atoms and rest to hydrophobic atoms in the ligand. 45 waters were classified as points of evolution based on hydrophilic sidechain interactions, 25 of which were close to hydrophilic atoms and rest to hydrophobic atoms in the ligand. 17 waters were classified as points of evolution based on hydrophobic aromatic protein interactions, 2 of which were close to hydrophilic atoms and the rest to hydrophobic atoms in the ligand. 8 waters in bulk were classified as points of evolution, 3 of which were close to hydrophilic atoms and the rest were close to hydrophobic atoms in the ligand. 3.3.2.7.4 Alprazolam Alprazolam replaced 2 waters involved in single water bridge formation with the protein with hydrophilic atoms as compared to 2 other SWB waters replaced with hydrophobic atoms in the ligand. 2 waters were found to interact via hydrogen bonds with the protein and were favorably close to hydrophilic atoms. A strong hydrogen bond interaction with the protein backbone was found to be replaced with a hydrophobic atom by the ligand. Additionally, there was 1 more water interacting hydrophilically with the protein sidechain substituted by hydrophilic ligand atoms. 2 waters were found to interact with hydrophobic aliphatic protein residues and were displaced by hydrophilic atoms in the ligand while 13 waters were displaced by hydrophobic ligand atoms. 5 waters were found not interacting with the protein that were replaced by hydrophobic ligand atoms. 6 waters interacting hydrophobically to aromatic protein residues were displaced by hydrophobic atoms in the ligand. 5 waters not interacting with the protein were replaced with hydrophilic atoms (1 water) and hydrophobic atoms (4 waters) in the ligand. 22 waters were classified as points of evolution based on - 53 - hydrophilic backbone interactions, 5 of which were close to hydrophilic atoms and rest to hydrophobic atoms in the ligand. 13 waters were classified as points of evolution based on hydrophilic sidechain interactions, 5 of which were close to hydrophilic atoms and rest to hydrophobic atoms in the ligand. 19 waters were classified as points of evolution based on hydrophobic aliphatic protein interactions and were found to be close to hydrophobic atoms in the ligand. 8 waters were classified as points of evolution based on hydrophobic aromatic protein interactions and were close to hydrophobic atoms in the ligand. 11 waters in bulk were classified as points of evolution, 2 of which were close to hydrophilic atoms and the rest were close to hydrophobic atoms in the ligand. 3.3.2.7.5 Diazepam Diazepam replaced 3 waters involved in single water bridge formation with the protein with hydrophilic atoms and 3 other SWB waters with hydrophobic atoms in the ligand. 2 waters were found to interact via hydrogen bonds with the protein but were close to hydrophobic atoms in the ligand. There were 2 waters interacting hydrophilically with the protein sidechain substituted by hydrophobic ligand atoms. 1 water was found to interact with hydrophobic aliphatic protein residues and was displaced by hydrophobic ligand atoms. 2 waters were found not interacting with the protein and were dislaced by hydrophobic ligand atoms. 33 waters were classified as points of evolution based on hydrophilic backbone interactions, 6 of which were close to hydrophilic atoms and rest to hydrophobic atoms in the ligand. 22 waters were classified as points of evolution based on hydrophilic sidechain interactions, 10 of which were close to hydrophilic atoms and rest to hydrophobic atoms in the ligand. 10 waters were classified as points of evolution based on hydrophobic aliphatic protein interactions, 3 of which were found to be close to hydrophilic atoms and the rest to hydrophobic atoms in the ligand. 3.3.2.7.6 Atorvastatin Atorvastatin replaced 11 waters involved in single water bridge formation with the protein with hydrophilic atoms as compared to 9 SWB waters replaced with hydrophobic atoms in the ligand. 7 waters were found to interact via hydrogen bonds with the protein and were favorably close to hydrophilic atoms while 3 other waters interacted through hydrogen bonds - 54 - but were close to hydrophobic atoms in the protein. 4 strong hydrogen bond interactions with the protein backbone were found to be replaced with hydrophobic atoms in the ligand. Additionally, there were 4 other waters interacting hydrophilically with the protein sidechain substituted by hydrophilic ligand atoms. There were 2 waters that were involved in hydrogen bonds with the protein sidechain but were displaced by hydrophobic atoms. 8 waters interacted with aliphatic protein residues hydrophobically and were displaced by hydrophobic atoms in the ligand. 3 waters were found not interacting with the protein and were replaced by hydrophilic ligand atoms, while 9 waters were replaced by hydrophobic atoms in the ligand. 28 waters were classified as points of evolution based on hydrophilic backbone interactions, 15 of which were close to hydrophilic atoms and rest to hydrophobic atoms in the ligand. 56 waters were classified as points of evolution based on hydrophilic sidechain interactions, 25 of which were close to hydrophilic atoms and rest to hydrophobic atoms in the ligand. 6 waters were classified as points of evolution based on hydrophobic aliphatic protein interactions and were close to hydrophobic atoms in the ligand. 13 waters in bulk were classified as points of evolution close to hydrophobic atoms in the ligand. 3.3.2.7.7 Fluvastatin Fluvastatin replaced 6 waters involved in single water bridge formation with the protein with hydrophilic atoms as compared to other 6 SWB waters replaced with hydrophobic atoms. 5 waters were found to interact via hydrogen bonds with the protein and were favorably close to hydrophilic atoms while 5 other waters interacted through hydrogen bonds but were close to hydrophobic atoms in the protein. 4 strong hydrogen bond interactions with the protein backbone were replaced with hydrophobic atoms by the ligand. Additionally, there were 4 waters interacting hydrophilically with the protein sidechain substituted by hydrophilic ligand atoms. There were 2 waters that were involved in hydrogen bonds with the protein sidechain but were displaced by hydrophobic atoms. 2 waters were found to interact with hydrophobic aliphatic protein residues and were displaced by hydrophobic atoms in the ligand. 9 waters found not interacting with the protein were replaced by hydrophilic atoms (1 water) and hydrophobic (8 waters) ligand atoms. 39 waters were classified as points of evolution based on hydrophilic backbone interactions, 13 of which were close to hydrophilic atoms and rest to hydrophobic atoms in the ligand. 65 waters were classified as points of evolution based on hydrophilic sidechain interactions, 28 of which were close to hydrophilic atoms and rest to - 55 - hydrophobic atoms in the ligand. 9 waters were classified as points of evolution based on hydrophobic aliphatic protein interactions, 7 of which were close to hydrophilic atoms and the rest to hydrophobic atoms in the ligand. 5 waters in bulk were classified as points of evolution close to hydrophobic atoms in the ligand. 3.3.2.7.8 Rosuvastatin Rosuvastatin replaced 12 waters involved in single water bridge formation with the protein with hydrophilic atoms as compared to 16 SWB waters replaced with hydrophobic atoms in the ligand. 9 waters were found to interact via hydrogen bonds with the protein and were favorably close to hydrophilic atoms. A strong hydrogen bond interaction with the protein backbone was found to be replaced with a hydrophobic atom by the ligand. Additionally, there were 3 waters interacting hydrophilically with the protein sidechain substituted by hydrophilic ligand atoms. There were 5 waters involved in hydrogen bonds with the protein sidechain but were displaced by hydrophobic atoms. 1 water was found to interact with hydrophobic aliphatic protein residues and displaced by a hydrophilic atom in the ligand, while 1 other was displaced by a hydrophobic atom. 7 waters were found not interacting with the protein replaced by hydrophobic ligand atoms (5 waters) and hydrophilic atoms (2 waters). 43 waters were classified as points of evolution based on hydrophilic backbone interactions, 20 of which were close to hydrophilic atoms and rest to hydrophobic atoms in the ligand. 60 waters were classified as points of evolution based on hydrophilic sidechain interactions, 31 of which were close to hydrophilic atoms and rest to hydrophobic atoms in the ligand. 10 waters were classified as points of evolution based on hydrophobic aliphatic protein interactions close to hydrophobic atoms in the ligand. 15 waters in bulk were classified as points of evolution, 5 of which were close to hydrophilic atoms and the rest were close to hydrophobic atoms in the ligand. 3.3.2.7.9 Glibenclamide Glibenclamide replaced 6 waters involved in single water bridge formation with the protein with hydrophilic atoms as compared to 3 SWB waters replaced with hydrophobic atoms in the ligand. 7 waters were found to interact via hydrogen bonds with the protein and were favorably close to hydrophilic atoms. A strong hydrogen bond interaction with the protein - 56 - backbone was found to be replaced with a hydrophobic atom by the ligand. Additionally, there were 3 waters interacting hydrophilically with the protein sidechain substituted by hydrophilic ligand atoms. There were 3 waters that were involved in hydrogen bonds with the protein sidechain but were displaced by hydrophobic atoms. 1 water was found to interact with hydrophobic aliphatic protein residue and was displaced by hydrophobic atoms in the ligand. 7 waters interacting with hydrophobic aromatic protein residues were substituted by hydrophobic atoms in the ligand. 9 waters were found not interacting with the protein, 7 of which were replaced by hydrophobic ligand atoms and the rest by hydrophilic. 30 waters were classified as points of evolution based on hydrophilic backbone interactions, 5 of which were close to hydrophilic atoms and rest to hydrophobic atoms in the ligand. 61 waters were classified as points of evolution based on hydrophilic sidechain interactions, 17 of which were close to hydrophilic atoms and rest to hydrophobic atoms in the ligand. 29 waters were classified as points of evolution based on hydrophobic aliphatic protein interactions, 4 of which were close to hydrophilic atoms and the rest to hydrophobic atoms in the ligand. 12 waters interacted hydrophobically with aromatic protein residues, 4 of which were replaced by hydrophilic ligand atoms and the rest by hydrophobic. 13 waters in bulk were classified as points of evolution, 5 of which were close to hydrophilic atoms and the rest were close to hydrophobic atoms in the ligand. 3.3.2.7.10 Ibuprofen Ibuprofen does not interact with any waters forming contact single water bridges. 3 waters were found to interact via hydrogen bonds with the protein and were favorably close to hydrophilic atoms while 1 other water interacted through a hydrogen bond but was close to hydrophobic atoms in the protein. There were 4 waters interacting hydrophilically with the protein sidechain substituted by hydrophilic ligand atoms. There were 6 waters interacting with hydrophobic aliphatic and 3 with aromatic protein residues that were displaced by hydrophobic atoms in the ligand. 4 waters were found not interacting with the protein and were replaced by hydrophobic ligand atoms. 20 waters were classified as points of evolution based on hydrophilic backbone interactions, 1 of which was close to hydrophilic atoms and rest to hydrophobic atoms in the ligand. 13 waters were classified as points of evolution based on hydrophilic sidechain interactions, 2 of which were close to hydrophilic atoms and rest to hydrophobic atoms in the ligand. 18 waters classified as points of evolution based on - 57 - hydrophobic aliphatic and 13 based on hydrophobic aromatic protein interactions were found to be close to hydrophobic atoms in the ligand. There were no points of evolution based on waters in bulk. 3.4 Discussion The program developed in this chapter (WALE) helps to identify intricate water-proteinligand interactions in detail. Although currently existing as twelve separate scripts, the program has potential to be unified into one singular script using the CSV file from SOLVATE as input and creating the SDF file of the water constellations as output. The water categories were carefully developed to demonstrate the juxtaposition of the kind of interactions that can occur between two atoms. During the analysis of the results, it was found in many instances that although a water and part of protein residue may interact through a hydrogen bond, the water itself may be surrounded by a lot of hydrophobic contacts. During drug design, the bonds between a biological protein and its surrounding waters are studied and functional groups are loaded onto the pharmacophore to try and mimic these bonds65 . In such cases, hydrogen bonds are of great importance since they are strong and contribute to exceptional binding ability of the ligand66 . When we look at categories of water, such as “Forms HB with protein and closest atom is HB in protein” and “Forms HB with protein but closest atom is HF in protein”, that participate in hydrogen bonding but exist in a hydrophobic environment, it poses a question in drug design regarding the functional group that should replace such waters. It adds a layer of investigation and careful consideration in designing and ultimately depends on the objectives of the medicinal chemist developing the drug. Fluvastatin and atorvastatin (having 5 and 3 waters forming hydrogen bonds with protein but close to hydrophobic atoms, Table 5) are examples of such cases. Studying these categories also allows investigation into ligand structure optimization. Ideally the ligand should participate in all the interactions that the waters had with the protein before they got displaced67 . However, due to pharmacokinetic and synthetic difficulties, this ideal structure of the ligand is often not realized. This analysis offers a brief overview of how much the structure can be optimized. Categories like “Contact SWB and closest atom in ligand is HB” and “Contact SWB but closest atom in ligand is HF” talk about the existence and replacement of single water bridges in a biological protein by the ligand68 . Single water bridges are essential in maintaining the protein structure and conformational stability and - 58 - they can be formed only when the atoms on either side of the water are hydrophilic and in appropriate geometrical orientation69,70 . When the ligand replaces a water that was involved in making a water bridge by a hydrophobic atom it results in the loss of this contact leading to energy penalty71 . Rosuvastatin and alogliptin (16 and 14 contact SWB waters having hydrophobic ligand contact, Table 5) are examples of a compromise on the formation of water bridges. Ligand structures can be optimized by overcoming this phenomenon without compromising on other drug-like properties of the ligand. Hydrogen bond interactions of water with protein backbone are stronger than with the protein sidechain72 . Categories like “HB to backbone of protein displaced with HB by ligand”, “HB to backbone of protein displaced with HF by ligand”, “HB to sidechain of protein displaced with HB by ligand” and “HB to sidechain of protein displaced with HF by ligand” examine the waters forming hydrogen bonds to either the protein backbone or sidechain that were displaced either by hydrophobic or hydrophilic atoms in the ligand. By calculating energies involved in bond loss or gain derived from the ligand structure and compared against this analysis, the medicinal chemist can even have an estimate of the binding energy of the ligand. Similarly, the presence of water close to hydrophobic aromatic protein residues results in weaker binding as compared to presence of water close to aliphatic residues73 . Hence, replacement of such waters by a hydrophobic atom in the ligand, especially proximal to aromatic residues, results in better binding as given by the “HF to aromatic residue of protein displaced with HF by ligand”, “HF to aromatic residue of protein displaced with HB by ligand”, “HF to aliphatic residue of protein displaced with HF by ligand” and “HF to aliphatic residue of protein displaced with HB by ligand” categories. Both atorvastatin and fluvastatin displace 4 waters forming hydrophilic backbone interactions to the protein with less favorable hydrophobic atoms (Table 5). Sitagliptin undergoes only 1 such displacement of water interacting hydrophobically with an aromatic protein residue. These statistics might also hint at the extent of the importance of hydrophilic versus hydrophobic bonds as well as its compensation by contacts with other waters or hydrophobic residues. To give a deeper insight into water displacement by ligand the categories, “No interaction to protein displaced with HB by ligand” and “No interaction to protein displaced with HF by ligand”, are included, which talk about those waters which do not interact either hydrophilically or hydrophobically with the protein. Points of evolution are 3D coordinates of positions at which a ligand structure can be considered for extension by addition of functional groups or replacement of non-favorable - 59 - groups with favorable ones. All points of evolution water categories given in Table 5 include water which are currently not displaced by the ligand but could be displaced if a medicinal chemist wanted to objectively optimize the ligand structure. They identify the existing interactions of the water with the protein and the ligand and suggest the changes (either hydrophilic or hydrophobic) that could be done in the ligand structure. Additionally, the points of evolution based on water in bulk categories were specifically developed for structure evolution away from the active site of the protein to counter issues other than binding affinity like synthetic difficulties and optimization of pharmacokinetic properties. A quick look at Table 5 suggests that the ten investigated ligands have several hydrophobic groups in proximity to waters that interact through hydrogen bonds with the protein. If a medicinal chemist were to switch these groups into hydrophilic groups, it would result into a water bridge network formation leading to favorable binding. Furthermore, only alogliptin, vildagliptin and glibenclamide have hydrophilic groups close to hydrophobic aromatic protein-water interactions with the counts of 2,2, and 4 waters respectively. This leads to the general idea that ligand structures tend to conserve hydrophobic contacts with the protein than establish hydrophilic contacts. This may be due to many reasons including conventional methods of drug designing, importance of hydrophobic contacts with the protein, and determination of optimal pharmacokinetic properties. The idea needs to be explored more to be stated as a conclusion. This chapter offers a detailed analysis and discussion of the type of protein-water and proteinligand interactions and the various factors that determine which interaction, either hydrophilic or hydrophobic, will overpower the other. Studying the count of waters in various categories for ligands across several disease domains gives an insight into the similarities of ligands in different pharmacological categories. It also gives an idea about the manner of binding with the protein as well as details in the active site that could give rise to a similar water count. The concept of similarity using the surrounding waters of a protein is explored in the next chapter. 3.5 Data sharing The input CSV files, Python scripts for processing of data and the output CSV files discussed in this chapter are available at https://github.com/ruch555/Investigating-effects-of-solvationusing-SOLVATE.git. - 60 - Chapter 4: Developing water constellations using WALE and comparing similarities using 3D similarity screening 4.1 Background The previous chapter highlighted how solvated waters interact with the empty biological protein and can be used to predict ligand-protein binding interactions. Using the same waters but studying water-ligand interactions offer a unique perspective on the process of ligandprotein binding. This chapter provides a way to use the solvated waters of an empty protein, study its interactions with the ligand, develop water constellations and, compute similarity scores ultimately leading to the design of a new drug molecule. For this chapter, the same ten ligand-protein complexes as given in Table 1 are used as input data. SOLVATE processing of PDB files took place as mentioned in Section 3.1.2. Python was the language used for processing of the initial data. In ADMET Predictor®, the 3D similarity screening module was used to compute similarity scores. 4.1.1 Processing for 3D similarity screening 4.1.1.1 Processing PDB files The PDB files of all ligands mentioned in Table 1 were downloaded. The protein was separated from the ligand using ViewerPro and only the ligand PDB files were used in further analysis. 4.1.1.2 Creating 3D conformer database The only way by which a 3D conformer database can be prepared for multiple ligands at once is by combining all drug structures in an SDF format. Each ligand PDB file was converted into an SDF file using ViewerPro. The SDF files for each ligand were opened in Notepad and structural information for each ligand was pasted one after the other separated by “M END $$$$” insertion. This insertion is crucial for ADMET Predictor® to understand the inclusion of multiple SDF structures into a single SDF file. Code to generate this file is under development, but was not used here. This combined SDF file served as input to the 3D similarity screening module in ADMET Predictor®. To ensure that only the conformations specified in the SDF file are used, the user needs to check the “Use existing 3D conformers - 61 - instead of generating them” box. The conformer database is created in the form of a bin file. 4.1.1.3 Computing 3D similarity The combined SDF was loaded onto ADMET Predictor® as the input query molecules to calculate an n x n similarity matrix with the database molecules. The bin file served as the database consisting of the reference molecules against which similarity is calculated. Unchecking the box “Treat all atoms as carbons (faster)” is important to ensure that the atomic volume of all atoms in the different ligand structures (Table 1) is not uniform and each atom gets assigned its own atomic volume. The relative weight for chemical features as compared to shape while calculating similarity was set to be 3 as this was found to be the optimum distribution of weight. The minimum threshold for the Tanimoto score was 0.0. The output generated is in the form of SDF files which contain the 3D alignment of each test and reference molecule. Additionally, a TXT file consisting of similarity scores of all the query and reference molecules is generated. 4.2 Theory 4.2.1 Python script to isolate protein solvated waters from the rest of the output CSV file from SOLVATE The output from SOLVATE consists of solvated waters for the empty protein as well as for the protein-ligand complex. For this chapter, only the solvated waters for the empty protein were required. Hence, a Python script (Appendix 13, component 9 in Figure 10) was developed to isolate the solvated waters for the empty protein from the rest of the information and write out CSV files for each drug in a separate folder. The logic used was that when the code would find an empty row, it would collect all the information above the row and write it out in a separate CSV. 4.2.2 Python script to identify interested categories of solvated waters and convert them to carbons by default A script (Appendix 14, component 10 in Figure 10) was developed to put a filter on the “Fate” column of the CSV file to select water categories like “Absolute Displacement”, - 62 - “Contact Displaced Bulk”, “Contact Displaced HF”, “Contact SWB”, “Contact SWH Lig HB” and “Contact SWH Prot HB”. The water numbers resulting after filtering were stored in the form of a list. In the corresponding solvated PDB file of the drug, the code identified the waters that matched the numbers in the list, copied the entire row containing their positional 3D coordinates and pasted this information into a separate PDB file. In the original PDB, the waters were represented as “O”. In the modified PDB file, these were converted to “C”. 4.2.3 Classification of the water-converted-carbons as either hydrophilic or hydrophobic – Creation of water constellations 4.2.3.1 Ligand-based water constellations The carbons in the modified PDB files need to be classified as either hydrophilic or hydrophobic. For this script (Appendix 15, component 11 in Figure 10), the logic used to classify the carbons is based on what the “Closest Lig Atom” is to the water-convertedcarbon. This is useful for generating as similar a water constellation to the ligand as possible for calculating 3D similarity scores. 4.2.3.2 Protein-interaction based water constellations For this script (Appendix 16, component 11 in Figure 10), the logic used to classify the carbons is given in Table 3. This is essential to identify the interactions the waters have with the protein so that the medicinal chemist can evolve the ligand structure with the objective of maximizing protein interactions possibly leading to stronger binding. Table 6. Logic used for generation of protein-interaction based water constellations Water categories Column used for checking the logic Logic used for value checking Logic used for classification Absolute Displacement “ProtHB”, If column has a value = “TRUE” Classify as “O” “ProtHF” If column has a value = “TRUE” Classify as “C” - 63 - “ProtHB”, “ProtHF” and “Distance” If both “FALSE”, then check “Distance”. If “Distance” has values <5 A0 Classify as “C” If “Distance” has values >5A0 Classify as “O” Contact SWB “ProtHB” If value = “True” Classify as “O” Contact SWH Prot HB “ProtHB” If value = “True” Classify as “O” Contact SWH Lig HB “ProtHB” If value = “True” Classify as “C” Contact Displaced HF “ProtHB” If value = “True” Classify as “C” Contact Displaced Bulk “ProtHB” If column has a value = “TRUE” Classify as “O” “ProtHF” If column has a value = “TRUE” Classify as “C” “ProtHB” and “ProtHF” If both “FALSE” Classify as “O” 4.2.4 Fortran script for establishment of bonds between the atoms The water constellations generated in Section 4.2.3 consist of individual atoms. To establish bonds between them, a Fortran code (Appendix 17, component 12 in Figure 10) was developed using the logic to calculate distances between all atoms against each other, identify the farthest atom, and start establishing bonds to the nearest atom after that. The currently developed Fortran code considers a single PDB file at a time and generates a connected water constellation in the SDF format. This was done with the objective to make it easier to compile multiple SDFs into one and load onto ADMET Predictor®. 4.2.5 Validation of water constellations by computing 3D similarity scores against reference molecules Using the water constellations to grow the ligand needed a validation check of how similar - 64 - the constellations were to 3D ligand structures. Using the 3D Similarity Screening module in ADMET Predictor® an SDF file was compiled consisting of 10 ligand structures (Table 1) and their corresponding water constellations. An n x n matrix similarity calculation was performed to obtain the Tanimoto scores. All other details remain the same as given in Section 4.1.1. Scores were obtained for similarity calculations of ligands versus ligand-based water constellations as well as protein-interaction based water constellations. 4.3 Results 4.3.1 Python script to isolate protein solvated waters from the rest of the output CSV file from SOLVATE During analysis of the output files obtained from SOVLATE, it was observed that the CSV file contained information on waters solvated around the empty protein as well as the proteinligand complex. Furthermore, the empty protein solvation information was separated from the protein-ligand complex solvation information by an empty row. This was used as an identifier for the Python script to recognize, thereby collecting all information given above the empty row and producing it in a separate CSV file. For sitagliptin, there were 356 waters (as referenced in Figure 17) which were added during solvation of empty protein. Figure 16. Representation of the initial output obtained from SOLVATE for Sitagliptin, containing information of solvated waters for empty protein and protein-ligand complex, separated by WALE into a separate CSV file. - 65 - 4.3.2 Python script to identify interested categories of solvated waters and convert them to carbons by default The interested waters were filtered out by the Python script and the output obtained was in the form of a PDB file consisting of waters (represented by “O”) at various positions in 3D space (Box B, Figure 18). To obtain a clean working slate, every “O” was converted into a “C” which was then used in further analysis (Box D, Figure 18). Out of the total 356 empty protein solvated waters (Box B, Figure 18), 26 waters were filtered out to form the water constellation (Boxes C and D, Figure 18). Figure 17. Obtaining the PDB file consisting of the interested water categories using WALE. A. Solvated structure of the target protein of Sitagliptin (DPP IV) with all waters (represented as red spheres). B. Solvated waters of the empty protein. C. Water categories filtered by the Python code represented as “O” atoms (in red). D. Water categories converted to “C” atoms (in grey) by the Python script. 4.3.3 Classification of the water-converted-carbons as either hydrophilic or hydrophobic – Creation of water constellations 4.3.3.4 Ligand-based water constellations - 66 - The Python script works through the CSV file to identify the ligand atoms closest to the carbons in the water constellation (represented as the superimposition of the ligand structure and water constellation in Box B, Figure 19) and converts the atoms in the PDB file to either hydrophilic atoms differentiated as “O” (represented in red in Box C, Figure 19) or “N” (represented in blue in Box C, Figure 19) or hydrophobic atoms as “C” (represented in grey in Box C, Figure 19). The output generated is a classified PDB file containing the water constellation. As given in Box B Figure 19, water 14 is close to a carbon atom, water 19 is close to a nitrogen atom while water 202 is close to an oxygen atom in the ligand. According to the logic of the code, water 14 remains a carbon atom, water 19 converts into a nitrogen atom while water 202 converts into an oxygen atom (as given in Box C, Figure 19). Figure 18. Obtaining the ligand-based water constellation for Sitagliptin using WALE. A. Default “C”-based water constellation of the ligand. B. Super-imposition of the ligand structure (N = blue, O = red, F = cyan, C = grey, and dotted lines represent aromaticity) with its “C”-based water constellation. Closest ligand atoms to waters 14, 19 and 202 are carbon, nitrogen, and oxygen respectively. C. Ligand-based water constellation derived from the closest atom in the ligand to each “C” in the “C”-based water constellation generated by the Python code. Water 14 is classified as carbon, water 19 classified as nitrogen and water 202 classified as oxygen. 4.3.3.5 Protein-interaction based water constellations - 67 - Depending on the presence of a hydrogen bond, the script classifies the waters as either hydrophilic (“O” atoms) or hydrophobic (“C” atoms). In Box A Figure 20, two amino acid residues GLU 180 and TYR 621 are considered. Waters 14 and 19 interact with “O” (backbone interaction) and “OE2” (sidechain interaction) atoms of GLU180 respectively while water 202 interacts with “CE2” atom of TYR 621 (Box B, Figure 20). Both waters 14 and 19 get classified as hydrophilic “O” atoms due to their hydrophilic protein interactions while water 202 gets classified as hydrophobic “C” atom. This result helps to illustrate the contrasting classification logic used in preparing ligandbased and protein-interaction based water constellations. The same three waters (14,19 and 202) get classified in different ways depending on whether the code looks for similarity with the ligand or interactions with the protein. This result also highlights the extent of optimization of the ligand structure by comparing the ligand-based water constellation to the ideal protein interactions that potentially could take place by modifying the ligand structure and converting or adding the right functional groups at the right place. Figure 19. Obtaining the protein-interaction based water constellation for Sitagliptin. A. The protein (DPP IV) with the interested categories of solvated waters converted to carbons by default. Two amino acid residues GLU 180 and TYR 621 are highlighted as examples for further investigation. B. Representation of the interactions each water has with the protein. Waters 14 and 19 interact with different oxygens on GLU 180 while water 202 interacts with the aromatic carbon of TYR 621. C. Protein-interaction based water constellation classified - 68 - as per the interactions present between the protein and the waters. Waters 14 and 19 get classified as oxygens while water 202 stays a carbon. 4.3.4 Fortran script for establishment of bonds between the atoms Figure 21 shows the establishment of bonds between the separated waters in the constellation. The script runs through the PDB file by finding the next nearest atom and connects it to the previous atom. Branching takes place when two atoms are equidistant to the previous atom. Parts A and a of Figure 21 discuss the ligand-based water constellation whereas parts B and b discuss the protein-interaction based water constellation of sitagliptin. Figure 20. A. Ligand-based water constellation before Fortran script, a. Ligand-based water constellation after Fortran script. B. Protein-interaction based water constellation before Fortran script, b. Protein-interaction based water constellation after Fortran script. 4.3.5 Validation of water constellations by computing 3D similarity scores against reference molecules Similarity checks using the 3D similarity screening module in ADMET Predictor® were performed twice, once using the ligand-based water constellations and ligand structures and the second time using the protein-interaction based water constellations and ligand structures. Figure 22 shows the heatmap of the similarity scores calculation where the 10 ligands (Table - 69 - 1) are query molecules while their ligand-based water constellations are the database molecules. Using the constellations as molecules against which each drug’s similarity is being calculated helps to understand whether a given drug’s water constellation is more similar to its own drug as compared to others. In Figure 22, the water constellations of 7 ligands out of 10 (highlighted by green boxes) were found to be more similar to their own ligands (considered as hits) proving that ligand-based water constellations indeed can be understood as representation of the drug structures even in the absence of the ligand. Figure 23 shows the heatmap of the similarity scores calculation where the 10 ligands (Table 1) are query molecules while their protein-interaction based water constellations are the database molecules. When comparing the hits of the same ligands as Figure 22, only one drug (ibuprofen) was a hit in Figure 23. In total only 2 ligands out of 10 matched with their own water constellations (ibuprofen and alprazolam). As previously discussed, if the ligand-based water constellations are considered as representation of the ligand structures, then it can be concluded that the ligand structures do not reach the complete potential of their interactions with the protein. Furthermore, Figure 24 highlights the difference in the similarity scores of the ligand-based and protein-interaction based water constellations obtained for each ligand. The positive numbers across the diagonal indicate that the ligand structure closely matches the ligand-based constellation than the protein-interaction based constellation. Ligand structures do not accurately mimic the ideal, naturally existing protein-water interactions thereby leading to the idea that ligand structures have the potential to be optimized for better interactions and stronger binding. - 70 - Figure 21. Heatmap of similarity scores of ligand-based water constellations and ligand structures (Green highlighted box = Hits). Figure 22. Heatmap of similarity scores of protein-interaction based water constellations and ligand structures (Green highlighted box = Hits, Red highlighted box = Misses as compared - 71 - to Figure 22). Figure 23. Heatmap of the diagonal differences between similarity scores of ligand-based and protein-interaction based water constellations 4.4 Discussion The objective of creating ligand-based and protein-interaction based water constellations was to explore whether there were any differences between the two. The ligand-based water constellation which is completely based on the closest atoms in the ligand to the water can be considered as a direct representation of the ligand’s structural constituents (Figure 19). The heatmap detailing the similarity scores of the ligand-based water constellation against the ligand indicates that most of the water constellations (7 out of 10) matched most with the ligands they were derived from. The heatmap analysis served as a form of method validation for further understanding and conclusion drawing. The heatmap detailing the similarity scores of the protein-interaction based water constellation against the ligand indicates that most of the water constellations (8 out of 10) do not match with the ligands they were derived from. The protein-interaction based water constellations generally underperform when compared with their ligand-based counterparts when similarity against the original ligands is calculated - 72 - (Figure 24). This is just another way of saying that the ligand structures have not been fully optimized to interact with the protein. This could be due to various reasons like adjusting for synthetic difficulty or improving the pharmacokinetic properties74,75 . This chapter provides a way to visualize the extent as well as the direction of optimization the ligand structures can undergo. It is exemplified in this chapter with the help of sitagliptin. 4.5 Data sharing The input CSV files, Python codes for processing of data and the output CSV files discussed in this chapter are available at https://github.com/ruch555/Application-of-the-effects-ofsolvation.git. - 73 - Chapter 5: Ligand evolution using AIDD 5.1 Background The previous chapter established the need for ligand structure optimization using ideal protein interaction-based water constellations. This chapter uses the AIDD module in ADMET Predictor® to evolve ligands using artificial intelligence. Three different types of runs were performed in AIDD to understand its potential in ligand evolution and generation of candidate molecules. The overall objective of this chapter was to generate structures of candidate molecules with more similarity to the protein interaction-based water constellation than the pharmacological ligand. This would achieve the goal of ligand structure optimization established in the previous chapter. 5.1.1 Processing for AIDD For this chapter, an SDF file of the drug of interest as loaded onto ADMET Predictor® as the seed molecule. In the AIDD module, the box for using 3D models was checked. In the 3D options, it was ensured that the option for generating 3D conformers of the seed molecule was switched off. The objective functions used in this chapter were either “<Synthetic_Difficulty+>” or 3D similarity. Using 3D similarity is not available as an objective function by default in AIDD; hence, it must be loaded by checking the “Compute objectives using external application” box. The TXT file outlining the instructions to compute Tanimoto scores like feature weight, minimum score and the reference molecule that will be used for computing 3D similarity against, can be modified by the user. The scaffold query defines the pharmacophoric feature which can be derived from the seed molecule and must be present in all the candidate molecules generated by AIDD. The transform rules used to generate new molecules as well as the product filter criteria which every candidate must fail to proceed to the next level of optimization in AIDD can be modified by the user. As the output, several SMI files are created containing details of molecules produced in each generation and the objective similarity scores computed against the reference molecule. A summary SDF file giving 2D structural information along with a complete list of candidate molecules and similarity scores is also generated. 5.2 Theory The ligand-based water constellation was used as a check of similarity with the ligand as the - 74 - constellation was completely based on the ligand structure. The protein-interaction based water constellation was used to predict the optimum ligand structure with possibly the highest beneficial interactions with the protein. Three types of AIDD runs were performed varying in the selection of the seed molecule and reference molecule; where the reference molecule refers to the molecule against which similarity is calculated. Each run was performed for 100 generations. The objective functions used for all runs were to minimize synthetic difficulties and maximize 3D similarity with the reference molecule. Ligand structures were manually modified and then fed into AIDD based on a set of rules as given in Figure 25. The objective of this manual modification was to incorporate the ideally interacting water-converted-atoms into the original ligand structure simply through the establishment of bonds. These bonds were established without regards for organic chemistry and the modified structure often violated conventional chemical structure rules like valence and bond length. AIDD was used to read this unreal chemical moiety, understand the nature of substitutions and type of atoms, and convert it into real candidate chemical compounds. A simple part of the ligand structure was selected as the scaffold query ([#7]1:[#6]:[#7]:[#7]:[#6]:1 (Encoded query from sketch tool used in AIDD to specify scaffold queries)) to aid AIDD in generating molecules similar molecules to the ligand. For filtering the interested waters, the CSV file output obtained from SOLVATE was used. For reference purposes as well as establishment of bonds the water constellation SDF files were used. For this chapter, the analysis of sitagliptin is used for reference and given in detail. - 75 - Figure 24. Rules for deciding on substitutions in the original ligand structure to produce the manually modified ligand. 5.2.1 Using water constellations to evolve ligands with AIDD The runs in AIDD were performed in sequential order and parameters were refined in the following runs depending on the success of the previous runs. The first two runs were performed to understand the capabilities of AIDD with respect to the initial molecules it can handle and understand and the kind of candidate molecules it would generate. The runs offered insights into the algorithm behind AIDD and the varying weightage it offers to the seed, scaffold and 3D similarity references depending on several factors in each run. Run 3, however, was an ambitious run wherein the problem statement posed in the last chapter was directly addressed. The objective of Run 3 was to generate drug-like candidate molecules with more similarity to the protein-interaction based water constellation. However, in optimizing the similarity parameter, it was hoped that AIDD would not deviate too much from the ligand structure thus ensuring the possibility of the novel candidate molecules to fit in the active site of the target protein. 5.2.1.1 Run 1: Seed molecule – Manually modified sitagliptin, Reference molecule – Sitagliptin protein-interaction based water constellation A manually modified molecule with unrealistic chemistry was fed into AIDD and it was asked to generate real candidate molecules with maximal similarity to the drug’s proteininteraction based water constellation. The objective of this run was to explore the abilities of AIDD and check whether it could understand the unrealistic chemical moiety and generate drug-like molecules from it. 5.2.1.2 Run 2: Seed molecule - Sitagliptin protein-interaction based water constellation, Reference molecule –Sitagliptin The objective of this run was to test whether real candidate molecules can be generated based on purely a water constellation. This was significant as it could be applied to the designing of novel molecules across several disease domains based on water constellations derived from - 76 - solvated waters in empty protein and using a standard molecule in the pharmacological class as reference. A negative control with the protein-interaction based water constellation of glibenclamide as the seed molecule was performed while keeping rest of the parameters same. This was done to verify if the run produced different results if the seed water constellation did not belong to sitagliptin. 5.2.1.3 Run 3: Seed molecule - Sitagliptin protein-interaction based water constellation (all carbons by default), Reference molecule – Sitagliptin protein-interaction based water constellation This run was characteristically very exploratory in nature. Using only the water constellations as the seed and the 3D reference, the objective of this run was to see whether AIDD could come up with drug-like candidate molecules with more similarity to the protein-interaction based water constellation. Essentially, the success of this run would determine whether optimized candidate molecules with structural features mimicking ideal protein-water interactions can be produced for a given pharmacological class of ligands. 5.3 Results 5.3.1 Using water constellations to evolve ligands with AIDD 5.3.1.1 Run 1: Seed molecule – Manually modified sitagliptin, Reference molecule – Sitagliptin protein-interaction based water constellation The results in Table 7 indicate that AIDD was able to make sense of the unrealistic modified sitagliptin moiety and produce realistic candidate molecule with sensible chemical structures similar to its protein-interaction based water constellation. The highest similarity score was 0.481 of compound 1 when calculated against the water constellation. The synthetic difficulty scores range from 1-10, with 10 being the highest score of difficulty for synthesis. The highest synthetic difficulty score for the candidate molecules was found to be 5.523. Each candidate molecule is superimposed against the reference molecules to generate 3D similarity scores as shown in Figures 25, 26 and 27. Table 7. Top 10 candidate molecules generated by AIDD using modified sitagliptin as the - 77 - seed molecule and its protein-interaction based water constellation as the reference molecule for calculating 3D similarity. Molecules are listed based on decreasing order of 3D similarity score. Structure 3D similarity score Synthetic difficulty score 3D reference molecule Seed molecule: Manually modified sitagliptin Sitagliptin protein-interaction based water constellation Compound 1 0.481 5.523 Compound 2 0.48 8.984 Compound 3 0.475 7.207 Compound 4 0.474 5.343 Compound 5 0.47 5.184 Compound 6 0.47 5.294 Compound 7 0.47 7.307 - 78 - Compound 8 0.468 5.454 Compound 9 0.468 8.623 Compound 10 0.464 5.307 Figure 25. Superimposition of candidate molecule: Compound 1 (stick figure) on the 3D similarity reference molecule: Sitagliptin protein-interaction based water constellation (scaled ball and stick figure) in AIDD (with seed molecule: Manually modified sitagliptin) to compute 3D similarity scores. - 79 - Figure 26. Superimposition of candidate molecule: Compound 2 (stick figure) on the 3D similarity reference molecule: Sitagliptin protein-interaction based water constellation (scaled ball and stick figure) in AIDD (with seed molecule: Manually modified sitagliptin) to compute 3D similarity scores. Figure 27. Superimposition of candidate molecule: Compound 3 (stick figure) on the 3D similarity reference molecule: Sitagliptin protein-interaction based water constellation (scaled ball and stick figure) in AIDD (with seed molecule: Manually modified sitagliptin) to compute 3D similarity scores. 5.3.1.2 Run 2: Seed molecule - Sitagliptin protein-interaction based water constellation, - 80 - Reference molecule – Sitagliptin The results in Table 8 indicate that AIDD was able to generate realistic molecules just based on the drug’s protein-interaction based water constellation and using the drug sitagliptin as the similarity reference. The highest similarity score was 0.821 of compound 1 when calculated against the water constellation. The highest synthetic difficulty score for the candidate molecules was found to be 4.522, lesser than the score obtained in Run 1, indicating ease in chemically synthesizing the molecule. The similarity score was found to the highest and the synthetic difficulty score was found to be the lowest among the three runs indicating this run’s success. Each candidate molecule is superimposed against the reference molecules to generate 3D similarity scores as shown in Figures 28, 29 and 30. Table 8. Top 10 candidate molecules generated by AIDD using a protein-interaction based water constellation as the seed molecule and sitagliptin as the reference molecule for calculating 3D similarity. Molecules are listed based on decreasing order of 3D similarity score. Identifier Structure 3D similarity score Synthetic difficulty score 3D reference molecule Seed molecule: Sitagliptin proteininteraction based water constellation Sitagliptin Compound 1 0.822 5.407 Compound 2 0.821 4.522 - 81 - Compound 3 0.821 5.462 Compound 4 0.819 5.473 Compound 5 0.814 4.11 Compound 6 0.812 4.152 Compound 7 0.812 4.242 Compound 8 0.812 6.694 Compound 9 0.812 6.861 Compound 10 0.811 4.198 - 82 - Figure 28. Superimposition of candidate molecule: Compound 1 (stick figure) on the 3D similarity reference molecule: Sitagliptin (scaled ball and stick figure) in AIDD (with seed molecule: Sitagliptin protein-interaction based water constellation) to compute 3D similarity scores. Figure 29. Superimposition of candidate molecule: Compound 2 (stick figure) on the 3D similarity reference molecule: Sitagliptin (scaled ball and stick figure) in AIDD (with seed molecule: Sitagliptin protein-interaction based water constellation) to compute 3D similarity scores. - 83 - Figure 30. Superimposition of candidate molecule: Compound 3 (stick figure) on the 3D similarity reference molecule: Sitagliptin (scaled ball and stick figure) in AIDD (with seed molecule: Sitagliptin protein-interaction based water constellation) to compute 3D similarity scores. 5.3.1.3 Run 3: Seed molecule - Sitagliptin protein-interaction based water constellation (all carbons by default), Reference molecule – Sitagliptin protein-interaction based water constellation The results in Table 9 indicate that AIDD was able to comprehend the both the default “C” protein-interaction based water constellation as well as the normal protein-interaction based water constellation and was able to produce molecules somewhat similar to sitagliptin just due to the influence of the scaffold query. The highest similarity score was 0.46 of compound 1 when calculated against the normal water constellation which was the lowest among the three runs. The highest synthetic difficulty score for the candidate molecules was found to be 5.869, highest among the three runs. Each candidate molecule is superimposed against the reference molecules to generate 3D similarity scores as shown in Figures 31, 32 and 33. Table 9. Top 10 candidate molecules generated by AIDD using sitagliptin’s proteininteraction based water constellation (all carbons by default) as the seed molecule and the normal protein-interaction based water constellation as the reference molecule for calculating 3D similarity. Molecules are listed based on decreasing order of 3D similarity score. - 84 - Identifier Structure 3D similarity score Synthetic difficulty score 3D reference molecule Seed molecule: Sitagliptin proteininteraction based water constellation (all carbons by default) Sitagliptin proteininteraction based water constellation Compound 1 0.457 8.035 Compound 2 0.449 8.024 Compound 3 0.45 6.407 Compound 4 0.448 6.343 Compound 5 0.448 6.299 Compound 6 0.453 6.201 - 85 - Compound 7 0.453 6.184 Compound 8 0.453 6.141 Compound 9 0.447 6.064 Compound 10 0.446 6.042 Figure 31. Superimposition of candidate molecule: Compound 1 (stick figure) on the 3D similarity reference molecule: Sitagliptin protein-interaction based water constellation (scaled ball and stick figure) in AIDD to compute 3D similarity scores. - 86 - Figure 32. Superimposition of candidate molecule: Compound 2 (stick figure) on the 3D similarity reference molecule: Sitagliptin protein-interaction based water constellation (scaled ball and stick figure) in AIDD to compute 3D similarity scores. Figure 33. Superimposition of candidate molecule: Compound 3 (pink) on the 3D similarity reference molecule: Sitagliptin protein-interaction based water constellation (red: oxygen, grey: carbon) in AIDD to compute 3D similarity scores. 5.4 Discussion To test the extent of AIDD’s abilities, an initial run (Run 1) was performed with a manually modified sitagliptin as the seed molecule and its protein-based water constellation as the 3D similarity reference. The scaffold (([#7]1:[#6]:[#7]:[#7]:[#6]:1 (Encoded query from sketch tool used in AIDD to specify scaffold queries))) was intentionally held constant in all the runs - 87 - to gently nudge AIDD into the right direction of generating candidate molecules but not biasing it too much towards generating exact replicas of sitagliptin. The manual modification of sitagliptin, following the rules given in Figure 24, was required to make the seed molecule reflective of the desired modifications in the original ligand structure to maximize its protein interactions. However, these modifications were not made keeping any real chemistry in mind. The manually modified chemical entities are not real molecules and do not possess sensible chemistry. When fed into AIDD, multiple warnings get generated complaining of the entity’s chemical violations like extreme bond lengths and bad valence states. Having said that, these manually modified entities do possess the right functional groups at the right places for maximal protein interaction. AIDD can then take these imperfect chemical entities and convert them into compounds with drug-like characteristics and chemistry (Table 7). Run 2 was interesting as AIDD managed to produce candidates with 0.821 similarity to sitagliptin (Table 8), containing optimal protein interactions as given by the proteininteraction based water constellation which was used as the seed molecule. This result has great implications in drug designing and development as by just using the solvated waters of the empty protein and its interactions as the seed, the medicinal chemist can easily generate novel molecules similar to a standard pharmacological drug in that therapeutic category. Furthermore, the run consisted of synthetic difficulty as one of the objective functions thereby making the results contain more realistic chemistry. Run 3 consisted of only protein-interaction based water constellations as the seed (all “C” atoms containing water constellation) and the reference molecule (normal classified water constellation). Here, the scaffold query played a very important role in directing the generation of candidate molecules towards compounds that not only were drug-like (unlike the long chain water constellations used as seed and reference) but also contained at least one pharmacophoric feature of sitagliptin. A good result in this run would be to generate molecules with more similarity with the protein-interaction based water constellation than sitagliptin itself. Run 3 demonstrated the generation of candidate molecules with more optimal interactions to the protein (Table 9). Thus, this chapter provides novel ideas of drug designing and development using AIDD. This method stands different from conventional methods of designing76 as it is quick and easy to use while providing the medicinal chemist a better control over the entire process by letting them specify what the seed, reference or scaffold molecules are at the click of a button. - 88 - Furthermore, the chemist can even specify certain transformations to take place while avoiding others to arrive at specific objective candidate molecules. Additionally, the chemist may add more objective functions (no greater than 5) to better utilize the Pareto multi-layer optimization algorithm which AIDD is based on41. This work focuses on the optimization of existing drug structures however, its results may find applications in novel chemical entity initiation and design. - 89 - Chapter 6: Conclusion Artificial intelligence has become increasingly prominent in the field of pharmaceutical sciences. It has found applications in various areas of virtual drug screening, drug design, synthesis and development77–79 . This thesis offers a way to laboratory medicinal chemists to harness this power of AI and apply it to their workbench projects. The work done using KNIME highlights a simpler, quicker and more successful way of screening and predicting pharmacological activity of lead compounds using SAR analysis. WALE was developed to analyse the surrounding waters of proteins and how these waters interact with the protein. Each code was created step by step to counter issues as they arose thereby offering a unique perspective on the entire coding process. The 3D similarity screening and AIDD modules offered by ADMET Predictor® helped to demonstrate the applications of WALE in the process of ligand evolution as well as showing potential for novel drug design. As an illustration of the method, this work focused on 10 randomly chosen ligands across various pharmacological domains. The logic and code in WALE have been so developed that it can be applied to our established database of around 10,000 solvated proteins and ligand-protein complexes63 . Using a larger database as well as deepening the branch of thought and logic that went into WALE is a step for the future. While working on this project, the scope of both Python and Fortran were thoroughly studied and appropriately applied to the input data. AIDD has shown tremendous potential for the development of candidate molecules, and it was understood that the direction of this evolution could be easily manipulated using the scaffold, seed, and reference molecule inputs. A possibility remains of exploring AIDD using different types of runs varying in the selection of the scaffold, seed, and reference molecules as well as performing some control runs. GitHub offered an interesting platform for uploading the input data, the AI programs developed as well as the output data to make it more accessible for chemists wanting to use the algorithms for their own dataset of compounds. We hope that the medicinal chemist community can utilize the novel algorithms developed based on AI, thereby saving their time and other resources80 . - 90 - References 1. Bateman, T. J. Drug discovery. in Atkinson’s Principles of Clinical Pharmacology 563–572 (Elsevier, 2022). doi:10.1016/B978-0-12-819869-8.00019-7. 2. Vamathevan, J. et al. Applications of machine learning in drug discovery and development. Nat Rev Drug Discov 18, 463–477 (2019). 3. Struble, T. J. et al. Current and Future Roles of Artificial Intelligence in Medicinal Chemistry Synthesis. J Med Chem 63, 8667–8682 (2020). 4. Schadt, E. E., Friend, S. H. & Shaywitz, D. A. A network view of disease and compound screening. Nat Rev Drug Discov 8, 286–295 (2009). 5. Mishra, V. Artificial Intelligence: The Beginning of a New Era in Pharmacy Profession. Asian J Pharm 12, (2018). 6. Ramesh, A., Kambhampati, C., Monson, J. & Drew, P. Artificial intelligence in medicine. Ann R Coll Surg Engl 86, 334–338 (2004). 7. Wirtz, B. W., Weyerer, J. C. & Geyer, C. Artificial Intelligence and the Public Sector— Applications and Challenges. International Journal of Public Administration 42, 596– 615 (2019). 8. Blower, P. & Cross, K. Decision Tree Methods in Pharmaceutical Research. Curr Top Med Chem 6, 31–39 (2006). 9. Paul, D. et al. Artificial intelligence in drug discovery and development. Drug Discov Today 26, 80–93 (2021). 10. Mak, K.-K. & Pichika, M. R. Artificial intelligence in drug development: present status and future prospects. Drug Discov Today 24, 773–780 (2019). 11. Yang, X., Wang, Y., Byrne, R., Schneider, G. & Yang, S. Concepts of Artificial Intelligence for Computer-Assisted Drug Discovery. Chem Rev 119, 10520–10594 (2019). 12. Brown, N. In Silico Medicinal Chemistry: Computational Methods to Support Drug Design. (Royal Society of Chemistry, Cambridge, 2015). doi:10.1039/9781782622604. 13. Hessler, G. & Baringhaus, K.-H. Artificial Intelligence in Drug Design. Molecules 23, 2520 (2018). - 91 - 14. Zhang, L., Tan, J., Han, D. & Zhu, H. From machine learning to deep learning: progress in machine intelligence for rational drug discovery. Drug Discov Today 22, 1680–1685 (2017). 15. Öztürk, H., Özgür, A. & Ozkirimli, E. DeepDTA: deep drug–target binding affinity prediction. Bioinformatics 34, i821–i829 (2018). 16. Mayr, A., Klambauer, G., Unterthiner, T. & Hochreiter, S. DeepTox: Toxicity Prediction using Deep Learning. Front Environ Sci 3, (2016). 17. Hutson, M. AI protein-folding algorithms solve structures faster than ever. Nature (2019) doi:10.1038/d41586-019-01357-6. 18. Wan, F. & Zeng, J. Deep Learning with Feature Embedding for Compound-Protein Interaction Prediction. (2016). 19. Wang, F. et al. Computational Screening for Active Compounds Targeting Protein Sequences: Methodology and Experimental Validation. J Chem Inf Model 51, 2821– 2828 (2011). 20. Zeng, X. et al. Target identification among known drugs by deep learning from heterogeneous networks. Chem Sci 11, 1775–1797 (2020). 21. Li, Z. et al. KinomeX: a web application for predicting kinome-wide polypharmacology effect of small molecules. Bioinformatics 35, 5354–5356 (2019). 22. Grzybowski, B. A. et al. Chematica: A Story of Computer Code That Started to Think like a Chemist. Chem 4, 390–398 (2018). 23. Nichols, P. L. Automated and enabling technologies for medicinal chemistry. in 191– 272 (2021). doi:10.1016/bs.pmch.2021.01.003. 24. Dietz, C. & Berthold, M. R. KNIME for Open-Source Bioimage Analysis: A Tutorial. Adv Anat Embryol Cell Biol 179–197 (2016) doi:10.1007/978-3-319-28549-8_7. 25. Berthold, M. R. et al. KNIME: The Konstanz Information Miner. Data Analysis, Machine Learning and Applications 319–326 (2008) doi:10.1007/978-3-540-78246- 9_38. 26. O’Boyle, N. M. et al. Open Babel: An open chemical toolbox. J Cheminform 3, 33 (2011). - 92 - 27. P. Mazanetz, M., J. Marmon, R., B. T. Reisser, C. & Morao, I. Drug Discovery Applications for KNIME: An Open Source Data Mining Platform. Curr Top Med Chem 12, 1965–1979 (2012). 28. Kruggel, S. & Lemcke, T. Generation and Evaluation of a Homology Model of Pf GSK‐3. Arch Pharm (Weinheim) 342, 327–332 (2009). 29. Korb, O., ten Brink, T., Victor Paul Raj, F. R. D., Keil, M. & Exner, T. E. Are predefined decoy sets of ligand poses able to quantify scoring function accuracy? J Comput Aided Mol Des 26, 185–197 (2012). 30. Steri, R., Achenbach, J., Steinhilber, D., Schubert-Zsilavecz, M. & Proschak, E. Investigation of imatinib and other approved drugs as starting points for antidiabetic drug discovery with FXR modulating activity. Biochem Pharmacol 83, 1674–1681 (2012). 31. Baell, J. B. & Holloway, G. A. New Substructure Filters for Removal of Pan Assay Interference Compounds (PAINS) from Screening Libraries and for Their Exclusion in Bioassays. J Med Chem 53, 2719–2740 (2010). 32. Saubern, S., Guha, R. & Baell, J. B. KNIME Workflow to Assess PAINS Filters in SMARTS Format. Comparison of RDKit and Indigo Cheminformatics Libraries. Mol Inform 30, 847–850 (2011). 33. Joshi, R. et al. KNIME workflows for applications in medicinal and computational chemistry. Artificial Intelligence Chemistry 2, 100063 (2024). 34. Geldenhuys, W. J., Gaasch, K. E., Watson, M., Allen, D. D. & Van der Schyf, C. J. Optimizing the use of open-source software applications in drug discovery. Drug Discov Today 11, 127–132 (2006). 35. Grünewald, F. et al. Polyply; a python suite for facilitating simulations of macromolecules and nanomaterials. Nat Commun 13, 68 (2022). 36. García-Ortegón, M. et al. DOCKSTRING: Easy Molecular Docking Yields Better Benchmarks for Ligand Design. J Chem Inf Model 62, 3486–3502 (2022). 37. Jung, S., Vatheuer, H. & Czodrowski, P. VSFlow: an open-source ligand-based virtual screening tool. J Cheminform 15, 40 (2023). - 93 - 38. Kurdekar, V. & Jadhav, H. R. A new open source data analysis python script for QSAR study and its validation. Medicinal Chemistry Research 24, 1617–1625 (2015). 39. Han, S. & Kwak, I.-Y. Mastering data visualization with Python: practical tips for researchers. Journal of Minimally Invasive Surgery 26, 167–175 (2023). 40. Ghosh, J., Lawless, M. S., Waldman, M., Gombar, V. & Fraczkiewicz, R. Modeling ADMET. Methods Mol Biol 63–83 (2016) doi:10.1007/978-1-4939-3609-0_4. 41. Jones, J., Clark, R. D., Lawless, M. S., Miller, D. W. & Waldman, M. The AI-driven Drug Design (AIDD) platform: an interactive multi-parameter optimization system integrating molecular evolution with physiologically based pharmacokinetic simulations. J Comput Aided Mol Des 38, 14 (2024). 42. Ayoub, A. T., Klobukowski, M. & Tuszynski, J. Similarity-based virtual screening for microtubule stabilizers reveals novel antimitotic scaffold. J Mol Graph Model 44, 188–196 (2013). 43. Zheng, Z. M.S Thesis. (University of Southern California, 2024). 44. Guéroux, M., Fleau, C., Slozeck, M., Laguerre, M. & Pianet, I. Epigallocatechin 3- Gallate as an Inhibitor of Tau Phosphorylation and Aggregation: A Molecular and Structural Insight. J Prev Alzheimers Dis 4, 218–225 (2017). 45. Mao, J. et al. Comprehensive strategies of machine-learning-based quantitative structure-activity relationship models. iScience 24, 103052 (2021). 46. Katoch, S., Chauhan, S. S. & Kumar, V. A review on genetic algorithm: past, present, and future. Multimed Tools Appl 80, 8091–8126 (2021). 47. Butler, K. T., Davies, D. W., Cartwright, H., Isayev, O. & Walsh, A. Machine learning for molecular and materials science. Nature 559, 547–555 (2018). 48. Xuan, P. et al. Genetic algorithm-based efficient feature selection for classification of pre-miRNAs. Genetics and Molecular Research 10, 588–603 (2011). 49. Guha, R. On Exploring Structure–Activity Relationships. Methods Mol Biol. 81–94 (2013) doi:10.1007/978-1-62703-342-8_6. 50. Rodrigues, T. The good, the bad, and the ugly in chemical and biological data for machine learning. Drug Discov Today Technol 32–33, 3–8 (2019). - 94 - 51. Gally, J., Bourg, S., Do, Q., Aci‐Sèche, S. & Bonnet, P. VSPrep: A General KNIME Workflow for the Preparation of Molecules for Virtual Screening. Mol Inform 36, (2017). 52. Nicola, G., Berthold, M. R., Hedrick, M. P. & Gilson, M. K. Connecting proteins with drug-like compounds: Open source drug discovery workflows with BindingDB and KNIME. Database 2015, bav087 (2015). 53. P. Mazanetz, M., J. Marmon, R., B. T. Reisser, C. & Morao, I. Drug Discovery Applications for KNIME: An Open Source Data Mining Platform. Curr Top Med Chem 12, 1965–1979 (2012). 54. Hemmerich, J., Gurinova, J. & Digles, D. Accessing Public Compound Databases with KNIME. Curr Med Chem 27, 6444–6457 (2020). 55. Kralj, S., Jukič, M. & Bren, U. Comparative Analyses of Medicinal Chemistry and Cheminformatics Filters with Accessible Implementation in Konstanz Information Miner (KNIME). Int J Mol Sci 23, 5727 (2022). 56. Caballero Alfonso, A. Y., Chayawan, C., Gadaleta, D., Roncaglioni, A. & Benfenati, E. A KNIME Workflow to Assist the Analogue Identification for Read-Across, Applied to Aromatase Activity. Molecules 28, 1832 (2023). 57. Roughley, S. D. Five Years of the KNIME Vernalis Cheminformatics Community Contribution. Curr Med Chem 27, 6495–6522 (2020). 58. Caballero Alfonso, A. Y., Chayawan, C., Gadaleta, D., Roncaglioni, A. & Benfenati, E. A KNIME Workflow to Assist the Analogue Identification for Read-Across, Applied to Aromatase Activity. Molecules 28, 1832 (2023). 59. Seidler, P. M. et al. Structure-based discovery of small molecules that disaggregate Alzheimer’s disease tissue derived tau fibrils in vitro. Nat Commun 13, 5451 (2022). 60. Kooistra, A. J. et al. 3D‐e‐Chem: Structural Cheminformatics Workflows for Computer‐Aided Drug Discovery. ChemMedChem 13, 614–626 (2018). 61. Comesana, A. E., Huntington, T. T., Scown, C. D., Niemeyer, K. E. & Rapp, V. H. A systematic method for selecting molecular descriptors as features when training models for predicting physiochemical properties. Fuel 321, 123836 (2022). - 95 - 62. Bellissent-Funel, M.-C. et al. Water Determines the Structure and Dynamics of Proteins. Chem Rev 116, 7673–7697 (2016). 63. Morningstar-Kywi, N. et al. Prediction of Water Distributions and Displacement at Protein–Ligand Interfaces. J Chem Inf Model 62, 1489–1497 (2022). 64. BIOVIA (previously Accelrys Inc.)., ViewerPro 42, San Diego, CA. 65. Mahmoud, A. H., Masters, M. R., Yang, Y. & Lill, M. A. Elucidating the multiple roles of hydration for accurate protein-ligand binding prediction via deep learning. Commun Chem 3, 19 (2020). 66. Chen, D. et al. Regulation of protein-ligand binding affinity by hydrogen bond pairing. Sci Adv 2, (2016). 67. Lukac, I., Wyatt, P. G., Gilbert, I. H. & Zuccotto, F. Ligand binding: evaluating the contribution of the water molecules network using the Fragment Molecular Orbital method. J Comput Aided Mol Des 35, 1025–1036 (2021). 68. Bagchi, B. Untangling complex dynamics of biological water at protein–water interface. Proceedings of the National Academy of Sciences 113, 8355–8357 (2016). 69. Petukhov, M., Cregut, D., Soares, C. M. & Serrano, L. Local water bridges and protein conformational stability. Protein Science 8, 1982–1989 (1999). 70. Bellissent-Funel, M.-C. et al. Water Determines the Structure and Dynamics of Proteins. Chem Rev 116, 7673–7697 (2016). 71. Samways, M. L., Bruce Macdonald, H. E., Taylor, R. D. & Essex, J. W. Water Networks in Complexes between Proteins and FDA-Approved Drugs. J Chem Inf Model 63, 387–396 (2023). 72. Zhao, J. et al. Chasing weakly-bound biological water in aqueous environment near the peptide backbone by ultrafast 2D infrared spectroscopy. Commun Chem 7, 82 (2024). 73. Levy, Y. & Onuchic, J. N. Water and proteins: A love–hate relationship. Proceedings of the National Academy of Sciences 101, 3325–3326 (2004). - 96 - 74. Mahapatra, M. K. & Karuppasamy, M. Fundamental considerations in drug design. in Computer Aided Drug Design (CADD): From Ligand-Based Methods to StructureBased Approaches 17–55 (Elsevier, 2022). doi:10.1016/B978-0-323-90608-1.00005-8. 75. Leipold, D. & Prabhu, S. Pharmacokinetic and Pharmacodynamic Considerations in the Design of Therapeutic Antibodies. Clin Transl Sci 12, 130–139 (2019). 76. Doytchinova, I. Drug Design—Past, Present, Future. Molecules 27, 1496 (2022). 77. Johansson, S. et al. AI-assisted synthesis prediction. Drug Discov Today Technol 32– 33, 65–72 (2019). 78. Arnold, C. Inside the nascent industry of AI-designed drugs. Nat Med 29, 1292–1295 (2023). 79. Gentile, F. et al. Artificial intelligence–enabled virtual screening of ultra-large chemical libraries with deep docking. Nat Protoc 17, 672–697 (2022). 80. Han, R., Yoon, H., Kim, G., Lee, H. & Lee, Y. Revolutionizing Medicinal Chemistry: The Application of Artificial Intelligence (AI) in Early Drug Discovery. Pharmaceuticals 16, 1259 (2023). - 97 - Appendices Appendix 1. Identification of key features and feature value for enrichment of inhibitors in the decision tree analysis for different puncta counts. Puncta count Key feature Feature value % Inhibitors in whole population % Inhibitors among molecules with feature 10,000 ArHdrxl_-OH >4.5 28 (3.0%) 21 (37.5%) 12,500 ArHdrxl_-OH >4.5 62 (6.7%) 27 (48.2%) 15,000 ArHdrxl_-OH >4.5 116 (12.5%) 31 (55.4%) 17,500 ArHdrxl_-OH >4.5 188 (20.3%) 39 (69.6%) 20,000 ArHdrxl_-OH >4.5 262 (28.2%) 44 (78.6%) 22,500 ArHdrxl_-OH >4.5 372 (40.1%) 49 (87.5%) 25,000 N_IoAcAt >3.5 465 (50.1%) 121 (73.8%) 27,500 ArHdrxl_-OH >3.5 567 (61.1%) 126 (86.3%) 30,000 ArHdrxl_-OH >3.5 673 (72.5%) 134 (91.8%) - 98 - Appendix 2. Impact of increasing puncta count on classification and prediction values using linear sampling Puncta count True Negative (TN) False Positive (FP) False Negative (FN) True Positive (TP) Accuracy Sensitivitya Specificityb 10,000 265 7 4 3 0.961 0.429 0.974 15,000 231 15 25 8 0.857 0.242 0.939 20,000 179 23 59 18 0.706 0.234 0.886 25,000 84 57 64 74 0.566 0.536 0.596 30,000 8 71 22 178 0.667 0.890 0.101 a Sensitivity = TP/TP+FN b Specificity = TN/TN+FP - 99 - Appendix 3. Ranking of importance of molecular features in the machine learning model with selection using a genetic algorithm without and with prior elimination using linear correlation analysis. Features are ranked using the genetic algorithm without linear correlation. Features Without linear correlation With linear correlation Rank Count Rank Count HBD 1 983 Eliminated Eliminated EqualChi 2 642 2 406 F_DbleB 3 620 8 334 HBA 4 573 Eliminated Eliminated N_IoAcAt 5 507 Eliminated Eliminated HBDo 6 462 Eliminated Eliminated EqualEta 7 443 Eliminated Eliminated PriAmAli_-NH2 8 441 Eliminated Eliminated HBDH 9 405 Eliminated Eliminated ArHdrxl_-OH 10 377 1 794 - 100 - Appendix 4. Ranking of importance of molecular features in the machine learning model with selection using a genetic algorithm without and with prior elimination using linear correlation analysis. Features are ranked using the genetic algorithm with prior linear correlation. Features With linear correlation Without linear correlation Rank Count Rank Count ArHdrxl_-OH 1 794 10 377 EqualChi 2 406 2 642 N_Carbon 3 398 24 287 N_Pisyms 4 370 19 324 F_SgleB 5 344 22 293 IHB 6 342 59 144 FormalQ 7 335 44 215 F_DbleB 8 334 3 620 N_Kekule 9 321 15 341 Carbonyl_C=O 10 311 16 339 - 101 - Appendix 5. Python code to deal with multiple entries in a single column (“ProtHB Atoms” or “ProtHF Atoms” in a CSV file). - 102 - - 103 - Appendix 6. Development of a Python code to classify interactions between empty protein and solvated water using the categories: Absolute displacement, contact displaced bulk and contact displaced HF. - 104 - - 105 - Appendix 7. Development of a Python code to classify interactions between empty protein and solvated water using the categories: Absolute displacement, contact displaced bulk and contact displaced HF and the columns: “ProtHB” and “Closest Prot Atom”. - 106 - Appendix 8. Development of a Python code to classify interactions between empty protein and solvated water using the categories: Contact SWB. - 107 - Appendix 9. Development of a Python code to classify interactions between empty protein and solvated water using the categories: Matched and Ghost Match. - 108 - - 109 - - 110 - Appendix 10. Development of a Python code to tally the total number of waters in each unique category. - 111 - Appendix 11. Development of a Python code to add an identifier in the CSV files. - 112 - Appendix 12. Development of a Python code to combine all the CSV files. - 113 - Appendix 13. Development of a Python code to isolate protein solvated waters from the rest of the output CSV file from SOLVATE - 114 - - 115 - Appendix 14. Development of a Python code to identify interested categories of solvated waters and convert them to carbons by default - 116 - - 117 - Appendix 15. Classification the water-converted-carbons as either hydrophilic or hydrophobic – Creation of water constellations: Ligand based water constellations - 118 - - 119 - Appendix 16. Classification the water-converted-carbons as either hydrophilic or hydrophobic – Creation of water constellations: Protein-interaction based water constellations - 120 - - 121 - - 122 - Appendix 17. Establishment of bonds between the atoms using a Fortran code - 123 - - 124 -
Abstract (if available)
Abstract
The objective of using computational tools in medicinal chemistry is to design compounds with better physicochemical properties with minimal investment of resources like time, chemicals, and human workload. Artificial intelligence and machine learning help increase the chances of success of a molecule in the drug pipeline by identifying the best performing molecules in the earlier stages of discovery. Medicinal chemists require training and a holistic understanding of the computational tools to make optimum use of them. KNIME presents a solution through its comprehensive toolsets and user-friendly platform, enabling the utilization of machine learning for SAR data analysis. In this thesis, downloadable workflows for predicting pharmacological activity of compounds and investigate the ligand protein interactions are made available to facilitate scientists who seek to incorporate AI tools into their research projects using KNIME. Python, as a programming language, presents an approach to navigating folder structures, handling large datasets, and efficiently generating and organizing results. ADMET Predictor, along with its AIDD (Artificial Intelligence-driven Drug Design) and 3D similarity screening module, facilitates the prediction of physicochemical properties of investigational compounds and the design of novel drugs using artificial intelligence and similarity methods. This thesis highlights the utility of the above mentioned computational tools and methods to manipulate them to aid the laboratory medicinal chemist enhance the drug discovery process.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Prediction of peptides in formation of MHC class I - peptide - TCR complexes using molecular models and artificial intelligence
PDF
Structure-based computational analysis and prediction of TCR CDR3 loops in the TCR-peptide-MHC complex using solvation parameters and peptide molecular dynamics.
PDF
An analysis of the robustness and reproducibility of computational tools used in biomedical research
PDF
Integration of KNIME and molecular docking for evaluation of tau fibril inhibitors
PDF
Inhibition of MAO-A by Dual MAO-A/HDAC inhibitors: in silico approach for ligand binding and affinity prediction
PDF
NMI (near-infrared dye conjugate MAO A inhibitor) outperformed FDA-approved prostate cancer drugs with a unique mechanism based on bioinformatic analysis of NCI60 screening data
PDF
Optimizing small compounds to better understand tau fibril inhibition
PDF
Evaluating the robustness and reproducibility or AIRR sequencing tools using computational replicates
PDF
Global landscape of primary omics data generation and its secondary analysis across 193 countries and territories
PDF
Inhibition of monoamine oxidase A and histone deacetylase inhibitors: computational prediction of ligand binding
PDF
Discovery of small molecules for brain cancer treatment
PDF
Benchmarking of computational tools for ancestry prediction using RNA-seq data
PDF
Image-driven pharmacokinetics of tropoelastin nanoparticles
PDF
reTCR: a unified repository for robust, rigorous, and reproducible analysis of TCR-Seq data
PDF
Computational analysis of drug complexes with beta-cyclodextrin
PDF
Molecular docking of sulfonylureas to the SUR1 receptor
PDF
Optimization of ADRB2 overexpression and reagent characterization for cyclic AMP measurement
PDF
Drug development targeting Piezo1 channel in the lymphatic system
PDF
Artificial Decision Intelligence: integrating deep learning and combinatorial optimization
PDF
Availability assessment of research products in biomedical research
Asset Metadata
Creator
Joshi, Ruchira Vishwanath (author)
Core Title
Artificial intelligence in medicinal chemistry and drug discovery
School
School of Pharmacy
Degree
Master of Science
Degree Program
Molecular Pharmacology and Toxicology
Degree Conferral Date
2024-08
Publication Date
07/17/2024
Defense Date
07/17/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
3D similarity screening,AIDD,artificial intelligence,computational tools,KNIME,medicinal chemistry,OAI-PMH Harvest,python,WALE
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Haworth, Ian (
committee chair
), Mangul, Serghei (
committee member
), Seidler, Paul (
committee member
)
Creator Email
ruchirajoshi99@gmail.com,ruchirav@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113997X98
Unique identifier
UC113997X98
Identifier
etd-JoshiRuchi-13257.pdf (filename)
Legacy Identifier
etd-JoshiRuchi-13257
Document Type
Thesis
Format
theses (aat)
Rights
Joshi, Ruchira Vishwanath
Internet Media Type
application/pdf
Type
texts
Source
20240718-usctheses-batch-1185
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
3D similarity screening
AIDD
artificial intelligence
computational tools
KNIME
medicinal chemistry
python
WALE