Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Computational analysis of the spatial and temporal organization of the cellular environment
(USC Thesis Other)
Computational analysis of the spatial and temporal organization of the cellular environment
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Computational analysis of the spatial and temporal organization of the cellular environment by Zachary C Frazier A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements of the Degree DOCTOR OF PHILOSOPHY (COMPUTATIONAL BIOLOGY AND BIOINFORMATICS) December 2015 Copyright 2015 Zachary C Frazier Acknowledgements This work would not have been possible without the support of my family, good friends, and many talented colleagues. My rst semesters would have been impossible without the talented group of students compris- ing my class: Mehmet Cetin, Fang Fang, Dazhe Meng, Tade Souaiaia, Tittu Thomas, and Song Qian. Their help with classes, screening exam preparation, and research projects was invaluable. I also had the pleasure of having fantastic professors, who provided insight into dicult areas and presented complex topics in engaging ways, including Peter Baxendale, Aiichiro Nakano, Fengzu Sun, and Michael Waterman. My advisor, Frank Alber, provided a great work environment, lling the lab with talented people, and providing insight into fascinating and challenging research areas. His vision of large scale problems was inspiring, and always helped to connect our research to the larger movement in science. Other members of the Alber Lab were great friends and resources, including Ke Gong, Long Pei, Harianto Tjong, Min Xu, Shemgli Hao, Chao Dai, and Shihua Zheng. Min Xu deserves special thanks for his help with the cryoET portion of this work. Everything I present builds directly upon his research, and is only possible because of the extensive ground work he has laid, along with his considerable guidance in this new research area. I would also like to thank my dissertation committee members for their time and guidance. My committee was composed of Frank Alber, Peter Baxendale, Lin Chen, Aiichiro Nakano, and Jasmine Zhou. Outside of the classroom and the lab, there were several people who made Los Angeles a memorable place. Tade Souaiaia, Reza Ardekani, and Ke Gong are great friends, both inside and outside of the department. ii My family has been incredibly supportive during my entire scholastic career, and I have always appreciated their encouragement, and everything they have done to provide for my education. My sister Rachel has been uniquely understanding and has been a wonderful friend since our childhood. I owe a special thanks to Tyler Roberts, who has been a great friend through my undergraduate and graduate degree programs. He has always provided a unique perspective, not only on my own research, but the entire realm of academic education. Finally all of this would have been impossible without the love and support of Jamie McKnight, who was an incredibly patient and supportive partner during my entire graduate school career. iii Table of Contents 1 Introduction 1 2 Increasing time scales in Brownian dynamics simulations 5 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Review of Brownian dynamics, and reaction systems . . . . . . . . . . . . . . . . . 6 2.2.1 Brownian motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.2 Reaction systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.3 Computational dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.4 Reaction-diusion algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.1 Diusion simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.2 Reaction systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.3 Reaction before Move reaction system . . . . . . . . . . . . . . . . . . . . . 15 2.3.4 Reaction Before Move algorithm . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4.1 Collision probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4.2 RBM accuracy assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4.3 Annihilation simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4.4 Crowded diusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3 TomoMiner: A platform for large-scale subtomogram structural analysis 28 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2.1 Software Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2.2 Cloud computing setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2.3 Fast rotational alignment and fast subtomogram matching . . . . . . . . . . 37 3.2.4 Experimental Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3 TomoMiner Analysis programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.3.1 Reference-free Classication . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3.2 Reference-based classication . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.3.3 Subtomogram alignment by fast rotational matching . . . . . . . . . . . . . 44 3.3.4 Template-matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.4.1 Data scalability, worker scalability, and eciency . . . . . . . . . . . . . . . 44 3.4.2 Performance of reference-free subtomogram classication . . . . . . . . . . . 46 3.4.3 Accuracy increases with larger data sets . . . . . . . . . . . . . . . . . . . . 48 3.4.4 Reference-free classication of GroEL and GroEL/GroES subtomograms . . 48 iv 3.4.5 Cost analysis of cloud computing . . . . . . . . . . . . . . . . . . . . . . . . 51 3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4 Conclusion 55 Bibliography 57 v List of Figures 2.1 Flowchart of RBM method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2 P col (r) comparison between algorithms . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3 Dierences between algorithms in N(t) and P acc (t) . . . . . . . . . . . . . . . 22 2.4 Radial distribution function and survival probabilities . . . . . . . . . . . . . . . 24 2.5 Survival error for dierent t in annihilation simulation . . . . . . . . . . . . . . 25 2.6 Observed diusion rates for dierent t . . . . . . . . . . . . . . . . . . . . . . . 26 3.1 Parallel processing architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2 Reference-free classication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.3 Eciency and Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.4 Reference-free classication of 100,000 subtomograms. . . . . . . . . . . . . . . . 49 3.5 Accuracy of refrerence-free alignment and averaging . . . . . . . . . . . . . . . . 50 3.6 Reference-free classication of GroEL/GroES . . . . . . . . . . . . . . . . . . . . 52 vi Abstract Cellular processes take place over relatively large temporal and spatial scales. Molecular interac- tions mediate structural changes in proteins and their complexes, which modulate their activities, and regulate complex processes, including cellular transport of components across membranes, gene expression, and genome replication. Gaining knowledge of these systems requires experimen- tal techniques that span the temporal and spatial scales of the environment, ranging from the atomic description of the individual components to knowledge about the detailed locations of all of its components at the cell level. The cellular environment is complex, teeming with thousands of molecular structures in a dynamic equilibrium. Cryo-electron tomography (cryoET) is an emerging technology which provides some unique advantages in the study of structural organization of entire cells. The samples are lightly processed in a close to native state, essentially providing a snapshot of the interior composition of cells. In principle, samples used in cryoET provide information about the spatial organization of the cell, allowing researchers to detect the structures of objects the size of protein complexes, as well as their relative orientation, abundance, and interactions. With a dynamic environment, these values provide insight into the functionality of a cell's components. However, extracting this information is challenging due to the relatively low resolution, low signal-to-noise ratio, and distortions in the tomograms due to missing data. Cryo-ET provides a 3-dimensional density of the volume under study. Segmentation of the volume based on regions of high density allows the identication of individual protein complexes, as well as larger structures. These can be compared to known structures, or against each other to classify them into groups whose density maps can be averaged to increase their resolution. This large scale analysis is computationally intensive, demanding expansive resources. Here we describe the TomoMiner package, a parallel computing framework for these calculations. The software runs on computer clusters with hundreds of processors and can handle the alignment, clustering, and averaging of hundreds of thousands of subtomograms. After recovering detailed data regarding the interior of cells, one approach to understanding the dynamics of the system is modeling the vii reaction-diusion behavior of all its components. At the scale of entire cells containing an abun- dance of protein complexes, atomic modeling is too detailed and would require too many steps in order to model large scale processes. Brownian dynamics (BD) allows the abstraction of the smallest molecules in the simulation, and allows for the simulation of larger temporal steps than molecular dynamics. Brownian dynamics simulations still have restrictions based on reaction and diusion rates, which limit the scope of the time steps available in the simulations. We have devel- oped an approach to accelerate Brownian dynamics simulations specically tailored for simulating the reaction-diusion process of crowded mixtures of proteins. The approach removes constraints due to fast reaction rates that require small temporal steps for diusion limited reactions. The only remaining constraints on BD simulations are those required by physical constraints and lim- itations. The new algorithm for BD simulations accurately reproduce reactions and dynamics of cellular scale systems with temporal step increases over existing approaches, allowing for simula- tions at cellular time scales. These two contributions, a package for discovering features at the cellular scale of real environments, and the ability to accurately simulate the environment more eciently, provide researchers with techniques for accelerated discovery at the cellular scale. viii Chapter 1 Introduction Biological processes, such as the selective nucleo-cytoplasmic protein transport or gene expression regulation, typically involve the intricate relationship of hundreds of cellular components. Pre- dicting the systems level behavior of reaction networks requires the absolute abundance of all of the reaction components, and their spatial distributions in the cell. The interiors of cells are in- homogeneous environments, where the distinct spatial distribution of macromolecular complexes facilitate and regulate cellular functions. For instance, spatial gradients, irregularities and discon- tinuities of macromolecular distributions in a cell, are known to play an active role in biological processes such as cell division, genome segregation, gene regulation, cell morphogenesis, and shape maintenance. Many biological processes related to cell signaling, the cell cycle, and protein trans- port are also modulated by the precise spatial organization of molecules in space and time [Scott and Pawson, 2009, Bork and Serrano, 2005]. The cellular environment is highly crowded, with molecular diusion impacted by the high level of crowding. The high level of crowding in uences reaction rates as well as diusion rates, and modeling the eects of crowding is an important part of modeling. Due to advances in experimental technology, quantitative information has been made increas- ingly available about the molecular organization in living cells from cryo-electron microscopy, as well as uorescence imaging; the abundance of reaction partners from quantitative mass spec- trometry, as well as information about in vivo kinetics in reaction networks, and the diusion 1 dynamics of cellular components from uorescence imaging. There is a pressing need to integrate these varied data into spatially explicit, predictive, in vivo models of biological processes such as signal transduction, genome separation and mitosis. Mathematical and computational modeling has been critical for predicting the systems level behavior of reaction networks in cellular environments. Typically, mathematical modeling treats the entire system, or parts of it, as a spatially homogeneous environment, which can be modeled by ordinary or stochastic dierential equations. However, the cellular environment is highly inhomo- geneous due to spatial gradients in the distribution of biomolecules, crowding eects, and cellular compartmentalization. Many cellular processes, such as cell division and nucleo-cytoplasmic trans- port, are either spatially constrained or segregated [Terry et al., 2007]. Moreover, when biomolecules are present in relatively low copy numbers, their local concentra- tions can uctuate widely, which can cause stochastic eects in reaction processes. The eective behavior of such molecules may be very dierent from their behavior under a constant distribu- tion, as often assumed in mathematical modeling. Such eects, due to low copy numbers, has been shown to signicantly in uence gene regulation [Cai et al., 2006, Raser and O'Shea, 2005], signal transduction [Kollmann et al., 2005], and many other processes [Ermak and McCammon, 1978]. Particle based simulations naturally incorporate the concepts of space, crowding, and stochas- ticity. Those methods treat proteins or other reactants explicitly as simple geometric objects and the time-evolution of particle positions is sampled at discrete time intervals by Brownian dynamics (BD) simulations [Ermak and McCammon, 1978, Northrup and Erickson, 1992]. In BD, the net force experienced by a particle contains a random element in addition to contributions from interactions with other particles. The random element is an explicit approximation to the statistical properties of Brownian forces, due to the eects of collisions with solvent molecules, which are not explicitly modeled. More specically, the particles are displaced from their position at each time interval by a random vector whose norm is chosen from a probability distribution function that is a solution to the Einstein diusion equation. As a consequence, spatial gradients 2 of particles naturally occur. To incorporate reactions within a BD framework, reactions occur upon collisions of particles according to specic probabilities, which are chosen to reproduce the correct experimentally determined reaction kinetics. However, to simulate reaction processes in cellular environments is challenging due to several problems that must be addressed. First, the disadvantage of particle methods is that often they require relatively small time steps in order to accurately simulate the dynamics of diusion and reaction kinetics. The reason lies partly in the approximations used to derive reaction event probabilities and the incomplete detection of particle collisions, which prohibit the use of larger time steps. The use of small time steps often prevents reaching simulation times that are relevant for biological processes. Biological processes can occur on a wide range of temporal scales. Some proteins may encounter each other in fractions of a millisecond, while others take hours. For instance, DNA transcription and translation are completed in a few minutes, but these processes are made up of thousands of small diusion-limited reactions, which take fractions of a nanosecond. It is an ongoing challenge to develop particle-based simulations that can cover a wide range of temporal scales while accurately reproducing the properties of diusion and reaction networks. Secondly, for accurate simulations the organization of the intra-cellular space must be known, which includes the ultra-structure of the cell given by the intricate membrane systems that de- lineate the cellular compartments, and also includes the knowledge about the particle types and their abundances as well as the initial particle distributions in the cell. Cryo-electron tomography (Cryo-ET) captures the electron density distribution of macromolec- ular complexes in near-native states from large samples of the cellular environment. The density volume can be segmented into individual particles; these particles can be aligned to known struc- tures for classication or aligned to each to produce a template of an unknown structure. In the case that particle species are accurately identied, the original tomogram provides us with the distribution of molecular complexes in the cell. Cryo-electron microscopy uses cold temperatures to stabilize biological samples while imaging is performed by an electron microscope. If several dierent projections are captured, cryoET can 3 be used to construct a three dimensional density map as a model of the original structure. These three dimensional density maps provided by cryoET are challenging to work with because of high noise levels and missing data. Cryo-ET can be applied to whole cells, or a subcellular region, and is capable or resolving details with a resolution of approximately 5 nm. At this scale, most individual proteins are not distinguishable from background noise, but larger collections of proteins, in complexes, and other cellular structures can be dierentiated. Identication of complexes is hampered by the low resolution and the limited information on complexes inside of cells. Template matching can be used to identify complexes where an atomic structure has already been determined by other techniques. In the absence of a structure, pairwise alignment, clustering, and averaging can be used to identify common structures throughout the imaged volume. This knowledge of localization of molecular complexes provides essential data required to model the dynamics of molecular processes within cells using Brownian dynamics or other simulation techniques. Here we provide solutions for both a longer time scale Brownian dynamics simulator, and also a set of tools for processing of subtomogram images from cryoET experiments. 4 Chapter 2 Increasing time scales in Brownian dynamics simulations 2.1 Introduction Cellular systems are active on a wide variety of temporal and spatial scales. Small molecules and ions can move large distances in nanoseconds. The internal dynamics and motions of proteins and other molecules take place in femtoseconds, as groups of bonds can stretch and relax on these temporal scales [Schlick, 2002]. A large scale rearrangement of a protein, or the time required for two proteins to align for an interaction are on much larger temporal scales taking nanoseconds. Neglecting the smallest motions internal to the proteins, the scale of time for these particles to diuse and form an interaction can be on the order of nanoseconds to microseconds [Gabdoulline and Wade, 1997, 2002]. Describing the motion of a protein in a cellular environment at a molecular level requires a detailed representation of the region, and all of the molecules nearby including solvent particles. In even a concentrated solution of proteins, for molecular modelling the smallest components will dictate the temporal scales which can be used. Handling water molecules explicitly means that time steps will be orders of magnitude smaller then if the same system were treated as if in a vacuum without water present. Most of the forces and movement considered will consist 5 of solvent-solvent contacts, or solvent-protein contacts. If we are interested in protein-protein interactions, we will have to reduce the time step used, and may not be able to simulate reactions, since many occur orders of magnitude slower then the processes we are modelling. As an abstraction of this highly detailed description of a cellular environment, Brownian dy- namics removes all solvent molecules, and instead replaces them with forces meant to represent their net eect over a longer period of time. Such an abstraction is very powerful, and here it is extended algorithmically by increasing the range of time steps for which the technique can be used. Designing a method which allows for larger time steps in Brownian dynamic simulations allows us to simulate larger scales of time, and to model more of the cellular functions, especially those which because of complexity take longer times to complete. We achieve this with more accurate accounting of collision and reaction events during BD time steps. 2.2 Review of Brownian dynamics, and reaction systems 2.2.1 Brownian motion Brownian motion is so called because of observations made by botanist Robert Brown. While looking at pollen under a microscope, Brown observed small particles shed by the pollen moving erratically. The reason for the random motion of small particles was proposed independently by Einstein and Smoluchowski [Einstein, 1905, von Smoluchowski, 1906]. Using dierent approaches both arrive at an expression for the motion of a particle undergoing Brownian motion, as well as explanations using the kinetic theory of gasses to explain the observations. Particles undergoing Brownian motion move according to a diusion equation. @ @t P (r;tjr 0 ;t 0 ) =Dr 2 P (r;tjr 0 ;t 0 ) (2.1) 6 where P (r;tjr 0 ;t 0 ) is the probability that the particle will be at position r at time t given the particle was at position r 0 at time t 0 . The rate of diusion is given by the parameter D with dimensions length 2 /time. Solving the above dierential equation requires boundary conditions. For an isolated particle, with the introduction of P (r;t 0 jr 0 ;t 0 ) = (rr 0 ) lim r!1 P (r;tjr 0 ;t 0 ) = 0 the solution can be solved and is the Gaussian distribution. P (r;t + tjr 0 ;t) = (4Dt) 3=2 exp jrr 0 j 2 2Dt (2.2) This solution does not account for other particles, and resulting forces including direct inter- actions, or hydrodynamic eects, mediated by the solvent, or even excluded volume eects. We have also made the assumption by representing D as a scalar that the particle is spherical, and diusion is not biased due to shape. A thin rod, for example, is not rotationally symmetric, and will behave dierently then a sphere of the same mass. To model the motion of many particles, we need to solve the many body version of the diusion equation. For many constraints, however, there is no closed form solution. In some particular cases there are positive results such as the two body problem [Kim and Shin, 1999]. Approximations that take into account the hydrodynamic forces and intermolecular forces have also been developed [Ermak and McCammon, 1978, Northrup and Erickson, 1992]. Depending on the scale of interactions investigated, the temporal scales used begin to approach those of molecular level simulations. 7 2.2.2 Reaction systems Another component of the biological systems we would like to model is interactions between particles. We are interested in very coarse representations of interactions. Brownian dynamic simulations exist which model reactions between particles at the atomic level [Gabdoulline and Wade, 1997, 2002]. These take into account the structure of the molecule and account for hy- drodynamic interactions and the rotational diusion of the particle. Inter-molecular forces are considered and play a role in steering and orienting the particles in order for bonds to form. On a coarser level, all of the orientation and rotational dynamics can be ignored and incorporated into a probability used to account for observed reaction rates [Gabdoulline and Wade, 1997, 2002]. The reaction system A +B * ) C describes a reaction where two particles of types A and B come into contact and form a complex C. The associated reaction rates and parameters dene the behavior of the reaction in bulk solutions. A +B kon * ) k off C (2.3) k on and k off are the macroscopic reaction rates, which are related to the equilibrium(K eq ) and the concentrations of the species by K eq = k on k off = [A][B] [C] (2.4) where [A] is the concentration of the species A. 2.2.3 Computational dynamics Simple molecular systems were some of the rst applications of the new eld of computational science. The paper introducing Monte Carlo methods [Metropolis et al., 1953] focuses on the derivation of the equation of state for simple hard sphere systems using the MANIAC computer system at Los Alamos. In retrospect, with the role that Monte Carlo algorithms have come to 8 play in computational sciences, the paper's conclusion seems to undersells the potential of MC algorithms with: "The method of Monte Carlo integrations over conguration space seems to be a feasible approach to statistical mechanical problems which are as yet not analytically soluble." [Metropolis et al., 1953] A follow up paper applied MC to a system of particles experiencing a Lennard-Jones potential, a more realistic simulation of molecular forces [Rosenbluth and Rosenbluth, 1954]. Later researchers integrated the equations of motion for non-equilibrium collections of atoms to computationally determine equilibrium properties. [Alder and Wainwright, 1957, Wood and Jacobson, 1957] This of course was only the beginning of computational models of molecular systems. Now computational methods are considered fundamental to the study of molecular systems and may algorithms have been developed which cover BD and a wide variety of other techniques. 2.2.4 Reaction-diusion algorithms Brownian dynamics is a particle based method which simulates each particle as it moves through the modeled environment. Alternative methods treat the numbers of particles as variables in stochastic dierential equations [Gillespie, 1977]. While methods like these can simulate popu- lation averages they can not re ect spatial organization and inhomogeneities [Beck et al., 2011]. Extensions which subdivide space and allow systems of dierential equations to interact allow more accuracy in spatial modelling, but still fail to take into account crowding, and other eects of individual particles and their interactions [Kim and Yethiraj, 2010, 2011, Ando and Skolnick, 2010]. Particle based simulations are able to simulate spatial inhomogeneities, crowding eects, and the inherent stochasticity of small numbers of reactants [D lugosz and Trylska, 2011, Dobrzynski et al., 2007]. Other methods [Ermak and McCammon, 1978, Northrup and Erickson, 1992] account for the temporal evolution of the system by sampling particle positions at discrete time intervals. 9 Methods have been developed allow for larger time steps then traditional Brownian algorithms, by exploiting features of the simulation. If the simulation is not crowded, Green's Function Reaction Dynamics (GFRD) can decompose the simulation into one and two body problems [Takahashi et al., 2005, van Zon and ten Wolde, 2005]. Then these sub problems can be solved exactly for large time steps with some accounting necessary when the small subsystems interact. Assuming that there is enough space between these separate systems we can take large time steps for large portions of the simulation, falling back on traditional BD whenever the environment requires it due to crowding of particles into larger groups [Takahashi et al., 2005, van Zon and ten Wolde, 2005]. There are several constraints on the time steps that can be used in Brownian dynamics. First, the system must have a time step small enough that the force approximation at the original point is valid over the entire range of motion of the particle. Secondly, the time step should be chosen so that particles can not pass through each other during a step and the unphysical situation not be detected. When dealing with overlaps, some simulations allow small overlaps and use a repulsive potential to push apart any overlapping particles from the previous step, others reject these moves. The force incorporation adds another force whose scale needs to be considered in the time step determination. Thirdly, we have assumed that k d t 1. There are many dierent ways to handle reactions as well. In many simulations, reaction events are evaluated from the subset of particles found to be overlapping at the end of the time step with probabilities chosen to reproduce the desired kinetic rates [Morelli and ten Wolde, 2008]. Other methods have taken into account the probability that a pair of particles did come into contact during the elapsed time, based on the beginning and ending positions of the particles [Barenbrug et al., 2002]. Many Brownian dynamics algorithms trace back to Ermak-McCammon who derived from the high-friction limit of the Lavengian equation the expression of motion governing Brownian particles, and from there a simple integrator for simulating their motion. Their formulation is found in the core of almost all simulations. 10 2.3 Methodology 2.3.1 Diusion simulation Our simulation of diusion is a straightforward and traditional adaptation of the equation for diusion of a single isolated particle. P (r;t + tjr 0 ;t) = (4Dt) 3=2 exp jrr 0 j 2 4Dt (2.5) We simply sample the next position at the end of t according to this Gaussian distribution. X(t + t) =X(t) +N(0; 2Dt) (2.6) where N(0; 2Dt) is the 3-dimensional normal distribution centered on the origin, with vari- ance 2Dt, and no covariance entries. Each particle is updated independently; and for the moment, we have neglected any forces between particles. We do, however, enforce excluded volume constraints. Any move which results in particle overlap are rejected, and another position is drawn from the probability distribution representing the new location X(t + t). 2.3.2 Reaction systems For the purpose of BD simulation, the macroscopic reaction, A +B kon * ) k off C (2.7) is decomposed into two events [Agmon and Szabo, 1990, Morelli and ten Wolde, 2008]. A +B k D * ) k D AB ka * ) k d C (2.8) 11 Here we have separated the diusion from the reaction event. The intermediate complex AB is an encounter complex formed by two particles that have come into contact but which have yet to react. The rates of diusion limited encounters can be calculated separately from the reaction probabilities. The rst reaction, the formation of the encounter complex, will occur according to the dif- fusion limited rate k D . For our simple spherical particles, where there are no forces, this is the Smoluchowski rate k D = 4RD where R and D are the sums of the particle radii and diusion rates [Rice, 1985]. The second reaction, where the encounter complex with the particles in contact, undergoes the reaction to form the actual complex occurs at the intrinsic rate k a . This rate is where we tune reaction systems according to their observed rates. In a coarse simulation, where we have neglected intermolecular forces and their steering eects, and the size and shape of the active sites of the particles, this rate will re ect the probability of the particles aligning and actually reacting. It is also possible from the encounter complex for the particles to separate and diuse away from each other in space. The complex C will disassociate according to the rate k d . This rate is a function of the forces holding the particles together, and the solvent the complex exists in. With the parameters chosen fork a andk d we are almost able to reconstruct the macromolecular state using BD simulation. The parameters still left to be dened are how to dene an encounter complex, and how to place particles when the encounter complex disassociates. A correct denition of these parameters will satisfy detailed balance [Morelli and ten Wolde, 2008]. First we dene the probabilities for each event. P acc The probability of the reaction progressing from the encounter complex to the reaction product C. P col (r) The probability of a collision for two particle A and B separated by distance r. 12 P sep (r) The probability of the particles of an encounter complex being found at distance r when they separate. P dis Probability of the product C disassociating. For any complex, the probability of disassociation does not have a temporal component, since it is a memoryless process. P dis can therefore be modelled as a Poisson distribution of waiting times, and the probability of disassociation for each time step is then dened asP dis = 1exp(k d t) k d t when k d t 1 [Morelli and ten Wolde, 2008]. To omit systematic errors, the system must obey detailed balance requirements. One require- ment is that the particle placement after a disassociation does not in uence the probability of a reaction with the same pairs more then should be expected. This is accomplished by ensuring that the distributions P col (r) and P sep (r) are the same distribution diering only by scale. The normalization factor used to dene scale P sep (r) is N. This scaling will ensure that the integral is 1. [Morelli and ten Wolde, 2008] This is not required by P col (r) since it only describes particles which result in a reaction. This normalization N is the integral of P col (r) over all locations it could start at, and still interact with the particle located r away. P sep (r) = P col (r) R jrj>R P col (r)dr = P col (r) 4 R 1 R P col (r)r 2 dr (2.9) A second requirement of detailed balance, which will also allow us to relate the probabilities to the known reaction rates is: K eq = k on k off = k a k d = P col (r)P acc P sep (r)P dis (2.10) where each probability is a rate for a dened event in the above expanded reaction. The relationship of the forward and reverse probabilities can be dened in terms of the prob- abilities of Brownian dynamic moves over the course of a time step of length (t). 13 Overlaying the above parameters on the previous reaction system, we can view our chemical equations as a state diagram with probabilities of transitions. A +B P col (r) * ) Psep(r) AB Pacc * ) P dis C (2.11) Again all of these probabilities are dened only for a short time step t. With these event probabilities dened, we identify some constraints on the time step dictated by the model. The probability for accepting the reaction can then be related to the others: P acc = k a k d P sep (r)P dis P col (r) (2.12) = k a t 4 R 1 R r 2 P col (r)dr (2.13) = k a t N (2.14) (2.15) where k a is the microscopic association rate, r is the distance between two particles A and B, and R is the sum of the radii of the two particles. In the paper introducing the concept of relating detailed balance to the microscopic rates through the above system, Morelli and ten Wolde [Morelli and ten Wolde, 2008] give an example calculation for a simple Brownian dynamics algorithm. In their system, reaction events occur when at the end of a time step, in which particles are all randomly displaced, two particles are found to be overlapping. This is the formation of the encounter complex in the system. Given this denition, the probabilityP col (r) is calculated with integration over all possible locations that result in overlap, for two particles diusing around each other. This method corresponds to how most BD algorithms traditionally handle reactions. The method we introduce uses the same balance rules but a dierent interaction system to achieve better limits on the time step. 14 2.3.3 Reaction before Move reaction system By dening the encounter complex as the result of overlapping particles after a Brownian dis- placement, the maximum time step is naturally limited. In order to generate enough collisions to reproduce the correct dynamics, the time step must be kept small in order to not miss the motion of the particle during the elapsed time. Since the positions at the end of the time steps are all that are sampled, and all of the diusive motion between those two points have been ignored, it is necessary to maintain small time steps, as to not neglect any activity, and mistakenly ignore potential reactions or encounter events that may be seen if the time step were lower. To this end, we instead account for the position of the particles during the entire time step. This is accomplished by calculating before the reaction, whether or not the encounter complex will be formed. For two diusing particles, the probability of coming into contact during the time step can be written in closed form [Kim and Shin, 1999, Rice, 1985, Agmon and Szabo, 1990]. P col (r) = R r erfc rR p 4Dt : (2.16) HereR is the sum of the radii,D is the sum of the two diusion rates,r is the distance between the particles, and t is the time step. We can next calculate the normalization factor which we use to scale the P sep (r) distribution. This integral over P col (r) is used to scale the P sep (r) values so they sum to one. N = 4 Z 1 R P col (r)r 2 dr = 4 RDt + 2R 2 r Dt ! (2.17) where k a is the microscopic association rate, r is the distance between two particles A and B, and R is the sum of the radii of the two particles. the acceptance probabilityP acc (r) is determined as 15 P acc (r) = k a t N = k a t RDt + 2R 2 q Dt (2.18) All of our probabilities are now determined: P col (r) = R r erfc rR p 4Dt P acc = k a t N P sep (r) = R rN erfc rR p 4Dt P dis = k d t 2.3.4 Reaction Before Move algorithm The diusion and reaction scheme are diagrammed in Figure ?? and the listing of Algorithm 1. 1. Detect potential reactions for this time step. (a) For all pairs of particles which can potentially interact, located within a cuto distance: i. Draw a random number and compare to the probability of the reaction taking place (the encounter complex, and the forward reaction (P col (r)P acc ). If the reaction occurs add the reaction to REACTION-LIST. (b) For all individual particles or complexes which can undergo a disassociation(A ! B+C), or a transformation (A! B), draw a random number, and add the reaction to REACTION-LIST if the value is lower then the reaction probability. (c) Filter REACTION-LIST 16 i. Randomly walk over REACTION-LIST. For every reaction, if either particle is al- ready involved in a reaction previously seen, remove the reaction from REACTION- LIST. 2. Carry out reactions. (a) For every reaction in REACTION-LIST: i. If reaction is association (A + B! C) Replace the particles with the product particle located at the location of B. If this introduces an overlap, with a particle not A or B reject the reaction, and leave the particles disassociated, and marked as not being in a reaction this step. ii. If reaction is disassociation (C! A + B): Replace the particle C with particle A. Place particle B at distance r away from A, in a random direction, according to the distribution of separation distances P sep (r). If this results in an overlap, retry the placement. If this fails multiple times, reject the disassociation move, restore particle C, and mark the particle as not involved in a reaction this step. 3. Move particles (a) For every particle which has not participated in a reaction: displace the particle ac- cording to its diusion rate. If this results in an overlap with another particle, retry the move. If this fails multiple times, the particle does not move during the time step. 2.4 Results We now compare our method with the traditional BD method which resolves on overlaps at the end of time steps as potential reaction events. Here we will demonstrate that larger time steps are possible from the RBM method. The examples and simulations re ect conditions that arise in molecular systems focusing on parameters that re ect biological conditions for proteins in cellular 17 Algorithm 1 React before move Brownian dynamics (RBM-BD) algorithm Precondition: ~ x1;:::~ xn are positions of particles, Ri are radii, Di are diusion rates, t is the timestep 1: function RBM-BD Single-Step(~ x; ~ R; ~ D; t) 2: for all pairs of particles i and j do 3: ifj~ xi~ xjj2 (Ri +Rj )< 6(Di +Dj )t then 4: continue . Particles will not contact during the time step 5: end if 6: if RAND()<P col (j~ xi~ xjj2)Pacc then 7: L L[f(i;j)g . add particle pair to reaction list 8: end if 9: end for 10: for all particles i which are a complex do 11: if RAND()<P dis then . Check if particle is disassociating 12: L L[f(i)g . add particle to reaction list 13: end if 14: end for 15: for reaction2L do 16: if forward reaction A +B!C then 17: replace particle A with particle C, remove particle B from simulation 18: if an overlap occurs reject the reaction 19: end if 20: if reverse reaction C!A +B then 21: place C with A and place B at distance r sampled from Psep(r) 22: if an overlap occurs reject the reaction 23: end if 24: end for 25: for all particles i not in L do 26: ~ xi ~ xi +N(0; 2Dit) 27: if an overlap occurs reject the move 28: end for 29: end function 18 Figure 2.1: Flowchart of the RBM method. 19 environments. The benets from the method are two fold. First, an increase in the time step that can be employed without missing potential reactions. Second, because of the rst, the time step can be increased even further without the probability of reaction per collision (P acc ) growing larger. 2.4.1 Collision probability A shortcoming of detecting potential reaction events from particle overlap at the end of time steps, is that collisions, which occur during the time step are not accounted for. Although the particles may not overlap before or after the time step, there is a chance that the diusive path taken by the particles led to an interaction during the elapsed time. In our method where we account for these missed collisions, our time steps can be larger without missing collisions, and therefore the probabilities of our reaction given an overlap can be held low. There is a simple proof that ourP col (r) value is always at least twice as large as when reactions are drawn from overlapping particles. Suppose instead of another particle, a particle we are tracking interacts with a wall occupying the simulation volume. Consider the set of moves for which the particle comes into contact with the wall during the time step. The RBM method accounts for all of these events by calculating the probability of a collision during the time step. The traditional method, however, will only account for those cases where the particle remains in contact or overlaps at the end of the time step. For every random walk that the particle can take from this point on, there is another walk which is equally likely, and is dened as the mirror image, where the wall's surface is the mirror. One of these walks ends in an overlap of the wall, and one does not. Both are accounted for by the RBM method, but only the move which ends in an overlap is accounted for by the overlap detection method. The diusion process is memoryless, and for the remainder of the time step after initial contact the particle is equally likely to end on either side of the wall we have described. So in this case, our method will detect twice as many collision events. 20 However, the particles we are interacting with occupy much less space then that of an innite wall; so therefore, our rate is at the minimum twice as large, since the wall is a limiting case as our interacting particle radius increases to innity. In Figure ?? the dierence betweenP col values and N values are seen for the two approaches, for a typical case. Figure 2.2: Because of the collisions accounted for during the movement of the particles during the time step, the probability of collision is lower for a given time step then an algorithm that resolves reactions from collision events. The RBM scheme P col (r) (green), is at least half that of the traditional method (blue). The simulation contains two particles each with a radius of 0:5nm, and a diusion rate ofD = 0:5 nm 2 /ns. These values re ect numbers typically found in protein systems. At the leftmost edge is where particles begin the simulation in contact. In this case, with probability 1.0 RBM detects the overlap, where an overlap based method will detect the event with a probability of less then 0.4. The dierence in collision probability (P col (r)), leads to dierent values for the correction factor N which appears in the denominator of P acc . N is also at least twice as large, since it is the integral of a positive function which is always twice as large. This dierence between the traditional value N BD (t) and the proposed methods N RMB (t) is shown in Figure ??A N BD (t) plateaus quickly, while N RBM (t) continues on a linear increase. The dierence in P acc between the two methods is also dramatically dierent as can be seen in Figure ??B. 21 Since it is advised to keep P acc < 0:1 for numerical accuracy of the approximation equations [Morelli and ten Wolde, 2008], this dierence radically eects the size of time steps available to the two algorithms, with several orders of magnitude dierence for the reaction system described in Figure ??. Figure 2.3: (A) The normalization factorN(t) as a function of the simulation's time step t. (Green line) N as observed in the RBM method and (blue line) N as observed in the traditional BD scheme. (B) The probability of accepting a reaction upon the collision between two particles P acc (t) as a function of the time step t used in the simulation. The functionP acc (t) diers dramatically between the traditional BD approach and our RBM method. Most importantly, the values ofP acc remain at relatively low values over a wide range of time steps, while P acc in the traditional BD scheme increases dramatically with increasing time step length. 2.4.2 RBM accuracy assessment For two diusing particles which can react, there is a closed form expression for their separation distance, and reaction state [Kim and Shin, 1999] as a function of initial condition and elapsed time. We can use this closed form solution to evaluate the accuracy of our method, across dierent simulation schemes and time scales. 22 One measurable quantity is the radial distribution function. This distribution is the probability of nding the particles separated by a given distance after elapsed time. The integral of the radial distribution function is the fraction of particles which have survived, and have not reacted. This value is the survival rate. We have carried out 30,000 simulations for 6 dierent time steps. We begin with two particles, initially at contact, but not reacting. The error that does exist in the method is related to the placement of particles. Even though we have accounted for the collisions that have occurred, we place the particle as if it were diusing in an isolated environment, without accounting for depletion eects around reaction partners to account for any reactions that have occurred. This is the reason our accuracy improves with time step. The source of error in the radial distribution function, and the survival rates are due to the error in placement for the moves. There should be a slight bias in placement away from the other particle whenever we do not react. This is handled in other methods by taking moves, and then carrying out reactions according to the probability that a reaction took place conditioned on the start and end position [Barenbrug et al., 2002]. This method does not have the same bias, but the calculations required are more expensive. 2.4.3 Annihilation simulation Next we consider the annihilation reaction A +A!;. In this simulation, whenever two particles of typeA come into contact, they annihilate each other. The reaction takes place at the diusion limited rate, which corresponds to P acc = 1, independent of time step. We place 10 5 particles of radius 1 nm, with diusion rate D = 1 nm 2 /ns randomly in a box, with sides of 216:55 nm which gives the initial concentration of 1% occupied volume. Assuming that the particles have been placed according to the steady state that exists in the box, the number of particles remaining as a function of elapsed simulation time can be calculated [Barenbrug et al., 2002]. We compare our method, with a traditional algorithm which takes reactions from the pairs of particles overlapping at the end of a time step. 23 Figure 2.4: The simulation is carried out with two particles of radius 0.5 nm, with diusion rates of 0.5 nm 2 /ns, and an intrinsic reaction rate k a = 0:16k D . The reaction rate is chosen close to the diusion rate to amplify errors in reaction probability. The total simulated time is 20 ns. (A) The radial distribution for two particles, with the analytical solution. Across the 6 orders magnitude of time step, there is very little dierence in the accuracy of the method. (B) The survival probability as a function of elapsed time during the simulation. This is the integrals underneath the curves in (A) but for dierent time steps. The integrals for the actual values in (A) would correspond to the rightmost values at an elapsed time of 20. There is increasing accuracy with time step, since the errors do not have a chance to accumulate in the lower time steps. (C) Relative error with respect to the analytical solution across 6 orders of magnitude. 24 The total error between the RBM method and the analytical solution is similar for dierent time steps. In contrast the traditional BD scheme performs poorly for increasing time steps. The error quickly reaches an order of magnitude larger, as the time step increases modestly (Figure ??). Figure 2.5: Annihilation reaction of 10,000 particles which destroy each other on collision, at the diusion limited reaction rate. The simulation is carried out for the RBM method and the traditional BD approach. 2.4.4 Crowded diusion When particles diuse in crowded environment, they are slowed by interactions with neighboring particles. Here we measure the eect of time step size on the apparent diusion rate of a particle in a crowded environment at dierent concentrations of crowding particle. Because of the coarse method used to place particles, there will be problems with many overlaps in placement, which will lead to many rejections of moves, which in turn will cause issues with the observed diusion rate. The observed rate will be dierent from the expected decrease in diusion rate caused by a crowded neighborhood. We consider particles of radius 1 nm, which diuse at the rate 0.25 nm 2 /ns. For 6 dierent crowding levels, we note measure the time required for particles to travel 10 nm, over 5000 trials. (Figure ??). 25 The time to escape increases as a function of crowding, which is expected [Kim and Yethiraj, 2010, 2011]. There are errors, however, for increasing crowding at large time steps. For up to 15% occupied volume, there are few problems increasing the time step across a few orders of magnitude. For larger time steps, or larger crowding fractions, the error is signicant. For these cases, the time step must be chosen to limit error in this diusion rate. Figure 2.6: The eect of crowding and time step on the time required for a small particle to escape a eld of identically sized particles. For modest crowding and time steps, the method does not lose accuracy. As the crowding approaches 20%, the time step needs to be constrained to avoid creating unphysical artifacts in the crowding. 2.5 Conclusions The introduced RBM method for particle-based reaction diusion method allows for the use of larger time steps in simulations of BD systems. By accounting for potential reactions that can occur during the next time step, we can account for reactions that are missed with other systems, including the popular option of selecting reactions from the subset of particles that are overlapping at the end of the time step. A variety of tests conrm the accuracy of the method, and the ability to reproduce analytic results across dierent biologically relevant conditions. The achieved increase 26 in the time steps of BD simulations allow for the modelling of more complex processes which require longer biologically relevant temporal scales. 27 Chapter 3 TomoMiner: A platform for large-scale subtomogram structural analysis 3.1 Introduction Cryo-electron tomography (CryoET) captures the density distributions of macromolecular com- plexes and pleomorphic objects at nanometer resolution [Nicastro et al., 2006, K uhner et al., 2009, Medalia et al., 2002, Komeili et al., 2006, Murphy and Jensen, 2007]. CryoET has provided im- portant insights into the ultra-structures of entire bacterial cells, and revealed the structures of numerous macromolecular complexes, particularly membrane-bound complexes [Nicastro et al., 2006, K uhner et al., 2009, Medalia et al., 2002, Komeili et al., 2006, Beck et al., 2007, Lu ci c et al., 2005, Ben-Harush et al., 2010, Kuybeda et al., 2013]. Several factors complicate the analysis of cryo-electron tomograms to determine structures of macromolecular complexes: relatively high noise levels, relatively low and non-isotropic resolu- tion, and distortions due to electron optical eects and missing data [Frangakis and F orster, 2004, Lu ci c et al., 2005, F orster et al., 2005, Briggs et al., 2009]. For example, unavoidable system- atic distortions are caused by variations in the Contrast Transfer Function (CTF) in individual electron micrographs [F orster et al., 2008, Lu ci c et al., 2005, F orster et al., 2005, Briggs et al., 2009]. Orientation-specic distortions can result from the missing-wedge eect, which arises from 28 the restricted range of tilt-angles when collecting the micrographs (typically between70 to +70 degrees). This limitation in data coverage means that Fourier space structure factors are missing from a wedge-shaped region, causing anisotropic resolution and other image artifacts, which de- pend on the orientation and shape of the object relative to the tilt axis [Bartesaghi et al., 2008, F orster et al., 2008, Xu et al., 2012]. The nominal resolution of tomography images can be increased by aligning and averaging mul- tiple subtomograms containing the same structure [F orster and Hegerl, 2007, Hrabe and F orster, 2011, Schmid, 2011, Beck et al., 2007, Briggs, 2013, Chen et al., 2013]. Typically, for a given complex of interest, sub-volumes (i.e., the subtomograms) are extracted from a tomogram con- taining distinct examples of the complex, which are typically aligned and their signals averaged to generate a density map with increased nominal resolution; but if the subtomograms represent a heterogeneous sample (a mixture of dierent complexes, or multiple conformational or composi- tional states of the target complex), it is necessary to rst group them into homogeneous sets in an unbiased manner, using reference-free classication methods. This classication or clustering step is a common subtask in subtomogram analysis and requires fast and accurate subtomogram align- ment. We recently introduced an ecient alignment algorithm designed for use with reference-free subtomogram classication [Xu et al., 2012]. The method relies on fast rotational alignment and uses the Fourier space equivalent form of a constrained correlation measure [F orster et al., 2008] that accounts for missing wedge corrections, and density variances, in the subtomograms. The fast rotational search is based on 3D volumetric matching [Kovacs and Wriggers, 2002]. We have also proposed an algorithm for gradient-based alignment renement [Xu and Alber, 2012], and another for fast real space alignment [Xu and Alber, 2013], which is a variant of the method described in Xu et al. [2012], in order to increase the alignment accuracy. Having a larger number of subtomograms increases the accuracy of classication, which in turn improves the resolution of the resulting averaged structures [Beck et al., 2009, Brandt et al., 2010, Ortiz et al., 2006]. With the rapid advance of cryoET acquisition technologies, it has become easy to acquire a large number (> 10; 000) of instances of macromolecular complexes. Osetting 29 the clear advantage in accuracy is the high computational cost of 3D image processing. The eld therefore needs ecient high-throughput computational methods for processing large numbers of subtomograms to take advantage of the available data. Currently, only a few alignment algorithms have the scalability to process large subtomogram data sets, to the best of our knowledge (e.g. Bartesaghi et al., 2008, Chen et al., 2013, Xu et al., 2012, Xu and Alber, 2013). TomoMiner was developed with particular focus for scalability and therefore the ability to process a large number of subtomograms (> 100; 000). Here we describe the Python/C++ software package TomoMiner, which includes a high- performance implementation of several of our previously developed methods, including reference- free subtomogram classication [Xu et al., 2012], template matching, and both Fourier space [Xu et al., 2012] and real space [Xu and Alber, 2013] fast subtomogram alignment. All these methods are implemented in a parallel computing framework designed to be highly scalable, ecient, robust, and exible. The software can run on a single personal computer or in parallel on a computer cluster, in order to quickly process large numbers (> 100; 000) of subtomo- grams. Additionally, TomoMiner provides an open source platform for users to implement their own tomographic structural analysis algorithms within the high-performance environment of the TomoMiner framework. Although many methods have been proposed for the structural analysis of macromolecular complexes from cryoET subtomograms, only a few software packages are currently available to the research community. These include, but are not limited to, the TOM Toolbox [Nickell et al., 2005], PyTOM [Hrabe et al., 2012, Hrabe, 2015], AV3 [F orster et al., 2005, Nickell et al., 2005], Dynamo [Casta~ no D ez et al., 2012], EMAN2 [Tang et al., 2007, Galaz-Montoya et al., 2015], PEET [Nicastro et al., 2006], Bsoft [Heymann et al., 2008], and RELION [Scheres, 2012, Bharat et al., 2015]. TomoMiner complements existing software solutions because it focuses on large-scale data processing and implementing proven algorithms and tools in parallel form, so that researchers can process tens or even hundreds of thousands of subtomograms. TomoMiner has been designed to run on computer clusters, and scales to hundreds of proces- sors. Some components, such as the data storage interface, have been abstracted, so are easily 30 replaced with dierent implementations on dierent cluster computing platform architectures. We have also provided a cloud computing version of TomoMiner on Amazon's Elastic Compute Cloud (EC2) [Amazon, 2006]. Those research labs without access to substantial computational capacity, or the ability to adapt, install, and maintain TomoMiner on existing computer clusters can use the cloud computing version immediately by paying for resources as they go. 3.2 Methodology 3.2.1 Software Implementation The TomoMiner package contains a suite of programs covering a variety of important tasks in subtomogram analysis, including among others (i) fast and accurate subtomogram alignment which accounts for missing wedge eect; (ii) large-scale, reference-free subtomogram averaging and classication; (iii) reference-based subtomogram classication; and (iv) template matching for detecting complexes in large tomograms. TomoMiner is optimized for processing large numbers ( 100; 000) of subtomograms. It is de- signed to be scalable, robust, computationally ecient and exible. This is accomplished through modular design and parallel computing architecture. The programs function by breaking com- putations into smaller independent tasks, which can be computed in parallel by individual CPU cores on a computer cluster. Software design and modular architecture for parallel processing TomoMiner's parallel processing system consists of three major components (Figure 3.1): 1) the analysis programs, such as the subtomogram classication, averaging and template matching; 2) the TomoMiner server, which manages the execution of tasks generated by the analysis programs and passes the results of each task back to its requesting program; and 3) the workers, which process the tasks. 31 Data storage Node 1 Worker 1.1 Worker 1.2 Node 2 Worker 2.1 Worker 2.2 TomoMiner Server User 1 Program 1 (Classification 1) Program 2 (Classification 2) User 2 Program 1 (Averaging 1) Program 2 (Classification 1) Figure 3.1: Parallel processing architecture of the TomoMiner package. Each analysis program breaks its computations down into small independent tasks, which are submitted to the TomoMiner server. The server distributes the tasks to workers for execution, monitors worker processes, and passes results from the workers back to the analysis program. Workers and analysis programs communicate with the server over a network, allowing all of these components to run on separate computers. Since workers are single-threaded, we usually run one worker per CPU core. The workers connect to the server and request tasks for execution. When a task is nished, the worker sends the result to the server and requests a new task. The results of all completed tasks are collected by the analysis program. Importantly, several independent analysis programs from dierent users can submit their tasks to the same TomoMiner server and worker pool at the same time. This design allows for maximum utilization of cluster resources; for example, an analysis program may stop running as it awaits the completion of a non-parallelizable task or a set of parallel tasks. Typically, some parallel tasks nish earlier than others, so if only one analysis program is using the server, then many workers will remain idle while waiting for slower tasks to nish. We can decrease the idle time on the available cluster nodes by running two or more analysis programs communicating with the same TomoMiner server. Idle nodes can then receive tasks from a second program, and total node utilization will be higher. This design is particularly useful when programs are submitted in a shared cluster environment that limits the number of submitted jobs and the assigned cluster time per user. 32 The TomoMiner software system can run the analysis and server programs, as well as the work- ers, on a single desktop computer, or run each component independently on separate computers within a cluster. For example we frequently use TomoMiner with 256 workers running on dierent machines. To reduce the communication load on the server, both the tasks and the results that pass through the server are limited to small messages. Large results or inputs, such as the subtomo- grams themselves, are kept on shared data storage (Figure 3.1), where they can be easily retrieved by workers and analysis programs as needed. A task passed to a worker only needs the path to the data, not the data object itself. Software robustness. Component failures are inevitable when using distributed computer systems; these must be handled without causing the failure of other system components and without terminating the analysis process. TomoMiner components are designed to be robust to intermittent network and remote failures. When a task is sent to a worker, the TomoMiner server monitors its progress. If the task takes longer than expected, or the connection to the worker is lost, the server re-assigns the task to another worker. If a task fails, the worker passes the failure notication to the server. The server can then take an action, such as rescheduling, or pass the notication back to the analysis program for handling. All tasks are carefully tracked, and an uncompleted task can be attempted by multiple workers when computational resources are available. As soon as one of these attempts succeeds, the server can cancel remaining instances of the same task, so that the freed workers can request new tasks. A worker processes each task by launching an independent subprocess, so that the worker program cannot be crashed by bugs in the analysis code. If the subprocess crashes, the worker noties the server of the failure, but remains on-line. Each task can also be assigned properties to control how it is executed. For example, one can specify the maximum run time, after which the task will be considered lost and the server will send the task to another worker. One can also set up an upper limit on the number of times a task 33 can be re-assigned to a new worker after loss or failure. All these features provide the foundation for a robust parallel processing system. Software exibility. TomoMiner is designed to run multiple analysis programs connected to the same TomoMiner server with the same pool of workers (Figure 3.1). Multiple users can run multiple analysis programs concurrently. The server will manage tasks for multiple programs on the same pool of workers, even for multiple users. Developing new subtomogram analysis programs does not require knowledge of the internal parallel worker implementation, only a way to match the parallel interface to the functions processed by the tasks. Software components and dependencies. The TomoMiner code consists of several com- ponents. The core is a library of basic functions dealing with (i) data input and output, (ii) subtomogram processing, such as fast rotational and translational alignment of subtomograms and averaging, and (iii) calculations of subtomogram correlations. This core is written in C++ to maximize computational eciency. This core has been wrapped into a Python module. All TomoMiner top-level programs are implemented as Python programs. These include analysis programs such as the reference-free subtomogram classication routine, parallel processing programs such as the TomoMiner server, and utility programs such as FSC (Fourier Shell Correlation) calculator. The choice of languages allows for fast prototyping of new algorithms and interoperability with other software. Python is more accessible for novice programmers. TomoMiner provides the advantages of developing software in a high-level language without sacricing performance, because all numerically intensive calculations are carried out by the wrapped C++ functions. The C++ code is built on top of several existing libraries. The open source Armadillo [Sander- son, 2010] library is used to represent volumes, masks, matrices, and vectors. Fast Fourier Trans- forms are provided by FFTW [Frigo and Johnson, 2005]. The C++ core is wrapped using Cython [Behnel et al., 2011]. This library enables a user to call core functions written in C++ directly 34 from Python programs, using Python data structures as arguments. A number of auxiliary rou- tines from Scipy [Oliphant, 2007], and scikit-learn [Pedregosa et al., 2011] are also used by the classication code. 3.2.2 Cloud computing setup Due to the computationally intensive nature of 3D image processing of large number of subto- mograms, it is necessary that analysis software scales well, and support parallel computation environments to achieve high performance. TomoMiner was designed to meet these criteria, and can be installed on computer clusters. However, many biology research labs do not have access to a computer cluster with sucient computational resources. Also, for those labs that do have access to a computer cluster, the hardware and software architectures can vary substantially. The installation and conguration of any specialized software platform on dierent computer clusters is often non-trivial, and may introduce con icts with the previously installed software and libraries. Therefore, it may be impractical for labs whose main focus is not the structural classication of cryoET data, and who may only occasionally need to perform subtomogram analysis tasks, to invest money and/or labor in setting up and maintaining the required software and hardware. To provide a solution to labs that only occasionally need to perform subtomogram analysis, we provide a pre-installed and pre-congured TomoMiner system (TomoMinerCloud) in the form of a cloud computing service. TomoMinerCloud is a system image that can be used on publicly available cloud computing platforms, such as Amazon's Elastic Compute Cloud (EC2) [Amazon, 2006]. Cloud platforms allow computational capacity to be purchased as a service, where users are charged based on the amount of computational resources used [Cianfrocco and Leschziner, 2015]. They provide the exibility to run large computations or analyses using a pool of Virtual Machines (VMs), without the burdens of owning and maintaining hardware or installing cluster management software. 35 We have built a publicly available virtual machine image and installed our software system into the image to provide cloud services. The service allows users to immediately use the To- moMiner software for large-scale subtomogram analysis, at an aordable cost and with very little conguration or maintenance burden. The amount of computational resources dedicated to the analysis project can be determined dynamically as a function of the data size and budget. Using TomoMinerCloud can also release a laboratory from the sometimes time-consuming burden of installing new parallel software on its own computing clusters. Currently, TomoMinerCloud is available on Amazon Web Services (AWS). TomoMinerCloud is designed so that researchers can set up a high-performance parallel data analysis environment with little informatics expertise. Inside a Virtual Private Cloud (VPC), the VM used to run the analysis program can be started and accessed from the user's own computer. The same VM can also host the server layer and shared data storage. A large number of workers (hundreds or thousands) can be executed in the cloud, each running on its own VM. Therefore, an end-user only needs a computer with Internet access, a web browser, and a secure shell (SSH) client. No specialized software is required. TomoMinerCloud can be instantiated using the web console of Amazon EC2. SSH can be used to transfer data and launch the jobs on the VMs. An additional advantage of TomoMinerCloud is that snapshots can be taken to record the current status of the VM, TomoMiner program, and data. The snapshot mechanism can be used to verify the reproducibility of computational experiments, record exact parameter settings and conguration details, measure the eect of bug xes or algorithmic changes, and share analy- sis between collaborators. Detailed procedures for using TomoMinerCloud are described in the documentation available from the main TomoMiner website. 36 3.2.3 Fast rotational alignment and fast subtomogram matching Subtomograms are 3D volumes dened as 3D arrays of real numbers representing the intensity values at each voxel position. The voxel intensities are the result of a discretization of the density function f :R 3 !R. A tomogram is subject to orientation specic distortions as a result of the missing-wedge eect. This eect is a consequence of the data collection being limited to tilt angle ranges when collecting individual micrographs (with a maximum tilt range of 70 ). As a result, in Fourier space structure, factors are missing in a characteristic wedge shaped region. This missing data leads to anisotropic resolution and distortion artifacts that depend on the structure of the object and its orientation with respect to the tilt-axis. To accurately calculate the similarity between two subtomograms, we have recently introduced a Fourier space equivalent form [Xu et al., 2012] of a popular constrained correlation score [F orster et al., 2008] that accounts for missing wedge eects. It is based on a subtomogram transform that eliminates the Fourier coecients located in the missing wedge regions of any of the two subtomograms. For each subtomogram (f), a missing wedge mask functionM : R 3 !f0; 1g denes valid and missing Fourier coecients in Fourier space. To allow for missing wedge corrections in our analysis procedures, a series of missing wedge masks can be given as input information together with the subtomograms. The search for the optimal subtomogram alignments is performed through rigid transformations with rotational and translational components. A transformed subtomogram can be represented as: a R f(x) =f(R 1 (xa)) (3.1) where f is a subtomogram, R is a transformation operator which applies the rotation given by rotation matrix R. a is a transformation operator applying a shift by vector a2R. 37 The previously developed correlation score [Xu and Alber, 2012] for subtomograms f and g, where g has been rotated by R , the correlation is dened as c = Re 0 @ R (Ff )F a R g q (Ff) Ff q (F a R g) F a R g 1 A (3.2) HereF is the Fourier transform and :=M f R gM g . The optimal rotational alignmentR and translationa are found by maximizing this correlation. The above formulation allowed us to design a fast alignment procedure [Xu et al., 2012]. Summarizing, we rst form a translation invariant approximation score dened by keeping only the magnitudes of the Fourier coecients of the subtomograms. This score can be decomposed into three rotational correlation functions. After representing the values in a spherical harmonics expansion of the magnitude values these rotational correlation functions can be eciently and simultaneously calculated over all rotation angles [Kovacs and Wriggers, 2002] using the FFT after representing the values in Spherical Harmonics expansion of the magnitude values. Therefore a small number of local maxima of the approximation score can be collected, representing a set of rotation angle candidates. Given each candidate rotation, a fast translation search can be performed to obtain optimal translations to determine a. The overall optimal alignment can then be obtained. This procedure is detailed in Equations 7-9 of Xu et al. [2012] 3.2.4 Experimental Procedures Generating a benchmark set of cryo-electron tomograms We tested the performance of TomoMiner with realistically simulated subtomograms as ground truth. This benchmark set of tomograms contains ve known protein complexes (Table S1). For a reliable assessment of the software, the subtomograms must be generated by simulating the actual tomographic image reconstruction process, including the applications of noise, distor- tions due to the missing wedge eect, and electron optical factors, such as the Contrast Transfer Function (CTF) and Modulation Transfer Function (MTF). We follow a previously described 38 methodology for realistic simulation of the tomographic image reconstruction process [Beck et al., 2009, F orster et al., 2008, Nickell et al., 2005, Xu et al., 2011]. Macromolecular complexes have an electron optical density proportional to the electrostatic potential. The PDB2VOL program from the Situs [Wriggers et al., 1999] package has been used to generate volumes with a 4 nm resolution, with a voxel length of 1 nm. The volumes are cubes, whose length dimension can vary depending on the experiment. The density maps are used to simulate electron micrograph images through a set of tilt-angles. The angles are chosen to represent the experimental conditions of cryo electron tomography, and to have a missing wedge angle similar to experimental data. For this paper we use a typical tilt-angle range of60 . Noise is added to achieve the desired SNR value [F orster et al., 2008]. Next the images are convoluted with the CTF and MTF to simulate optical artifacts [Frank, 1996, Nickell et al., 2005]. The acquisition parameters used are typical of those found in experimental tomograms [Beck et al., 2009]; voxel grid length of 1 nm, spherical aberration of 2 10 3 m, defocus of4 10 6 m, and voltage of 300kV. The MTF is dened as sinc(!=2) where ! is the fraction of the Nyquist frequency, corresponding to a realistic detector [McMullan et al., 2009]. Finally a backprojection algorithm [Nickell et al., 2005] is used to produce subtomogram from the tilt series. Assessment of classication accuracy We assess the performance of the reference-free subtomogram classication method on simulation data, comparing the generated results to the ground truth. The accuracy of a classication is measured as the number of true positives. To compare the computed class labels with the ground truth, we constructed a confusion matrix where each row corresponds to a known class, and each column to a predicted class label. The matrix elements are the number of subtomograms actually belonging to each class of a given class label. A maximum-weight matching [Munkres, 1957] is computed to determine the best correspondence between ground truth classes and detected clusters. That is, we determine the labeling of ground truth classes to class labels, which maximizes the number of true positives of the confusion matrix. In the event that we have more generated 39 Parallel processing programs Description tm server Run a server tm worker Run a worker which will process subproblems tm watch Report progress and statistics on the server Utility programs Description tm align Calculate optimal alignment between two subtomograms using fast rotational matching tm fsc Calculate the Fourier Shell Correlation(FSC) between two aligned structures tm corr Calculate the correlation score of the best alignment be- tween two subtomograms Analysis programs Description tm classify Reference-free or reference-based subtomogram classica- tion tm average Reference-free or reference-based subtomogram alignment and global averaging tm match Template matching Table 3.1: The various executables included in the TomoMiner software package. clusters than true classes, we do not require a one-to-one matching, and allow for multiple clusters to map to the same ground truth class. The accuracy of the generated subtomogram cluster averages is determined by comparison with templates of the ground truth protein complexes. The Pearson correlation score between the two structures is used to quantify the similarity. 3.3 TomoMiner Analysis programs TomoMiner includes the high-performance, parallel software implementation of several of our previously described methods, in addition to some new methods. The methods include: 1) fast subtomogram alignment; 2) reference-free and reference-based large-scale subtomogram averaging and classication and 3) template matching applications. In the next section we describe the reference-free classication program. 40 3.3.1 Reference-free Classication TomoMiner contains a program for large-scale reference-free subtomogram classication. The software is based on a previously published method [Xu et al., 2012], and includes modications for processing large data sets. The program does not rely on template structures; the only input is a large set of subtomograms that are randomly oriented at the beginning of the iterative process. The outputs are a classication of the subtomograms into individual complexes, a rigid transformation for each subtomogram, and a density map generated by averaging all the aligned subtomograms within each class. In comparison with our previously published method [Xu et al., 2012], which is a variant of alignment-through-classication method [Bartesaghi et al., 2008], this software implementation has several adaptations to parallelize the algorithm and improve eciency and scalability. The reference-free classication is an iterative process. Each iteration consists of the following steps: Step 1: Dimension reduction: The similarity between subtomograms is measured in a reduced dimension space in order to focus on the features most relevant for discrimination. For each voxel and its neighbors, this step calculates the average covariance of the voxel intensities across all subtomograms in similar way as [Xu et al., 2012]. The voxels with the largest covariance are selected as the most informative features, and each subtomogram is represented by a high-dimension feature vector (see Xu et al. [2012] for details). To account for missing wedge eects, the covariances and feature vectors are calculated on missing wedge-masked dierence maps [Heumann et al., 2011]. Then principal component analysis (PCA) is used to project the high-dimension feature vectors into a low-dimension space to form low-dimension feature vectors. In practice, EM-PCA [Bailey, 2012] is used for its scalability and speed when one only extract a small number of principal components. Step 2: Clustering: K-means clustering is performed based on the Euclidean distance of the low-dimension feature vectors generated in step 1. The value of the K parameter is specied by the user, and should be chosen to over-partition the data set. This is because 41 clusters leading to similar averaged tomograms are easily identied and the corresponding subtomograms merged into one cluster later in the analysis. In our previous method, we used hierarchical clustering [Xu et al., 2012] but K-means clustering results in a more ecient and scalable algorithm. Finally, the class labels of all subtomograms are assigned according to the clustering. Step 3: Generate cluster averages: The subtomograms within each cluster are averaged to generate density maps, which are used as cluster representatives. Step 4: Alignment of cluster averages: All the averaged density maps resulting from step 3 are grouped using hierarchical clustering, based on the pairwise optimal alignment scores of the cluster averages [Xu et al., 2012]. A silhouette [Rousseeuw, 1987] score de- termines the optimal cuto to cluster all averaged density maps into classes. Within each hierarchical class, the map that was generated from the largest number of subtomograms is chosen as a reference. Then all other maps in the hierarchical class are aligned relative to this reference. Step 5: Alignment of subtomograms: All of the original subtomograms are aligned to each of the cluster averages generated in step 4. To allow high-throughput processing of this step, we implemented a fast alignment algorithm based on fast rotational matching [Xu et al., 2012], which is computationally ecient. For each subtomogram the rigid transform with the highest scoring alignment is used as input for the next iteration. The iterative process (step 1 to 5) can either be executed for a xed number of iterations, or terminated when the amount of changes in subtomogram class labels or changes in the cluster averages between two iterations is small. 3.3.2 Reference-based classication If template structures are provided as a reference, the classication process can use these alongside the averaged density maps of each cluster as cluster representatives. 42 (1) Initialization (2) Project subtomograms into lower dimensional space (3) K-means clustering of subtomograms in projection space (4) Generate cluster averages (5) Hierarchical clustering of aver- ages. Align similiar averages to common reference frame (6) Align all subtomograms to each cluster average, and save best transformation (9) Classied and averaged subtomograms (8) Convergance? (7) Output the classication results and cluster averages Yes No Figure 3.2 43 3.3.3 Subtomogram alignment by fast rotational matching TomoMiner contains a program for fast rotational alignment (Xu et al., 2012. This method increases the computational eciency of subtomogram alignments by at least three orders of magnitude [Xu et al., 2012] compared to exhaustive search methods [F orster et al., 2008], while at the same time accounting for missing wedge eects when calculating the correlations between the tomograms. The missing wedge constrained fast alignment method is implemented as a C++ library. This new C++ implementation has been thoroughly tested, and is at least about 6 times faster than our previous MATLAB [MATLAB, 2013] prototype used in [Xu et al., 2012]. In addition, TomoMiner implements our previously proposed real space fast subtomogram alignment method [Xu and Alber, 2013]. 3.3.4 Template-matching TomoMiner also provides an ecient template matching protocol. Given a set of templates with known structures, and a set of candidate subtomograms with unknown structures extracted through template-free particle picking (e.g. Voss et al., 2009, Langlois et al., 2011), TomoMiner can perform fast alignment [Xu et al., 2012] to compute which structures are most similar to the unknown subtomograms in terms of the alignment score. 3.4 Results 3.4.1 Data scalability, worker scalability, and eciency Scalability is an important measure of performance for parallel software. We evaluate it using two measures: data scalability and strong scalability. Data scalability measures the performance of TomoMiner when the number of subtomgrams increases while the number of workers is held constant. Strong scalability measures performance when the number of workers increases for a data set of xed size. 44 Whether we change the number of processors or the number of subtomograms, we are most interested in the time required to process a single subtomogram. This is captured by the eciency, dened as the ratio of the observed rate (total time / subtomogram number) to the expected linear rate. As a reference point for both performance measures (data scalability and strong scalability) we use the highest observed rate among all the calculations as the linear expected rate to represent the ideal scenario. An relative eciency of 100% corresponds to perfect linear scaling, while an relative eciency of 50% indicates that the program took twice as long as the ideal scenario. Data scalability and strong scalability are assessed for a single iteration of the reference-free subtomogram alignment and averaging process: averaging all subtomograms and aligning them against a single average. The subtomograms are cubes (46 3 voxels) containing a single randomly oriented complex (PDBID: 2AWB). They were generated following the simulation procedure de- scribed in Section 3.2.4 using a signal to noise ratio of 0.01 and a tilt angle range of60 . Data scalability. TomoMiner makes eective use of computational resources. When using a constant 256 workers, the computational time increases nearly linearly with increasing numbers of subtomograms (5000 to 100,000 subtomograms, see Figure 3.3A). The software aligns and averages 100,000 subtomograms in under 2 hours using 256 workers (Figure 3.3A). For all data sets with more than 5000 subtomograms, the eciency remains above 80% (Figure 3.3B). TomoMiner scales very well with increasing data and is an ecient platform for data analysis across a wide range of problem sizes. Strong Scalability. When increasing the number of workers for a xed number of subtomo- grams, the computing time decreases (Figure 3.3C). For example, for a data set containing 16000 subtomograms, the computing time dramatically decreases when increasing the number of workers from 32 to 128 (Figure 3.3CD). Increasing the number of workers further results in less pronounced gains, because the worker pool is not fully utilized. When using 256 workers for 8000 subtomo- grams, for example, many of the workers are idle at any given moment so the processing rate is lower than the expected linear rate leading to an decreased eciency (Figure 3.3D). Interestingly, we nd optimal performance at about 100 subtomograms/workers. 45 In summary, we can demonstrate that TomoMiner makes eective use of computational re- sources and is able to process very large number of subtomograms in an eective manner. 3.4.2 Performance of reference-free subtomogram classication We previously presented a reference-free subtomogram classication method [Xu et al., 2012]. We implemented this pipeline in TomoMiner and adapted it to increase its scalability. To test the performance of this program, we classied 100; 000 subtomograms, divided into ve groups of 20; 000, each group depicting a dierent complex (Figure 3.4A). Each subtomogram is a cube with sides of 41 voxels. The complexes are generated with a SNR of 0.01, and tilt angles in the range of60 . The complexes were randomly rotated, and given a random oset from the tomogram center up to 7 voxels in each dimension. The classication program requires a user-dened number of clusters, which should be chosen to over-partition the data as described earlier. In our example, the initial number of clusters was set to 10 to demonstrate the performance with the expected over-partition of the data. PDB ID Description 1BXR Carbamoyl phosphate synthetase complexed with the ATP analog AMPPNP 1KP8 (GroEL-KMgATP)(14) 1W6T Octameric enolase from S. pneumoniae 1YG6 ClpP 2AWB 50s subunit of E. coli ribosome. Table 3.2: To assess the methods we used ve dierent macromolecular com- plexes selected from the PDB and used previously as a test set [Berman et al., 2000]. After 10 iterations, the reference-free classication process converged and all the subtomograms were successfully classied. Because we had access to the true subtomograms that were used to generate the data, we could compare the cluster averages to the corresponding true structures for validation. The resulting cluster averages were accurate reconstructions of the true complexes, with 46 A B C D Figure 3.3: (A) The time required for a single round of alignment and averaging as a function of subtomogram number, for a constant 256 workers. The curve is close to linear across the entire range of data. (B) The relative eciency of the data scalability when additional data is added, for a constant 256 workers. The rate of processing is very stable across several orders of magnitude. (C) The time required for a single round of alignment and averaging for two dierent data sets, with 8,000 and 16,000 subtomograms. The number of workers varies from 32 to 256. For a relatively small number of subtomograms, there are not enough subproblems generated to occupy 256 workers, so some are idle, creating the plateau seen in the graphs. (D) The relative eciency of strong scalability. For these problem sizes, TomoMiner scales well, with very little overhead for the increased communication and coordination load of additional workers. There is a clear loss of eciency when using too many workers for a given problem size, but this demonstrates that even for medium-sized data sets (10,000+ subtomgorams) TomoMiner is far from reaching its computational limits. 47 Pearson correlation values between cluster averages and the ground truth > 0:9 (Figure 3.4A,C). The over-partition led to several clusters containing identical complexes, which could easily be identied based on the high correlation score between the aligned cluster averages (Figure 3.4B). Subtomograms within a cluster overwhelmingly depicted only a single complex. The fraction of subtomograms from the same complex ranged between 89.4% and 99.9% (Figure 3.4C) for the ten clusters. When using 256 workers, TomoMiner required an average of 207 minutes per iteration to classify the hundred thousand subtomograms without a reference structure. 3.4.3 Accuracy increases with larger data sets Next, we demonstrate the benet of very large data sets for reference-free subtomogram averaging. We generated 100,000 subtomograms of the 50S subunit of the E. coli ribosome (PDB ID: 2AWB) with SNR 0.005 and tilt angle range60 . Each subtomogram is a cube with a side length of 33 voxels. Our reference-free iterative alignment and averaging pipeline is able to recover the underlying structure. TomoMiner requires an average of 37 minutes per iteration for alignment and averaging using 256 workers. Figure 3.5 shows the correlation score after 20 iterations, when dierent num- bers of subtomograms are given as the input dataset. Using a very large number of subtomo- grams increases the accuracy of the generated model, demonstrating the advantage of using high- performance parallel analysis software. 3.4.4 Reference-free classication of GroEL and GroEL/GroES subto- mograms Next, we demonstrate the use of our reference-free classication method on a set of publicly available experimental subtomograms of puried GroEL and GroEL/GroES complexes previously published in [F orster et al., 2008] and frequently used for testing of subtomogram classication 48 Cluster 4 Subtomograms True Structures Cluster Averages Cluster 6 Cluster 7 Cluster 2 Cluster 9 Cluster 1 Cluster 5 Cluster 8 Cluster 3 Cluster 10 A C B 2AWB 1YG6 1W6T 1KP8 1BXR 2AWB 1YG6 1KP8 1W6T 1BXR 2AWB 1YG6 1KP8 1W6T 1BXR Cluster Cluster Correlation Structures Structures Clusters Figure 3.4: 20,000 subtomograms were generated for each of ve dierent struc- tures, using the procedure dened in Experimental Procedures. The subtomo- grams were simulated using a signal to noise ratio of 0.01 and a tilt angle of 60 . TomoMiner converged after 10 iterations of reference-free subtomogram classication, using a cluster number of 10. (A) The averaged subtomograms in each cluster converged to structures close to the ground truth. Since there were more clusters than structures, some clusters have converged to the same structure. (B) Pairwise, correlations between the averaged density maps of all ten clusters. Clusters corresponding to the same complex are easily identied by their high correlation values. (C) The number of subtomograms in each clus- ter (top). Each cluster was dominated by a single complex. The percentages of subtomograms generated from the dominant complex were 96.2%, 97.8%, 99.9%, 100%, 95.6%, 98.5%, 97.7%, 90.7%, 89.4%, 99.9% for clusters 1 to 10, respectively. Cluster IDs were shown on the horizontal axis. Since the numbers were arbitrary labels, they had been arranged so that similar clusters are adja- cent. The correlations between the true structures (bottom), and the averaged density maps demonstrated that the clustering is accurate. 49 100 500 1000 5000 10000 50000 100000 Number of subtomograms 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 Correlation Figure 3.5: Accuracy of the averaged density maps generated from reference-free alignment and averaging. The accuracy was measured as the Pearson correlation between the generated averages and the template of the true structure. Sev- eral correlations were shown, for averages generated with an increasing number of subtomograms in the dataset. We generated 100,000 subtomograms of a randomly oriented complex (PDBID: 2AWB) using the procedure described in Experimental Procedures, with a signal to noise ratio of 0.005 and a tilt angle range of60 . The computed average was more accurate when using more subtomograms. TomoMiner's ability to handle large numbers of subtomograms therefore eciently allowed for accurate reconstructions and classications of structures from noisy data, given suciently large datasets. 50 methods [Xu et al., 2012]. The dataset consists of two sets of subtomograms: 214 obtained from 13 cryo-electron tomograms of puried GroEL complexes, and 572 obtained from 11 cryo-electron tomograms containing GroEL/GroES complexes. The dierences between the subtomograms of the two dierent complexes are subtle, so their classication is a challenging test case. Because the total number of subtomograms is small (786), the classication can be easily per- formed on a single computer with multiple CPU cores. The test was carried out on a workstation with 8 CPU cores and 12 GB memory. When using 8 parallel workers, and setting the number of classes to K=2, 5 iterations of classication only took 140 minutes. Both major classes of structures GroEL and GroEL/GroEL were well recovered (Figure 3.6). 3.4.5 Cost analysis of cloud computing We have implemented and made TomoMinerCloud publicly available on the Amazon EC2 cloud. The Amazon EC2 cloud infrastructure can be accessed worldwide, and there are data centers in many regions of the world. Researchers without access to local computing clusters are now able to leverage Amazon's cloud computing infrastructure to perform large-scale data analysis, at low cost. Current prices for renting an analysis program and server VM with 2 cores and 15GB memory (instance type r3.large) start from $0.175 (USD) per hour, based on Amazon EC2 pricing [Amazon, 2015]. Renting a worker VM with 36 core and 60GB memory (instance type c4.8xlarge) can cost as little as $1.763 (USD) per hour. Each such VM can host 36 workers, therefore the cost per worker per hour is $0.049(USD). The design of our task distribution also conveniently enables one to rent spot instances, which use unused EC2 capacity at a signicantly lower price. Renting solid- state storage costs $0.10 USD per GB per month. Uploading data is free of charge. Downloading analysis results is nearly free of charge, because the generated results consist of only a small amount of data, namely the rigid transformations of each subtomogram and the class averages. Inter- communication among VMs inside the VPC is also free of charge. Given such pricing, the total cost for the reference-free classication example of 100,000 subtomograms depicted in Figure 3.4 51 Figure 3.6: Reference free classication of 886 experimental subtomograms con- taining the GroEL and GroEL/GroES complexes taken from [F orster et al., 2008]. Convergence was reached after 5 iterative rounds of reference-free clas- sication. (a) Sliced through the resulting cluster averages. The scale bar indi- cated 5nm. (b) Cluster averages depicted by isosurface rendering. The atomic structure of the GroEL/GroES complex was tted into both cluster averages for comparison. 52 is estimated to be below $500. Therefore, TomoMinerCloud is an aordable and ecient solution for high-performance subtomogram analysis for tomography laboratories that will not maintain a large computer cluster or need additional computing resources to perform the calculations. 3.5 Discussion With current developments in cryo electron tomography, it is possible to acquire cryoET 3D images of large numbers of particles. Processing large numbers of subtomograms has been a bottleneck in structural analysis, so high-performance subtomogram analysis software is an increasingly im- portant part of the toolkit used for the structural analysis of macromolecular complexes. TomoMiner is a software for high-performance parallelized cryoET structural analysis. It is able to handle very large numbers of subtomograms, which is necessary to increase the quality and resolution of macromolecular complex structures from cryoET applications. TomoMiner provides a scalable architecture with respect to computational resources and can handle huge numbers of subtomograms. The platform provides both reference-based and reference-free subtomogram classication methods, and perform averaging and template matching based on subtomogram alignment methods. We intend to transfer the TomoMiner into a community-centered, collaborative development project, with publication of the initial source code and programs as the rst step. Our framework will be available through a distributed source code repository, which makes it easy for developers to participate in the project, modify TomoMiner to suit their own needs, and build their own tools on the platform. Additionally, the TomoMiner core library can be easily integrated into other programs written in Python or C++. In TomoMiner, various components such as the data storage interface have been abstracted, allowing for fast adaptation to novel computing environments. Furthermore, dierent implemen- tations of these components can be used on dierent high-performance computing clusters. 53 TomoMinerCloud provides an ecient solution for users who do not have access to, or do not want to maintain, a high-performance computing cluster. Virtual Machines (VMs) running on cloud computing platforms are a useful alternative to local infrastructure, requiring minimal set-up and no up-front hardware costs. Renting VMs allows smaller research laboratories to avoid the costs of hardware and maintaining a data center, while still benetting from large-scale computational methods. Currently, the cloud computing solution only runs on Amazon AWS. We expect future releases to support other cloud computing providers, such as Google Cloud [Google, 2014] and Rackspace [Rackspace, 1998]. In summary, TomoMiner provides several high-performance, scalable, solutions for large-scale subtomogram analysis. We believe that TomoMiner will be an important and ecient tool for the cryoET community, and it complements existing tools in the community. 54 Chapter 4 Conclusion To gain insight into biological processes requires detailed knowledge of cellular machinery and the interactions and processes taking place at sub-cellular regimes. Interactions and reactions taking place in the uniquely structured and crowded environment require new technologies to capture the behavior and structure of the cellular system. Emerging technologies, such as cryo-electron to- mography, allow for researchers to capture snapshots of the cellular environment, and to study the abundance, orientation, and interactions of molecular complexes in cells. This exciting new eld of visual proteimics is currently limited by the computational tools required for data processing. With collection systems that can produce tens of thousands to hundreds of thousands of subtomo- gram structures from a single experiment, faster algorithms and software are required to analyze this data. Here we have presented the TomoMiner software package, usable by researchers needing to analyze hundreds of thousands of particles. As technology improves, and larger eukaryotic cells are processed by these techniques, these methods will be a necessity. After the localization of molecular complexes have been determined, a dynamic view of the cellular environment is still required. The cryoET data provides snapshots of the cellular system, but destroys the sample, removing any temporal dimension from the analysis of the molecular systems observed. Particle based simulation of these systems is an important tool with many desirable features. It can account for abundance of molecular species, and the eects of crowding on diusion in the inhomogeneous cellular environment. With these features, the stochastic nature of 55 cellular reactions can be accurately simulated. However, modeling a crowded reaction environment can be computationally challenging because of the temporal scales involved. We have presented a novel Brownian dynamics algorithm that extends the time step used in the simulation to the physical limit, removing any dependence on the reaction system. By pre-computing collisions and reactions, instead of resolving them after the step, small time steps are no longer required to resolve collisions. TomoMiner and our React Before Move algorithms provide the research community with tools required for the future of visual proteomics, where data acquisition will necessitate algorithms and software capable of handling large amounts of data. 56 Bibliography Noam Agmon and Attila Szabo. Theory of reversible diusion-in uenced reactions. The Journal of Chemical Physics, 92(9):5270, 1990. ISSN 00219606. doi: 10.1063/1.458533. B. J. Alder and T. E. Wainwright. Phase transition for a hard sphere system. The Journal of Chemical Physics, 27(5):1208{1209, 1957. doi: 10.1063/1.1743957. Amazon. Amazon Web Services, 2006. URL http://aws.amazon.com/. Amazon. Amazon Web Services Pricing, 2015. URL http://aws.amazon.com/ec2/pricing. Tadashi Ando and Jerey Skolnick. Crowding and hydrodynamic interactions likely dominate in vivo macromolecular motion. Macromolecules, 2010, 2010. doi: 10.1073/pnas.1011354107. Stephen Bailey. Principal Component Analysis with Noisy and/or Missing Data. Publ. Astron. Soc. Pacic, 124(919):1015{1023, September 2012. ISSN 00046280. doi: 10.1086/668105. Theo M. A. O. M. Barenbrug, E. A. J. F. (Frank) Peters, and Jay D. Schieber. Accurate method for the brownian dynamics simulation of spherical particles with hard-body interactions. The Journal of Chemical Physics, 117(20):9202{9214, 2002. doi: 10.1063/1.1515775. Alberto Bartesaghi, P. Sprechmann, J. Liu, G. Randall, G. Sapiro, and S. Subramaniam. Clas- sication and 3D averaging with missing wedge correction in biological electron tomography. J. Struct. Biol., 162(3):436{450, June 2008. ISSN 10478477. doi: 10.1016/j.jsb.2008.02.008. Martin Beck, Vladan Luci c, Friedrich F orster, Wolfgang Baumeister, and Ohad Medalia. Snap- shots of nuclear pore complexes in action captured by cryo-electron tomography. Nature, 449 (7162):611{615, 2007. ISSN 0028-0836. doi: 10.1038/nature06170. Martin Beck, Johan A Malmstr om, Vinzenz Lange, Alexander Schmidt, Eric W Deutsch, and Ruedi Aebersold. Visual proteomics of the human pathogen Leptospira interrogans. Nat. Meth- ods, 6(11):817{823, November 2009. ISSN 1548-7091. doi: 10.1038/nmeth.1390. Martin Beck, Maya Topf, Zachary Frazier, Harianto Tjong, Min Xu, Shihua Zhang, and Frank Alber. Exploring the spatial and temporal organization of a cell's proteome. Journal of Structural Biology, 173(3):483 { 496, 2011. ISSN 1047-8477. doi: 10.1016/j.jsb.2010.11.011. <ce:title>Combining computational modeling with sparse and low-resolution data</ce:title>. Stefan Behnel, Robert Bradshaw, Craig Citro, Lisandro Dalcin, Dag Sverre Seljebotn, and Kurt Smith. Cython: The best of both worlds. Comput. Sci. Eng., 13(2):31{39, 2011. ISSN 15219615. doi: 10.1109/MCSE.2010.118. 57 Kr Ben-Harush, Tal Maimon, Israel Patla, Elizabeth Villa, and Ohad Medalia. Visualizing cellular processes at the molecular level by cryo-electron tomography. J. Cell Sci., 123(Pt 1): 7{12, 2010. ISSN 0021-9533. doi: 10.1242/jcs.060111. Helen M. Berman, John Westbrook, Zukang Feng, Gary Gilliland, T. N. Bhat, Helge Weissig, Ilya N. Shindyalov, and Philip E. Bourne. The Protein Data Bank. Nucleic Acids Res., 28(1): 235{242, 2000. ISSN 0305-1048, 1362-4962. doi: 10.1093/nar/28.1.235. Tanmay AM Bharat, Christopher J Russo, Jan L owe, Lori A Passmore, and Sjors HW Scheres. Advances in single-particle electron cryomicroscopy structure determination applied to sub- tomogram averaging. Structure, 23(9):1743{1753, 2015. Peer Bork and Luis Serrano. Towards cellular systems in 4d. Cell, 121(4):507 { 509, 2005. ISSN 0092-8674. doi: 10.1016/j.cell.2005.05.001. Florian Brandt, Lars Anders Carlson, F. Ulrich Hartl, Wolfgang Baumeister, and Kay Gr unewald. The Three-Dimensional Organization of Polyribosomes in Intact Human Cells. Mol. Cell, 39(4): 560{569, 2010. ISSN 10972765. doi: 10.1016/j.molcel.2010.08.003. J A G Briggs. Structural biology in situ{the potential of subtomogram averaging. Curr. Opin. Struct. Biol., 23(2):261{7, April 2013. ISSN 1879-033X. doi: 10.1016/j.sbi.2013.02.003. J A G Briggs, J D Riches, B Glass, V Bartonova, G Zanetti, and H-G Kr ausslich. Structure and assembly of immature HIV. Proc. Natl. Acad. Sci. U. S. A., 106(27):11090{11095, 2009. ISSN 0027-8424. doi: 10.1073/pnas.0903535106. Long Cai, Nir Friedman, and X Sunney Xie. Stochastic protein expression in individual cells at the single molecule level. Nature, 440(7082):358{362, 03 2006. 10.1038/nature04599. Daniel Casta~ no D ez, Mikhail Kudryashev, Marcel Arheit, and Henning Stahlberg. Dynamo: A exible, user-friendly development tool for subtomogram averaging of cryo-EM data in high- performance computing environments. J. Struct. Biol., 178(2):139{151, May 2012. ISSN 10478477. doi: 10.1016/j.jsb.2011.12.017. Yuxiang Chen, Stefan Pfeer, Thomas Hrabe, Jan Michael Schuller, and Friedrich F orster. Fast and accurate reference-free alignment of subtomograms. J. Struct. Biol., 182(3):235{245, June 2013. ISSN 10478477. doi: 10.1016/j.jsb.2013.03.002. Michael A Cianfrocco and Andres E Leschziner. Low cost, high performance processing of single particle cryo-electron microscopy data in the cloud. Elife, 4:e06664, 2015. doi: http://elifesciences.org/lookup/doi/10.7554/eLife.06664.001. Maciej Dobrzynski, Jordi Vidal Rodr guez, Jaap A Kaandorp, and Joke G Blom. Computational methods for diusion-in uenced biochemical reactions. Bioinformatics, 23(15):1969{77, August 2007. ISSN 1367-4811. doi: 10.1093/bioinformatics/btm278. Maciej D lugosz and Joanna Trylska. Diusion in crowded biological environments: applications of Brownian dynamics. BMC biophysics, 4(1):3, January 2011. ISSN 2046-1682. doi: 10.1186/2046- 1682-4-3. A. Einstein. Uber die von der molekularkinetischen Theorie der W arme geforderte Bewegung von in ruhenden Fl ussigkeiten suspendierten Teilchen. Annalen der Physik, 322(8):549{560, 1905. ISSN 1521-3889. doi: 10.1002/andp.19053220806. 58 Donald L. Ermak and J. A. McCammon. Brownian dynamics with hydrodynamic interactions. The Journal of Chemical Physics, 69(4):1352{1360, 1978. doi: 10.1063/1.436761. Friedrich F orster and Reiner Hegerl. Structure Determination In Situ by Averaging of Tomo- grams. In J Richard McIntosh, editor, Methods Cell Biol., volume 2007 of Methods in Cell Biology, pages 741{767. Academic Press, 2007. ISBN 0123706475. Friedrich F orster, Ohad Medalia, Nathan Zauberman, Wolfgang Baumeister, and Deborah Fass. Retrovirus envelope protein complex structure in situ studied by cryo-electron tomog- raphy. Proc. Natl. Acad. Sci. U. S. A., 102(13):4729{4734, 2005. ISSN 0027-8424. doi: 10.1073/pnas.0409178102. Friedrich F orster, Sabine Pruggnaller, Anja Seybert, and Achilleas S Frangakis. Classication of cryo-electron sub-tomograms using constrained correlation. J. Struct. Biol., 161(3):276{86, March 2008. ISSN 1095-8657. doi: 10.1016/j.jsb.2007.07.006. Achilleas S Frangakis and Friedrich F orster. Computational exploration of structural information from cryo-electron tomograms. Curr. Opin. Struct. Biol., 14(3):325{331, 2004. ISSN 0959440X. doi: 10.1016/j.sbi.2004.04.003. Joachim Frank. Electron Microscopy of Macromolecular Assemblies. Oxford University Press, New York, 1996. ISBN 9780122650406. doi: 10.1016/B978-012265040-6/50003-5. Matteo Frigo and Steven~G. Johnson. The Design and Implementation offFFTW3g. Proc. IEEE, 93(2):216{231, 2005. R R Gabdoulline and R C Wade. Simulation of the diusional association of barnase and barstar. Biophys. J., 72(5):1917{1929, 1997. Razif R. Gabdoulline and Rebecca C. Wade. Biomolecular diusional association. Current Opinion in Structural Biology, 12(2):204{213, April 2002. Jes us G Galaz-Montoya, John Flanagan, Michael F Schmid, and Steven J Ludtke. Sin- gle particle tomography in EMAN2. J. Struct. Biol., 190(3):279{290, 2015. doi: doi:10.1016/j.jsb.2015.04.016. Daniel T. Gillespie. Exact stochastic simulation of coupled chemical reactions. The Journal of Physical Chemistry, 81(25):2340{2361, 1977. doi: 10.1021/j100540a008. Google. Google Cloud.nurlfhttps://cloud.google.com/g, 2014. John M. Heumann, Andreas Hoenger, and David N. Mastronarde. Clustering and variance maps for cryo-electron tomography using wedge-masked dierences. J. Struct. Biol., 175(3):288{299, 2011. ISSN 10478477. doi: 10.1016/j.jsb.2011.05.011. J. B. Heymann, Giovanni Cardone, Dennis C. Winkler, and Alasdair C. Steven. Computational resources for cryo-electron tomography in Bsoft. J. Struct. Biol., 161(3):232{242, 2008. ISSN 10478477. doi: 10.1016/j.jsb.2007.08.002. Thomas Hrabe. Localize. pytom: a modern webserver for cryo-electron tomography. Nucleic Acids Res., 43:W231{W236, 2015. 59 Thomas Hrabe and Friedrich F orster. Structure Determination by Single Particle Cryo- electron Tomography. John Wiley & Sons, Ltd, 2011. ISBN 9780470015902. doi: 10.1002/9780470015902.a0023175. Thomas Hrabe, Yuxiang Chen, Stefan Pfeer, Luis Kuhn Cuellar, Ann-Victoria Mangold, Friedrich F orster, Luis Kuhn Cuellar, Ann-Victoria Mangold, and Friedrich F orster. PyTom: a python-based toolbox for localization of macromolecules in cryo-electron tomograms and subtomogram analysis. J. Struct. Biol., 178(2):177{188, May 2012. ISSN 10478477. doi: 10.1016/j.jsb.2011.12.003. Hyojoon Kim and Kook Joe Shin. Exact solution of the reversible diusion-in uenced reaction for an isolated pair in three dimensions. Phys. Rev. Lett., 82(7):1578{1581, Feb 1999. doi: 10.1103/PhysRevLett.82.1578. Jun Soo Kim and Arun Yethiraj. Crowding eects on association reactions at membranes. Biophysical journal, 98(6):951{8, March 2010. ISSN 1542-0086. doi: 10.1016/j.bpj.2009.11.022. Jun Soo Kim and Arun Yethiraj. Crowding eects on protein association: eect of interactions between crowding agents. The journal of physical chemistry. B, 115(2):347{53, January 2011. ISSN 1520-5207. doi: 10.1021/jp107123y. Markus Kollmann, Linda Lovdok, Kilian Bartholome, Jens Timmer, and Victor Sourjik. Design principles of a bacterial signalling network. Nature, 438(7067):504{507, 11 2005. 10.1038/na- ture04228. Arash Komeili, Zhuo Li, Dianne K Newman, and Grant J Jensen. Magnetosomes are cell mem- brane invaginations organized by the actin-like protein MamK. Science, 311(5758):242{245, 2006. ISSN 0036-8075. doi: 10.1126/science.1123231. Julio A Kovacs and Willy Wriggers. Fast rotational matching. Acta Crystallogr. D. Biol. Crys- tallogr., 58(Pt 8):1282{1286, August 2002. ISSN 0907-4449. Sebastian K uhner, Vera van Noort, Matthew J Betts, Alejandra Leo-Macias, Claire Batisse, Michaela Rode, Takuji Yamada, Tobias Maier, Samuel Bader, Pedro Beltran-Alvarez, Daniel Casta~ no Diez, Wei-Hua Chen, Damien Devos, Marc G uell, Tomas Norambuena, Ines Racke, Vladimir Rybin, Alexander Schmidt, Eva Yus, Ruedi Aebersold, Richard Herrmann, Bettina B ottcher, Achilleas S Frangakis, Robert B Russell, Luis Serrano, Peer Bork, and Anne-Claude Gavin. Proteome organization in a genome-reduced bacterium. Science, 326(5957):1235{1240, 2009. ISSN 0036-8075. doi: 10.1126/science.1176343. Oleg Kuybeda, Gabriel A Frank, Alberto Bartesaghi, Mario Borgnia, Sriram Subramaniam, and Guillermo Sapiro. A collaborative framework for 3D alignment and classication of heterogeneous subvolumes in cryo-electron tomography. J. Struct. Biol., 181(2):116{127, 2013. Robert Langlois, Jesper Pallesen, and Joachim Frank. Reference-free particle selection enhanced with semi-supervised machine learning for cryo-electron microscopy. J. Struct. Biol., 175(3): 353{361, 2011. ISSN 10478477. doi: 10.1016/j.jsb.2011.06.004. Vladan Lu ci c, Friedrich F orster, and Wolfgang Baumeister. Structural studies by electron tomog- raphy: From Cells to Molecules. Annu. Rev. Biochem., 74(1):833{865, 2005. ISSN 0066-4154. doi: 10.1146/annurev.biochem.73.011303.074112. 60 MATLAB. version 8.1 (R2013a). The MathWorks Inc., Natick, Massachusetts, 2013. G. McMullan, S. Chen, R. Henderson, and a. R. Faruqi. Detective quantum eciency of electron area detectors in electron microscopy. Ultramicroscopy, 109(9):1126{1143, 2009. ISSN 03043991. doi: 10.1016/j.ultramic.2009.04.002. Ohad Medalia, Igor Weber, Achilleas S Frangakis, Daniela Nicastro, Gunther Gerisch, and Wolf- gang Baumeister. Macromolecular architecture in eukaryotic cells visualized by cryoelectron tomography. Science, 298(5596):1209{1213, 2002. ISSN 00368075. doi: 10.1126/science.1076184. Nicholas Metropolis, Arianna W. Rosenbluth, Marshall N. Rosenbluth, Augusta H. Teller, and Edward Teller. Equation of state calculations by fast computing machines. The Journal of Chemical Physics, 21(6):1087{1092, 1953. doi: 10.1063/1.1699114. Marco J. Morelli and Pieter Rein ten Wolde. Reaction brownian dynamics and the eect of spatial uctuations on the gain of a push-pull network. The Journal of Chemical Physics, 129 (5):054112, 2008. doi: 10.1063/1.2958287. James Munkres. Algorithms for the Assignment and Transportation Problems. J. Soc. Ind. Appl. Math., 5(1):32{38, 1957. ISSN 0368-4245. doi: 10.1137/0105003. Gavin E Murphy and Grant J Jensen. Electron cryotomography. Biotechniques, 43(4):413{423, 2007. ISSN 07366205. doi: 10.2144/000112568. Daniela Nicastro, Cindi Schwartz, Jason Pierson, Richard Gaudette, Mary E Porter, and J Richard McIntosh. The molecular architecture of axonemes revealed by cryoelectron tomogra- phy. Science, 313(5789):944{948, 2006. ISSN 0036-8075. doi: 10.1126/science.1128618. Stephan Nickell, Friedrich F orster, Alexandros Linaroudis, William Del Net, Florian Beck, Reiner Hegerl, Wolfgang Baumeister, and J urgen M. Plitzko. TOM software toolbox: Acquisition and analysis for electron tomography. J. Struct. Biol., 149(3):227{234, 2005. ISSN 10478477. doi: 10.1016/j.jsb.2004.10.006. SH Northrup and HP Erickson. Kinetics of protein-protein association explained by brownian dynamics computer simulation. Proceedings of the National Academy of Sciences, 89(8):3338{ 3342, 1992. Travis E Oliphant. SciPy: Open source scientic tools for Python, 2007. ISSN 1521-9615. Julio O. Ortiz, Friedrich F orster, Julia K urner, Alexandros Linaroudis, and Wolfgang Baumeister. Mapping 70S ribosomes in intact cells by cryoelectron tomography and pattern recognition. J. Struct. Biol., 156(2):334{341, 2006. ISSN 10478477. doi: 10.1016/j.jsb.2006.04.014. F Pedregosa, G Varoquaux, A Gramfort, V Michel, B Thirion, O Grisel, M Blondel, P Pretten- hofer, R Weiss, V Dubourg, J Vanderplas, A Passos, D Cournapeau, M Brucher, M Perrot, and E Duchesnay. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res., 12:2825{2830, 2011. Rackspace. Rackspace, 1998. URL http://www.rackspace.com/. Jonathan M. Raser and Erin K. O'Shea. Noise in gene expression: Origins, consequences, and control. Science, 309(5743):2010{2013, 2005. doi: 10.1126/science.1105891. 61 Stephen A. Rice. Diusion-limited reactions, volume 25. Elsevier, 1985. Marshall N. Rosenbluth and Arianna W. Rosenbluth. Further results on monte carlo equations of state. The Journal of Chemical Physics, 22(5):881{884, 1954. doi: 10.1063/1.1740207. Peter J. Rousseeuw. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math., 20(0):53{65, 1987. ISSN 03770427. doi: 10.1016/0377- 0427(87)90125-7. Conrad Sanderson. Armadillo: An open source C++ linear algebra library for fast prototyping and computationally intensive experiments. Technical report, NICTA, Australia, October 2010. Sjors HW Scheres. Relion: implementation of a bayesian approach to cryo-em structure deter- mination. Journal of structural biology, 180(3):519{530, 2012. Tamar Schlick. Molecular Modeling and Simulation. Springer, 1 edition, August 2002. ISBN 038795404X. Michael F Schmid. Chapter 2 - Single-particle electron cryotomography (cryoET). In Steven J Ludtke and B V Venkataram Prasad, editors, Recent Adv. Electron Cryomicroscopy, Part B, volume 82 of Advances in Protein Chemistry and Structural Biology, pages 37{65. Academic Press, 2011. doi: http://dx.doi.org/10.1016/B978-0-12-386507-6.00002-6. John D. Scott and Tony Pawson. Cell signaling in space and time: Where proteins come together and when they're apart. Science, 326(5957):1220{1224, 2009. doi: 10.1126/science.1175668. Kouichi Takahashi, Satya Nanda Vel Arjunan, and Masaru Tomita. Space in systems biology of signaling pathways - towards intracellular molecular crowding in silico. FEBS Letters, 579(8): 1783{1788, March 2005. Guang Tang, Liwei Peng, Philip R. Baldwin, Deepinder S. Mann, Wen Jiang, Ian Rees, and Steven J Ludtke. EMAN2: An extensible image processing suite for electron microscopy. J. Struct. Biol., 157(1):38{46, January 2007. ISSN 10478477. doi: 10.1016/j.jsb.2006.05.009. Laura J. Terry, Eric B. Shows, and Susan R. Wente. Crossing the nuclear envelope: Hierarchical regulation of nucleocytoplasmic transport. Science, 318(5855):1412{1416, 2007. doi: 10.1126/sci- ence.1142204. Jeroen S. van Zon and Pieter Rein ten Wolde. Green's-function reaction dynamics: A particle- based approach for simulating biochemical networks in time and space. The Journal of Chemical Physics, 123(23):234910, 2005. doi: 10.1063/1.2137716. M. von Smoluchowski. Zur kinetischen Theorie der Brownschen Molekularbewegung und der Suspensionen. Ann. Phys., 326(14):756{780, 1906. doi: 10.1002/andp.19063261405. N. R. Voss, C. K. Yoshioka, M. Radermacher, C. S. Potter, and B. Carragher. DoG Picker and TiltPicker: Software tools to facilitate particle selection in single particle electron microscopy. J. Struct. Biol., 166(2):205{213, 2009. ISSN 10478477. doi: 10.1016/j.jsb.2009.01.004. W. W. Wood and J. D. Jacobson. Preliminary results from a recalculation of the monte carlo equation of state of hard spheres. The Journal of Chemical Physics, 27(5):1207{1208, 1957. doi: 10.1063/1.1743956. 62 W Wriggers, R a Milligan, and J a McCammon. Situs: A package for docking crystal structures into low-resolution maps from electron microscopy. J. Struct. Biol., 125(2-3):185{195, 1999. ISSN 1047-8477. doi: 10.1006/jsbi.1998.4080. Min Xu and Frank Alber. High precision alignment of cryo-electron subtomograms through gradient-based parallel optimization. BMC Syst. Biol., 6(Suppl 1):S18, July 2012. ISSN 1752- 0509. doi: 10.1186/1752-0509-6-S1-S18. Min Xu and Frank Alber. Automated target segmentation and real space fast alignment methods for high-throughput classication and averaging of crowded cryo-electron subtomograms. Bioin- formatics, 29(13):i274{i282, July 2013. ISSN 1367-4811. doi: 10.1093/bioinformatics/btt225. Min Xu, Martin Beck, and Frank Alber. Template-free detection of macromolecular complexes in cryo electron tomograms. Bioinformatics, 27(13):i69|-i76, 2011. ISSN 13674803. doi: 10.1093/bioinformatics/btr207. Min Xu, Martin Beck, and Frank Alber. High-throughput subtomogram alignment and classi- cation by Fourier space constrained fast volumetric matching. J. Struct. Biol., 178(2):152{164, May 2012. ISSN 1095-8657. doi: 10.1016/j.jsb.2012.02.014. 63
Abstract (if available)
Abstract
Cellular processes take place over relatively large temporal and spatial scales. Molecular interactions mediate structural changes in proteins and their complexes, which modulate their activities, and regulate complex processes, including cellular transport of components across membranes, gene expression, and genome replication. Gaining knowledge of these systems requires experimental techniques that span the temporal and spatial scales of the environment, ranging from the atomic description of the individual components to knowledge about the detailed locations of all of its components at the cell level. The cellular environment is complex, teeming with thousands of molecular structures in a dynamic equilibrium. ❧ Cryo-electron tomography (cryoET) is an emerging technology which provides some unique advantages in the study of structural organization of entire cells. The samples are lightly processed in a close to native state, essentially providing a snapshot of the interior composition of cells. In principle, samples used in cryoET provide information about the spatial organization of the cell, allowing researchers to detect the structures of objects the size of protein complexes, as well as their relative orientation, abundance, and interactions. With a dynamic environment, these values provide insight into the functionality of a cell's components. However, extracting this information is challenging due to the relatively low resolution, low signal-to-noise ratio, and distortions in the tomograms due to missing data. ❧ Cryo-ET provides a 3-dimensional density of the volume under study. Segmentation of the volume based on regions of high density allows the identification of individual protein complexes, as well as larger structures. These can be compared to known structures, or against each other to classify them into groups whose density maps can be averaged to increase their resolution. This large scale analysis is computationally intensive, demanding expansive resources. Here we describe the TomoMiner package, a parallel computing framework for these calculations. The software runs on computer clusters with hundreds of processors and can handle the alignment, clustering, and averaging of hundreds of thousands of subtomograms. After recovering detailed data regarding the interior of cells, one approach to understanding the dynamics of the system is modeling the reaction-diffusion behavior of all its components. At the scale of entire cells containing an abundance of protein complexes, atomic modeling is too detailed and would require too many steps in order to model large scale processes.Brownian dynamics (BD) allows the abstraction of the smallest molecules in the simulation, and allows for the simulation of larger temporal steps than molecular dynamics. Brownian dynamics simulations still have restrictions based on reaction and diffusion rates, which limit the scope of the time steps available in the simulations. We have developed an approach to accelerate Brownian dynamics simulations specifically tailored for simulating the reaction-diffusion process of crowded mixtures of proteins. The approach removes constraints due to fast reaction rates that require small temporal steps for diffusion limited reactions. The only remaining constraints on BD simulations are those required by physical constraints and limitations. The new algorithm for BD simulations accurately reproduce reactions and dynamics of cellular scale systems with temporal step increases over existing approaches, allowing for simulations at cellular time scales. These two contributions, a package for discovering features at the cellular scale of real environments, and the ability to accurately simulate the environment more efficiently, provide researchers with techniques for accelerated discovery at the cellular scale.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Understanding the 3D genome organization in topological domain level
PDF
Genome-wide studies reveal the function and evolution of DNA shape
PDF
Molecular dynamics studies of protein aggregation in unbounded and confined media
PDF
Structure – dynamics – function analysis of class A GPCRs and exploration of chemical space using integrative computational approaches
PDF
3D modeling of eukaryotic genomes
PDF
Exploring the application and usage of whole genome chromosome conformation capture
PDF
Genome-wide studies of protein–DNA binding: beyond sequence towards biophysical and physicochemical models
PDF
Investigation of mechanisms of complex catalytic reactions from obtaining and analyzing experimental data to mechanistic modeling
PDF
A million-plus neuron model of the hippocampal dentate gyrus: role of topography, inhibitory interneurons, and excitatory associational circuitry in determining spatio-temporal dynamics of granul...
PDF
Theoretical studies of lipid bilayer electroporation using molecular dynamics simulations
PDF
Disentangling the network: understanding the interplay of topology and dynamics in network analysis
PDF
Controlling electronic properties of two-dimensional quantum materials: simulation at the nexus of the classical and quantum computing eras
PDF
Exploring the molecular and cellular underpinnings of organ polarization using feather as the model system
PDF
Numerical and experimental investigations of ionic electrospray thruster plume
PDF
Physical principles of membrane mechanics, membrane domain formation, and cellular signal transduction
PDF
Novel computational methods of disease gene and variant discovery, parallelization and applications
PDF
Automatic tracking of flies and the analysis of fly behavior
PDF
Molecular-scale studies of mechanical phenomena at the interface between two solid surfaces: from high performance friction to superlubricity and flash heating
PDF
From single molecules to bacterial nanowires: functional and dynamic imaging of the extracellular electron transfer network in Shewanella oneidensis MR-1
PDF
Tracking and evolution of compressible turbulent flow structures
Asset Metadata
Creator
Frazier, Zachary C.
(author)
Core Title
Computational analysis of the spatial and temporal organization of the cellular environment
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Publication Date
11/12/2015
Defense Date
08/03/2015
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Brownian dynamics,chemical kinetics,Computational Biology,cryo-electron tomography,OAI-PMH Harvest,reaction-diffusion,tomography
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Alber, Frank (
committee chair
), Nakano, Aiichiro (
committee member
), Zhou, Xianghong Jasmine (
committee member
)
Creator Email
zach.frazier@gmail.com,zfrazier@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-198976
Unique identifier
UC11279168
Identifier
etd-FrazierZac-4034.pdf (filename),usctheses-c40-198976 (legacy record id)
Legacy Identifier
etd-FrazierZac-4034.pdf
Dmrecord
198976
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Frazier, Zachary C.
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
Brownian dynamics
chemical kinetics
cryo-electron tomography
reaction-diffusion
tomography