Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Representation problems in brain imaging
(USC Thesis Other)
Representation problems in brain imaging
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Representation Problems in Brain Imaging by Daniel Moyer Submitted to the FACULTY OF THE USC GRADUATE SCHOOL at the UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree of DOCTOR OF PHILOSOPHY (Computer Science) Committee Members Prof. Greg Ver Steeg (Chairperson) Prof. Aram Galstyan Prof. Paul Thompson Prof. Aiichiro Nakano December, 2019 Copyright 2019 Daniel Moyer Dedication To God, who, at sundry times and divers manners, had spoken, and who has, in these last days, spoken to us through that His Son, for his grace, mercy, and revelation. To my mother and father. and To my wife and light Sophie. This work is also dedicated to the National Science Foundation of the United States of America, and the Rose-Hills Fellowship at the University of Southern California, who provided funding for ve years of this work. Any person funding either organization through taxes, donations, or service has in part funded this work, and may contact me for a short summary, to the best of my ability. ii Author's Note This manuscript is divided into two distinct parts. The rst concerns the intersection of graph theory/network science and a sub-eld of imaging neuroscience called connectomics. The second concerns the problem of purposefully ignoring specic information in representation learning, a problem which has immediate application in imaging sciences, both in clinical research as well as more fundamental sciences. These parts are presented independently, and each can be read without the other. They are unied by their focus on neuroimaging, but re ect the rapidly changing culture of methodological research in quantitative imaging, and my own trajectory of moving from neuroscience and analysis towards medical image engineering and lower level processing. I have provided topical reviews of diusion weighted MRI and brain mapping in Appendix A and Appendix B. These are short and meant for the technical but non-neuroimaging reader. iii Table of Contents Dedication ii Author's Note iii Abstract vi I Continuous Models of Connectivity 1 Chapter 1: Overview 2 1.1 Review of Connectomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Doubt Concerning Network Representations . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Motivation and Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Chapter 2: A Continuous Connectivity Model 9 2.1 Model Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Reproducibility and Parcellation Selection Results . . . . . . . . . . . . . . . . . . . 15 Chapter 3: Searching for Optimal Parcellations from Continuous Connectivity 19 3.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2 Procedure and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Chapter 4: Tractography Distributions 32 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.2 Framework and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.3 Empirical Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 II Invariance Representations, with Applications 46 Chapter 5: Invariance Methods Overview 47 5.1 Review of Invariance Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Chapter 6: Invariant Representations without Adversarial Training 55 6.1 Constructing Codes with Limited Mutual Information . . . . . . . . . . . . . . . . . 56 6.2 Computation and Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 64 6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 iv Chapter 7: Scanner Invariant Representations 71 7.1 Relevant Prior Work on Diusion MRI Harmonization . . . . . . . . . . . . . . . . . 72 7.2 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 7.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 7.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 7.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Appendix A A neuro-anatomy primer/review for computer scientists . . . . . . . . . . . . . . . . . . . 90 A.1 Brain Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Appendix B A short MRI primer for computer scientists . . . . . . . . . . . . . . . . . . . . . . . . . . 98 References 101 v Abstract Diusion Magnetic Resonance Imaging (dMRI) is a imaging modality that, conditional on direction, queries the propensity for water diusion in living tissue. These images are desirable for studying tissues with anisotropic diusion proles, e.g. the white matter of the human brain. In this manuscript I describe two problems at the interface between diusion imaging and computational science and our proposed solutions for those problems, reconciling dierences between current methodology, existing theory, and practical constraints. The rst part of this manuscript describes methodology from the intersection of network sci- ence and neuroscience. Previous literature has worked to estimate white matter axon bundles (fasiculi) from dMRI. Combined with segmentations of the grey matter, this leads intuitively to the construction of macro-scale brain networks, where nodes are cortical regions and edges are estimates of axonal connections. The work I present here deconstructs this progression, reconcil- ing the inherently spatially continuous cortex with discrete network theory. The resulting model is a generalization of discrete graphs, using point process theory to create continuous interaction densities (continuum adjacency functions). The second part of this manuscript describes a method for learning representations of data that invariant under changes in specied outside factors. These can be applied to diusion imaging data, which in particular suers from instrument/observer biases (\site/scanner" biases). Due to the rela- tive complexity of the domain, modelling the particular eects of each instrument may be dicult; moreover, such an approach does not generalize to unseen instruments. Instead, the described method can learn representations that are minimally informed of the imaging site. Subsequent derived data will then be at least as uninformed of the site variable. vi Part I Continuous Models of Connectivity 1 Chapter 1 Overview The representation chosen for data often strongly aects our understanding of those data. While many representations may be equivalent from an information standpoint, and while the data pro- cessing inequality may tell us that the maximal amount of information is contained in the \raw" representation without processing [34], we often use compressive representations (w.r.t. informa- tion) to provide context, interpretation, and to enforce desirable properties. In brain connectivity we have chosen to use networks and graph theory to represent our data; networks t our consensus intuitions, as evidenced by decades of brain connectivity literature (pre- dating e.g. the brain network eld's name of connectomics). In the network construction phase, however, we nd an irreversible choice of representation, node selection (parcellation selection). This choice has lasting consequences for many graph measures of interest [158, 155, 165, 129, 93, 24]. In the rst part of this work we establish a more general model of connectivity that does not require a choice of nodes. From this more general spatial model we can then establish node choice criteria or attempt to optimize these choices. This more general model carries with it assumptions (constraints), one of which is particularly strong: edge evidence independence. This is the assumption that our observations of the evidence of edges between two regions are independent; it is usually unveriable, but also assumed not to be 2 true. In this case however, by deconstruction of our data generating process, we nd that it is true for a large family of tractography algorithms. Moreover, we nd that the edge generating process is a simulation; this may be obvious in hindsight, but should change our treatment of tractography based graphs and their statistics. 1.1 Review of Connectomics Connections in the nervous system and brain have existed as intuition since antiquity [26]. With modern anatomic knowledge and the machinery of graph theory, the quantication of connectivity has become possible. Modern imaging methods have provided the ability to observe these systems in vivo, and it is thus desirable to quantify connectivity in images. Connectomics, the study of brain connectivity, has attempted to provide such quantication and analysis in functional and diusion imaging, particularly through methods transplanted from network science. The study of macro-scale brain connectivity is built upon three components, each of which we will describe individually before discussing their intersection in the context of the eld: 1. The spatial segmentation (parcellation) of cognitive computational resources, either in sub- divisions of the cerebral or cerebellar cortex, identication and further subdivision of sub- cortical structures, or the abstraction of non-CNS structures to discrete sources/sinks. 2. The observation of either physical or functional (correlational) connection between pairs of such subdivisions (e.g. streamlines from diusion simulations, correlations between BOLD signal, tracer observations). 3. Graph theory and network science, which are respectively the mathematical description of pairwise relations between discrete sets of objects and the resulting eld of science studying networks in the natural world. 3 Parcellation in this context is the division of the cerebral or cerebellar cortex or any of the sub-cortical structures into (usually) spatially contiguous subsets, referred to as parcels [7]. This division may be based on atlas priors from anatomy (usually from post-mortem dissection), from outside imaging atlases or clusters, or in some cases at random. No matter the method or derivation, parcels serve to provide coarse-grained correspondence to the structure in question. In the case of atlases this is usually accompanied by some added notion of interpretability and/or providence of results; data aggregated by specic atlases can have results listed by brain region, and brain regions named in atlases can share in the qualitative results ascribed to those regions [8]. Regions sometimes are further assumed to be relatively homogeneous with respect to function at the scale of the atlas [36]. Computationally regions are label sets on either the image domain or on recovered surfaces (vertices or faces of the triangle meshes). Physical connections are often inferred from evidence of axonal connections. Axons themselves are actual physical connections between neurons, transporting electrical signals as established by cable theory, yet due to the composition and fragility of the white matter bers, post-mortem examinations often are unable to accurately trace their trajectories [26]. In non-humans chemical or viral tracers are used to observe a subset of axonal trajectories [26, 146]; unfortunately the transport phenomena being leveraged with tracers is reliant on using living tissue, and thus is unable to be used for human subjects. Instead, Diusion MRI is used to estimate a local axonal orientation, from which whole trajecto- ries can be simulated. Sets of these simulated trajectories are called tractographies or tractograms, in references to the term \tracts" from anatomy, which refers to physical bundles of axons (some of which are named). Beginning with Basser et al. 2000 [13], the estimations of these tractograms has been based on streamline propagation similar to those used for di. eq. simulation, augmented by anatomically based heuristics and priors. While the accuracy of these methods is a contentious 4 subject [99, 146], their use is standard practice in structural network construction [138]. Tracts have usually been represented computationally as nite sequences of points in physical space, e.g. a sequence t =ft 0 ;:::;t L g of points t 0 = (t 0 x ;t 0 y ;t 0 z ) in the image's spatial domain. Other forms of connective evidence are often taken from functional (timeseries) data. Net- works based on correlation matrices and variations thereof have become quite popular within the functional imaging community [39]. In the present work we will only consider structrual connec- tivity; however, the interest and in uence of functional connectivity in connectomics should not be understated, and many methods and concepts have been transported between the two sub-elds. Central to connectomic analysis is graph theory and network science. Graphs (or networks) are mathematical objects describing discrete sets (nodes) and their pairwise interactions (edges). While many variants and generalizations of graphs exist, most connectome analyses take place on undirected weighted graphs, where the pairs of interactions are not ordered (so (a;b) has the same interaction as (b;a)), and where each interaction as some scalar \strength" value. The set of interactions can be written as a square matrix. Graphs naturally describe the structure of discrete topologies, and the matrix of interactions is thus often referred to as an adjacency matrix, where topological structure (adjacency) is described by the matrix entries. Network science is the application and extension of graph theory in pursuit of solutions to other problems [12] (in contrast to the study of network properties for their own sake). Expanding out of both social networks and complex systems research (which, in turn, originated in sub- elds of physics), network science has become a catch-all for graph theory-based analyses in the sciences. Scale-free structures and parameter estimation [40] (i.e. power law scaling degree distri- butions), centrality and path distance-styled measures [130], and ball-and-stick visualizations with superimposed cortical surfaces [161] are all hallmarks of the intersection of network science and neuroimaging. 5 The clear intuition from connectomics is to use regions as nodes, tractography as edges, con- structing graphs upon which a network science-styled analysis may take place. Formalizing this, let be the computational substrate (cortex, sub-cortical structures, etc.) with partitionsfE i g K i=1 ;E i . Consider a set of tractsftg witht said to be \in" a regionE i if at least one endpoint of a tract is incident on E i . We can dene a weighted adjacency matrix A (and thus a graph) by counting the intersections of tracts with each pair of regions: A ij = Count(E i ;E j ) = X t (t2E i )(t2E j ) (1.1) where is the indicator function. Usually S E i = , and Vol(E i \E j ) = 0 for all (i;j) pairs, i.e. fE i g K i=1 is a minimal cover of . While often described less formally, we will use this notation in a more general frame in Chapter 2. Largely taking inspiration from Sporns, Tononi, and K otter in 2005 [140], which coined the work \connectome", and the subsequent tutorial and review in 2010 [130], the archetypal study in connectomics has been to recover parcels, recover edge measures (structural or functional), con- struct networks, and then to transplant network science methods onto brain networks. Particularly popular are the menagerie of network measures [130], as well as core-periphery analyses providing \network-hub" style results [22, 151], eventually coined as the \rich-club" regions of the cortex. Work echoing the design of \Parcels, Edges, Network Measures" have been successful in across neuroscience [139, 68, 22, 151, 14, 103], particularly in clinical research [35, 152, 70, 102, 92, 23, 100, 141, 48]. The largest human imaging projects today are descendents of the \Human Connectome Project" [153], which, as its name suggests, was largely oriented toward connectomics. 6 1.2 Doubt Concerning Network Representations Connectomics is often criticized for problems in parcellation (node denitions), in tractography (edge denitions), and the intersection of those problems with network science. While by no means does this represent an exhaustive list of open issues in connectomics, the direct transport of network science concepts to brain networks has often had to contend with problems at these particular hurdles. In particular, most popular network measures are not invariant to parcellation choice [158, 155, 165, 129, 94, 24]. Studies of the eect of parcellation choice on network measure has spanned both functional ([158, 24]) and structural imaging ([155, 165, 129, 94]), and their results while alarming are perhaps not unexpected: by changing the domain of the graph (the node set) we nd dierent summary measures. One might then ask: which of the parcellations is best [24, 9]? This is unfortunately task dependent; while there may be a best parcellation for classication of specic diseases using specic modalities, it is unlikely that this result generalizes to other tasks using other modalities or parameters [9]. Moreover, it has been suggested a \true parcellation" is unlikely [36], even conditioned on a particular physical length-scale. Even if there were an agreed upon parcellation for normatively healthy human populations, it is probably not the case that covariate eects of interest (disease, treatment, or otherwise) would be homogeneous within each region. While eects probably do have maps with spatial patterning (i.e. regions of eect), it is unlikely that these eects would follow standard parcellation boundaries, especially for disease eects. Somewhat independent of this, it has been suggested that the edges in these networks, built from tractography, are unreliable [99, 146]. The particular type of edge noise observed in brain connectivity is not easily modelled from within the network framework, due to its origins in spatial geometry (which is abstracted away under the network paradigm). 7 1.3 Motivation and Summary The present work and thesis proposal is concerned with these particular problems in connectomics. We present in Chapter 2 a parcellation independent representation of connectivity, relying instead on the underlying continuous spatial domain from which nodes are drawn. From this model we frame parcellation choice as model selection, then present an optimization method for parcel gener- ation. We then provide a method for direct analysis of the continuous connectivity objects (without parcels). Finally, in Chapter 4 we step back to the edge generation step, providing an analysis and discussion of convergence, random and systemic error, and the nature of our observations. The model proposed in Chapter 2 is not innovative in its mechanism; it is, in short, an inho- mogeneous Poisson process dened over the product space of the original domain with itself (the space of possible connections). However, because of special context of the domain , we can leverage the Poisson rate to provide insight into the selection of partitions of (Chapter 3). 8 Chapter 2 A Continuous Connectivity Model Connectomics as a eld is synonymous with the network-theory based representation of the brain. Critical to this representation of connectivity is the delineation of brain regions, anatomic parcella- tion, yet multiple studies have shown that the choice of parcellation in uences the graph statistics of both structural and functional networks [133, 158, 165]. It is thus useful to construct a more general framework for cortical connectivity, one in which any particular parcellation of the cortex may be expressed and its connectivity matrix derived, and one in which the variability of connectivity measures can be modeled and assessed statistically. It is also important that this framework allow comparisons between parcellations, and represen- tations in this framework must be both analytically and computationally tractable. Since several brain parcellations at the macroscopic scale are possible, a representation of connectivity that is independent of parcellation is particularly appealing. In this chapter, we develop such a general framework for a parcellation independent connectivity representation, building on the work of [67]. We describe a continuous point process model for the generation of observed tract 1 (streamline) intersections with the cortical surface, from which we may 1 It is critical to distinguish between white matter bers (fascicles) and observed \tracts." Here, \tracts" denotes the 3d-curves recovered from Diusion Weighted Imaging via tractography algorithms. 9 recover a distribution of edge strengths for any pair of cortical regions, as measured by the inter- region tract count. Our model is an intensity function over the product space of the cortical surface with itself, assigning to every pair of points on the surface a point connectivity. We describe an ecient method to estimate the parameter of the model, as well as a method to recover the region- to-region edge strength. We then demonstrate the estimation of the model on a Test-Retest dataset. We provide reproducibility estimates for our method and the standard direct count methods [74] for comparison. We also compare the representational power of common cortical parcellations with respect to a variety of measures. This chapter appeared as a standalone publication [111, 114], and has associated minor explorations not included here [113, 115]. The key theoretical component of our work is the use of point process theory to describe estimated cortical tract projections. A point process is a random process where any realization consists of a collection of discrete points on a measurable space. The most basic of these processes is the Poisson process, in which events occur independently at a specic asymptotic intensity (rate) over the chosen domain [108]. completely characterizes each particular process, and is often dened as a function : Domain!R + , which allows the process to vary in intensity by location. The expected count of any sub-region (subset) of the domain is its total intensity, the integral of over the sub-region. In our current context, the domain is the connectivity space of the cortex i.e. the set of all pairs of points on the surface, and the events are estimated tract intersections with the cortical surface. In point process literature the domain usually has some implicit time-scale. Further, the inde- pendence of these events is assumed but never able to be ensured. Instead, in this context we have a static (diusion) image from which events are simulated; we discuss the implications of this in Chapter 4, but in short, for many tracking methods we can certify the independence of the events and normalize the \observation period" during which they're \collected". 10 2.1 Model Denition Let be union of two disjoint subspaces each dieomorphic to the 2-sphere representing the white matter boundaries in each hemisphere. Further consider the space , which here represents all possible end point pairs for tracts that reach the white matter boundary. We treat the observation of such tracts as events generated by an inhomogeneous (symmetric) Poisson process on ; in our case, for every event (x;y) we have a symmetric event (y;x). Assuming that each event is independent of all other events except for its symmetric event (i.e. ind. tracts), we model connectivity as a intensity function : ! R + , such that for any regions E 1 ;E 2 , the number of events is Poisson distributed with parameter C(E 1 ;E 2 ) = ZZ E 1 ;E 2 (x;y)dxdy: (2.1) Due to properties of the Poisson distribution, the expected number of tracts is exactlyC(E 1 ;E 2 ). For any collection of regionsfE i g N i=1 =P , we can compute a weighted graphG(P;) by computing eachC(E i ;E j ) for pairs (E i ;E j )2 PP . Each node in this graph represents a region, and each weighted edge represents the pairwise connectivity of the pair of nodes (regions) it connects. We call P a parcellation of if S i E i = and T i E i has measure 0 (fE i g is almost disjoint). The intensity function is exactly the continuous connectivity object we wanted to construct. We could also conceptualize as an (unnormalized) measure, a \connective density", for which the edge counting measure is a random instance (linked through the Poisson process). For any xed set of regions we essentially are using the edge counting measure to construct step-wise approximations G to . In this context all tract-based connectomes are in the exponential random graph family, specically Poisson random graphs. 11 2.1.1 Recovery of the Intensity Function A sucient statistic for Poisson process models is the intensity function (x;y). Estimation of the function is non-trivial, and has been the subject of much study in the spatial statistics community [38]. We choose to use a non-parametric Kernel Density Estimation (KDE) approach due to an ecient closed form for bandwidth estimation described below. This process is self-tuning up to a choice of desiderata for the bandwidth parameter. We rst in ate each surface to a sphere and register them using a spherical registration (See section 2.2.1); however this entire method can be undertaken without group registration. We treat each hemisphere as disjoint from the other, allowing us to treat as the product of spheres (S 1 [S 2 ) (S 1 [S 2 ). Throughout the rest of the paper D denotes a dataset containing observed tract endpoints (x;y) i , and ^ denotes our estimation of . The unit normalized spherical heat kernel is a natural choice of kernel for S 2 . We use its truncated spherical harmonic representation [29], dened as follows for any two unit vectors p and q on the 2-sphere: K (p;q) = H X h 2h + 1 4 expfh(h + 1)gP 0 h (pq) Here, P 0 h is the h th degree associated Legendre polynomial of order 0. Note that the non-zero order polynomials have coecient zero due to the radial symmetry of the spherical heat kernel [29]. However, since we are estimating a function on , we use the product of two heat kernels as our KDE kernel . For any two points p and q, the kernel value associated to a end point pair (x;y) is((p;q)j(x;y)) =K (x;p)K (y;q). It is easy to show that R K (x;p)K (y;q)dpdq = 1. The spherical heat kernel has a single shape parameter which corresponds to its bandwidth. While in general tuning this parameter requires the re-estimation of ^ at every iteration, by rewrit- ing our kernel we can memoize part of the computation so that we only need to store the sum of 12 the outer products of the harmonics. Writing out ((p;q)jD) = P (x i ;y i )2D K (x i ;p)K (y i ;q), we have the following: ((p;q)jD) = H X h H X k 2h + 1 4 2k + 1 4 expf(h 2 +h +k 2 +k)g | {z } Independent of D; evaluated every iteration X (x i ;y i )2D P 0 h (x i p)P 0 k (y i q) | {z } Independent of ; evaluated once Thus, evaluations of the kernel at any point (p;q) can be done quickly for sequences of values of . We then are left with the choice of loss function. Denoting the true intensity function , the estimated intensity ^ , and the leave-one-out estimate ^ i (leaving out observation i), Integrated Squared Error (ISE) is dened: ISE(jD) = Z ( ^ (x;yj)(x;y)) 2 dxdy Z ^ (x;yj) 2 dxdy 2 jDj X (x i ;y i )2D ^ i (x i ;y i ) +Constant: Hall and Marron [69] suggest tuning bandwidth parameters using ISE. In practice, we nd that replacing each leave-one-out estimate with its logarithm log ^ i (x i ;y i ) yields more consistent and stable results. 2.1.2 Selecting a parcellation Given an estimated intensity ^ and two or more parcellations P 1 ;P 2 ;::: , we would like to know which parcellation and associated graphG(P; ^ ) best represents the underlying connectivity func- tion. There are at least two perspectives to consider. 13 Approximation Error: Because eachP i covers (andP i P i = ), eachG(P 1 ; ^ ) can be viewed as a piece-wise function g : !R + , where g(x;y) = 1 jE i jjE j j C(E i ;E j ) such that x2E i and y2 E j . In other words, g is the constant approximation to over every pair of regions. A natural measure of error is another form of Integrated Squared Error: Err( ^ ;G(P 1 ; ^ )) = ZZ (g(x;y)(x;y)) 2 dxdy: (2.2) This is analogous to squared loss (` 2 -loss). Likelihood: An alternative viewpoint leverages the point process model to measure a likeli- hood: logL(P ) = X E i ;E j 2P log Poisson(jf(x;y) i 2D :x2E i ;y2E j gj;C(E i ;E j )): (2.3) Here, the independence assumption plays a critical role, allowing pairs of regions to be evaluated separately. Unfortunately this is biased toward parcellations with more, smaller regions, as the Poisson distribution has tied variance and mean in one parameter. A popular likelihood-based option that somewhat counterbalances this is Akaike's Information Criterion (AIC), AIC(P ) =2 logL(P ) +jPj(jPj 1): (2.4) AIC balances accuracy with parsimony, penalizing overly parameterized models - in our case, parcellations with too many regions. 14 2.2 Reproducibility and Parcellation Selection Results We demonstrate the use of our framework on a test-retest dataset. We measure connectome repro- ducibility using Intraclass Correlation (ICC) [126], and compare three parcellations using multiple criteria (See Equations 2.2, 2.3, and 2.4). 2.2.1 Procedure, Connectome Generation, and Evaluation Our data are comprised of 29 subjects from the Institute of Psychology, Chinese Academy of Sciences sub-dataset of the larger Consortium for Reliability and Reproducibility (CoRR) dataset [175]. T1-weighted (T1w) and diusion weighted (DWI) images were obtained on 3T Siemens TrioTim using an 8-channel head coil and 60 directions. Each subject was scanned twice, roughly two weeks apart. T1w images were processed with Freesufer's [45] recon-all pipeline to obtain a triangle mesh of the grey-white matter boundary registered to a shared spherical space [46], as well as corre- sponding vertex labels per subject for three atlas-based cortical parcellations, the Destrieux atlas [47], the Desikan-Killiany (DK) atlas [37], and the Desikan-Killiany-Tourville (DKT31) atlas [87]. Probabilistic streamline tractography was conducted using the DWI in 2mm isotropic MNI 152 space, using Dipy's [56] implementation of constrained spherical deconvolution (CSD) [150] with a harmonic order of 6. As per Dipy's ACT, we retained only tracts longer than 5mm with endpoints in likely grey matter. We provide the mean ICC score computed both with and without entries that are zero for all subjects. When estimating ^ the kernels are divided by the number of tracks, and we use a sphere with unit surface area instead of unit radius for ease of computation. We threshold each of the kernel integrated connectomes at 10 5 , which is approximately one half of one unit track density. 15 We then compute three measures of parcellation representation accuracy, namely ISE, Negative Log Likelihood, and AIC scores. Atlas DK Destrieux DKT31 Number of Regions 68 148 62 Count ICC 0.2093 0.1722 0.2266 Intensity ICC (Full) 0.4868 0.4535 0.4388 Intensity ICC (w/Threshold) 0.5613 0.6481 0.4645 Table 2.1: This table shows mean ICC scores for each connectome generation method. The count method - the standard approach - denes edge strength by the ber endpoint count. The integrated intensity method is our proposed method; in general it returns a dense matrix. However, many of the values are extremely low, and so we include results both with and without thresholding. Highest ICC scores for each atlas are bolded. Type DK Destrieux DKT31 ISE 1:8526 10 5 2:1005 10 5 2:1258 10 5 Negative LogLik 85062.5 355769.4 88444.5 AIC Score 174680.95 733294.8 185253.5 Retest ISE 1:0517 10 5 1:0257 10 5 1:1262 10 5 Retest Negative LogLik 85256.0 357292.9 88434.9 Retest AIC Score 175068.1 736341.9 185234.3 Table 2.2: This table shows the means over all subjects of three measures of parcellation \goodness". The retest versions are the mean of the measure using the parcellation's regional connectivity matrix (or the count matrix) from one scan, and the estimated intensity function from the other scan. Table 2.1 shows a surprisingly low mean ICC scores for regular count matrices. This may be because ICC normalizes each measure by its s 2 statistic, meaning that entries in the adjacency matrices that should be zero but that are subject to a small amount of noise { a few erroneous tracks { have very low ICC. Our method in eect smooths tracts endpoints into a density; end points near the region boundaries are in eect shared with the adjacent regions. Thus, even without thresholding we dampen noise eects as measured by ICC. With thresholding, our method's performance is further improved, handily beating the counting method with respect to ICC score. It is important to note that for many graph statistics, changing graph topology can greatly aect the measured value [165]. While it is important to have consistent non-zero measurements, the 16 Figure 2.1: A visualization of the ICC scores for connectivity to Brodmann Area 45 (Destrieux region 14) for the Count connectomes (left) and the proposed Integrated Intensity connectomes (right). Blue denotes a higher score. dierence between zero and small but non-zero in the graph context is also non-trivial. The consistency of zero-valued measurements is thus very important in connectomics. Table 2.2 suggests that all three measures, while clearly dierent, are consistent in their selection at least with respect to these three parcellations. It is somewhat surprising that the Destrieux atlas has quite low likelihood criteria, but this may be due to the (quadratically) larger number of region pairs. Both likelihood based retest statistics also choose the DK parcellation, while ISE chooses the Destrieux parcellation by a small margin. It should be noted that these results must be conditioned on the use of a probabilistic CSD tractography model. Dierent models may lead to dierent intensity functions and resulting matrices. The biases and merits the dierent models and methods (e.g. gray matter dilation for ber counting vs streamline projection) remain important open questions. 17 Figure 2.2: A visualization of the marginal connectivity M(x) = R E i ^ (x;y)dy for the Left Post- central Gyrus region of the DK atlas (Region 57). The region is shown in blue on the inset. Red denotes higher connectivity regions with the blue region. 18 Chapter 3 Searching for Optimal Parcellations from Continuous Connectivity A natural follow-up to the parcellation selection method of the previous chapter is the optimization over parcellations given a connective density: can we nd the best parcellation to describe the edge observations? Unfortunately without regularization or constraint the trivial solution of \every voxel" is optimal. However, with a prior over relative spatial-scale and constraints or priors over shape (connectedness, relative convexity) there are certainly non-trivial solutions. In this chapter we develop a method to sample such parcellations. A number of papers have focused on using the connection prole of either voxels or vertices to dene the parcellation of the cortex (see [41] and the references therein). For connectivity based parcellation (CBP) methods using structural connectivity, two modeling choices are required: 1) a spatial resolution of the grid on which connections are dened, and 2) the criteria on which clusters should be formed. In almost every existing method, connectivity measures are rst estimated for a high resolution grid of atomic units (choice one), e.g. voxels or vertices, and the atomic units then combined under spatial constraints optimizing a desiderata (choice two). The rst decision is essentially a division of connectivity into two scales: the macro-region level and higher resolution voxel/vertex level, where our task is to learn from the former the regions of the latter. The second modeling choice is the criteria on which these atoms are clustered. Many popular 19 choices come from more general clustering literature, e.g. within-group sum of squares (explained variance), within-group statistical distances, and mixture model likelihoods [41, 124, 164]. These criteria use the connectivity proles of each meso-scale atom without regard to the network structure they induce at the macro-scale. In other words, they treat each vertex or voxel as a data point with an associated vector (its row in the meso-scale adjacency matrix), and then cluster based on this vector space. If one vertex is changed from one group to another, these methods generally do not re-evaluate the quality of all the groups, though on the macro-scale each of their connective proles would have changed. In this chapter we will address both these choices. We present a method framed in the context of generative models, specically Bayesian non-parametric mixture models which place priors over all possible partitions of the higher resolution grid, and do not require the number of clusters to be predened. One large classes of such priors are the so-called \restaurant processes", used here. We implement a continuum form of connectivity for our mixture components, and further leverage a conjugate likelihood-prior relationship to produce closed form marginal likelihoods for network interactions, allowing ecient sampling. The content of this chapter was published as [116]. 3.1 Model Let again be the white matter/gray matter interface (the inner cortical surface), with the ac- knowledgment that is in general composed of two disjoint sheets, each with a boundary at the medial wall. Fix a coordinate system over , and dene a parcellation P as any set of regions fE i g where E i , S E i = , andj T E i j = 0 (i.e. the regions E i are almost disjoint). We assume there exists a latent parcellation P that accurately describes the cortical surface with respect to its underlying neuroanatomical structure. Our objective is the recovery of P , specically using 20 structural connectivity information, and without specifying the exact number of regions. In order to accomplish this, we construct a joint generative model of parcellations and connectivity. We start by choosing a model of partitions. We use the distance dependent Chinese Restaurant Process (ddCRP) [18], a variant of the popular Chinese Restaurant Process (CRP) non-parametric Bayesian models. CRP models are most commonly used in mixture models, providing a prior over all possible label assignments for any number of label-parameter pairs. A main assumption of the CRP is the exchangeability of the data; the ddCRP removes this exchangeability assumption, allowing for non-trivial topologies of dependence between data points. This is discussed in Section 3.1.1 in detail. Brie y, ddCRP allows us to use non-parametric style mixture models on mesh grids, where we assume a priori that neighboring patches are dependent. For example, we assume there is spatial auto-correlation over the discrete manifold of the mesh. Practically speaking, the ddCRP is the component responsible for merging or splitting the parcels (clusters), and in general for their conguration. We next choose a mixture component model; the distribution chosen here will generate the observed network between estimated regions from the ddCRP. We choose to follow the style of the Innite Relational Model [84, 10], where we model interactions between clusters instead of the proles of the clusters themselves. Thus, we need a separate parameter for each pair of regions. (These will be the step-wise approximations to a continuous density .) Consider , the set of all possible tract endpoints intersecting the cortical surface. We model the observation of these pairs of endpoints using the spatial point process from Chapter 2, with parameter : ! R + . In the view of the overall model it is important to note one convenient property: disjoint regions have independent counts. Moving back to the mixture components, we make the following simplifying assumption on the form of the tract endpoint process: each region pair interacts in a homogenous manner. That is, 21 we assume is constant over any pair of parcels. Thus, for any nite conguration ofK parcels we have on order K 2 non-negative scalar parameters to estimate, and these parameters are the rate parameters for Poisson spatial processes generating the evidence of connectivity. We choose to use the Gamma distribution to model these parameters (i.e., each pair of parcels (g i ;g j ) draws a rate ij from a Gamma prior). As shown in Section 3.1.2, the conjugacy of the Gamma distribution with the homogenous Poisson process allows for closed form marginal distributions, and thus ecient collapsed sampling methods. We also choose to use the mesh facesff m g M m=1 as the elements of our ddCRP-mixture. This is because connectivity is usually dened over intersections of tracts with areal units, and both the ddCRP as well as the Poisson process naturally operate over such regions. Putting this all together, and leaving the meaning of c m for the next section, the model is as follows: c m ;g i ddCRP(;Adj) ij Gamma(a;b) D ij Poisson Point Process( ij ) 3.1.1 The distance dependent Chinese Restaurant Process (ddCRP) As suggested by its name, distance directed Chinese Restaurant Process is a variant of the Chi- nese Restaurant Process (CRP), often used in non-parametric Bayesian mixture models as a prior over possible mixture components (i.e. a distribution over distributions). Let be some positive constant concentration parameter, and let G 0 be a prior distribution over mixture component pa- rameters (for us, a gamma distribution). The original CRP mixture model [125] describes an endless stream of customers (data) entering a restaurant with an innite number of tables (clusters). Each customer either chooses (with prescribed probability dependent on ) to sit at an existing table 22 (which has a particular component distribution) or sit an unoccupied table (draw a new component distribution from the prior). Up to the indexing of the tables, for any nite number of observations any number of clusters and conguration of cluster associations is possible. In the original CRP, the data are assumed to be exchangeable; that is, the joint likelihood of any observations is invariant under permutations of observation indices. However, in our spatial context we have a topology of face adjacencies. Permutations of the face indexes are non-trivial, and thus the faces are not exchangeable. To model this, the ddCRP allows each customer to choose another customer (possibly itself) to sit with based on its dependencies. This forms a directed graph of seating choices; table assignments are then made to each group of customers who have chosen to sit with each other, i.e. each connected component of the seating choice graph. Mixture components are drawn for each table from G 0 , and only then are the actual data drawn from each mixture component. Clearly this is a two stage procedure. In our context, this means each face will choose to be in a cluster with one of its neighbors or itself. As above, letff m g M m=1 be the set of mesh faces, and letfc m g M m=1 be the corresponding as- signments, where each c m 2f1;:::;Mg. We draw c m conditioned adjacency information Adj as follows: p(c m =jjAdj)/ 8 > > > > > > < > > > > > > : 1 if j is adjacent to i if m =j 0 otherwise We denote each cluster of faces as g k for k2f1;:::;Kg, where K is the number of clusters. Due to our restriction of c m to the indices of faces adjacent to f m , each g k is a contiguous region, and the set of groups forms a valid parcellation of . We note that the original ddCRP is dened for more general distance functions. 23 3.1.2 Mixture components: Poisson-Gamma The evidence of pairwise interaction between regions in structural connectivity is the set of tract endpoints D =f(x t ;y t )g T t=1 . Since the regional clusters are dened over discrete grids of areal atoms (mesh faces), these are naturally aggregated to count measures over each pair of sub-regions. For any pair of regions (g i ;g j ) , dene D ij =f(x t ;y t )2 g i g j g. We model the counts jD ij j using the Poisson process with xed intensity ij , where the area g i g j contains a random countjD ij j distributed jD ij jPoisson( Z g i g j dxdy) Using the independence assumption of the tract endpoints, the likelihood of any conguration of tract endpoints can then be written L(D) = Y g i ;g j exp ( Z g i g j ij dxdy ) jD ij j ij We use a Gamma prior for the ij parameters, the conjugate prior of the Poisson distribution. Using the Gamma distribution allows us derive a simple closed form marginal distribution for D ij that \integrates out" the ij 's, leaving a likelihood in terms of prior parametersa;b. It is as follows: P (D ij ja;b) = Z P (D ij ;ja;b)d = Z P (D ij j)P (ja;b)d = Z exp ( Z g i Z g j dA ) jD ij j Y t=1 | {z } Homogeneous Point Process b a (a) exp(b) a1 | {z } Gamma Prior d =Z(a;b) Z expf(jg i g j j +b)g jD ij j+a1 | {z } Un-normalized Gamma Posterior d = Z(a;b) Z(a 0 ;b 0 ) = jg i g j j +b a 1 jg i g j j +b jD ij j (a +jD ij j) (a) 24 Here, Z(a;b) = b a (a) 3.1.3 Combined Model and Collapsed Sampling Scheme We will estimate the model via Collapsed Gibbs Sampling, specically using the closed form integral over ij to avoid sampling the interaction parameters. Starting at iteration ` = 0, we update each c ` m by the following conditional likelihood: P (c `+1 m =kjD)/P (c m =k) Y i;j P (D ij ja;b;c `+1 m =k;fc `+1 r g r<m ;fc ` s g s>m ) Since we assume c m is restricted by our mesh topology, we have only a small number of options to evaluate. We denote the seating graph edge c ` m as a \critical edge" if for any other node f m 0 such that c m 0 =m there exists a path to the face with index c ` m . Let g old be f m 's previous component, and g crit be the component of f m without its own edge (its critical component). Without loss of generality we may further order each neighbor of f m as f n for n = 1;:::;N, and their groups as g(n). Using these denitions, for triangle meshes we can write out all possible scenarios: 1. Ifc m is not critical, and all neighbors are of the same component before the update, then we can simply choose c m via P (c m = k), as there is no dierence with respect to the induced components. 25 2. Ifc m is not critical, but not surrounded by the same component, then we are asking essentially \Shouldc m 's previously induced component join one of its as of yet independent neighbors". Thus, P (c `+1 m =njD)/P (c m =n) K Y k P (D ;k jg new =g old [g n ;g k ) N Y g ^ n :^ n6=n ^ n6=old K Y k P (D g(n);k jg ^ n ;g k ) for each neighbor n. Here D ;k =f(x t ;y t )2g old [g n g k g 3. If c m is critical, then for each neighboring component (including the component g old ng crit in the neighbors) we have P (c `+1 m =njD)/P (c m =n) K Y k P (D ;k jg new =g crit [g n ;g k ) Y ^ n6=n K Y k P (D ^ n;k jg ^ n ;g k ): We iteratively update the face associates using P (c `+1 m = kjD), collecting samples after ev- ery pass. While this generates a posterior distribution over c m , we simply take the maximum a posteriori (MAP) estimate as our selected parcellation. In general, updates made in Gibbs sampling algorithms are done sequentially; this is because strong dependencies between concurrent updates will destabilize some samplers. However, in cases of low dependence approximate asynchronous parallel updates have been used with empirically strong results (so called \Hogwild" updates [76]). In our case, most updates are either within components, or between small components (with correspondingly small interdependencies), so a small degree of parallelism is possible. In practice we use a compromise between the serial algorithm 26 and the parallel version: we use a shared memory parallel sampler for calculating the likelihoods of a small batch c i , then make a serial updates based on these likelihoods. This allows a roughly linear speed-up in the number of threads used, though there is a slowly scaling cost of the serial update. 3.1.4 Implementation notes In xing a coordinate system, it is common to split the white matter/gray matter interface into two spheres, each with a null region where the corpus callosum bridges the longitudinal ssure. Thus, an easy system can be constructed using spherical coordinates and a marker for hemisphere. The symmetry of each tract's endpoints requires careful consideration to avoid double counting; while the intuition of the model can be understood without thinking about the symmetry of the data, when evaluating joint probabilities it is important to only include each data point once. This can be achieved by only evaluating P (D ij ja;b) for i j. When computing the parallel updates, in our experience it is much more ecient to keep the threads active but idle, and simply have a single thread do the serial update. This avoids the overhead of repeated thread spawns, which for some implementations/architectures can be costly. 3.2 Procedure and Results In order to test our proposed model, we use two open datasets, one composed of 20 subjects each scanned twice from the Institute of Psychology, Chinese Academy of Sciences (IPCAS) subset of the Consortium for Reliability and Reproducibility (CoRR) dataset [175], and the other composed of 30 subjects from the Human Connectome Project (HCP) S900 release [153]. The pre-processing diers slightly between the datasets, to account for the dierent imaging parameters. In general the HCP dataset has higher resolution (both in voxel size and angular resolution) leading to dierent 27 tractographies. On each dataset we compare the performance of the proposed method against two recommended alternatives: Ward's method, a greedy hierarchical clustering method [41], and Spectral Clustering [124]. 3.2.1 Preprocessing and Tractography HCP: We used the minimally preprocessed T1-weighted (T1w) and diusion weighted (DWI) images rigidly aligned to MNI space. Brie y, the preprocessing of these images included motion correction and eddy current correction (DWI), and linear and nonlinear alignment (betweek T1w and DWI). We used the HCP Pipeline (version 3.13.1) FreeSurfer protocol to run an optimized ver- sion of the recon-all pipeline that computes surface meshes in a higher resolution (0.7mm isotropic) space. We again resample this space to an geodesic grid after computing tract intersections with the surface. Tractography was conducted using the DWI in the native 1.25mm isotropic voxel size in MNI space. Probabilistic streamline tractography was conducted using the DWI in 2mm isotropic MNI 152 space, using Dipy's [56] implementation of constrained spherical deconvolution (CSD) [150] with a harmonic order of 6. Tractography streamlines were seeded at 2 random locations in each white matter voxel labeled by FSL's FAST. Streamline tracking followed directions randomly in proportion to the orientation function at each sample point at 0.5mm steps, starting bidirec- tionaly from each seed point with 8 restarts per seed. As per Dipy's Anatomically Constrained Tractography (ACT) [134], we retained only tracts longer than 5mm with endpoints in likely gray matter. IPCAS: This dataset is processed in the same manner as Chapter 2. 28 3.2.2 Fitting and Results We t the proposed method using our parallel sampling scheme, using 60 passes of the sampler with 8 parallel threads (approximately 600,000 updates per subject), using a = 1, b = 1, and = 0:01. We use the MAP estimate as our results. We tted Ward clustering by maximizing Explained Variance over a na ve search of every possible merge. For the Spectral Clustering method we use an exponential kernel, using the normalized cosine distance as a metric. We use a number of eigenvectors equal to the number of clusters. For both baselines we take the vector of connections as our feature vector. In both clustering schemes, we specify the number of clusters to be equal to that of the proposed method. We assess cluster quality using a KL-divergence based measure. We take the number of tracts from each face f m to each region g i as the objective distribution, and measure how well this is approximated by the average number of tracts from g(f m ) to g i . These form two matrices of dimensionMK. We then normalize these matrices to sum to one and measure their KL divergence as in [124]. If a cluster is well represented by its average connectivity prole, then this divergence will be low. For the IPCAS dataset we have an additional measure of Test-Retest reproducibility. This is measured by Normalized Mutual Information (NMI)[41, 10], which measures cluster similarity without requiring similar numbers of clusters. Let Z be a binary matrix of cluster assignments, where, for each row i, each entry Z ij is 1 if f i is in cluster j and zero otherwise. NMI is dened as I(Z 1 ;Z 2 )= p (H(Z 1 )H(Z 2 ), where I(;) is mutual information, H() is entropy, and Z 1 and Z 2 are the cluster assignments for the rst and second scan respectively. (This uses the convention customary in information theory that 0 log 0 = 0). NMI is also invariant under permutations of labels. As can be seen in Figure 3.1 and Figure 3.2, the proposed method is performing well compared to the baseline methods. HCP dataset uses more clusters (around 250 per hemisphere) than the IPCAS dataset (around 175-200 per hemisphere). The dierence here may be due in part 29 Figure 3.1: Plots of the KL Divergence based goodness of t measure, for the three methods, on both datasets. Here, lower is better. to the higher resolution of the HCP dataset, leading to greater resolving power with respect to the regional connections. These averages are at the upper range of the number suggested by Van Essen et al. [154]. 30 Figure 3.2: Left: Normalized Mutual Information between Test-Retest scans. Here, higher is better. Right: histograms of the number of clusters selected for each subject. Figure 3.3: An exemplar parcellation from an HCP subject. Region colors are random. 31 Chapter 4 Tractography Distributions 4.1 Introduction It is common practice in brain imaging studies to estimate the trajectories of white matter fascicles, the bundles of axons that connect dierent gray matter regions in the brain. It is dicult to observe the fascicles directly in vivo; instead, researchers currently rely on diusion imaging - a variant of standard anatomical MRI - and the \tube like" structure of most major neural pathways to reconstruct the brain's major ber bundles. We observe measurements of water diusion in brain tissue, t local diusion models, and then reconstruct the trajectories by curve tting. This process is known as tractography, and reconstructed bers are referred to as tracts 1 . The number of possible curves is combinatorially large, and grows rapidly as image resolution increases. Closed form likelihoods for sets of tracts (tractograms) have not yet been found even under simple local models. Thus, tract sampling is performed, where only a subset of the possible tracts are recovered. Sampling is usually performed in an ad-hoc manner, and post-hoc analyses usually treat tracts as direct observations instead of realizations of global trajectories from local models. The concept that tracts are drawn from a distribution is rarely acknowledged, and the issue of convergence is almost never addressed. 1 It is important to distinguish between white matter bers (fascicles) and observed \tracts." In this paper we use \tracts" to denote the 3D-curves recovered from diusion-weighted imaging via tractography algorithms. 32 A simple question remains unanswered: how many tracts should we sample? Even in the most general sense there is no consensus on this topic. Recent works have used 100,000 [59], 500,000 [57] and 10,000,000 [135] tracts for scans of isotropic resolution 2 mm (with 64 directions), 1.25 mm (90 directions), and 2.5 mm (60 directions), respectively. The pragmatic tractography user might then think to simply take the maximum of these options to ensure convergence; for small studies this may be feasible, but given the trajectory of imaging studies - now performed in biobanks of over 10,000 scans - this choice clearly will not scale to the regimes demanded by current and future clinical studies 2 using current technology. The answer to whether or not such large tractogram sizes are required will dictate practical planning in large studies. It is useful then to understand empirically the sample complexity of these phenomena. If we choose a sample size too small our tractograms will be sensitive random uctuations; if instead we choose a sample size too large, we could be wasting very large amounts of storage and computation. The trade-o between accuracy and cost forms a curve of solutions to the (ill-posed) question of \how many tracts?" In this chapter we provide a method for measuring this trade-o. We leverage cross entropy as a relative measure of accuracy, applying our estimator to a test set of subjects under several dierent tractography methods. The content of this chapter was published as [118]. 4.1.1 Relevant Prior Work Surprisingly literature on the number of samples required is sparse. In the documentation of Probtrackx [16] the number of posterior samples is taken to be large (5,000 per vertex) specically to ensure convergence. This solution is not feasible for most other methods, as Probtrackx does not usually keep computed trajectories, but is the most direct reference to tractogram convergence that we found in the literature. 2 The largest study to date aims to have 100,000 subjects participating in the imaging cohort [104]. By our back-of- the-envelope estimate, 100,000 subjects with 10,000,000 tracts each would require about 1.5 petabytes of disk space, just for the tractograms. 33 In Cheng et al. [28] the authors address the convergence of network measures. While this is a similar issue, it is seen through the lens of graph theory. In a network context the absolute (objective) number of tracts strongly in uences the downstream measures. Thus, the authors advise against larger numbers of seeds in order to avoid spurious bers. Similarly, Prasad et al. [127] investigate eect of the number of streamlines used on a group-wise comparison task. The authors note that larger numbers of streamlines tend to decrease p-values in nodal degree comparisons. Similar to this, Maier-Hein et al. [99] provide a tractography challenge in which the measure of success depends in part on how few false positives were identied. This work is an important if not critical step for the community to address issues raised by others (e.g., Thomas et al. [146]), but its evaluation design runs counter to our premise here that tractography is a simulation. It is equivalent to taking the support of the distribution as its characterizing quality. Jordan et al. [78] propose a method for measuring streamline reproducibility. While they do not provide as much theoretical analysis nor do they identify the estimation method employed (leading to non-standard choices for their empirical work), the method at its core is the same as described here. 4.2 Framework and Methods 4.2.1 Minimal Surprise as a stopping condition Diusion MRI does not provide direct observations of whole white matter fascicles; the unit of observation is a q-space measurement of diusion propensity for given gradient angles, which is resolved and processed into quantized voxel measurements. Researchers simulate fascicle trajec- tories from tted local (voxel-wise) models, which have parameters estimated from the diusion signal (for reference and review, see Sotiropoulos and Zalesky [138]). Some of these simulations are 34 deterministic given a seed point, while others are stochastic. In both cases, however, seed points are not bijective with tracts. If the choice of seed point is random, it is then reasonable to view the tractography generation process as sampling from a distribution P dened over the space of possible tracts, stratied by seed points. The space of possible tracts has at minimum a qualitative notion of similarity. While there is no agreed upon metric, various distances and similarities have been proposed [121, 159, 55]. Choosing a reasonable metric, a simplistic estimator ^ P of tract distributions can be constructed using Kernel Density Estimation (KDE). These estimators enjoy concentration results 3 even in our general context [32] with minimal constraints on metric choice or kernel shape. The construction of distribution estimates ^ P lead us to a na ve condition for assessing convergence: we should stop sampling when ^ P stops gaining information (\stops learning"). ^ P is a representation of our beliefs about P . A sample fromP containing information not in ^ P (i.e., contrary to our beliefs) is then qualitatively surprising. As we update our beliefs ^ P we learn about P ; when the surprise no longer decreases we have stopped updating our beliefs, so we have stopped gaining information, which is exactly our stopping condition. Thus we should sample until we have minimal surprise. Before moving on, it should be made explicit: clearly ^ P will be misspecied. Without a natural metric, it is not clear if an estimator class could ever be guaranteed to include P in every case. With this in mind, our provided method is left open-ended toward choice of metric, kernel shape, smoothing parameter, etc. Further, for dierent datasets, the convergence rate of ^ P will be dierent. Higher resolution data will provide more information, providing more intrinsic information in P , meaning ^ P has more surprise to overcome. The point of this paper is not to estimateP exactly, nor to establish objective numerical standards for the number of tracts for every case, but to provide 3 In short, we are asserting that ^ P converges as the number of samples used to construct it increases. This does not guarantee that ^ P converges to P . 35 a method to investigate the complexity of tract generation. It is our belief that with reasonable choices a conservative estimate can be made to the number of samples required. 4.2.2 Quantitative Measures of Surprise Kullback-Leibler divergence (KL divergence) is a classic measurement for the dierences between distributions. While it has numerous interpretations and applications, in the information theory context it measures the \extra information" required to encode samples drawn from P using the optimal encoding for ^ P [34]. This is often used as a pseudo-distance, as \close" distributions have similar optimal encodings. KL divergence has the following analytic form, with 0 log 0 = 0 by convention: KL[ Pj ^ P ] =E[logP (x) log ^ P (x)] = Z P (x)[logP (x) log ^ P (x)]dx (4.1) = Z P (x) log (P (x))dx | {z } Negative Entropy Z P (x) log ^ P (x) dx | {z } Cross Entropy : (4.2) For any P the Negative Entropy term is clearly xed, regardless of ^ P . Following our previous intuition, this corresponds to the intrinsic surprise of P , the number of bits required to encode P under the optimal encoding. The Cross Entropy term corresponds to the induced surprise of using ^ P 's encoding in place of P . It is thus our desired measure of surprise; measuring Cross Entropy is the focus of our empirical method. Both have units of bits of information (or nats, depending on the base of the log). 36 4.2.3 Fixed-KDE Cross Entropy We construct a simple estimator of Cross Entropy using Kernel Density Estimation (KDE). Let fx i g N i=1 be samples from a distributionP , and letd(x i ;x j ) be a metric. The (unnormalized) squared exponential kernel function for distance d is dened as k(x i ;x j ) = exp( d(x i ;x j )): (4.3) Using this function, we can then dene an unnormalized Gaussian KDE: ^ Q(x i ) = 1 N N X j=1 k(x i ;x j ) (4.4) We use Q(x i ) to emphasize that this distribution is o by a normalizing constant, and ^ Q(x i ) to denote that this is an estimate ofQ based onfx i g N i=1 . We assume there exists a nite normalization constant, Z = R Q(x)dx<1. An approximation of P the distribution of tracts is then ^ P (x) = 1 Z ^ Q(x) = 1 ZN N X j=1 k(x;x j ) (4.5) Using Eq. 4.5, we can form an approximation to the Cross Entropy up to a constant. Cross entropy: H[ ^ P (x);P (x)] = Z P (x) log ^ P (x) dx (4.6) = Z P (x) log ^ Q(x) dx + Const (4.7) 1 N X x i 2 4 log 1 N N X j=1 k(x i ;x j ) 3 5 + Const (4.8) 37 4.2.4 Calculating Tract Sample Cross Entropy Several metric spaces have been proposed [121, 55] for streamlines. These allow us to apply our KDE estimator to the streamline distributions. We use the Mean Direct Flip distance of Garyfallidis et al. [55] as the distance metric in Eq. 4.3. We then estimate the Cross-Entropy of samples to tract distributions in these spaces using Eq. 4.8. While Eq. 4.8 is always o by an unknown constant, the rate of information gain is preserved. This allows us to measure the marginal value (in bits) of increasing sample sizes. Further, for xed the objective information content is comparable across distributions. As tracts are simulated and each observation is independent, many issues are simplied in our approximation. In particular, we separate thefx i g andfx j g samples, letting the number of x i be large ( 10 6 ) no matter the choice of N (the number of x j ) in order to ensure a good estimate of the outer expectation. We can further separate the kernel computation step and the summation step, which allows for fast bootstrap estimates as well as faster computation for increasing sizes of N. Again, this coincides with Jordan et al. [78], who use a KDE-like form to assign a weight (density) to each streamline. The authors do not identify this as a KDE (and accordingly use a non-standard power-law kernel), but otherwise produce the same method. 4.2.5 Scale Considerations Note that each KDE used xed and un-tuned bandwidth 1 . Thus, our KDE converges to PK instead of P , where K(x) = Z 1 k(x; 0). We believe this to be a feature and not a bug: rst, though the tractography simulation is often conducted with high numerical precision (using, e.g., oating point representation), the trajectories of streamlines are determined in part by the diusion measurements. These measurements are ill-resolved relative to machine precision. Second, later analyses of streamline distributions are rarely conducted at a local scale. Fixing allows for 38 \convergence" to be contextualized to a particular spatial scale, albeit after the choice of a tract metric. Thus, if the practitioner does not need to resolve small scale dierences, a corresponding choice of can be made. 4.3 Empirical Assessment We apply the Cross Entropy estimator to 44 test subjects from the publicly available Human Connectome Project dataset [153]. We used the minimally preprocessed T1-weighted (T1w) and diusion-weighted images (DWI) rigidly aligned to MNI 152 space. Brie y, the preprocessing of these images included gradient nonlinearity correction (T1w, DWI), motion correction (DWI), eddy current correction (DWI), and linear alignment (T1w, DWI). We used the HCP Pipeline FreeSurfer [45] protocol, running the recon-all pipeline. Tractography was conducted using the DWI in 1.25-mm isotropic MNI 152 space. Three sep- arate types of tractography were conducted, using three separate local models and two dierent tracking methods. All three were performed using the Dipy [56] implementation. The local models were as follows: 1. Constrained Spherical Deconvolution (CSD) [150] with a harmonic order of 8 2. Constant Solid Angle (CSA) [2] with a harmonic order of 8 3. Diusion Tensors (DTI) (see, e.g., [91]) Probabilistic streamline tractography was performed using Dipy's probabilistic direction getter for CSD and CSA local models, while deterministic tracking was completed using EuDX [54] for DTI and the Local Tracking module for CSD. For each local model streamlines were then seeded at 2 million random locations (total) in voxels labeled as likely white matter by FSL's FAST [173] and propagated in 0.5 mm steps, constrained to at most 30 angles between segments. Each set 39 of tracts was split into two equally sized subsets. Cross-entropy was then calculated using Eq. 4.8 for ve separate length-scale choices =f 1 4 ; 1 2 ; 1; 2; 4g, using 1000 streamline intervals to allow for parallelization and ecient storage. This portion of the procedure was written in C++ with OpenMP. In the top row of Fig. 4.1 we plot the cross-entropy (surprise) as a function of the number of samples when sampling at random from within the white matter mask. The curve for = 0:25 is nearly at, while for = 0:5 and = 1 the curves quickly bottom out after a short steep learning period. As the length scale increases, the learning rate becomes quite slow. At = 4 the curves begin to bottom out at 250000 samples. However, for each of these scales, no changes occur after 500000 samples. If there is relevant signal in these samples that is not contained on average in the previous half-million samples, it has a very small relative amount of information about the distribution of tracts. In general for each length scale both deterministic methods have less entropy than the prob- abilistic methods. Further, DTI converges ( attens) much faster than the other methods. These results match our prior intuitions about these methods. Probabilistic CSD has the largest infor- mation content, though we condition this in that in general noise adds information, albeit useless information. Thus, this is not alone a measure of quality, though it does dierentiate the method as certainly sampling a tract distribution unlike the others. In the bottom row of Fig. 4.1 we plot the incremental (marginal) entropy rate. For each length scale for each method the marginal information gained per thousand streamlines after the initial learning period is very low; this is unsurprising, but raises questions about the utility of sampling past several million tracts. In order to build intuition for qualitative dierences between values of , we plot in Fig. 4.2 the spatial extent of the kernel function centered on one streamline from the inferior longitudinal 40 Figure 4.1: Top row: we plot the self cross-entropy (surprise) of each method as a function of the number of tracts, conditioned on the length scale (given by the plot colors). Bottom row: we plot the incremental (marginal) entropy gain per 1000 streamlines, also conditioned on the length scale. In both row we eschew plotting the x-axis to the full one million samples as the unplotted section is essentially at. 41 Figure 4.2: We plot a streamline from the inferior longitudinal fasciculus, as well as each tract above 5% of the maximum kernel amplitude for each given . At = 4 only the target tract itself is left. fasciculus for each of the ve length scales tested. While in general the spatial extent is dicult to visualize, we can demonstrate its support practically by plotting all sampled tracts above a threshold, in this case 5% of the function maximum. As can be seen in the gure, for low values of the spatial extent is quite large, while for larger values of the function is extremely peaked. These curves are articially smooth due to the binning of the number of samples by thousands; while in each plot the curves are monotonic, this is an overall trend (at the scale of thousands of samples) and not true locally (at the scale of single samples). We performed a second experiment using the same tracking parameters, but seeding only in a single voxel at a time. For each voxel in the white matter we generated two tractographies (one train set, one test set), seeding only in that voxel 100 times. We then computed the cross-entropy between the two tractographies as a function of the number of samples in the training set. In Fig. 4.3 we plot the marginal change in entropy over the white matter mask for one subject, both at the start of the process and the end. For many regions the sampling process converges very quickly. However, for certain regions the sampling process converges quite slowly (albeit without considering samples from other voxels). This suggests at surface level high variance in complexity of the white matter when aggregating into voxels. Further, this provides evidence that a weighted seeding scheme might be more ecient than gridded or uniformly random schema. 42 4.4 Discussion Our results suggest that in general the number of tracts required to closely approximate the under- lying distribution is low using the methods we tested here. This can only be reduced by conditioning tracts or seeds on specic anatomic priors (e.g., seeding only in the corpus callosum or in a specic region). Further, the Human Connectome Project [153] data are widely regarded as examples of the state of the art. It is our hypothesis that other datasets may have less signal content, lead- ing to sharper (more steep) curves and even faster convergence (as more signal content would, we hypothesize, better resolve complex crossing-ber regions). Our theoretical analysis and method is also inherently limited: we cannot dierentiate between \valid" signal and spurious systemic error. By sampling to the scales described in this paper researchers can accurately recover the streamline distribution (up to the chosen spatial scale). If this distribution includes \false" modes, i.e. modes that not from a corresponding biophysical structure, we will also consistently recover those false modes. Larger sample sizes in simulations do nothing to guard against false signal; modes of both types will be reinforced, while random uctuations will attenuate. Consider a tractography method that when queried ten times generates complete noise 4 nine times and generates a real tract one time on average. Clearly this method has high noise (90% useless!) and would not do well in major tractography challenges (e.g., [99]), but, we claim, this hypothetical method would still have value. Since the number of real tracts is nite and the number of possible noise tracts is comparatively large, increasing the sample size will lead to each real tract having more corresponding samples than the noise tracts. Given enough time we may recover the set of real tracts. 4 the denition of complete noise is actually tricky, but we could use a proxy of \drawing a tract length uniformly at random between 1 and the number of voxels, and lling its sequence with points drawn uniformly from the domain of the image". 43 Further consider a tractography method that when queried ten times generates complete noise eight times, generates a real tract one time, and generates a false but non-random tract the re- maining time on average. Like the rst hypothetical method, this second method produces 90% useless tracts. However, increasing sample size will not lead to vanishing error in this second case due to the false mode (the non-random false tract). Clearly, no method is purely in the rst case; all methods likely have false modes. However, we posit that understanding the number, location, and size of these false modes is just as important as measuring false positive rate. It is illustrative that F1-score is unable to dierentiate between these two hypothetical methods. We believe such considerations should be taken into account, in that valuable methods might be slow to converge, and thus currently ignored. Further, we believe that in undersampled regimes the assessment of signal peaks is inherently inaccurate. It is important to note that \oversampling" only carries computational drawbacks. Aside from the time and disk space costs, there is no danger in sampling more under i.i.d. conditions, e.g. 10,000,000 tracts [135]. Further, most if not all downstream analyses will not be completely in- formation ecient (this would be akin to an electrical machine being completely energy ecient). Thus, sampling further may be required for later methods applied to the tractogram. Our results are lower bounds, and not meant to be the end-all-be-all for the number of tracts to sample. Again however, \undersampling", i.e. using a number of samples below the start of the converged regime, is dangerous, and risks producing spurious results without other diagnostic warning. By the data processing inequality, the amount of information in the tractogram is the maximum amount of information about the tract generating distribution available to downstream methods. 44 Figure 4.3: Here are displayed slices of the white matter of one subject, centered on (60; 85; 61) (MNI). Colors denote the average change in entropy per marginal tract in each voxel, conditioned on seeding in that voxel. The left column shows the average change in the rst ve tracts seeded in each voxel, while the right column shows the average for tracts 96-100. Most regions converge almost immediately, but specic regions converge very slowly. Note that these measurements do not account for similarity to tracts seeded in other voxels. 45 Part II Invariance Representations, with Applications 46 Chapter 5 Invariance Methods Overview An invariant representation is a function or mapping of some data that does not change. Usually this is conditioned on a context. For example, summing a list of numbers (1; 2; 3; 4) is invariant to their order. Here, the mapping is the summation action, and the context is the order of the list. The removal of unwanted information is a surprisingly common task. Transform-invariant features in computer vision, \fair" encodings from the algorithmic fairness community, and two- stage regressions often used in scientic studies are all cases of the same general concept: we wish to remove the eect of some outside variablec on our datax while still relevant to our original task. In the context of representation learning, we wish to mapx into an encodingz that is uninformative of c, yet also optimal for our task lossL(z;::: ). These objectives are often operationalized as an independence constraint z ? c. Features and encodings satisfying this condition are invariant under changes in c, thus called \invariant representations". Historically for spatial c i.e. translation, scale, and for more general ane trans- formations, features have been carefully constructed to be analytically invariant. Representation learning has since ushered in the desire to at once learn features automatically yet also constrain them to be approximately invariant. Thus, in practice for learned encodings constraints are of- ten relaxed to other measures. In this chapter we rst take a short review of invariance features 47 and their applications, then describe adversarial training, one particular relaxation common in the literature, transplanted from generative literature to encoder/decoder settings. In the following chapter we then will describe our own advancements in describing this method analytically, and pushing beyond adversarial training using a novel bound derived from information theory. 5.1 Review of Invariance Methods 5.1.1 Invariance for c uncorrelated with y (nuisances) Invariance is often a highly desirable property in data representations, particularly in computer vision. These often have been invariances under physical or stereographic transformations: transla- tions, scales, general ane transformations, etc. Examples include Steerable Filters [51, 65], Scale Invariant Feature Transform (SIFT) descriptors [96] which in turn was based on a long history of scale-space matching methods, Shape Spectral Methods [171], as well as shared-weight convolu- tions [63] and data augmentation [63]. Though the preferred methods often change, the intuition is approximately the same for vision tasks: the desired output from a specic task should be invariant to changes in eld of view. Put more plainly, no matter how a person is holding a camera or the shape of its aperture, we would like our machine to identify the cats or dogs, or where the person is in the picture, or where they're walking to/from. Invariance in this context can be analytically modelled (ane transformation or subsets thereof), and, roughly speaking, does not change the outcome (cats translated to the left are still cats). Borrowing terminology from abstract algebra, invariance in vision usually looks for the quotient representation, collapsing group orbits (e.g. rotations) to single equivalence classes. All rotations of a cat are valid cats, in these constructions. Filters and features can then be constructed speci- cally with this assumption in mind, analytically designed to be invariant to these transformations. 48 Cohen et. al [31] codify this algebraic intuition into working method, introducing a group-invariant convolution method, where the group action is known and analytic, and the group is nite. Following this algebraic intuition, we can view data augmentation as enforcing these assump- tions in the training data. Data augmentation is the application of a priori known transformations to extant data in order to create \new" data points. In eect this process lls out the orbits which we assumed the label (e.g. dog vs cag) is invariant under. This forces generic classiers without invariance mechanisms to ignore the chosen orbits; since the chosen orbit's positions articially has no information about the label, ecient classers will not include features derived from the orbit position (i.e. the orientation of the object will no longer have an aect on its predicted class). Models including \nuisance" factors were also considered by Soatto and Chiuso [136], in which the authors propose denitions of both nuisances and invariant representations. More abstractly, it can be shown that maximially information ecient features z for tasks y that are uncorrelated with nuisances c do not include information about c. This result can be seen from analysis of the information bottleneck [1], but also can be described intuitively without symbol manipulation: if prediction ofy does not depend onc, then when learning features to predicty, including information about c should not improve performance. The removal of c-information should then occur in well- performing classiers in general, not just those explicitly built to do so. The authors follow the group theoretic concept of nuisances. Achille and Soatto [1] directly utilize these results, providing the same criterion and relaxation for invariance we will use here, minimal mutual information between the representation z and the covariate c. While nuisances form only a small subsection of the paper, the authors propose and test a sampling based approach to learning invariant representation. Their method is predicated on the ability to sample from the nuisance distribution (e.g. adding occlusions in images). 49 5.1.2 Invariance for c correlated with y A separate class of invariant features exist which, unlike the previous class, do aect outcome variables y (at least with respect to our empirically collected data), yet are nevertheless desir- able to remove. Examples include accounting post hoc for uncontrollable observational conditions (instruments, experimental conditions) which could not be harmonized (made uniform) by the experimenters, conditions that aect data collection, leading to imbalances in training datasets (upside down cats are rare, but in some systems it is desirable to still label such upside down examples as cats), and situations in which actual data-distributions do have dierences in label conditioned on context (p(yjx;c)6=p(yjx;c 0 )). The removal of covariate factors from scientic data has a long history. Observational studies in the sciences often cannot control for every factor in uencing subjects, thus a large amount of literature has been generated on the topic of removing such factors after data collection. Simple statistical techniques usually involve modifying analyses with corresponding covariate eects [128] or sometimes multi-level regressions [50]. Modern methods have included more nuanced feature generation and more complex models, but follow along the same vein [44, 49] \Regressing out" or \controlling for" unwanted factors can be eective, but place strong constraints on later analyses (namely, the observation of the same unwanted covariates). Invariance to correlated factors has also been the focus of study for the algorithmic fairness community [80, 167]. Derived in part from the desire to avoid discriminating against protected classes of individuals (and in part to avoid breaking laws and/or to avoid being the subject of a civil suit), the objective of these methods has been to preserve task accuracy (usually classication or regression) while removing bias against the protected class of individuals. Relevant to our own work are the methods proposed by Louizos et al. [95] and Xie et al. [162], which have a similar problem setup. Both methods generate representations that make it 50 dicult for an adversary to recover the protected class but are still useful for a classication task. Louizos et al. propose the \Variational Fair Auto-Encoder" (VFAE), which, as its name suggests, modies the VAE of Kingma and Welling [86] to produce fair 1 encodings, as well as a supervised case providing fair classications. Xie et al. combine this concept with adversarial training, adding (inverted) gradient information to produce fair representations and classications. This adversarial solution coincides exactly with the conceptual framework used in a computer vision application for constructing Fader Networks [90]. 5.1.3 Adversarial Invariance Construction of c-invariant features for distributions where c? y does not has been explored in deep learning using adversarial training. Re-appearing at least three separate times in dierent subelds (domain adaptation [53], vision [90], and fairness [162]), while their actual archtectures vary slightly, all three of the methods by Ganin et al. [53], Lample et al. [90], and Xie et al. [162] have similar if not identical mechanisms: for a general learned representation z from encoder q : x! c, create an \adversarial network"A : z! c that attempts to predict c from z. When training q, we can penalize congurations of q that allowA to predict c well. Since q is constantly being updated, to gain correct information we also must updateA. In order to pass training information (gradients) toA, we need either a second phase of training (as in e.g. Fader Networks [90]), or another mechanism such as the \graident ip trick" in Ganin et al. [53] and Xie et al. [162]. More rigorously, we can derive the adversarial invariance loss as an add-on to generic feature learning critera. Consider some general lossL[q(zjx)] for learned feature mapping (encoder) q : x! z. One example of this loss is the auto encoder loss (requiring decoder p : z! x): L = 1 The denition of \fair" in an algorithmic setting is of some debate. A fair encoding in this paper is uninformative of protected classes. We oer no opinion on whether this is truly \fair" in a general or legal sense, taking the word at face value as used by Louizos et al. 51 kp(xjq(zjx))xk. We would also like to have z be minimally informed of c; this might be codied as a minimax loss: L inv = min q max A kA(q(zjx))ck; (5.1) whereA again represents a predictor of c (the \adversary"). We need to trainL inv at the same time as the main loss (some other abstractL). Two primary methods have surfaced, both suitible for rst-order optimization techniques: 1. Two step optimization: Commonly used in most adversarial settings [63], this method alternates updates betweenL andL inv . It can also be tuned with a so-called \critic number", i.e. the number ofL inv iterations to the number ofL iterations. For reasons we show in the next chapter, this number should usually be greater than 1. 2. Gradient Reversal: Introduced by Ganin et al. [53], this method takes the gradients passed by @L inv =@A and multiplies by1 before passing them onward to the primary network (via the chain rule). It is used both by Ganin et al. [53] and Xie et al. [162], but is avoided in the vision literature. The main advantages of these methods are that they are very easy to implement (though sometimes dicult to trian), general enough for almost any c, and t with almost every other architecture. As a rule of thumb, if c could be accurately predicted using a neural network, then we can make an adversary for c. 5.1.4 Equivariance versus Invariance In some cases, the transformation changes the outcome by an analytically tractable function (track- ing cats translated to the left mean that object markers should similarly be translated). This is 52 not a true invariance (i.e. for function f, p(yjx)6=p(yjf(x))), but instead has been referred to as an equivariance; for translating function f there should be a corresponding \label translation" g so that p(yjx) =p(g(y)jf(x)). [31]. 5.1.5 Comparing and Contrasting Adversarial Invariance with GAN literature Adversarial Invariance methods share their namesake mechanism with Generative Adversarial Net- works (GANs) [64]. While these models do have very similar properties and training mechanisms, their dierences are non-trivial, and lead to separate behaviors. GANs are generative models which sample from a known distribution z, then transform z to x gen through some (highly non-linear) function G (usually a neural network). This x gen should be close in distribution to an empirical data distribution x emp . To achieve this, during training a separate adversary functionD is trained to tell apartx gen fromx emp . Using identical computational methods to adversarial invariance methods, GANs extract learning gradients from D to train G [63]. For each step of training of G, D is subsequently trained as well, possibly for many more epochs than G. In the original GAN paper (Goodfellow et al. 2014 [64]), the authors show that under cross- entropy loss the optimal adversary is p gen =(p gen +p emp ). Contrasting this with the adversarial invariance case, this function is essentially describing p(mjx), where m is a coin ip between x gen and x emp , and x is drawn from their (even) mixture. In the adversarial invariance case, we're instead approximating p(cjx), where c is arbitrarily correlated with x. Because of this dierence, analyses of GAN training dynamics, asymptotics, etc. do not nec- essarily carry over to adversarial invariance architectures and vice-versa. In particular, we will derive a statement about adversaries showing that the adversarial loss under best-of-all adversaries is equivalent up to a constant to I(z;c). The corresponding statement about GANs involves the 53 term I(m;x), which is both a) on the data domain instead of the latent domain, and b) involves the auxiliary variable m which does not actually appear in the data (i.e. m is an artifact of the GAN training scheme). This means that in GANs we have much stronger control over m than we do over the empirically dened c variable, and much less control over the marginal p(x emp ) and thus p(x) than we do over p(z). In practice, this may translate to diering error modes and design constraints. Further, many evaluations of GANs are qualitative, as x is a human-readable media (images, text, etc). Thus there are acceptable GAN solutions that may not be acceptable adversarial invariance solutions. Predictive bias in invariance criteria is not qualitative, and, due to its reliance on estimating p(cjx) correctly, is not always quantitatively accurate. Finally, because the original formulations of GANs are in a generative context, there is no bottleneck behavior to rely on or even consider; adversarial invariance seeks to perform a task y using data x through x! z feature transform. GANs instead take z from a known distribution, and create a syntheticx, which is a dierent task. A more comparable architecture are the family of X-to-Y GAN mappings e.g. Image-to-Image mappings [73], referred to as conditional-GAN. These take inputs x and random samples from a known distribution z, and output generated data y. Due to the prevalence of GANs in neural network literature after 2014, it is fairly common to compare architectures to them qualitatively, especially if they include an adversarial training mechanism. The reader is advised to use caution here, as not all results in one domain have equivalent results in the other, much less similar practical implications. 54 Chapter 6 Invariant Representations without Adversarial Training In this chapter we replace the adversarial relaxation of z?c with instead a penalty on the mutual information I(z;c), a dierent but equally valid relaxation. We provide an analysis of this loss, showing that: 1. I(z;c) admits a useful variational upper bound. This is in contrast to the usual lower bound e.g. bounds on I(z;y) for some labels y. 2. When placed alongside the Variational Auto Encoder (VAE) and the Variational Informa- tion Bottleneck (VIB) frameworks, the upper bound on I(z;c) produces a computationally tractable form for learning c-agnostic encodings and predictors. 3. The adversarial approach can be also derived as a procedure to minimize I(z;c), but does not provide an upper bound. Our proposed methods have the practical advantage of only requiringc at training time, and not at test time. They are thus viable for production settings where accessing c is expensive (requiring human labeling), impossible (requiring underlying transformations), or legally inadvisable (sharing protected data). On the other hand, our method also produces a conditional decoder taking both c and z as inputs. While for some purposes this might be discarded at test time, it also can be 55 manipulated to function similarly to Fader Networks [90], in that we can generate realistic looking transformations of an input image where some class label has been altered. We empirically test our proposed c-agnostic VAE and VIB methods on two standard \fair prediction" tasks, as well as an unsupervised learning task demonstrating Fader-like capabilities. The content of this chapter was published as [112]. 6.1 Constructing Codes with Limited Mutual Information Consider a general task that includes an encoding of observed datax into latent variablesz through the conditional likelihood q(zjx) (an encoder). Further assume that we observe a variable c which exhibits statistical dependence with x (possibly non-linear dependence). We would like to nd a q(zjx) that minimizes our loss functionL from the original task, but that also produces a z independent of c. This is clearly a dicult optimization; independence is a very strong condition. A natural relaxation of this is the minimization of the mutual information I(z;c). We can write our relaxed objective as min q L(q;x) +I(z;c) (6.1) where is a trade-o parameter between the two objectives. L might involve other variables as well, e.g. labels y. Without details ofL and its associated task, we can still provide insight into I(z;c). Before continuing it is important to note that all entropic quantities related to z are from the encoding distribution q unless explicitly stated otherwise. In some cases entropies depend on prior distributions, p(z), and this will be explicitly noted. 56 From properties of mutual information, we have that I(z;c) = I(z;x)I(z;xjc) +I(z;cjx). Here, we note that q(zjx) is the function that we are optimizing over, and thus the distribution of z solely depends on x. Thus, I(z;cjx) =H(zjx)H(zjx;c) =H(zjx)H(zjx) = 0: (6.2) Using Mutual Information properties and a variational inequality, we can then write the following: I(z;c) =I(z;x)I(z;xjc) (6.3) =I(z;x)H(xjc) +H(xjz;c) (6.4) I(z;x)H(xjc)E x;c;zq [logp(xjz;c)] (6.5) =E z;x [logq(zjx) logq(z)]H(xjc)E x;c;zq [logp(xjz;c)] (6.6) =E x [ KL[ q(zjx)k q(z) ] ]H(xjc)E x;c;zq [logp(xjz;c)]: (6.7) H(xjc) is a constant and can be ignored. In Eq. 6.5 we introduce the variational distributionp(xjz;c) which will play the traditional role of the decoder. I(z;c) is thus bounded up to a constant by a divergence and a reconstruction error. The result is similar in appearance to the bound from Variational Auto-Encoders [86], wherein we balance the divergence between q(zjx) and a prior p(z) against the reconstruction error. Here our penalty on I(z;c) amounts to encouraging q(zjx) to be close to its marginal q(z), i.e. to vary less across inputs x, no matter the form of q(zjx) or q(z). From a coding viewpoint our penalty encourages the compression of x out of z using the I(z;x) term from Eq 6.5. 57 In both interpretations, these penalties are tempered by conditional reconstruction error. This provides additional intuition; by adding a copy of c into the reconstruction, we ensure that com- pressing away information inz aboutc is not penalized. In other words, conditional reconstruction combined with compressing regularization leads to invariance w.r.t. to the conditional input. 6.1.1 Invariant Codes through VAE We apply our proposed penalty to the VAE of Kingma and Welling [86], inspired by the similarity of the penalty in Eq. 6.7 to the VAE loss function. The original VAE stems from the classical unsupervised task of constructing latent factors, z, so that p(z);p(xjz) dene a generative model that maximizes the log likelihood of the data E x [logp(x)]. This generally intractable expression is lower bounded using Jensen's inequality and a variational approximation: logp(x)KL[ q(zjx)k p(z) ] +E zq(zjx) [logp(xjz)]: (6.8) Kingma and Welling [86] frameq(xjz) andp(zjx) as an encoder/decoder pair. They then provide a re-parameterization trick that, when used with standard function approximators (neural networks), allows for ecient estimation of latent codes z. In short, the reparametrization is the following: q(zjx) =g (x) +"; "N (0;()) (6.9) where g is a deterministic function (neural network) with parameters , and " is an independent random variable from a Normal distribution 1 also parameterized by . 1 In the original paper, this was dened more generally; here we only consider the Normal distribution case. 58 We can reformulate Kingma and Welling's VAE to include our penalty onI(z;c). Denefx i g N i=1 data and latent factorsfz i g, but also dene observedfc i g upon which x may have non-trivial dependence. That is, p(x;z;c) =p(z;c)p(xjz;c): (6.10) The invariant coding task is to nd q(zjx);p(xjz;c) that maximize E (x;c) [logp(xjc)], subject to z?c under zq (i.e. subject to the estimated code z being invariant to c). We make the same relaxation as in Eq. 6.1 to formulate our objective: maxE (x;c) [logp(xjc)]I(z;c): (6.11) Starting with the rst term, we can derive a familiar looking encoder/decoder loss function that now includes c. logp(xjc) = log Z p(x;zjc)dz (6.12) = log Z p(x;zjc) q(zjx) q(zjx)dz = logE zq p(x;zjc) q(zjx) (6.13) E zq [logp(x;zjc) logq(zjx)] (6.14) =E zq [logp(zjc) logq(zjx) + logp(xjz;c)]: (6.15) Because p(zjc) is a prior, we can make the assumption that p(zjc) = p(z), the prior marginal distribution over z. This is a willful model misspecication: for an arbitrary encoder, the latent factors, z, are probably not independent of c. However, practically we wish to nd z that are 59 independent of c, thus it is reasonable to include such a prior belief in our generative model. Taking this assumption, we have logp(xjc)E zq [logp(z) logq(zjx)] +E zq [logp(xjz;c)] (6.16) =KL[ q(zjx)k p(z)] +E zq [logp(xjz;c)]: (6.17) This is almost exactly the same as the VAE objective in Eq. 6.8, except our decoder p(xjz;c) requires c as well as z. Putting this together with the penalty term Eq. 6.7, we have the following variational bound on the combined objective (up to a constant): E (x;c) [logP (xjc)]I(z;c) E (x;c) [ KL[ q(zjx)k p(z)]KL[ q(zjx)k q(z) ] + (1 +)E zq [logp(xjz;c)] ]: (6.18) We use this bound to learn c-invariant auto-encoders. 6.1.1.1 Derivation of an approximation for the Conditional-Marginal divergence Equation 6.18 is our desired loss function for learning invariant codesz in an unsupervised context. Unfortunately it containsq(z), the empirical marginal distribution of latent codez, which is dicult 60 to compute. Using the re-parameterization trick, q(z) becomes a mixture distribution and this allows us to approximate its divergence from q(zjx). KL[q(zjx)kq(z)] =H(q(zjx)) Z q(zjx) log Z q(zjx 0 )p(x 0 )dx 0 dz (6.19) =H(q(zjx)) Z q(zjx) logE x 0[q(zjx 0 )]dz (6.20) H(q(zjx))E zqjx [E x 0[logq(zjx 0 )]] (6.21) H(q(zjx)) X x 0 E zqjx [logq(zjx 0 )] (6.22) = X x 0 [H(q(zjx))E zqjx [logq(zjx 0 )]] (6.23) X x X x 0 KL[q(zjx)kq(zjx 0 )] | {z } KL between Gaussians (6.24) We can thus approximate KL[q(zjx)kq(z)] from its pairwise distances KL[q(zjx)kq(zjx 0 )], which all have a closed form due to the reparameterization trick. While this requires O(b 2 ) operations for batch size b, we can reduce pairwise Gaussian KL divergence to matrix algebra, making this computation fast in practice. This further provides insight into the previously proposed Variational Fair Auto-Encoder of Louizos et al [95]. In that paper, the authors add a Maximum Mean Discrepancy penalty as a somewhat ad hoc regularizer. This nevertheless works in practice quite well, as it encourages the statistical moments of each q(zjc) to be the same over the varying values of c. Our condition of KL[q(zjx)kq(z)] has equivalent minima, and shares the \q-regularizing" avor of the MMD penalty. 61 6.1.1.2 Alternate derivation leads to adversarial loss In Equation 6.3 we used the identityI(z;c) =I(z;x)I(z;xjc)+I(z;cjx), with the caveat that the third term I(z;cjx) is zero. We could have instead used another identity, I(z;c) =H(c)H(cjz). Here, the rst term is constant, but expanding the second term provides the following: H(cjz) =E c;zq [ logp(cjz)] (6.25) = inf r(cjz) E c;zq [ logr(cjz)] (6.26) E (x;c) [logP (xjc)]I(z;c) E (x;c) [ KL[ q(zjx)k p(z)] +E zq [logp(xjz;c)] ] + inf r(cjz) E c;zq [ logr(cjz)] (6.27) The last inequality is again up to a constant term. Interpreting this in machine learning parlance, another possible approach for minimizing I(z;c) is to optimize the conditional distribution p(cjz) so that r, the lowest entropy predictor of c given z, has the highest entropy (i.e. is as inaccurate as possible at predicting c). This is often operationalized by adversarial learning, and subsequent error may be due in part to the adversary not achieving the inmum. Practically speaking, this may indicate that over-training adversaries would benet performance by bringing the adversarial gradient closer to the inmum adversary's gradient. 6.1.2 Supervised Invariant Codes through the Variational Information Bottleneck Learned encodings are often used for downstream supervised prediction tasks. Just as in Variational Fair Auto-encoders [95], we can model both at the same time to oer c-invariant predictions. Our formulation of this problem ts into the Information Bottleneck framework [148] and mirrors the Variational Information Bottleneck (VIB) [3]. 62 Conceptually, VAEs have strong connections to the Information Bottleneck [3]. Stepping out of the generative context, we can \reroute" our decoder to a label variable y. This gives us the following computational model: x q(zjx) !z p(yjz) !y (6.28) The bottleneck paradigm prescribes optimizing over q and p so that I(z;y) is maximal while minimizingI(x;z) (\maintaining information about y with maximal compression of x into z"). As illustrated by Alemi et al. [3], this can be approximated using variational inference. We can produce c-invariant codes in the supervised Information Bottleneck context using the relaxation from Eq. 6.1. Beginning with the bottleneck objective max q;p I(z;y)I(x;z) and then including the minimization of I(z;c), we have max p;q I(z;y)I(x;z)I(z;c) (6.29) We can then apply the same bound as in Eq. 6.5 to obtain, up to constant H(xjc), the following: max p;q I(z;y) ( +)I(x;z) +E[logp(xjz;c)]: (6.30) In this objective we have a maximization of likelihood p(xjz;c). This is a decoder loss, adding a third branch to our network. Following the derivation in Alemi et al. [3] as well as a similar path as in Section 6.1.1, the variational bound on the objective is I(z;y) ( +)I(x;z) +E[logp(xjz;c)] E (x;c) [E z;y [logp(yjz)]( +)KL[ q(zjx)k q(z) ] +E z [logp(xjz;c)] ]: (6.31) 63 We use Eq. 6.31 to learn c-invariant predictors. Optimization is performed over three function approximations: one encoder q(zjx), one conditional decoder p(xjz;c), and one predictor p(yjz). We further must compute KL[ q(zjx)k q(z) ] from the I(x;z) penalty term. Instead of following Alemi et al.[3], we again use the approximation to KL[ q(zjx)k q(z) ] from Eq. 6.24. 6.2 Computation and Empirical Evaluation We have two modied VAE loss (Eq. 6.18) and modied VIB loss (Eq. 6.31). In both we have to learn an encoder and decoder pair q(zjx) and p(xjz;c). We use feed forward networks to approx- imate these functions. For q(zjx) we use the Gaussian reparameterization trick, and for p(xjz;c) we simply concatenatec ontoz as extra input features to be decoded. In the modied VIB we also have a predictor branch p(yjz), which we also use a feed forward network to parametrize. Specic architectures (e.g. number of layers and nodes per layer for each branch) vary by domain. We evaluate the performance on of our proposed invariance penalty on two datasets with a \fair classication" task. We also demonstrate \Fader Network"-like capabilities for manipulating specied factors in generative modeling on the MNIST dataset. 6.2.1 Fair Classication For each fair classication dataset/task we evaluated both prediction accuracy and adversarial error in predicting c from the latent code. We compare against the Variational Fair Autoencoder (VFAE) [95], and the adversarial method proposed in Xie et al. [162]. Both datasets are from the UCI repository. The preprocessing for both datasets follow Zemel et al. 2013[167], which is also the source for the pre-processing in our baselines [95, 162]. The rst dataset is the German dataset, containing 1000 samples of personal nancial data. The objective is to predict whether a person has a good credit score, and the protected class is 64 Age (which, as per [167], is binarized). The second dataset is the Adult dataset, containing 45,222 data points of US census data. The objective is to predict whether or not a person has over 50,000 dollars saved in the bank. The protected factor for the Adult dataset is Gender 2 . Wherever possible we use architectural constraints from previous papers. All encoders and decoders are single layer, as specied by Louizos et al. [95] (including those in the baselines), and for both datasets we use 64 hidden units in our method as in Xie et al., while for VFAE we use their described architecture. We use a latent space of 30 dimensions for each case. We train using Adam using the same hyperparameter settings as in Xie et al., and a batch size of 128. Optimization and parameter tuning is done via a held-out validation set. For each tested method we train a discriminator to predict c from generated latent codes z. These discriminators are trained independently from the encoder/decoder/within-method adver- saries. We use the architecture from Xie et al. [162] for these post-hoc adversaries, which describes a three-layer feed-forward network trained using batch normalization and Adam (using = 1 and a learning rate of 0:001), with 64 hidden units per layer, using absolute error. We generalize this to four adversaries, increasing in the number of hidden layers. Each discriminator is trained post-hoc for each model, even in cases with a discriminator in the model (e.g. the model proposed by Xie et al. [162]). 6.2.2 Unsupervised Learning We demonstrate a form of unsupervised image manipulation inspired by Fader Networks [90] on the MNIST dataset. We use the digit label as the covariate class c, which pushes all non-class stylistic information into the latent space while attempting to remove information about the exact digit being written. This allows us to manipulate the decoder at test time to produce dierent 2 In some papers the protected factor for the Adult dataset is reported as Age, but those papers also reference Zemel et al. [167] as the processing and experimental scheme, which species Gender. 65 German Dataset Adv. Loss Pred Acc. Maj. Class 0.725 0.695 VFAE [95] 0.717 0.720 Xie et al. [162] 0.811 0.695 Proposed 0.698 0.710 Adult Dataset Adv. Loss Pred Acc. Maj. Class 0.675 0.752 VFAE [95] 0.882 0.842 Xie et al. [162] 0.888 0.831 Proposed 0.675 0.844 Figure 6.1: On the left we display the adversarial loss (the accuracy of the adversary on c) and predictive accurracy on y for three methods, plus the majority-class baseline, on both Adult and German datasets. For adv. loss lower is better, while for pred. acc. higher is better. On the right we plot adversarial loss by varying adversarial strength (indicated by color), parameterized by the number of layers from zero (logistic regression) to three. All evaluations are performed on the hold-out test sets. articial digits based on the style of one digit. We use 2 hidden layers with 512 nodes for both the encoder and the decoder. 6.3 Results For the German dataset shown on top table of Figure 6.1, the methods are roughly equivalent. All methods have comparable predictive accuracy, while the VFAE and the proposed method have competitive adversarial loss. In general however, the smaller dataset does not dierentiate the methods. For the larger Adult dataset shown on the bottom table of Figure 6.1, all three methods again have comparable predictive accuracy. However, against stronger adversaries each baseline has very 66 Figure 6.2: t-SNE plots for the latent encodings of (Left to Right) the VFAE, Xie et al., and our proposed method on the Adult dataset (rst 1000 pts., test split). The value of the c variable is provided as color, where red is the majority class. high loss. Our proposed method has comparable accuracy with the VFAE, while providing the best adversarial error across all four adversarial diculty levels. We further visualized a projection of the latent codes z using t-SNE [97]; invariant represen- tations should produce inseparable embeddings for each class. All methods have large red-only regions; this is somewhat expected for the majority class. However, both baseline methods have blue-only regions, while the proposed method has only a heterogenous region. Figure 6.3 demonstrates our ability to manipulate the conditional decoder. The left column contain the actual images (randomly selected from the test set), while the right columns contain images generated using the decoder. Particularly notable are the transfer of azimuth and thickness, and the failure of some styles to transfer to some digits (usually curved to straight digits or vice versa). 6.4 Discussion As show analytically in Section 6.1.1.2, in the optimal case adversarial training can perform as well as our derived method; it is also intuitively simple and allows for more nuanced tuning. However, it introduces an extra layer of complexity (indeed, a second optimization problem) into our system. 67 Figure 6.3: We demonstrate the ability to generate stylistically similar images of varying classes using the MNIST dataset. The left column is mapped into z that is invariant to its digit label c. We then can generate an image using z and any other specied digit, c 0 , as show on the right. In this particular case of invariant representation, our results lead us to believe that adversarial training is unnecessary. This does not mean that adversarial training for invariant representations is strictly worse in practice. There are certainly cases where training an adversary may be easier or less restrictive than other methods, and due to its shared literature with Generative Adversarial Networks [64], there may be training heuristics or other techniques that can improve performance. On the otherhand, we believe that our derivations here shed light on why these methods might fail. We believe specic failure modes of adversarial training can be attributed to Eq. 6.27, where the adversary fails to achieve the inmum. Bad approximations (i.e. weak or poorly trained 68 adversaries) may provide bad gradient information to the system, leading to poor performance of the encoder against a post-hoc adversary. Our experimental results do not match those reported in Xie et al. While in general their method has comparable performance for predictive accuracy, we do not nd that their adversarial error is low; instead, we nd that the encoder/adversary pair becomes stuck in local minima. We also nd that the adversary trained alongside the encoder performs badly against the encoder (i.e. the adversary cannot predict c well), but a post-hoc trained adversary performs very well, easily predicting c (as demonstrated by our experiments). It may be that we have inadvertently built a stronger adversary. We have attempted to follow the author's experimental design as closely as possible, using the same architecture and the same adversary (using the gradient- ip trick and 3-layer feed forward networks). With the details pro- vided we could not replicate their reported adversarial error for their method, nor for the VFAE method. However, we are able to reproduce the adversarial error reported in Louizos et al., which uses logisic regression. In general for stronger adversaries the adversarial loss will increase, but the relative rankings should remain roughly the same. 6.4.1 Chapter Summary We have derived a variational upper bound for the mutual information between latent representa- tions and covariate factors. Provided a dataset with labeled covariates, we can train both supervised and unsupervised learning methods that are invariant to these factors without the use of adversarial training. After training our method can be used in production without requiring covariate labels. Finally, our approach also enables manipulation of specied factors when generating realistic data. Our direct, information-theoretic optimization approach avoids the pitfalls inherent in adversarial 69 learning for invariant representation and produces results that match or exceed capabilities of these state-of-the-art methods. 70 Chapter 7 Scanner Invariant Representations Observational conditions may vary strongly within a medical imaging study. Researchers are often aware of these conditions (e.g., scanner, site, technician, facility, etc.) but are unable to modify the experimental design to compensate, due to cost or geographic necessity. In magnetic reso- nance imaging (MRI), variations in scanner characteristics such as the magnetic eld strength, scanner vendor, receiver coil hardware, applied gradient elds, or primary image reconstruction methods may have strong eects on collected data [27, 49, 79]; multi-site studies in particular are subject to these eects [71, 83, 98, 166]. Data harmonization is the process of removing or compensating for this unwanted variation through post-hoc corrections. In the present work we focus on harmonization for diusion MRI (dMRI), a modality known to have scanner/site biases [17, 33, 58, 122, 123, 157, 160, 169, 170, 168] as well as several extra possible degrees of freedom with respect to protocol (e.g., angular resolution, b-values, gradient waveform choice, etc.). Several prior methods approach diusion MRI harmonization as a regression problem. Super- vised image-to-image transfer methods have been proposed [19, 143], while for the unsupervised case site eects are often modeled as covariate eects, either at a summary statistic level [49, 166] or on the image directly [107]. All of these methods directly transfer scans from one site/scanner 71 context to another. Further, while all methods require paired scans to correctly validate their re- sults (subjects or phantoms scanned on both target and reference scanners), supervised methods also require paired training data. The collection of such data is expensive and dicult to collect at a large scale. In this chapter we instead frame the harmonization problem as the invariant representation learning task from the preceding chapters. We propose that a subset of harmonization solutions may be found by learning scanner-invariant representations, i.e., representations of the images that are uninformative of which scanner the images were collected on. These representations and the mappings between them may then be manipulated to provide image reconstructions that are minimally informative of their original collection site. We thus provide an encoder/decoder method for learning mappings to and from invariant representations computationally. This method has several advantages over regression-based methods, including a practical implementation that does not require paired data, i.e., a traveling phantom as training input, and an extension to a multi-site case. We demonstrate our proposed method on the MICCAI Computational Diusion MRI challenge dataset [119, 145, 174], showing substantial improvement compared to a recently published baseline method. We further introduce technical improvements to the training of neural architectures on diusion-weighted data, and discuss the limitations and error modes of our proposed method. The content of this chapter was published as [117]. 7.1 Relevant Prior Work on Diusion MRI Harmonization Harmonization has been an acknowledged problem in MR imaging and specically diusion imaging for some time [174]. Numerous studies have noted signicant dierences in diusion summary measures (e.g., fractional anisotropy; FA) between scanners and sites [122, 157, 160]. Further 72 protocol dierences arise between sites due to limitations of the available scanners, unavoidable changes or upgrades in scanners or protocols, or when combining data retrospectively from multiple studies; eects of variations in scanning protocols on derived measures include eects of voxel size [123], b-values (the diusion weightings used) [17, 33, 123], and angular resolution or q-space sampling [58, 169, 170, 168] among other parameters. These problems were also examined by the MICCAI Computational Diusion MRI 2017 and 2018 challenges [119, 145], which held an open comparison of methods for a supervised (paired) task. Most previously proposed harmonization methods have relied on forms of regression. Harmo- nization of summary statistics (voxel-wise or region-wise) include random/mixed-eect models [166] as well as the scale-and-shift random eects regression of ComBat [49, 166]. This latter method was adapted from the genomics literature [77], and employs a variational Bayes scheme to learn model coecients. A more nuanced family of regression methods for diusion imaging was recently introduced in a series of papers by Mirzaalian et al. [106, 107, 105]. This was later analyzed empirically in Karayumak et al. [81], which compared it against ComBat [77] for summary statistics. This family of methods computes a power spectrum from a spherical harmonic (SH) basis, then generates a template from these images using multi-channel dieomorphic mappings. The resulting template is used to compute spatial maps of average SH power spectra by scanner/protocol, which are then used in a scale regression on individual subjects. While these papers take a very dierent approach from our own, the resulting method has a very similar usage pattern and output. We compare our approach directly to this method. In a supervised (paired) task, direct image-to-image transfer has been explored both in the harmonization context [88, 82, 119] as well as the similar super-resolution context [19, 143]. This family of methods generally relies on high expressive-capacity function tting (e.g., neural networks) 73 to map directly between patches of pairs of images. This requires explicitly paired data, in that the same brains must be scanned at all sites. These methods perform well empirically, as tested by the CDMRI challenge [145], but require paired data in the training set. Our proposed method does not require paired data to train; however, in our opinion, best practice validation still requires paired data in the (holdout) test-set. 7.2 Theory Our goal is to learn a representation of MRI scans (in this case dMRI scans) with minimal scan- ner/site specic information, but otherwise maximally informative of the image, and then map that representation back to a new image. We would like to remove trends and biases in our data that are informative of which site the data were collected at, but we otherwise do not want to remove extra information. To do this, we construct an encoding function q that takes each image x to a corresponding vector z, and a conditional decoding function p that takes each z and a specied site c back to an image ^ x (the \reconstruction" of the original image). We optimize q and p so that q(zjx) has minimal scanner-specic information, i.e. minimizing I(z;c) where s indexes the scanners/sites, but also so thatq(zjx) has minimal dierences from the original data when mapped back through p; this optimization is done with respect to a parameter , regulating the trade-o between the two objectives. By the data processing inequality [34], if we can remove scanner/site specic information in z, then any images ^ x we reconstruct from z will also have no information about the original scanner/site. As we show in the following section, learning the mapping q does not require matching pairs of data (x;x 0 ) from pairs of sites (c;c 0 ). Best practices in validation and testing do require such data, but during training we can minimizeI(z;c) without having examples of the same subject collected 74 on dierent scanners. This is due to our bound of I(z;c) described in Eq. 7.1, wherein we can decompose an upper bound of I(z;c) into terms not reliant on inter-site correspondence. During training, evaluation of the conditional decoderp will use the same site as the input data. During testing however, we can manipulate this mapping to reconstruct images at a dierent site than they were originally collected at; we use this mapping as our harmonization tool. Again, by the data processing inequality, the amount of information these (new) reconstructed images contain about their original collection site is bounded by I(z;c), which we explicitly minimize. The theory backing our method is largely explored in Moyer et al. [112], where it is used in the context of algorithmic fairness. We reproduce it here for clarity, and further reinterpret their theoretical results in the imaging harmonization context, adding our own data processing inequality interpretation of test-time remapping. 7.2.1 Scanner Invariant Variational Auto-Encoders We wish to learn a mappingq from datax (associated with scannerc) to some latent spacez such that z is independent of c (denoted z? c), yet also z is maximally relevant to x. We start by relaxing z?c to I(z;s), and then bounding I(z;s), as done in Eq. 6.7 from the previous chapter: I(z;s)E x;s;zq [logp(xjz;s)] | {z } Conditional Reconstruction +E x [ KL[ q(zjx)k q(z) ] ] | {z } Compression H(xjs) | {z } Const (7.1) where q(z) is the empirical marginal distribution of z under q(zjx), the specied encoding which we control, and p(xjz;s) is a variational approximation to the conditional likelihood of x given z and c again under q(zjx). The bound in Eq.7.1 has three components: a conditional reconstruction, a compressive diver- gence term, and a constant term denoting the conditional entropy of the scan given the scanner. Intuitively, this breakdown makes sense: if we reconstruct givenc, and are otherwise compressingz, 75 the optimal compressivez has no information abouts for reconstruction;q(zjx) can always remove information about c without penalty, because the reconstruction term is handed that information immediately. Further, if x is highly correlated with c, i.e. H(xjc) is very low, then our bound will be worse. In the continuous case, this may be a trivial bound ifx is uniquely determined by s. This provides insight into the operating characteristics of this bound, and a specic error mode where scanner c is highly informative of the image x. Inspired by the functional form of our bound on a relaxation of constraint z ? c, we can now construct a variational encoding/conditional-decoding pair q andp with which our variational bound ts nicely, and which also ts our overall goal of re-mapping x accurately through p(xjz;c). Following Kingma and Welling [86], we use a generative log-likelihood as an objective: max logE (x;s) [p(xjc)] (7.2) Here however, we inject the conditional likelihood to match our bound for I(z;c). This also ts our test-time desired Markov chain (with condition z?c): x c z c 0 ^ x Following the original VAE derivation (again in Kingma and Welling), we can derive a similar s-invariant VAE by introducing the encoder q(zjx), as in Eq. 6.18 from the previous chapter: logp(xjc) = log Z p(x;zjc) q(zjx) q(zjx)dz = logE zq p(x;zjc) q(zjx) (7.3) E zq [logp(x;zjc) logq(zjx)] (7.4) =E zq [logp(zjsc logq(zjx) + logp(xjz;c)]: (7.5) 76 This, again from the previous, assumes that the prior p(zjc) =p(z), i.e., that the conditional prior is equal to the marginal prior over z. In the generative context, this would be a strong model mis- specication: if we believe that there truly are generating latent factors, it is unlikely that those factors would be independent of c. However, we are not in such a generative frame, and instead would like to nd a codez that is invariant toc, so it is reasonable to use a prior that also has this property. Taking this assumption, we have logp(xjc)KL[ q(zjx)k p(z)] +E zq [logp(xjz;c)]: (7.6) This is a conditional extension of the VAE objective from Kingma and Welling [86]. Putting this objective together with the penalty term in Eq.7.1, we have the following variational bound on the combined objective (up to a constant): E (x;c) [logP (xjc)]I(z;c) E (x;c) [KL[q(zjx)kp(z)] | {z } Div. from Prior KL[q(zjx)kq(z)] | {z } Div. from Marg. +(1 +)E zq [logp(xjz;c)]] | {z } Cond. Reconstruction : (7.7) We use the negation of Eq.7.7 as the main loss term for learningq andp, where we want to minimize the negative of the bound. As we have it written in Eq. 7.7, the site variable c has ambiguous dimension. For applications with only two sites,c might be binary, while in the multi-site case,c might be a one-hot vector 1 . In this latter multi-site case, we would like codez to be uninformative of all input sites, and we further expect p(xjz;c) to be able to reconstruct images from each site. Since we are able to optimize our proposed objective Eq. 7.7 for either a single-site (binary c) or multiple sites (one-hot vector c), we conduct experiments for both in Sections 7.3 and 7.4. 1 For a categorical variable with value k out of K possible values, its corresponding one-hot vector is a K- dimensional vector with zeros in every entry except for the k th entry, which is one. 77 More complex c values are also possible. We could, for example, input c variables describing scanner vendors, headcoils, software, etc., essentially describing in more detail the congurations of each site. One would expect similar congurations to be similar in their biases. Unfortunately our data do not permit such considerations, but they are, in theory, possible. 7.2.2 Diusion-space Error Propagation from SH representations A convenient representation for diusion-weighted MRI is the spherical harmonics (SH) basis [105]. These provide a countable set of basis functions from the sphere to and from which projection is easy and often performed (e.g., in graphics). In this paper, our input data and the reconstruction error is computed with respect to the SH coecients. However, for the eventual output, the data representation that we would like to use is not in this basis, but in the original image representation which is conditional on a set of gradient vectors (b-vectors). These vectors are in general dier- ent for each subject due to spatial positioning and motion, and often change in number between sites/protocols. Rigid transformation and alignment of scan data, used in many pre-processing steps, also changes vector orientation. While the ` 2 function norm is preserved under projection to the SH basis (i.e., asymptotically SH projection is an isomorphism for ` 2 ), this is not the case for a norm on general nite sets of vectors. To correct for this, we construct a projection matrix from the SH basis to each subject's partic- ularb-vector set. This can then be used in conjunction with decoder outputp(xjz;c) to map output SH coecients to the original subject-specic b-vector representation. We allow each b 0 channel to \pass through" the projection (mapped as identity), as they are without orientation. While we use the SH representation for both input and reconstruction (to leverage our invariance results), we augment the loss function from Eq. 7.1 with a \real-space" loss function, the reconstruction 78 Recon (DWI) Center voxel Adversary q(zjx) Encoder z c z compressed code Conditional Decoder p(xjz;c) ^ x x inputs (SH) x c site id. ^ x recon. (SH) Figure 7.1: Our network is composed of an encoder branch q(zjx) (at left), a conditional decoder p(xjz;c), as well as two augmenting losses. The DWI-space reconstruction loss in blue is computed using the injected subject-specic projection matrix (from SH to b-vector representation). The patch adversary in green attempts to predict whether a reconstructed patch is originally from a given site (\remapped" vs. \real" patches). At test time the c site id is manipulated to map data onto one specic site. Training Conguration L =kx ^ xk +::: Backpropagation Subj 1 Subj 2 Subj 3 Input x Input x Input x Siteless z Siteless z Siteless z Site c Site c Site c Recon ^ x Recon ^ x Recon ^ x Testing Conguration Selected Target c Harmonized new ^ x Input x Input x Input x Siteless z Siteless z Siteless z Site s 0 Site s 0 Site s 0 Recon ^ x Recon ^ x Recon ^ x Figure 7.2: Diagram of the proposed training and testing schema, in the left and right boxes respec- tively. Site bias is represented by the diering colors. In both congurations, these are mapped to a site-invariant spacez, the colorless center column of both boxes. The remaining (site-independent) information can then be reconstructed into an image, given a site. In the training conguration data are remapped to their original site, and the loss calculated from Eq. 7.8 (secondary loss terms omitted from gure). Weights are then trained using the derivative of the loss (backpropagation [132]). In the testing conguration, the data are mapped to the selected site c 0 . The outputs in the right-hand column of the right box have bounded mutual information about their original site, vanishing with respect to the loss function. loss in each subject's original domain. This encourages the overall loss function to be faithful to our use-case in the original image space. 79 7.2.3 Computational Implementation We parameterize q and p using neural networks, tting their parameters by gradient-based op- timization. Though the loss in Eq. 7.7 is dened generally, and clearly one can learn invariant representations using many dierent parameterizations, the exibility of neural networks as func- tion approximators make them ideal for this application. We apply these networks to small image patches, concatenating patch-wise outputs to create harmonized images. The overall architecture is shown in Figure 7.1, and the training and testing congurations are diagrammed in Figure 7.2, with exact parameters given in Section 7.3. We discuss the use of patches and its relative advantages and drawbacks in Section 7.5. Our primary reconstruction loss is computed in the SH domain with respect to the entire patch. We then add a secondary loss function for the center voxel based on the SH-to-DWI projection, and an adversarial loss which attempts to predict which scanner/protocol each reconstructed patch is from (seen at the right of Figure 7.1). We added this branch in order to provide additional information toward keeping remapped patches \reasonable" when remapping to new sites; this prediction can be performed without explicit pairing of patches. Our loss function is then, in abstract, L =L recon +L proj L adv I(z;s) (7.8) whereL recon is SH reconstruction loss (using MSE),L proj is the DWI space loss, andL adv is the adversarial loss on the SH reconstruction, with three hyper parameters controlling trade-os between objectives. This loss function trivially extends from the single-site case (one target site to/from one base site) to a multi-site case, where c is categorical. 80 We optimize these networks over Eq. 7.8 by dierentiating the loss function with respect to the network weights (i.e. backpropagation [132]) and then using the Adam optimizer [85], which is a rst order optimization method. To compute gradients of the divergences in Eq. 7.7 eciently, we use the reparameterization trick of Kingma and Welling [86], using both a diagonal Gaussian conditionalq(zjx) and a Gaussian priorp(z). We also use the closed form bound forKL[q(zjx)kq(z)] from Eq. 6.24. 7.3 Methods 7.3.1 Data and Pre-processing To evaluate our method, we use the 10 training subjects from the 2018 CDMRI Challenge Har- monization dataset [144, 145]. These subjects were imaged on two dierent scanners: a 3 T GE Excite-HD \Connectom" and a 3 T Siemens Prisma scanner. For each scanner, two separate pro- tocols were collected, one of which matches between the scanners at a low resolution, and another which does not match at a high resolution. This results in four dierent \site" combinations, for which all subjects were scanned, resulting in forty dierent acquisitions (10 subjects, 2 scanners, 2 protocols each). We split this into 8 training subjects, two held out test subject. The low resolution matching protocol had an isotropic spatial resolution of 2.4 mm with 30 gradient directions (TE = 89 ms, TR = 7200 ms) at two shells b = 1200; 3000 s mm 2 , as well as a minimum of 4 b 0 acquisitions, at least one of these with reverse phase encoding. These volumes were then corrected for EPI distortions, subject motion, and eddy current distortions using FSL's TOPUP/eddy [5, 6]. Subjects from the \Connectom" scanner were then registered to the \Prisma" scanner using a ane transformation, t to a co-temporally acquired T1-weighted image volume (previously registered to each corresponding FA volume). The b-vectors were then appropriately 81 rotated. In the case of the \Connectom" scanner, geometric distortions due to gradient non- linearities were corrected for using in-house software [61, 131]. The high resolution protocols are identical in pre-processing to their low resolution counterparts, but have isotropic voxel sizes of 1:5 mm (TE = 80 ms, TR = 4500 ms) and 1.2 mm (TE = 68 ms, TR = 5400 ms) for \Prisma" and \Connectom" scanners respectively, each with 60 gradient directions per shell, same b-shell congurations (b = 1200; 3000 s mm 2 ). We downsample the spatial resolution of the high resolution scans to 2:4 mm isotropic to test the multi-task method, but keep the angular resolution dierences. To simplify notation, we refer to the four scanner/protocol combinations by their scanner make and number of gradient directions: Prisma 30, Prisma 60, Connectom 30, and Connectom 60. All scans were masked for white matter tissue, with the cerebellum removed. We map each of these scans to an 8 th -order SH representation for input into our method, but retain the original domain for training outputs. We use the minimal ` 2 weighted solution in the case of under- determined projections, which corresponds with the SVD solution (using the pseudo-inverse). This is well-dened, unlike direct projection. 7.3.2 Experimental Protocol The original CDMRI 2018 challenge [145] specied three supervised tasks, mapping between one base \site" (Prisma 30) and the three target \sites" (Prisma 60, Connectom 30, and Connectom 60). We modify this task, removing correspondence/pairing knowledge between sites (keeping this information for validation and testing), and including the inverse mapping task (target to base). This results in six tasks, two for each target site. We train a \single-site" network for each of the six tasks, learning representations for Prisma 30 and a single target site, a multi-site variant across all six tasks, as well as the Mirzaalian et al. [107] baseline. We measure performance of each on the holdout set of subjects using Root Mean 82 Squared Error (RMSE), with the Prisma 30 as input, outputs from each method, and ground truth as the target scanner/protocol volume in the original DWI basis (after pre-processing). 7.3.3 Conguration and Parameters We implemented our method for image patches composed of a center voxel and each of its six immediate neighbors. Each of these voxels has two shells of DWI signal, which we mapped to the SH 8 th order basis, plus one b 0 channel. Unravelling these patches and shells, the input is then a vector with 91 7 = 637 elements. We use two-layer fully connected neural networks for encoder q(zjx) and conditional decoder p(xjz;c), with 128 and 64 hidden units respectively for the encoder, and the reverse (64 then 128) for the decoder. The latent code z is parameterized by a 32 unit Gaussian layer (z). This layer is then concatenated with the scanner/protocol one-hot representation s, and input into the decoder. We use tanh(x) transformations at each hidden layer. We use a linear output from the encoder to parameterize the mean of the Gaussian layer, and a soft-plus rectied output to parameterize the variance. The adversary is a fully connected two-layer network with 64 units at each layer, with tanh(x) units again at each hidden node. For each task we train our network for 2000 epochs, which took 8 hours to train in the pair-wise case on standard desktop equipped with an external Nvidia Titan-Xp with 12GB of RAM using TensorFlow (32GB of CPU RAM, 4 cores). We loosely tune the hyper parameters so losses are approximately on the same order of magnitude, with = 0:1, = 1:0, and = 10:0. We use these same parameters for both the pair-wise tasks as well as the multi-task experiments. We use an Adam learning rate of 0:0001. 83 Figure 7.3: We plot the voxel-wise performance as measured by RMSE of the Mirzaalian et al. [107] method as well as our two proposed methods on each of the three harmonization tasks (Prisma 30 to each of the other scanner/site combinations). Here lower is better. The naive baseline (direct substitution) is approximately an order of magnitude worse than any of these methods. Mirzaalian et al. [107] and the naive baseline are the only other unsupervised methods we are aware of in the literature. 7.4 Results Figure 7.3 plots the root mean squared error (RMSE) by voxel of the baseline, single-site proposed method, and multi-site proposed method, as evaluated on the holdout test subjects. Our proposed methods show improvement over the baseline method in each case. In the pair-wise task between similar protocols (mapping between Prisma 30 and Connectom 30), these improvements have non- overlapping inner quartile range. For dissimilar protocols, i.e. mapping between Prisma 30 and Prisma 60 or Connectom 60, our proposed method shows improvements, though the dierence is less pronounced. For similar protocols, the multi-site method does not perform as well as the single-site method, and in fact does not outperform the baseline. Surprisingly, however, for higher resolution target images the multi-site method performs as well or better than the pair-wise method and the baseline; 84 this may be due to the multi-task method receiving many more volumes overall, allowing it to gather more information (albeit biased by other scanners) or preventing it from overtting. Figures 7.4, 7.5, and 7.6 show the spatial distribution of the error for each tested method on a single test subject, for mappings between Prisma 30 and Connectom 30, Prisma 60, and Connectom 60 respectively. For the Prisma 30 to Connectom 30 mapping, overall the Mirzaalian baseline [107] has higher error than the other methods as shown by the overall coloring. More importantly the Mirzaalian baseline [107] and the multi-site proposed method show signicant white matter patterning (though in varying degree); optimally we would like to see uncorrelated residuals, like those shown in the single-site method. For the mappings between dissimilar protocols, the multi-site method has both lower error and less qualitative patterning in white matter regions. The Connectom 60 error plots (Fig. 7.6) have a strong spatial patterns at both the occipital and frontal poles, shown in all methods. This wide-scale eect is somewhat mitigated by the proposed methods, but is still present in all error distributions. 7.5 Discussion This method does not erase the necessity of balancing site cohorts with respect to specic treatment variables. Considerx to be a disease eect instead of an image, and consider Eq. 7.1 under a case in which treatment/control cases are highly correlated with c (e.g. all patients with a specic disease are scanned at one site, and all controls at another). ThenH(xjs) =1, and our bound is useless. Clearly this is an extreme case, but it leads us to the following rule of thumb: if c could be used as a proxy (instrument variable) for a treatment eect, or is partially confounded with a predictor of interest, it cannot be removed without also partially removing that eect. This method further cannot remove long-range scanner-biases; this is due to the patch-based architecture. It is important to dierentiate between large-scale and long-range. If, for example, 85 Figure 7.4: We plot the spatial distribution of RMSE per voxel, displayed in slices centered at (x,y,z) = (10,22,35) for mappings from the Prisma 30 protocol to the Connectom 30 protocol. The top row is from the Mirzaalian [107] baseline, the center row is from the single-site proposed method, and the bottom row is from the multi-site proposed method. The color scale is the same between the rows, as well as between this gure, Fig. 7.5, and Fig. 7.6. Mirzaalian et al. [107] and the naive baseline are the only other unsupervised methods we are aware of in the literature. all voxels in the frontal lobe had extra apparent diusion in a specic direction for only one site, this bias would be large-scale but not long-range (all biases are within voxel, but occur at a large number of voxels). On the other hand, if the apparent grey matter/white matter boundary had been deformed by 2 mm across 20-mm subsections of the frontal lobe for only one site, this deformation would be a relatively long-range bias (the actual bias signal is on a 20-mm scale, which is larger 86 Figure 7.5: We plot the spatial distribution of RMSE per voxel, displayed in slices centered at (x,y,z) = (10,22,35) for mappings from the Prisma 30 protocol to the Prisma 60 protocol. The top row is from the Mirzaalian [107] baseline, the center row is from the single-site proposed method, and the bottom row is from the multi-site proposed method. The color scale is the same between the rows, as well as between this gure, Fig. 7.4, and Fig. 7.6. Mirzaalian et al. [107] and the naive baseline are the only other unsupervised methods we are aware of in the literature. than the size of our patches). In theory, with larger patches, we could avoid this limitation; current hardware, in particular GPU memory and bus speeds, limit our computation to small patches. Specic work in this domain has been done to reduce memory load [19], but it is by no means solved, especially for high angular resolution data such as the HCP dataset [137]. We hypothesize that a similar architecture with larger patches or whole images could rectify this particular problem 87 Figure 7.6: We plot the spatial distribution of RMSE per voxel, displayed in slices centered at (x,y,z) = (10,22,35) for mappings from the Prisma 30 protocol to the Connectom 60 protocol. The top row is from the Mirzaalian [107] baseline, the center row is from the single-site proposed method, and the bottom row is from the multi-site proposed method. The color scale is the same between the rows, as well as between this gure, Fig. 7.4, and Fig. 7.5. Mirzaalian et al. [107] and the naive baseline are the only other unsupervised methods we are aware of in the literature. - architectures that may become accessible with increased hardware capabilities - or better model compression/computational reduction techniques. We test our method on matching spatial resolutions, which necessitated downsampling for two of the datasets. While our SH basis-injection method allows for varying angular outputs, changing 88 spatial resolutions probably will necessitate dierences in decoder architectures. This is possible in direct methods, e.g., [19], but is currently an open issue in unsupervised methods. In the current method we reconstruct images for a specic target sitec 0 . We might instead look for a site agnostic image. This is philosophically challenging: each image must be collected at a site. While we can manipulate our method to produce anc average site, the output image may not be representative at all of any of the images. It may be that all images must have site information, and that the quotient representation is not an image at all. On the other hand, for other tasks y that are not images, e.g., prediction of pathology or prognosis, we can use z to make unbiased (scanner-agnostic) predictions of y. In cases where the actual goal is not in the image domain (for which the harmonization task is a pre-processing step), such a formulation may be benecial, and could be built from our proposed method. 89 Appendix A A neuro-anatomy primer/review for computer scientists Physical dissection of the human head reveals a brain with three main parts [15]: the cerebrum, the largest part of the human brain and characterized by two large hemispheres divided by a large ssure, the cerebellum, a smaller structure below and behind the cerebrum that mainly handles motor control, and the brain stem, a long stalk of nerve bers that connects the brain to the spinal cord 1 . A fourth set of structures known as the limbic system is also found buried below the cerebrum. Each of these structures is mainly composed two distinct tissues [15]: grey matter, a soft tissue composed of mostly nerve cell bodies (soma), and white matter, a berous fatty tissue composed of long tubes of cell appendages (axons). Both tissues contain supporting cells (gila) as well as capillaries (small blood vessels). In the cerebrum and cerebellum the outermost layers of each structure are grey matter (the cerebral cortex and the cerebellar cortex), while the interior portions are white matter. Grey and white matter (and the associated supporting cells) can also be found in the brain stem and the spinal cord. Grey matter also contains many short cell appendages (dendrites or dendritic trees) which extend through grey matter other nearby cells. Both these 1 More nuanced anatomic perspectives from the Evolutionary and Developmental Biology elds would identify each of these as subsets of the hindbrain, midbrain, and forebrain respectively, along with numerous smaller structures in each region [120]. This distinction is more accurate, especially when considering the developmental trajectory of the pre-natal CNS in humans as well as other species, but complicates the description of the system beyond what is necessary here. 90 tissues are immersed in cerebrospinal uid (CSF), a clear body uid circulating throughout brain via the ventricles (a set of large open cavities in the interior of the organ). As previously implied, neurons in the brain have on average one large body (the soma), many small local appendages (dendrites), and usually one large appendage (an axon) that connects to another, non-local neuron [120]. The axons encountered in the white matter are often grouped into larger ber groups known as fascicles, fasciculi, or tracts. These groupings are visible upon even rudimentary dissection, but most are notoriously dicult to follow even for modern methods. The cell bodies of the grey matter similarly associate into larger groups, both in layers along the surface of the brain, as well as columns perpendicular to the surface. Further macro-organization is observed in independent structures (e.g. the cerebellum, the sub-regions of the mid-brain, etc.) as well as specic local surface regions [4] (e.g. the visual cortex). The macro-organization of the brain is closely related to functional segregation. In general, dierent regions have dierent functions (purposes, uses, etc.). While it is hypothesized that they all interact to produce general cognition, specic structures are responsible for specic cognitive or physical sub-routines. Examples include the visual cortex, the motor cortices, and the somatosen- sory cortex. Nomenclature here, however, overstates this connection: for example, the motor cortex interfaces with several other structures to produce movement control, notably the Cerebellum, sev- eral sub-cortical and brain stem structures, and the spinal cord. While brain probably does not function as an electronic computer e.g. a von Neumann architec- ture, it can be helpful (for neuroimaging purposes) to view the grey matter as various processors, active main memory, memory management units, GPUs, etc., and to view the white matter as the bus cables. While in MR based imaging we cannot image the circuitry (the transistors in our analogy), we are able to image the macro connections. 91 During the development of individuals the growth of both these tissues lead to the characteristic folding of the cortex [120]. The raised convex portion of these folds are known as gyri (singluar: gyrus), and the indented, concave portion of the folds are known as sulci (singular: sulcus). Folds on the cortical surface often correspond with grey and white matter changes in and below cortical substrate, and each form of macrostructural organization (tracts, regions, and folds) is partially informative of the other two. In particular, the folding and shape of the surface is often used in the image registration process (See Appendix B). A.1 Brain Mapping Many tracts, regions, and folds are qualitatively regular in cognitively normal adult humans (i.e. their relative boundaries have qualitatively low variance across individuals). Such regularity has lead to the construction of brain maps (atlases). These atlases often name and delineate specic regions on a \canonical" brain. These sets of regions are often only of a subset of tissues, e.g. cortical grey matter atlases, and are often presented with caveats from the data from which they were derived (physical samples, images, anecdotal case evidence). This means that there are grey matter cortical atlases using cyto-architecture (micro/meso-structure) that are not incompatible with imaging derived white matter atlases, but perhaps are incomparable without further nuance. The criteria for such delineation varies greatly [21]. Nomenclature and denitions for many tracts and regions is notoriously contentious; brain mapping is an active eld, with atlases for var- ious structures being proposed regularly. Many dierent criteria and methods have been proposed for region selection (parcellation) in the grey matter [37, 43, 47, 60] as well as in the white matter [109, 163, 172]. It is clear that structure is conserved between individuals, but the exact nature of that structure is dicult to describe. 92 It is unclear which if any tracts or regions are anatomically asymmetric, sexually dimorphic, environmentally induced, or otherwise varying [25, 4], and by and large their genetic and epi- genetic controlling factors are unknown. While the overall gross anatomy of the brain and the relative segregation of its tissues has been known for centuries, the exploration of its meso- and macro-scopic organization has been a relatively recent development. The exploration of functional segregation has largely been regulated by instrumentation [149]. Before the development of digital imaging techniques, brain mapping was largely conducted qual- itatively via animal dissection [15], post-mortem human dissection [21, 66], qualitative analysis of behavior following injury [66, 20], or some combination of the three. These methods clearly remain valid and indeed active [72], but are widely supplemented with imaging. The advent of less invasive methods (up to wholy non-invasive methods) lead to eorts to im- age healthy human subjects and generate digital atlases [142], often en masse from population averages [101]. By collecting large collections of brain images along with anticillary information (demographics, conditions, etc.) and combining this information through any number of compu- tational methods [149](e.g. registration, hierarchical modelling, etc.) we arrive at a modern brain atlas [42]. Initially limited to structural MRI (sequence families such as T1 and T2), these eorts have expanded to functional MRI [147] and diusion imaging [109], tractography [163], as well as multi- modal eorts [60]. These maps come in varying forms; early digital brain atlases were usually themselves brains (e.g. [142]), while later structural atlases form an average [149] (note that there is no natural choice of metric here, so the choice of which particular average varies). Surface based analyses based on e.g. FreeSurfer [45] have popularized an alternative image derived domain, the cortical surface (usually actually the grey matter/white matter interface), and averages on that 93 domain. Atlases are almost synonymous with label maps (dened regions, parcellations), which are drawn on the atlas image or atlas surfaces. Parcellation sets are the current reication of several centuries of anatomic abstraction; each set of sets represent beliefs (usually informed by data, quantatative or qualitative) about the physically delineation of groups of neurons on an idealized (average) brain. These label sets are the maps of brain mapping, and corresponding literature associating these regions to functions is analogous to human geography, adding context, annotation, and depth to regional maps. A.1.1 Contrasting Parcellation with Geography Due to similarity in both content and nomenclature, it is tempting to make analogies between brain mapping and geographic or other physical mapping. While these are often helpful, it is the purpose of this section to describe where these elds do not align, even if there are obvious correspondences at the surface level. Several of these \obvious correspondences" lead to conceptualizations of brain mapping topics that may not be helpful. One real object vs. an idealized amalgam from a population: Maps of the earth oper- ate on one shared domain (namely, the earth). This means that though the data these maps portray may vary the substrate on which they show those data is relatively constant. This is obviously not true for population Neuroimaging. Dierent subjects have dierent brains, and these brains have dierent shapes, sizes, sulci and gyri, etc. Though the topology and relative geometry is usually conserved, i.e. the relative placement of specic regions and their relative sizes/shapes/boundaries are the same for most humans, these generalizations are purposefully vague since there is no one true brain geometry or even topology. Even identical twins with similar environments have varia- tion in their brain structures [75]. The variation in genetically identical subjects (e.g. cloned mice) is lower than the general population, so brain maps are making statements about an idealized 94 object that is the amalgam (and, in modern atlases, the numerical average in a chosen sense) of many instances of the class of objects (brains), a class which has intrinsic variation insofar as we can measure. Boundaries: often in political maps we are presented with sharp boundaries. Due to the accuracy of modern geospatial measurement and the precision with which human dened boundaries may be dened, these boundaries are not only sharp in reality but are truly the data being displayed. In other words, the boundary is the meaningful data of political maps: delineation of the contents of one state from another. Neuroimaging does not usually have such maps. Boundaries between brain regions are sometimes sharp (famously, the cyto-architecture and thus the visual texture of varying regions was noted by Brodmann [21]), but such resolution is almost certainly inaccessible for imaging methods. Moreover, in light of the discussion of idealized map domains (brains), the real boundary of these regions is not clearly meaningful. However, due to this uncertainty the center of each region is much more meaningful than the centers of regions in political maps. This is because (assuming some abstract variation of boundaries) the center is more certain. A.1.2 Brain Networks It is common to not only consider data aggregated by parcel but also data from pairwise interactions between parcels. The set of interactions collectively is sometimes referred to as a brain network, and the study thereof as network neuroscience [14]. Often this intuition is extended into a graph theoretic frame, where the domain is now the discrete and nite set of parcels and the interactions between the parcels have additional semantic value of empirically dening the topology (i.e. neigh- borness) and geometry (relative positioning) of those parcels. This graph topology and geometry may not be re ective of the physical geometry and topology, but instead intrinsically denes the 95 domain through the observed interactions. The study of brain networks is known as connectomics [140, 14]. Ideas and intuition about brain networks have a long history. Pre-dating cell theory, physical dissections of both animal and human brains lead to primative concepts of functional segregation in anatomy, starting as early as Galen in the 2nd century BC [30]. Renaissance and Enlighten- ment scholars observed clear \tubules" in the berous white matter, and phrenology through the 19 th century, misguided as it was, also added to the intuition that cognitive responsibilities were physically subdivided [26], with overall cognition provided by the interaction of separate parts. The discovery of cell theory and the eventual discovery of the neuron validated these theories, in spirit, at a micro-scale; new histological techniques lead to the famous descriptions of Golgi and Ramon y Cajal of cell bodies, short branching dendrites, and long primary axons. Ramon y Cajal's following electro-physical work eventually lead to the neuron doctrine [26]. Implicit in the belief that the neuron is the seat of computation and cognition is the concept that individual neurons interact, forming a network. The degree to which function is distributed may be contested, but clearly segragated cells must, at some level, work together to produce unied outputs. Slightly after these discoveries, Korriban Brodmann provided similar insight into macroscale architecture. Intuitions of network interactions at the macro-scale pre-date the neuron doctrine [26], but mechanistic models required rst the work of early brain mappers such as Brodmann [21], and Vogt and Vogt [156]. Early exploration of these concepts was largely focused on func- tional electrophysiology in animal models, e.g. [110]. With the popularization of tracers work began in ernest exploring corresponding structural features backing or refuting previous network (connectomic) discoveries [62, 89, 11]. 96 The advent of tractography, high angular resolution diusion imaging (HARDI) and its suc- cessors, and more advanced functional imaging methods has likewise contributed to the network neuroscience literature. This is covered in depth in Section 1.1. 97 Appendix B A short MRI primer for computer scientists Magnetic Resonance Imaging or MRI is a volumetric imaging method reliant on a strong static magnetic eld and orthogonal dynamic elds (gradient elds). The actual physical mechanism depends on the modality: T1w/T2w images are similar to NMR for adipose tissue and water, while diusion weighted images (DWI) measure water diusion propensity in a given direction and strength (and is thus \hyper-spectral"). Functional imaging or fMRI (time-series imaging) is also possible, the most common of which uses blood oxygen-level dependent (BOLD) signal, which measures changes in arterial blood ow volumetrically. Other more esoteric modalituies exist including magnetic susceptibility imaging, thermal imaging, and elastography (measuring compression moduli), and magnetic resonance spectroscopy (MRS, equivalent to a spatial NMR measurement, usually on living tissue). Usually MRI modalities are completely non-invasive, requiring no physical intervention at all. The exception is the the injection of contrast agents, which change the response prole of various tissues. These non-invasive and minimally invasive properties are the main attractive feature for their use in in vivo imaging of living organisms, particularly human imaging. MR images are collected in frequency space, commonly along a 2d physical plane (the \imaging plane"). Typical imaging resolutions in modern scanners for T1w and T2w images are 1-2 mm, with 98 usually identical voxel dimensions within in the imaging plane. The out-of-plane resolution varies, though it is standard in neuroimaging research to use iso-tropic (i.e. all even) imaging dimensions. Fields of view depend on settings, scanner, and coils, but typical head scans have around 20-25 cm on a side. For example, a standard atlas, MNI-152 Nonlinear, in 1 mm isotropic resolution has dimensions 182 218 182, in a single channel (with float32 non-negative values). More nuanced modalities often have multiple channels or timepoints, as well as other associated data, such a b-value/vector tables, or time series event data. Due to their collection in frequency space, MR images susceptible to frequency artifacts such as Gibbs ringing and aliasing. Image reconstruction (fourier to image space transformation) is commonly done on imaging machine. Reconstruction algorithms vary by machine vendor and model. Subject motion is also problematic, particularly in long sequences (such as DWI). DWI and fMRI are \Echo Planar Imaging" (EPI) methods, which have \ghost" artifacts due to the nature of the collection method. Finally, the magnetic suceptibility of most organisms in heterogeneous. Strong non-linear deformations can be found near sharp transitions, e.g. the sinuses versus bone versus CSF. Over the course of four decades of imaging computer vision methods have been developed for each of these problems. While few are completely solved, tools and techniques exist to ameliorate their eects. Further, for brain imaging a menagerie of tools for tissue classication, surface recon- struction, registration/atlas tting, and, in the case of DWI, local diusion model tting (usually per-voxel) have been constructed and distributed [46, 47, 45, 56, 52, 138]. The raw data are technically in frequency space, but often are automatically transformed by the software on the scanner unit itself. The resulting data are volumetric scalar elds, sometimes with many channels (parameterized by angle) or timepoints. These are often reviewed for head motion or other non-recoverable artifacts, then (pre-)processed with automated standard tools to correct 99 known artifacts bring them to a standard atlas, to register them to one another, or to recover surfaces (e.g. the WM/GM interface) which are then registered. This provides correspondence through which the actual analysis is performed. Diusion-weighted imaging (DWI or dMRI) is a subset of MRI that measures the propensity for water to diuse within a volume or slice. This depends on two critical scan parameters, the gradient angle (b-vector), and the gradient strength (b-value, known as \shells"). In short these parameters determine the strength at which the tissue is queried for diusion and in which direction. Scans with large numbers of angles (> 60) and a small number of shells (1 or 2) are referred to as HARDI scans (High Angular Resolution Diusion Images). An alternative paradigm known as Diusion Spectrum Imaging (DSI) instead uses a large number of shells. Local model tting is a common task for diusion imaging [150]. This takes measurements at each voxel conditioned on angle and gradient strength (a grid of R 3 S 2 R + ) to parameterized models at each voxel (R 3 P ). A common functional basis for the diusion propensity (non- negative functions on the sphere) are the spherical harmonics. From these, tracts are simulated, often using a mixture of di eq. solving (using each voxel model as a diusion propagator) and heuristics [56, 138]. 100 References [1] Alessandro Achille and Stefano Soatto. On the emergence of invariance and disentangling in deep representations. arXiv preprint arXiv:1706.01350, 2017. [2] Iman Aganj et al. Reconstruction of the orientation distribution function in single-and multiple-shell q-ball imaging within constant solid angle. Magnetic Resonance in Medicine, 64(2):554{566, 2010. [3] Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational information bottleneck. arXiv preprint arXiv:1612.00410, 2016. [4] Katrin Amunts and Karl Zilles. Architectonic mapping of the human brain beyond brodmann. Neuron, 88(6):1086{1107, 2015. [5] Jesper LR Andersson, Stefan Skare, and John Ashburner. How to correct susceptibility distor- tions in spin-echo echo-planar images: application to diusion tensor imaging. Neuroimage, 20(2):870{888, 2003. [6] Jesper LR Andersson and Stamatios N Sotiropoulos. An integrated approach to correction for o-resonance eects and subject movement in diusion MR imaging. NeuroImage, 125:1063{ 1078, 2016. [7] Salim Arslan. Connectivity-driven parcellation methods for the human cerebral cortex. [8] Salim Arslan, Soa Ira Ktena, Antonios Makropoulos, Emma C Robinson, Daniel Rueckert, and Sarah Parisot. Human brain mapping: A systematic comparison of parcellation methods for the human cerebral cortex. Neuroimage, 2017. [9] Salim Arslan, Soa Ira Ktena, Antonios Makropoulos, Emma C Robinson, Daniel Rueckert, and Sarah Parisot. Human brain mapping: A systematic comparison of parcellation methods for the human cerebral cortex. Neuroimage, 170:5{30, 2018. [10] Christopher Baldassano, Diane M Beck, and Li Fei-Fei. Parcellating connectivity in spatial maps. PeerJ, 3:e784, 2015. [11] C Baleydier and F Mauguiere. Network organization of the connectivity between parietal area 7, posterior cingulate cortex and medial pulvinar nucleus: a double uorescent tracer study in monkey. Experimental brain research, 66(2):385{393, 1987. [12] Albert-L aszl o Barab asi et al. Network science. Cambridge university press, 2016. [13] Peter J Basser, Sinisa Pajevic, Carlo Pierpaoli, Jerey Duda, and Akram Aldroubi. In vivo ber tractography using dt-mri data. Magnetic resonance in medicine, 44(4):625{632, 2000. 101 [14] Danielle S Bassett and Olaf Sporns. Network neuroscience. Nature neuroscience, 20(3):353, 2017. [15] Mark F Bear, Barry W Connors, and Michael A Paradiso. Neuroscience, volume 2. Lippincott Williams & Wilkins, 2007. [16] Timothy EJ Behrens et al. Probabilistic diusion tractography with multiple bre orienta- tions: What can we gain? Neuroimage, 34(1):144{155, 2007. [17] S Bisdas, DE Bohning, N Be senski, JS Nicholas, and Z Rumboldt. Reproducibility, interrater agreement, and age-related changes of fractional anisotropy measures at 3T in healthy sub- jects: eect of the applied b-value. American Journal of Neuroradiology, 29(6):1128{1133, 2008. [18] David M Blei and Peter I Frazier. Distance Dependent Chinese Restaurant Processes. Journal of Machine Learning Research, 12(Aug):2461{2488, 2011. [19] Stefano B Blumberg, Ryutaro Tanno, Iasonas Kokkinos, and Daniel C Alexander. Deeper image quality transfer: Training low-memory neural networks for 3D images. In MICCAI, pages 118{125. Springer, 2018. [20] Joseph E Bogen and GM Bogen. Wernicke's region{where is it. Annals of the New York Academy of Sciences, (280):834{843, 1976. [21] Korbinian Brodmann. Vergleichende Lokalisationslehre der Grosshirnrinde in ihren Prinzip- ien dargestellt auf Grund des Zellenbaues. Barth, 1909. [22] Randy L Buckner, Jorge Sepulcre, Tanveer Talukdar, Fenna M Krienen, Hesheng Liu, Trey Hedden, Jessica R Andrews-Hanna, Reisa A Sperling, and Keith A Johnson. Cortical hubs revealed by intrinsic functional connectivity: mapping, assessment of stability, and relation to alzheimer's disease. Journal of neuroscience, 29(6):1860{1873, 2009. [23] Elisa Canu, Federica Agosta, and Massimo Filippi. A selective review of structural connec- tivity abnormalities of schizophrenic patients at dierent stages of the disease. Schizophrenia research, 161(1):19{28, 2015. [24] Hengyi Cao, Sarah C McEwen, Jennifer K Forsyth, Dylan G Gee, Carrie E Bearden, Jean Addington, Bradley Goodyear, Kristin S Cadenhead, Heline Mirzakhanian, Barbara A Corn- blatt, et al. Toward leveraging human connectomic data in large consortia: generalizability of fmri-based brain graphs across sites, sessions, and paradigms. Cerebral Cortex, 29(3):1263{ 1279, 2018. [25] M Catani. The functional anatomy of white matter: from postmortem dissections to in vivo virtual tractography. Diusion MRI: theory, methods and applications, pages 5{18, 2010. [26] Marco Catani, Michel Thiebaut de Schotten, David Slater, and Flavio Dell'Acqua. Connec- tomic approaches before the connectome. Neuroimage, 80:2{13, 2013. [27] Jiayu Chen, Jingyu Liu, Vince D Calhoun, Alejandro Arias-Vasquez, Marcel P Zwiers, Cota Navin Gupta, Barbara Franke, and Jessica A Turner. Exploration of scanning eects in multi-site structural MRI studies. Journal of Neuroscience Methods, 230:37{50, 2014. 102 [28] Hu Cheng et al. Optimization of seed density in dti tractography for structural networks. Journal of neuroscience methods, 203(1):264{272, 2012. [29] Moo K Chung. Heat kernel smoothing on unit sphere. In Biomedical Imaging: Nano to Macro, 2006. 3rd IEEE International Symposium on, pages 992{995. IEEE, 2006. [30] Edwin Clarke and Charles Donald O'Malley. The human brain and spinal cord: A historical study illustrated by writings from antiquity to the twentieth century. Number 2. Norman Publishing, 1996. [31] Taco Cohen and Max Welling. Group equivariant convolutional networks. In International Conference on Machine Learning, pages 2990{2999, 2016. [32] Ronald R Coifman and St ephane Lafon. Diusion maps. Applied and computational harmonic analysis, 21(1):5{30, 2006. [33] Marta Morgado Correia, Thomas A Carpenter, and Guy B Williams. Looking for the optimal DTI acquisition scheme given a maximum scan time: are more b-values a waste of time? Magnetic Resonance Imaging, 27(2):163{175, 2009. [34] Thomas M Cover and Joy A Thomas. Elements of information theory. Wiley & Sons, 2012. [35] Madelaine Daianu, Neda Jahanshad, Talia M Nir, Arthur W Toga, Cliord R Jack Jr, Michael W Weiner, and Paul M Thompson, for the Alzheimer's Disease Neuroimaging Ini- tiative. Breakdown of brain connectivity between normal aging and alzheimer's disease: a structural k-core network analysis. Brain connectivity, 3(4):407{422, 2013. [36] Marcel A de Reus and Martijn P Van den Heuvel. The parcellation-based connectome: limitations and extensions. Neuroimage, 80:397{404, 2013. [37] Rahul S Desikan, Florent S egonne, Bruce Fischl, Brian T Quinn, Bradford C Dickerson, Deborah Blacker, Randy L Buckner, Anders M Dale, R Paul Maguire, Bradley T Hyman, et al. An automated labeling system for subdividing the human cerebral cortex on mri scans into gyral based regions of interest. Neuroimage, 31(3):968{980, 2006. [38] Peter Diggle. A kernel method for smoothing point process data. Applied Statistics, pages 138{147, 1985. [39] Nico UF Dosenbach, Damien A Fair, Francis M Miezin, Alexander L Cohen, Kristin K Wenger, Ronny AT Dosenbach, Michael D Fox, Abraham Z Snyder, Justin L Vincent, Mar- cus E Raichle, et al. Distinct brain networks for adaptive and stable task control in humans. Proceedings of the National Academy of Sciences, 104(26):11073{11078, 2007. [40] Victor M Eguiluz, Dante R Chialvo, Guillermo A Cecchi, Marwan Baliki, and A Vania Apkarian. Scale-free brain functional networks. Physical review letters, 94(1):018102, 2005. [41] Simon B Eickho, Bertrand Thirion, Ga el Varoquaux, and Danilo Bzdok. Connectivity-based parcellation: Critique and implications. Human Brain Mapping, 36(12):4771{4792, 2015. [42] Alan C Evans, D Louis Collins, SR Mills, ED Brown, RL Kelly, and Terry M Peters. 3d statistical neuroanatomical models from 305 mri volumes. In 1993 IEEE conference record nuclear science symposium and medical imaging conference, pages 1813{1817. IEEE, 1993. 103 [43] Lingzhong Fan, Hai Li, Junjie Zhuo, Yu Zhang, Jiaojian Wang, Liangfu Chen, Zhengyi Yang, Congying Chu, Sangma Xie, Angela R Laird, et al. The human brainnetome atlas: a new brain atlas based on connectional architecture. Cerebral cortex, 26(8):3508{3526, 2016. [44] Rogier A Feis et al. Ica-based artifact removal diminishes scan site dierences in multi-center resting-state fMRI. Frontiers in Neuroscience, 9:395, 2015. [45] Bruce Fischl. Freesurfer. NeuroImage, 2(62):774{781, 2012. [46] Bruce Fischl et al. High-resolution intersubject averaging and a coordinate system for the cortical surface. Human Brain Mapping, 8(4):272{284, 1999. [47] Bruce Fischl, Andr e Van Der Kouwe, Christophe Destrieux, Eric Halgren, Florent S egonne, David H Salat, Evelina Busa, Larry J Seidman, Jill Goldstein, David Kennedy, et al. Auto- matically parcellating the human cerebral cortex. Cerebral cortex, 14(1):11{22, 2004. [48] Alex Fornito, Andrew Zalesky, and Michael Breakspear. The connectomics of brain disorders. Nature Reviews Neuroscience, 16(3):159, 2015. [49] Jean-Philippe Fortin et al. Harmonization of multi-site diusion tensor imaging data. Neu- roimage, 161:149{170, 2017. [50] Robert P Freckleton. On the misuse of residuals in ecology: regression of residuals vs. multiple regression. Journal of Animal Ecology, 71(3):542{545, 2002. [51] William T Freeman, Edward H Adelson, et al. The design and use of steerable lters. IEEE Transactions on Pattern analysis and machine intelligence, 13(9):891{906, 1991. [52] Karl J Friston et al. Statistical parametric maps in functional imaging: a general linear approach. Human Brain Mapping, 2(4):189{210, 1994. [53] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Fran cois Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096{2030, 2016. [54] Eleftherios Garyfallidis. Towards an accurate brain tractography. PhD thesis, University of Cambridge, 2013. [55] Eleftherios Garyfallidis et al. Quickbundles, a method for tractography simplication. Fron- tiers in neuroscience, 6:175, 2012. [56] Eleftherios Garyfallidis et al. Dipy, a library for the analysis of diusion MRI data. Front. Neuroinform, 8(8), 2014. [57] Eleftherios Garyfallidis et al. Recognition of white matter bundles using local and global streamline-based registration and clustering. NeuroImage, 2017. [58] Marco Giannelli, Mirco Cosottini, Maria Chiara Michelassi, Guido Lazzarotti, Gina Belmonte, Carlo Bartolozzi, and Mauro Lazzeri. Dependence of brain DTI maps of fractional anisotropy and mean diusivity on the number of diusion weighting directions. Journal of Applied Clinical Medical Physics, 11(1):176{190, 2010. [59] Gabriel Girard et al. Towards quantitative connectivity analysis: reducing tractography biases. Neuroimage, 98:266{278, 2014. 104 [60] Matthew F Glasser, Timothy S Coalson, Emma C Robinson, Carl D Hacker, John Harwell, Essa Yacoub, Kamil Ugurbil, Jesper Andersson, Christian F Beckmann, Mark Jenkinson, et al. A multi-modal parcellation of human cerebral cortex. Nature, 536(7615):171{178, 2016. [61] Matthew F Glasser, Stamatios N Sotiropoulos, J Anthony Wilson, Timothy S Coalson, Bruce Fischl, Jesper L Andersson, Junqian Xu, Saad Jbabdi, Matthew Webster, Jonathan R Poli- meni, et al. The minimal preprocessing pipelines for the Human Connectome Project. Neu- roimage, 80:105{124, 2013. [62] Patricia S Goldman-Rakic and Michael L Schwartz. Interdigitation of contralateral and ipsi- lateral columnar projections to frontal association cortex in primates. Science, 216(4547):755{ 757, 1982. [63] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016. [64] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672{2680, 2014. [65] Hayit Greenspan, Serge Belongie, Rodney Goodman, Pietro Perona, Subrata Rakshit, and Charles H Anderson. Overcomplete steerable pyramid lters and rotation invariance. 1994. [66] Yosef Grodzinsky and Katrin Amunts. Broca's region. Oxford University Press, 2006. [67] Boris Gutman et al. Registering cortical surfaces based on whole-brain structural connectivity and continuous connectivity analysis. In MICCAI 2014, pages 161{168. Springer, 2014. [68] Patric Hagmann, Leila Cammoun, Xavier Gigandet, Reto Meuli, Christopher J Honey, Van J Wedeen, and Olaf Sporns. Mapping the structural core of human cerebral cortex. PLoS biology, 6(7):e159, 2008. [69] Peter Hall and James Stephen Marron. Extent to which least-squares cross-validation min- imises integrated square error in nonparametric density estimation. Probability Theory and Related Fields, 74(4):567{581, 1987. [70] Deborah L Harrington, Mikail Rubinov, Sally Durgerian, Lyla Mourany, Christine Reece, Katherine Koenig, Ed Bullmore, Jerey D Long, Jane S Paulsen, PREDICT-HD investiga- tors of the Huntington Study Group, et al. Network topology and functional connectivity disturbances precede the onset of huntington's disease. Brain, 138(8):2332{2346, 2015. [71] Colin Hawco, Joseph D Viviano, Soa Chavez, Erin W Dickie, Navona Calarco, Peter Kochunov, Miklos Argyelan, Jessica Turner, Anil K Malhotra, Robert W Buchanan, et al. A longitudinal human phantom reliability study of multi-center T1-weighted, DTI, and resting state fMRI data. Psychiatry Research: Neuroimaging, 2018. [72] AE Hillis, Peter B Barker, NJ Beauchamp, Bradford D Winters, M Mirski, and RJ Wityk. Restoring blood pressure reperfused wernicke's area and improved language. Neurology, 56(5):670{672, 2001. [73] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125{1134, 2017. 105 [74] N Jahanshad et al. Alzheimer's Disease Neuroimaging I (2013) genome-wide scan of healthy human connectome discovers SPON1 gene variant in uencing dementia severity. Proc Natl Acad Sci USA, 110(12):4768{4773. [75] Neda Jahanshad, Agatha D Lee, Marina Barysheva, Katie L McMahon, Greig I de Zubicaray, Nicholas G Martin, Margaret J Wright, Arthur W Toga, and Paul M Thompson. Genetic in uences on brain asymmetry: a dti study of 374 twins and siblings. Neuroimage, 52(2):455{ 469, 2010. [76] Matthew Johnson et al. Analyzing hogwild parallel Gaussian Gibbs sampling. In Advances in Neural Information Processing Systems, pages 2715{2723, 2013. [77] W Evan Johnson, Cheng Li, and Ariel Rabinovic. Adjusting batch eects in microarray expression data using empirical Bayes methods. Biostatistics, 8(1):118{127, 2007. [78] Kesshi M Jordan et al. Cluster condence index: A streamline-wise pathway reproducibility metric for diusion-weighted mri tractography. Journal of Neuroimaging, 28(1):64{69, 2018. [79] Jorge Jovicich, Silvester Czanner, Douglas Greve, Elizabeth Haley, Andre van Der Kouwe, Randy Gollub, David Kennedy, Franz Schmitt, Gregory Brown, James MacFall, et al. Re- liability in multi-site structural MRI studies: eects of gradient non-linearity correction on phantom and human data. Neuroimage, 30(2):436{443, 2006. [80] Faisal Kamiran and Toon Calders. Classifying without discriminating. In 2nd International Conference on Computer, Control and Communication, pages 1{6. IEEE, 2009. [81] Suheyla Cetin Karayumak et al. Retrospective harmonization of multi-site diusion MRI data acquired with dierent acquisition parameters. Neuroimage, 184:180{200, 2019. [82] Suheyla Cetin Karayumak, Marek Kubicki, and Yogesh Rathi. Harmonizing diusion mri data across magnetic eld strengths. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 116{124. Springer, 2018. [83] Sinead Kelly, N Jahanshad, A Zalesky, P Kochunov, I Agartz, C Alloza, OA Andreassen, C Arango, N Banaj, S Bouix, et al. Widespread white matter microstructural dierences in schizophrenia across 4322 individuals: results from the ENIGMA schizophrenia DTI working group. Molecular Psychiatry, 2017. [84] Charles Kemp et al. Learning systems of concepts with an innite relational model. 2006. [85] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. [86] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. [87] Arno Klein, Jason Tourville, et al. 101 labeled brain images and a consistent human cortical labeling protocol. Front. Neurosci, 6(171):10{3389, 2012. [88] Simon Koppers et al. Spherical harmonic residual network for diusion signal harmonization. arXiv preprint arXiv:1808.01595, 2018. [89] HGJM Kuypers, CE Catsman-Berrevoets, and RE Padt. Retrograde anoxal transport of uorescent substances in the rat's forebrain. Neuroscience Letters, 6(2-3):127{133, 1977. 106 [90] Guillaume Lample, Neil Zeghidour, Nicolas Usunier, Antoine Bordes, Ludovic Denoyer, et al. Fader networks: Manipulating images by sliding attributes. In Advances in Neural Informa- tion Processing Systems, pages 5969{5978, 2017. [91] Denis Le Bihan et al. Diusion tensor imaging: concepts and applications. Journal of magnetic resonance imaging, 13(4):534{546, 2001. [92] Changhong Li, Biao Huang, Ruibin Zhang, Qing Ma, Wanqun Yang, Lijuan Wang, Limin Wang, Qin Xu, Jieying Feng, Liqing Liu, et al. Impaired topological architecture of brain structural networks in idiopathic parkinson's disease: a dti study. Brain imaging and behavior, 11(1):113{128, 2017. [93] Anton Lord, Stefan Ehrlich, Viola Borchardt, Daniel Geisler, Maria Seidel, Stefanie Huber, Julia Murr, and Martin Walter. Brain parcellation choice aects disease-related topology dif- ferences increasingly from global to local network levels. Psychiatry Research: Neuroimaging, 249:12{19, 2016. [94] Louis-David Lord, Paul Expert, Henrique M Fernandes, Giovanni Petri, Tim J Van Hartevelt, Francesco Vaccarino, Gustavo Deco, Federico Turkheimer, and Morten L Kringelbach. In- sights into brain architectures from the homological scaolds of functional connectivity net- works. Frontiers in systems neuroscience, 10:85, 2016. [95] Christos Louizos, Kevin Swersky, Yujia Li, Max Welling, and Richard Zemel. The variational fair autoencoder. arXiv preprint arXiv:1511.00830, 2015. [96] David G Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2):91{110, 2004. [97] Laurens van der Maaten and Georey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579{2605, 2008. [98] Vincent A Magnotta, Joy T Matsui, Dawei Liu, Hans J Johnson, Jerey D Long, Bradley D Bolster Jr, Bryon A Mueller, Kelvin Lim, Susumu Mori, Karl G Helmer, et al. Multicenter reliability of diusion tensor imaging. Brain Connectivity, 2(6):345{355, 2012. [99] Klaus H Maier-Hein et al. The challenge of mapping the human connectome based on diusion tractography. Nature Communications, 8(1):1349, 2017. [100] Paul M Matthews and Adam Hampshire. Clinical concepts emerging from fmri functional connectomics. Neuron, 91(3):511{528, 2016. [101] John Mazziotta, Arthur Toga, Alan Evans, Peter Fox, Jack Lancaster, Karl Zilles, Roger Woods, Tomas Paus, Gregory Simpson, Bruce Pike, et al. A probabilistic atlas and ref- erence system for the human brain: International consortium for brain mapping (icbm). Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences, 356(1412):1293{1322, 2001. [102] Peter McColgan, Kiran K Seunarine, Adeel Razi, James H Cole, Sarah Gregory, Alexandra Durr, Raymund AC Roos, Julie C Stout, Bernhard Landwehrmeyer, Rachael I Scahill, et al. Selective vulnerability of rich club brain regions is an organizational principle of structural connectivity loss in huntington's disease. Brain, 138(11):3327{3344, 2015. 107 [103] John D Medaglia, Mary-Ellen Lynall, and Danielle S Bassett. Cognitive network neuroscience. Journal of cognitive neuroscience, 27(8):1471{1491, 2015. [104] Karla L Miller et al. Multimodal population brain imaging in the uk biobank prospective epidemiological study. Nature neuroscience, 19(11):1523, 2016. [105] Hengameh Mirzaalian et al. Harmonizing diusion MRI data across multiple sites and scan- ners. In MICCAI, pages 12{19. Springer, 2015. [106] Hengameh Mirzaalian et al. Inter-site and inter-scanner diusion MRI data harmonization. Neuroimage, 135:311{323, 2016. [107] Hengameh Mirzaalian et al. Multi-site harmonization of diusion MRI data in a registration framework. Brain Imaging and Behavior, 12(1):284{295, 2018. [108] Jesper Moller and Rasmus Plenge Waagepetersen. Statistical inference and simulation for spatial point processes. CRC Press, 2003. [109] Susumu Mori, Kenichi Oishi, Hangyi Jiang, Li Jiang, Xin Li, Kazi Akhter, Kegang Hua, Andreia V Faria, Asif Mahmood, Roger Woods, et al. Stereotaxic white matter atlas based on diusion tensor imaging in an icbm template. Neuroimage, 40(2):570{582, 2008. [110] Vernon B Mountcastle. Modality and topographic properties of single neurons of cat's somatic sensory cortex. Journal of neurophysiology, 20(4):408{434, 1957. [111] Daniel Moyer et al. A continuous model of cortical connectivity. In MICCAI, pages 157{165. Springer, 2016. [112] Daniel Moyer, Shuyang Gao, Rob Brekelmans, Aram Galstyan, and Greg Ver Steeg. Invariant representations without adversarial training. In Advances in Neural Information Processing Systems 31, pages 9102{9111. Curran Associates, Inc., 2018. [113] Daniel Moyer, Boris A Gutman, Joshua Faskowitz, Neda Jahanshad, and Paul M Thompson. An empirical study of continuous connectivity degree sequence equivalents. arXiv preprint arXiv:1611.06197, 2016. [114] Daniel Moyer, Boris A Gutman, Joshua Faskowitz, Neda Jahanshad, and Paul M Thompson. Continuous representations of brain connectivity using spatial point processes. Medical image analysis, 41:32{39, 2017. [115] Daniel Moyer, Boris A Gutman, Neda Jahanshad, and Paul M Thompson. Product space de- compositions for continuous representations of brain connectivity. In International Workshop on Machine Learning in Medical Imaging, pages 353{361. Springer, 2017. [116] Daniel Moyer, Boris A Gutman, Neda Jahanshad, and Paul M Thompson. A restaurant process mixture model for connectivity based parcellation of the cortex. In International Conference on Information Processing in Medical Imaging, pages 336{347. Springer, 2017. [117] Daniel Moyer, Greg Ver Steeg, Chantal MW Tax, and Paul M Thompson. Scanner invariant representations for diusion mri harmonization. arXiv preprint arXiv:1904.05375, 2019. [118] Daniel C Moyer, Paul Thompson, and Greg Ver Steeg. Measures of tractography conver- gence. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 295{307. Springer, 2018. 108 [119] Lipeng Ning, Elisenda Bonet-Carne, Francesco Grussu, Farshid Sepehrband, Enrico Kaden, Jelle Veraart, Stefano B Blumberg, Can Son Khoo, Marco Palombo, Jaume Coll-Font, et al. Multi-shell diusion MRI harmonisation and enhancement challenge (MUSHAC): progress and results. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 217{224. Springer, 2018. [120] John Nolte. The human brain: an introduction to its functional anatomy. 2002. [121] Lauren J O'Donnell and Carl-Fredrik Westin. Automatic tractography segmentation using a high-dimensional white matter atlas. IEEE Trans on Med Imaging, 26(11), 2007. [122] Elisabetta Pagani, Jochen G Hirsch, Petra JW Pouwels, Mark A Horseld, Elisabetta Perego, Achim Gass, Stefan D Roosendaal, Frederik Barkhof, Federica Agosta, Marco Rovaris, et al. Intercenter dierences in diusion tensor MRI acquisition. Journal of Magnetic Resonance Imaging, 31(6):1458{1468, 2010. [123] Nico Dario Papinutto, Francesca Maule, and Jorge Jovicich. Reproducibility and biases in high eld brain diusion MRI: An evaluation of acquisition and analysis variables. Magnetic Resonance Imaging, 31(6):827{839, 2013. [124] Sarah Parisot et al. Tractography-driven groupwise multi-scale parcellation of the cortex. In International Conference on Information Processing in Medical Imaging, pages 600{612. Springer, 2015. [125] Jim Pitman et al. Combinatorial stochastic processes. 2002. [126] Leslie Gross Portney and Mary P Watkins. Statistical measures of reliability. Foundations of Clinical Research: Applications to Practice, 2:557{586, 2000. [127] Gautam Prasad, Talia M Nir, Arthur W Toga, and Paul M Thompson. Tractography density and network measures in alzheimer's disease. In Biomedical Imaging (ISBI), 2013 IEEE 10th International Symposium on, pages 692{695. IEEE, 2013. [128] Stephen W Raudenbush. Random eects models. The handbook of research synthesis, 421, 1994. [129] Rafael Romero-Garcia, Mercedes Atienza, Line H Clemmensen, and Jose L Cantero. Eects of network resolution on topological properties of human neocortex. Neuroimage, 59(4):3522{ 3532, 2012. [130] Mikail Rubinov and Olaf Sporns. Complex network measures of brain connectivity: uses and interpretations. NeuroImage, 52(3):1059{1069, 2010. [131] SU Rudrapatna, GD Parker, J Roberts, and DK Jones. Can we correct for interactions between subject motion and gradient-nonlinearity in diusion MRI. In Proc. Int. Soc. Mag. Reson. Med., volume 1206, 2018. [132] David E Rumelhart, Georey E Hinton, and Ronald J Williams. Learning internal represen- tations by error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science, 1985. [133] Theodore D Satterthwaite and Christos Davatzikos. Towards an individualized delineation of functional neuroanatomy. Neuron, 87(3):471{473, 2015. 109 [134] Robert E Smith et al. Anatomically-constrained tractography: improved diusion MRI streamlines tractography through eective use of anatomical information. NeuroImage, 62(3):1924{1938, 2012. [135] Robert E Smith et al. SIFT2: Enabling dense quantitative assessment of brain white matter connectivity using streamlines tractography. Neuroimage, 119, 2015. [136] Stefano Soatto and Alessandro Chiuso. Visual representations: Dening properties and deep approximations. arXiv preprint arXiv:1411.7676, 2014. [137] Stamatios N Sotiropoulos, Saad Jbabdi, Junqian Xu, Jesper L Andersson, Steen Moeller, Edward J Auerbach, Matthew F Glasser, Moises Hernandez, Guillermo Sapiro, Mark Jenk- inson, et al. Advances in diusion MRI acquisition and processing in the human connectome project. Neuroimage, 80:125{143, 2013. [138] Stamatios N Sotiropoulos and Andrew Zalesky. Building connectomes using diusion MRI: Why, how and but. NMR in Biomedicine, 2017. [139] Olaf Sporns. Networks of the Brain. 2010. [140] Olaf Sporns, Giulio Tononi, and Rolf K otter. The human connectome: a structural description of the human brain. PLoS computational biology, 1(4):e42, 2005. [141] Kaustubh Supekar, Vinod Menon, Daniel Rubin, Mark Musen, and Michael D Greicius. Network analysis of intrinsic functional brain connectivity in alzheimer's disease. PLoS com- putational biology, 4(6):e1000100, 2008. [142] Jean Talairach. Co-planar stereotaxic atlas of the human brain-3-dimensional proportional system. An approach to cerebral imaging, 1988. [143] Ryutaro Tanno, Daniel E Worrall, Aurobrata Ghosh, Enrico Kaden, Stamatios N Sotiropou- los, Antonio Criminisi, and Daniel C Alexander. Bayesian image quality transfer with CNNs: Exploring uncertainty in dMRI super-resolution. In MICCAI, pages 611{619. Springer, 2017. [144] Chantal MW Tax, Francesco Grussu, Enrico Kaden, Lipeng Ning, Umesh Rudrapatna, John Evans, Samuel St-Jean, Alexander Leemans, Simon Koppers, Dorit Merhof, et al. Cross- scanner and cross-protocol diusion mri data harmonisation: A benchmark database and evaluation of algorithms. Neuroimage, 2019. [145] Chantal MW Tax, Francesco Grussu, Enrico Kaden, Lipeng Ning, Umesh Rudrapatna, John Evans, Samuel St-Jean, Alexander Leemans, Santi Puch, Matt Rowe, et al. Cross-vendor and cross-protocol harmonisation of diusion MRI data: a comparative study. In International Symposium on Magnetic Resonance in Medicine (Paris), volume 471, 2018. [146] Cibu Thomas, Q Ye Frank, M Okan Irfanoglu, Pooja Modi, Kadharbatcha S Saleem, David A Leopold, and Carlo Pierpaoli. Anatomical accuracy of brain connections derived from diusion mri tractography is inherently limited. Proceedings of the National Academy of Sciences, 111(46):16574{16579, 2014. [147] BT Thomas Yeo, Fenna M Krienen, Jorge Sepulcre, Mert R Sabuncu, Danial Lashkari, Marisa Hollinshead, Joshua L Roman, Jordan W Smoller, Lilla Z ollei, Jonathan R Polimeni, et al. The organization of the human cerebral cortex estimated by intrinsic functional connectivity. Journal of neurophysiology, 106(3):1125{1165, 2011. 110 [148] Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000. [149] Arthur W Toga and John C Mazziotta. Brain mapping: the methods. Academic press, 2002. [150] J-Donald Tournier, Chun-Hung Yeh, Fernando Calamante, Kuan-Hung Cho, Alan Connelly, and Ching-Po Lin. Resolving crossing bres using constrained spherical deconvolution: vali- dation using diusion-weighted imaging phantom data. NeuroImage, 42(2):617{625, 2008. [151] Martijn P Van Den Heuvel and Olaf Sporns. Rich-club organization of the human connectome. Journal of Neuroscience, 31(44):15775{15786, 2011. [152] Martijn P van den Heuvel, Olaf Sporns, Guusje Collin, Thomas Scheewe, Ren e CW Mandl, Wiepke Cahn, Joaqu n Go~ ni, Hilleke E Hulsho Pol, and Ren e S Kahn. Abnormal rich club organization and functional brain dynamics in schizophrenia. JAMA psychiatry, 70(8):783{ 792, 2013. [153] David C Van Essen, WU-Minn HCP Consortium, et al. The WU-Minn human connectome project: an overview. NeuroImage, 80:62{79, 2013. [154] David C Van Essen, Matthew F Glasser, Donna L Dierker, John Harwell, and Timothy Coalson. Parcellations and hemispheric asymmetries of human cerebral cortex analyzed on surface-based atlases. Cerebral Cortex, 22(10):2241{2262, 2012. [155] Bernadette CM Van Wijk, Cornelis J Stam, and Andreas Daertshofer. Comparing brain networks of dierent size and connectivity density using graph theory. PloS one, 5(10):e13701, 2010. [156] C ecile Vogt and Oskar Vogt. Die vergleichend-architektonische und die vergleichend- reizphysiologische felderung der grohirnrinde unter besonderer ber ucksichtigung der men- schlichen. Naturwissenschaften, 14(50):1190{1194, 1926. [157] Christian Vollmar et al. Identical, but not the same: intra-site and inter-site reproducibility of fractional anisotropy measures on two 3.0 T scanners. Neuroimage, 51(4):1384{1394, 2010. [158] Jinhui Wang et al. Parcellation-dependent small-world brain functional networks: A resting- state fMRI study. Human Brain Mapping, 30(5):1511{1523, 2009. [159] Demian Wassermann et al. Unsupervised white matter ber clustering and tract probabil- ity map generation: Applications of a gaussian process framework for white matter bers. NeuroImage, 51(1):228{241, 2010. [160] Tonya White et al. Global white matter abnormalities in schizophrenia: a multisite diusion tensor imaging study. Schizophrenia bulletin, 37(1):222{232, 2009. [161] Mingrui Xia, Jinhui Wang, and Yong He. Brainnet viewer: a network visualization tool for human brain connectomics. PloS one, 8(7):e68910, 2013. [162] Qizhe Xie, Zihang Dai, Yulun Du, Eduard Hovy, and Graham Neubig. Controllable invariance through adversarial feature learning. In Advances in Neural Information Processing Systems, pages 585{596, 2017. 111 [163] Fang-Cheng Yeh, Sandip Panesar, David Fernandes, Antonio Meola, Masanori Yoshino, Juan C Fernandez-Miranda, Jean M Vettel, and Timothy Verstynen. Population-averaged atlas of the macroscale human structural connectome and its network topology. NeuroImage, 178:57{68, 2018. [164] BT Thomas Yeo et al. The organization of the human cerebral cortex estimated by intrinsic functional connectivity. Journal of Neurophysiology, 106(3), 2011. [165] Andrew Zalesky, Alex Fornito, Ian H Harding, Luca Cocchi, Murat Y ucel, Christos Pantelis, and Edward T Bullmore. Whole-brain anatomical networks: does the choice of nodes matter? Neuroimage, 50(3):970{983, 2010. [166] Artemis Zavaliangos-Petropulu, Talia M Nir, Sophia I Thomopoulos, et al. Diusion MRI indices and their relation to cognitive impairment in brain aging: The updated multi-protocol approach in ADNI3. bioRxiv, page 476721, 2018. [167] Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. Learning fair repre- sentations. In International Conference on Machine Learning, pages 325{333, 2013. [168] Liang Zhan et al. How do spatial and angular resolution aect brain connectivity maps from diusion MRI? In Biomedical Imaging (ISBI), 2012 9th IEEE International Symposium on, pages 1{4. IEEE, 2012. [169] Liang Zhan, Alex D Leow, Neda Jahanshad, Ming-Chang Chiang, Marina Barysheva, Agatha D Lee, Arthur W Toga, Katie L McMahon, Greig I De Zubicaray, Margaret J Wright, et al. How does angular resolution aect diusion imaging measures? Neuroim- age, 49(2):1357{1371, 2010. [170] Liang Zhan, Bryon A Mueller, Neda Jahanshad, Yan Jin, Christophe Lenglet, Essa Yacoub, Guillermo Sapiro, Kamil Ugurbil, Noam Harel, Arthur W Toga, et al. Magnetic resonance eld strength eects on diusion measures and brain connectivity networks. Brain Connectivity, 3(1):72{86, 2013. [171] Dengsheng Zhang and Guojun Lu. Review of shape representation and description techniques. Pattern recognition, 37(1):1{19, 2004. [172] Fan Zhang, Ye Wu, Isaiah Norton, Laura Rigolo, Yogesh Rathi, Nikos Makris, and Lauren J O'Donnell. An anatomically curated ber clustering white matter atlas for consistent white matter tract parcellation across the lifespan. NeuroImage, 2018. [173] Yongyue Zhang, Michael Brady, and Stephen Smith. Segmentation of brain MR images through a hidden markov random eld model and the expectation-maximization algorithm. Medical Imaging, IEEE Transactions on, 20(1):45{57, 2001. [174] Alyssa H Zhu, Daniel C Moyer, Talia M Nir, Paul M Thompson, and Neda Jahanshad. Challenges and opportunities in dMRI data harmonization. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 157{172. Springer, 2018. [175] Xi-Nian Zuo et al. An open science resource for establishing reliability and reproducibility in functional connectomics. Scientic Data, 1, 2014. 112
Abstract (if available)
Abstract
Diffusion Magnetic Resonance Imaging (dMRI) is a imaging modality that, conditional on direction, queries the propensity for water diffusion in living tissue. These images are desirable for studying tissues with anisotropic diffusion profiles, e.g. the white matter of the human brain. In this manuscript I describe two problems at the interface between diffusion imaging and computational science and our proposed solutions for those problems, reconciling differences between current methodology, existing theory, and practical constraints. ❧ The first part of this manuscript describes methodology from the intersection of network science and neuroscience. Previous literature has worked to estimate white matter axon bundles (fasiculi) from dMRI. Combined with segmentations of the grey matter, this leads intuitively to the construction of macro-scale brain networks, where nodes are cortical regions and edges are estimates of axonal connections. The work I present here deconstructs this progression, reconciling the inherently spatially continuous cortex with discrete network theory. The resulting model is a generalization of discrete graphs, using point process theory to create continuous interaction densities (continuum adjacency functions). ❧ The second part of this manuscript describes a method for learning representations of data that invariant under changes in specified outside factors. These can be applied to diffusion imaging data, which in particular suffers from instrument/observer biases (""site/scanner"" biases). Due to the relative complexity of the domain, modelling the particular effects of each instrument may be difficult
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Mutual information estimation and its applications to machine learning
PDF
Unveiling the white matter microstructure in 22q11.2 deletion syndrome with diffusion magnetic resonance imaging
PDF
Methods for improving reliability and consistency in diffusion MRI analysis
PDF
Fast and label-efficient graph representation learning
PDF
Robust causal inference with machine learning on observational data
PDF
Learning to diagnose from electronic health records data
PDF
Information geometry of annealing paths for inference and estimation
PDF
On information captured by neural networks: connections with memorization and generalization
PDF
Novel theoretical characterization and optimization of experimental efficiency for diffusion MRI
PDF
Novel multi-site brain imaging approaches to map HIV-related neuropathology
PDF
Neuroimaging in complex polygenic disorders
PDF
Controlling information in neural networks for fairness and privacy
PDF
Pattern detection in medical imaging: pathology specific imaging contrast, features, and statistical models
PDF
Hashcode representations of natural language for relation extraction
PDF
Probabilistic framework for mining knowledge from georeferenced social annotation
PDF
Alleviating the noisy data problem using restricted Boltzmann machines
PDF
Transport mechanisms for ocular drug delivery and therapies
PDF
Applications of graph theory to brain connectivity analysis
PDF
Dynamical representation learning for multiscale brain activity
PDF
Artificial intelligence for low resource communities: Influence maximization in an uncertain world
Asset Metadata
Creator
Moyer, Daniel Cheng
(author)
Core Title
Representation problems in brain imaging
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
12/16/2019
Defense Date
08/23/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
diffusion MRI,machine learning,OAI-PMH Harvest,representation learning
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Ver Steeg, Greg (
committee chair
), Galstyan, Aram (
committee member
), Nakano, Aiichiro (
committee member
), Thompson, Paul M. (
committee member
)
Creator Email
dcmoyer@gmail.com,moyerd@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-254206
Unique identifier
UC11674764
Identifier
etd-MoyerDanie-8067.pdf (filename),usctheses-c89-254206 (legacy record id)
Legacy Identifier
etd-MoyerDanie-8067.pdf
Dmrecord
254206
Document Type
Dissertation
Rights
Moyer, Daniel Cheng
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
diffusion MRI
machine learning
representation learning