Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Data modeling approaches for continuous neuroimaging genetics
(USC Thesis Other)
Data modeling approaches for continuous neuroimaging genetics
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Data Modeling Approaches for Continuous Neuroimaging Genetics
by
Qifan Yang
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTATIONAL BIOLOGY AND
BIOINFORMATICS)
December 2024
Copyright 2024 Qifan Yang
Table of Contents
List of Tables.................................................................................................................................................iv
List of Figures................................................................................................................................................ v
Abstract........................................................................................................................................................ vii
Chapter 1 Introduction................................................................................................................................... 1
1.1 Neurodegenerative Disorders and Alzheimer’s Diseases.......................................................... 1
1.2 Genetic Association Analysis.................................................................................................... 4
1.3 Dynamic Data Modeling Framework........................................................................................ 5
1.4 Automatic Hypothesis Testing Framework................................................................................7
1.5 Contributions..............................................................................................................................7
1.6 Outlines...................................................................................................................................... 8
Chapter 2 Genetic Associations across Temporally or Spatially associated brain imaging traits................11
2.1 Introduction............................................................................................................................... 11
2.2 Materials and Methods..............................................................................................................15
2.2.1 Model Specification.........................................................................................................15
2.2.2 Parameter Estimation.......................................................................................................18
2.2.3 Hypothesis Testing.......................................................................................................... 19
2.2.4 Simulation........................................................................................................................20
2.2.5 Model Evaluation............................................................................................................ 25
2.3 Application I: Temporal Modeling............................................................................................28
2.3.1 Data Acquisition..............................................................................................................28
2.3.2 Temporal Autoregressive Linear Mixed Modeling......................................................... 30
2.4 Application II : Spatial Modeling............................................................................................. 31
2.4.1 Data Acquisition..............................................................................................................31
2.4.2 Spatial Autoregressive Linear Mixed Modeling............................................................. 33
2.5 Results.......................................................................................................................................34
2.5.1 Simulation........................................................................................................................34
2.5.2 Temporal Modeling......................................................................................................... 40
2.5.3 Spatial Modeling............................................................................................................. 51
2.6 Discussion................................................................................................................................. 53
Chapter 3 Automatic Hypothesis Testing Framework................................................................................. 56
3.1 Introduction..................................................................................................................................... 56
3.2 Materials and Methods.................................................................................................................... 63
3.2.1 Scientific method as line of inquiry...................................................................................... 63
3.2.2 Specifying inquiries: hypotheses and questions.....................................................................66
3.2.3 Organizing datasets through ontologies................................................................................. 74
3.2.4 Finding data............................................................................................................................78
3.2.5 Analyzing data....................................................................................................................... 81
3.2.6 Updating results continuously................................................................................................83
3.2.7 The implementation of NeuroDISK.......................................................................................84
3.3 Results............................................................................................................................................. 87
ii
3.3.1 Replicating the published ENIGMA GWAS meta-analysis.................................................. 89
3.3.2 Asking and answering novel questions with meta-data......................................................... 91
3.3.3 Continuous updating of results.............................................................................................. 93
3.4 Discussion........................................................................................................................................96
Chapter 4 Conclusions............................................................................................................................... 102
Bibliography...............................................................................................................................................104
iii
List of Tables
2.5.2 Demographic information for subcortical volume analysis in different diagnostic groups applied
to the Alzheimer‘s Disease Initiative (ADNI) dataset................................................................................. 41
2.5.3.1 Estimated SNP parameters of VCAN gene association with FA measures using longGWAS,
MMHE and ARLMM packages for spatial modeling..................................................................................52
2.5.3.2 Spatial modeling predictive metrics for longGWAS, MMHE and ARLMM.................................. 52
3.2.3 Overview of the current ODS-ENIGMA ontologies in NeuroDISK.................................................. 78
3.3.1 Reproduced genetic association results for precentral surface area and rs1080066 for ENIGMA
discovery cohorts with and without UK Biobank........................................................................................ 91
iv
List of Figures
2.2.5 Autoregressive Linear Mixed Model illustration................................................................................27
2.5.1.1 Visualization of simulated genetic structure for one population or two populations.......................35
2.5.1.2 Time and space complexity of different longitudinal linear mixed models using simulated data...37
2.5.1.3 Power analysis and mixed model parameter estimates accuracy when simulated data sample
size increases................................................................................................................................................39
2.5.2.1 Analysis of longitudinal subcortical volumes in four diagnostic groups.........................................44
2.5.2.2 Genetic association results for longGWAS, MMHE and ARLMM across 4 diagnostic groups
separately and total groups for temporal modeling......................................................................................48
2.5.2.3 Temporal model evaluation and comparison using predictive metrics for longGWAS, MMHE
and ARLMM................................................................................................................................................51
3.1.1 The same analysis can be conducted in different contexts as more and more data become
available for collaborative efforts................................................................................................................ 58
3.1.2 NeuroDISK is designed to automate the processes that scientists follow to answer questions
using existing datasets..................................................................................................................................62
3.2.1 NeuroDISK uses Lines of Inquiries to represent the approach that a scientist would follow to
answer different types of questions..............................................................................................................65
3.2.2.1 An overview of the Scientific Questions Ontology at the top, illustrating the main concepts
and terms...................................................................................................................................................... 68
3.2.2.2 The NeuroDISK user interface for specifying questions................................................................ 73
3.2.3 An overview of the main concepts in the NeuroDISK ODS-ENIGMA ontology..............................75
3.2.4.1 The LOI variable mappings between the variables in the question template and the variables in
the LOI data query template.........................................................................................................................79
3.2.4.2 The data query generated from the user’s question..........................................................................80
3.2.4.3 The cohort data that are retrieved using Line of Inquiries............................................................... 81
v
3.2.5.1 The LOI’s meta-analysis is done through a meta-workflow for meta-regression............................82
3.2.5.2 NeuroDISK captures the provenance of all execution results of the analysis and meta-analysis
workflows and uses it to generate explanations for the user........................................................................83
3.2.6 NeuroDISK continuously checks if the results for a user question have changed due to new data
becoming available or due to workflow or meta-workflow updates............................................................84
3.2.7.1 A diagram of the architecture components of DISK........................................................................85
3.2.7.2 A diagram of the DISK APIs and adapters that enable interoperability with other data sources
and workflow systems..................................................................................................................................86
3.3.1 Meta analysis workflow results to validate the NeuroDISK framework can reproduce the
published meta analysis results.................................................................................................................... 90
3.3.2 Meta regression workflow results show a scatter plot displaying the association between the
effect size of an association of interest against mean age of each cohort.................................................... 92
3.3.3 Continuously updated findings........................................................................................................... 94
vi
Abstract
For complex human brain mapping analysis, each subject may have multiple
measurements of a quantitative trait corresponding to different time points or spatial locations. It
is important to model genetic factors for repeated measurements, accounting for within-subject
variation, especially when the datasets contain many diagnostic trajectories, or the measurements
are taken at different spatial locations. Genetic associations often require tens of thousands of
individuals to detect, thus powerful and robust approaches are needed. We propose a robust and
accurate Autoregressive Linear Mixed Model (ARLMM) which incorporates support vector
regression and joint modeling to address different types of within-subject and between-subject
dependences, including genetic differences, scanner effects, and temporal or spatial differences.
We applied our model for genetic association analysis on the Alzheimer’s Disease Initiative
(ADNI) T1-weighted MRI dataset and UK Biobank Diffusion-weighted MRI dataset.
When data is collected continuously, it remains challenging to perform systematic data
analysis without human intervention. We developed NeuroDISK, a user-friendly
three-component automatic analysis framework, which includes data storing and retrieval,
analytical workflow integration and data visualization to perform continuous data analysis. To
demonstrate applications of interest to the general scientific community without the need for
individual level data, NeuroDISK was evaluated as a tool for meta-analysis. We incorporated
both an inverse variance weighted meta-analysis and a meta-regression framework to showcase
the effect of specific genotypes in select brain regions. The NeuroDISK framework can be
generalized beyond this use case, providing users with enough flexibility to define questions, run
workflows, and access results interactively and continuously.
vii
Chapter 1 Introduction
This thesis mainly focuses on developing and implementing novel data mining
approaches in the field of neuroimaging genetics. Neuroimaging genetics combines insights from
imaging and genetics to investigate how genetic factors influence brain structure, function, and
the likelihood of neurological diseases (JL Stein et al., 2012; DJ Smit et al., 2012). The main aim
of neuroimaging genetics is to deepen our understanding of the biological processes underlying
neurological disorders, enabling better approaches for preventing and treating these disorders.
1.1 Neurodegenerative Disorders and Alzheimer’s Diseases
The most commonly studied neurological disorders for human brain research are
neurodegenerative disorders, which involve the progressive loss of neurons in the central
nervous system (CNS) or peripheral nervous system (PNS) over time (D Moujalled et al., 2021;
DM Wilson et al., 2023). For example, Alzheimer’s Disease (AD), a progressive
neurodegenerative disorder that affects an estimated 6.9 million Americans aged 65 and older
until 2024 (KB Rajan et al., 2021; Alzheimer's Association et al., 2024). Current therapy for the
management of AD, has aided in controlling the progression of the disease rather than
eliminating the root causes. These interventions include pharmacological treatments, cognitive
training and physical exercises (J Mendiola-Precoma et al., 2016; N Shusharina et al., 2023).
Unless AD can be effectively treated or prevented, the number of people with it will increase
significantly if the current population aging trend continues. This is because increasing aging is
one of the most important risk factors for AD (R Guerreiro et al., 2015; N Zhao et al., 2020).
Data collection for imaging genetics studies are growing exponentially in terms of sample
size, varieties of experiments, and information coverage (S Marek et al., 2022, M Sanjana et al.,
2023). The data keep being collected beyond baseline genetic, imaging and demographic
1
information, spanning time scales as the subjects may return at later time points for repeated
scans, or extending to spatial scales where brain imaging data were analyzed across different
brain regions to examine location-specific variations. In particular, Longitudinal data analysis
that data were collected at multiple timepoints in genetic studies provides several advantages
over cross sectional studies only using baseline information. It provides additional information of
onset time of diseases, and may shed light on information particularly valuable for traits with
variable age of onset, and for brain imaging traits that are heterogeneous with regard to disease
development over time (B Kerner et al., 2009).
There has been a long history for building longitudinal neuroimaging datasets aiming to
analyze the genetic risk factors during the progression of AD. The Alzheimer’s Disease
Neuroimaging Initiative (ADNI) (https://adni.loni.usc.edu/) is a longitudinal, multi-site,
four-phase research study, encompassing ADNI1 (2004–2009), ADNI2/GO (2009–2016),
ADNI3 (2016–2023), and ADNI4 (2023–present) of more than 2400 subjects scanned at baseline
and follow-ups up to 60 months (MW Weiner et al., 2010; EN Manning et al., 2017; CR Jack Jr
et al., 2024; S Walter et al., 2024) . It collects longitudinal imaging data along with genetic,
clinical, demographic, and other biomarkers at over 60 enrollment sites across North America
and Canada. Participants aged between 55-90 years old, are categorized as Cognitive Normal
(CN), Mild Cognitive Impairment (MCI), or Alzheimer’s Disease (AD). To measure the
progression of AD, ADNI primarily uses the Clinical Dementia Rating (CDR) as diagnosis
criteria (CR Jack Jr et al., 2024; H Wilk et al., 2024) to assess subject’s performance in six
cognitive and functional domains: functioning, Memory, Orientation, Judgment and
Problem-Solving, Community Affairs, Home and Hobbies, and Personal Care. CN participants
have a CDR global score of 0, Mini-Mental State Examination (MMSE) scores between 24-30,
2
and non-depressed, non-MCI, and non-demented, education adjusted scores on delayed recall of
one paragraph from Wechsler Memory Scale Logical Memory II. MCI participants have CDR
global scores of 0.5, with MMSE scores between 24-30, and objective memory loss measured by
education adjusted scores on delayed recall of one paragraph from Wechsler Memory Scale
Logical Memory II. AD patients have CDR global scores of either 0.5 or 1, MMSE scores
between 20-26, andNINCDS/ADRDA criteria for probable AD
(https://adni.loni.usc.edu/data-samples/adni-data/study-cohort-information/). While diagnosis
criteria for AD continue to evolve to better reflect the underlying biology of AD (MT Ly et al.,
2024; CR Jack Jr et al., 2024), the CDR criteria remains widely used for research purposes, along
with MMSE and Wechsler Logical Memory II sub-scale to establish the final CN, MCI and
diagnosis for ADNI dataset. For our study, the analysis within diagnostic groups focused on
patterns (e.g., imaging, genetic, or aging factors) that are independent of the diagnosis grouping
criteria to minimize bias, avoid loss of generalization, and prevent inflated or overly optimistic
performance metrics.
Several other datasets such as UK Biobank (https://www.ukbiobank.ac.uk/), go beyond
the typical scale of tens to hundreds of participants by broadly sampling from the general
population. This allows the researchers to discover novel genetic markers and develop predictive
algorithms for human diseases (M Garg et al., 2024). The UK Biobank has an extensive
collection of high-quality neuroimaging data with baseline and follow-up scans from around
100,000 volunteers, and contains data from more than 500,000 subjects aged between 40 and 69
at recruitment in 2006 (KL Miller et al., 2016; C Bycroft et al., 2018). Approximately 90% of the
UK Biobank participants do not have a diagnosis of neurological conditions (N Veronese et al.,
2021). It aims to provide a comprehensive, large-scale biomedical database that is continuously
3
updated with health records, de-identified genetic data, and lifestyle and health information,
making it an invaluable resource for researchers worldwide (C Bycroft et al., 2018; M Garg et
al., 2024). The emergence of biobank-level longitudinal datasets offers new opportunities to
reliably detect the relatively small effects of single nucleotide polymorphisms (SNPs) on the
human brain structure over time through genetic association analysis.
1.2 Genetic Association Analysis
Genetic association analysis of longitudinal datasets requires methods that effectively
address relatedness, population structure, and other biases introduced by confounders. Population
stratification, in particular, refers to the presence of subpopulation(s) within a study population
that have different allele frequencies due to ancestral differences rather than true associations
with the trait of interest. In a structured population, spurious genetic associations with a
phenotype can arise as a result of confounding with some genetic variants or environmental
factors (BJ Vilhjálmsson et al., 2013). Without properly controlling for these confounders, it
becomes challenging to determine the extent to which observed phenotypic differences between
diagnostic groups are attributable to genetic factors versus other factors (JG Schraiber et al.,
2024).
Among various statistical approaches to adjust for population structure, two methods
stand out (Y Zhang et al., 2015): Principal Component Analysis (PCA) (GE Hoffman, 2013; H
Li et al., 2019; Y Yao et al., 2023), which is widely used probably due to its early appearance,
simplicity and effectiveness; Linear mixed model (LMM) based approach (NA Furlotte et al.,
2012; PR Loh et al., 2015; T Ge et al, 2017; X Wu et al., 2018; Q Yang et al., 2019), which has
emerged recently as one of the most flexible and effective methods, especially for datasets with
complex genetic structures. Top principle components (PCs) are the first few eigenvectors
4
corresponding to the largest eigenvalues for the genotype matrix after applying dimensional
reduction PCA method, where subjects are represented by rows and SNPs by columns. To
correct for population stratification, PCs are included as covariates in linear regression or
fixed-effects models. In comparison, LMMs are an extension of fixed-effects models with
random effects, which may specify the genetic relationship matrix (GRM) that reflects the
pariwise genetic similarity between subjects, and accounts for family relationships that PCs
might not fully address. Standard GRM is also calculated through the genotype matrix, by
standardizing genotypes with a mean 0 and a variance of 1, then summing the standardized
genotype similarity across all genetic markers. This results a symmetric � × � matrix for
GRM, where � is the number of subjects being studied. The diagonal entries represent
self-relatedness typically close to 1, and off-diagonal entries represent pairwise relatedness
between subjects.
1.3 Dynamic Data Modeling Framework
An interesting question would be, if longitudinal data come from different groups
(diagnostic, sex, etc.), will joint modeling of different groups (J Guedj et al., 2011; W Ge et al.,
2017; B Lei et al., 2020) help us better understand when and where the genetic factor(s)
influence human brain structure during AD progression? By modeling multiple groups
simultaneously, it may be possible to identify shared and group-specific patterns of genetic
effects, offering a more comprehensive and dynamic understanding of how genetics interact with
other factors. Additionally, it enables the detection of shared-group effects that may be missed in
group-specific analyses of smaller sample sizes (M Li et al., 2015; S Marek et al., 2022). With
longitudinal data collected over time, joint modeling can reveal when genetic factors start to
5
influence, or exert the strongest influence during disease progression, potentially identifying
critical windows for intervention.
To jointly model the diagnostic groups more accurately, we combine the time series
techniques with mixed models, focusing on subject-level dynamics over time, such as
autocorrelation, which might be overlooked in LMMs that are not designed to explicitly model
the time-dependent relationships between observations. Time series analysis such as
Autoregressive (AR) modeling (S Makridakis et al., 1997; I Funatogawa et al., 2012; TI Lin et
al., 2020; K Tian et al., 2024) focuses on a single subject or entity, analyzing how its
measurements evolve, while longitudinal analysis tracks multiple subjects over time, often
accounting for variability both within and between subjects. The field of image analysis has been
striving to develop large autoregressive models (Y Zheng et al., 2014; B Guo et al., 2022; A
El-Nouby et al., 2024), inspired by the success of Generative Pre-trained Transformer (GPT)
series and other Large Language Models (LLMs) in the Nature Language Processing (NLP) field
(TB Brown, 2020; J Achiam et al., 2023). GPT models predicted one token or word at a time,
and ensured that the output mimics human language production, maintaining coherence and
relevance of text flows. When applied to images, AR modeling is a self-supervised learning
strategy, discretizing continuous images into high-dimensional grids of tokens (e.g., 2D),
flattening them into a 1D sequence, and then predicting the next token in the sequence, a simple
yet powerful approach (K Tian et al., 2024).
Here, we proposed an Autoregressive Linear Mixed Model (ARLMM) that introduces a
temporally or spatially varying random effect with AR modeling, using the same flattening
strategy as in (K Tian et al., 2024) on the 2D covariance structures in the random effects of
mixed models. We further incorporated support vector regression, to model the complex genetic
6
relationships more robustly especially for small sample sizes (D Basak et al., 2007; F Zhang et
al., 2020). Using joint modeling, Our model can be applied for disease specific brain imaging
traits, to infer the model parameters and perform hypothesis testing from stable diagnostic groups
to converter groups using joint modeling.
1.4 Automatic Hypothesis Testing Framework
As ongoing studies and new research initiatives collect neuroimaging data continuously,
it presents opportunities to review and refine previous findings. While systematic meta-analyses
can capture discrepancies in common trends across different studies and datasets, new data can
often challenge existing conclusions or generate new ones. A user-friendly system that captures
the evolution of such changes without human intervention has not yet been developed. Besides,
most published automatic hypothesis testing frameworks such as Exploratory Hypothesis Testing
System (G Liu et al., 2011) and NeuroCI (J Sanz-Robinson et al., 2022) focus on hypothesis
comparision problems, rather than general hypothesis testing applied to any workflows. We
developed NeuroDISK, a three-component framework (data storing/workflow integration/data
visualization) to perform continuous data analysis for genetic associations. We evaluated the
NeuroDISK workflow by implementing meta analysis and meta regression workflows using data
from ENIGMA consortium, which aggregates information from human brain imaging studies at
multiple sites worldwide. NeuroDISK provides enough flexibility to the users and researchers to
define their own questions of interest, run the corresponding workflow(s) and access the results
interactively.
1.5 Contributions
1. We proposed an autoregressive linear mixed model (ARLMM) approach designed for repeated
measurement across time or spatial scales using large-scale imaging genetic data from ADNI and
7
UK Biobank. Building on existing work of longitudinal mixed models, our method addresses a
critical gap by introducing a novel approach that avoids time-consuming iterative steps through
moment matching and autoregressive models. Furthermore, by incorporating robust support
vector regression, our approach demonstrates the highest predictive power among the
longitudinal approaches in the past twenty years. It is the first sensitivity and specificity balanced
longitudinal mixed model approach, and uses joining modeling to provide a dynamic profiling
for the genetic effects on human brain structures over disease progression.
2. We developed NeuroDISK, a three-component automatic hypothesis testing system
comprising a structured, crowdsourced knowledge graph in the form of a semantic wiki (ODS); a
platform, WINGS, for executing semantic computational experiments as scientific workflows;
and a user-friendly interface, DISK, that orchestrates continuous data integration, automated
hypothesis testing and revision, and present data visualization of results. Building on the DISK
framework, we integrated and extended meta-analysis and meta-regression implementation and
showcase an example use-case from imaging genetics. This is a research area where large sample
sizes and replications are of utmost importance due to the small effect sizes of common genetic
variants.
1.6 Outlines
Chapter 2 describes a fast, accurate, specificity and sensitivity balanced approach,
Autoregressive Linear Mixed Model (ARLMM) for genetic association analysis of temporally or
spatially correlated brain imaging traits. Chapter 2.1 first introduces the background of APOE
genetic risk factors and its importance for analyzing the progression of Alzheimer's Disease.
Next, we discuss the benefits and challenges of using longitudinal data compared with cross
sectional data, and review existing methods developed for longitudinal mixed models. Finally,
8
we summarized the motivations of combining Autoregressive models to linear mixed models,
and showcase our method through genetic association analysis using subcortical volumes
measured at baseline, 12-month and 24-month with ADNI dataset, and three sections of corpus
callosum FA measures using UK Biobank. Chapter 2.2 details the ARLMM model specification,
parameter estimation, hypothesis testing methods. We describe how to compare the performances
of ARLMM and other longitudinal linear mixed models on simulation data and real data.
Chapter 2.3 and Chapter 2.4 detail how ADNI and UK Biobank imaging, genetic and
demographic data were acquired and preprocessed to fit in the frameworks of fixed effects and
random effects of ARLMM. Chapter 2.5 shows the genetic association results through simulation
data and real data, and Chapter 2.6 summarizes the problems we have worked on, challenges and
future directions for developing longitudinal mixed model approaches applied on genetic
association analysis.
Chapter 3 describes an automatic hypothesis testing platform Neuro-DISK, comprise of
three components: data storing and preprocessing, workflow execution and interactive user
interfaces. Chapter 3.1 first introduces the importance and challenges of analyzing neuroimaging
and genetic data without human intervention, when data keep being collected through
independent studies worldwide, coordinated by consortium such as ENIGMA. Then we reviewed
existing methods that are able to perform continuous data analysis, and propose NeuroDISK to
replicate the ENIGMA3 meta analysis results for human cortical surface area and thickness to
verify the reliability, scalability, security and maintainability. Chapter 3.2 details how the
three-component structure of NeuroDISK is organized: 1) Data storing and preprocessing:
Organic Data Science dataset catalog (ODS) using a semantic wiki allows users to record unique
project specific properties for different datasets from ENIGMA3 cortical GWAS working group;
9
2) Workflow execution: WINGS system executes specific workflows; 3) Interactive user
interfaces: Automated DIscovery of Scientific Knowledge (DISK) provides question templates,
expressed on SPARQL for Resource Description Framework (RDF) allowing users to filter data
from ODS by setting variables of interest. Chapter 3.3 shows the performances of NeuroDISK
by running meta analysis and meta regression workflows through simulation studies, and
real-world data applications. Chapter 3.4 summarizes how NeuroDISK automates data analysis
through the inquiry-driven approach and future work to improve NeuroDISK.
Chapter 4 discusses how ARLMM and NeuroDISK could be integrated to establish a
unified continuous data analysis pipeline, aiming to provide accurate, dynamic and continuous
profiling of genetic effects on the human brain structures during disease progression.
10
Chapter 2 Genetic Associations across Temporally or Spatially
Associated Brain Imaging Traits
2.1 Introduction
Accelerated brain tissue atrophy is an early imaging marker of cognitive impairment and
dementia. The APOE4 genotype is a well known risk factor for Alzheimer’s Disease (AD) and
related dementias that is also associated with smaller volumes of some brain regions, including
the hippocampus (MK Lupton et al., 2016; A Montagne et al., 2020). Fortunately, many
individuals at high genetic risk will never be diagnosed with AD. If the pattern of longitudinal
brain atrophy associated with APOE4 differs between individuals ultimately diagnosed with AD
versus those who are not, this may shed light on mechanisms and interacting risk factors that
promote disease in those at higher risk.
Understanding how genetic factors like APOE4 influence brain atrophy patterns during
the AD progression, requires longitudinal approaches to disentangle the complex relationships
between genetics, brain structures, and disease progression (S Adaszewski et al., 2013; J Fortea
et al., 2024). However, such analyses often face challenges due to the limited availability of
longitudinal brain imaging datasets, underscoring the importance of collaborative efforts and
large-scale data sharing initiatives such as Alzheimer’s Disease Neuroimaging Initiative (ADNI)
(MW Weiner et al., 2010; EN Manning et al., 2017; CR Jack Jr et al., 2024) and UK Biobank
(KL Miller et al., 2016; C Bycroft et al., 2018; N Veronese et al., 2021). These datasets provide
unprecedented opportunities to detect and validate genetic associations with brain structures and
microstructures, through repeated measurements undergoing different stages of disease
progression, using non-invasively magnetic resonance imaging (MRI) (LT Elliot et al., 2018)
and Diffusion Tensor Imaging (DTI).
11
Temporal correlations in longitudinal brain imaging datasets are often complex, arising
from the repeated measurements within subjects over time, which are inherently intertwined with
other sources of correlations such as genetic influences, group-specific characteristics such as
diagnostic criteria, and environmental factors (FD Bowman et al., 2014). Repeat measurements
taken on a single individual at different timepoints are almost always correlated, with measures
taken closer in time being more highly correlated than measures taken further apart in time (RC
Littell et al., 2000; Q Yang et al., 2019). Ignoring the dependence of repeated measures both
within and between subjects may increase Type 1 error and reduce statistical power, especially
when there is a known genetic correlation structure, or other confounding effects such as MRI
scanner or site differences. Furthermore, in longitudinal studies of multi-diagnostic populations,
such as ADNI, an additional correlation structure may exist between the time-series of
observations from subjects within the same diagnostic group. In order to accurately account for
different correlation structures, a linear mixed model (LMM) (PR Loh et al., 2015; T Ge et al.,
2017; X Wu et al., 2018; B Zhao et al., 2021) is often used to explicitly model random effects,
and specify the covariance structure of both random effects and measurement errors with
variances and correlation matrices. Furthermore, LMM accounting for disease-specific
correlations can help map the trajectories of brain decline with greater specificity, and improve
sensitivity in statistical inferences of genetic effects (RC Littell et al., 2000).
Inspired by the remarkable predictive improvements achieved by Autoregressive (AR) models in
natural language processing (NLP), particularly through models like the Generative Pre-trained
Transformer (GPT) series and other large language models (LLMs) (TB Brown, 2020; J Achiam
et al., 2023), which employ a straightforward but powerful ‘next-token prediction’ strategy, we
combined AR models with mixed models to perform the
12
‘next-temporal-measurement-prediction’ or ‘next-spatial-location-prediction’ tasks. For language
processing, AR models generate coherent and contextually relevant text by predicting the next
word based on the previous few words in a sequence, naturally following the left-to-right order
of a sentence (K Tian et al., 2024; K Jian et al., 2024). However, brain images inherently lack
such predefined order, presenting unique challenges in applying AR models to this domain. To
address this, we defined the order of the imaging measurements of brain regions of interest
(ROIs) based on either a time-increasing order or anatomical order along the anterior-posterior
axis.
Recently, many publicly available mixed model packages, that are appropriate for
longitudinal genetic association analysis with imaging applications have been developed,
including longGWAS (NA Furlotte et al., 2012), Moment Matching Heritability Estimation
(MMHE) (T Ge et al., 2017) and L-GATOR (X Wu et al., 2018). The longGWAS package
applies Restricted Maximum Likelihood (REML) parameter estimation method for fitting linear
mixed models, which has a high time complexity of O(� ), making it impractical to test
3
large-scale longitudinal datasets (NA Furlotte et al., 2012). MMHE improves the parameter
estimation time to O(� ), while there may be substantial power loss when being applied to
2
moderate sample size, small number of repeated measurements and larger measurement errors (T
Ge et al., 2017). L-GATOR introduces a flexible random effect structure with an exponential
covariance to account for the phenotypic autocorrelation between repeated measurements, but is
designed for family or twin studies and still computational ineffective similar to longGWAS (X
Wu et al., 2018).
Extending work from exponential covariance modeling in L-GATOR (X Wu et al., 2018),
we propose an autoregressive covariance structure of linear mixed model (ARLMM), GPU
13
accelerated and designed for imaging genetic analysis even in the presence of long-term repeated
measurements. ARLMM considers a simplified version of exponential covariance structure for a
random effect, integrating autoregressive phenotypic correlations to account for temporal or
spatial variabilities. This approach enables the inference of parameters specific to disease
progression, transitioning from stable groups to converter groups. When estimating the
parameters of the mixed models, we combined moment matching techniques from MMHE (T Ge
et al, 2017; J Mbatchou et al., 2024) with support vector regression (SVR) (S Balasundaram et
al., 2019) to improve statistical power and predictive accuracy while reducing dependency on
large sample sizes, making ARLMM a robust and scalable solution for longitudinal
imaging-genetic analyses. Compared with all the publicly available longitudinal mixed model
packages developed over twenty years (NA Furlotte et al., 2012; T Ge et al., 2017; X Wu et al.,
2018), ARLMM is the first sensitivity and specificity balanced approach to joint model all
diagnostic groups together, providing dynamic profiles of genetic effects on brain changes during
disease progression. Simulation studies have shown ARLMM is applicable to both genetically
correlated subjects and uncorrelated general populations, demonstrating the highest predictive
accuracy and improving power consistently. To the best of our knowledge, ARLMM is the
fastest longitudinal mixed model method to date, efficiently balancing memory usage while
accounting for both within and between subject variations, and is capable of handling moderate
to large sample sizes, and weak to strong autocorrelation signals between repeated measurements
specific to disease subgrouping.
ARLMM can be used to a wide range of longitudinal genetic association studies with
brain imaging traits, we showcase how to apply ARLMM as follows: 1) Analyze the strongest
genetic risk factor of AD, APOE4 effect on human brain subcortical volumes extracted from
14
T1-weighted MRIs. Data were obtained from the Alzheimer’s Disease Neuroimaging Initiative
(ADNI) across the United States and Canada, including 645 European-ancestry subjects scanned
at three time points (baseline, 12-month, and 24-month). ARLMM effectively modeled the
temporal correlations using autoregressive parameters fitting different stages of AD, enabling the
accurate identification of the influence of APOE4 on brain volume changes over time. 2)
Expanding the application from neuroimaging data with complex temporal correlations to spatial
correlations, we analyzed DTI data, focusing on fractional anisotropy (FA) measures for three
sections of the corpus callosum (genu, body, and splenium) from the UK Biobank
European-ancestry 4, 000 subjects (SR Cox al., 2016; LT Elliott et al., 2018; B Zhao et al.,
2021). The VCAN gene was identified as the only genetic locus significantly associated with FA
measures across all three sections of the corpus callosum (SR Coxet al., 2016; LC Rutten et al.,
2018). To validate our approach, we tested the effect of the VCAN gene on FA measures for the
three corpus callosum sections as a positive control. As a negative control, we assessed the
association of the APOE4 with FA measures, as APOE4 has been extensively studied in
large-scale GWAS and shown no significant associations with any of the three corpus callosum
sections (B Zhao et al., 2021). This dual-control framework futher strengthens the reliability of
our findings and provides a benchmark for evaluating genetic effects on spatially continuous
neuroimaging traits.
2.2 Materials and Methods
2.2.1 Model Specification
We consider the matrix form of the Autoregressive Linear Mixed Model (ARLMM)
� = �� + � + � + � + � (1)
� ~ �(0, σ�
2
Σ�
), � ~ �(0, σ�
2
Σ�
), � ~ �(0, σ�
2
Σ�
), � ~ �(0, σ�
2
Σ�
) (2)
15
Suppose we have brain imaging phenotypes, covariates and genotype data for � individuals,
each with � ( ) repeated measurements. (of size ) denotes the imaging �
� = 1, 2, ..., � � Σ�
�
× 1
phenotypes, and � (of size Σ� ) denotes the fixed effects, which may include time-varying �
× �
covariates such as age, and static covariances such as sex and first four genetic principal
components (PCs) using the ENIGMA genetic protocol (SE Medland et al., 2022). When testing
genetic associations, we represent the genotype as a vector � (size Σ� ) while (constant) �
× 1 β
and � (size � × 1) are coefficients as follows:
� = �� + �β + � + � + � + � (3)
The goal is to test the null hypothesis β = 0. The distributions of random effects are specified
by unknown variance components σ ; and through known or modeled relationships �
2
, σ
�
2
, σ�
2
, σ�
2
including the GRM Σ , block-diagonal matrices and , and the identity correlation matrix . �
Σ� Σ
�
�
There are two types of random effects: within-subject variations � and �, and between-subject �
and �. � and � represent genetic relatedness and autocorrelation for one single individual’s
phenotypes, respectively; � corresponds to the between-subject variations due to scanner or sites
differences, where each sub-block of Σ is an all-ones matrix, corresponding to individuals whose �
images were acquired on the same scanner. � represents between-subject environmental effects
or measurement error.
Four diagnostic groups are defined for disease progression modeling, corresponding to
stable clinical groups and a converter group for the diagnoses of cognitive normal (CN), mild
cognitive impairment (MCI) and Alzheimer’s Disease (AD): 1) stable CN (sCN): subjects who
were diagnosed as CN at the baseline and remained CN until the last time point; 2) stable MCI
(sMCI): subjects who were diagnosed as MCI at the baseline and remained unchanged until the
16
last time point; 3) converter MCI (cMCI): subjects who were diagnosed as MCI at the baseline,
but converted to AD at the last time point; In particular, if each subject is measured three times,
those who converted to AD from MCI at the second time point were early converter MCI
(ecMCI) and those who converted to AD from MCI at the third time point were late converter
MCI (lcMCI); 4) stable AD (sAD): subjects who were diagnosed as AD at the baseline and
remained unchanged until the last time point.
To account for a possibly temporal-varying or spatial-varying correlation, we assume Σ
�
contains unknown parameters that need to be estimated. For subject i at time-point j, � is
��
proportional to the most recent information � under a first-order autoregressive model
��−1
[AR(1)]: � (TT Chong, 2001; B Jiang et al., 2023), for which autoregressive �� = ρ��
�
��−1
fluctuations are absorbed by the between-subject environmental errors. The disease progression
parameter, ρ is assumed to depend only on the diagnostic classification at time points and
�� �
� − 1. Thus, it is specified for each diagnostic progression step: the stable groups α (CN ->
CN), β (MCI -> MCI), γ (AD -> AD), and conversion θ (MCI -> AD). For each subject, the
block-diagonal correlation submatrices of Σ have the following form, corresponding to sCN
�
(CN->CN->CN), sMCI (MCI->MCI->MCI), lcMCI (MCI->MCI->AD), ecMCI
(MCI->AD->AD) and stable AD (AD->AD->AD), where lcMCI and ecMCI correlation matrices
are combined into one correlation matrix for converter MCI diagnostic group:
(4)
17
When it comes to mixed modeling for temporal data or spatial data, the autoregressive
parameters may refer to different concepts separately, which will be explained in detail in the
following sections.
2.2.2 Parameter Estimation
Our parameter estimation for Autoregressive Mixed Model follows a two-step approach:
In the first step, we computed variance estimates by converting the mixed model estimation
problem to a regression problem using moment matching (T Ge et al., 2017), and minimized the
L2 regularized support vector regression ε-insensitive loss; In the second step, beta estimates for
fixed effects (i.e age, sex) and SNP could be obtained by applying maximum likelihood given the
covariance structures are known.
Step 1:
a) First, for the stable diagnostic groups, fixed effects � and SNP(s) � together [� �]
were regressed out from the original phenotypes � (Equ. (3)) by calculating a projection matrix �
, such that �� = 0. The covariance of the projected phenotypes can be expressed as linear
combinations of projected Genetic Relationship Matrix (GRM) Σ , projected temporal-varying �
or spatial-varying correlation matrix Σ , projected imaging site/cohort correlation and
�
Σ�
correlation for measurement errors (identity matrix):
���(��) = σ�
2
�Σ�
�
� + σ�
2
�Σ�
(ρ)�
� + σ�
2
�Σ�
�
� + σ�
2
� (5)
If we apply vectorization for the covariance of transformed phenotypes, equation (5) is
equivalent to the following equation (6)
���(��� (6) �
�
�
) = σ�
2
����(�
)�
� + σ�
2
����(�
(ρ))�
� + σ�
2
����(�
)�
� + σ�
2
���(�)
To get the variance components estimates σ , ARLMM minimized L2 �
2
, σ�
2
, σ�
2
, σ�
2
regularized ε-insensitive loss �, which is used in the Support Vector Regression with linear
18
Kernel and ignores any training data close to the predicted models within a threshold ε (D Basak
et al., 2007; YG Wang et al., 2023).
� = ��� (0, |���(��� (7) �
�
�
) − σ�
2
����(�
)�
� − σ�
2
����(�
(ρ))�
� − σ�
2
����(�
)�
� − σ�
2
���(�)| − ε)
b) We used joint modeling (NJ Law et al., 2002; M Li et al., 2005; J Guedj et al., 2011) to
get the estimates of α (CN -> CN), β (MCI -> MCI), γ (AD -> AD) using moment matching
techniques in equation (5) and (6) and support vector regression (SVR) in equation (7), then
apply the estimated α, β and γ from stable groups to get θ (MCI -> AD) in the converter group,
so that the knowledge of stable groups about covariance structure is transferred to the converter
group which shares some similar disease progression stage before one converts.
Step 2:
After estimated variance components are obtained, L2 regularized Mean Squared Error
loss is minimized to get beta estimates for fixed effects (i.e age, sex) and SNP including in the
fixed effects �, assuming the phenotype can be considered as normally distributed conditional on
the covariates
� ~ �(��, σ (8) �
2
Σ� + σ�
2
Σ� + σ�
2
Σ� + σ�
2
Σ� )
2.2.3 Hypothesis Testing
We implemented a permutation test (R Schweiger et al., 2018) to compute the p values
for heritability in the linear mixed models, but only apply on the residuals after regressing the
fixed effects out of phenotypes. Note that the permutation test noramlly requires exchangeability
of the residuals, which are not observed in general in presence of non-constant covariates. We
first projected the phenotypes onto the space orthogonal to the fixed effects as in Step 1A
(section 2.2.2), then permuted the transformed phenotypes while keeping the genetic structure
intact, ensuring that any observed associations between genetic variants and phenotypes would
19
be purely random. For each permutation round, we refitted the linear mixed model and estimated
the heritability. This process was repeated for a large number of permutations, generating a null
distribution of heritability estimates. Finally, we calculated the p-value for heritability by
counting the proportion of permutations in which the heritability estimate from the permuted
data was equal to or greater than the heritability estimate from the original, non-permuted data.
2.2.4 Simulation
Besides only testing on the real data, simulated data for genetic association analysis
offers several advantages: the simulation approach can be scaled to an arbitrary number of
subjects; simulation data is inherently publicly available and shareable, and the transparency of
the simulation process can be leveraged to expose specific failure modes of different methods (E
Alsentzer et al., 2023).To examine the runtime and memory requirements, type I error rate,
statistical power, and the similarity or difference of the heritability and effect sizes produced by
different mixed model methods for repeated measurements, we designed and implemented a
simulation pipeline as follows.
We tested the runtime and memory of longutudinal linear mixed model packages such as
longGWAS (NA Furlotte et al., 2012), MMHE (T Ge et al, 2017) and ARLMM by varying
sample sizes, systematically measuring execution time and peak memory consumption. To assess
Type I error rates and statistical power, we perform a series of simulations under both null and
alternative hypotheses, following these steps as (X Wu et al., 2018): 1) Type I Error Assessment:
We simulated longitudinal subcortical volumes, covariates including age, sex, and SNP(s),
correlation structure, assuming no genetic association. Each subject has longitudinal imaging
phenotypes measured three times, structured with 1st-order autoregressive (AR(1)) properties to
reflect realistic temporal correlations or spatial correlations. Specifically, we generate a baseline
20
phenotype (i.e. hippocampus volume), followed by the second and third phenotypes, each
sequentially generated with a consistent correlation value based on AR(1). After generating the
data, we conduct hypothesis tests on each simulated dataset and calculate the proportion of tests
with p-values below a predetermined significance threshold (e.g., 0.05). This proportion
represents the Type I error rate, indicating the rate of false positives expected in our study. 2)
Power Analysis: To estimate statistical power, we simulate data under the alternative hypothesis
by introducing a genetic variant with a known effect size (e.g., 0.05). This effect size is
incorporated into the model to represent a true genetic association. Similar to the Type I error
assessment, we maintain the AR(1) correlation structure for repeated measurements and calculate
the proportion of tests with SNP p-values below the significance threshold (e.g., 0.05). This
proportion represents the statistical power of our study design, indicating the likelihood of
detecting a true genetic effect when it exists.
Simulating Genotypes
Number of Loci: Total number of genetic loci � used for stimulating genetic correlations
between subjects (e.g., 1000, 5000, 10000).
Minor Allele Frequency (MAF): We simulated SNP genotypes assuming Hardy-Weinberg
equilibrium based on randomly chosen minor allele frequencies (MAFs) within the reasonable
range (e.g., 0.05 to 0.5). A genotype matrix � for one population is computed where rows
represent subjects and each column represents SNPs. Each entry of � is typically encoded as 0,
1 or 2, the number of copies of the minor allele, following the binomial distribution with 2 trials
and success probability of p as MAF:
� ~ �������� (2, �) (9)
Genetic Relationship Matrix (GRM): We used the simulated SNPs to calculate the genetic
relationship matrix (GRM) and quantify the genetic similarity between subjects as in (Loh PR et
21
al., 2015). First, For each SNP �, computer the allele frequency � :
�
� (10) � = 1
2� �= 1
�
∑ ���
Then center the genotype values in � to �' with a mean of 0:
� (11) ��
' = ��� − 2��
Then scale the centered genotypes �' to �'' by normalizing the SNP genotypes with squares of
2p(1-p) (Eq. 11) and taking the dot product between subjects' genotypes (Eq. 12) (D Speed et al.,
2017):
� (12) ��
' = �
��
''
2��
(1 −��
)
� = (13) 1
� �'
�''�
Population Structure: To further simulate population structure, we showcase two
subpopulations and how to simulate desired population structure by adjusting the MAFs in
subpopulation 1 and subpopulation 2. First, randomly divide the total subjects into two distinct
subpopulations, assuming a proportion of � of all from the first subpopulation, and proportion 1
of � ( ) from the second subpopulation. The genotype matrix for two 2 �2 = 1 − �1
subpopulations are � and separately, following binomial distributions with 2 trials and 1 �2
success probability of MAFs � and : 1 �2
� (14) 1
~ �������� (2, �1
)
� (15) 2
~ �������� (2, �2
)
We introduce a relationship between � and that reflects the desired population 1 �2
structure, assuming two populations differ slightly due to small genetic drift or local adaptation
(Equ 16), by adding a small random value from a normal distribution to � for , where 1 �2
22
ε ~ �(0, σ . Further details about the model and inferences can be found in (DJ Cutler et al., 2
)
2010; T Günther et al., 2013).
� (16) 2 = ���(���(�1 + ε, 0), 1)
Values of σ are chosen to align with the evolutionary processes that influence the 2
structure of genetic varations within and among the human populations (L Excoffier et al., 2008;
KE Holsinger et al., 2009). Begin with σ = 0.01 and increase to 0.03 if the observed 2
divergence seems too low. This range represents closely related human populations with minimal
genetic differentiation, corresponding to fixation index � between 0.01 and 0.05 (NA ��
Rosenberg et al., 2002; A Bergström et al., 2020).
Effect Size: We simulate the SNP effect size β implemented in the Genome-Wide Complex Trait
Analysis (GCTA) software (J Yang et al., 2010).The SNP genotype(s) are first standardized such
that each column has a variance of 1. GCTA assumes a strong negative relationship between
variances of effect size β and MAF � (D Speed et al., 2017; J Zeng et al., 2018). Specifically, the
variance of effect size is modeled to be inversely proportional to �(1 − �)
β ~ �(0, (18) �
�(1−�) )
� is a scaling factor to control the magnitude of the effect size, for example � = 0.1 for highly
polygenic traits typically influenced by thousands of SNPs with small effect sizes (J Yang et al.,
2010).
Simulating Phenotypes
Number of Subjects: We simulated different numbers of subjects in the study (e.g., 200, 400,
600, …1,000, 10, 000) that correspond to real neuroimaging datasets from small scale to large
scale. While subjects are not required to have the same number of repeated measurements, for
simplicity, our simulation assumes that each subject has three measurements.
23
Phenotypes: Phenotypes were simulated based on first-order autoregressive (AR(1)) model,
where autocorrelation ρ between any consecutive phenotypic measurements are constant, for
example weak (0.2<=|ϱ|<0.3), moderate (0.3<=|ϱ|<0.5) and strong correlation (0.5<=|ϱ|<=1). A
baseline phenotype (i.e. hippocampus volume) was generated following LMMs, then follow-up
phenotypic measurements were sequentially generated. Depending on whether the phenotype is
continuous or binary, we may further transform the phenotype to normalize its distribution for
continuous traits (e.g., using log or square root transformations) or to balance class distributions
for binary traits (e.g., using oversampling, undersampling, or synthetic data generation methods).
Covariates: We included covariates that may affect the association (e.g., age, sex, principal
components). Age values at the baseline were randomly sampled from the age range associated
with the disease (e.g., 30-90 years) and increased by one year with each subsequent
measurement. We simulated sex with a binomial distribution to assign subjects as either male or
female, for example using a binomial probability of male 0.35 since AD is more common in
females than males, and to align with previous studies that after age 65, approximately two-thirds
of AD patients are female (A Gustavsson et al., 2023). When analyzing subcortical measures,
Intracranial Volumes (ICVs) were included as a fixed covariate variable in the linear mixed
models to control the head size (TG van Erp et al, 2016). ICVs were simulated separately for
females and males, as males typically have larger ICVs than females, with differences ranging
from 10% to 13% on average (AN Ruigrok et al., 2014). For females, ICV values were sampled
from a normal distribution with a mean of 1,400 cm³ (representing average values within the
normal range) and a standard deviation of 200 cm³. For males, the mean ICV was increased by
10% (1, 540 cm³), with a corresponding proportional increase in the standard deviation (220cm³).
Simulated PCs were used to account for confounding due to population stratification. Principal
24
components were computed by performing PCA on the standardized genetic matrix, revealing
the main axes of genetic variation.The first two PCs were plotted to see whether subjects
separate based on population structure or other genetic differences.
Simulating Environment Setups
Python version 3.10.2 and R version 4.3.2 were configured to test different linear mixed
model packages for repeated measurements. All simulations were run on one Tensor Processing
Unit (TPU) v2-8 in Google Colab, which consists of 8 cores with up to 334.6 GB system RAM
memory and 225.3 GB DISK memory (high RAM configuration). TPUs, which are publicly
available on the Google Cloud Platform (GCP), offer a flexible and accessible framework for
distributed computation at high resolution, reducing the dependency on specialized
high-performance computing (HPC) hardware, enabling researchers to conduct
resource-intensive simulations in a cost-effective manner (NP Jouppi et al., 2017; Wang Y et al.,
2020; TO Kehinde et al., 2023). Additionally, TPUs enhance reproducibility and replicability by
streamlining complex computational setups, making cutting-edge research more accessible and
easier to replicate in diverse computational environments.
2.2.5 Model Evaluation
To compare ARLMM with publicly available packages including longGWAS (NA
Furlotte et al., 2012), MMHE (T Ge et al., 2017), we evaluate performance using standard
predictive metrics: Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and R
Squared (R2) (D Chicco et al., 2021; AV Tatachar et al., 2021). RMSE is calculated as the square
root of the average of the squared differences between the observed and predicted values,
providing a measure of model accuracy by penalizing larger errors more heavily. MAE
represents the average absolute difference between the observed and predicted values, offering a
25
more interpretable measure of error in the same units as the data. RMSE and MAE are
normalized to be scale-dependent between 0 and 1, by dividing RMSE and MAE values with the
mean of the phenotypes across all time points or spatial locations. R-squared, or the coefficient
of determination ranging from 0 to 1, quantifies the proportion of variance in the phenotypes
explained by the predicted model, providing insights into how well the model fits the data and
capturing the underlying variability.
A standard longitudinal 5-fold cross validation (CV) (C Bergmeir et al., 2018) was used
to evaluate the performance and associated prediction errors of mixed models on the ADNI and
UK Biobank datasets. The subjects were first randomly split into 5 sets (random 5-fold CV), and
mixed model training was then performed on the other 4 sets including the remaining subjects
and all their repeated measurements, each time withholding a different set for testing, ensuring
that all measurements from a given subject were either in the training or testing set to prevent
data leakage. In each test set, RMSE, MAE, and R-squared values were computed to generate
CV predictive performance metrics. These metrics were aggregated across all folds to provide
final robust and unbiased estimates of model overall performance.
26
A
B
Figure 2.2.5: Autoregressive Linear Mixed Model illustration. A) illustrates the data acquisition procedures including imaging
phenotypes, genotypes and demographic variables for genetic association analysis applied on temporal correlated or spatially
correlated imaging phenotypes. For temporally correlated imaging phenotypes, we collected T1-weighted MRI scans from 645
subjects in ADNI1 and ADNI2 at baseline, 12-month, and 24-month time points. For spatially correlated imaging phenotypes, we
used DTI Fractional Anisotropy (FA) measurements from three sections of the Corpus Callosum, obtained from the first 10,000
imaged subjects released in the UK Biobank, totaling 8,330 subjects. Genotype information includes full-genome wide SNP data
available from ADNI and UK Biobank, where SNPs that showed protective or risky effects with the imaging phenotypes serve as
the positive control, and neutral effects as the negative control. Demographics include 1) static variables where the demographic
variables, such as sex, remain constant across repeated measurements ; 2) dynamic variables, such as age, vary across repeated
measurements. B) demonstrates the workflow of our Autoregressive LInear Mixed Model (ARLMM) including model
27
specification and model evaluation. ARLMM jointly models stable groups (stable CN, stable MCI, and stable AD) to infer the
autoregressive correlations within the converter MCI group, where autoregressive parameters were estimated using Support
Vector Regression (SVR). ARLMM performed comprehensive model evaluation of imaging phenotypes, genotypes to be tested
for associations, and correlations from different sources, such as genetic similarities for heritability calculations, scanner or site
effects, and disease progression stages.
2.3 Application I: Temporal Modeling
2.3.1 Data Acquisition
ADNI study sample
The data used in this study were obtained from Alzheimer’s Disease Neuroimaging
Initiative (ADNI) (http://adni.loni.usc.edu), a large ongoing longitudinal dataset collecting
genetic, imaging and demographic data for more than 2400 subjects aged between 55 to 90 years
old since 2004. ADNI aims to discover, standardize and validate biomarkers and clinical
measures used for AD progression. T1-weighted brain MRI scans were obtained from more than
60 sites across the United States and Canada. Our analysis included 645 subjects of European
ancestry from phases 1 and 2 of ADNI, all of whom had complete baseline, 12-month, and
24-month data for the following measurements: age, sex, diagnosis (CN/MCI/AD), genome-wide
SNPs including APOE genotypes, subcortical volumes, and intracranial volumes (ICVs).
ADNI diagnostic groups
The subjects were divided into four diagnostic groups based on their diagnoses at
baseline, 12-month and 24-month: stable CN (CN -> CN -> CN) (N = 217), stable MCI (MCI ->
MCI -> MCI) (N = 214), stable AD (AD -> AD -> AD) (N = 97) and converter MCI group
(MCI -> MCI -> AD or MCI -> AD -> AD) (N = 117). To eliminate the effects of diagnosis
changes at later time points, we excluded 12 subjects: 1) those in the stable groups who
converted from CN to MCI or AD, or from MCI to AD after the 24-month; 2) those in the
converter group who improved from AD to MCI or CN after the 24-month.
28
ADNI genetic preprocessing
We downloaded APOE information from the ADNI genetic core section
(https://adni.loni.usc.edu/category/genetics-core/) and encoded as 0, 1, 2 representing the number
of APOE ε4 alleles a subject carries. To mitigate potential conflicts between the protective
effects of the APOE ε2 allele and the risk-enhancing effects of the ε4 allele, we excluded
subjects carrying any ε2 alleles (ε2/ε2, ε2/ε3, and ε2/ε4). Analysis focused exclusively on
participants with genotypes containing only APOE ε3/ε3, ε3/ε4, and ε4/ε4 (Table 2.5.2). We
used GCTA (J Yang et al., 2010) to estimate the genetic relationship matrix (GRM), built with
644,855 SNPs to capture the between-subject genetic similarities. The genetic data download
from ADNI database followed the Enhancing Neuro Imaging Genetics Through Meta Analysis
(ENIGMA) genetic imputation protocol
(https://enigma.ini.usc.edu/protocols/genetics-protocols/) and the quality control (QC) criteria
(MAF < 0.01; Genotype Call Rate < 95%; Hardy-Weinberg Equilibrium < 10 ) (SE Medland −6
et al., 2022). First 4 Multidimensional Scaling (MDS) principal components were included in the
fixed effects to adjust for population stratification following ENIGMA genetic processing
protocol.
ADNI MRI acquisition
T1-weighted brain MRI scans using General Electric (GE), Philips, and Siemens 3T
scanners at baseline, 12-month, and 24-month from ADNI1 and ADNI2 databases were
obtained, following ADNI MRI scanner protocol
(https://adni.loni.usc.edu/data-samples/adni-data/neuroimaging/mri/mri-scanner-protocols/). The
left and right subcortical volumes of hippocampus, amygdala, thalamus, pallidum, putamen,
nucleus accumbens, caudate were extracted with longitudinal FreeSurfer version 5.3 (B Fischl et
29
al., 2012; M Reuter et al., 2012). Automatic subcortical parcellation was applied to detect all 7
subcortical structures based on a probabilistic atlas (S Rane et al., 2017; CR Jack Jr et al., 2024).
The left and right subcortical volumes were averaged as the final phenotypes to be analyzed. To
account for variability in head sizes, intracranial volume (ICV) was included as a covariate,
calculated with eTIV (https://surfer.nmr.mgh.harvard.edu/fswiki/eTIV) in FreeSurfer. For each
diagnostic group, outlier detection was performed across all three time points for subcortical
volumes for each subcortical region. This was achieved using a probability density-based
spatial-temporal clustering algorithm, DBSCAN (E Schubert et al., 2017), which identified core
subcortical volume data samples belonging to high-density clusters, while labeling others as
outliers to be removed (N = 37). This clustering-based preprocessing step enhanced the
reliability of the longitudinal subcortical volume measurements within each diagnostic group,
and mitigated the influence of noise in the imaging data.
2.3.2 Temporal Autoregressive Linear Mixed Modeling
Our goal is to analyze the effect of APOE4 on the volumetric phenotypes of seven
subcortical regions across three time points, baseline, 12-month, and 24-month. ARLMM needs
to specify the following inputs: 1) phenotype � represented the left and right averaged
subcortical volumes in a time-increasing order for each subject (baseline, 12-month, 24-month);
2) Fixed effects � included age, sex, ICV, and the first four MDS PCs. 3) Genotype � was
encoded as as 0, 1, 2 for APOE4 status, the number of APOE ε4 alleles each subject carries; 4)
Correlation matrices GRM was estimated by GCTA for genetic similarities between subjects (J
Yang et al., 2010); 5) Correlation matrix Σ was designed to capture between-subject variations �
due to scanner or site differences. The complete imaging, genetic and demographic ADNI1 and
ADNI2 data were collected from 58 clinical sites, so we extracted the site ID from subject ID,
30
the unique first three-digit numeric code to indicate which clinic site the subject’s data were
collected. Σ consisted of multiple all-ones submatrices, where each submatrix corresponded to �
subjects scanned at the same site. 6) Correlation matrix Σ introduced different autoregressive �
parameters in the stable CN, stable MCI and stable AD, then fixed the autoregressive parameters
estimated from stable groups to estimated ones in the converter group.
The genetic association analyses were conducted in the following cases for group
comparison and model comparison: 1) Across four diagnostic groups including stable CN, stable
MCI and stable AD groups and all groups combined; 2) With different longitudinal mixed model
packages, such as longGWAS, MMHE and ARLMM.
2.4 Application II : Spatial Modeling
2.4.1 Data Acquisition
UK Biobank study sample
The UK Biobank (UKBB) (https://www.ukbiobank.ac.uk/) is a large is a large
prospective epidemiological study, with extensive collection of imaging, genetic, demographic
and clinical data from more than 500,000 subjects aged between 40 and 69 years old since 2006
(KL Miller et al., 2016; C Bycroft et al., 2018). This study is part of the UKBB application
#11559. UKBB imaging data have been obtained through 4 scanning sites, Cheadle Imaging
Centre (2014), Newcastle Imaging Centre (2015), Reading Imaging Centre (2016) and Bristol
Imaging Centre (2017). We used DTI data from the first 10,000 imaged UKBB subjects released
in 2017, undergoing their scans at the Cheadle Imaging Centre. Our study included 8,330 UKBB
subjects of European ancestry who had complete data for the following measurements: age, sex,
genome-wide SNPs including APOE and VCAN genotypes, and FA measures for genu, body
and splenium of corpus callosum (CC). Our goal is to comprehensively evaluate the
31
performances of mixed model methods within the last twenty years including longGWAS (NA
Furlotte et al., 2012), MMHE (T Ge et al., 2017) and L-GATOR (X Wu et al., 2018). We first
randomly selected 1,000 subjects from a total of 8,330 UKBB subjects, then incrementally added
100 subjects at a time until reaching 4,000 subjects, at which point longGWAS reached its
computational limits, requiring several weeks to get genetic association results even with GPU or
TPU accelerations.
UK Biobank diagnostic groups
We excluded UKBB subjects with diagnoses of neurological conditions leaving only one
diagnostic group cognitive normal (CN) to be analyzed. The neurological conditions were
identified using the 3-character code of International Classification of Disease version-10
(ICD-10) based on self-report, primary care, hospital inpatient data from 2019. The UKBB data
fields used to identify overall and specific neurological conditions are detailed in (N Veronsese et
al., 2019).
UK Biobank genetic preprocessing
We downloaded top SNP information for VCAN (rs35544841) and APOE4 (rs429358
and rs7412) genes from the UKBB Research Analysis Platform (RAP)
(https://ukbiobank.dnanexus.com/) and encoded them as 0, 1, 2 representing the number of
copies for the alternative allele compared with reference allele. Raw genotypes (~18,000,000
genetic variants) were downloaded and used to create the GRM using the raremetalworker
(RMW) software package (https://genome.sph.umich.edu/wiki/RAREMETALWORKER); We
did not use the X chromosome in our GRM calculation. First 4 genetic MDS principal
components were included in the fixed effects to adjust for population stratification.
UK Biobank DTI acquisition
32
The UK Biobank's Diffusion Tensor Imaging (DTI) data were acquired using Siemens
Skyra 3T MRI scanners, equipped with standard 32-channel head coils (KL Miller et al., 2016).
This uniformity in imaging hardware across all assessment centers ensures consistency and
reliability in the DTI measurements. We downloaded imaging-derived phenotypes (IDPs) related
to DTI measures from the UKBB database, including mean FA in genu of corpus callosum,
body of corpus callosum and splenium of corpus callosum. Besides using the UKBB brain
imaging QC protocol, DTI measures were QCed using the density-based clustering method,
DBSCAN (E Schubert et al., 2017) for the outlier detection (N = 129).
2.4.2 Spatial Autoregressive Linear Mixed Modeling
We performed genetic association analysis for VCAN and APOE4 with mean FA
measures in genu, body and splenium of CC, where association with VCAN gene would act as
positive control as the only loci to be associated with all three sections of CC region’s FA
measures, and with APOE4 would act as negative control, for which SNPs showed no effect on
any CC region’s FA measures (SR Cox al., 2016; LT Elliott et al., 2018; B Zhao et al., 2021).
ARLMM needs to specify the following inputs: 1) phenotype � represented FA measures in
genu, body and splenium of CC, arranged in the spatial order from anterior to the posterior. 2)
Fixed effects � included age, sex, and the first 4 genetic PCs. 3) Genotype � was encoded as 0,
1, 2 for VCAN or APOE4 genes. 4) Correlation matrix GRM was estimated by RMW software
package for between-subject genetic similarities. 5) Correlation matrix Σ due to the scanner �
differences was excluded, as UKBB DTI data were acquired using a standard imaging protocol
to minimize the variabilities in imaging data due to hardware differences. 6) Correlation matrix
Σ introduced different autoregressive parameters, but only within the stable CN group. �
33
The genetic association analyses were conducted in the following cases for group
comparison and model comparison: 1) for one diagnostic group stable CN for all subjects
without neurological conditions; 2) With different longitudinal mixed model packages, such as
longGWAS, MMHE and ARLMM.
2.5 Results
2.5.1 Simulation
We first conducted simulation studies to test whether the longitudinal mixed models
(longGWAS, MMHE, ARLMM) can scale to large longitudinal datasets across varying
parameter settings. We used the genotype data from our simulation pipeline, where the
population structure was observed by scatter plots of the first principal component (PC1) against
the second principal component (PC2) (Figure 2.5.1.1A and Figure 2.5.1.1B). Two
subpopulations were generated by adjusting the global distribution of minor allele frequencies
with an additional normal distribution with a mean of 0 and variance of 0.05. As sample size
increases (N = 200, 400, 600, 800, 1000), scatter plots of PC1 vs PC2 reveal either one cluster
(Figure 2.5.1.1A) or two clusters (Figure 2.5.1.1B) consistently, corresponding to one or two
populations, respectively. Similarly, Genetic Relationship Matrices (GRMs) that quantifies the
genetic similarity between subjects derived from the simulated SNP data of 10000 loci contains
block submatrice(s) corresponding to either one population (Figure 2.5.1.1C) or two populations
(Figure 2.5.1.1D).
34
Figure 2.5.1.1: Visualization of simulated genetic structure for one population or two populations. A) Scatter plot of the
simulated first two principal components (PC1 vs PC2) with only one population simulated. Principal component analysis (PCA)
was performed on the simulation datasets with different sample sizes (N = 200, 400, 600, 800, 1000). B) Scatter plot of the
simulated first two principal components shows two clusters corresponding to two populations where each subpopulation was
generated through subpopulation-specific Minor Allele Frequencies (MAFs) by slightly shifting the global MAFs with Normal
distibution N(0, 0.05). C) Heatmap of one-population genetic relationship matrix (GRM). We only show 1000 samples for better
visualization of with-subject genetic correlation (N = 1000). D) Heatmap of two-population genetic relationship matrix (GRM)
where within population subjects are closely genetically correlated (N = 1000).
We compared our method ARLMM with other longitudinal linear mixed model methods
longGWAS, MMHE, ARLMM through simulated genotype and phenotype datasets generated by
our pipeline as described in Section 2.2.4. Only ARLMM is the fastest method among all the
publicly available mixed model packages in the last twenty years, and remains feasible for
high-dimensional and large-scale neuroimaging data using moderately controllable space
resources (Figure 2.5.1.2). The running time for longGWAS grows polynomially (close to O(�3
35
)) as the number of subjects increases, making it impossible to run large-scale datasets with
thousands of individuals and intensive simulations which require multiple rounds of hypothesis
tests. MMHE is at least 2 times slower than ARLMM, and becomes very time consuming at the
biobank dataset sample size when the number of subjects is no less than 10,000 (Figure
2.5.1.2A). Assuming the total number of study participants is �, max number of repeated
measurements per individual is �, the running time of ARLMM mainly depends not only on the
cost of matrix computation for support vector regression using linear kernel, scaling as O(� ) 2
�2
in the worst case and O(��) in the best case with matrix sparsification techniques, which is
speeded up by GPU/TPU parallel computing. The running time also varies with heritability
values, genetic relatedness, and population structures. In terms of peak memory usage,
longGWAS remains to be the one with the minimum space usage requirements O(��) while
ARLMM requires larger memory of O(� ) mainly for storing and processing genetic 2
�2
correlation and phenotype correlation matrices, slightly smaller than longGWAS (Figure
2.5.1.2B).
36
Figure 2.5.1.2: Time and space complexity of different longitudinal linear mixed models using simulated data. A) shows
the relationship between the number of subjects and the average running time (in seconds) required for three linear mixed model
packages, longGWAS, MMHE and ARLMM. Simulated longitudinal subcortical imaging phenotypes following first-order
autoregressive (AR(1)) correlation, covariates including age, sex, ICV and SNP to be tested encoded as 0,1,2 were randomly
generated; The data points, represented by red, blue and yellow circles, indicate the observed running time of ARLMM, MMHE
and longGWAS for different subject sizes, ranging from 200 to 1000. The yellow trend line for longGWAS fitted through the data
points indicates the polynomial growth (O(� )) in computational time as the sample size increases. The blue trend line for 3
�3
MMHE and red line for ARLMM are computational achievable of O(� ) in the worst case , while our method ARLMM is at 2
�2
least 2 times faster than MMHE and scaling with (O(��) with matrix specification techniques. B) illustrates the average peak
memory usage (in MB) when the number of subjects increases from 200 to 1000. The red circles (ARLMM) represent the
observed peak memory usage across different subject counts, showing a near-linear increase as the sample size grows.In
comparison, longGWAS requires significantly less space without complex matrix computations, while MMHE requires slightly
more memory than ARLMM.
We further evaluated estimation accuracy of heritability and phenotypic autocorrelations,
and performed power analysis for different longitudinal mixed model packages. To analyze the
effect of population stratification on the heritability estimation accuracy, we fixed the number of
simulation rounds per subject (T = 2000), and compared the heritability estimates for 200
subjects or 1000 subjects. We performed Kolmogorov-Smirnov tests on the mean Cumulative
Distribution Functions (CDFs) of heritability estimates (Figure 2.5.1.3C and Figure 2.5.1.3D).
The results showed no significant differences between the heritability estimates for one
population and two populations (p > 0.05). We also assessed the power of ARLMM, longGWAS,
and MMHE, either experimentally from our simulations or based on previous studies (X Wu et
al., 2018). There were no significant difference of power between ARLMM and longGWAS in
the small sample size (N < 200) (NA Furlotte et al., 2012; T Ge et al., 2017; X Wu et al., 2018),
37
but higher power in the larger sample between ARLMM and longGWAS and MMHE (N < 1000)
(Figure 2.5.1.3E). The novelty of Autoregressive Linear Mixed Model (ARLMM) relies on
incorporating autoregressive framework in the random effect modeling specific to join model
different diagnostic trajectories, to our best knowledge none of any previous studies evaluated
the accuracy of the parameters for autoregressive modeling through simulation studies. First, for
the imaging phenotypes whose repeated measures are generated independently from any
previous ones (N = 1000, T = 2000) where others parameters in the mixed modeling such genetic
variance components and environmental variance components are fixed, estimated
autoregressive parameter ρ is not significant (p = 0.73). Second, when the correlation between
next phenotype and current phenotype is primarily driven by the autoregressive mechanism, we
tested the autocorrelation for weak (0.2<=|ϱ|<0.3), moderate (0.3<=|ϱ|<0.5) and strong
relationships (0.5<=|ϱ|<=1). Notably, even at the borderline of a strong correlation (ρ = 0.5),
ARLMM produced an accurate and unbiased estimate of the autoregressive parameter (Figure
2.5.1.3F).
38
Figure 2.5.1.3: Power analysis and mixed model parameter estimates accuracy when simulated data sample size increases.
A) shows the original heritability statistics for ARLMM represented as red line, and histogram plots of the distribution of
permuted heritability statistics for 1000 times (N = 200). The genetic differences are stimulated to not contribute to the observed
variation in the trait within the population and the estimated h2 is 0.0005 (p = 0.33). B) shows the original heritability statistics
and histogram plots of the distribution of permuted heritability statistics for 1000 times (N = 400). The estimated h2 (h2 =
2.35E-11) is closer to the true h2 (h2 = 0) as the number of subjects increases (p = 0.78). C) The empirical cumulative
distribution function (CDF) of permuted heritability for small sample size (N = 200). D) The empirical cumulative distribution
function (CDF) of permuted heritability as sample size increases (N = 400). Pumutated heritability would be more clustered to
the original estimate heritability statistic when sample size of subjects increases from 200, 400, 1000, ……, 10,000 and the
number of repeated measurements keep the same, indicating the ARLMM estimated parameters for variance related to the
genetic correlations between subjects controlled type I error well and do not overestimate heritability. E) Empirical power
analysis for ARLMM for N = 200, 400, …, 1000 and number of repeated measurements per subject is 3. ARLMM reaching
slightly higher power (0.83) than L-GATOR (0.82), which simulated on the same number of subjects (N = 1000) but more reρ
peated measurements per subject (t = 5). No significant difference of power between longGWAS and ARLMM, LGATOR in the
small sample size (N = 100, 200) (X Wu et al., 2018), but longGWAS for large-scale dataset power analysis is very computation
39
intensive and not shown in the figure. F) To avoid interactions of simulating multiple parameters within the random effect models
at the same time, we only test the different values of autoregressive parameters used to account for the temporal or spatial
correlation between repeated measurements, and freeze other parameters such as heritability (0 < h2 < 1) and variance of
environmental effects. When true value of autoregressive parameter was set as 0.5 and 2000 simulations were run (N = 1000),
ARLMM autoregressive parameter ρ estimate is unbiased and accurate when within-subject phenotypes correlations are
primarily driven by an autoregressive mechanism.
2.5.2 Temporal Modeling
Phase 1 and phase 2 of Azheimer’s Disease Neuroimaging Initiative (ADNI) datasets
detailed demographic information (e.g. mean age across all time points or spatial locations, sex,
number of subjects for each diagnostic group) and APOE genotype (e3/e3, e3/e4, and e4/e4) for
subcortical volumes and cortical thickness are shown in Table 2.5.2. T1-weighted brain MRI
data were analyzed (N = 645) scanned at baseline, 12-month, and 24-month. Among all four
diagnostic groups, sample sizes of stable Cognitive Normal, stable MCI and stable AD groups
are larger than converter groups whose diagnosis at the baseline is MCI and is AD at the third
time point. (Table 2.5.2). Smaller converter groups may lead to less precise parameter estimates
of longitudinal linear mixed models and reduced power to detect meaningful genetic effects
within these groups.
40
Table 2.5.2: Demographic information for subcortical volume analysis in different diagnostic groups applied to the
Alzheimer‘s Disease Initiative (ADNI) dataset (N = 645).
Diagnostic Group N Age Sex (F / M) APOE (e3/e3 / e3/e4 / e4/e4)
Stable cognitive
normal controls:
sCN (CN -> CN ->
CN)
217 75.57±5.38 104/113 151/60/6
Stable mild cognitive
impairment:
sMCI (MCI -> MCI
-> MCI)
214 73.07±6.86 79/135 108/85/21
MCI converting to
AD:
cMCI (MCI -> MCI
-> AD or MCI -> AD
-> AD)
117 73.89±7.43 45 / 72 40/52/25
Stable AD: sAD (AD
-> AD -> AD)
97 75.12±7.77 40 / 57 27/47/23
The imbalance between stable and converter groups needs careful consideration to ensure
that genetic association findings are valid, reliable, and applicable to all diagnostic groups. We
performed additional association analysis on the total group, combining all subgroups underlying
different disease progression stages. We further examined the motivations of introducing the
autoregressive parameters for the mixed effect models using only baseline ADNI neuroimaging
and genetics data (Table 2.5.2), by performing the covariance and correlation analysis for
volumes of the hippocampus (Figure 2.5.2.1A, Figure 2.5.2.1B, Figure 2.5.2.1C and Figure
41
2.5.2.1D), one of the first brain regions to show volume reduction as ADNI1 and ADNI2
subjects whose diagnosis is MCI at baseline converts to AD at the third timpoint, making it a
critical biomarker for understanding the genetic effects on the brain structural changes during
the Alzheimer’s Disease progression.
Figure 2.5.2.1 A shows the covariance matrices of averaged left and right hippocampal
volumes at baseline, 12 month and 24 month in stable CN, stable MCI, converter MCI and stable
AD groups, respectively. Box’s M tests were performed to assess the equality of covariance
matrices across any two groups (Figure 2.5.2.1A). There were significant covariance differences
of hippocampus volumes for stable CN and with stable MCI (chi-squared = 137.52, p =
3.34E-27), converter MCI (chi-squared = 19.05, p = 4.08E-03), stable MCI with cMCI
(chi-squared = 73.15, p = 9.21E-14), and stable MCI with stable AD (chi-squared = 34.45, p =
5.51E-06). The covariance differences between stable MCI and converter MCI, stable MCI and
stable AD suggests the importance of jointly modeling different diagnostic groups rather than
combining subgroups. Furthermore, Jennrich’s tests which compare any two correlation
matrices to determine if they are statistically significantly different, did not find significant
correlation differences of hippocampus volumes for any diagnostic groups (Figure 2.5.2.1B), but
Likelihood Ratio Test (LRT) on the correlation matrices of hippocampal volumes in each
subgroup shows that the correlation matrices for four diagnostic groups including stable CN
(chi-squared = 0.53, p = 0.91), stable MCI (chi-squared = 0.11, p = 0.99), converter MCI
(chi-squared = 0.15, p = 0.98), stable AD group (chi-squared = 0.66, p = 0.88) does not
significantly deviate from lag-1 Autoregressive (AR(1)) structure (Figure 2.5.2.1C). Figure
2.5.2.1D shows the estimated autoregressive parameters and different trajectories for stable CN,
stable MCI, converter MCI and stable AD if the observation time period is extending to
42
long-term up to 60 months follow-up, the standard protocol of Alzheimer’s Disease
Neuroimaging Initiative longitudinal data collection design.
With the observations that Autoregressive (AR) modeling may capture the within-subject
imaging phenotypic correlations with higher precision using subgrouping, our final goal is to
analyze the influence of genetic factors, particularly APOE4, on subcortical brain structures in
aging and the progression of Alzheimer’s disease. A linear regression analysis was conducted for
APOE4 and the subcortical volumes of all 7 brain regions (including hippocampus) using the
baseline information from ADNI1 and ADNI2, adjusting for age, sex, Intracranial Volume (ICV)
to control the head size. The findings are summarized as follows: combining all diagnostic
groups (N = 645) which ignore the phenotypic variation from disease progression shows that
APOE4 is negatively associated with volumes of accumbens (p =2.44E-03), amygdala (p =
5.69E-12) and hippocampus (p = 5.61E-17), but association of APOE4 with accumbens
volumes cannot be detected in any of the subgroups. After Bonferroni correction, APOE4 is only
negatively associated with volumes of amygdala (p = 8.18E-03) and hippocampus (p =
5.11E-03) in the stable AD group, but not in any stages before the onset of Alzheimer’s Disease.
To address the genetic association inconsistency that the association between APOE4 and
subcortical volumes varies across different subgroups and combining total groups, linear mixed
models were used for baseline and follow-up measurements to improve statistical power
compared with linear regresion using baseline information, and a data-driven approach ARLMM
by modeling the autoregressive correlation pattern of subcortical volumes (e.g. hippocampus) is
compared with other non-autoregressive linear mixed models such as longGWAS (NA Furlotte et
al., 2017) and MMHE (T Ge et al., 2017).
43
A B
C D
E
44
Figure 2.5.2.1: Analysis of longitudinal subcortical volumes in four diagnostic groups, stable CN, stable MCI, converter
MCI and stable AD. A) The empirical covariance of hippocampal volumes shows similar values between baseline and 12
month, baseline and 24 month, 12 month and 24 month baseline especially in stable CN group and stable MCI group, but much
larger covariances in converter group and stable AD groups. B) Correlation matrices of all stable CN, MCI, converter MCI and
stable AD groups follow the pattern that the longer the time interval between two timepoints, the lower the correlation value
between the volume measured at the two time points, which may satisfy autoregressive modeling assumptions. C) Histograms of
Box’s M Test Chi-squared statistics for pairwise groupwise comparison if two covariance matrices of hippocampal volumes are
significantly different. D) Estimated autoregressive parameters if modeling the correlation matrices in each diagnostic group
with AR(1). We extend next-phenotypic correlation prediction to 60 months in which ADNI allows for clinical, genetic, imaging
data collection. E) APOE4 z-score mapped to the 7 human brain subcortical regions, accumbens, amygdala, caudate,
hippocampus, pallidum, putamen, thalamus (upper) and -log10 p value transformation of APOE4 association with subcortical
volumes mapped to all corresponding regions. Linear regression for stable CN, stable MCI, converter MCI, and stable AD groups
all together finds that APOE4 is negatively associated with volumes of accumbens (p =2.44E-03), amygdala (p = 5.69E-12) and
hippocampus (p = 5.61E-17), while the association pattern is not consistent throughout the diagnostic groups, where subjects
cognitive functions keeps declining from Cognitive Normal (CN), Mild Cognitive Impairment (MCI), and Alzheimer’s Disease
(AD). After Bonferroni correction, APOE4 is only negatively associated with volumes of amygdala (p = 8.18E-03) and
hippocampus (p = 5.11E-03) in the stable AD group.
We tested the genetic association of APOE4 with all 7 averaged volumes of left and right
subcortical structures (hippocampus, amygdala, thalamus, pallidum, putamen, nucleus
accumbens, caudate) at the baseline, 12-month and 24-month for 4 diagnostic groups and total
groups, then analyze the assocaiton results for mixed model evaluation and comparison in a
hierarchical way: Under different stages of Alzheimer’s Disease progression stages, examine
whether the longitudinal volumes in any human brain subcortical regions are heritable (Figure
2.5.2.2A); For subcortical regions identified as heritable, evaluate when the strongest genetic risk
factor for Alzheimer’s Disease starts to drive brain abnormalities (Figure 2.5.2.2B and Figure
2.5.2.2C). longGWAS, MMHE and ARLMM estimated heritability with less varibility for total
groups compared with subgroup heritability estimates, where longGWAS reported missing
heritability of most subcortical structure volumes (<0.01), MMHE reported high heritability
estimates of all subcortical structure volumes (0.79 ~ 0.86) and ARLMM subcortical heritability
estimates range between (0.12 ~ 0.48). In particular, longGWAS does not find any subcortical
regions heritable if being applied on the total groups (h2 < 0.01), while amygdala heritability in
the stable CN subgroup is the highest (h2 = 0.72) and falls into the normal range of amygdala
heritability 0.34 to 0.83 (Figure 2.5.2.2 A) (YN Ou et al., 2023) . The missing heritability
45
revealed in the total groups was also present in the subgroups, and exceptions were noted for
palladium (h2 = 0.22) and putamen (h2 = 0.23) in the stable MCI group, as well as for
accumbens (h2 = 0.14) and caudate (h2 = 0.40) in the converter MCI group. For all other
subcortical volumes analyzed in four diagnostic groups using longGWAS, heritability estimates
were below 0.01. MMHE, a computationally more efficient linear mixed model method (Figure
2.5.1A) reported all of the subcortical heritability estimates in stable CN, stable MCI, converter
MCI and stable AD groups between 0.79 and 0.86, significantly higher than all of the
longGWAS heritability estimates for subcortical volumes, near the upper boundary of the normal
range of subcortical volumes (DP Hibar et al., 2015). Using ARLMM, Pallidum heritability in
the converter MCI group is the highest (h2 = 0.48), significantly lower than MMHE heritability
estimates for all subcortical regions. When the subject converts from Cognitive Normal (CN) to
Mild Cognitive Impairment (MCI) until Alzheimer’s Disease (AD), ARLMM estimated
heritability values show the consistent decreasing patterns for amygdala, hippocampus, caudate,
pallidum, thalamus volumes, but an increasing pattern for Accumbens volume. The genetic
architecture of Accumbens may be more static or resilient, leading to a consistent or even
increasing heritability estimate over time, while other subcortical regions show a reduction in
heritability as AD progresses as environmental factors exert more influence.
Figure 2.5.2.2B and Figure 2.5.2.2C demonstrate the APOE4 z score and negative log10
-transformed p values of longGWAS, MMHE and ARLMM. After Bonferroni correction,
longGWAS reported APOE4 is only negatively associated with hippocampus volume in stable
MCI (p = 2.08E-03), and stable AD (p = 5.87E-04) groups, and with amygdala volume in the
stable AD group (p = 1.60E-03). In comparison, MMHE detected APOE4 associations with all
subcortical structures in converter MCI and stable AD groups, and not associated with any of the
46
subcortical volumes when the subjects are in the stable CN or stable MCI group. ARLMM found
APOE4 is negatively associated with hippocampus volume in the converter MCI (p = 2.88E-04)
and stable AD groups (p = 4.83E-05), and amygdala volume in the converter MCI (p =
9.17E-06) and stable AD groups (p = 8.23E-04).
A
B
47
C
Figure 2.5.2.2. Genetic association results for three longitudinal linear mixed model packages longGWAS, MMHE,
ARLMM across 4 diagnostic groups separately and total groups for temporal modeling. A) Heritability values of
subcortical volumes in 7 ROIs. Longitudinal Heritability is defined as the proportion of genetic variance over the total phenotypic
variance. B) APOE4 z-scores of subcortical volumes in 7 ROIs. D) APOE4 -log(p value) in 7 ROIs. C) APOE4 -log10 p values
of subcortical volumes in 7 ROIs.
We compared the regression predictive metrics such as Root Mean Squared Error
(RMSE), Mean Absolute Error (MAE) and R squared (R2) after applying ARLMM, longGWAS
and MMHE on all subcortical volumes in four diagnostic groups, and all groups combined
(Figure 2.5.2.3). ARLMM demonstrated lower RMSE values for the caudate, hippocampus,
putamen, and thalamus volumes compared to longGWAS. However, ARLMM exhibited higher
RMSE values for the accumbens, amygdala (except in the cMCI group), and pallium volumes
across all four diagnostic groups and the combined groups. The prediction accuracy evaluated
by MSE of ARLMM and longGWAS seems to be region-specific and may not vary based on the
specific diagnostic group. When compared to MMHE, ARLMM showed generally lower RMSE
values across diagnostic groups except for the amygdala in the stable CN group and
hippocampus in the stable AD group, indicating broader predictive reliability (Figure 2.5.2.3A).
ARLMM and longGWAS demonstrated similar patterns using MAE (Figure 2.5.2.3B) across
48
diagnostic groups, consistent with the patterns observed using RMSE (Figure 2.5.2.3A).
ARLMM generally exhibited lower MAE values than MMHE across groups, except for specific
subcortical regions under certain disease conditions: the accumbens in stable MCI, the amygdala
in stable CN, the caudate and hippocampus in stable AD, the pallium in stable MCI, and the
putamen in both stable MCI and stable AD groups (Figure 2.5.2.3B). RMSE and MAE alone do
not provide sufficient evidence to determine whether ARLMM is a better mixed model
framework than longGWAS for predicting longitudinal brain imaging phenotypes using APOE4.
But in 61% of the test data that includes all three time points of the subcortical volumes ordered
by time, ARLMM shows smaller RMSE than MMHE, and this proportion increases to 80% in
terms of MAE.
We further evaluated the third predictive metric, R-squared (R2) as shown in Figure
2.5.2.3C. Only longGWAS of all the three packages generated negative R2 for putamen in the
stable AD group (R2 = -0.20). This may suggest that longGWAS does not predict the changes of
putamen volume correctly across all three timepoints. Surprisingly, ARLMM accumbens R2 is
much higher than longGWAS in stable CN, converter MCI and stable AD, and just subthreshold
smaller in stable MCI group, suggesting ARLMM is the best of all three packages when
predicting accumbens volumes. ARLMM shows higher R2 than longGWAS in converter MCI
and stable AD groups for amygdala, stable CN and AD groups for caudate, stable CN, converter
MCI and stable AD of hippocampus, stable CN and stable AD for pallidum and putamen
volumes. It has been found that the standard errors of ARLMM R2 are smaller than longGWAS,
only subthread larger than longGWAS in the converter group of accumbens, and smaller than all
subcortical volumes in all diagnostic groups.
49
A
B
50
C
Figure 2.5.2.3: Temporal model evaluation and comparison using predictive metrics RMSE, MAE and R2 for longGWAS,
MMHE and ARLMM. Longitudinal cross validation techniques were applied, 80% of subjects and all of their corresponding
human brain subcortical volume measurements at the time increasing order are chosen as training, and 20% are for testing and
evaluating the predictive ability by applying the trained models from longGWAS (Figure A), MMHE (Figure B), ARLMM
(Figure C) on the test data for four diagnostic sub-groups and total groups to get the predictive statistics R2.
2.5.3 Spatial Modeling
DTI imaging data from 4,000 UK Biobank healthy participants with European ancestry
were analyzed. Heritability is defined as variance due to genetic relatedness divided by total
phenotypic variance. For the negative control, none of longGWAS, MMHE and ARLMM shows
the association between APOE4 and three sections of corpus callosum FA measures
(genu/body/splenium). ARLMM reported heritability for VCAN gene and three sections of
corpus callosum as 0.55 similar as UK Biobank DTI GWAS results (B Zhao et al., 2021), where
heritability of all three sections for corpus callosum FA measures are between 0.5~0.6. Only our
method ARLMM detects significant association between VCANE gene and all three sections for
corpus callosum FA measures (Table 2.5.3.1), in line with DTI GWAS results for around 8000
participants in UK Biobank (LC Rutten-Jacobs et al., 2018).
51
Comparing all three packages in spatial modeling, longGWAS and MMHE output
negative R squared values, which may indicate packages do not predict the spatial properties of
corpus callosum FA measures across three sections correctly (Table 2.5.3.2). R squared on
training and test data using ARLMM was 0.01, slightly smaller than linear regression using only
baseline with averaged FA features, but comparable.
Table 2.5.3.1: Estimated SNP parameters (beta value and corresponding P value) of VCAN gene association with FA
measures using longGWAS, MMHE and ARLMM packages for spatial modeling. 1) longGWAS, MMHE and ARLMM did
not find an association between APOE4 and FA measures (p > 0.05), all of which find associations of VCAN gene. 2)
Heritability is defined as variance due to genetic relatedness divided by total phenotypic variance. Heritability reported by
ARLMM is 0.55 separately. ARLMM heritability estimate is also in line with published UK Biobank DTI GWAS results (B
Zhao. et al., 2021), where BCC/GCC/SCC heritability are between 0.5~0.6.
Method Beta Value P value
longGWAS -3.50E-03 3.60E-03
MMHE -3.65E-03 2.43E-06
ARLMM -3.64E-03 1.91E-56
Table 2.5.3.2: Spatial modeling predictive metrics (RMSE/MAE/R2) for longGWAS, MMHE and ARLMM. longGWAS
and MMHE output negative R2 values, which indicates packages do not predict the trend of BCC, GCC and SCC data correctly.
R2 on training and test data using ARLMM was 0.01, positive and comparable with linear regression results.
Method RMSE MAE R2
Train Test Train Test Train Test
longGWAS 0.05 0.05 0.04 0.04 -0.01 -0.02
MMHE 0.05 0.03 0.04 0.02 0.01 -0.38
ARLMM 0.05 0.05 0.04 0.04 0.01 0.01
52
2.6 Discussion
We propose a fast, powerful, accurate Autoregressive Linear Mixed Model (ARLMM)
applicable for genetic association tests by jointly modeling different diagnostic groups together
to analyze the genetic effects on the brain structure over time instead of at fixed time points.
ARLMM used joint modeling to account for phenotypic covariance variations more accurately
by estimating autoregressive parameters in the random effect from the stable groups (Figure
2.5.1.3F), to model the autoregressive parameter in the converter groups, with fewer data points
(Um TT et al., 2014; H Malik et al., 2020). ARLMM achieved highest predictive accuracy
among the publicly available longitudinal mixed model packages in the recent twenty years by
comparing predictive metrics of RMSE, MAE and R2 with longGWAS and MMHE longitudinal
mixed model packages (Figure 2.5.2.3).
Intensive simulations have shown that when the sample size was relatively small (N <
400), all three longitudinal methods longGWAS, MMHE, ARLMM were underpowered as
expected (<0.75) and performed very similar; But when the sample size increased up to 1000, we
only compared ARLMM with MMHE because longGWAS took too long to run multiple rounds
of simulations, and ARLMM shows consistently higher power than MMHE and both powers are
above 0.8 (Figure 2.5.1.3E).
When we tested the genetic effects on the subcortical volumes during AD progression, to
detect the associations between strongest genetic risk factor APOE4 with subcortical volumes
from all 7 brain regions using ADNI subjects scanned at baseline, 12 month and 24 month, the
extreme low longGWAS heritability estimates (<0.01) and high MMHE heritability estimates
(0.79 ~ 0.86) tested are a little bit concerning, even after we carefully account for genetic
correlation, population structure and site effects. ARLMM provides subcortical volume
53
heritability estimates between 0.12 and 0.48, a little bit smaller than the twin studies but
comparable (DP Hibar et al., 2015). Additionally, all three packages longGWAS, MMHE and
ARLMM detect the APOE4 association with hippocampus and amygdala volumes in the stable
AD group, in line with the previous studies that APOE4 may influence the human brain
subcortical structure in Alzheimer's Disease (AD) patients (ET Reas et al., 2024); In the stable
cognitive normal (CN) group as the negative control, no APOE4 effect was found for any of
subcortical volumes using all three packages (Figure 2.5.2.2B and Figure 2.5.2.2C). The
longitudinal mixed model performances are very different, on the subjects who were diagnosed
as Mild Cognitive Impairment (MCI) at the baseline and do not convert to AD within 24 months
vs those who converted. In the stable MCI group, only the Restricted Maximum Likelihood
(REML) based method longGWAS found the association of APOE4 with hippocampal volume.
Both moment matching based method MMHE and ARLMM, detected APOE4 associations with
hippocampus and amygdala volumes, same as in stable AD group, but MMHE found APOE4 is
associated with volumes of all subcortical structures in the converter group (Figure 2.5.2.2B and
Figure 2.5.2.2C). These findings suggest that ARLMM appears to offer a sensitivity and
specificity balanced approach compared with MMHE and longGWAS, capturing genetic effects
in a way that is neither too conservative nor too inclusive of weak associations and well-suited
for tracking AD progression at the late stages.
Linear Mixed Models for temporal correlated traits can be further applied to spatial
correlated brain imaging traits. For spatial modeling, as the negative control none of the three
packages longGWAS, MMHE, ARLMM detected the association between APOE4 and corpus
callosum FA measures. All detect VCANE gene associations with three sections of corpus
callosum FA measures (Table 2.5.3.1). Our method ARLMM shows much higher predictive
54
accuracy with R2 evaluation metrics compared with longGWAS and MMHE, and is the only one
to find significant association(s) in line with GWAS results of UK Biobank studies.
In summary, our proposed method ARLMM may provide an accurate, powerful,
sensitivity and specificity balanced approach for genetic association analysis for repeated
measurements across many different settings, including temporal or spatial correlation analysis,
case and control studies, and analysis for general populations. ARLMM can be extended to the
Generalized Linear Mixed Model (GLMM) with autoregressive modeling by applying a link
function on the imaging phenotypes. Further directions may include extending the application of
our method to Genome-wide association studies (GWAS), lifting the restriction of even time
intervals or fixed number of timepoints by adjusting the autoregressive covariance structure
accordingly.
55
Chapter 3 Automatic Hypothesis Testing Framework
3.1 Introduction
Sample sizes for studies involving human subjects are often limited by the costs of
collecting data. Unfortunately, in these studies, variables of interest that have small to moderate
effect sizes have been particularly susceptible to spurious findings that do not replicate. Recent
efforts have recognized these concerns in neuroimaging-heavy fields of psychological and
neurological sciences (KS Button et al., 2013; W Boekel et al., 2015; A Bowring et al., 2019; RA
Poldrack et al., 2019; SM Hodge et al., 2021), and even more specifically, in
neuroimaging-genetics (SM Smith et al., 2018; SE Medland et al., 2014). A driving factor in this
reproducibility and replication crisis has been a lack of sufficiently well-powered studies, so
larger scale efforts are of growing interest.
Merging data from multiple sources has paved a way for larger sample sizes and
reproducible findings. Individual multi-site studies such as the disorder-specific Alzheimer’s
Disease Neuroimaging Initiative (ADNI) (MW Weiner et al., 2010; EN Manning et al., 2017; CR
Jack Jr et al., 2024) and the population-based UK Biobank (KL Miller et al, 2010; KL Miller et
al., 2016; C Bycroft et al., 2018; N Veronese et al., 2021) have aimed to collect both
neuroimaging and genetic data across individuals from multiple locations for larger and more
efficient data collection efforts. Multi-study consortia such as the Enhancing NeuroImaging
Genetics through Meta Analysis (ENIGMA) consortium (PM Thompson et al., 2020) have also
been established to coordinate analyses and pool data across tens of thousands of brain imaging
datasets from hundreds of independent studies around the world. ENIGMA has over 35 active
working groups with targeted clinical, biological or methodological interests. These working
groups, including one dedicated to neuroimaging genetics, pool data from independent existing
56
studies from around the world. Consortia that are built on existing and available data resources
not only ensure large sample sizes for well-powered analyses, but also include diverse samples
and heterogeneous study designs to allow for robust and generalizable findings. Multi-study
efforts in ENIGMA ensure adequate sample sizes for genome-wide association studies (GWAS)
on brain imaging derived traits (SE Medland et al., 2022) and have led to the identification of
numerous genetic variants that influence brain structure through some of the largest studies to
date.
As ENIGMA studies are being conducted, new neuroimaging genetics initiatives and new
studies with relevant data continue to be funded and collected. These efforts may eventually be
incorporated into other mutli-site and multi-study initiatives, further empowering larger and
more representative studies. For example, in 2012 a multi-study GWAS of hippocampal volume
was published by the ENIGMA consortium (JL Stein et al., 2012). Interest soon grew in the
consortium, and a follow up study that again included a GWAS of the hippocampal volume was
conducted with nearly twice the sample size (DP Hibar et al., 2015). The same inquiry was then
posed jointly by ENIGMA and another multi-study consortium, CHARGE (Cohorts for Heart
and Aging Research in Genomic Epidemiology), more than doubling the sample size once again
(CL Satizabal et al., 2019). These studies have iteratively shown the progression of insights into
the genetic architecture of regional brain volumes, including that of the increasing confidence in
the effect of a genetic locus on chromosome 12 on the volume of the hippocampus (Figure
3.1.1). This was the only locus to show a genome-wide significant effect in the first analysis, and
confidence of this finding only grew in subsequent analyses. The evolution of scientific
knowledge is captured by repeatedly making the same inquiry, yet with more or different data.
Artificial intelligence (AI) systems that can automatically generate these updates as more data
57
become available, with minimal human intervention, can greatly facilitate research efficiency
and accelerate scientific advances.
Figure 3.1.1: The same analysis can be conducted in different contexts as more and more data become available for
collaborative efforts. In the above graphic, we show the evolution of results related to the genetic associations with MRI-derived
hippocampal volume. The most significant genetic association with hippocampal volume in the first ENIGMA consortium study
(ENIGMA1) (JL Stein et al., 2012), which performed a genome-wide association study (GWAS) meta-analysis of the
hippocampus and the brain’s intracranial volume with data for almost 8,000 individuals across 17 cohorts, was in a locus on
chromosome 12. In the second ENIGMA GWAS (ENIGMA2) (DP Hibar et al., 2015), which evaluated hippocampal volume
along with 6 other structures in a pooled sample of over 13,000 individuals from 28 cohorts, the same locus (middle panel)
emerged as significant with greater confidence (smaller p-value). When the results of ENIGMA2 were then meta-analyzed with
those from the CHARGE consortium in an extended analysis of 9 subcortical structures using a total sample of over 26,000
individuals from 46 discovery cohorts (ENIGMA+CHARGE) (CL Satizabal et al., 2019), again the same genetic locus showed
significant association with the hippocampal volume with greater confidence. The thickness of the red circle indicates the
strength of the association, highlighting the greater significance as the dataset is expanded (bottom to top). Specifically, blue
points reflect the ENIGMA1 study from 2012 (JL Stein et al., 2012) ; yellow points correspond to the ENIGMA2 study from
2015 (CL Satizabal et al., 2019) , and burgundy points represent the ENIGMA+CHARGE joint analysis from 2017 (DP Hibar et
al., 2017) (thickest red circle). We show the extent to which the significance changed from study to study. The same locus was
then separately shown to be significant in independent data from over 8,000 individuals in the UK Biobank dataset (LT Elliot et
al., 2018), which had not been used in any of the initial three ENIGMA or ENIGMA-CHARGE publications.
These AI systems would conduct continuous monitoring to detect new data and
re-execute analyses to update findings. Intelligent automation can further be used to interrogate
the data as more and more of the population becomes represented in the available data. For
example, if the support for an association becomes stronger or weaker once more data is added
and the sample becomes more diverse, an intelligent system may be able to identify aspects of
the populations that were driving the changes in the association strength. We have previously
found that the mean age of a study’s participants may drive key associations between
neuroimaging and genetic markers, including the association between hippocampal volume and
58
one of the strongest genetic risk factors for Alzheimer’s disease and related dementias (RM
Brouwer et al., 2022; D Garijo et al., 2019); here associations were only identified in cohorts, or
datasets, of studies where the average age was over 60 years.
This work presents a two-fold AI approach to: 1) perform automated inquiry driven
analyses, and 2) continuously update these analyses. We have designed NeuroDISK, an AI
system built on the DISK framework (Y Gil et al., 2016; D Garijo et al., 2017), which currently
focuses on a pilot set of neuroimaging genetics tasks given a structured data ontology and
specific analytical workflows.
Frameworks that aim to automate portions of scientific research have been described.
Early concepts such as the Exploratory Hypothesis Testing System (G Liu et al., 2011) described
a data-driven approach to generate potential hypothesis-tests based on available data and
combinations of groups that can be compared against each other. This framework allows
potential relationships to be discovered that were not originally considered by the researchers.
More recently, tools have been developed for continuous integration (CI) to use updated data
sources and analytical pipelines as they become available to enhance reproducibility (M
Krafczyk et al., 2019). However, few CI tools have been implemented in practice. One of these
tools is NeuroCI (J Sanz-Robinson et al., 2022), which, similar to our work here, focuses on
neuroimaging analytics and data. NeuroCI is a novel tool which integrates highly computational
neuroimage data processing pipelines and allows users to compare the reproducibility and
robustness of these different pipelines on user-defined statistical analyses with visualizations on
a platform. Here, we aim to integrate aspects of both continuous integration and exploratory
hypothesis testing into a fully automated scientific workflow for multi-dataset analyses.
59
We demonstrate the value of NeuroDISK by using and building on published data from
the large-scale multi-study GWAS meta-analysis of MRI-derived cortical structure (KL Grasby
et al., 2020). We cataloged the published data and meta-data from all cohort studies that
contributed to that work, and automated the statistical meta-analysis to demonstrate a successful
replication of the available original findings. Our AI Scientist was then able to identify newly
cataloged data that match the requirements for the specific neuroimaging genetic inquiry and
incorporated it into the analysis, ultimately updating the findings of the original paper. We
further demonstrated the capabilities of NeuroDISK for asking additional questions of the data,
beyond what was originally proposed; in this demonstration, we investigated whether
study-specific effects are driven by particular aspects of the study cohort, in particular, mean age.
The contributions of NeuroDISK include:
1. A novel AI approach to scientific problem solving designed to capture the strategies that a
scientist follows to answer a question or test a hypothesis. including finding data, analyzing it,
and extracting findings.
2. A novel concept of lines of inquiry that can automate hypothesis-driven discovery processes
using AI knowledge representation and reasoning techniques that include ontologies, constraint
reasoning, and workflows.
3. An implementation of this approach in NeuroDISK, which extends the general
domain-independent DISK framework with hypothesis ontologies and lines of inquiry for
multi-site neuroimaging genetics.
4. A reproduction of a published paper to demonstrate this framework, with explicit questions
and hypotheses driving the system, and an extension of the published results to demonstrate its
use for continuous updates.
60
NeuroDISK is designed to mimic how human scientists pursue a scientific question or
hypothesis. We focus on investigations that reuse already available data from previously
conducted studies, rather than designing and carrying out new experiments. In many areas of
science, collaborative projects and data-sharing agreements are commonplace. This is true for the
ENIGMA consortium in the field of neuroimaging genetics, where studies from different groups
have focused on cohorts of individuals with different demographics and diagnoses and include
different types of neuroimaging data.
Scientists pose questions or hypotheses and consider different approaches to answering
them. The first step is typically finding relevant data, by researching published papers or
consulting colleagues. Depending on the available data, different methods are appropriate to
analyze them. For example, MRI data would be analyzed using computational imaging methods
while genomics data would be analyzed using GWAS statistical analysis methods. The results are
then consolidated and presented to the scientists.
61
Figure 3.1.2: NeuroDISK is designed to automate the processes that scientists follow to answer questions using existing
datasets. A scientific question (1) is mapped to a structured form that is machine-readable (2) and enables NeuroDISK to reason
about it and select a general approach (3) to answer that question. The approach typically involves formulating a query that will
access existing data sources to find relevant datasets (4), setting up and running analyses for the data available (5), consolidating
the results for different datasets (6), and explaining (7) and presenting (8) the findings. When new data becomes available (9),
NeuroDISK revisits the original question and re-runs its analyses so the findings can be updated (10).
Figure 3.1.2 illustrates this general process in the green loop in the center, and shows an
example of how a question could be tested automatically by a system that has knowledge about
common strategies that human scientists pursue. In the center of the figure is a key novel
contribution in this work: defining lines of inquiry that capture the reasoning of scientists about
62
how to answer a question, including how to specify their question in a machine-readable way,
how to find the appropriate data, and how to analyze it, and how to present the findings.
NeuroDISK demonstrates how to automate this inquiry-driven discovery process for
neuroimaging genomics. NeuroDISK uses AI representations and reasoning to test and revise
hypotheses based on automatic analysis of scientific data repositories that grow over time. Two
key features of NeuroDISK are: 1) Inquiry-driven automated analysis: Given an input hypothesis
or scientific question, NeuroDISK can automatically search for relevant data in shared
repositories and apply appropriate methods to test it; 2) Continuous automated updates of
findings: NeuroDISK checks for new data availability, allowing it to reconsider prior analyses
and revise its findings accordingly.
3.2 Materials and Methods
3.2.1 Scientific method as line of inquiry
NeuroDISK uses a line of inquiry (LOI) to represent the approach that a scientist would
follow to pursue a general type of question in their discipline, including steps to get data from a
shared data source and to analyze it with computational workflows. A LOI has several key
components:
● Hypothesis or question template: a general question containing variables that are
matched against the specific question posed by a scientist.
● Data query template: indicates how a data source should be queried in order to
obtain data that is relevant to the question. The data query template includes the
variables that appear in the question template, as well as additional variables that
characterize the data requirements in detail for the approach being pursued.
63
● Workflows: specify the multi-step methods to be used to analyze the data
retrieved. The workflows use variables to indicate how to take the data retrieved
as input. Workflows also represent the computational steps of the analysis
method.
● Meta-workflows: specify the method to combine results from multiple workflow
executions in order to derive an answer to the user’s question. Meta-workflows
may generate overall statistics, such as an effect size, confidence interval, p-value,
or a refinement of the original question or hypothesis, as well as visualizations of
results and findings.
The individual LOI components are described in more detail in this section.
64
Figure 3.2.1: NeuroDISK uses Lines of Inquiries (LOIs) to represent the approach that a scientist would follow to answer
different types of questions. A LOI consists of: (1) Documentation about the LOI, which can include literature citations that
introduce the approach; (2) a question template that will be matched against the user’s question; (3) a data query template that
specifies how to retrieve data that is relevant to answering the question; (4) workflows that specify how to analyze the data
retrieved; and (5) meta-workflows that indicate how to combine the results of all the workflows executions.
Figure 3.2.1 illustrates the main components of a LOI for investigating if the effect size
of a particular genotype on a specific brain region is associated with a demographic attribute of
the cohorts for those filtered by another demographic attribute (in this case genetic ancestry),
using a meta-regression on the effect sizes of the individual cohorts retrieved from the data
sources.
65
NeuroDISK captures knowledge in machine-readable representations that use semantic
web standards, in particular the W3C Resource Description Framework (RDF) (M Cyganiak et
al., 2014), the W3C Web Ontology Language (OWL) (https://www.w3.org/TR/owl-ref/), and
the W3C Semantic Protocol and RDF Query Language (SPARQL)
(https://www.w3.org/TR/sparql11-query/).
3.2.2 Specifying inquiries: hypotheses and questions
NeuroDISK is an inquiry-driven system in that it expects users to provide a structured
hypothesis or scientific question that will drive its reasoning and data analysis. The user is
guided through pre-defined question templates and selects one to specify an inquiry in a
structured, machine-readable representation. Once that selection is made, the user’s question is
matched against the query template of the available LOIs which will trigger an appropriate one
among them.
We use, as our running example, the top finding from the 2020 ENIGMA cortical
structure GWAS paper (KL Grasby et al., 2020), which found an association between the genetic
variant rs108066 and the surface area of the precentral gyrus after meta-analyzing genome-wide
association results from 48 discovery cohorts of individuals of European ancestry, replicating the
findings in the UK Biobank, and generalizing the findings in datasets of non-European
individuals. A novel scientific question presented in NeuroDISK is to use available meta-data
and individual cohort results to determine whether the effect size of rs108066 on the surface area
of the precentral cortex is associated with the mean age of the cohorts. The user would start by
selecting from a set of question templates. In this example, they would select: “Is the effect size
of [Genotype] in [Brain Imaging Trait] of [Region] associated with [Demographic Attribute] for
66
cohorts of [Genetic Ancestry]?”, with the brackets indicating variables. Then the user would be
offered choices to select the desired variable values which will generate the user's question.
Question templates are expressed as a machine-readable question pattern that consists of
RDF triples of the form {subject predicate object}. For our running example, this would be an
RDF triple in a question pattern:
effectSize isAssociatedWith DemographicAttribute
67
Figure 3.2.2.1: An overview of the Scientific Questions Ontology (SQO) at the top, illustrating the main concepts and
terms. Question templates in NeuroDISK have variables that users will fill out based on the options specified in the Scientific
Domain Ontology (SDO) with additional options that are dynamically retrieved from the data source and expressed in a Data
Source Ontology (DSO), as shown in the lower part of the figure. In NeuroDISK, the SDO is the SDO-ENIGMA ontology and
the DSO is the ODS-ENIGMA ontology.
68
To specify question patterns, we followed a principled approach by using a Scientific
Questions Ontology (SQO) that organizes question templates and variables
(https://w3id.org/sqo). This allows us to relate different scientific questions, and connect the user
questions to LOIs. The SQO ontology is extended to create a Scientific Domain Ontology (SDO)
with additional terms from a new domain that can be used to specify question variables and
choices. The data source would also use a metadata ontology to describe datasets, which we refer
to as the Data Source Ontology (DSO). Ideally DSO extends SDO, or is well mapped to it,
supporting the specification of the data queries needed to answer the anticipated user queries. In
NeuroDISK, the SDO is the SDO-ENIGMA ontology and the DSO is the ODS-ENIGMA
ontology, which will be described in detail in Section 4.3.
Figure 3.2.2.1 illustrates the key concepts in the SQO and how they are used to create
questions in the SDO in NeuroDISK. The SQO includes:
● Question Category which helps organize all the question templates into broad
types of scientific inquiries such as association, counterfactual, prediction, etc.
● Question: The class of user questions. Each question class includes:
○ A Template which consists of a text in natural language containing slots
for question variables to be specified by the user.
○ A Pattern, a collection of RDF triples that combines the pattern fragments
for all the variables in the question.
○ Constraints, which are logical expressions that represent the valid
combinations of variable values. These are used to ensure that the user
question makes scientific sense.
69
○ Variables that are filled by the user to express their question. In the
example above, a variable can be the demographic which can be used to
express user questions about age.
● Question Variable: Each question variable is represented by:
○ Variable Name which denotes the name of the variable, e.g [Demographic]
○ Option: Indicates the values that the question variables can take. The
possible values can be indicated in several ways:
■ Static options: A list of options pre-defined on the SDO. In our
running example, the SDO-ENIGMA ontology has a variable
[BrainImagingTrait] that has options ‘Surface Area’,
‘Thickness’,…
■ Dynamic options: A list of options generated at run time by
querying the data source, which would return terms from DSO.
This query will use the question’s Constraints. For example, to
generate options for a [Region] variable, the DSO for the data
source would return the brain regions that are covered in the
datasets available, such as the cortical regions used in this work:
‘Precentral’, ‘Insula’,etc.
■ User input options: Allows the user to specify new variable values
directly through the user interface as free form input. Users can
only do this if they are very familiar with the domain and the
datasets.
70
○ Min/max Cardinality: Minimum/maximum number of options that can be
selected by the user for the variable. By default each variable accepts and
requires only one option (cardinality of 1).
○ Pattern Fragment: The semantic expression (RDF triple) about this
particular question variable, and that is part of the question pattern for a
question.
Question patterns are expressed as a set of triples using terms from the SQO as well as
the SDO. For our running example, the question template has the following question pattern
(using the ‘sdo-e:’ prefix for terms in the SDO-ENIGMA ontology):
sdo-e:effectSize sdo-e:sourceGene sdo-e:Genotype .
sdo-e:effectSize sdo-e:targetCharacteristic sdo-e:BrainImagingTrait .
sdo-e:effectSize sdo-e:targetCharacteristic sdo-e:Region .
sdo-e:effectSize sdo-e:isAssociatedWith sdo-e:DemographicAttribute.
sdo-e:effectSize sdo-e:collectedFrom sdo-e:GeneticAncestry .
When the user specifies their question and chooses variable values, the user question
pattern will be (using the ‘ods-e:’ prefix for the values extracted dynamically from the data
source and that are in the ODS-ENIGMA ontology):
sdo-e:effectSize sdo-e:sourceGene ods-e:rs1080066 .
sdo-e:effectSize sdo-e:targetCharacteristic ods-e:Precentral .
sdo-e:effectSize sdo-e:targetCharacteristic ods-e:Surface Area .
sdo-e:effectSize sdo-e:isAssociatedWith ods-e:Mean_Age.
sdo-e:effectSize sdo-e:collectedFrom ods-e:European .
Each LOI has a LOI question template expressed similarly as the user question templates
above. The LOI question pattern of each LOI will be matched against the user question pattern
through logical unification (F Baader et al, 2011). All LOIs that match will be triggered and their
71
methods executed. The user’s question variables will set up the LOI to find data and run
computational workflows, as we describe in Section 2.4.
The question templates defined in DISK for investigating questions about the brain with
neuroimaging genomics data in ENIGMA are:
● What is the effect size of [Genotype] on [Region] [Brain Imaging Trait]?
● Is the effect size of [Genotype] on [Brain Imaging Trait] of [Region] associated
with [Demographic Attribute]?
We have also implemented these same questions with the addition of a filter, in this case
filtering by specific ancestry labels of the cohorts:
● What is the effect size of [Genotype] on [Brain Imaging Trait] of [Region] for
cohorts of [Genetic Ancestry]?
● Is the effect size of [Genotype] on [Brain Imaging Trait] of [Region] associated
with [Demographic Attribute] for cohorts of [Genetic Ancestry]?
These question templates can be generalized further. For example we used genetic
ancestry as a filtering criterion to replicate the discovery analysis for the original paper, but the
ontology could be easily extended to support other demographic characteristics or cohort level
meta-data, for example filtering by cohorts that use MRIs with a particular magnetic field
strength or a particular genotyping chip.
72
Figure 3.2.2.2: The NeuroDISK user interface for specifying questions. (1) The user chooses among a set of pre-defined
question templates, which can be specific hypotheses with a posited outcome or simply exploratory questions. These templates
contain variables and their possible instantiations. (2) The user chooses values for each of the question variables through
pull-down menus. (3) The user’s question is turned into a question pattern consisting of RDF triples.
Figure 3.2.2.2 illustrates how users specify their questions in the NeuroDISK user
interface, and the question pattern that results from it. Users do not need to be familiar with the
ontologies or the structure of the data repository in order to pose their questions to NeuroDISK.
Note that the genetic ancestry options are dynamically obtained from the data source and come
from ODS-ENIGMA, populated by using the exact terminology of the original publication.
73
3.2.3 Organizing datasets through ontologies
Before we show how LOIs are triggered and executed, we describe in more detail how
the ODS-ENIGMA is used to organize datasets in the data sources that NeuroDISK uses for
ENIGMA.
The data available in the data source should be described with semantic metadata that is
rich enough to capture the terms that a scientist would use to express what data they would
consider relevant to a question or hypothesis. Many data sources have semantic annotations,
using metadata vocabularies to describe characteristics of the data. In the case of NeuroDISK,
the data source represents the available information from clinical research studies that collect
data from a cohort of participants according to a specific study design with some
inclusion/exclusion criteria, for example age range or clinical diagnosis. The inclusion criteria
and other characteristics of the participants and the study design may be important for the user’s
question. This description of the datasets is often through semantic metadata, defined as part of
one of several ontologies.
74
Figure 3.2.3: An overview of the main concepts in the NeuroDISK ODS-ENIGMA ontology. It describes how cohorts and
subsets of cohorts are used for specific projects.
Figure 3.2.3 illustrates the main concepts in the ODS-ENIGMA ontology. It represents
useful entities in the ENIGMA collaboration such as datasets, cohorts, organizations, protocols,
instruments, software, working groups, projects, and persons, together with the relationships
among them. It extends popular vocabularies for describing entities and actions
(https://schema.org/) and the W3C semantic standard PROV
(https://www.w3.org/TR/prov-primer/) for provenance recording. The design rationale for the
ODS-ENIGMA ontology is described in (
https://www.isi.edu/publications/trpublic/pdfs/isi-tr-723.pdf).
In the ODS-ENIGMA ontology, a central concept is that of a Working Group that
represents how several organizations in the ENIGMA consortium work together to analyze
datasets on a particular topic of common interest. Each Working Group uses multiple Cohorts
(defined as a group of individuals who participated in a specific clinical study) contributed by its
members. For example, the ADNI project team manages and organizes the ADNI cohort and
provides it to the ENIGMA Genetics Working Group for genetics related projects.
75
Metadata about cohorts include the Principal Investigator, any Covariates, and Brain
Scan Data Type, as well as relevant aspects of data collection, design type, and other statistical
information such as the mean age of participants.
Each working group organizes its members to collaborate in one or more Projects. A
project will use data and information from multiple cohorts contributed by the collaborators. For
example, the ENIGMA Cortical GWAS Project involves multiple cohorts and their associated
data properties for the GWAS analysis of cortical measures. A cohort can be used in multiple
projects and multiple working groups, but does not need to participate in all projects within any
one group.
A project often uses the subset of a cohort that meets specific inclusion criteria, called a
Cohort Group. A cohort can have multiple cohort groups, each with specific inclusion and/or
exclusion criteria that defines them. For example, a cohort group may be the subset of a cohort
considered “controls” or those without any neurological or psychiatric conditions. This
distinction helps describe the different assessments that may be applied to particular subsets of
the cohort; for example “controls” may not have been asked to fill out questionnaires regarding
medication use or have follow-up data, whereas the subset of individuals with a diagnosis of
interest would have that information. A Cohort Project Group, would then be considered the
subset of the cohort group included for a particular project that meet all project inclusion criteria
and pass quality control, such that exact statistics on included participants can be retained (e.g.,
N = 117, mean age = 39.5). This can also be extended for different analyses within a project.
This representation for projects has the ability to capture provenance by maintaining
descriptions of the cohort specifics that were used in a project at a specific point in time. This
system allows for conserving cohort versions that were undertaken under certain cohort
76
conditions. For example, the cohort ABCD can have a cohort project
ABCD_proj_ENIGMA3_Cortical_GWAS, which contains all baseline cohort information
available at the time of, and relevant to, the ENIGMA3 Cortical GWAS project. As more
information is added to ABCD (such as new participants, covariates, assessments, etc.), this
original Cohort Project remains untouched for its associated analyses. This makes documentation
for past studies readily accessible.
To retrieve datasets described with this vocabulary, we express the queries using the W3C
SPARQL semantic query standard (https://www.w3.org/TR/sparql11-query/). For example, the
following query pattern retrieves all cohorts and their respective cohort projects and types of
brain scans:
?cohort a ods-e:Cohort .
?cohort ods-e:HasCohortProject ?cohortProject .
?cohort ods-e:HasBrainScanDataType ?ScanDataType .
The ODS-ENIGMA ontology is modular, and is composed of several smaller ontologies. Table
3.2.3 gives an overview of these ontologies, which are described in detail in (
https://www.isi.edu/publications/trpublic/pdfs/isi-tr-723.pdf).
77
Table 3.2.3: Overview of the current ODS-ENIGMA ontologies in NeuroDISK.
Name Description
Core Ontology Main concepts of ODS-ENIGMA, elaborated further in other
ontologies
Organization Ontology Organizations (institutions) of investigators that contribute to
ENIGMA
Cohort Ontology Cohorts of study participants selected based on inclusion
criteria
Demographic Ontology Working Group Ontology
Working Group Ontology Dedicated working groups within the ENIGMA consortium
Project Ontology Projects undertaken by ENIGMA working groups
Person Ontology Researchers that participate in ENIGMA working groups and
projects
3.2.4 Finding data
Once a LOI is matched with a user’s question, the variables in the question are used to set
up the LOI. The first step is to set up the data query to send to the data source in order to retrieve
relevant datasets. Recall that the LOI has a data query template. To set up the LOI data query,
LOIs have pre-defined LOI variable mappings between the LOI question template and the LOI
data query template through their variables (Figure 3.2.4.1).
The LOI data query template has many variables that reflect how the data source is
organized. In the case of ENIGMA, datasets are contributed by members who participate in
ENIGMA working groups, where each member contributes data collected in their own clinical
78
studies. Analogous to user questions and LOI questions, LOI data query templates have a data
query pattern as a collection of triples that express the characteristics of the datasets sought.
Figure 3.2.4.1: The LOI variable mappings between the variables in the question template and the variables in the LOI
data query template. The mapping indicates to DISK how to retrieve appropriate datasets from the data source. The data query
template is a SPARQL query, and the actual data query will be completely specified once the user choices for the user question
variables replace the corresponding LOI query template variables according to the variable mappings.
The LOI variable mappings are used to incorporate the variable choices in the user
question pattern into the LOI data query pattern to form the data query issued to the data source.
Figure 3.2.4.1 shows the data query in SPARQL for our running example, highlighting the LOI
variable mappings that appeared in Figure 3.2.4.2.
79
Figure 3.2.4.2: The data query generated from the user’s question. The query highlights the relevant LOI variable mappings.
The data query will be used to retrieve datasets from the data source that are relevant to a user’s question.
Once the data query is executed, the data source returns several datasets. The LOI
indicates which variables in the LOI data query template would be useful for a user to see about
those datasets. Figure 3.2.4.3 shows the results of the data query for the running example.
80
Figure 3.2.4.3: The cohort data that are retrieved using Line of Inquiries. It shows for each cohort the demographics and
other information that appeared in the user’s question.
3.2.5 Analyzing data
NeuroDISK LOIs can analyze data in two stages: analysis and meta-analysis. Analysis
typically refers to a method applied to a particular study. Meta-analysis typically refers to
consolidating the results from individual analyses. This is done because the data within a study is
often analyzed on site for privacy reasons. We represent them as workflows and meta-workflows
respectively.
Workflows specify the steps and data dependencies needed to carry out a computational
analysis, while meta-workflows capture the steps to aggregate the results from one or multiple
workflows. When the execution of a workflow is completed, the provenance of all workflow
execution results is captured including the input and intermediate datasets as well as the software
components used. Meta-workflows have access to all the outputs generated by the workflows, as
well as to all the datasets retrieved by the LOI data query in case they are needed for the
meta-analysis.
81
ENIGMA working groups often have projects that conduct meta-analysis, as individual
subject information may not always be sharable. In this case, statistical analysis, such as the
genome-wide association, is carried out locally at the site that has stewardship over the data, and
only summary results are shared for meta-analysis. For our running example based on a project
within the ENIGMA Genetics Working Group, NeuroDISK conducts the meta-analysis to
combine the results of the individual site(cohort)-level analyses, and therefore, the currently
implemented LOIs do not conduct any subject-wise analysis. In other words, the data source for
the current examples includes a partial set of the results of the individual cohort analyses; when
the data query retrieves these results, then the LOI variable mappings pass the information to the
meta-analysis workflow (Figure 3.2.5.1). The results of running NeuroDISK for a user question
is highlighted in Figure 3.2.5.2, explaining the LOI used, the datasets retrieved, and the results
obtained. These explanations are generated from the provenance records that DISK keeps,
including provenance records for the workflow executions that WINGS provides.
Figure 3.2.5.1. The LOI’s meta-analysis is done through a meta-workflow for meta-regression. The LOI variable mappings
are used to take the results of the data query and set the inputs to the meta-workflow.
82
Figure 3.2.5.2: NeuroDISK captures the provenance of all execution results of the analysis and meta-analysis workflows
and uses it to generate explanations for the user. In this case: (1) the original question or hypothesis is shown along with the
name of the selected LOI; (2) the datasets retrieved corresponding to several cohorts; (3) datasets and datasets characteristics that
were input to the meta-analysis; (4) the results of the meta-analysis include a confidence value as well as visualizations of the
results.
3.2.6 Updating results continuously
NeuroDISK continuously checks for new information that could be used to answer an
active query; new information could be new cohort data included in the data source or new
workflows added to reflect new analysis methods. In such cases, NeuroDISK re-runs the LOI to
update the findings. Therefore, there may be several runs of the same LOI in response to a
standing user query. Figure 3.2.6 illustrates how the user sees this process. In this case, initially
10 datasets are available, and eventually up to 50 cohorts become available in the data source.
83
Figure 3.2.6: NeuroDISK continuously checks if the results for a user question have changed due to new data becoming
available or due to workflow or meta-workflow updates. (1) The standing user question continuously triggers the matched
LOI; (2) the user hypothesis contains variables; (3) those variables are used to set up a query to retrieve relevant data, (4) the data
is then analyzed with workflows; and (5) the LOI is re-executed every time that additional data is retrieved from the data source,
and the user can see how the results change. On the left is the user’s view of the hypothesis and the results over time. Users can
ask for details on each run,, and can request to see the LOI as shown on the right.
3.2.7 The implementation of NeuroDISK
An integrated overview of the components of NeuroDISK is shown in Figure 3.2.7.1. We
distinguish among users who formulate questions and receive the results, advanced users who
define the types of questions and corresponding lines of inquiry, and developers who integrate
new data sources and workflow software into the framework.
84
Figure 3.2.7.1: A diagram of the architecture components of DISK. Users specify hypotheses or questions that conform to
pre-defined templates. Advanced users have added pre-defined LOIs that are triggered when they match the user’s question. The
LOIs generate queries to retrieve data from a data source, and the data is analyzed through workflows and meta-workflows. A
developer can set up new data sources and new workflows in the DISK back-end.
In NeuroDISK, the data source is implemented in the Organic Data Science (ODS)
framework (Y Gil et al., 2015), an extension of Semantic MediaWiki (D Vrandečić et al., 2023).
This enables users to view metadata of different datasets, and to add new metadata as needed. In
NeuroDISK we set up a separate ODS site per ENIGMA working group, since each group
85
operates largely independently. The ODS data source we make available as part of this
publication is one which contains published information from the cortical GWAS project of the
Genomics Working Group (KL Grasby et al, 2020).
In NeuroDISK, workflows and meta workflows are executed through WINGS (Y Gil et
al., 2011), an intelligent workflow system that propagates semantic constraints to ensure that the
analysis is valid for the data at hand, and to set up parameter values that are appropriate for the
input datasets. WINGS exports provenance information to DISK, so it is accessible to users.
Figure 3.2.7.2. A diagram of the DISK APIs and adapters that enable interoperability with other data sources and
workflow systems. DISK provides abstract classes to implement both data adapters (for data sources) and method adapters (for
workflow systems). The current implementation of DISK includes method adapters for two workflow systems, WINGS and
AirFlow, and one data adapter for the Semantic MediaWiki platform.
86
The architecture of DISK is modular, and other data sources and workflow systems can
be integrated. Figure 3.2.7.2 shows the data adapters and APIs defined for DISK.The software
for DISK and NeuroDISK is available open source, with separate code for the backend
(https://github.com/KnowledgeCaptureAndDiscovery/disk-ui) and the UI
(https://github.com/KnowledgeCaptureAndDiscovery/disk-ui). The system can be installed as a
software container or building from source code. The system is documented in detail at
https://disk.readthedocs.io/, with guides for users, developers, and system administrators.
3.3 Results
We demonstrate the use of NeuroDISK to simulate hypothesis-driven discovery and
continuous result updates using and extending a well-known ENIGMA publication25
, in which
researchers from institutions around the world participated in this study by following specific
standardized protocols to: 1) extract neuroimaging derived traits from the cortical brain surface;
2) impute the genotyping data 3) quality control all the data; and 4) run a series of linear
association models (or mixed effects models depending on population structure) to identify
genome-wide associations with cortical brain imaging measures. 70 total brain imaging traits
were assessed genome-wide (~18,000,000 genetic variants). The results from cohorts of
European ancestry were meta-analyzed together using an inverse-variance based weighting.
Replication analysis was performed by meta-analyzing results from the ENIGMA cohorts with
that of a single large cohort, UK Biobank, with over 10,000 individuals. These final
meta-analyzed results were then used to demonstrate generalizability in other cohorts of
individuals from non-European ancestry.
87
While final meta-analyzed results are available for the full genome-wide and image-wide
set of variables on the ENIGMA website, it has been suggested that raw cohort level results can
be used to identify an individual participant (N Homer, et al., 2008; R Cai et al., 2015).
Therefore, here we work only with a subset of results. The subset of summary statistics includes
the effect size, standard error, and p-value calculated for 14 of the significant associations
discussed in the paper between specific single nucleotide polymorphisms (SNP) and surface area
or thickness for cortical regions of interest. The individual cohort-level results were made
available as part of the Forest plots in the supplementary materials of the original publication.
The set of results per cohort were uploaded into NeuroDISK as a single file, with the results of
each SNP as an individual row, allowing the set of findings made available to easily be extended
in the future. While all 14 associations are available for users to peruse, our running example
uses the most significant association identified in the original publication – the SNP rs1080066
and the surface area of the precentral cortex. We also capture meta-data related to the sample
size, sex distribution, mean age, and ancestry information available from participating cohorts.
We first reproduce the results of publication (KL Grasby et al., 2020) with a
meta-analysis of all effect-sizes as weighted by the inverse variance of the effects. Next, we
show how cohort-specific meta-data can be integrated to ask a novel question of the data, which
may help identify discrepancies in results across cohorts. Finally, we add data from the
Adolescent Brain Cognitive Development Study (https://nda.nih.gov/abcd) (TL Jernigan et al.,
2018) a cohort that was not available as part of the original 2020 paper, and highlight how
incrementally adding data can change results, altering confidence in the original meta-analysis
and the novel meta-data based analyses. We detail these results below.
88
3.3.1 Replicating the published ENIGMA GWAS meta-analysis
In (KL Grasby et al, 2020), we and over 300 co-authors performed a standardized
genome-wide association study meta analysis as described above. Here, we show that when the
individual level cohort summary results for the SNP of interest are stored in the ODS database,
we can query the association with NeuroDISK and trigger a meta-analysis workflow. The query
specifically searches for association results of SNP rs10810066 on the precentral surface area,
filters for cohorts of European ancestry, and performs a meta-analysis. Our meta-analysis
workflow also generates a Forest plot for visual comparison of effect sizes per cohort. Users are
presented with the individual datasets that the query returns and have the option to remove
individual cohorts. Here we show that a meta-analysis of all discovery ENIGMA cohorts and the
UK Biobank results yields results very similar to that of the published work (Figure 3.3.1). The
slight discrepancy is due to the fact that in the publication the ENIGMA cohorts were meta
analyzed together first, then UK Biobank was meta-analyzed with those results, where as here
we meta-analyze results from all cohorts (including UKBiobank) together. We show this by
comparing the results of two associations (rs10810066 on the precentral surface area) from both
the original publication and that of NeuroDISK’s reassessment with and without UK Biobank
(Table 3.3.1). We show that the resultant effects are nearly identical without UK Biobank, but
differ slightly in significance with UK Biobank due to the difference in how the cohort was
included, as the Grasby et al paper meta-analyzed the result of all 48 ENIGMA cohorts
(considered discovery) with that of UK Biobank (considered replication), while here we
meta-analyze all 49 cohorts simultaneously.
89
Figure 3.3.1: Meta analysis workflow results to validate the NeuroDISK framework can reproduce the published meta analysis
results(KL Grasby et al., 2020). Left: Published Forest plot for the effect size of SNP loci rs1080066 on the precentral surface
area and from supplements of published results (KL Grasby et al., 2020). Right: Forest plot using the NeuroDISK meta analysis
workflow highlights successful replication.
90
Table 3.3.1: Reproduced genetic association results for precentral surface area and rs1080066 for ENIGMA discovery
cohorts with and without UK Biobank.
Precentral surface
area and
rs1080066-A allele
ENIGMA Discovery Cohorts ENIGMA Discovery + UK Biobank
Published Replicated Published Replicated
Effect size
(unstandardized
beta)
-110.56 -110.51 -111.78 -116.32
P value 2.53 x 10-95 6.70 x 10-95 3.81 x 10-137 2.44 x 10-135
3.3.2 Asking and answering novel questions with meta-data
When working across such diverse cohorts and performing meta-analyses, one may be
interested to know whether there are aspects of particular datasets that are driving the
associations. For example, is the effect of a genetic association a function of the age of a cohort?
We can ask and answer such questions within the NeuroDISK framework. Specific association
results and covariates are retrieved from the ODS, and meta analysis or meta regression
workflows are provided as choices. In the case of rs1080066 and Precentral Surface Area, the
meta regression workflow corresponds to “Is the effect size of rs1080066 on Precentral SA
associated with mean age of the cohort?” We performed a meta regression to determine whether
there was an association between the effect size of the SNP’s association with the cortical
structure and mean age of each cohort. Our meta-regression workflow weights the effects of the
cohort by its sample size, and regresses the effects against the cohort-specific factor, in this case,
mean age. The workflow includes a scatterplot as a visual output, and allows the user the option
to filter cohorts by limits on the factor. For example, in Figure 3.3.2 we show results of the
meta-regression across the full set of cohorts (left) and after filtering to only include cohorts with
91
mean age under 60 years (right). Although not shown, filtering can also be done as part of the
meta-analysis in section 3.3.1.
Figure 3.3.2: Meta regression workflow results show a scatter plot displaying the association between the effect size of an
association of interest, here being the effect of SNP rs1080066 on precentral surface area (y-axis), against the mean age of each
cohort (x-axis). The visualization is based on R Shiny (https://www.rstudio.com/products/shiny/) and is interactive such that users
can click any point to check the cohort information. Clicking the regression line shows the beta value and p value of association.
The size of each point is a reflection of the sample size of the cohort. (Right) Users can also adjust the demographic variable of
interest to selectively display cohorts that fall within a specified range, ie., the mean age between 0-60 years old.
92
3.3.3 Continuous updating of results
The NeuroDISK framework can re-execute queries as more and more data becomes
available. We demonstrate this re-execution through continuous retrieval of data and updating of
results by artificially simulating the availability of new data over time. We subsampled the
available data, starting with 10 cohorts (N = 10) and then added 10 more cohorts at a time (N =
20, 30, 40, 48). Where we had 48 cohorts in total in the original ENIGMA discovery cohort (not
including UK Biobank). Then we added UK Biobank, which was analyzed separately in the
original paper (N = 49). Then we added a new external cohort (ABCD), which was not in the
original publication, so associations were calculated separately and uploaded onto ODS (N = 50).
ABCD is the largest longitudinal neuroimaging study for child health in the United
States, which has recruited more than 10,000 children around 9 to 10 years old (BJ Casey et al.,
2018) with deep genotyping using the Affymetrix Axiom Smokescreen Array. We extracted
demographic information such as age, sex, and scanner information (SIEMENS/GE/Philips) for
the baseline (N = 11,362) data. T1 weighted MRI from ABCD 4.0 release data were processed
by the ABCD group (DJ Hagler Jr et al, 2019) and baseline cortical surface area or thickness
measures were extracted using Freesurfer version 7.1.1 available in release 4.0. We calculated
the first four components of MDS using the ENIGMA protocol and followed the ENIGMA
genetic imputation protocol and QC criteria (Minor Allele Frequency < 0.01; Genotype Call Rate
< 95%; Hardy-Weinberg Equilibrium < 1x10-6
). We filtered for individuals of European ancestry
as in the ENIGMA Genetics Imputation protocol available online
(https://github.com/ENIGMA-git/Genetics/tree/main/ENIGMA2/Imputation), where a radius of
0.0066 was set around the first, second and third MDS components of the CEU centroid,
resulting in a sample size of N = 5,202. The distance was calculated by covering 95% of the first,
93
second and third MDS components of the European centroid data. We used GCTA (J Yang et al.,
2011) to estimate the genetic relationship matrix (GRM) from 452,544 SNPs, which was
subsequently used in a linear mixed model regression of the specific SNPs of interest on the
brain regions of interest using GCTA-MLMA (J Yang et al., 2014) and including age, sex, and
scanner manufacturer (Siemens, GE, or Philips) as fixed covariates along with the first four
genetic components derived from multidimensionality scaling, as performed in the original
ENIGMA publication (KL Grasby et al., 2020).
As ABCD was not used in the original publication, these genetic association tests were
run separately before corresponding genotype-phenotype association results were uploaded onto
the ODS for follow-up continuous meta-analysis and meta-regressions.
Figure 3.3.3 shows the meta-regression results of the consecutive runs with increasing
sample size. On the UI itself (Figure 3.16A), the user can see the final p-value of the association,
while the plots help visualize the individual data points (Figure 3.16B). We notice that as the
number of cohorts increase, the effect of mean age on the effect of the SNP on the cortex appears
to reach significance (p = 0.02), showing a trend towards younger cohorts having a greater
genetic effect.
94
Figure 3.3.3: Continuously updated findings. The NeuroDISK user interface shows results across all runs in one location for ease
of comparison (KL Grasby et al., 2020). We display the meta-regression results of continuous runs of the effect of SNP
rs1080066 on the precentral surface area versus mean age of the cohort. To demonstrate a continuously growing database of
cohort data, the first 10, 20, 30, and 40 samples were taken at random from the initial list of 48 cohorts of European ancestry
involved in the original ENIGMA cortical GWAS publication. The 49th cohort was the UK Biobank dataset which served as a
meta-analysis replication in the original publication, and the 50th included cohort was ABCD, a cohort not involved in the
original publication. We notice that as the number of cohorts increase, the effect of mean age on the effect of the SNP appears to
reach nominal significance, although as the result hovers around the nominal p=0.05 threshold, more data will be needed to
ensure stability of the effect. The -log(10) of the p-value is plotted on the y-axis, so points higher along the y-axis are more
significant.
In summary, our results show that NeuroDISK can perform data analysis continuously
and automatically, and our methods yield results that replicate the original publication and allow
users to delve deeper with the data and ask questions of even summary results that were not
answered in the original study. NeuroDISK is ideal for meta-analytical studies, as demonstrated
in this work, but can also be extended to large-scale coordinated analyses using individual
subject level data points.
95
3.4 Discussion
We have presented a novel AI approach to scientific problem solving that captures how
scientists pursue inquiry-driven discoveries through explicit knowledge structures called lines of
inquiry. We use AI techniques to represent how scientists pose questions or hypotheses, how
they characterize the data needed to address a given question, and what methods are appropriate
to analyze the data found. We use AI reasoners to do expression matching, variable assignments,
constraint propagation, and other forms of problem solving needed to set up and run lines of
inquiry. Our approach is implemented in NeuroDISK, which extends the general-purpose DISK
framework with lines of inquiry for multi-site imaging genetics, including datasets with
descriptive metadata and workflows for analysis. We demonstrated NeuroDISK by reproducing
the results of a well-known publication following our approach, and showed how the results can
be updated automatically when new data becomes available.
NeuroDISK has demonstrated important capabilities for the automation of inquiry-driven
discovery, specifically:
● Guiding scientists to specify neuroimaging genetics inquiries that can be tested with
available datasets. NeuroDISK can guide scientists to specify an inquiry (question or
hypothesis) by selecting from possible questions that can be answered using available
datasets. This is possible because NeuroDISK uses question patterns that are tied to
ontologies representing the kinds of terms in questions that could be answered with the
data available. The underlying DISK framework provides a general ontology of scientific
questions and hypotheses that provides overarching concepts that guided the design of the
neuroimaging genetics question ontology in NeuroDISK.
96
● Automatically selecting among possible approaches that represent how to answer
common types of neuroimaging genetics inquiries with existing datasets. NeuroDISK
contains lines of inquiry that express the general approaches that a scientist would pursue
for a given inquiry, including what kinds of datasets to use and how to analyze them.
NeuroDISK lines of inquiry provide the basis for taking a scientific question and
automatically generating a query to a data repository, then setting up and running
analyses for the data found, and finally consolidating the results to answer the question.
● Automatically finding data by generating data queries that describe characteristics of
datasets from existing neuroimaging cohorts that make them useful for different types of
scientific questions. NeuroDISK includes ontologies to describe cohorts datasets
collected in neuroimaging studies, and to describe subsets of those cohorts that have been
used in ENIGMA projects. Dataset characteristics include genotypic, phenotypic, and
other demographic information. These representations enable NeuroDISK to
automatically retrieve and reuse the available datasets. This approach is consistent with
the FAIR data principles (MD Wilkinson et al., 2016) by providing descriptive metadata
to find, access, integrate, and reuse datasets. For NeuroDISK, this metadata about project
cohorts is in the data repository, and is machine readable and machine accessible through
APIs.
● Automatically apply methods by elaborating and executing appropriate steps to analyze
the available data to investigate an inquiry posed by a scientist. NeuroDISK reasons
about the representations of scientific questions and the representations of lines of inquiry
in order to automate this process. In that process, NeuroDISK generates data queries to
retrieve appropriate datasets and sets up executable workflows for analyzing that data
97
with the required inputs. This may be decomposed into analysis, done with workflows,
and meta-analysis done with meta-workflows.
● Updating results by continuously monitoring the available data sources and triggering
new runs of the relevant lines of inquiry. NeuroDISK considers standing user inquiries to
reconsider lines of inquiry that query data sources for new datasets, and when new data is
returned the analyses are executed again and the results are updated.
● Presenting the findings through explanations and summaries of the analyses conducted.
NeuroDISK generates explanations of how a result was generated based on the
provenance records of the execution of a triggered line of inquiry and its associated
queries and workflows. When new datasets are added, NeuroDISK summarizes the
differences among all the triggered lines of inquiry so that the changes in findings can be
tracked.
Our system has certain limitations given that NeuroDISK was designed with a specific
scope based on the select ENIGMA projects that we targeted. We designed NeuroDISK with
modularity in mind so that it can be easily extended for other datasets, projects, and studies. For
example, new ontologies and workflows would have to be created for other neuroimaging data of
different modalities for example Diffusional MRI and functional MRI.
NeuroDISK is designed to support a scientist to explore questions and hypotheses. Rather
than doing the above steps and processes manually, a scientist can rely on NeuroDISK to
automate the process and present the results in context. This can save the scientist significant
time and effort. It also has assurances that the analysis is valid, since it uses proven methods
represented in NeuroDISK. In addition, NeuroDISK reconsiders a scientist’s question whenever
new relevant data becomes available. This would add a novel dimension to scientific research,
98
namely obtaining updates of published findings as more data is collected in various
neuroimaging studies over time. This could help increase the confidence on published findings,
or lead to revisiting previously explored questions to refine the scope of a past investigation. As
new datasets are incorporated into the analysis, a scientist can consider a more specific question
such as filtering by a demographic characteristic where enough data is now available.
NeuroDISK could be extended to automatically generate visualizations that highlight the results
in different populations so the scientist can consider potential future investigations.
Beyond individual users, there are many benefits of a framework like NeuroDISK for
scientific collaborations such as the ENIGMA consortium. First, method validity since the lines
of inquiry and workflows in NeuroDISK would be used for many analyses and checked by many.
This would be in contrast with current practices where individual researchers set up the methods
and software. Second, dissemination of new methods as well as method reuse because it would
be easy to run existing methods as there would be no learning curve needed to apply them.
Third, transparency and reproducibility of the analyses since everyone could access the
provenance traces for any analyses in the collaboration. Fourth, promoting FAIR data and FAIR
method practices since the AI reasoners need explicit metadata and representations that are also
useful for human scientists. Fifth, comparison of experimental results would be facilitated as the
provenance and lines of inquiry would be comparable side to side.
Adopting a framework such as NeuroDISK could raise concerns about stifling creativity
since the framework would run the same workflows and methods for all data while
human-driven analyses would naturally include many variants (using different software, different
order in the steps, etc). We would argue the opposite, that current human-driven processes resist
creativity and innovation. Each researcher has a tendency to use software that they have used
99
before, since using new algorithms and methods can have a significant learning curve. With
NeuroDISK it would be easy to replace workflow fragments with new methods that become
available in the literature. Scientists could easily compare and contrast new methods, and see the
results of re-running previous analyses using the new methods. We believe that this would
stimulate more creativity in terms of exploring new combinations of methods and incorporation
of new algorithms.
Intelligent generation of visualizations is also an area of future work. Different types of
visualizations may be more useful depending on the scientific question at hand. For example, a
visualization may focus on the distribution of the input data for statistical analysis and
significance, while others may focus on domain-specific illustrations-like 3D brain images- to
aid the exploration). Including these visualizations in our workflows requires domain-specific
knowledge from researchers. Visualization is a key aspect for understanding the outcome of an
analysis, and easing data exploration may lead to additional scientific questions, or prompting
researchers to create new datasets to fill in existing gaps.
Because all the scientific questions in NeuroDISK are machine readable, our approach
can easily be extended to retrieve semantically similar questions and datasets that have been
posed by other investigators and may be related to a new question. NeuroDISK can group similar
questions so that their results can be compared. Improving this support for comparing scientific
questions may provide researchers with new insights.
NeuroDISK automates many key steps and processes for discovery that are mostly
carried out manually today. This is a crucial contribution to the development of AI scientists that
can automate scientific exploration and discovery. Today, these steps and processes are typically
described as part of the methods section of publications. However, they are not in a
100
machine-readable representation. Moreover, the descriptions of methods are often partial or
incomplete (D Garijo et al., 2013) which makes it difficult for AI systems to access this
information. In contrast, NeuroDISK makes these steps and processes machine-readable and
therefore accessible to AI reasoners.
NeuroDISK is the first AI Scientist that demonstrates the automation of sophisticated
reasoning involved in scientific discovery processes including data retrieval and analysis through
AI planning, reasoning, and execution techniques.
Scientific practice would be fundamentally transformed through these developing AI
scientists. Scientific advances would be greatly accelerated through the automation of analyses.
Pursuing new investigations could potentially be completed in a matter of hours or days rather
than months or years. In addition, new results would be more reliable, as they would come with
some guarantees of correctness by using lines of inquiry that reflect well known methods and
best practices, rather than be prone to human error and misunderstandings. Extensive provenance
and explanation would be available for examination, rather than the limited documentation
provided in scientific publications. We imagine a future where, once a scientific article is
published, its findings are continuously updated when new evidence comes to light, opening the
way to fully digital and comprehensively computational paradigms for science.
101
Chapter 4 Conclusions
We developed a fast robust autoregressive linear mixed model (ARLMM) to detect
genetic associations in temporally or spatially correlated brain imaging traits with high precision.
Our autoregressive linear mixed model could explicitly account for between-subject effects (e.g.
cohort-specific/scanner effects, measurement errors) and within-subject effects (e.g.
time-varying or spatial-varying effects). Time-varying or spatial-varying random effects was
introduced in the longitudinal mixed model framework, with the autoregressive distribution
assumption that the measurement at current time point or spatial location is only correlated to the
previous time point or spatial point, not anything before or after. We used the moment matching
method (T Ge et al, 2017) to perform regression on the second moment of phenotypic variances
against variances of all random effects, which aims to convert the parameter estimation problem
for linear mixed model fixed and random effects to a linear regression model. Instead of directly
applying ordinary least squares to the regression model, support vector regression (S
Balasundaram et al., 2019) was applied to find the optimal parameters that fit the regression line
within a certain threshold, to make it less sensitive to the outliers and noise from medical
imaging or genetic data.
With more uncorrelated or correlated data coming in, NeuroDISK was proposed to
perform continuous data analysis, not only for multiple hypothesis comparision, but also for
general hypothesis testing corresponding to any workflows. NeuroDISK automates many key
steps and processes for discovery that are mostly carried out manually today. This is a crucial
contribution to the development of AI scientists that can automate scientific exploration and
discovery. Today these steps and processes are typically described as part of the methods section
of publications. However, they are not described in text and not in a machine-readable
102
representation. Moreover, the descriptions of methods are often partial or incomplete (D Garijo
et al., 2013), which makes it hard for AI systems to access this information. In contrast,
NeuroDISK makes these steps and processes machine-readable and therefore are accessible to AI
reasoners. NeuroDISK is the first AI scientist that demonstrates the automation of sophisticated
reasoning involved in scientific discovery processes including data retrieval and analysis through
AI planning, reasoning, and execution techniques.
Future work will include extend ARLMM to more time points/variable time intervals and
incorprate deep learning based backbones for parameter estimation. Link NeuroDISK with more
neuroimaging workflow ontologies, for example ReproNim, to incorporate cohort-level analyses
with multicohort meta/mega analyses.
103
Bibliography
Stein JL, Medland SE, Vasquez AA, Hibar DP, Senstad RE, Winkler AM, Toro R, Appel K,
Bartecek R, Bergmann Ø, Bernard M. Identification of common variants associated with human
hippocampal and intracranial volumes. Nature genetics. 2012 May;44(5):552-61.
Smit DJ, van‘t Ent D, De Zubicaray G, Stein JL. Neuroimaging and genetics: exploring,
searching, and finding. Twin Research and Human Genetics. 2012 Jun;15(3):267-72.
Moujalled D, Strasser A, Liddell JR. Molecular mechanisms of cell death in neurological
diseases. Cell Death & Differentiation. 2021 Jul;28(7):2029-44.
Wilson DM, Cookson MR, Van Den Bosch L, Zetterberg H, Holtzman DM, Dewachter I.
Hallmarks of neurodegenerative diseases. Cell. 2023 Feb 16;186(4):693-714.
Rajan KB, Weuve J, Barnes LL, McAninch EA, Wilson RS, Evans DA. Population estimate of
people with clinical AD and mild cognitive impairment in the United States (2020-2060).
Alzheimers Dement 2021;17(12):1966-75.
Alzheimer's Association. 2024 Alzheimer's disease facts and figures. Alzheimer's & Dementia.
2018 Mar 1;14(3):367-429.
Mendiola-Precoma J, Berumen LC, Padilla K, Garcia-Alcocer G. Therapies for prevention and
treatment of Alzheimer’s disease. BioMed research international. 2016;2016(1):2589276.
Shusharina N, Yukhnenko D, Botman S, Sapunov V, Savinov V, Kamyshov G, Sayapin D,
Voznyuk I. Modern methods of diagnostics and treatment of neurodegenerative diseases and
depression. Diagnostics. 2023 Feb 3;13(3):573.
Guerreiro R, Bras J. The age factor in Alzheimer’s disease. Genome medicine. 2015 Dec;7:1-3.
Zhao N, Ren Y, Yamazaki Y, Qiao W, Li F, Felton LM, Mahmoudiandehkordi S, Kueider-Paisley
A, Sonoustoun B, Arnold M, Shue F. Alzheimer’s risk factors age, APOE genotype, and sex
drive distinct molecular pathways. Neuron. 2020 Jun 3;106(5):727-42.
Marek S, Tervo-Clemmens B, Calabro FJ, Montez DF, Kay BP, Hatoum AS, Donohue MR,
Foran W, Miller RL, Hendrickson TJ, Malone SM. Reproducible brain-wide association studies
require thousands of individuals. Nature. 2022 Mar 24;603(7902):654-60.
Murali S, Ding H, Adedeji F, Qin C, Obungoloch J, Asllani I, Anazodo U, Ntusi NA, Mammen
R, Niendorf T, Adeleke S. Bringing MRI to low‐and middle‐income countries: directions,
challenges and potential solutions. NMR in Biomedicine. 2024 Jul;37(7):e4992.
Kerner B, North KE, Fallin MD. Use of longitudinal data in genetic studies in the genome‐wide
association studies era: summary of Group 14. Genetic epidemiology. 2009;33(S1):S93-8.
104
Weiner MW, Aisen PS, Jack Jr CR, Jagust WJ, Trojanowski JQ, Shaw L, Saykin AJ, Morris JC,
Cairns N, Beckett LA, Toga A. The Alzheimer's disease neuroimaging initiative: progress report
and future plans. Alzheimer's & Dementia. 2010 May 1;6(3):202-11.
Manning EN, Leung KK, Nicholas JM, Malone IB, Cardoso MJ, Schott JM, Fox NC, Barnes J,
Alzheimer’s Disease Neuroimaging Initiative. A comparison of accelerated and non-accelerated
MRI scans for brain volume and boundary shift integral measures of volume change: evidence
from the ADNI dataset. Neuroinformatics. 2017 Apr;15:215-26.
Jack Jr CR, Arani A, Borowski BJ, Cash DM, Crawford K, Das SR, DeCarli C, Fletcher E, Fox
NC, Gunter JL, Ittyerah R. Overview of ADNI MRI. Alzheimer's & Dementia. 2024 Sep 11.
Walter S, Long R, Hummel CH. “Make research more people‐centered”: Two ADNI participants
share recommendations for more inclusive Alzheimer's disease research. Alzheimer's &
Dementia. 2024 Aug 21;20(10):7420.
Wilks H, Benzinger TL, Schindler SE, Cruchaga C, Morris JC, Hassenstab J. Predictors and
outcomes of fluctuations in the clinical dementia rating scale. Alzheimer's & Dementia. 2024
Mar;20(3):2080-8.
Ly MT, Adler J, Loy AF, Edmonds EC, Bondi MW, Delano-Wood L. Comparing
neuropsychological, typical, and ADNI criteria for the diagnosis of mild cognitive impairment in
Vietnam-era veterans. Journal of the International Neuropsychological Society. 2024
Jun;30(5):439-47.
Jack Jr CR, Andrews JS, Beach TG, Buracchio T, Dunn B, Graf A, Hansson O, Ho C, Jagust W,
McDade E, Molinuevo JL. Revised criteria for diagnosis and staging of Alzheimer's disease:
Alzheimer's Association Workgroup. Alzheimer's & Dementia. 2024 Jun.
Garg M, Karpinski M, Matelska D, Middleton L, Burren OS, Hu F, Wheeler E, Smith KR, Fabre
MA, Mitchell J, O’Neill A. Disease prediction with multi-omics and biomarkers empowers
case–control genetic discoveries in the UK Biobank. Nature Genetics. 2024 Sep;56(9):1821-31.
Miller KL, Alfaro-Almagro F, Bangerter NK, Thomas DL, Yacoub E, Xu J, Bartsch AJ, Jbabdi
S, Sotiropoulos SN, Andersson JL, Griffanti L. Multimodal population brain imaging in the UK
Biobank prospective epidemiological study. Nature neuroscience. 2016 Nov;19(11):1523-36.
Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, Motyer A, Vukcevic D,
Delaneau O, O’Connell J, Cortes A. The UK Biobank resource with deep phenotyping and
genomic data. Nature. 2018 Oct;562(7726):203-9.
Veronese N, Smith L, Barbagallo M, Giannelli G, Caruso MG, Cisternino AM, Notarnicola M,
Cao C, Waldhoer T, Yang L. Neurological diseases and COVID-19: prospective analyses using
the UK Biobank. Acta Neurologica Belgica. 2021 Oct;121(5):1295-303.
Vilhjálmsson BJ, Nordborg M. The nature of confounding in genome-wide association studies.
105
Nature Reviews Genetics. 2013 Jan;14(1):1-2.
Schraiber JG, Edge MD. Heritability within groups is uninformative about differences among
groups: Cases from behavioral, evolutionary, and statistical genetics. Proceedings of the National
Academy of Sciences. 2024 Mar 19;121(12):e2319496121.
Zhang Y, Pan W. Principal component regression and linear mixed model in association analysis
of structured samples: competitors or complements?. Genetic epidemiology. 2015
Mar;39(3):149-55.
Hoffman GE. Correcting for population structure and kinship using the linear mixed model:
theory and extensions. PloS one. 2013 Oct 28;8(10):e75707.
Li H, Ralph P. Local PCA shows how the effect of population structure differs along the genome.
Genetics. 2019 Jan 1;211(1):289-304.
Yao Y, Ochoa A. Limitations of principal components in quantitative genetic association models
for human studies. Elife. 2023 May 4;12:e79238.
Furlotte NA, Eskin E, Eyheramendy S. Genome‐wide association mapping with longitudinal
data. Genetic epidemiology. 2012 Jul;36(5):463-71.
Loh PR, Tucker G, Bulik-Sullivan BK, Vilhjálmsson BJ, Finucane HK, Salem RM, Chasman DI,
Ridker PM, Neale BM, Berger B, Patterson N. Efficient Bayesian mixed-model analysis
increases association power in large cohorts. Nature genetics. 2015 Mar;47(3):284-90.
Ge T, Holmes AJ, Buckner RL, Smoller JW, Sabuncu MR. Heritability analysis with repeat
measurements and its application to resting-state functional connectivity. Proceedings of the
National Academy of Sciences. 2017 May 23;114(21):5521-6.
Wu X, McPeek MS. L-gator: genetic association testing for a longitudinally measured
quantitative trait in samples with related individuals. The American Journal of Human Genetics.
2018 Apr 5;102(4):574-91.
Yang Q, Thomopoulos SI, Ding L, Surento W, Thompson PM, Jahanshad N, Alzheimer’s
Disease Neuroimaging Initiative. Support vector based autoregressive mixed models of
longitudinal brain changes and corresponding genetics in Alzheimer’s disease. InPredictive
Intelligence in Medicine: Second International Workshop, PRIME 2019, Held in Conjunction
with MICCAI 2019, Shenzhen, China, October 13, 2019, Proceedings 2 2019 (pp. 160-167).
Springer International Publishing.
Guedj J, Thiébaut R, Commenges D. Joint modeling of the clinical progression and of the
biomarkers' dynamics using a mechanistic model. Biometrics. 2011 Mar;67(1):59-66.
106
Ge W, Yu Y. Borrowing treasures from the wealthy: Deep transfer learning through selective
joint fine-tuning. InProceedings of the IEEE conference on computer vision and pattern
recognition 2017 (pp. 1086-1095).
Lei B, Yang M, Yang P, Zhou F, Hou W, Zou W, Li X, Wang T, Xiao X, Wang S. Deep and joint
learning of longitudinal data for Alzheimer's disease prediction. Pattern Recognition. 2020 Jun
1;102:107247.
Li M, Liu X, Bradbury P, Yu J, Zhang YM, Todhunter RJ, Buckler ES, Zhang Z. Enrichment of
statistical power for genome-wide association studies. BMC biology. 2014 Dec;12:1-0.
Makridakis S, Hibon M. ARMA models and the Box–Jenkins methodology. Journal of
forecasting. 1997 May;16(3):147-63.
Tian K, Jiang Y, Yuan Z, Peng B, Wang L. Visual autoregressive modeling: Scalable image
generation via next-scale prediction. arXiv preprint arXiv:2404.02905. 2024 Apr 3.
Basak D, Pal S, Patranabis DC. Support vector regression. Neural Information
Processing-Letters and Reviews. 2007 Oct;11(10):203-24.
Zhang F, O'Donnell LJ. Support vector regression. InMachine learning 2020 Jan 1 (pp. 123-140).
Academic Press.
Zheng Y, Zhang YJ, Larochelle H. Topic modeling of multimodal data: an autoregressive
approach. InProceedings of the IEEE conference on computer vision and pattern recognition
2014 (pp. 1370-1377).
Guo B, Zhang X, Wu H, Wang Y, Zhang Y, Wang YF. Lar-sr: A local autoregressive model for
image super-resolution. InProceedings of the IEEE/CVF conference on computer vision and
pattern recognition 2022 (pp. 1909-1918).
El-Nouby A, Klein M, Zhai S, Bautista MA, Toshev A, Shankar V, Susskind JM, Joulin A.
Scalable pre-training of large autoregressive image models. arXiv preprint arXiv:2401.08541.
2024 Jan 16.
Brown TB. Language models are few-shot learners. arXiv preprint arXiv:2005.14165. 2020.
Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL, Almeida D, Altenschmidt J,
Altman S, Anadkat S, Avila R. Gpt-4 technical report. arXiv preprint arXiv:2303.08774. 2023
Mar 15.
Liu G, Feng M, Wang Y, Wong L, Ng SK, Mah TL, Lee EJ. Towards exploratory hypothesis
testing and analysis. In2011 IEEE 27th International Conference on Data Engineering 2011 Apr
11 (pp. 745-756). IEEE.
107
Sanz-Robinson J, Jahanpour A, Phillips N, Glatard T, Poline JB. NeuroCI: Continuous
Integration of Neuroimaging Results Across Software Pipelines and Datasets. In2022 IEEE 18th
International Conference on e-Science (e-Science) 2022 Oct 11 (pp. 105-116). IEEE.
Lupton MK, Strike L, Hansell NK, Wen W, Mather KA, Armstrong NJ, Thalamuthu A,
McMahon KL, de Zubicaray GI, Assareh AA, Simmons A. The effect of increased genetic risk
for Alzheimer's disease on hippocampal and amygdala volume. Neurobiology of aging. 2016
Apr 1;40:68-77.
Adaszewski S, Dukart J, Kherif F, Frackowiak R, Draganski B, Alzheimer's Disease
Neuroimaging Initiative. How early can we predict Alzheimer's disease using computational
anatomy?. Neurobiology of aging. 2013 Dec 1;34(12):2815-26.
Montagne A, Nation DA, Sagare AP, Barisano G, Sweeney MD, Chakhoyan A, Pachicano M,
Joe E, Nelson AR, D’Orazio LM, Buennagel DP. APOE4 leads to blood–brain barrier
dysfunction predicting cognitive decline. Nature. 2020 May 7;581(7806):71-6.
Adaszewski S, Dukart J, Kherif F, Frackowiak R, Draganski B, Alzheimer's Disease
Neuroimaging Initiative. How early can we predict Alzheimer's disease using computational
anatomy?. Neurobiology of aging. 2013 Dec 1;34(12):2815-26.
Fortea J, Pegueroles J, Alcolea D, Belbin O, Dols-Icardo O, Vaqué-Alcázar L, Videla L, Gispert
JD, Suárez-Calvet M, Johnson SC, Sperling R. APOE4 homozygozity represents a distinct
genetic form of Alzheimer’s disease. Nature medicine. 2024 May 6:1-8.
Elliott LT, Sharp K, Alfaro-Almagro F, Shi S, Miller KL, Douaud G, Marchini J, Smith SM.
Genome-wide association studies of brain imaging phenotypes in UK Biobank. Nature. 2018
Oct;562(7726):210-6.
Bowman FD. Brain imaging analysis. Annual review of statistics and its application. 2014 Jan
3;1(1):61-85.
Littell RC, Pendergast J, Natarajan R. Modelling covariance structure in the analysis of repeated
measures data. Statistics in medicine. 2000 Jul 15;19(13):1793-819.
Yang Q, Thomopoulos SI, Ding L, Surento W, Thompson PM, Jahanshad N, Alzheimer’s
Disease Neuroimaging Initiative. Support vector based autoregressive mixed models of
longitudinal brain changes and corresponding genetics in Alzheimer’s disease. InPredictive
Intelligence in Medicine: Second International Workshop, PRIME 2019, Held in Conjunction
with MICCAI 2019, Shenzhen, China, October 13, 2019, Proceedings 2 2019 (pp. 160-167).
Springer International Publishing.
Loh PR, Tucker G, Bulik-Sullivan BK, Vilhjálmsson BJ, Finucane HK, Salem RM, Chasman DI,
Ridker PM, Neale BM, Berger B, Patterson N. Efficient Bayesian mixed-model analysis
increases association power in large cohorts. Nature genetics. 2015 Mar;47(3):284-90.
108
Zhao B, Li T, Yang Y, Wang X, Luo T, Shan Y, Zhu Z, Xiong D, Hauberg ME, Bendl J, Fullard
JF. Common genetic variation influencing human white matter microstructure. Science. 2021 Jun
18;372(6548):eabf3736.
Jiang K, Huang J. A Survey on Vision Autoregressive Model. arXiv preprint arXiv:2411.08666.
2024 Nov 13.
Mbatchou J, McPeek MS. JASPER: fast, powerful, multitrait association testing in structured
samples gives insight on pleiotropy in gene expression. The American Journal of Human
Genetics. 2024 Aug 8;111(8):1750-69.
Cox SR, Ritchie SJ, Tucker-Drob EM, Liewald DC, Hagenaars SP, Davies G, Wardlaw JM, Gale
CR, Bastin ME, Deary IJ. Ageing and brain white matter structure in 3,513 UK Biobank
participants. Nature communications. 2016 Dec 15;7(1):13629.
Medland SE, Grasby KL, Jahanshad N, Painter JN, Colodro‐Conde L, Bralten J, Hibar DP, Lind
PA, Pizzagalli F, Thomopoulos SI, Stein JL. Ten years of enhancing neuro‐imaging genetics
through meta‐analysis: An overview from the ENIGMA Genetics Working Group. Human Brain
Mapping. 2022 Jan;43(1):292-9.
Rutten-Jacobs LC, Tozer DJ, Duering M, Malik R, Dichgans M, Markus HS, Traylor M. Genetic
study of white matter integrity in UK Biobank (N= 8448) and the overlap with stroke,
depression, and dementia. Stroke. 2018 Jun.
Chong TT. Structural change in AR (1) models. Econometric Theory. 2001 Feb;17(1):87-155.
Jiang B, Li J, Yao Q. Autoregressive networks. Journal of Machine Learning Research.
2023;24(227):1-69.
Wang YG, Wu J, Hu ZH, McLachlan GJ. A new algorithm for support vector regression with
automatic selection of hyperparameters. Pattern Recognition. 2023 Jan 1;133:108989.
Law NJ, Taylor JM, Sandler H. The joint modeling of a longitudinal disease progression marker
and the failure time process in the presence of cure. Biostatistics. 2002 Dec 1;3(4):547-63
Li M, Boehnke M, Abecasis GR. Joint modeling of linkage and association: identifying SNPs
responsible for a linkage signal. The American Journal of Human Genetics. 2005 Jun
1;76(6):934-49.
Guedj J, Thiébaut R, Commenges D. Joint modeling of the clinical progression and of the
biomarkers' dynamics using a mechanistic model. Biometrics. 2011 Mar;67(1):59-66.
Schweiger R, Fisher E, Weissbrod O, Rahmani E, Müller-Nurasyid M, Kunze S, Gieger C,
Waldenberger M, Rosset S, Halperin E. Detecting heritable phenotypes without a model using
fast permutation testing for heritability and set-tests. Nature communications. 2018 Nov
21;9(1):4919.
109
Alsentzer E, Finlayson SG, Li MM, Undiagnosed Diseases Network, Kobren SN, Kohane IS.
Simulation of undiagnosed patients with novel genetic conditions. Nature Communications. 2023
Oct 12;14(1):6403.
Speed D, Cai N, Ucleb Consortium, Johnson MR, Nejentsev S, Balding DJ. Reevaluation of SNP
heritability in complex human traits. Nature genetics. 2017 Jul 1;49(7):986-92.
Cutler DJ, Jensen JD. To pool, or not to pool?. Genetics. 2010 Sep 1;186(1):41-3.
Günther T, Coop G. Robust identification of local adaptation from allele frequencies. Genetics.
2013 Sep 1;195(1):205-20.
Excoffier L, Ray N. Surfing during population expansions promotes genetic revolutions and
structuration. Trends in ecology & evolution. 2008 Jul 1;23(7):347-51.
Holsinger KE, Weir BS. Genetics in geographically structured populations: defining, estimating
and interpreting F ST. Nature Reviews Genetics. 2009 Sep;10(9):639-50.
Rosenberg NA, Pritchard JK, Weber JL, Cann HM, Kidd KK, Zhivotovsky LA, Feldman MW.
Genetic structure of human populations. science. 2002 Dec 20;298(5602):2381-5.
Bergström A, McCarthy SA, Hui R, Almarri MA, Ayub Q, Danecek P, Chen Y, Felkel S, Hallast
P, Kamm J, Blanché H. Insights into human genetic variation and population history from 929
diverse genomes. Science. 2020 Mar 20;367(6484):eaay5012.
Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, Madden PA, Heath AC,
Martin NG, Montgomery GW, Goddard ME. Common SNPs explain a large proportion of the
heritability for human height. Nature genetics. 2010 Jul;42(7):565-9.
Zeng J, De Vlaming R, Wu Y, Robinson MR, Lloyd-Jones LR, Yengo L, Yap CX, Xue A,
Sidorenko J, McRae AF, Powell JE. Signatures of negative selection in the genetic architecture
of human complex traits. Nature genetics. 2018 May;50(5):746-53.
Gustavsson A, Norton N, Fast T, Frölich L, Georges J, Holzapfel D, Kirabali T, Krolak‐Salmon
P, Rossini PM, Ferretti MT, Lanman L. Global estimates on the number of persons across the
Alzheimer's disease continuum. Alzheimer's & Dementia. 2023 Feb;19(2):658-70.
Van Erp TG, Hibar DP, Rasmussen JM, Glahn DC, Pearlson GD, Andreassen OA, Agartz I,
Westlye LT, Haukvik UK, Dale AM, Melle I. Subcortical brain volume abnormalities in 2028
individuals with schizophrenia and 2540 healthy controls via the ENIGMA consortium.
Molecular psychiatry. 2016 Apr;21(4):547-53.
Ruigrok AN, Salimi-Khorshidi G, Lai MC, Baron-Cohen S, Lombardo MV, Tait RJ, Suckling J.
A meta-analysis of sex differences in human brain structure. Neuroscience & Biobehavioral
Reviews. 2014 Feb 1;39:34-50.
110
Jouppi NP, Young C, Patil N, Patterson D, Agrawal G, Bajwa R, Bates S, Bhatia S, Boden N,
Borchers A, Boyle R. In-datacenter performance analysis of a tensor processing unit.
InProceedings of the 44th annual international symposium on computer architecture 2017 Jun 24
(pp. 1-12).
Wang Y, Wang Q, Shi S, He X, Tang Z, Zhao K, Chu X. Benchmarking the performance and
energy efficiency of AI accelerators for AI training. In2020 20th IEEE/ACM International
Symposium on Cluster, Cloud and Internet Computing (CCGRID) 2020 May 11 (pp. 744-751).
IEEE.
Kehinde TO, Chung SH, Chan FT. Benchmarking TPU and GPU for Stock Price Forecasting
Using LSTM Model Development. InScience and Information Conference 2023 Jul 13 (pp.
289-306). Cham: Springer Nature Switzerland.
Chicco D, Warrens MJ, Jurman G. The coefficient of determination R-squared is more
informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation.
Peerj computer science. 2021 Jul 5;7:e623.
Bergmeir C, Hyndman RJ, Koo B. A note on the validity of cross-validation for evaluating
autoregressive time series prediction. Computational Statistics & Data Analysis. 2018 Apr
1;120:70-83.
Fischl B. FreeSurfer. Neuroimage. 2012 Aug 15;62(2):774-81.
Reuter M, Schmansky NJ, Rosas HD, Fischl B. Within-subject template estimation for unbiased
longitudinal image analysis. Neuroimage. 2012 Jul 16;61(4):1402-18.
Grasby KL, Jahanshad N, Painter JN, Colodro-Conde L, Bralten J, Hibar DP, Lind PA, Pizzagalli
F, Ching CR, McMahon MA, Shatokhina N. The genetic architecture of the human cerebral
cortex. Science. 2020 Mar 20;367(6484):eaay6690.
Schubert E, Sander J, Ester M, Kriegel HP, Xu X. DBSCAN revisited, revisited: why and how
you should (still) use DBSCAN. ACM Transactions on Database Systems (TODS). 2017 Jul
31;42(3):1-21.
Ou YN, Wu BS, Ge YJ, Zhang Y, Jiang YC, Kuo K, Yang L, Tan L, Feng JF, Cheng W, Yu JT.
The genetic architecture of human amygdala volumes and their overlap with common brain
disorders. Translational Psychiatry. 2023 Mar 11;13(1):90.
Hibar DP, Stein JL, Renteria ME, Arias-Vasquez A, Desrivières S, Jahanshad N, Toro R, Wittfeld
K, Abramovic L, Andersson M, Aribisala BS. Common genetic variants influence human
subcortical brain structures. Nature. 2015 Apr 9;520(7546):224-9.
111
Um TT, Park MS, Park JM. Independent joint learning: A novel task-to-task transfer learning
scheme for robot models. In2014 IEEE International Conference on Robotics and Automation
(ICRA) 2014 May 31 (pp. 5679-5684). IEEE.
Malik H, Farooq MS, Khelifi A, Abid A, Qureshi JN, Hussain M. A comparison of transfer
learning performance versus health experts in disease diagnosis from medical imaging. IEEE
Access. 2020 Jun 24;8:139367-86.
Reas ET, Triebswetter C, Banks SJ, McEvoy LK. Effects of APOE2 and APOE4 on brain
microstructure in older adults: modification by age, sex, and cognitive status. Alzheimer's
Research & Therapy. 2024 Jan 11;16(1):7.
Button KS, Ioannidis JP, Mokrysz C, Nosek BA, Flint J, Robinson ES, Munafò MR. Power
failure: why small sample size undermines the reliability of neuroscience. Nature reviews
neuroscience. 2013 May;14(5):365-76.
Boekel W, Wagenmakers EJ, Belay L, Verhagen J, Brown S, Forstmann BU. A purely
confirmatory replication study of structural brain-behavior correlations. cortex. 2015 May
1;66:115-33.
Bowring A, Maumet C, Nichols TE. Exploring the impact of analysis software on task fMRI
results. Human brain mapping. 2019 Aug 1;40(11):3362-84.
Poldrack RA, Whitaker K, Kennedy DN. Introduction to the special issue on reproducibility in
neuroimaging.
Hodge SM, Haselgrove C, Honor L, Kennedy DN, Frazier JA. An assessment of the autism
neuroimaging literature for the prospects of re-executability. F1000Research. 2021 Mar
4;9:1031.
Smith SM, Nichols TE. Statistical challenges in “big data” human neuroimaging. Neuron. 2018
Jan 17;97(2):263-8.
Medland SE, Jahanshad N, Neale BM, Thompson PM. Whole-genome analyses of whole-brain
data: working within an expanded search space. Nature neuroscience. 2014 Jun;17(6):791-800.
Thompson PM, Jahanshad N, Ching CR, Salminen LE, Thomopoulos SI, Bright J, Baune BT,
Bertolín S, Bralten J, Bruin WB, Bülow R. ENIGMA and global neuroscience: A decade of
large-scale studies of the brain in health and disease across more than 40 countries. Translational
psychiatry. 2020 Mar 20;10(1):100.
Satizabal CL, Adams HH, Hibar DP, White CC, Knol MJ, Stein JL, Scholz M, Sargurupremraj
M, Jahanshad N, Roshchupkin GV, Smith AV. Genetic architecture of subcortical brain structures
in 38,851 individuals. Nature genetics. 2019 Nov;51(11):1624-36.
112
Hibar DP, Adams HH, Jahanshad N, Chauhan G, Stein JL, Hofer E, Renteria ME, Bis JC,
Arias-Vasquez A, Ikram MK, Desrivières S. Novel genetic loci associated with hippocampal
volume. Nature communications. 2017 Jan 18;8(1):13624.
Brouwer RM, Klein M, Grasby KL, Schnack HG, Jahanshad N, Teeuw J, Thomopoulos SI,
Sprooten E, Franz CE, Gogtay N, Kremen WS. Genetic variants associated with longitudinal
changes in brain structure across the lifespan. Nature neuroscience. 2022 Apr;25(4):421-32.
Garijo D, Fakhraei S, Ratnakar V, Yang Q, Endrias H, Ma Y, Wang R, Bornstein M, Bright J, Gil
Y, Jahanshad N. Towards automated hypothesis testing in neuroscience. InHeterogeneous Data
Management, Polystores, and Analytics for Healthcare: VLDB 2019 Workshops, Poly and
DMAH, Los Angeles, CA, USA, August 30, 2019, Revised Selected Papers 5 2019 (pp.
249-257). Springer International Publishing.
Gil Y, Garijo D, Ratnakar V, Mayani R, Adusumilli R, Boyce H, Mallick P. Automated
hypothesis testing with large scientific data repositories. InProceedings of the Fourth Annual
Conference on Advances in Cognitive Systems (ACS) 2016 Jun (Vol. 2, p. 4).
Garijo D, Gil Y, Ratnakar V. The DISK Hypothesis Ontology: Capturing Hypothesis Evolution
for Automated Discovery. InK-CAP Workshops 2017 Dec 4 (pp. 40-46).
Liu G, Feng M, Wang Y, Wong L, Ng SK, Mah TL, Lee EJ. Towards exploratory hypothesis
testing and analysis. In2011 IEEE 27th International Conference on Data Engineering 2011 Apr
11 (pp. 745-756). IEEE.
Krafczyk M, Shi A, Bhaskar A, Marinov D, Stodden V. Scientific tests and continuous
integration strategies to enhance reproducibility in the scientific software context. InProceedings
of the 2nd International Workshop on Practical Reproducible Evaluation of Computer Systems
2019 Jun 17 (pp. 23-28).
Sanz-Robinson J, Jahanpour A, Phillips N, Glatard T, Poline JB. NeuroCI: Continuous
Integration of Neuroimaging Results Across Software Pipelines and Datasets. In2022 IEEE 18th
International Conference on e-Science (e-Science) 2022 Oct 11 (pp. 105-116). IEEE.
Baader F, Ghilardi S. Unification in modal and description logics. Logic Journal of IGPL. 2011
Dec 1;19(6):705-30.
Gil Y, Michel F, Ratnakar V, Read J, Hauder M, Duffy C, Hanson P, Dugan H. Supporting open
collaboration in science through explicit and linked semantic description of processes. InThe
Semantic Web. Latest Advances and New Domains: 12th European Semantic Web Conference,
ESWC 2015, Portoroz, Slovenia, May 31--June 4, 2015. Proceedings 12 2015 (pp. 591-605).
Springer International Publishing.
Vrandečić D, Pintscher L, Krötzsch M. Wikidata: The making of. InCompanion Proceedings of
the ACM Web Conference 2023 2023 Apr 30 (pp. 615-624).
113
Gil Y, Gonzalez-Calero PA, Kim J, Moody J, Ratnakar V. A semantic framework for automatic
generation of computational workflows using distributed data and component catalogues. Journal
of Experimental & Theoretical Artificial Intelligence. 2011 Dec 1;23(4):389-467
Homer N, Szelinger S, Redman M, Duggan D, Tembe W, Muehling J, Pearson JV, Stephan DA,
Nelson SF, Craig DW. Resolving individuals contributing trace amounts of DNA to highly
complex mixtures using high-density SNP genotyping microarrays. PLoS genetics. 2008 Aug
29;4(8):e1000167.
Cai R, Hao Z, Winslett M, Xiao X, Yang Y, Zhang Z, Zhou S. Deterministic identification of
specific individuals from GWAS results. Bioinformatics. 2015 Jun 1;31(11):1701-7.
Jernigan TL, Brown SA. ABCD Consortium Coordinators. Introduction. Dev Cogn Neurosci.
2018;32(1-3):10-16.
Casey BJ, Cannonier T, Conley MI, Cohen AO, Barch DM, Heitzeg MM, Soules ME, Teslovich
T, Dellarco DV, Garavan H, Orr CA. The adolescent brain cognitive development (ABCD)
study: imaging acquisition across 21 sites. Developmental cognitive neuroscience. 2018 Aug
1;32:43-54.
Hagler Jr DJ, Hatton S, Cornejo MD, Makowski C, Fair DA, Dick AS, Sutherland MT, Casey
BJ, Barch DM, Harms MP, Watts R. Image processing and analysis methods for the Adolescent
Brain Cognitive Development Study. Neuroimage. 2019 Nov 15;202:116091.
Yang J, Zaitlen NA, Goddard ME, Visscher PM, Price AL. Advantages and pitfalls in the
application of mixed-model association methods. Nature genetics. 2014 Feb;46(2):100-6.
Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, Blomberg N,
Boiten JW, da Silva Santos LB, Bourne PE, Bouwman J. The FAIR Guiding Principles for
scientific data management and stewardship. Scientific data. 2016 Mar 15;3(1):1-9.
Garijo D, Kinnings S, Xie L, Xie L, Zhang Y, Bourne PE, Gil Y. Quantifying reproducibility in
computational biology: the case of the tuberculosis drugome. PloS one. 2013 Nov
27;8(11):e80278.
114
Abstract (if available)
Abstract
For complex human brain mapping analysis, each subject may have multiple measurements of a quantitative trait corresponding to different time points or spatial locations. It is important to model genetic factors for repeated measurements, accounting for within-subject variation, especially when the datasets contain many diagnostic trajectories, or the measurements are taken at different spatial locations. Genetic associations often require tens of thousands of individuals to detect, thus powerful and robust approaches are needed. We propose a robust and accurate Autoregressive Linear Mixed Model (ARLMM) which incorporates support vector regression and joint modeling to address different types of within-subject and between-subject dependences, including genetic differences, scanner effects, and temporal or spatial differences. We applied our model for genetic association analysis on the Alzheimer’s Disease Initiative (ADNI) T1-weighted MRI dataset and UK Biobank Diffusion-weighted MRI dataset.
When data is collected continuously, it remains challenging to perform systematic data analysis without human intervention. We developed NeuroDISK, a user-friendly three-component automatic analysis framework, which includes data storing and retrieval, analytical workflow integration and data visualization to perform continuous data analysis. To demonstrate applications of interest to the general scientific community without the need for individual level data, NeuroDISK was evaluated as a tool for meta-analysis. We incorporated both an inverse variance weighted meta-analysis and a meta-regression framework to showcase the effect of specific genotypes in select brain regions. The NeuroDISK framework can be generalized beyond this use case, providing users with enough flexibility to define questions, run workflows, and access results interactively and continuously.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Reproducibility and management of big data in brain MRI studies
PDF
Characterizing brain aging with neuroimaging, health, and genetic data
PDF
Alzheimer’s disease: dysregulated genes, ethno-racial disparities, and environmental pollution
PDF
Mapping epigenetic and epistatic components of heritability in natural population
PDF
Validating structural variations: from traditional algorithms to deep learning approaches
PDF
Decoding the neurological and genetic underpinnings of chronic pain
PDF
Pattern detection in medical imaging: pathology specific imaging contrast, features, and statistical models
PDF
Automatic tracking of flies and the analysis of fly behavior
PDF
Data-driven learning for dynamical systems in biology
PDF
Designing neural networks from the perspective of spatial reasoning
PDF
Model selection methods for genome wide association studies and statistical analysis of RNA seq data
PDF
Exploring the genetic basis of complex traits
PDF
Memory abnormalities in Alzheimer's disease and anxiety models
PDF
Two-step study designs in genetic epidemiology
PDF
Neuroimaging in complex polygenic disorders
PDF
Data-driven multi-fidelity modeling for physical systems
PDF
Phenotypic and multi-omic characterization of novel C. elegans models of Alzheimer's disease
PDF
Bayesian multilevel quantile regression for longitudinal data
PDF
Deciphering protein-nucleic acid interactions with artificial intelligence
PDF
Covariance-based distance-weighted regression for incomplete and misaligned spatial data
Asset Metadata
Creator
Yang, Qifan
(author)
Core Title
Data modeling approaches for continuous neuroimaging genetics
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Degree Conferral Date
2024-12
Publication Date
01/13/2025
Defense Date
07/25/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Alzheimer’s disease,automatic hypothesis testing,autoregressive models,continuous data analysis,disease modeling,genetic association analysis,linear mixed model,OAI-PMH Harvest
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Thompson, Paul (
committee chair
), Edge, Michael (
committee member
), Jahanshad, Neda (
committee member
)
Creator Email
aliceyang1224@gmail.com,qifan.yang@usc.edu
Unique identifier
UC11399FAHA
Identifier
etd-YangQifan-13751.pdf (filename)
Legacy Identifier
etd-YangQifan-13751
Document Type
Dissertation
Format
theses (aat)
Rights
Yang, Qifan
Internet Media Type
application/pdf
Type
texts
Source
20250113-usctheses-batch-1234
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
Alzheimer’s disease
automatic hypothesis testing
autoregressive models
continuous data analysis
disease modeling
genetic association analysis
linear mixed model