Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Improving arterial spin labeling in clinical application with deep learning
(USC Thesis Other)
Improving arterial spin labeling in clinical application with deep learning
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Improving Arterial Spin Labeling in Clinical Application with Deep
Learning
by
Qinyang Shou
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(BIOMEDICAL ENGINEERING)
December 2024
Copyright 2024 Qinyang Shou
Dedication
Dedicated to myself, my family and friends
ii
Acknowledgments
First of all, I would like to express my special thanks to my academic advisor, Professor
Danny JJ Wang for giving me a chance to be his Ph.D. student and for both academic
guidance and emotional motivation. His guidance not only helped me develop the skills for
conducting research in MRI technologies, he also encouraged me to explore multiple research
fields with great patience. His knowledge and dedication to research influenced me in both
my work and my life. I would also like to thank Prof. Qifa Zhou, Prof. Yonggang Shi, Prof.
Hosung Kim and Dr. John Ringman for supporting and providing insightful comments and
suggestions for my work.
I would like to express my great thankfulness to all my friends and research fellows in
Prof. Wang’s lab, in USC Stevens Neuroimaging and Informatics Institute, in the Biomedical
Engineering department at USC. It is your encouragement and kindness that helped me get
out of the hardest time during the Ph.D. life. I am also grateful to the International Society
of Magnetic Resonance in Medicine, for providing great chances for learning new knowledge
of MRI technologies, communication with the greatest scientists and researchers from all
over the world in the MR field.
And finally, I would like to thank my parents and my relatives for understanding and
supporting my life throughout my Ph.D. study in the U.S. To my girl friend, Kexin, thank
you for your company and your love. The time when we meet at the conference is the
happiest moment I have ever had during my Ph.D. life.
iii
Publications
Wang K, Shou Q, Ma SJ, Liebeskind D, Qiao XJ, Saver J, Salamon N, Kim H, Yu Y, Xie
Y, Zaharchuk G. Deep learning detection of penumbral tissue on arterial spin labeling in
stroke. Stroke. 2020 Feb;51(2):489-97.
Shou Q, Shao X, Wang DJ. Super-resolution arterial spin labeling using slice-dithered enhanced resolution and simultaneous multi-slice acquisition. Frontiers in Neuroscience. 2021
Oct 29;15:737525.
Shao X, Guo F, Shou Q, Wang K, Jann K, Yan L, Toga AW, Zhang P, Wang DJ. Laminar
perfusion imaging with zoomed arterial spin labeling at 7 Tesla. NeuroImage. 2021 Dec
15;245:118724.
Wang K, Ma SJ, Shao X, Zhao C, Shou Q, Yan L, Wang DJ. Optimization of pseudocontinuous arterial spin labeling at 7T with parallel transmission B1 shimming. Magnetic
resonance in medicine. 2022 Jan;87(1):249-62.
Shao X, Zhao C, Shou Q, St Lawrence KS, Wang DJ. Quantification of blood–brain barrier water exchange and permeability with multidelay diffusion-weighted pseudo-continuous
arterial spin labeling. Magnetic resonance in medicine. 2023 May;89(5):1990-2004.
Zhao C, Shao X, Shou Q, Ma SJ, Gokyar S, Graf C, Stollberger R, Wang DJ. WholeCerebrum distortion-free three-dimensional pseudo-Continuous Arterial Spin Labeling at
7T. Neuroimage. 2023 Aug 15;277:120251.
iv
Shao X, Shou Q, Felix K, Ojogho B, Jiang X, Gold BT, Herting MM, Goldwaser EL,
Kochunov P, Hong LE, Pappas I. Age-Related Decline in Blood-Brain Barrier Function is
More Pronounced in Males than Females in Parietal and Temporal RegionseLife13: RP96155
Shou Q, Zhao C, Shao X, Jann K, Kim H, Helmer KG, Lu H, Wang DJ. Transformerbased deep learning denoising of single and multi-delay 3D arterial spin labeling. Magnetic
resonance in medicine. 2024 Feb;91(2):803-18.
Shou Q, Zhao C, Shao X, Herting MM, Wang DJ. High Resolution Multi-delay Arterial
Spin Labeling with Transformer based Denoising for Pediatric Perfusion MRI. NeuroImage
(under review).
Guo F, Zhao C, Shou Q, Jin N, Jann K, Shao X, Wang DJ. Assessing Cerebral Microvascular Volumetric Pulsatility with High-Resolution 4D CBV MRI at 7T. medRxiv. 2024 Sep 5.
Shou Q, Cen S, Chen NK, Ringman JM, Wen J, Kim H, Wang DJ. Diffusion model enables
quantitative CBF analysis of Alzheimer’s Disease. Radiology Advances (under review)
Shao X, Guo F, Kim J, Ress D, Zhao C, Shou Q, Jann K, Wang DJ. Laminar multicontrast fMRI at 7T allows differentiation of neuronal excitation and inhibition underlying
positive and negative BOLD responses. Imaging Neuroscience. 2024 Sep 25.
v
Table of Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
List Of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii
Chapter1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Cerebral Blood Flow (CBF) for Diagnosis . . . . . . . . . . . . . . . . . . . 2
1.2 Measuring CBF with Arterial Spin Labeling (ASL) . . . . . . . . . . . . . . 3
1.3 Transformers, Diffusion Model and Pretrained Foundation Model in Medical
Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.1 Transformer-based model . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.2 Diffusion Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.3 Pretrained Foundation Model . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Current Deep Learning Applications in medical imaging and ASL . . . . . . 10
Chapter2: Characteristics of ASL imaging . . . . . . . . . . . . . . . . . . . . . 13
2.1 Image acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.1 Labeling method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.2 Readout method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.3 Background suppression . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Processing Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 CBF Quantification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Beyond Conventional ASL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Chapter3: Improving ASL image quality with Swin Transformer-based model 25
3.1 Swin Transformer model for image restoration . . . . . . . . . . . . . . . . . 26
3.2 Pseudo3D SwinIR for single-delay ASL denoising . . . . . . . . . . . . . . . 29
vi
3.3 Pseudo4D SwinIR for multi-delay ASL denoising . . . . . . . . . . . . . . . . 39
3.4 Evaluation of the results on quantification . . . . . . . . . . . . . . . . . . . 43
Chapter4: Self-supervised learning for high-resolution pediatric choroid plexus
perfusion imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1 Choroid plexus perfusion imaging and accelerated acquisition . . . . . . . . . 47
4.2 Accelerated acquisition of high-resolution multi-delay ASL . . . . . . . . . . 49
4.3 Self-supervised learning with k-space weighted image average(KWIA) and
Noise2Void . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4 Evaluation of Test-retest reproducibility . . . . . . . . . . . . . . . . . . . . 59
Chapter5: Generating M0 enables CBF quantification for ADNI . . . . . . 61
5.1 Conditional latent diffusion model . . . . . . . . . . . . . . . . . . . . . . . . 61
5.1.1 Denoising diffusion probabilistic model (DDPM) . . . . . . . . . . . . 61
5.1.2 Latent diffusion model for conditional M0 generation . . . . . . . . . 63
5.2 Diffusion model for M0 generative in ADNI dataset . . . . . . . . . . . . . . 65
5.2.1 Dataset collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2.2 Model training and evaluation . . . . . . . . . . . . . . . . . . . . . . 66
5.3 Analysis of clinical significance . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.3.1 Group analysis of brain regions . . . . . . . . . . . . . . . . . . . . . 75
5.3.2 Machine learning for AD classification . . . . . . . . . . . . . . . . . 78
Chapter6: Discussion and Future Direction . . . . . . . . . . . . . . . . . . . . 81
6.1 Study limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.2 Direct quantification for multi-parametric ASL . . . . . . . . . . . . . . . . . 84
6.3 Interpretable classification model for AD diagnosis with ASL . . . . . . . . . 86
6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Appendix A:
CNN and Transformer comparison for patch-based training strategy . . . 100
vii
List of Tables
3.1 Details of the datasets used in this study, including the number of subjects
and cases, patient age and gender, MRI scanner vendors, ASL parameters
including LD and PLD, image resolution, matrix size, number of repetitions
(L/C pairs) and the number of testing cases. . . . . . . . . . . . . . . . . . . 30
3.2 Details of the input settings in the experiment. (M0 is the proton density
image) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 Comparison of similarity metrics for all model backbones and input settings. 38
3.4 SNR performance for different vendors with different proportion of averages.
The best performance across different models was shown in bold. (SNR 2
means the image is averaged with 2 repetitions; SNR 4 means the image is
averaged with 4 repetitions; SNR all means the image is averaged with all
available repetitions; the last column is the SNR for all images from the 3
datasets with all available repetitions). . . . . . . . . . . . . . . . . . . . . . 39
4.1 SNR of perfusion images and CBF and ATT maps of baseline in GM and
different denoising methods, the percentages show improvements compared
to the baseline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2 SNR of perfusion images and CBF and ATT maps in CP of baseline and
different denoising methods, the percentages show improvements compared
to the baseline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3 Intraclass coefficient of covariance (ICC) of CBF and ATT in GM and CP . 60
4.4 Within-subject coefficient of variation (wsCV) of CBF and ATT in GM and CP 60
5.1 Demographic information for in-house dataset with paired data of ASL . . . 67
5.2 Demographic information for subjects (Baseline) . . . . . . . . . . . . . . . . 73
viii
5.3 Results of machine learning classification with either CBF or perfusion as features. Machine learning methods include Ada Boost, Elastic Net and Random
Forest. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
List of Figures
1.1 A typical architecture for Transformer encoder and its adaption to image
processing. The Transformer encoder includes a multi-head attention layer
and a multi-layer perceptron (MLP) with residual connections. The vision
Transformer separates the image into smaller patches and adds positional
embedding to the patch according to its location. Then the embedded patches
are sent to the Transformer encoders for processing. (Figure from He et al.
Intelligent Medicine 2023) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Illustration of the workflow of a score-based diffusion model used to reconstruct undersampled MRI dataset. (Figure from Chung and Ye Medical Image
Analysis 2022) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Illustration of the foundation model. Training data is from broad source and
can be adapted to a number of downstream tasks. (Figure from Bommasani,
Rishi, et al. Arxiv 2021) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 Signal evolution for pulsed ASL and pseudo-continuous ASL (Figure from
Michael Chappell, Bradley Macintosh and Thomas Okell. Introduction to
Quantification using Arterial Spin Labeling). . . . . . . . . . . . . . . . . . . 15
2.2 (a) 2D versus (b) 3D readout ASL imaging in a normal subject. Both images were acquired with approximately 5 min of imaging at 3T with pCASL
labeling (label duration of 1.5 sec and a post-label delay of 2 sec). The 2D
readout method was a single-shot gradient echo spiral. The 3D readout was
a segmented stack-of-spirals fast spin echo. Note the artifacts associated with
the 2D single shot method in regions of high susceptibility (arrows). (Figure
from Alsop, David C., et al. Magnetic resonance in medicine 73.1 (2015):
102-116.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
ix
2.3 The timing diagram for the background suppression. a. Unconstrained scheme,
b. constrained scheme and c. also an unconstrained background suppression
scheme interleaved with CASL. (Figure from Nasim Maleki, Weiying Dai and
David Alsop. Magnetic Resonance Materials in Physics, Biology and Medicine
25 (2012): 127-133.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Demonstration of zoomed pCASL acquisition and diagrams of motor and visual tasks. A. Illustration of imaging volume on a 3D brain surface. A small
FOV (100 × 50 × 24 mm3
covering the dominant motor cortex (red) or visual
cortex (blue) was acquired with zoomed GRASE. Axial, coronal and sagittal
views of GRASE images are shown in enlarged red and blue boxes, respectively. B. Illustration of pCASL labeling plane in sagittal (yellow line) and
coronal (yellow lines and shade) views. Intracranial arteries were revealed by
maximal-intensity-projection (MIP) of the T1w structural MRI and pCASL
labeling plane was placed above the circle of Willis (CoW) and simultaneously
perpendicular to the M3 segment of middle cerebral artery (MCA), P2 segment of posterior cerebral artery (PCA) and A2 segment of anterior cerebral
artery (ACA). (Figure from Shao et al. NeuroImage 245 (2021): 118724.) . . 23
3.1 Illustration of self attention and multi-head self attention. (Figure from
Vaswani et al. Advances in Neural Information Processing Systems (2017).) . 26
3.2 Illustration of Vision Transformer. (Figure from Dosovitskiy, Alexey. ICLR
2021. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 An illustration of the shifted window approach for computing self-attention
in the proposed Swin Transformer architecture. In layer l (left), a regular
window partitioning scheme is adopted, and self-attention is computed within
each window. In the next layer l + 1 (right), the window partitioning is
shifted, resulting in new windows. The self-attention computation in the new
windows crosses the boundaries of the previous windows in layer l, providing connections among them. (Figure from Liu et al. Proceedings of the
IEEE/CVF international conference on computer vision. 2021.) . . . . . . . 28
3.4 The architecture of the proposed SwinIR for image restoration. (Figure from
Liang et al. Proceedings of the IEEE/CVF international conference on computer vision. 2021.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
x
3.5 Input, reference, and prediction perfusion images processed by the deep learning (DL) models of different settings. The DL processed image has higher SNR
compared to the input image and high similarity to the reference. The label
“ 2D” means the model takes 1 slice as the input, “ 3,” “ 5,” and “ 7” mean
the model takes pseudo 3D inputs of three slices, five slices, and seven slices
of the perfusion images, respectively. “ 2D M0” means the model takes one
slice of the perfusion image and the corresponding slice of M0 image as the
input. “ 3 M0” means the model takes pseudo 3D inputs of three slices of the
perfusion image and the center slice in the M0 image as the input. . . . . . . 33
3.6 Example with larger coverage of a representative subject. The reference is
averaged by all input perfusion images. Each input perfusion image was denoised by the deep learning model and averaged by different portions of the
time points (2, 4, and all). More averaging will result in higher SNR. . . . . 34
3.7 Comparison of the structural similarity index (SSIM), peak signal-to-noise
ratio (PSNR) of different model backbones and input settings. For both SSIM
and PSNR, with more adjacent slices to the input channels, the performance
improves. Adding a M0 channel will result in the largest improvement. The
significance of the difference was indicated on the bar plot (*p<0.05, **p<0.01,
***p<0.001). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.8 Prediction and difference map of UNETR and SwinIR ( 3 stands for using
pseudo-3D of 3 slices as the input) . . . . . . . . . . . . . . . . . . . . . . . 36
3.9 Prediction and difference map of TransUNet and SwinIR ( 3 stands for using
pseudo-3D of 3 slices as the input) . . . . . . . . . . . . . . . . . . . . . . . 37
3.10 Denoising performance of three models for three representative cases from
independent testing cases from different vendors. For GE data, the original
image is the input to the networks. For the other two vendors, the reference
images are averaged by all input repetitions. SNR of the images are shown
above each case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.11 The framework for the arterial spin labeling (ASL) denoising task. The input
to the model is an image slice combined with several other channels, which
are images from spatially adjacent slices, or image from temporally adjacent
post labeling delays. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
xi
3.12 An example of the multi-delay dataset. Input perfusion images were averaged
with 1/1/1/2/2 repetitions for post labeling delay (PLD) of 500/1000/1500/2000/2500ms.
Reference perfusion images were averaged with 9/9/9/9/3 repetitions for PLD
of 500/1000/1500/2000/2500ms. The denoised perfusion images were deep
learning predictions with the input, which show improves SNR for each PLD.
slc3 means taking three slices of the PLD. PLD1 and PLD3 means pseudo 3D
and 4D models that use one or three PLDs as input, respectively. . . . . . . 41
3.13 Fitted CBF and ATT maps for two representative subjects. Red arrows show
a spot where an error in fitting occurs because of spike in the input, which was
resolved in spatiotemporal denoising models, but not resolved in spatial only
models. slc3 means taking three slices of the PLD. PLD1 and PLD3 means
pseudo 3D and 4D models that use one or three PLDs as input, respectively. 42
3.14 Similarity metrics of the predicted CBF and ATT maps with reference. The
performance for the CBF map is close between spatial-only and spatiotemporal denoising, but for ATT map, spatiotemporal denoising results are better
than spatial-only results. The significance of the difference was indication on
the bar plot (*: p<0.05, **: p<0.01, ***: p<0.001) . . . . . . . . . . . . . . 43
3.15 Mean difference of CBF values in whole brain, gray matter, and white matter
(relative values (A) and in percentage (B)) for different model backbones and
input settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.16 Scatter plots of the denoised cerebral blood flow (CBF) values compared to
reference CBF values in whole brain, gray matter, and white matter of the
three models, respectively. “ 3” means the model takes pseudo 3D inputs of
three slices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.17 Bias analysis for different models. CBF difference is small for all model conditions. But for ATT, spatial-only models have much higher difference in white
matter ATT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1 The acquisition and under sampling scheme. For each measurement, a time
dependent CAIPIRINHA under sampling pattern with 2×4 acceleration was
applied. A full k-space can be achieved by the averaging all 8 segments to
reconstruct a single image or used to estimate sensitivity maps. And each
segment can be used to reconstruct a single image with TGV method. . . . . 50
4.2 KWIA processing pipeline. The k-space of the center is kept, and the outer
ring is averaged across neighboring PLDs to produce a denoised k-space and
perfusion image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
xii
4.3 Blind-spot masking scheme used during NOISE2VOID training. (a) A noisy
training image. (b) A magnified image patch from (a). During N2V training,
a randomly selected pixel is chosen (blue rectangle) and its intensity copied
over to create a blind-spot (red and striped square). This modified image is
then used as input image during training. (c) The target patch corresponding
to (b). We use the original input with unmodified values also as target. The
loss is only calculated for the blind-spot pixels we masked in (b) (Figure from
Krull et al. IEEE/CVF 2019) . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.4 Original and denoised perfusion images of the five PLDs with different methods including TGV, KWIA and the deep learning methods. The red arrows
show the choroid plexus signals in lateral ventricles. . . . . . . . . . . . . . . 55
4.5 CBF and ATT images fitted from the original and denoised multi-delay perfusion images with different methods including TGV, KWIA and the deep
learning methods. The zoom-in panels show choroid plexus areas in the axial
and coronal view. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.6 Quantitative comparison of SNR among different methods. (a) SNR for perfusion images of each PLD in GM(b) SNR for CBF and ATT images after
fitting in GM. (c) SNR for perfusion images of each PLD in CP(d) SNR for
CBF and ATT images after fitting in CP. . . . . . . . . . . . . . . . . . . . . 57
5.1 Illustration of the conditional latent diffusion model(LDM) and the inference
process. a. Model structure of the latent diffusion model. The image (M0)
in the original pixel space is first encoded into the latent space with the
encoder(E). Diffusion process is conducted in the latent space to improve the
computational efficiency. The condition (control image) is also encoded into
the latent space with the same encoder and concatenate with the latent space
feature of M0. The reversed denoising network ϵθ takes the time step t, latent
space image zt and the latent-space image of the condition τθ (y) as input and
produces the previous latent space image in the previous step. The final M0
image is produced with a decoder to recover the image from the latent space.
b. The sampling process (generation). The input images are first scaled with
the histogram matching to match the histogram of the training data. The
trained LDM generate 20 samples for each scan and averaged to get a robust
sample. The scale factor saved in the previous histogram matching is used to
scale the image back to the original scale. . . . . . . . . . . . . . . . . . . . . 64
5.2 Sequence diagram of the PASL used in the ADNI protocol for Siemens scanners. a. The original product sequence to acquire control/label pairs for the
PASL. b. Modified sequence to acquire M0. The background suppression
pulses are disabled, and the inversion time is prolonged from 2s and 5s to
enable full recovery of the longitudinal magnetization (Mz). . . . . . . . . . . 66
xiii
5.3 Uncertainty evaluation of the diffusion model. a, b and c. Normalized mean
square error (NMSE), structural similarity index (SSIM) and peak signalto-noise ratio (PSNR) between generated and ground truth M0 image with
different number of averaged samples. The shaded area shows standard deviation across test images. d and e. A representative case of the ground truth
(GT) and the generated image. f. The difference map between the ground
truth and the generated M0 image. g. Standard deviation map across 100
generated samples. Red arrows show CSF areas where the standard deviation
is higher than other brain tissues. . . . . . . . . . . . . . . . . . . . . . . . . 68
5.4 Qualitative and quantitative model performance evaluation. a. A representative case of the ground truth and generated M0 and CBF maps. Both M0
and CBF map produced by the diffusion model show high fidelity and similarity to the ground truth image. b. Quantitative similarity metrics of the
generated image and the ground truth including NMSE, SSIM and PSNR.
c and d. Bias in CBF quantification measurements. c. Scatter plot of the
averaged CBF values of the ground truth and generated CBF images in both
gray matter (GM) and white matter (WM). d. Averaged CBF difference in
the whole brain, GM and WM. Mean difference is 1.07±2.12ml/100g/min for
whole brain, 1.29±2.51ml/100g/min for GM and -0.04±1.44ml/100g/min for
WM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.5 Results for T1 evaluation. a. Bloch simulation results for background suppression of various T1 values. b. Illustration of the relationship of Mz/M0
with different T1 values. c. fitted T1 map with real M0 and control images.
d. fitted T1 map with generated M0 and real control images. e. scatter plot
of gray matter and white matter T1 values of real and generated T1 maps.
f. summary of the GM and WM values of real and generated T1 maps of all
subjects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.6 Averaged perfusion and CBF maps for different groups in the ADNI dataset
including CN, SMC, MCI and AD. The perfusion maps show larger inhomogeneity across different brain regions, while CBF maps are more homogeneous.
Both Siemens and GE data show a decreasing trend from CN/SMC to MCI
to AD, especially in the CBF maps. Siemens pulsed ASL data show more
vascular signals while GE pCASL data is smoother across the whole brain. . 74
5.7 Regional analysis of the trend in different groups of the subjects. Four ADrelated ROIs are shown including posterior cingulate cortex (PCC), precuneus,
inferior parietal gyrus and angular gyrus. Data from Siemens and GE are
shown in blue and red respectively. Both data show similar trend in these
ROIs, with a decreasing from CN/SMC to MCI to AD. . . . . . . . . . . . . 76
xiv
5.8 Regional analysis of more ROIs including cuneus, inferior parietal gyrus,
supramarginal gyrus and hippocampus. Overall, GE and Siemens data show
similar trends in different groups of subjects. Siemens data show large variance across subjects than GE data. . . . . . . . . . . . . . . . . . . . . . . . 77
6.1 (A) Image processing pipeline. (B) Network architectures. Architectures are
shown in dashed-line boxes, whereas the legend is shown in the solid-line box.
BN, batch normalization; PLD, post-label delay; ReLU, rectified linear unit . 85
6.2 Grad-CAM overview: Given an image and a class of interest (e.g., ‘tiger cat’
or any other type of differentiable output) as input, we forward propagate
the image through the CNN part of the model and then through task-specific
computations to obtain a raw score for the category. The gradients are set
to zero for all classes except the desired class (tiger cat), which is set to 1.
This signal is then backpropagated to the rectified convolutional feature maps
of interest, which we combine to compute the coarse Grad-CAM localization
(blue heatmap) which represents where the model has to look to make the
particular decision. Finally, we pointwise multiply the heatmap with guided
backpropagation to get Guided Grad-CAM visualizations which are both highresolution and concept-specific. (Figure from Selvaraju et al. ICCV 2017) . . 87
6.3 User interface for the radiologists in the study. First, they were shown chest
radiographs (CXRs) with and without a present pneumothorax and the vision
Transformer (or, ViT) prediction score. In the second step, a saliency map was
additionally shown. For both parts, radiologists had to detect if a pneumothorax was present and then determine if the saliency map was (subjectively)
useful for aiding detection. ID = image identifier.(Figure from Wallek et al.
RSNA 2022) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
A.1 Illustration of patch wise training and full image inference for (A) the CNN
network and (B) SwinIR network. . . . . . . . . . . . . . . . . . . . . . . . . 101
xv
List Of Abbreviations
AD Alzheimer’s Disease
ASL Arterial Spin Labeling
ATT Arterial Transit Time
CAIPIRINHA Controlled Aliasing in Parallel Imaging Results in Higher Acceleration
CBF Cerebral Blood Flow
CNN Convolutional Neural Network
CP Choroid Plexus
DL Deep Learning
FOV Field Of View
GAN Generative Adversarial Networks
GPU Graphics Processing Unit
GRASE Gradient And Spin Echo
IRB Institutional Review Board
KWIA K-space Weighted Image Average
LD Labeling Duration
LDM Latent Diffusion Model
MPRAGE Magnetization-Prepared 180 degrees Radio-frequency pulses And rapid GradientEcho
xvi
MRA Magnetic Resonance Angiography
MRI Magnetic Resonance Imaging
PET Positron Emission Tomography
ROC Receiver Operating Characteristic
ROI Region of Interest
SNR Signal-to-Noise Ratio
TGV Total Generalized Variation
TR Time of Repetition
xvii
Abstract
Arterial spin labeling (ASL) is a magnetic resonance imaging (MRI) technique that can
measure human cerebral blood flow (CBF) non-invasively. However, clinical application
of this technique remains challenging due to the intrinsic low signal-to-noise ratio (SNR)
and long scan time. Also, heterogeneity of ASL imaging protocols across vendor platforms
make quantification not reliable. Traditional methods of denoising usually assumes an image
models and/or noise characteristics, which may not well represent the real data. Deep
Learning (DL)-based models can learn the underlying patterns purely from real data. Recent
developments of DL in image processing and image generation provide powerful tools to
improve clinical applications of medical imaging, such as improving image quality, reducing
time for image acquisition, etc. However, while a handful studies have demonstrated the
feasibility of DL applications on ASL, there remain large gaps in the reliable application
of DL methods for improving the clinical use of ASL on multiple vendor platforms with
different imaging protocols (e.g., single-delay and multi-delay).
The purpose of this work is to adapt, optimize and apply some of the latest DL techniques,
including Transformer and diffusion model to improving the clinical translation of ASL by
improving the image quality and/or reduce scan time, and generating the missing modality
to enable CBF quantification to improve standardization cross vendors for the Alzheimer’s
Disease Neuroimaging Initiative (ADNI) dataset.
There are three specific aims in this study. In the first aim, a flexible Transformer-based
DL denoising scheme will be developed and evaluated for 3D ASL to improve SNR and/or
reduce scan time for both single-delay and multi-delay ASL data. Our hypothesis is that
with the completion of aim 1, we will be able to improve the image quality for ASL acquired
from multiple vendors with the trained model without introducing bias in quantification of
cerebral blood flow (CBF) and/or arterial transit time (ATT).
In the second aim, the proposed DL framework and the trained model will be adapted
xviii
to a high-resolution pediatric multi-delay ASL dataset for perfusion imaging of pediatric
choroid plexus. Since there are no reference images for this cutting edge application, some
self-supervised learning techniques will be explored. We will compare the performance of
the proposed deep learning method with state-of-the-art conventional denoising method like
total generalized variation (TGV). Our hypothesis is that with completion of aim 2, the
proposed deep learning method will show better performance than the traditional method,
both improving image quality and the test-retest reliability for pediatric choroid plexus
perfusion imaging.
In the third aim, generative diffusion model will be applied to generating the M0 from the
control image for Siemens 3D pulsed ASL (PASL) scans in the ADNI-3 dataset, where M0 is
not acquired. The generated M0 can be used to quantify CBF for analysis of CBF variations
among subjects with normal cognition, mild cognitive impairment (MCI) and AD. This can
help resolve the heterogeneity of different type of ASL scans to standardize quantification of
CBF in the AD dataset. Our hypothesis is that with the completion of aim 3, Siemens 3D
PASL scans in the ADNI-3 dataset can be used to quantify CBF with the acquired control
images and the generated M0. Analysis of CBF data from different MR vendors will reveal
similar characteristic of deficits in quantitative CBF in MCI and AD subjects compared to
normal, which can show better differentiation between AD and normal people compared to
using non-standardized perfusion images.
In conclusion, with the completion of the three specific aims, we will show that latest DL
methods such as Transformers and diffusion models have the potential to improve ASL in
clinical applications by enhancing the image quality and better standardization.
xix
xx
Chapter 1
Introduction
Magnetic Resonance Imaging (MRI) has been playing a pivot role in the medical field to
help doctors make diagnosis. Modern MRI has not only been able to provide structural
information of the human body, with techniques such as T1 or T2-weighted MRI, but also
functional information including brain activities with techniques such as function MRI. One
of the largest challenges in MRI, compared to other medical imaging modalities like computed
tomography (CT) and Positron Emission Tomography (PET), is the long scan time due to
the nature that the acquisition of an MR image needs frequency encoding. Traditional
methods to solve this issue is to apply image processing techniques or through constrained
reconstruction methods. With the emergence of deep learning (DL), it is now possible to
improve MRI through an end-to-end model trained with lots of real-world data.
The topic of this dissertation is the adaptation of the recent-developed DL technique to
solve problems for arterial spin labeling (ASL), which is a non-invasive MRI modality that
measures cerebral blood flow (CBF). The studies are focused on improving image quality,
as well as to evaluate the feasibility and generalizability of the DL models for different
conditions, including neurodevelopmental studies and neurodegenerative studies. As ASL
is an emerging MRI modality, there are limited available datasets for model training and
testing. Also, accuracy for quantification and diagnosis need to be evaluated besides image
similarity based metrics. These two challenges and will be the discussed throughout the
entire study discussed below.
The first chapter introduces the importance of CBF in clinical diagnosis and some basics
of measuring CBF with ASL techniques. The challenges in ASL in clinical applications are
1
discussed in the next. Then some latest development in the field of deep learning based
methods in medical imaging are discussed, including Transformer, diffusion models, as well
as pretrained foundation models. The last portion of this section discussed existing studies
in the applications of DL in ASL.
1.1 Cerebral Blood Flow (CBF) for Diagnosis
CBF is defined as the amount of blood that passes through a certain quantity of the brain
during a given period of time. It plays a crucial role in maintaining the brain’s nutrition
and oxygen demands and ensuring its proper function. Changes in CBF is closely related to
many brain diseases such as Alzheimer’s Disease (AD)[1], acute ischemic stroke[2, 3], brain
tumor[4] and some cerebral small vessel disease (SVD)[5].
Take AD as an example, research has consistently demonstrated reduced CBF in specific
brain regions among individuals with AD[6], particularly in the temporal cortex[7] and subcortical regions such as hippocampus[8]. These regions are known to be involved in memory
and cognitive functions that are commonly impaired in AD. The reduction in CBF is believed to result from vascular dysfunction, including narrowing and occlusion of small blood
vessels, endothelial dysfunction, and impaired regulation of blood flow[9]. The severity of
CBF reductions has also been found correlated with the degree of cognitive impairment in
AD patients[10]. Furthermore, CBF is normally coupled with metabolism, so it provides
surrogate marker of neuronal function and brain metabolism, which is usually assessed by
FDG-PET[11]. This characteristic CBF alteration may serve as a potential biomarker for
AD and aids in the diagnostic process. Quantitative assessment of CBF using imaging techniques such as PET and MRI, can provide valuable insights into the underlying vascular and
neurodegenerative processes in AD. Continued research in this area holds promise for developing novel diagnostic approaches and therapeutic strategies targeting CBF and vascular
dysfunction in AD.
2
1.2 Measuring CBF with Arterial Spin Labeling (ASL)
ASL is a non-invasive magnetic resonance imaging (MRI) technique that can measure perfusion by using radiofrequency (RF) pulses to tip down the spins in water to make water
in the blood as an endogenous tracer[12]. Compared to other perfusion MRI techniques,
ASL has the following advantages. On one hand, ASL is completely non-invasive, which
has advantage over other contrast based perfusion MRI methods like dynamic contrast enhanced (DCE)[13] imaging and dynamic susceptibility contrast (DSC)[14] imaging. Since it
is required to inject gadolinium-based contrast agent, these methods may cause problems for
patients with renal dysfunction. On the other hand, ASL can measure CBF values quantitatively, which has been validated with 15O PET - the gold standard of quantitative CBF
measurement[15]. The ability to quantify CBF makes it possible for ASL to be used as an
imaging biomarker for the diagnosis of brain diseases[16].
Despite the promises and progresses, there remain challenges for the reliable use of ASL
for clinical application and cutting-edge research. One major challenge is the intrinsic low
signal-to-noise ratio (SNR) of the ASL signal, which is only about 1% of the tissue signal.
Since the SNR of all MRI images is proportional to the voxel size and the total scan time,
there is always a tradeoff between the SNR, image resolution and scan duration. In order to
get sufficient SNR for an ASL image, a typical MRI scan on a common 3T scanner has the
resolution of about 4×4×4 mm3 and can be acquired in about 5 minutes[12]. Increasing the
scan time will also increase the risk to involve more motion artifacts during the scan process,
especially for pediatric subjects and elder patients. Since the recommended ASL protocols
use segmented 3D readout, the motion happened between segments will be difficult to correct
and can cause image distortion. The loss in SNR may lead to inaccurate quantification of
CBF. On the other hand, a coarse resolution will lead to partial volume effects, which makes
it hard to quantify CBF values for small brain structures such as hippocampus, amygdala,
etc. Furthermore, this relatively long scan duration generally allows only single-delay ASL,
3
which may be susceptible to variations in arterial transit time (ATT). The application of
multi-delay ASL was proposed to measure both CBF and ATT but need to divide the
repetitions into multiple post labeling delays (PLD)[3], which may result in lower SNR for
each PLD.
Another large challenge in ASL is heterogeneity of implementations and protocols across
vendors. There are two aspects for this heterogeneity. On one hand, the implementation
of ASL method is different. In some vendors, pulsed ASL (PASL) is the default method,
while in other vendors, pseudo-continuous ASL (pCASL) is the default method. This will
cause difference in signal modeling as well as perfusion contrast, which will be discussed in the
next chapter. On the other hand, some vendors will perform preprocessing such as repetition
averaging before images are exported, while other vendors will keep the original images. This
will affect the way for further image processing methods, which will be discussed in chapter
3.
1.3 Transformers, Diffusion Model and Pretrained Foundation Model in Medical Imaging
During the past decades, deep learning (DL)[17] in computer vision (CV) has
become a transformative technology, which has inspired many applications of DL
in medical imaging analysis[18]. In general, DL is a subset of machine learning
that uses neural networks with multiple layers to model complex patterns and
representations in data. It has shown excellent performance in tasks such as
image recognition[19], natural language processing[20], and speech analysis[21]
by learning from vast amounts of data. A key architecture in deep learning is
the Convolutional Neural Network (CNN)[22], which is especially powerful for
image data. CNNs use convolutional layers to detect local patterns like edges
and textures in images, making them highly efficient at recognizing and classi4
fying objects. By leveraging shared weights and spatial hierarchies, CNNs have
become the backbone of many state-of-the-art applications in CV. Typical CNN
structures, such as U-Net[23] and ResNet[24], can effectively capture spatial patterns, especially local features of images through convolutional blocks. However,
there have been some new advances beyond CNN that boosts the performance
in the aforementioned tasks, such as the Transformers. In this section, we will
briefly introduce some of them and how they can be applied to medical imaging.
1.3.1 Transformer-based model
Recently, Transformer-based models[25, 26, 27] have been developed initially for natural
language processing (NLP) and adapted to the CV[28]. In particular, Transformer is a new
type of neural network that is built on self-attention mechanism. Figure 1.1 demonstrate
the major components of a Transformer and how it can be adapted to image processing.
The original images are first divided into smaller patches and flattened as a vector. The
multi-head self-attention operator computes the importance of each patch in relation to all
other patches within the image. This feature of self-attention captures global and local
dependencies by attending one patch to different parts of the image simultaneously. After
self-attention, each patch representation passes through a fully connected feed-forward neural network independently. This network applies the same transformations to each patch
representation individually, allowing it to capture local patterns.
There have already been many applications of Transformers in medical imaging in the
past few years. For example, Zhang et al.[29] applied Transformer to classify COVID-19
with CT chest images. Hatamizadeh et al. [30] developed a U-shaped encoder-decoder
structure with Transformer to segment brain tumors. Feng et al. [31] built a multi-task
Transformer framework for joint MRI reconstruction and super-resolution. Compared to
CNN, Transformer-based models have the following advantages. First, Transformers are
able to capture both short-range and long-range dependencies, while CNN focused on local
5
Figure 1.1: A typical architecture for Transformer encoder and its adaption to image processing. The Transformer encoder includes a multi-head attention layer and a multi-layer
perceptron (MLP) with residual connections. The vision Transformer separates the image
into smaller patches and adds positional embedding to the patch according to its location.
Then the embedded patches are sent to the Transformer encoders for processing. (Figure
from He et al. Intelligent Medicine 2023)
features. Second, Transformers are able to adaptively adjust the receptive field, while CNN
usually has a fixed receptive field. Third, Transformers can be made interpretable with an
explicit self-attention mechanism and attention weightings[32]. However, since Transformers
don’t use any inductive bias of locality and translation invariance like CNN, they need more
data to train on to achieve its best performance. Also, their superiority over CNN for various
CV tasks has not been comprehensively and rigorously validated and explored, which is an
ongoing research direction.
1.3.2 Diffusion Model
Diffusion model[33] or score-based model[34] is a type of generative model that can produce images from random noise by iteratively denoise and back propagate with learned
conditional probabilities. It is usually composed of two main steps: a forward step and a
backward step. In the forward step, random Gaussian noise is incrementally added to an
6
Figure 1.2: Illustration of the workflow of a score-based diffusion model used to reconstruct
undersampled MRI dataset. (Figure from Chung and Ye Medical Image Analysis 2022)
image through a predefined diffusion process or stochastic diffusion equation (SDE) over
several steps, eventually transforming the image into a tractable target distribution, typically an isotropic Gaussian distribution. In traditional diffusion models, training focuses on
estimating the conditional probabilities of the forward process using maximum likelihood
estimation (MLE). This helps the model learn a mapping from noisy data to the clean data.
On the other hand, score-based models take a different approach by directly estimating the
gradient of the log data distribution (the score function) using a neural network. The score
function is then employed to numerically solve the reverse SDE to generate new samples. To
generate new data, the model samples an initial point from the prior distribution, and applies
either denoising steps or the reverse SDE with the predictor-corrector algorithm[34]. Figure
1.2 illustrate a framework of using score-based diffusion model for MRI reconstruction.
Diffusion models have also been explored in several applications in medical imaging.
Chung et al. [35] developed a score-based MRI reconstruction scheme for undersampled data
which shows stable performance across different sampling patterns. Lyu et al. [36] utilized
diffusion model to perform modality transformation, which can obtain CT images from T2w
MR images. Kim et al. [37] introduced a novel diffusion-based method for image registration
that use diffusion model to estimate the deformation field and uses a spatial transformation
layer to generate the coregistered image. There are some major advantages for diffusion
7
models. First, training a diffusion model is more stable compared to generative adversarial
networks (GANs), which may suffer from adversarial nature of generator/denominator and
mode collapse. Second, diffusion models allow for progressive denoising through a sequence
of reverse steps, making the generation process more controllable. This can be leveraged
for techniques like in-painting, super-resolution and conditional generation. Third, since
the model learns the distribution of the data, diffusion model can be used to estimate the
distribution and the uncertainty of the output by Monte Carlo simulation. This property
can be used to assist diagnosis of disease or quantification of lesion with indicated levels
of uncertainty. There are still some limitations for this method. First, the generation
of new samples takes longer time since it involves a number of iterative steps, typically
1000 steps. Second, due to the multi-step nature of the diffusion process, diffusion models
are computationally intensive. Both training and sampling require running through many
diffusion steps, which can lead to high computational costs and longer training time compared
to other generative models. However, these limitations don’t diminish its capability to
generate high-quality data, making application of diffusion models in medical imaging an
emerging and promising field for current research.
1.3.3 Pretrained Foundation Model
Pretrained foundation model is another emerging field of DL [38]. It refers to a DL model
that is trained on massive datasets to learn a general representation of the data, which can
then be adapted to a variety of downstream tasks, as depicted in Figure 1.3.
Some of the well-known pretrained models including Bidirectional Encoder Representations from Transformers (BERT) and Generative Pretrained Transformer (GPT) for language
processing and Stable Diffusion for image generation, are all proven to work well on many
downstream tasks such as language translation and prompted image generation. There are
some characteristics for foundation models. First, a foundation model can be trained on
broad data with self-supervised learning techniques, which doesn’t require data annotation
8
Figure 1.3: Illustration of the foundation model. Training data is from broad source and can
be adapted to a number of downstream tasks. (Figure from Bommasani, Rishi, et al. Arxiv
2021)
9
by hand. This increases the amount of data that can be used for training, which can generally improve the capability of the model. Second, the foundation model is usually trained
to learn a general task such as image generation and image restoration that can be used for
a more specific downstream task like object detection and segmentation. Third, foundation
models offer the flexibility to be adapted to a different task or domain via an adaptation
step, typically through fine-tuning, which can achieve better performance than training a
new model from scratch, when there is limited data available for the specific task.
In the case of medical imaging, there is some challenge in directly transferring a model
trained on natural images, such as ImageNet[39]. There are many difference for medical
images and natural images. For example, in the case of classification, diagnosis of a medical
image is usually made from a local variation of texture, while a natural images have a
clearer global subject. The value of transfer learning in medical imaging is still under study.
A preliminary study showed training a foundation model on 3D CT chest scans can be
transferred and can perform well on other tasks even like brain tumor segmentation with
MRI, better than learning from scratch[40], suggesting pretraining a foundation model using
medical images, especially in 3D, can be beneficial to improving the performance of the
model for some downstream tasks in medical imaging.
1.4 Current Deep Learning Applications in medical
imaging and ASL
Tasks in medical imaging share some similar properties compared to regular CV, such as
lesion segmentation, image denoising, image super-resolution, etc. However, there are also
some difference for DL application between medical imaging and natural images. First,
medical image datasets are usually smaller (usually on the order of hundreds or thousands)
compared to other regular image datasets (usually more than 10K). Since DL methods
usually benefit from more training data, some strategies need to be used when generating
10
the dataset for medical imaging. This includes using image or volume patches for training
rather than the full-sized image[41, 42], and also data augmentation techniques to enlarge
the training dataset based on transformation of the original data[41, 43]. Another possible
solution is to apply transfer learning techniques to transfer models trained on large regular
image datasets to their specific tasks[39, 44]. Second, medical images acquired from different
scanners (i.e., modalities, systems) or even the same scanner with different imaging protocols,
may show different properties such as image contrast or intensity scale[45, 46]. This means
models trained on one dataset may not work for dataset from another site. So, it is essential
to consider the generalization ability of the DL models in medical imaging, and evaluation
of the performance should always involve a different dataset. Third, while sometimes image
contrast is enough for diagnosis, quantitative values in some image modalities may have
clinical meaning or impact on decision making, such as Hounsfield unit value for CT imaging
and apparent diffusion coefficient (ADC) value for diffusion MRI, special care need to be
taken to ensure the DL algorithm not induce systematic bias to the images, which may
weaken the diagnosis results or change the patterns in populations. Last but not least,
although other fields have been gaining benefit from using large pretrained models, it would
be hard to train a large foundation model for medical images. Although there are thousands
of medical images being acquired every day, most of the medical images involve private
patient information, which makes them hard to be made publicly available. A method to
overcome this may be federated learning that updates the model on each site without the
need of data sharing. Also, medical images from different datasets may contain population
bias or systematic bias, which may affect the performance when transferring to a target task.
Nevertheless, preliminary studies[47, 40] about the foundational models for medical imaging
still show some potential applications and superiority over other methods that learning from
scratch.
As previously mentioned, the greatest challenge for ASL is its low SNR. Therefore, the
main tasks for ASL data processing are denoising the perfusion images and precise quan11
tification. ASL has some special characteristics compared to other MRI modalities. First,
although the primary outcome for an ASL acquisition is a perfusion weighted image, the raw
data is obtained in a 4D fashion, with several 3D volumes of paired control and label images
and are often acquired along with a proton density image (M0) as reference for CBF quantification. This makes ASL images to contain both spatial and temporal information, which
can be potentially utilized in post processing. There have been studies developing spatiotemporal denoising and reconstruction of ASL images with spatiotemporal regularizers[48, 49].
However, applying DL algorithms on spatiotemporal denoising would be computationally
expensive and hard to train with limited data. Current studies on ASL denoising always
focused on image-to-image translation[50, 51, 52], but not fully utilized the temporal information. Second, quantification of CBF and other parameters of ASL is usually done through
a model fitting process. The models used for fitting are well established that follows the dynamics of the T1 relaxation of blood and tissue. However, theoretical model may not be
always correct for in-vivo data, which may contain various kind of noise and artifact. Also,
the calculation of CBF may be time-consuming, especially for multi-delay ASL. Therefore,
some studies have been investigating to apply DL algorithm to bypass the model fitting steps
and directly achieve quantitative maps from the acquired perfusion images[53, 54].
12
Chapter 2
Characteristics of ASL imaging
Before discussing how deep learning can be adapted to improving the reliability and accuracy
in clinical applications, it is worth introducing the characteristics of ASL imaging. ASL is still
an emerging field and has been developing for the past decades. With both improvements in
hardware (better magnetic field strength, better coil sensitivity, etc.) and in pulse sequence
development (the use of pseudo-continuous ASL for higher labeling efficiency, and segmented
3D GRASE readout for higher SNR, etc.), the image quality has been continuously improving
over last decades. However, compared to other widely used modalities, such as T1w and
T2w imaging, there are still challenges in using ASL for clinical studies. In this chapter,
some recent development in ASL imaging will be briefly reviewed. First, the process of
ASL acquisition will be introduced, including labeling methods, readout methods and the
importance of background suppression. Next, ASL image processing methods and CBF
quantification will be covered. In the last, some recent research that makes cutting-edge
applications in ASL beyond normal brain ASL will be reviewed. From these perspectives,
it will be clarified how deep learning can improve the efficiency and image quality for ASL
imaging.
2.1 Image acquisition
Most modern MRI sequence can be divided into two parts, signal preparation or signal
encoding, and signal readout. For ASL, the signal preparation is referred to the labeling
part, where one or multiple RF pulses are applied to tip down the spins in the blood that
13
later flows into the tissue of the area of interest. Paired labeled and unlabeled images are
acquired consecutively and the perfusion signal can be calculated by subtracting the bloodlabeled signal from the unlabeled (control) signal. Since it takes time for labeled blood to
travel from the labeling area to the tissue and there is T1 relaxation of the labeled spins
during this period, the final acquired perfusion signal is only about 1% of the tissue signal
for human brain, which is close to the noise level. Therefore, the goal for improvement at
the image acquisition stage is to maximize the signal and minimizing the noise.
2.1.1 Labeling method
There are three main labeling methods for ASL, which are continuous or pseudo-continuous
labeling, pulsed labeling and velocity-selective labeling. Among these methods, pseudocontinuous labeling and pulsed labeling are the most widely used methods in clinical diagnosis, while velocity-selective labeling is still under development and will not be discussed
in the following section.
Pulsed ASL (PASL) and pseudo-continuous ASL (pCASL) differ in the labeling region
and labeling duration, which results in different signal properties. In pulsed ASL, a bulk of
blood water outside the imaging volume is labelled with a single inversion RF pulse. This
RF pulse is typically as short as 10 to 20 ms. Time of inversion (TI) is defined as the
gap between the readout and the inversion pulse, during which the labeled blood flows into
the tissue of imaging. For pseudo-continuous ASL, a series of small selective RF pulses are
applied at a thinner plane, which is usually put perpendicular to the feeding artery, called
the labeling plane, that tip down the spins in the blood as it flows across the plane. This is
achieved with a special flow-driven adiabatic pulse[55]. The total labeling duration of this
series of labeling pulses is typically 1 to 2s. The post-labeling delay (PLD) is defined as the
gap between the end of the labeling pulses and the readout, which is similar to the definition
of TI.
The SNR of the pCASL is typically higher than PASL for two main reasons, as depicted
14
Figure 2.1: Signal evolution for pulsed ASL and pseudo-continuous ASL (Figure from Michael
Chappell, Bradley Macintosh and Thomas Okell. Introduction to Quantification using Arterial Spin Labeling).
in Figure 2.1. First, the total amount of labeled blood for pCASL is more than PASL. Since
the labeling volume for PASL is limited by the transmit RF coverage, and the tail of the
labeling bolus will decay more when it reaches the imaging volume. While for pCASL, all
blood experience the same inversion as it passes through the labeling plane. Therefore, the
area under the arterial input function (AIF) for PASL is smaller than pCASL, resulting in
less labeled blood and less signal, shown in Figure 2.1 (a). Second, in PASL, the inversion of
the blood starts to decay right after the labeling pulse, while in pCASL, fresh labelled blood
is accumulated into the labeling plane, the average T1 decay for pCASL is less than PASL,
resulting in overall higher signal at the readout, shown in 2.1 (c).
15
For these reasons, pCASL is the recommended workhorse for clinical application of ASL
[12]. However, PASL is still used in some product sequences and some specific cases such as
spinal cord perfusion imaging[56], and ultra-high field ASL[57].
2.1.2 Readout method
Readout methods can be grouped into 2D methods and 3D methods, where 2D methods
excite a single or multiple slices sequentially to achieve a final image volume. While 3D
methods excite the whole image volume and resolve the slice information through encoding.
For 2D readout, simultaneous multi-slice (SMS) approach is usually combined with echo
planar imaging (EPI) to achieve a single-shot acquisition[58]. This method has the advantage
that it is less sensitive to motion artifact, which is useful in the case where motion is of great
concern, such as pediatric imaging. In comparison, 3D readout excites the whole 3D volume,
thus need to acquire a 3D k-space in order to resolve the slice information. Since single will
decay during the readout period, it is usually not possible to achieve the whole k-space
data in a single pass without any acceleration techniques. To solve this, multiple time of
repetitions (TRs) are needed to fill the whole k-space, which is named as the segmented
readout. Motion that happens between these segments will be difficult to correct and results
in blurring of the image.
Figure 2.2 demonstrate the difference between ASL images acquired with 2D and 3D
readout. It shows that 3D readout has clearly higher SNR compared to 2D readout. For
this reason, even with the concern in the motion artifacts, 3D readout is more preferable
compared to 2D readout for its better SNR efficiency. Besides, since 3D segmented readout
involves a single excitation for each TR, which is compatible for the scheme of background
suppression, which will be discussed in the following section. For these reasons, 3D readout
is the used as the readout method in clinic and some of the research protocols.
16
Figure 2.2: (a) 2D versus (b) 3D readout ASL imaging in a normal subject. Both images were
acquired with approximately 5 min of imaging at 3T with pCASL labeling (label duration
of 1.5 sec and a post-label delay of 2 sec). The 2D readout method was a single-shot
gradient echo spiral. The 3D readout was a segmented stack-of-spirals fast spin echo. Note
the artifacts associated with the 2D single shot method in regions of high susceptibility
(arrows). (Figure from Alsop, David C., et al. Magnetic resonance in medicine 73.1 (2015):
102-116.)
17
2.1.3 Background suppression
Background suppression (BS) is also an important component of ASL sequence. Since the
perfusion signal is only about 1% of the amount of the brain tissue signal, fluctuation in the
background tissue, including thermal noise or physiological noise, can have a large impact
for the resulting perfusion signal. BS uses one or multiple inversion RF pulses to encode the
spins for different T1, so that the longitudinal magnitude of the spins within the selected
T1 range will be close or equal to zero at the time for readout. The optimal timings for the
placement of BS pulses can be solved with a least square minimization of the sum of residual
signals across the T1 range at the time of readout. Take a BS scheme with two BS pulses
as the example, the objective for optimization can be expressed as the following equation:
min
t1,t2
Xn
i=1
(f (T1i
;t1, t2))2
(2.1)
where t1 and t2 are timings for the first and second inversion pulses.
The T1 range for optimization should be chosen according to the application. For example, for human brain ASL at 3T, the T1 for gray matter and white matter are the most
important tissue for BS. In some other cases, such as spinal cord ASL, since the fluctuation
of cerebrospinal fluid (CSF) has a great impact on the signal, the T1 for CSF should be a
main objective for BS. BS can be plugged into the scheme of both PASL and pCASL. In
PASL, BS pulses can only be placed between the initial labeling pulse and the readout, for
pCASL, BS can be placed either during the labeling or in the PLD. If one or more BS pulses
are placed during the labeling, it is called unconstrained BS, while the BS scheme that all
BS pulses are placed during PLD is called the constrained BS[59]. While constrained BS is
easier for implementation without the need to interrupt the labeling pulses, it can only be
used when PLD is long enough to fit the BS pulses. In the cases for a short PLD, unconstrained BS should be considered in order to achieve better suppression performance. Since
BS inversion will also flip the labeled blood, the label/control scheme should also be flipped
18
Figure 2.3: The timing diagram for the background suppression. a. Unconstrained scheme, b.
constrained scheme and c. also an unconstrained background suppression scheme interleaved
with CASL. (Figure from Nasim Maleki, Weiying Dai and David Alsop. Magnetic Resonance
Materials in Physics, Biology and Medicine 25 (2012): 127-133.)
19
when a BS during the labeling is executed. The illustration of the background suppression
scheme for ASL is depicted in Figure 2.3
2.2 Processing Methods
So far, we have discussed improvements for ASL during the image acquisition stage. The
image quality using the recommended scan parameters at a regular 3T scanner is reasonable
for a 5-minute scan at a relative low resolution of 3 × 3 × 3mm3
. Although the SNR can
be improved with a larger number of repetitions, the time consumption would be a greater
concern for the clinical application of ASL. Therefore, post processing of ASL is usually
necessary to improve the image quality.
There have been non-DL methods and DL methods for post processing. For non-DL
methods, it usually depends on a predefined image model or uses temporal information
from the acquired image series. For example, total generalized variation (TGV) based ASL
denoising [60] uses a image model that constrained a smooth image texture as well as a
smooth transition in the edges. The objective function can be described as equation 2.2
TGV (α1, α0; u) = min
v
{α1 · ∥∇u − v∥1 + α0 · ∥Ev∥1} (2.2)
where u is the image to be denoised, and v is the smooth part within that image. ∇u − v is
the gradient remainder, and Ev =
1
2
(∇v + (∇v)
T
is the symmetric gradient, α0 and α1 are
the weights for the two penalty terms.
K-space weighted average (KWIA)[61] is another ASL denoising method that does not
depend on any image models, but uses temporal information between different frames. Since
image contrast is stored in the k-space center, but noise are mostly in the high frequency
region, denoising can be performed by averaging the outer ring of the k-space while keeping
the center. This technique has been successful in the CT perfusion image denoising[62] and
also in multi-delay ASL denoising[61].
20
2.3 CBF Quantification
In the previous sections, we have discussed how we can acquire and process ASL images.
Although the perfusion weighted images can already provide useful information for clinical
diagnosis, making it quantitative would make the meaning of ASL images standardized across
different platforms and scanner types, as well as potentially making it a useful biomarker for
some diseases. The quantification of CBF involves fitting the acquired perfusion signal to
a predefined ASL kinetic model[63]. In single-delay ASL, since only one point on the curve
is sampled, some assumptions need to be made for quantification. For pCASL, the PLD
has to be longer than the arterial transit time (ATT), which means at the time for image
acquisition, all labeled blood has come into the imaging volume. For PASL, time between
the initial inversion pulse and the starting of QUIPSS II[64] saturation pulse (TI-TI1) need
to be larger than ATT for the same reason. Another assumption is that after the blood has
come into the tissue, the T1 relaxation is governed by the blood T1 instead of the tissue T1.
Under these assumptions, the CBF can be calculated with the following equation:
CBFpCASL =
6000 · λ · (SIcontrol − SIlabel ) · e
PLLD
T1, blood
2 · α · T1, blood · SIPD ·
1 − e
− τ
T1, blood [ml/100 g/min] (2.3)
CBFPASL =
6000 · λ · (SIcontrol − SIlabel ) · e
TI
T1, blood
2 · α · TI1 · SIPD
[ml/100 g/min] (2.4)
where λ is the brain/blood partition coefficient in ml/g, SIcontrol − SIlabel is the averaged
perfusion signal, α is the labeling efficiency, which measures how much spins in the blood
has been inverted. If BS is used, this should also be included in the calculation of α. Note
that SIPD is a proton density image that need to be acquired separately. Typically it can be
acquired with a long PLD/TI and non BS pulses.
Multi-delay ASL[3] collected ASL images at multiple TIs or PLDs to sample multiple
points on the ASL dynamic curve that can offer a better estimation of CBF and can also
21
be used to estimate for other parameters such as ATT, depending on the physics model.
The benefit for using multiple delays is that it no longer needs the assumption for ATT,
and instead, can estimate ATT from the collected data. This can reduce the CBF underestimation where ATT is longer than the PLD or TI in the single-delay ASL. Also, ATT is
another important diagnostic measurement for patients with steno-occlusive diseases. However, multi-delay ASL takes longer time for image acquisition, and also needs more processing steps. More importantly, since the repetitions in single-delay is distributed to multiple
PLDs, the SNR for the image of each PLD becomes lower. Therefore, multi-delay ASL is
still not the recommended default method for ASL but can provide benefits in some clinical
applications[65].
2.4 Beyond Conventional ASL
ASL has originally developed to measure blood flow of the whole brain. However, besides
regular whole brain ASL, there have been other studies that utilize the idea of ASL for
other type of studies of either brain functions or blood flow in other parts of the body. For
example, in a recent study[66], the authors developed a zoomed ASL technique that can
provide a high-resolution (nominal 1mm isotropic) measurement of blood flow of a small
region of cortex, such as visual cortex (V1) or motor cortex (M1), also known as the laminar
ASL. The imaging region and ASL labeling schemes are shown in Figure 2.4. This technique
can be used to study the detailed laminar activities of cerebral cortex not only in the resting
state, but also when performing different tasks such as visual stimulation.
Also, there have been studies that applies the idea of ASL to other parts of the body, such
as liver[67] or kidney[68]. For body ASL, B0/B1 inhomogeneity and body motion are the
largest problems, which can cause reduced labeling efficiency and artifacts for subtraction.
ASL at ultra high field such as 7T[57, 69, 70] is another cutting-edge field for ASL studies.
Compared to regular ASL scans performed at lower field strength such as 3T, ASL at ultra
high field has the advantage of both increased SNR and longer T1, which reduces the decay of
22
Figure 2.4: Demonstration of zoomed pCASL acquisition and diagrams of motor and visual
tasks. A. Illustration of imaging volume on a 3D brain surface. A small FOV (100 × 50 ×
24 mm3
covering the dominant motor cortex (red) or visual cortex (blue) was acquired with
zoomed GRASE. Axial, coronal and sagittal views of GRASE images are shown in enlarged
red and blue boxes, respectively. B. Illustration of pCASL labeling plane in sagittal (yellow
line) and coronal (yellow lines and shade) views. Intracranial arteries were revealed by
maximal-intensity-projection (MIP) of the T1w structural MRI and pCASL labeling plane
was placed above the circle of Willis (CoW) and simultaneously perpendicular to the M3
segment of middle cerebral artery (MCA), P2 segment of posterior cerebral artery (PCA)
and A2 segment of anterior cerebral artery (ACA). (Figure from Shao et al. NeuroImage
245 (2021): 118724.)
23
the labeled blood. However, 7T has its own problem of high specific absorption rate (SAR),
as well as worse B0/B1 field inhomogeneity, which can cause signal loss or image distortion.
There has been few studies that show convincing evidence that ASL at ultra high field is
better than regular 3T, which makes this an emerging field of research.
These cutting-edge ASL have various image properties and problems that traditional
image processing methods cannot handle, thus it raises new needs for processing techniques
that can solve specific application-related problems to improve the reliability of ASL.
24
Chapter 3
Improving ASL image quality with Swin
Transformer-based model
This chapter will introduce how DL can help with ASL denoising and how the latest model -
Swin Transformer can improve the results over the convolutional neural network (CNN). DL
is one category of machine learning, with a series of layers of neural networks to mimic the
process of human brain. CNN has been proposed to learn from images with convolutional
kernels, which makes the process more efficiently and more effectively. Transformer-based
models have originally been proposed to solve sequential data like text for better modeling
of long-term dependencies, and later modified to Vision Transformer (ViT), which has been
shown to work better than CNN in image tasks as well. There have been some works in
using DL to improve ASL. Two main tasks among these works are ASL image denoising, and
end-to-end quantification. For denoising, CNN-based networks have been used with either
perfusion or CBF images. However, previous studies don’t utilize all repetitions acquired
during the scan, which may not achieve the optimized performance. On the other hand,
whether Transformer-based model can improve the performance over these CNN models are
not studied. Here, a Swin Transformer-based model has been developed for denoising ASL
images. The model has been further optimized with pseudo3D input channels for singledelay ASL denoising, and pseudo4D input channels for multi-delay ASL denoising. The
performance of the model has been evaluated both on the in-house test dataset, as well as
additional test datasets from different vendors.
25
Figure 3.1: Illustration of self attention and multi-head self attention. (Figure from Vaswani
et al. Advances in Neural Information Processing Systems (2017).)
3.1 Swin Transformer model for image restoration
The most important component in the Transformer is the self-attention block. The structure
of the self-attention block is shown in Figure 3.1. First, the input sequence is split into tokens
and each token is embedded into a vector with an embedding layer. Next, the embedding
of the sequence are further encoded with three linear layers, and the outputs are known as
the Q, K and V matrices. Attention is calculated as the scaled dot product of these three
encoded matrices with the following equation:
Attention(Q, K, V ) = softmax
QKT
√
dk
V (3.1)
26
Figure 3.2: Illustration of Vision Transformer. (Figure from Dosovitskiy, Alexey. ICLR
2021.
where dk is the dimension of Q or K vectors, which is used to scale the value of attention
with different encoding dimension to make training more stable. The output of the attention
operation is an attention-weighted sequence of the input. Multi-head attention means several
parallel attention-weighted sequences are computed at the same time and combined together
afterwards to get an output sequence. This makes the network able to model more complex
relationship in the sequence. Since the calculation of attention is calculated for all parts
in the sequence simultaneously instead of processing each part sequentially, it is able to
deal with long-term dependencies better than previous models to process sequence such as
recurrent neural network (RNN) or long-short-term-memory (LSTM). This is also the largest
advantage for Transformer-based networks.
When adapting the Transformer to image processing, the first work is the ViT [28]. The
illustration of ViT is shown in Figure 3.2. In ViT, the whole image is split into small patches
and each patch are treated as tokens just like in a sentence. A positional embedding is
added to the embedding of each patch to integrate the position information of the patch
with regard to the whole image. Although this method has been shown successful in some of
the benchmark tasks, the largest problem is that it is not scalable, as the space complexity of
27
Figure 3.3: An illustration of the shifted window approach for computing self-attention in
the proposed Swin Transformer architecture. In layer l (left), a regular window partitioning
scheme is adopted, and self-attention is computed within each window. In the next layer l
+ 1 (right), the window partitioning is shifted, resulting in new windows. The self-attention
computation in the new windows crosses the boundaries of the previous windows in layer l,
providing connections among them. (Figure from Liu et al. Proceedings of the IEEE/CVF
international conference on computer vision. 2021.)
attention calculation is quadratic to the size of the input image. To solve this problem, Swin
Transformer[26] has come up for potentially serving as the backbone of any general purpose
computer vision tasks. The key improvement of Swin Transformer is shown in Figure 3.3.
Swin Transformer first divides the whole image into small parts called windows, and selfattention is calculated within each windows instead of the whole image. In two consecutive
layers, the windows are shifted by half of the window size to incorporate connections between
windows. This calculation of attention in local windows makes the model makes the model
more scalable to different input image size, as the calculation of attention becomes linear
with the image size instead of quadratic.
The model that was used in this study is called the Swin Transformer for image restoration (SwinIR)[27]. The illustration of the model is shown in Figure 3.4. This model is a
combination of both CNN and Swin Transformer. The first layer from the input is an CNN
layer for shallow feature extraction. This is followed by several residual Swin Transformer
blocks for deep feature extraction. The output of block is sent to the last layer, which is also
28
Figure 3.4: The architecture of the proposed SwinIR for image restoration. (Figure from
Liang et al. Proceedings of the IEEE/CVF international conference on computer vision.
2021.)
CNN layers for high-quality image restoration.
3.2 Pseudo3D SwinIR for single-delay ASL denoising
As introduced in the previous chapter, ASL is generally acquired with multiple pairs of
control and label images to obtain an average perfusion image with sufficient SNR and a
proton density image (M0) for quantification. Here, a flexible scheme is proposed that is
based on performing DL denoising on each perfusion image, where all denoised perfusion
images can be further averaged to obtain a perfusion image with higher SNR. Alternatively,
this scheme allows fewer repetitions with shorter scan time to achieve a perfusion image with
sufficient SNR. In this study, we focused on 3D pCASL data acquired on 3T MRI scanners of
different vendors. Three dimensional acquisition is the recommended form of ASL acquisition
by the consensus paper[12] and offers higher SNR compared to 2D acquisitions.
A summary of the datasets used in this study is provided in Table 3.1. The participants
(total n=111, age=61±18, 72 females) were generally healthy without major neurologic/psychiatric disorder or severe systemic disease, and provided written informed consents. The
29
Vendor Dset1 Siemens Prisma Dset2 Siemens Prisma Dset3 Siemens Prisma Dset4 Philips Achieva Dset5 GE 750W Dset6 Siemens Prisma
No. Subject 55 (15 males) 20 (8 males) 10(4 males) 10(3 males) 10(4 males) 6(5 males)
Age 69±7 37±20 65±10 69±8 71±9 35±13
No. Scans 98 35 10 10 10 10
LD/PLD(ms) 1500/2000 1800/2000 2000/2500 2000/2500 2000/2500 1800/500-2500
Resolution (mm3) 2.5 × 2.5 × 2.5 2.5 × 2.5 × 2.5 3.4 × 3.4 × 4 3.4 × 3.4 × 4 3.4 × 3.4 × 4 2.5 × 2.5 × 2.5 Matrix size 96 × 96 × 48 96 × 96 × 40 64 × 64 × 36 80 × 80 × 36 128 × 128 × 36 96 × 96 × 40
Repetitions (L/C pairs) 7 9 8 5 1 9/9/9/9/3(for delay 1-5)
No. Test scans 9 5 10 10 10 Three-fold cross validation
Table 3.1: Details of the datasets used in this study, including the number of subjects and
cases, patient age and gender, MRI scanner vendors, ASL parameters including LD and
PLD, image resolution, matrix size, number of repetitions (L/C pairs) and the number of
testing cases.
participants of datasets 1 and 3–5 were part of the MarkVCID Consortium study[71] at the
University of Southern California (USC), Massachusetts General Hospital, and Johns Hopkins University, whereas the participants of datasets 2 and 6 underwent MRI scanning at
USC.
For single-delay ASL, the training data were chosen from the first two datasets acquired
on 3T Siemens Prisma scanner using background suppressed 3D gradient and spin echo
(GRASE) pCASL. The trained DL models were first tested on unseen data from the same
cohorts (datasets 1–2) as the training data, and were also independently tested on 3D pCASL
data acquired on Siemens Prisma, Philips Achieva, and GE Discovery 750 3T scanners,
respectively, (datasets 3–5) with different imaging parameters[72]. Each scan in the dataset
was organized as a 4D matrix of perfusion images with first three dimensions as image volume
and the last dimension as the repetition. For the training stage, each individual repetition
was served as the input and the average of all repetitions was used as the reference.
For model comparison, two other state-of-the-art CNN-based backbone network architectures, DWAN[50] and ResNet[73] were investigated. Besides, two other Transformer-based
networks, including UNETR[30] and TransUnet[74] were also studied. For each of the backbones, the baseline model uses a 2D structure that takes a single slice as the input and a
single slice as the output. Given that the perfusion image is a 3D volume, adjacent slices
may provide useful information for denoising. Pseudo-3D method is a way to incorporate
adjacent slices in the input[75]. Instead of a single slice, a sub-volume of N slices (odd
number) that contains the center slice, and N-1 adjacent slices are used as the input, which
30
Single delay Center slice Adjacent slice M0
Baseline 2D ✓ 0 ×
Pseudo3D 3 slice ✓ 2 ×
Pseudo3D 3 slice ✓ 4 ×
Pseudo3D 3 slice ✓ 6 ×
2D with M0 ✓ 0 ✓
Pseudo3D 3 slice with M0 ✓ 2 ✓
Table 3.2: Details of the input settings in the experiment. (M0 is the proton density image)
feeds more spatial information to the model. In this study, several pseudo-3D conditions
were investigated (N=1, 3, 5, 7; N=1 means standard 2D model). We also investigated
multi-modality input conditions, where the M0 image was included as an additional channel
to the input[52]. Because the M0 image is usually acquired with ASL for quantification, this
method does not require a separate scan or co-registration. A detailed setting of the experiments is summarized in Table 3.3. The 2D with M0 condition took one slice of perfusion
image and the M0 image of the same slice. The M0-with-three-slice condition took three
slices from the perfusion images and the center slice of the corresponding M0 image as input.
All networks were implemented with Pytorch 1.12[76] and python 3.7 and trained on a
lambda cluster with NVIDIA 3090 GPU. The ASL control and label images were motion
corrected using SPM12 and pairwise subtracted to get the perfusion images. The perfusion
images were averaged to get the reference for each scan. Since there might be large artifact or
severe head motion occurred across the perfusion series, a quality control step was performed
to ensure the training data did not include outliers. Perfusion images with mean signal
greater or less than mean ± 2×standard deviations of the perfusion image time series were
considered outliers and excluded from the training data. (On average, less than 1 time
frame of each scan was excluded from the training data). The top and bottom 10% of the
slices were cropped since they have insufficient SNR and/or often contained artifacts. The
data were divided into 8:1:1 for training, validation and testing respectively. Each perfusion
image and its corresponding reference were considered an input-reference pair, resulting in
a total of 104 training scans with 28443 slices and 15 validation scans with 4018 slices. A
31
sub-volume training strategy was used where a 48×48×N sub-volume randomly chosen from
the image instead of the whole was used as the input. Data augmentation strategies included
random image rotation within the range of -60 to 60 degrees and range flip along x direction.
A combined loss with l1 and structural similarity index (SSIM) loss, where
L(IOutput, Iref ) = l1(IOutput, Iref ) + LSSIM (IOutput, Iref ) (3.2)
was used as the loss function to preserve both local features and perceptual image quality.
Adam optimizer was used with initial learning rate of 1e-3. The learning rate was reduced if
the performance had not improved after 20 epochs. Batch size was 16 for all model settings.
Each model was trained for 500 epochs, and the parameters from the model that achieved
the best performance on the validation dataset was recorded.
Notice that during the training process, patches were used as input and output, while
for testing, the whole image (96×96×N) was used as the input and output. Figure A.1
illustrates the difference between training and inference with different image size for CNNbased network and window-based Transformer networks. During inference, the trained CNN
kernels were applied to the full input image through spatial convolution. While for Swin
Transformer, the full image is divided into non-overlapping windows and the trained selfattention weights were applied to these windows with the window shifting mechanism to
enable communication among all windows. In addition, we have conducted a comparison
study with patch-wise and full image for training to test whether this has an effect on the
denoising performance. The details of the experiments and results can be found in Appendix
A.
Figure 3.5 shows the DL denoised perfusion images and difference map compared to the
reference with different model backbones and input settings on a representative subject. The
input image is an individual perfusion image calculated by subtraction of one control-label
pair and the subtitles indicate the model from which the input was processed. Compared
to the input image, DL denoised images had higher SNR and better gray and white matter
32
Figure 3.5: Input, reference, and prediction perfusion images processed by the deep learning
(DL) models of different settings. The DL processed image has higher SNR compared to the
input image and high similarity to the reference. The label “ 2D” means the model takes
1 slice as the input, “ 3,” “ 5,” and “ 7” mean the model takes pseudo 3D inputs of three
slices, five slices, and seven slices of the perfusion images, respectively. “ 2D M0” means
the model takes one slice of the perfusion image and the corresponding slice of M0 image as
the input. “ 3 M0” means the model takes pseudo 3D inputs of three slices of the perfusion
image and the center slice in the M0 image as the input.
33
Figure 3.6: Example with larger coverage of a representative subject. The reference is
averaged by all input perfusion images. Each input perfusion image was denoised by the
deep learning model and averaged by different portions of the time points (2, 4, and all).
More averaging will result in higher SNR.
contrast. Figure 3.6 shows an example of the prediction by a three-slice SwinIR model
with six slices covering the cortex and zoomed patches. The reference was averaged by all
repetitions of original perfusion images, and the prediction was averaged by 2 repetitions,
4 repetitions, and all repetitions of denoised perfusion images, respectively. The SNR of
output perfusion images was the highest after DL processing and averaging all repetitions,
which improved the SNR by approximately twofold compared to the reference image.
Figure 3.7 shows the comparison of similarity metrics for different model backbones and
settings, respectively. Within the same backbone model the SSIM and PSNR improved as the
slice channel increased, where the improvement from five-slice to seven-slice was marginal.
The models with the M0 image as additional input outperformed all model settings with only
perfusion images as the input. Among the three model backbones, SwinIR outperformed
ResNet and DWAN in all model settings (p < 0.001), except for the 2D model, the SSIM
of ResNet was slightly higher than that of SwinIR (p<0.01). The detailed metric results
of all models are included in Table 3.3. The results for comparison with the UNETR and
34
Figure 3.7: Comparison of the structural similarity index (SSIM), peak signal-to-noise ratio
(PSNR) of different model backbones and input settings. For both SSIM and PSNR, with
more adjacent slices to the input channels, the performance improves. Adding a M0 channel
will result in the largest improvement. The significance of the difference was indicated on
the bar plot (*p<0.05, **p<0.01, ***p<0.001).
35
Figure 3.8: Prediction and difference map of UNETR and SwinIR ( 3 stands for using
pseudo-3D of 3 slices as the input)
TransUnet with the condition of three slices of perfusion images as input are shown in Figures
3.8 and 3.9, respectively. It shows that the 3D UNETR produced blurred images compared
to the reference, whereas the performance of pseudo-3D TransUNet was inferior to that of
the pseudo-3D SwinIR.
Figure 3.10 shows the denoising performance for the perfusion images from different
cohorts acquired on three MR vendors. For all three vendors, there was an increase in SNR
for the denoised perfusion images. For GE data, because the original data was already the
average of three repetitions with high SNR (individual repetition data were not provided by
the vendor), the improvement by DL denoising was marginal. Table 3.4 shows the details of
improvement in SNR of different model backbones with three-slice input for three vendors,
respectively. All three models were able to improve SNR over the input. ResNet achieved
36
Figure 3.9: Prediction and difference map of TransUNet and SwinIR ( 3 stands for using
pseudo-3D of 3 slices as the input)
37
Model NMSE SSIM PSNR
Input 0.465±0.164 0.876±0.021 22.498±0.829
DWAN 2D 0.401±0.147 0.870±0.020 23.176±0.954
DWAN 3 0.351±0.129 0.877±0.019 23.767±0.973
DWAN 5 0.315±0.118 0.881±0.019 24.217±1.001
DWAN 7 0.360±0.123 0.878±0.017 23.610±0.885
DWAN 2D M0 0.312±0.112 0.882±0.020 24.240±1.008
DWAN 3 M0 0.318±0.115 0.884±0.019 24.182±1.049
ResNet 2D 0.359±0.146 0.875±0.022 23.673±0.884
ResNet 3 0.343±0.134 0.879±0.021 23.859±0.885
ResNet 5 0.328±0.130 0.882±0.021 24.061±0.924
ResNet 7 0.327±0.131 0.883±0.020 24.071±0.935
ResNet 2D M0 0.243±0.080 0.893±0.019 25.230±0.996
ResNet 3 M0 0.238±0.073 0.892±0.018 25.304±0.935
SwinIR 2D 0.336±0.129 0.871±0.021 23.937±0.914
SwinIR 3 0.319±0.124 0.879±0.020 24.162±0.950
SwinIR 5 0.276±0.106 0.887±0.019 24.761±1.020
SwinIR 7 0.272±0.105 0.889±0.019 24.840±1.018
SwinIR 2D M0 0.228±0.069 0.895±0.019 25.456±0.978
SwinIR 3 M0 0.227±0.067 0.896±0.020 25.485±0.966
TransUNet mini 3 0.331±0.127 0.871±0.022 23.992±0.920
TransUNet original 3 0.335±0.126 0.870±0.021 23.935±0.881
UNETR original (3D) 0.378±0.117 0.858±0.025 23.369±1.101
UNETR mini (3D) 0.385±0.078 0.786±0.280 23.055±1.144
UNETR medium (3D) 0.282±0.078 0.858±0.025 24.578±1.015
Table 3.3: Comparison of similarity metrics for all model backbones and input settings.
38
Figure 3.10: Denoising performance of three models for three representative cases from
independent testing cases from different vendors. For GE data, the original image is the
input to the networks. For the other two vendors, the reference images are averaged by all
input repetitions. SNR of the images are shown above each case.
Model Dataset4 Philips Dataset3 Siemens Dataset5 (GE) Combined
SNR 2 SNR 4 SNR all SNR 2 SNR 4 SNR all SNR all SNR all
ResNet 3 3.509±1.077 5.005±1.607 5.582±1.701 3.109±1.308 4.278±1.935 6.379±2.705 4.398±0.457 5.453±1.394
SwinIR 3 3.108±1.229 4.476±1.594 4.988±1.580 3.104±1.281 4.391±2.017 6.485±2.789 4.561±0.210 5.345±2.034
DWAN 3 3.001±1.069 4.244±1.554 4.714±1.576 2.716±1.141 3.843±1.777 5.754±2.632 4.725±0.589 5.064±2.023
Input 2.171±0.861 3.101±1.107 3.524±1.143 2.088±0.884 2.948±1.352 4.359±1.975 4.105±0.504 3.996±1.394
Table 3.4: SNR performance for different vendors with different proportion of averages. The
best performance across different models was shown in bold. (SNR 2 means the image is
averaged with 2 repetitions; SNR 4 means the image is averaged with 4 repetitions; SNR all
means the image is averaged with all available repetitions; the last column is the SNR for
all images from the 3 datasets with all available repetitions).
the best performance for Philips data, SwinIR achieved best performance for the Siemens
data, and DWAN performed best on GE data.
3.3 Pseudo4D SwinIR for multi-delay ASL denoising
The aforementioned DL denoising schemes on single-delay ASL data were extended to
multi-delay ASL data with three-slice and temporally adjacent perfusion images (pseudo4D) (see Figure 3.11) as input for spatiotemporal denoising. For multi-delay ASL, it is
important to not only improve the SNR for each perfusion image, but also to preserve
the temporal relationship across different PLDs. Therefore, we included one temporal di39
Figure 3.11: The framework for the arterial spin labeling (ASL) denoising task. The input
to the model is an image slice combined with several other channels, which are images from
spatially adjacent slices, or image from temporally adjacent post labeling delays.
mension as the input to constrain the dynamic relationship between PLDs to improve the
model performance in estimating quantitative maps. In this experiment, we acquired a
separate multi-delay 3D GRASE pCASL dataset (dataset 6 in Table 3.1) with five delays
(500/1000/1500/2000/2500ms) and 9/9/9/9/3 repetitions for each delay, respectively, fewer
repetitions for last PLD were acquired because of time limits. Similar to the previous experiments, each individual perfusion image was used as input and the average of all time points
of that delay was used as the reference. The three model backbones, DWAN, ResNet, and
SwinIR were tested with either spatial input (center + two adjacent slices) or spatiotemporal
input (center + two adjacent slices and two adjacent PLDs). We padded in the temporal
dimension to get enough input for the first and last PLDs. We used a three-fold cross validation for the training, and in each fold one-third of the subjects would be left for the test
group so that every subject can be used once for evaluation. For evaluations, predicted CBF
and ATT maps were fitted from DL processed five-delay perfusion images with repetitions of
1/1/1/2/2, which would result in a 5-min scan. Reference CBF and ATT maps were fitted
using the five-delay perfusion images averaged with all repetitions. The similarity metrics
40
Figure 3.12: An example of the multi-delay dataset. Input perfusion images were averaged
with 1/1/1/2/2 repetitions for post labeling delay (PLD) of 500/1000/1500/2000/2500ms.
Reference perfusion images were averaged with 9/9/9/9/3 repetitions for PLD of
500/1000/1500/2000/2500ms. The denoised perfusion images were deep learning predictions with the input, which show improves SNR for each PLD. slc3 means taking three slices
of the PLD. PLD1 and PLD3 means pseudo 3D and 4D models that use one or three PLDs
as input, respectively.
between the output and reference quantitative maps were calculated.
Figure 3.12 shows denoising results of a representative multi-delay dataset. The SNR
of the perfusion image of each PLD was improved with all models. Figure 3.13 shows the
fitted CBF and ATT maps from two representative subjects. The input CBF and ATT
maps were fitted from the five-delay perfusion images with fewer repetitions (1/1/1/2/2),
and the reference maps were fitted from the five-delay perfusion images with all repetitions
(9/9/9/9/3). The DL models denoised the input perfusion image of each delay, and the
CBF and ATT maps were fitted from the denoised five-delay perfusion images. It can be
seen that after DL denoising, the fitted CBF maps and ATT maps achieved higher SNR
41
Figure 3.13: Fitted CBF and ATT maps for two representative subjects. Red arrows show
a spot where an error in fitting occurs because of spike in the input, which was resolved in
spatiotemporal denoising models, but not resolved in spatial only models. slc3 means taking
three slices of the PLD. PLD1 and PLD3 means pseudo 3D and 4D models that use one or
three PLDs as input, respectively.
42
Figure 3.14: Similarity metrics of the predicted CBF and ATT maps with reference. The
performance for the CBF map is close between spatial-only and spatiotemporal denoising,
but for ATT map, spatiotemporal denoising results are better than spatial-only results. The
significance of the difference was indication on the bar plot (*: p<0.05, **: p<0.01, ***:
p<0.001)
compared to those calculated from the input images. The models with an extra temporal
denoising dimension showed better performance in the WM compared to the spatial only
denoising models. The red arrows in Figure 3.13 indicate spurious high ATT values in input
images that were suppressed in spatiotemporal denoised images using three-PLD input.
Figure 3.14 shows that the spatial denoising models and spatiotemporal denoising models
achieved similar performance for the CBF maps, nevertheless, the spatiotemporal models
achieved significantly higher similarity for the fitted ATT maps. In all experimental settings,
SwinIR outperformed ResNet and DWAN in the quantitative metrics (SSIM and PSNR) of
both CBF and ATT maps.
3.4 Evaluation of the results on quantification
The similarity metrics show the models’ ability of mapping the input to the reference output.
However, for clinical application, the fidelity of the model output needs to be evaluated as
well. In this study, besides similarity evaluation, whole brain, gray matter (GM), and white
matter (WM) CBF values were calculated within masks for processed method and reference
to evaluate the systematic bias introduced by the models. Mean difference between reference
and predicted CBF values along with its 95% confidence interval was calculated. GM, WM,
43
Figure 3.15: Mean difference of CBF values in whole brain, gray matter, and white matter
(relative values (A) and in percentage (B)) for different model backbones and input settings.
and whole brain masks were segmented from M0 image using SPM12. Because existing
literature reported test–retest variability of ASL scans on the order of 10%[77, 78, 79, 80,
81], we consider CBF and ATT biases <10% within the normal variation range and clinically
acceptable.
For single-delay ASL experiment, the details of the bias analysis can be found in Figure
3.15. Figure 3.15A shows the mean difference of CBF values for whole brain, GM, and WM
using all repetitions for different model settings, respectively. In Figure 3.15B, it can be seen
that for models that did not include M0 as input, the mean CBF differences were relatively
small (within 10%), whereas the models that included M0 as input had significantly higher
biases in WM CBF compared to other input settings for all three model backbones. For
SwinIR models without M0 as input, the mean value in the GM was slightly increased
44
Figure 3.16: Scatter plots of the denoised cerebral blood flow (CBF) values compared to
reference CBF values in whole brain, gray matter, and white matter of the three models,
respectively. “ 3” means the model takes pseudo 3D inputs of three slices.
by <5%, whereas the mean value in WM was decreased by <10% except for seven-slice
pseudo-3D input. Figure 3.16 shows the scatter plot of the whole brain, GM, and WM CBF
values after denoising with three models compared to the reference. Overall, all three models
produced high correlations with the reference CBF values (R2>0.97). Based on the above
results, we chose pseudo-3D input with three slices as the optimal input condition based on
the balance between increased SNR and minimal bias for CBF quantification.
For multi-delay ASL experiment, Figure 3.17 shows the differences in the CBF and ATT
values between the denoised image and reference. For the CBF values, all models achieved
relatively small bias (the largest difference in GM and WM CBF was 5.97% and 6.29%,
respectively). For the ATT values, all models showed small biases in GM (within 10%), but
the spatiotemporal models showed better performance with small ATT biases in the WM
(about 2%), whereas the spatial only models showed a difference of about 8% in WM ATT.
45
Figure 3.17: Bias analysis for different models. CBF difference is small for all model conditions. But for ATT, spatial-only models have much higher difference in white matter ATT.
46
Chapter 4
Self-supervised learning for high-resolution
pediatric choroid plexus perfusion imaging
In chapter 3, we have discussed how DL with Transformer can help improving the ASL in
regular clinical settings. However, as discussed in previous chapters, ASL is an emerging
technique and there is always new technical development in the acquisition as well as application of ASL. In these scenarios, traditional DL techniques may not fit in because there
are not sufficient training data and there may not be any reference for evaluation. In this
chapter, we will showcase an example of how DL with self-supervised learning and transfer
learning can help with ASL in cutting-edge fields.
4.1 Choroid plexus perfusion imaging and accelerated
acquisition
Choroid Plexus (CP) consists of a monolayer of epithelial cells joined together by tight
junctions to form the blood-CSF barrier (BCSFB). The main role of CP is to produce cerebrospinal fluid (CSF) that circulates through the ventricular system and the subarachnoid
space, buffering the brain and spinal cord[82]. CP blood flow or perfusion is an important
physiological parameter for assessing the normal function of CP as well as its aberrations
in brain disorders such as hydrocephalus. By maintaining normal intracranial pressure and
47
CSF circulation, the CP is also key to healthy neurodevelopment during childhood when the
brain undergoes rapid growth and reorganization.
CP perfusion has been studied using arterial spin labeling (ASL) in adults but not in
pediatric populations[83, 84]. Limited studies have reported that the CP volume increases
with age from infants to children and to adults, with an average size of 3 to 3.5mm in
children[85]. This imposes significant challenges to accurately measure CP perfusion in
pediatric populations, including: 1) Existing ASL techniques generally have a coarse spatial
resolution of around 4×4×4 mm3
, making it difficult to visualize small brain structures due
to partial volume effects; and 2) Although multi-delay ASL (MD-ASL) is preferred over
single-delay ASL for improved quantification accuracy, the longer scan time required for
MD-ASL is difficult in scanning children, where motion artifact is a great concern.
In order to achieve robust high-resolution MD-ASL for pediatric CP perfusion imaging,
accelerated image acquisition/reconstruction and denoising techniques are necessary, which
may be applied separately or jointly through denoising-reconstruction techniques. Compressed Sensing[86] is a widely used technique, which uses L1 wavelet or total variation (TV)
regularizer for denoising and reconstruction. In another previous study[48], the authors proposed a robust single-shot acquisition of MD-ASL with time dependent CAIPIRINHA undersampling pattern and spatiotemporal reconstruction with total generalized variation (TGV)
regularization. However, these methods require predefined constraints for optimization, and
generally take relatively longer time for image reconstruction due to iterative optimization
steps. Another way to denoise multi-delay ASL images is to use k-space weighted image average (KWIA)[61], which divides the k-space into multiple rings and applies a progressively
wider temporal window for moving average of peripheral k-space data to reduce noise. This
method does not require predefined constraints or iterations and preserves the spatiotemporal resolution of the original image series. However, the image noise becomes temporally
correlated following KWIA denoising, resulting in little improvement on the fitting results,
i.e., cerebral blood flow (CBF) and arterial transit time (ATT) maps[61].
48
DL methods can process noisy images efficiently and effectively once the model has been
trained. However, training a DL model usually relies on paired images with a low-SNR image
as input and a high-SNR image as the reference. This makes it hard to apply to cases where
there are limited or no high-SNR reference images acquired, such as high-resolution MRI in
children where lengthy scans are difficult. While using a pretrained model may in part help
with this issue, the performance of the model may decline on a new dataset with different
data distribution such as different image contrast/resolution or different age groups, and
usually fine-tuning of the pretrained model is warranted.
4.2 Accelerated acquisition of high-resolution multidelay ASL
The recommended acquisition for ASL according to the consensus paper [12] is to use a
segmented 3D readout. In a recent study[48], the authors used a modified segmented 3D
GRASE readout with a time dependent 2D CAIPIRINHA[87] pattern to improve SNR and
robustness for acquiring multi-delay pseudo-continuous ASL (pCASL) images with iso-3mm
resolution. In this work, we extended the previous work with a higher acceleration factor
(from ×6 to ×8) to achieve higher resolution (iso-2mm) multi-delay pCASL. The undersampling pattern for each measurement is depicted in Figure 4.1. A total of 8 segments with
an acceleration factor of 2 in the phase encoding (PE) (y) direction and 4 in the partition
(z) direction were used in the readout to shorten the length of each echo train. There
were shifts between subsequent control/label pair both in the phase encoding and partition
direction to increase temporal incoherence and reduce aliasing artifacts for reconstruction.
This pattern enables the estimation of coil sensitivity maps directly from the combined kspace without separate calibration data, and is more robust to motion artifacts compared to
the standard segmented acquisition[48]. Furthermore, it allows to reconstruct the image with
either parallel imaging techniques for each segment or direct inverse Fourier Transform (IFT)
49
Figure 4.1: The acquisition and under sampling scheme. For each measurement, a time
dependent CAIPIRINHA under sampling pattern with 2×4 acceleration was applied. A full
k-space can be achieved by the averaging all 8 segments to reconstruct a single image or used
to estimate sensitivity maps. And each segment can be used to reconstruct a single image
with TGV method.
50
of the segment-combined k-space. As the regular processing pipeline for this acquisition, a
spatiotemporal TGV constrained reconstruction was applied to the k-space data for each
PLD. The reconstruction solves the following optimization equation:
(c
⋆
, l⋆
) ∈ argmin
c,l
λc
2
∥Kc − dc∥
2
2 +
λl
2
∥Kl − dl∥
2
2 + γ1(w)T GVα1,α0,β(l)
+γ1(w)T GVα1,α0,β(c) + γ2(w)T GVα1,α0,β(c − l)
(4.1)
where c and l are control and label images for one PLD, K is the forward encoding matrix,
including coil sensitivity profile, Fourier operation and under-sampling pattern masking, dc
and dl are the acquired k-space data for control and label respectively. λc and λl are weights
to balance the data consistency terms. α1, α0 and β are TGV reconstruction parameters
according to[88] and γ1(w), γ2(w) are the weights to balance the regularization for control,
label, and perfusion images. TGV regularizer enforces piece-wise smooth images for both
control/label and the perfusion image. The TGV constrained reconstruction method can
improve the SNR of the reconstructed perfusion images. However, the reconstruction is
time-consuming and has high demand in GPU memory as it takes the k-space data with
multiple coils and multiple time points into the reconstruction algorithm.
In this study, we collected a dataset of high-resolution MD-pCASL using the protocol we
developed above. Twenty-one typically developing children (age=13±2.5 years, 13 males)
without neurological/psychiatric disorders or development delays were recruited under IRB
approval. All subjects were scanned twice with two weeks apart for test-retest purposes
on a 3T MR system (Prisma, Siemens Healthcare, Germany) using a 32-channel head coil.
The MD-ASL images were acquired with the proposed scheme, along with a separate T1w
MPRAGE scan (1 mm3
isotropic).
A time-dependent CAIPIRINHA under-sampling pattern with 8 segments was employed
for each PLD for the multi-delay ASL protocol[48]. The complete imaging protocol is shown
as follows: TR = 6180ms, TE = 52.5ms, FOV = 192×192×96 mm3
, resolution of 2×2×2
mm3
, matrix size of 96×96×48. A phase contrast (PC) MRA image was acquired before
51
ASL, and the labeling plane was placed at a straight segment of intracranial arteries and
vertebral arteries according to the PC image to improve labeling efficiency. Labeling duration
= 1500ms and 5 PLDs (600, 1000, 1400, 1800 and 2200ms) were used considering shorter
ATT in younger populations[89]. Background suppression with 2 inversion pulses was used
and optimized for each PLD. The scan duration for this multi-delay ASL was 8 minutes and
33 seconds. This will result in a complete measurement (1 control and 1 label image) for each
delay if averaging all segments, or can be processed as 8 measurements if each segment is
reconstructed respectively. Segment-combined images were used as the input for denoising.
The above acquisition results in a dataset that contains a total of 42 scans with 5 delays and
a total of 10080 slices.
4.3 Self-supervised learning with k-space weighted image average(KWIA) and Noise2Void
KWIA performs denoising of an image within a time series by first dividing the whole kspace into several ring-shaped k-space regions. The central k-space region uses original data
to preserve the image contrast and temporal resolution, and the outer k-space is weighted
averaged among neighboring time frames to increase SNR while preserving spatial resolution.
The weighting for the averaged k-space is adjusted according to the number of neighboring
time frames to be averaged. The KWIA processed k-space data can be described by the
following equation:
FKW IA (kx, ky, t′
) =
t=t
′X
+(N−1)
t=t
′−(N−1)
WKW IA (kx, ky, t, t′
) F (kx, ky, t) (4.2)
where FKW IA and F denotes the k-space data after and before denoising respectively, and
WKW IA denotes the weight for the averaged rings. Figure 4.2 illustrates the processes of
applying KWIA with 2 rings to 5-delay ASL data. KWIA builds on the theory that most
52
Figure 4.2: KWIA processing pipeline. The k-space of the center is kept, and the outer ring
is averaged across neighboring PLDs to produce a denoised k-space and perfusion image.
of the image contrast and temporal information is stored in the low-frequency part of the kspace while the image resolution and fine details are determined by the extent of the k-space
coverage. Therefore, averaging the outer rings while keeping the center k-space unchanged
can improve the SNR while preserving the spatiotemporal resolution and contrast of the
original image series. The advantage of KWIA is that it can be easily implemented and does
not cause much spatial blurring to the image like some traditional denoising methods such as
Gaussian filtering. However, although theoretically KWIA can improve the SNR of each time
frame image by about 2 folds, the results for quantitative fitting show little improvements
since KWIA induces temporally correlated noise, thus does not help when all images are
used for fitting[61]. As a result, KWIA can only improve the SNR and visualization of fine
details for each time frame, but not improve the quantitative mapping.
Noise2Void[90] is an image-patch-based self-supervised learning method that only requires noisy images as training data. This is achieved by replacing the value in the center of
each input patch with a randomly selected value form the surrounding area, and the model
53
Figure 4.3: Blind-spot masking scheme used during NOISE2VOID training. (a) A noisy
training image. (b) A magnified image patch from (a). During N2V training, a randomly
selected pixel is chosen (blue rectangle) and its intensity copied over to create a blind-spot
(red and striped square). This modified image is then used as input image during training.
(c) The target patch corresponding to (b). We use the original input with unmodified values
also as target. The loss is only calculated for the blind-spot pixels we masked in (b) (Figure
from Krull et al. IEEE/CVF 2019)
is trained to predict the original value. Noise2Void works under the assumption that the
signal is not pixel-wise independent and the noise is conditionally pixel-wise independent
given the signal. It trains a model to predict the value of a masked pixel using only its noisy
neighboring pixels, leveraging the spatial correlations in the image. The illustration of the
Noise2Void training scheme of blind masking is shown in Figure 4.3. This method has been
shown to perform better than non-training-based method, but inferior to supervised method
for natural images[90].
Given the desirable features of KWIA denoising including its computational efficiency and
contrast preserving, and minimal spatial blurring it causes, in this study, we proposed a DL
framework for denoising high-resolution MD-ASL with the Transformer-based model using
KWIA denoised images as reference. We showed that the trained model can outperform
KWIA in not only the SNR of the raw perfusion images, but also the SNR of the fitted
perfusion parametric maps (CBF and ATT).
Existing works of DL ASL denoising usually trained a new model on a specific dataset.
However, in recent years, pretrained models have shown to be powerful especially in the case
54
Figure 4.4: Original and denoised perfusion images of the five PLDs with different methods
including TGV, KWIA and the deep learning methods. The red arrows show the choroid
plexus signals in lateral ventricles.
where there are not sufficient training data[91, 92, 93]. In general, a pretrained model is
developed on a larger dataset with a more general task, which can be then applied to specific
downstream tasks with fine-tuning. It has been shown that the pretrained model can provide
a better starting point and need less time for training. But whether the performance of finetuning and training from scratch will differ is not clear, especially for a specific task such
as ASL denoising. In this study, besides training a new model from scratch, we also tried
to utilize a pretrained model for lower resolution single-delay ASL denoising in adults. We
compared the performance of training a new model from scratch and fine-tuning a pretrained
denoising model in adults for pediatric perfusion MRI.
Figure 4.4 shows a representative dataset of the baseline and denoised perfusion images
of five PLDs using traditional and proposed methods. The baseline image was the IFT of the
segment-combined k-space. Both non-DL and DL methods improved the SNR compared to
55
Figure 4.5: CBF and ATT images fitted from the original and denoised multi-delay perfusion
images with different methods including TGV, KWIA and the deep learning methods. The
zoom-in panels show choroid plexus areas in the axial and coronal view.
the baseline qualitatively. The DL models trained with KWIA reference show clear perfusion
signals in the CP area, as indicated by the red arrows. Figure 4.5 shows a representative
dataset of fitted CBF and ATT maps from baseline perfusion images and denoised perfusion
images with different denoising methods. DL-based methods including both train-fromscratch and fine-tune show higher SNR in both CBF and ATT maps while also preserving
the image details. The zoom-in panels show the CP perfusion overlaid on T1 images, with
clear delineation of perfusion signals in the CP. Figure 4.6 shows the comparison of SNR
performance of perfusion images, CBF and ATT maps, both in the GM and CP. For GM, the
56
Figure 4.6: Quantitative comparison of SNR among different methods. (a) SNR for perfusion
images of each PLD in GM(b) SNR for CBF and ATT images after fitting in GM. (c) SNR
for perfusion images of each PLD in CP(d) SNR for CBF and ATT images after fitting in
CP.
57
SNR PLD1 PLD2 PLD3 PLD4 PLD5 CBF ATT
Baseline 1.54±0.54 1.42±0.46 1.48±0.36 1.19±0.28 0.94±0.27 2.79±0.39 1.47±0.16
KWIA 1.84±0.64 1.90±0.63 1.99±0.51 1.65±0.39 1.14±0.33 2.73±0.40 1.44±0.18
(19%) (34%) (34%) (39%) (21%) (-2%) (-2%)
TGV 2.29±0.80 2.20±0.73 2.22±0.60 1.72±0.47 1.25±0.37 2.97±0.60 1.59±0.30
(49%) (55%) (50%) (45%) (33%) (6.5%) (8%)
Noise2Void 2.41±0.91 2.23±0.79 2.33±0.64 1.96±0.50 1.55±0.48 3.56±0.75 1.46±0.26
(56%) (45%) (57%) (65%) (65%) (32%) (-1%)
KWIA scratch 2.90±1.07 2.83±1.01 3.03±0.83 2.41±0.70 1.68±0.55 3.22±0.92 1.77±0.22
(88%) (99%) (105%) (103%) (79%) (15%) (20%)
KWIA finetune 2.86±1.09 2.83±1.09 2.96±0.88 2.33±0.67 1.65±0.53 3.19±0.87 1.80±0.25
(86%) (99%) (100%) (96%) (76%) (14%) (22%)
Table 4.1: SNR of perfusion images and CBF and ATT maps of baseline in GM and different
denoising methods, the percentages show improvements compared to the baseline.
SNR PLD1 PLD2 PLD3 PLD4 PLD5 CBF ATT
Baseline 0.41±0.30 0.46±0.26 0.55±0.33 0.59±0.28 0.51±0.25 1.48±0.35 2.26±0.29
KWIA 0.46±0.33 0.56±0.30 0.65±0.39 0.72±0.31 0.58±0.29 1.45±0.36 2.29±0.34
(12%) (21%) (18%) (22%) (14%) (-2%) (1%)
TGV 0.39±0.31 0.46±0.28 0.57±0.33 0.60±0.30 0.55±0.29 1.46±0.34 2.35±0.25
(-5%) (0%) (3%) (2%) (8%) (-1%) (4%)
Noise2Void 0.44±0.41 0.52±0.34 0.64±0.44 0.73±0.38 0.62±0.36 1.40±0.35 2.56±0.47
(7%) (13%) (16%) (24%) (22%) (-5%) (13%)
KWIA scratch 0.91±0.24 0.90±0.17 0.98±0.17 0.98±0.16 0.87±0.15 1.74±0.35 2.68±0.36
(122%) (96%) (78%) (66%) (71%) (18%) (19%)
KWIA finetune 0.97±0.45 0.90±0.18 0.98±0.17 0.96±0.15 0.86±0.14 1.76±0.34 2.69±0.37
(136%) (96%) (78%) (63%) (69%) (19%) (19%)
Table 4.2: SNR of perfusion images and CBF and ATT maps in CP of baseline and different
denoising methods, the percentages show improvements compared to the baseline.
performance of KWIA-supervised train-from-scratch model and fine-tune model are averaged
across five folds. All DL methods outperformed traditional methods including KWIA and
TGV-based reconstruction across all five PLDs. For CBF and ATT maps, KWIA shows
little improvements from the original, but DL-based methods trained on the KWIA-reference
image show higher SNR than both KWIA and TGV method. For CP, both KWIA supervised
methods outperform other methods in perfusion images as well as CBF and ATT maps.
The comparisons of SNR values in GM and CP are shown in Table 4.1 and Table 4.2
respectively. For perfusion images, KWIA can improve SNR by an average about 30%,
which is consistent with the simulation results in a 2-ring KWIA setting[61]. DL trained
with KWIA filtered images as reference can improve the SNR of the perfusion images by
58
an average of 93%. The denoising performance is relatively consistent across all five PLDs
for TGV and Noise2Void, but both KWIA and KWIA-supervised DL methods show better
performance in PLD2, 3 and 4 than PLD1 and 5. For CBF and ATT maps, KWIA does not
have much improvement on the performance due to correlated noises between time frames,
which is consistent with previous results[61]. But DL model trained with KWIA reference
shows an average of 14.5% improved SNR in CBF maps and an average of 21% improved
SNR in ATT maps. Noise2Void has a moderate improvement in the perfusion images and
the fitted CBF maps, but does not improve the SNR of the ATT maps. In the CP area,
Noise2Void does not show as much improvements as the KWIA supervised methods.
4.4 Evaluation of Test-retest reproducibility
To evaluate the robustness of the methods against multiple visits, intraclass correlation coefficient (ICC) of two-way mixed effects, absolute agreement and single rater[94] was calculated
to evaluate the test and retest reliability from 21 subjects using the following equation:
ICC =
MSR − MSE
MSR + (k − 1)MSE +
k
n
(MSC − MSE)
(4.3)
where MSR is the mean square for rows, MSE is the mean square for error, MSC is mean
square for columns, n is the number of subjects and k is the number of measurements. ICC
was calculated to evaluate the test-retest reliability of the acquisition and influence of the
denoising algorithm. Within subject coefficient of variation (wsCV) of the test-retest was
calculated with the following equation:
wsCV =
vuut
1
N
X
(visit1−visit2)2
2
visit1+visit2
2
2
(4.4)
Table 4.3 and Table 4.4 show the ICC and wsCV values of CBF and ATT values in
GM and CP processed with different methods. The ICC of the baseline CBF and ATT
5
ICC Baseline KWIA TGV Noise2Void KWIA scratch KWIA finetune
CBF
GM 0.38 0.36 0.05 0.42 0.49 0.53
CP 0.74 0.74 0.65 0.72 0.68 0.67
ATT
GM 0.63 0.63 0.56 0.60 0.58 0.54
CP 0.06 -0.04 -0.05 -0.04 0.06 0.20
Table 4.3: Intraclass coefficient of covariance (ICC) of CBF and ATT in GM and CP
wsCV Baseline KWIA TGV Noise2Void KWIA scratch KWIA finetune
CBF
GM 0.08 0.08 0.12 0.08 0.09 0.09
CP 0.16 0.16 0.14 0.20 0.16 0.15
ATT
GM 0.06 0.07 0.07 0.07 0.08 0.08
CP 0.11 0.12 0.08 0.14 0.09 0.13
Table 4.4: Within-subject coefficient of variation (wsCV) of CBF and ATT in GM and CP
images show a moderate repeatability given the resolution is relatively high for multi-delay
ASL, and the reported repeatability in pediatric population is lower than that in adults[89].
For ICC of CBF in the GM, all DL-based models show improvement in the reproducibility,
while TGV method has a large decrease in ICC. In CP area, all methods show similar ICC
values compared to the baseline. ICC of ATT in CP is lower, which is due to the small
inter-subject variation. WsCV results show comparable within-subject reproducibility for
all methods compared to the baseline.
60
Chapter 5
Generating M0 enables CBF quantification for ADNI
In the previous chapters, we have utilized Transformer-based models for ASL denoising.
Recently advance in generative models has brought significant breakthroughs in medical
imaging. One of the greatest potential for using generative models is that it can help to
create and share medical data via synthetic datasets without privacy issues. On the other
hand, generative models can also be used to perform regular discriminative tasks such as
image-to-image translation[36] and image reconstruction[35] with conditional generation. In
this chapter, a conditional latent diffusion model is developed and utilized to generate missing
M0 data in the ADNI-3 dataset to enable CBF quantification to address heterogeneity across
vendors. As we have discussed in chapter 2, the quantification of CBF using ASL data needs
both perfusion data and M0 data. In case where we have not acquired M0 data, generated
M0 from the acquired control image may help with quantification of CBF.
5.1 Conditional latent diffusion model
5.1.1 Denoising diffusion probabilistic model (DDPM)
A DDPM or diffusion model is a type of generative model that has gained attention for its
ability to generate high-quality data, particularly in tasks such as image synthesis[33]. The
core idea behind diffusion models is to gradually transform data into noise and then learn
how to reverse this process to recover the original data. This algorithm involves two stages
61
of process: a forward diffusion process, where noise is added to the data step-by-step over
a sequence of time steps, and a reverse process, where the model is trained to denoise the
data at each step to gradually reconstruct it. The forward process can be expressed by the
following equation:
q (xt
| xt−1) = N (xt
;
√
αtxt−1,(1 − αt) I) (5.1)
where q (xt
| xt−1) is the conditional probability of data at step t given the previous state. αt
is a predefined noise scheduler that controls the noise added to the image and the notation
N represents that each step follows a Gaussian distribution. One good property for this
definition is that the sample at any time point t can be sampled directly from the input with
the following equation:
q (xt
| x0) = N
xt
;
√
αtx0,(1 − αt) I
(5.2)
where αt =
Qt
i=1 αi
is the product of the noise schedule up to time step t.
The reverse step can be expressed by the following equation:
p (xt−1 | xt) = N (xt−1; µθ(xt
, t), Σθ(xt
, t)) (5.3)
where p (xt−1 | xt) is the the learned conditional probability distribution that predicts the
previous step from step t, µθ(xt
, t) and Σθ(xt
, t) are the mean and variance of the previous
step xt−1 learned by the model. The model is typically trained using a variational lower
bound on the data likelihood. One common approach is to simplify the training objective
by making the model predict the added noise at each time step, using a mean squared error
(MSE) loss between the predicted noise ϵθ(xt
, t) and the actual noise added in the forward
step ϵ[33]:
Lsimple = Ex0,ϵ,t
∥ϵ − ϵθ (xt
, t)∥
2
2
(5.4)
6
In other words, the model learns to remove the noise from step t to recover the previous
step xt−1. By removing the noise step by step, the model can ultimately generate new
data samples by sampling from noise and denoising it iteratively. Unlike GANs, which rely
on adversarial training, diffusion models use likelihood-based approaches and have shown
promise in generating detailed, high-quality images with better training stability.
5.1.2 Latent diffusion model for conditional M0 generation
A latent diffusion model (LDM) is a type of diffusion model that leverages diffusion processes in a lower-dimensional latent space, rather than directly on high-dimensional data,
like the whole images. This approach allows for efficient and scalable generation of complex
data, while retaining high fidelity and flexibility in the generative process. Another good
property for it is that it enables flexible conditioning mechanism with cross attention modules. The conditional latent diffusion model in this study is illustrated in Figure 5.1a. It
is consisted of an encoder, a diffusion module in the latent space, and a decoder. Prior to
the forward diffusion process, the input image (M0) is first encoded into the latent space to
reduce image dimension while not losing the perceptual content of the original image. Random Gaussian noise is added to the latent-space image for T steps (T=1000 in this study)
to generate the intermediate representations until the data is transformed to a sample that
follows Gaussian distribution. The reverse denoising process is modelled using a trainable
network to iteratively recover the latent-space image in the previous step from these intermediate representations. The final output of the reverse diffusion process is fed to the decoder
to convert the latent-space image to the image in the original space. This encoder-decoder
pair was pretrained as in[95] and the model weights were directly adopted in this study. The
conditioning mechanism is shown on the right side of Figure 5.1a. The condition (control
image) is first encoded with the same encoder and its representation in the latent space is
concatenated with the latent-space image in the intermediate representations of M0 in each
step of the reversed diffusion process to perform the conditional probability propagation.
63
Figure 5.1: Illustration of the conditional latent diffusion model(LDM) and the inference
process. a. Model structure of the latent diffusion model. The image (M0) in the original
pixel space is first encoded into the latent space with the encoder(E). Diffusion process is
conducted in the latent space to improve the computational efficiency. The condition (control
image) is also encoded into the latent space with the same encoder and concatenate with the
latent space feature of M0. The reversed denoising network ϵθ takes the time step t, latent
space image zt and the latent-space image of the condition τθ (y) as input and produces the
previous latent space image in the previous step. The final M0 image is produced with a
decoder to recover the image from the latent space. b. The sampling process (generation).
The input images are first scaled with the histogram matching to match the histogram of
the training data. The trained LDM generate 20 samples for each scan and averaged to get
a robust sample. The scale factor saved in the previous histogram matching is used to scale
the image back to the original scale.
64
5.2 Diffusion model for M0 generative in ADNI dataset
In this work, we developed a conditional LDM to generate the M0 images for Siemens ASL
data in ADNI-3 to quantify CBF and hypothesized that the pattern of decreasing CBF
with AD progression could be observed in the generated data. The ADNI was launched
in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner,
MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging
(MRI), positron emission tomography (PET), other biological markers, and clinical and
neuropsychological assessment can be combined to measure the progression of mild cognitive
impairment (MCI) and early Alzheimer’s disease (AD). Figure 5.1b shows the workflow of the
study evaluation. We first demonstrated the feasibility of using conditional LDM to generate
M0 images with high fidelity and accuracy for CBF quantification on our in-house dataset.
Then, we showed the feasibility of applying the trained model to the Siemens ASL dataset
in ADNI-3 to impute the missing M0 images for CBF quantification. Further, we compared
the patterns of CBF variation with AD progression from generated (Siemens) and acquired
(GE) CBF data with regional analyses. Three machine learning classifiers were applied to
evaluate the capability of CBF and perfusion-weighted images to classify different AD stages.
Our work demonstrates that the proposed diffusion model can impute missing data modality
and provide sufficient power and precision for studying ASL CBF as a biomarker in AD.
5.2.1 Dataset collection
Since ADNI-3 has not acquired M0 for Siemens data, in order to build an M0 generation
model from the control images, we collected new dataset with paired data of 1)PASL with
the same protocol as the ADNI-3 Siemens protocol, and 2) manually disabled BS pulses and
extended inversion time (TI) to 5s, as shown in Figure 5.2(a) and (b). The details of the
imaging parameters of ADNI-3 Siemens PASL include TR=4000ms, TE=20.26ms, segmented
3D GRASE readout with a matrix size of 128×128×32 and a resolution of 1.9×1.9×4.5mm3
,
65
Figure 5.2: Sequence diagram of the PASL used in the ADNI protocol for Siemens scanners.
a. The original product sequence to acquire control/label pairs for the PASL. b. Modified
sequence to acquire M0. The background suppression pulses are disabled, and the inversion
time is prolonged from 2s and 5s to enable full recovery of the longitudinal magnetization
(Mz).
TI=2000ms and TI1=800ms. Two BS pulses were applied after the PASL inversion pulse to
suppress the control/label images. 55 subjects (age = 73±6.9 years, 15 males) were enrolled
and scanned on a 3T Siemens Prisma scanner (Siemens Healthcare, Erlangen, Germany)
with a 32-channel head coil. All subjects provided written informed consent according to a
protocol approved by the Institutional Review Board.
5.2.2 Model training and evaluation
The conditional LDM and training were implemented with Python and Pytorch. The
encoder-decoder pair was adopted from the pretrained model without fine-tuning. The
diffusion model was trained with the data described in Supplementary Table 5.1. 39, 6,
66
Vendor Siemens
N subjects 55
Age (years) 73±7
Gender 40 females
Imaging protocol PASL
Table 5.1: Demographic information for in-house dataset with paired data of ASL
10 of the whole 55 datasets were randomly selected to be the training, validation and test
data, respectively. Training was performed on a Lambda cluster with NVIDIA 3090 GPU.
Adam optimizer was used with initial learning rate of 2e-6 and trained for 1000 epochs. Loss
function of training the denoising network were the same as described in[95],
Given the statistical properties of generative models, the generated images can be different for each sample, which enables the uncertainty measurement of the model generation. We
optimized the number of averaged samples by first generating 100 samples for each subject
and averaging different number of samples, and then comparing the averaged image to the
ground truth image to determine the best number of samples to achieve good performance.
The standard deviation map of the samples was generated as a by-product to evaluate the
spatial distribution of the uncertainty. To quantitatively compare the performance, we calculated the similarity (NMSE, PSNR and SSIM) between the generated M0 and the reference.
To evaluate the accuracy of CBF quantification using M0 images generated from the LDM
model, we calculated the same similarity metrics on CBF maps, as well as the bias in CBF
values averaged in the whole brain, GM and WM regions of interest (ROI). The ROIs were
segmented with SPM12. To further validate whether our method complies with the physics
model, we fitted the T1 maps from the generated M0 images and the control images which
were compared with those fitted with acquired M0 and control images.
Multiple M0 images were generated first followed by averaging, and the curves of performance metrics as a function of the number of averaged images is plotted, as shown in
Figure 5.3a, b, and c for normalized mean squared error (NMSE), structural similarity index
(SSIM) and peak signal-to-noise ratio (PSNR) between the generated M0 and the ground
67
Figure 5.3: Uncertainty evaluation of the diffusion model. a, b and c. Normalized mean
square error (NMSE), structural similarity index (SSIM) and peak signal-to-noise ratio
(PSNR) between generated and ground truth M0 image with different number of averaged
samples. The shaded area shows standard deviation across test images. d and e. A representative case of the ground truth (GT) and the generated image. f. The difference map
between the ground truth and the generated M0 image. g. Standard deviation map across
100 generated samples. Red arrows show CSF areas where the standard deviation is higher
than other brain tissues.
68
truth (acquired M0 image), respectively. The quality of the averaged M0 image improved
with an increasing number of averages and got stabilized after averaging 20 samples. The
variation of performance metrics across all test subjects (indicated by the shaded area) illustrates the stability of the conditional LDM across different subjects. Figure 5.3d, e, and
f show one representative case in the test dataset for an averaged M0 image, the ground
truth, and the difference map between the two images, respectively. Figure 5.3g shows the
standard deviation map across 100 generated samples. Most brain regions have low standard
deviation, supporting the high certainty of the trained conditional LDM. The regions with
relatively higher standard deviations correspond to cerebrospinal fluid (CSF) or vasculature,
as red arrows indicate, which tend to have high fluctuations in the MR images.
Based on the results of our uncertainty evaluation, we chose to average 20 generated
images for each case in our following analysis to achieve the optimal balance between generation time and image quality. We systematically compared the generated M0 and CBF maps
to the ground truth images in the following aspects. First, image similarity to the reference
image is inspected. A representative case of generated M0 and CBF images are shown in Figure 5.4a, both of which have high similarities to the ground truth. Second, the quantitative
similarity metrics of the generated images are summarized in Figure 5.4b, demonstrating the
outstanding performance of the conditional LDM. The mean gray matter (GM) and white
matter (WM) CBF values for both generated CBF maps and the ground truths are shown in
the scatter plot, showing a high consistency (r=0.97) between the generated and true CBF
values. Finally, Figure 5.4d shows the mean difference between the generated CBF values
and the ground truths in the whole brain, GM, and WM across all test cases. The mean difference is 1.07±2.12ml/100g/min for the whole brain, 1.29±2.51ml/100g/min for GM, and
-0.04±1.44ml/100g/min for WM, which is less than 5% of the corresponding CBF values.
This shows our model’s feasibility and fidelity in preserving the qualitative and quantitative
features of the M0 image.
We further evaluated the fidelity of our generated M0 images based on their consistency
69
Figure 5.4: Qualitative and quantitative model performance evaluation. a. A representative
case of the ground truth and generated M0 and CBF maps. Both M0 and CBF map produced
by the diffusion model show high fidelity and similarity to the ground truth image. b.
Quantitative similarity metrics of the generated image and the ground truth including NMSE,
SSIM and PSNR. c and d. Bias in CBF quantification measurements. c. Scatter plot of the
averaged CBF values of the ground truth and generated CBF images in both gray matter
(GM) and white matter (WM). d. Averaged CBF difference in the whole brain, GM and
WM. Mean difference is 1.07±2.12ml/100g/min for whole brain, 1.29±2.51ml/100g/min for
GM and -0.04±1.44ml/100g/min for WM.
70
Figure 5.5: Results for T1 evaluation. a. Bloch simulation results for background suppression
of various T1 values. b. Illustration of the relationship of Mz/M0 with different T1 values.
c. fitted T1 map with real M0 and control images. d. fitted T1 map with generated M0 and
real control images. e. scatter plot of gray matter and white matter T1 values of real and
generated T1 maps. f. summary of the GM and WM values of real and generated T1 maps
of all subjects.
71
with the MR physics model. The relationship between M0 and the control image is dependent
on the timing of background suppression (BS) pulses, as shown in Figure 5.5. We calculated
the T1 map from the generated M0 image and the acquired control image and compared
that with the ground truth and the reported T1 values of normal brain tissue at 3T in the
literature. Figure 5.5c and d show a high similarity between the T1 map calculated from
the generated M0 and that calculated from the acquired M0. The largest variation between
the generated T1 map and the ground truth T1 map occurs in the area containing blood
vessels and CSF, where the signal fluctuation is relatively more significant. Figure 5.5e and
f show the averaged T1 values of the ground truth and fitted from the generated M0. T1
values fitted from the generated M0 are close to the ground truth, and T1 values of both GM
and WM are consistent with that reported in literature[96], confirming that our generative
model preserves the MR physics properties of the generated M0.
Since the diffusion model was trained on an in-house dataset acquired with the same
vendor, and identical to the ADNI dataset, the acquired images were expected to come from
similar data distribution. Therefore, we applied the trained model to generate the missing
M0 images for the Siemens ASL dataset in ADNI-3 in a zero-shot approach. For ADNI-3
Siemens dataset, 10 pairs of control and label images were acquired for each subject. All
control images were averaged and used as the condition in the LDM to generate M0 images.
As the intensity of the MRI images may vary across different sites, the data were normalized
before they were input to the network[97] using a histogram matching method. Specifically,
the histogram of the training data was averaged and the percentile of GM and WM on the
histogram were used as landmarks. The histogram of the new data was calculated and the
values of the same percentile of the histogram were compared to the values of the landmarks
to determine the coefficient for normalization. The process of normalization can be described
by the following equation:
Inorm =
T1 + T2
Vp1 + Vp2
× Iorig (5.5)
where T1 and T2 are learned landmark values for GM and WM, and Vp1 and Vp2 are the
72
Vendor GE Siemens
N Subjects 186 211
Age (years) 73.6±7.3 76.2±7.1
Gender 91 females 111 females
Group (CN, SMC, MCI, AD) 87, 21, 65, 13 102, 15, 85, 9
Imaging Protocol pCASL PASL
Table 5.2: Demographic information for subjects (Baseline)
values of percentiles p1and p2 on the histogram of the image to be normalized. This scale
factor was calculated and saved for each individual data. After the generation of the M0
image, this scale factor was used to rescale the generated M0 back to the original scale.
ADNI-3 acquired ASL data with 10 pairs of control/label images for each scan, the
average control image was used as the condition to generate the M0 image. Intensity normalization was performed to make the intensity of the control images from the ADNI dataset
match with that of the in-house dataset to get the optimal performance. Same as in-house
data evaluation, 20 samples were generated for each scan and averaged to produce the final M0 image for CBF calculation. CBF maps were calculated with the acquired perfusion
images and the generated M0 images. The demographic information of the ADNI dataset
is summarized in Table 5.2. All subjects were characterized into four diagnostic groups,
including cognitive normal (CN), subjective memory concerns (SMC), mild cognitive impairment (MCI), and AD. To compare the spatial patterns among the four groups, averaged
CBF maps were calculated for each group. Both averaged perfusion maps (i.e., a direct subtraction between control and label images) and CBF maps were calculated and compared
to justify the need for quantitative analysis. Figure 5.6a-d show the averaged perfusion and
CBF maps of each group for both Siemens and GE data, respectively. The perfusion maps
are in arbitrary units, while the CBF maps are standardized to the unit of ml/100g/min. A
greater degree of spatial inhomogeneity can be observed in the perfusion maps compared to
CBF maps, especially in top and bottom slices, which is likely due to spatial variations of
coil sensitivities and different head positions. The CBF maps show improved homogeneity
by normalizing the perfusion signal with a calibration scan (M0).
73
Figure 5.6: Averaged perfusion and CBF maps for different groups in the ADNI dataset
including CN, SMC, MCI and AD. The perfusion maps show larger inhomogeneity across
different brain regions, while CBF maps are more homogeneous. Both Siemens and GE data
show a decreasing trend from CN/SMC to MCI to AD, especially in the CBF maps. Siemens
pulsed ASL data show more vascular signals while GE pCASL data is smoother across the
whole brain.
74
5.3 Analysis of clinical significance
For each subject in the ADNI-3 dataset, a T1-weighted structural MRI was acquired along
with ASL with a resolution of 1×1×1 mm3
. In order to perform regional analysis, the
generated M0 images and perfusion images were first coregistered to the T1 image and then
normalized to the Montreal Neurological Institute (MNI) template space using SPM12. CBF
maps were calculated and averaged for each diagnostic group, including CN, MCI, AD and
SMC. Automated anatomical labeling (AAL) template was used to get regional CBF values
within each brain regions.
5.3.1 Group analysis of brain regions
Statistical analyses were performed to compare regional differences among the 4 groups, as
well as the difference between GE data and generated Siemens data. A generalized linear
model was used to include an interaction term between diagnostic groups and scanner types.
The interaction term represented the difference between scanners of the difference among
the 4 groups. A non-statistically significant interaction test indicated no statistically significant differences in CBF variation across NC, SMC, MCI and AD groups between generated
(Siemens) and acquired (GE) data. Since the data did not follow the normal distribution,
we conducted Wilcoxon ranking score transformations, thus all statistical tests conducted
by comparing the ranking score (median) instead of mean. Model integrity was inspected
by residual plots. In this study, we did not adjust the α level for multiple comparison. As
we hypothesized a similarity between generated and acquired images, a statistically nonsignificance would be the ”positive” finding. Once α level been penalized, it will inflate the
chance of claiming no-difference (the positive finding). SAS 9.4 (www.sas.com) was used for
statistical analyses.
Regional CBF values of baseline visits in ADNI-3 were compared between generated data
(Siemens) and acquired data (GE) across the four diagnostic groups using a generalized linear
75
Figure 5.7: Regional analysis of the trend in different groups of the subjects. Four AD-related
ROIs are shown including posterior cingulate cortex (PCC), precuneus, inferior parietal gyrus
and angular gyrus. Data from Siemens and GE are shown in blue and red respectively. Both
data show similar trend in these ROIs, with a decreasing from CN/SMC to MCI to AD.
76
Figure 5.8: Regional analysis of more ROIs including cuneus, inferior parietal gyrus, supramarginal gyrus and hippocampus. Overall, GE and Siemens data show similar trends in
different groups of subjects. Siemens data show large variance across subjects than GE
data.
77
model with interaction between the diagnosis group and scanner types. The interaction test
showed in 84 out of 90 ROIs, there are no statistically significant differences in CBF variation
across NC, SMC, MCI and AD groups between generated (Siemens) and acquired (GE)
data. Based on the latest systematic review[98], AD is most affected by temporoparietal
and posterior cingulate cortex (PCC) regions. Figure 5.7 shows boxplots of CBF values
among the four groups in 4 representative ROIs, including PCC, precuneus, angular gyrus,
and inferior temporal gyrus for both left and right hemispheres. Boxplots of CBF values
in 4 other AD-affected regions (cuneus, inferior parietal gyrus, supramarginal gyrus, and
hippocampus on both hemispheres) are shown in Figure 5.8. GE and Siemens data are
shown in red and blue, respectively. Visual inspection concurred with the interaction test
and showed a high level of similarity between generated and acquired CBF data, with a trend
of decreasing CBF from CN to MCI and AD, while there is little CBF difference between CN
and SMC. Siemens data had greater variances than GE data, as shown by larger interquartile
ranges in the box plot.
5.3.2 Machine learning for AD classification
To study whether using quantitative CBF data can improve the performance of binary classification between each pair of the 4 diagnostic groups than using qualitative perfusion data,
three machine learning (ML) algorithms were used to build classifiers: Random Forest (RF),
Real AdaBoost[99], and Elastic Net. RF and AdaBoost are considered as non-parametric
approaches while Elastic Net is considered as parametric approach in case of strong linear
predictors. The hyperparameter setting for RF was based on grid search using an interval
of (5, 10, 25, 50, 100) number of variables to enter and 200 to 1000 trees by an interval of
100. Other hyperparameters were set as maximal depth of 50, leaf size of 5. The optimized
hyperparameter was selected by minimizing the out of bag misclassification rate. For AdaBoost, since it is more efficient, only 25 trees were built with a depth of 3 as recommended
by [100]. For Elastic Net, the hyperparameter tuning was conducted using L2 value between
78
Machine Learning Method Comparative groups Vendor AUC from CBF AUC from Perfusion
Ada Boost MCI, CN Siemens 0.64 95% CI: (0.56, 0.72) 0.62 95% CI: (0.53, 0.7)
Elastic Net MCI, CN Siemens 0.64 95% CI: (0.56, 0.72) 0.62 95% CI: (0.54, 0.7)
Random Forest MCI, CN Siemens 0.65 95% CI: (0.57, 0.73) 0.66 95% CI: (0.58, 0.74)
Ada Boost SMC, CN Siemens 0.58 95% CI: (0.44, 0.73) 0.48 95% CI: (0.33, 0.64)
Elastic Net SMC, CN Siemens 0.58 95% CI: (0.4, 0.76) 0.54 95% CI: (0.36, 0.72)
Random Forest SMC, CN Siemens 0.63 95% CI: (0.46, 0.79) 0.59 95% CI: (0.46, 0.73)
Ada Boost AD, CN Siemens 0.75 95% CI: (0.62, 0.89) 0.67 95% CI: (0.45, 0.9)
Elastic Net AD, CN Siemens 0.71 95% CI: (0.55, 0.87) 0.61 95% CI: (0.39, 0.83)
Random Forest AD, CN Siemens 0.68 95% CI: (0.5, 0.85) 0.68 95% CI: (0.46, 0.89)
Ada Boost AD, MCI Siemens 0.69 95% CI: (0.52, 0.87) 0.63 95% CI: (0.39, 0.87)
Elastic Net AD, MCI Siemens 0.69 95% CI: (0.49, 0.88) 0.6 95% CI: (0.38, 0.82)
Random Forest AD, MCI Siemens 0.65 95% CI: (0.47, 0.84) 0.66 95% CI: (0.41, 0.91)
Ada Boost MCI, CN GE 0.58 95% CI: (0.49, 0.67) 0.63 95% CI: (0.54, 0.72)
Elastic Net MCI, CN GE 0.59 95% CI: (0.5, 0.68) 0.5 95% CI: (0.5, 0.5)
Random Forest MCI, CN GE 0.6 95% CI: (0.51, 0.69) 0.59 95% CI: (0.5, 0.68)
Ada Boost SMC, CN GE 0.79 95% CI: (0.68, 0.9) 0.72 95% CI: (0.59, 0.85)
Elastic Net SMC, CN GE 0.63 95% CI: (0.52, 0.75) 0.5 95% CI: (0.5, 0.5)
Random Forest SMC, CN GE 0.73 95% CI: (0.6, 0.85) 0.68 95% CI: (0.56, 0.8)
Ada Boost AD, CN GE 0.9 95% CI: (0.83, 0.98) 0.81 95% CI: (0.72, 0.91)
Elastic Net AD, CN GE 0.84 95% CI: (0.74, 0.94) 0.5 95% CI: (0.5, 0.5)
Random Forest AD, CN GE 0.8 95% CI: (0.64, 0.96) 0.82 95% CI: (0.71, 0.93)
Ada Boost AD, MCI GE 0.6 95% CI: (0.43, 0.76) 0.66 95% CI: (0.53, 0.8)
Elastic Net AD, MCI GE 0.67 95% CI: (0.5, 0.83) 0.5 95% CI: (0.5, 0.5)
Random Forest AD, MCI GE 0.68 95% CI: (0.5, 0.86) 0.8 95% CI: (0.7, 0.9)
Table 5.3: Results of machine learning classification with either CBF or perfusion as features.
Machine learning methods include Ada Boost, Elastic Net and Random Forest.
0 to 1 via a grid search to minimize predicted residual sum of squares (CVPRESS). For RF
and AdaBoost, Gini impurity index was used as the loss function. Loh method[101] with
intensive Chi-square computing was used as for variable selection. For imbalanced outcome,
prior correction as described by[102] was used. For all 3 classifiers, 10-fold cross validation
was used to evaluate model performance. The full dataset was equally divided into 10 folds.
The learning process was re-iterated 10 times and the classifier was applied to each of the
testing samples. Thus, each study sample served as an independent testing case once. Receiver Operating Characteristic (ROC) curve was constructed using the predicted probability
from 10 testing datasets combined and the area under the curve (AUC) with 95% confidence
interval was used to assess prediction accuracy.
The performances of ML classifiers are shown in Table 5.3. The receiver operating characteristic (ROC) curve of three ML methods for 4 pairwise group classifications, including
AD vs CN, AD vs MCI, MCI vs CN, and SMC vs CN are displayed. For both Siemens and
79
GE data, the performance is above acceptable for clinical use[103] to separate AD from CN,
with the best area-under-the-curve (AUC) of 0.75 95% CI: (0.62, 0.89) for Siemens data and
0.9 95% CI: (0.83, 0.98) for GE data, while the performance for classifying MCI and CN
is below the clinically acceptable level of 0.7 for both Siemens and GE. The performance
using perfusion features is lower than that using quantitative CBF features in most cases
except for MCI vs CN, supporting the importance of using quantitative CBF rather than
qualitative perfusion values for classification.
80
Chapter 6
Discussion and Future Direction
The past decade has seen tremendous technical improvements of ASL. However, the intrinsic
low SNR of the ASL image still put an obstacle for a general use of ASL in clinical diagnosis.
The rise of DL has put lights on the ASL. The rise of deep learning demonstrates the
potential to enhance ASL by leveraging the latest advancements in DL methodologies. In
the previous chapters, we have seen some examples how DL can help with improving the
clinical applications of ASL.
In chapter 3, Swin Transformer-based models with a modified pseudo-3D or pseudo-4D
input were developed for single-delay and multi-delay ASL respectively. The model can
denoise each repetition of the ASL scan and thus reduce the total scan time or improve the
image quality according to clinical needs. The performance of the models were evaluated in
three aspects: similarity of the denoised image with the reference image, if applicable; SNR
of the denoised image without reference; CBF and ATT quantification bias compared to the
original image. The evaluation was performed on images from multiple vendors to assess the
generalizability of the model.
In chapter 4, the pretrained model has been adapted to a cutting-edge case of highresolution pediatric ASL. Since there is no reference images for DL training, the model is
trained with self-supervised methods, including a KWIA-based reference and Noise2Void
method. The methods were compared to regular TGV-based denoising and reconstruction
method, and also KWIA-based method. The results show that the DL with KWIA reference has the best performance with both improved SNR in perfusion images as well as
quantification CBF and ATT maps.
81
In chapter 5, generative diffusion model was adapted to generate missing M0 data in
an existing public dataset, ADNI, to enable CBF quantification with generated M0. An
LDM was trained on an in-house dataset with paired M0 and control images and applied
to the existing ADNI dataset. The results show that the LDM can generate M0 for the
existing ADNI dataset in a zero-shot manner. The resulting CBF maps have shown similar
pattern to data acquired from another vendor with perfusion and M0. Machine learning for
binary classification shows improved classification accuracy for AD and CN with generated
quantitative CBF maps over qualitative perfusion maps.
In the rest of this chapter, limitations of the studied in previous chapters will be discussed. Potential future directions and some preliminary results will also be introduced in
the following.
6.1 Study limitations
There are several limitations for this study. For ASL denoising, first, we observed biases in
fitted CBF and ATT values with DL denoising. Although the differences were not large,
cautions still need to be taken in clinical applications. Because existing literature reported
test–retest variability of ASL scans on the order of 10%[77, 78, 79, 80, 81] , we consider biases
<10% acceptable in clinical applications. One possible solution to this may be developing
direct mapping from the input to quantitative maps, like the work in[53], and this would be
a direction in future studies. Another solution may be directly reducing the number of PLDs
or directly applying denoising to CBF maps[104]. Second, the training sample size of the
model is still relatively small compared to other computer vision tasks. This is important
for improving the generalizability of the model to make it work for ASL data of multiple
vendors rather than the patterns seen in the training data (e.g., time-encoded pCASL)[105].
Third, only generally healthy subjects were included in the training, and the performance
of the proposed DL method needs to be further tested on subjects with neurologic disorders
such as stroke and brain tumor. Future work may include more clinical ASL cases in training
82
and/or combined with a fine-tuning approach.
For pediatric choroid plexus imaging, first, gold-standard reference is not available in
our dataset. Our study population is children, and it is difficult to acquire the high-SNR
reference for this dataset due to long scan time and higher probability of head motion. We
tested our performance based on non-reference metrics such as SNR. However, it is warranted
that this method to be tested on other datasets with real reference images. Second, we only
used a five-PLD dataset. While it is still not clear whether increasing the number of delays
will improve the accuracy of quantitative maps, the performance of KWIA improves with
the number of rings. With only five delays, we can only use a two-ring KWIA. Nevertheless,
a three-ring KWIA is more desirable for datasets with more than 5 delays. Third, although
we considered the blood T1 changing with age and gender, we did not use T1 mapping
to quantitatively get the T1 values of the choroid plexus. Since, the main purpose of this
study is to validate the proposed framework, further studies can explore more details for
quantification of blood flow in small brain structures.
For M0 generation in ADNI-3 dataset, first, ground truth is not available for the generated
M0 images in the ADNI-3 Siemens dataset, which prevents from more rigid evaluation, as
well as fine-tuning of the model. In this study, we evaluated the fidelity of the proposed
method by several validation methods, such as validating the T1 maps with the physics
model and literature, as well as by comparing the detected patterns of CBF reductions
across different groups between generated and acquired CBF data. Although this cannot
completely replace the necessity of the ground truth image, this approach is suitable for realworld scenarios where the dataset has already been acquired. Future studies could utilize
datasets with existing references to validate further the use of generative models in large
multi-site datasets. Second, pCASL data has higher SNR than PASL, which is also indicated
in our results, where GE pCASL data showed better performance in classification than
Siemens PASL data. Future studies should consider this difference and further standardize
the imaging protocol of ASL. Third, the ADNI dataset is imbalanced, with more CN and
83
MCI individuals than AD, which may affect the statistical significance of group comparisons
and the performance of ML classifications. Testing of the performance on a more balanced
dataset of AD and CN is warranted. Fourth, the comparison between Siemens and GE
was based on different patient cohorts, which could be confounded by demographic factors
and comorbidities. According to the recent clinical standard for AD diagnosis[106], the
classification of AD should consider imaging and fluid biomarkers such as Amyloid PET,
but not only on the behavior observations. In the future, this study can be further extended
to analysis of ASL CBF and PET-based diagnosis, which can be more meaningful to early
staging of AD.
6.2 Direct quantification for multi-parametric ASL
One of the limitation of DL-based denoising method or generation method is that it will
introduce bias in quantification of CBF or ATT. For ASL, even perfusion weighted image
contrast can provide useful information for diagnosis by identifying hypoperfused or hyperperfused regions, it is still desired to achieve a quantitative values for the scan. One possible
solution would be training a model to perform an end-to-end mapping from perfusion images to quantitative maps, such as the work in[53]. The purpose of this work is to develop
an automated way for CBF and ATT quantification with multi-delay ASL data to replace
traditional model fitting methods for better efficiency. The model structure and framework
used in this work in shown in Figure 6.1. The model takes the 6-PLD image as the input,
and outputs the CBF and ATT quantitative maps.
The model is trained in a supervised manner with the model-fitted CBF and ATT maps
as reference. This method may benefit for ASL denoising to reduce the bias by adding
the quantification module after denoising module to correct for introduced CBF/ATT bias.
By training the two parts of the model separately, the models can achieve the optimal
performance for denoising and quantification respectively. However, how to combine these
two parts would need to be optimized in the future.
84
Figure 6.1: (A) Image processing pipeline. (B) Network architectures. Architectures are
shown in dashed-line boxes, whereas the legend is shown in the solid-line box. BN, batch
normalization; PLD, post-label delay; ReLU, rectified linear unit
85
Another possible solution will be to add more constraints in the training process. In current works, only image similarity based terms were included in the loss function of training.
It is possible to add more loss terms to enforce consistency for quantification. The improved
loss function will look as follows:
Loss = α × Lossl1 + Lossvalues × mask (6.1)
where Lossl1 will be the traditional l1 loss while the Lossvalues will penalize difference in
values in the predefined foreground mask. However, a more complex loss function will make
the training unstable, which is another potential problem that needs to be solved in the
future.
6.3 Interpretable classification model for AD diagnosis
with ASL
Neuroimaging has already proven to be useful in the diagnosis of AD. Previous studies[107,
108, 109] have already tried to leverage MRI or other modalities for classification among AD,
MCI and CN, which can assist the decision-making for diagnosis. However, most studies
relied on structural data, such as T1. Although using T1-only data can achieve a high
classification accuracy between AD and CN, structural changes in the brain such as atrophy,
are relatively late biomarkers for AD. Meanwhile, functional imaging such as PET and ASL
may provide an alternative early biomarker for the early detection of AD. Thus, it may be
possible to develop a DL model for early detection of AD with ASL or PET imaging.
While DL can provide a way to make a diagnosis with any kind of input, the process how
the model works is important for people to trust the decision, since most of the DL models
are blackbox. This leads to the development of explainable or interpretable models [110,
111]. Gradient-weighted Class Activation Mapping (Grad-CAM)[112] Grad-CAM (Gradientweighted Class Activation Mapping) is a popular visualization technique used to interpret
86
Figure 6.2: Grad-CAM overview: Given an image and a class of interest (e.g., ‘tiger cat’ or
any other type of differentiable output) as input, we forward propagate the image through the
CNN part of the model and then through task-specific computations to obtain a raw score for
the category. The gradients are set to zero for all classes except the desired class (tiger cat),
which is set to 1. This signal is then backpropagated to the rectified convolutional feature
maps of interest, which we combine to compute the coarse Grad-CAM localization (blue
heatmap) which represents where the model has to look to make the particular decision.
Finally, we pointwise multiply the heatmap with guided backpropagation to get Guided
Grad-CAM visualizations which are both high-resolution and concept-specific. (Figure from
Selvaraju et al. ICCV 2017)
Figure 6.3: User interface for the radiologists in the study. First, they were shown chest
radiographs (CXRs) with and without a present pneumothorax and the vision Transformer
(or, ViT) prediction score. In the second step, a saliency map was additionally shown. For
both parts, radiologists had to detect if a pneumothorax was present and then determine if
the saliency map was (subjectively) useful for aiding detection. ID = image identifier.(Figure
from Wallek et al. RSNA 2022)
87
the decisions of CNNs. It works by highlighting the important regions in an input image
that most strongly influence a model’s prediction. The workflow of Grad-CAM is shown
in Figure 6.2. Grad-CAM achieves its function by using the gradients of the target class,
flowing back through the network to produce a coarse localization map that emphasizes key
areas of interest in the convolutional layers.
On the other hand, Transformer-based models can be explained by creating attentionbased saliency maps using transformer multimodal explainability (TMME)[32]. This technique has been used in the explanation of model’s decision in pneumothorax classification
with chest radiographs[113]. A representative case is shown in Figure 6.3. The saliency
map provided from the ViT can help aid radiologists to detect and make diagnosis. These
methods can be used in future studies of explainable models for AD classification with ASL
or multi-modality MRI.
6.4 Conclusion
In conclusion, several latest DL techniques have been adapted to ASL for better clinical application. A Swin Transformer-based model has been developed for single-delay and multi-delay
ASL denoising. A self-supervised method has been developed for high-resolution pediatric
choroid plexus perfusion image denoising. Generative diffusion model has been developed
for existing ASL dataset to enable CBF quantification.
Future studies will focus on reducing the quantitative bias by using a two-step denoising
and quantification model or enforcing a value-consistency term in the loss function. New
classification models for AD, MCI and CN can be developed with CNN and Transformerbased networks, while gaining more interpretability with Grad-CAM or TMME, which can
help radiologists better understand the decision made by the models.
88
89
References
[1] Wenna Duan et al. “Cerebral blood flow is associated with diagnostic class and cognitive decline in Alzheimer’s disease”. In: Journal of Alzheimer’s Disease 76.3 (2020),
pp. 1103–1120.
[2] Danny JJ Wang et al. “The value of arterial spin-labeled perfusion imaging in acute
ischemic stroke: comparison with dynamic susceptibility contrast-enhanced MRI”. In:
Stroke 43.4 (2012), pp. 1018–1024.
[3] Danny JJ Wang et al. “Multi-delay multi-parametric arterial spin-labeled perfusion
MRI in acute ischemic stroke—comparison with dynamic susceptibility contrast enhanced perfusion imaging”. In: NeuroImage: Clinical 3 (2013), pp. 1–7.
[4] Jalal B Andre et al. “Cerebral blood flow changes in glioblastoma patients undergoing bevacizumab treatment are seen in both tumor and normal brain”. In: The
neuroradiology journal 28.2 (2015), pp. 112–119.
[5] Catriona R Stewart et al. “Associations between white matter hyperintensity burden, cerebral blood flow and transit time in small vessel disease: an updated metaanalysis”. In: Frontiers in Neurology 12 (2021), p. 647848.
[6] Oliver Bracko et al. “Causes and consequences of baseline cerebral blood flow reductions in Alzheimer’s disease”. In: Journal of Cerebral Blood Flow & Metabolism 41.7
(2021), pp. 1501–1516.
[7] Weiying Dai et al. “Mild cognitive impairment and alzheimer disease: patterns of
altered cerebral blood flow at MR imaging”. In: Radiology 250.3 (2009), pp. 856–866.
[8] Frank J Wolters et al. “Cerebral perfusion and the risk of dementia: a populationbased study”. In: Circulation 136.8 (2017), pp. 719–728.
[9] Alexander Kunz and Costantino Iadecola. “Cerebral vascular dysregulation in the
ischemic brain”. In: Handbook of clinical neurology 92 (2008), pp. 283–305.
[10] Maja AA Binnewijzend et al. “Cerebral blood flow measured with 3D pseudocontinuous arterial spin-labeling MR imaging in Alzheimer disease and mild cognitive
impairment: a marker for disease severity”. In: Radiology 267.1 (2013), pp. 221–230.
[11] David A Wolk and John A Detre. “Arterial spin labeling MRI: an emerging biomarker
for Alzheimer’s disease and other neurodegenerative conditions”. In: Current opinion
in neurology 25.4 (2012), pp. 421–428.
90
[12] David C Alsop et al. “Recommended implementation of arterial spin-labeled perfusion
MRI for clinical applications: a consensus of the ISMRM perfusion study group and
the European consortium for ASL in dementia”. In: Magnetic resonance in medicine
73.1 (2015), pp. 102–116.
[13] Yaron Gordon et al. “Dynamic contrast-enhanced magnetic resonance imaging: fundamentals and application to the evaluation of the peripheral perfusion”. In: Cardiovascular diagnosis and therapy 4.2 (2014), p. 147.
[14] Hannu J Aronen and Jussi Perki¨o. “Dynamic susceptibility contrast MRI of gliomas”.
In: Neuroimaging Clinics 12.4 (2002), pp. 501–523.
[15] Dennis FR Heijtel et al. “Accuracy and precision of pseudo-continuous arterial spin
labeling perfusion during baseline and hypercapnia: a head-to-head comparison with
15O H2O positron emission tomography”. In: Neuroimage 92 (2014), pp. 182–192.
[16] Christina E Wierenga, Chelsea C Hays, and Zvinka Z Zlatar. “Cerebral blood flow
measured by arterial spin labeling MRI as a preclinical marker of Alzheimer’s disease”.
In: Journal of Alzheimer’s Disease 42.s4 (2014), S411–S419.
[17] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. “Deep learning”. In: nature 521.7553
(2015), pp. 436–444.
[18] Dinggang Shen, Guorong Wu, and Heung-Il Suk. “Deep learning in medical image
analysis”. In: Annual review of biomedical engineering 19.1 (2017), pp. 221–248.
[19] Athanasios Voulodimos et al. “Deep learning for computer vision: A brief review”.
In: Computational intelligence and neuroscience 2018.1 (2018), p. 7068349.
[20] Daniel W Otter, Julian R Medina, and Jugal K Kalita. “A survey of the usages of deep
learning for natural language processing”. In: IEEE transactions on neural networks
and learning systems 32.2 (2020), pp. 604–624.
[21] DeLiang Wang and Jitong Chen. “Supervised speech separation based on deep learning: An overview”. In: IEEE/ACM transactions on audio, speech, and language processing 26.10 (2018), pp. 1702–1726.
[22] Jiuxiang Gu et al. “Recent advances in convolutional neural networks”. In: Pattern
recognition 77 (2018), pp. 354–377.
[23] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. “U-net: Convolutional networks for biomedical image segmentation”. In: Medical image computing and computerassisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18. Springer. 2015, pp. 234–241.
91
[24] Kaiming He et al. “Deep residual learning for image recognition”. In: Proceedings of
the IEEE conference on computer vision and pattern recognition. 2016, pp. 770–778.
[25] A Vaswani. “Attention is all you need”. In: Advances in Neural Information Processing
Systems (2017).
[26] Ze Liu et al. “Swin transformer: Hierarchical vision transformer using shifted windows”. In: Proceedings of the IEEE/CVF international conference on computer vision.
2021, pp. 10012–10022.
[27] Jingyun Liang et al. “Swinir: Image restoration using swin transformer”. In: Proceedings of the IEEE/CVF international conference on computer vision. 2021, pp. 1833–
1844.
[28] Alexey Dosovitskiy. “An image is worth 16x16 words: Transformers for image recognition at scale”. In: arXiv preprint arXiv:2010.11929 (2020).
[29] Lei Zhang and Yan Wen. “A transformer-based framework for automatic COVID19
diagnosis in chest CTs”. In: Proceedings of the IEEE/CVF international conference
on computer vision. 2021, pp. 513–518.
[30] Ali Hatamizadeh et al. “Unetr: Transformers for 3d medical image segmentation”. In:
Proceedings of the IEEE/CVF winter conference on applications of computer vision.
2022, pp. 574–584.
[31] Chun-Mei Feng et al. “Task transformer network for joint MRI reconstruction and
super-resolution”. In: Medical Image Computing and Computer Assisted Intervention–
MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–
October 1, 2021, Proceedings, Part VI 24. Springer. 2021, pp. 307–317.
[32] Hila Chefer, Shir Gur, and Lior Wolf. “Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers”. In: Proceedings of the IEEE/CVF
International Conference on Computer Vision. 2021, pp. 397–406.
[33] Jonathan Ho, Ajay Jain, and Pieter Abbeel. “Denoising diffusion probabilistic models”. In: Advances in neural information processing systems 33 (2020), pp. 6840–6851.
[34] Yang Song et al. “Score-based generative modeling through stochastic differential
equations”. In: arXiv preprint arXiv:2011.13456 (2020).
[35] Hyungjin Chung and Jong Chul Ye. “Score-based diffusion models for accelerated
MRI”. In: Medical image analysis 80 (2022), p. 102479.
[36] Qing Lyu and Ge Wang. “Conversion between CT and MRI images using diffusion
and score-matching models”. In: arXiv preprint arXiv:2209.12104 (2022).
92
[37] Boah Kim, Inhwa Han, and Jong Chul Ye. “Diffusemorph: Unsupervised deformable
image registration using diffusion model”. In: European conference on computer vision. Springer. 2022, pp. 347–364.
[38] Rishi Bommasani et al. “On the opportunities and risks of foundation models”. In:
arXiv preprint arXiv:2108.07258 (2021).
[39] Maithra Raghu et al. “Transfusion: Understanding transfer learning for medical imaging”. In: Advances in neural information processing systems 32 (2019).
[40] Zongwei Zhou et al. “Models genesis: Generic autodidactic models for 3d medical
image analysis”. In: Medical Image Computing and Computer Assisted Intervention–
MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17,
2019, Proceedings, Part IV 22. Springer. 2019, pp. 384–393.
[41] Kai Wang et al. “Deep learning detection of penumbral tissue on arterial spin labeling
in stroke”. In: Stroke 51.2 (2020), pp. 489–497.
[42] Jie-Zhi Cheng et al. “Computer-aided diagnosis with deep learning architecture: applications to breast lesions in US images and pulmonary nodules in CT scans”. In:
Scientific reports 6.1 (2016), p. 24454.
[43] Ben A Duffy et al. “Retrospective motion artifact correction of structural MRI images using deep learning improves the quality of cortical surface reconstructions”. In:
Neuroimage 230 (2021), p. 117756.
[44] Umang Gupta et al. “Transferring Models Trained on Natural Images to 3D MRI
via Position Encoded Slice Models”. In: 2023 IEEE 20th International Symposium on
Biomedical Imaging (ISBI). IEEE. 2023, pp. 1–5.
[45] Blake E Dewey et al. “DeepHarmony: A deep learning approach to contrast harmonization across scanner changes”. In: Magnetic resonance imaging 64 (2019), pp. 160–
170.
[46] Mariana Bento et al. “Deep learning in large and multi-site structural brain MR
imaging datasets”. In: Frontiers in Neuroinformatics 15 (2022), p. 805669.
[47] Sihong Chen, Kai Ma, and Yefeng Zheng. “Med3d: Transfer learning for 3d medical
image analysis”. In: arXiv preprint arXiv:1904.00625 (2019).
[48] Stefan M Spann et al. “Robust single-shot acquisition of high resolution whole brain
ASL images by combining time-dependent 2D CAPIRINHA sampling with spatiotemporal TGV reconstruction”. In: Neuroimage 206 (2020), p. 116337.
93
[49] Xingfeng Shao et al. “Prospective motion correction for 3D GRASE pCASL with volumetric navigators”. In: Proceedings of the International Society for Magnetic Resonance in Medicine... Scientific Meeting and Exhibition. International Society for
Magnetic Resonance in Medicine. Scientific Meeting and Exhibition. Vol. 25. NIH
Public Access. 2017, p. 0680.
[50] Danfeng Xie et al. “Denoising arterial spin labeling perfusion MRI with deep machine
learning”. In: Magnetic resonance imaging 68 (2020), pp. 95–105.
[51] Lei Zhang et al. “Improving sensitivity of arterial spin labeling perfusion MRI in
Alzheimer’s disease using transfer learning of deep learning-based ASL denoising”.
In: Journal of Magnetic Resonance Imaging 55.6 (2022), pp. 1710–1722.
[52] Enhao Gong et al. “Deep learning and multi-contrast-based denoising for low-SNR
Arterial Spin Labeling (ASL) MRI”. In: Medical Imaging 2020: Image Processing.
Vol. 11313. SPIE. 2020, pp. 119–126.
[53] Nicholas J Luciw et al. “Automated generation of cerebral blood flow and arterial
transit time maps from multiple delay arterial spin-labeled MRI”. In: Magnetic resonance in medicine 88.1 (2022), pp. 406–417.
[54] Donghoon Kim et al. “Parametric cerebral blood flow and arterial transit time mapping using a 3D convolutional neural network”. In: Magnetic resonance in medicine
90.2 (2023), pp. 583–595.
[55] Weiying Dai et al. “Continuous flow-driven inversion for arterial spin labeling using
pulsed radio frequency and gradient fields”. In: Magnetic Resonance in Medicine: An
Official Journal of the International Society for Magnetic Resonance in Medicine 60.6
(2008), pp. 1488–1497.
[56] Seongtaek Lee et al. “Relationships between spinal cord blood flow measured with
flow-sensitive alternating inversion recovery (FAIR) and neurobehavioral outcomes in
rat spinal cord injury”. In: Magnetic Resonance Imaging 78 (2021), pp. 42–51.
[57] Kai Wang et al. “Optimization of adiabatic pulses for pulsed arterial spin labeling
at 7 tesla: comparison with pseudo-continuous arterial spin labeling”. In: Magnetic
resonance in medicine 85.6 (2021), pp. 3227–3240.
[58] David A Feinberg, Alexander Beckett, and Liyong Chen. “Arterial spin labeling with
simultaneous multi-slice echo planar imaging”. In: Magnetic resonance in medicine
70.6 (2013), pp. 1500–1506.
[59] Nasim Maleki, Weiying Dai, and David C Alsop. “Optimization of background suppression for arterial spin labeling perfusion imaging”. In: Magnetic Resonance Materials in Physics, Biology and Medicine 25 (2012), pp. 127–133.
94
[60] Stefan M Spann et al. “Spatio-temporal TGV denoising for ASL perfusion imaging”.
In: Neuroimage 157 (2017), pp. 81–96.
[61] Chenyang Zhao et al. “k-space weighted image average (KWIA) for ASL-based dynamic MR angiography and perfusion imaging”. In: Magnetic resonance imaging 86
(2022), pp. 94–106.
[62] Chenyang Zhao et al. “Low dose CT perfusion with K-space weighted image average
(KWIA)”. In: IEEE transactions on medical imaging 39.12 (2020), pp. 3879–3890.
[63] Richard B Buxton et al. “A general kinetic model for quantitative perfusion imaging
with arterial spin labeling”. In: Magnetic resonance in medicine 40.3 (1998), pp. 383–
396.
[64] Wen-Ming Luh et al. “QUIPSS II with thin-slice TI1 periodic saturation: a method
for improving accuracy of quantitative perfusion imaging using pulsed arterial spin labeling”. In: Magnetic Resonance in Medicine: An Official Journal of the International
Society for Magnetic Resonance in Medicine 41.6 (1999), pp. 1246–1254.
[65] Joseph G Woods et al. “Recommendations for quantitative cerebral perfusion MRI
using multi-timepoint arterial spin labeling: Acquisition, quantification, and clinical
applications”. In: Magnetic resonance in medicine (2024).
[66] Xingfeng Shao et al. “Laminar perfusion imaging with zoomed arterial spin labeling
at 7 Tesla”. In: NeuroImage 245 (2021), p. 118724.
[67] Petros Martirosian et al. “Spatial-temporal perfusion patterns of the human liver
assessed by pseudo-continuous arterial spin labeling MRI”. In: Zeitschrift f¨ur Medizinische Physik 29.2 (2019), pp. 173–183.
[68] Fabio Nery et al. “Robust kidney perfusion mapping in pediatric chronic kidney
disease using single-shot 3D-GRASE ASL with optimized retrospective motion correction”. In: Magnetic Resonance in Medicine 81.5 (2019), pp. 2972–2984.
[69] Chenyang Zhao et al. “Whole-Cerebrum distortion-free three-dimensional pseudoContinuous Arterial Spin Labeling at 7T”. In: Neuroimage 277 (2023), p. 120251.
[70] Roy AM Haast et al. “Insights into hippocampal perfusion using high-resolution,
multi-modal 7T MRI”. In: Proceedings of the National Academy of Sciences 121.11
(2024), e2310044121.
[71] Hanzhang Lu et al. “MarkVCID cerebral small vessel consortium: II. Neuroimaging
protocols”. In: Alzheimer’s & Dementia 17.4 (2021), pp. 716–725.
95
[72] Kay Jann et al. “Cross-Vendor Test-Retest Analysis of 3D pCASL Cerebral Blood
Flow”. In: ISMRM 2021, https://index.mirasmart.com/ISMRM2021/PDFfiles/1846.html
(2021).
[73] Bee Lim et al. “Enhanced deep residual networks for single image super-resolution”.
In: Proceedings of the IEEE conference on computer vision and pattern recognition
workshops. 2017, pp. 136–144.
[74] Jieneng Chen et al. “Transunet: Transformers make strong encoders for medical image
segmentation”. In: arXiv preprint arXiv:2102.04306 (2021).
[75] Minh H Vu et al. “Evaluation of multislice inputs to convolutional neural networks
for medical image segmentation”. In: Medical Physics 47.12 (2020), pp. 6216–6231.
[76] Adam Paszke et al. “Pytorch: An imperative style, high-performance deep learning
library”. In: Advances in neural information processing systems 32 (2019).
[77] Yufen Chen, Danny JJ Wang, and John A Detre. “Test–retest reliability of arterial
spin labeling with common labeling strategies”. In: Journal of Magnetic Resonance
Imaging 33.4 (2011), pp. 940–949.
[78] Emily Kilroy et al. “Reliability of two-dimensional and three-dimensional pseudocontinuous arterial spin labeling perfusion MRI in elderly populations: Comparison
with 15o-water positron emission tomography”. In: Journal of Magnetic Resonance
Imaging 39.4 (2014), pp. 931–939.
[79] DJ Hodkinson et al. “Quantifying the testretest reliability of cerebral blood flow
measurements in a clinical model of on-going post-surgical pain: A study using pseudocontinuous arterial spin labelling. NeuroImage Clin. 2013; 3: 301–310”. In: Neuroimage
Clin 3 ().
[80] Tianye Lin et al. “Test-retest reliability and reproducibility of long-label pseudocontinuous arterial spin labeling”. In: Magnetic Resonance Imaging 73 (2020), pp. 111–
117.
[81] Kay Jann et al. “Evaluation of cerebral blood flow measured by 3D PCASL as
biomarker of vascular cognitive impairment and dementia (VCID) in a cohort of
elderly latinx subjects at risk of small vessel disease”. In: Frontiers in Neuroscience
15 (2021), p. 627627.
[82] Melody P Lun, Edwin S Monuki, and Maria K Lehtinen. “Development and functions
of the choroid plexus–cerebrospinal fluid system”. In: Nature Reviews Neuroscience
16.8 (2015), pp. 445–457.
96
[83] Li Zhao et al. “Non-invasive measurement of choroid plexus apparent blood flow with
arterial spin labeling”. In: Fluids and Barriers of the CNS 17 (2020), pp. 1–11.
[84] Jarrod J Eisma et al. “Choroid plexus perfusion and bulk cerebrospinal fluid flow
across the adult lifespan”. In: Journal of Cerebral Blood Flow & Metabolism 43.2
(2023), pp. 269–280.
[85] Megha Madhukar et al. “Choroid plexus: normal size criteria on neuroimaging”. In:
Surgical and radiologic anatomy 34 (2012), pp. 887–895.
[86] Jong Chul Ye. “Compressed sensing MRI: a review from signal processing perspective”. In: BMC Biomedical Engineering 1.1 (2019), p. 8.
[87] Felix A Breuer et al. “Controlled aliasing in volumetric parallel imaging (2D CAIPIRINHA)”. In: Magnetic Resonance in Medicine: An Official Journal of the International Society for Magnetic Resonance in Medicine 55.3 (2006), pp. 549–556.
[88] Matthias Schloegl et al. “Infimal convolution of total generalized variation functionals
for dynamic MRI”. In: Magnetic resonance in medicine 78.1 (2017), pp. 142–155.
[89] Varsha Jain et al. “Longitudinal reproducibility and accuracy of pseudo-continuous
arterial spin–labeled perfusion MR imaging in typically developing children”. In: Radiology 263.2 (2012), pp. 527–536.
[90] Alexander Krull, Tim-Oliver Buchholz, and Florian Jug. “Noise2void-learning denoising from single noisy images”. In: Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition. 2019, pp. 2129–2137.
[91] Juan Miguel Valverde et al. “Transfer learning in magnetic resonance brain imaging:
A systematic review”. In: Journal of imaging 7.4 (2021), p. 66.
[92] Ahmad Waleed Salehi et al. “A study of CNN and transfer learning in medical imaging: Advantages, challenges, future scope”. In: Sustainability 15.7 (2023), p. 5930.
[93] Hee E Kim et al. “Transfer learning for medical image classification: a literature
review”. In: BMC medical imaging 22.1 (2022), p. 69.
[94] Terry K Koo and Mae Y Li. “A guideline of selecting and reporting intraclass correlation coefficients for reliability research”. In: Journal of chiropractic medicine 15.2
(2016), pp. 155–163.
[95] Robin Rombach et al. “High-resolution image synthesis with latent diffusion models”. In: Proceedings of the IEEE/CVF conference on computer vision and pattern
recognition. 2022, pp. 10684–10695.
97
[96] Jorge Zavala Bojorquez et al. “What are normal relaxation times of tissues at 3 T?”
In: Magnetic resonance imaging 35 (2017), pp. 69–80.
[97] L´aszl´o G Ny´ul, Jayaram K Udupa, and Xuan Zhang. “New variants of a method of
MRI scale standardization”. In: IEEE transactions on medical imaging 19.2 (2000),
pp. 143–150.
[98] Cecily G Swinford et al. “Altered cerebral blood flow in older adults with Alzheimer’s
disease: a systematic review”. In: Brain imaging and behavior 17.2 (2023), pp. 223–
256.
[99] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. “Additive logistic regression:
a statistical view of boosting (with discussion and a rejoinder by the authors)”. In:
The annals of statistics 28.2 (2000), pp. 337–407.
[100] Trevor Hastie et al. The elements of statistical learning: data mining, inference, and
prediction. Vol. 2. Springer, 2009.
[101] Wei-Yin Loh. “Improving the precision of classification trees”. In: The Annals of
Applied Statistics (2009), pp. 1710–1737.
[102] Gary King and Langche Zeng. “Logistic regression in rare events data”. In: Political
analysis 9.2 (2001), pp. 137–163.
[103] David W Hosmer Jr, Stanley Lemeshow, and Rodney X Sturdivant. Applied logistic
regression. John Wiley & Sons, 2013.
[104] Yiran Li and Ze Wang. “Deeply Accelerated Arterial Spin Labeling Perfusion MRI
for Measuring Cerebral Blood Flow and Arterial Transit Time”. In: IEEE journal of
biomedical and health informatics (2023).
[105] Joseph G Woods et al. “Time-encoded pseudo-continuous arterial spin labeling: Increasing SNR in ASL dynamic angiography”. In: Magnetic Resonance in Medicine
89.4 (2023), pp. 1323–1341.
[106] Clifford R Jack Jr et al. “Revised criteria for diagnosis and staging of Alzheimer’s
disease: Alzheimer’s Association Workgroup”. In: Alzheimer’s & Dementia (2024).
[107] Heung-Il Suk and Dinggang Shen. “Deep learning-based feature representation for
AD/MCI classification”. In: Medical Image Computing and Computer-Assisted Intervention–
MICCAI 2013: 16th International Conference, Nagoya, Japan, September 22-26, 2013,
Proceedings, Part II 16. Springer. 2013, pp. 583–590.
98
[108] Junhao Wen et al. “Convolutional neural networks for classification of Alzheimer’s
disease: Overview and reproducible evaluation”. In: Medical image analysis 63 (2020),
p. 101694.
[109] Taeho Jo, Kwangsik Nho, and Andrew J Saykin. “Deep learning in Alzheimer’s disease: diagnostic classification and prognostic prediction using neuroimaging data”. In:
Frontiers in aging neuroscience 11 (2019), p. 220.
[110] Plamen P Angelov et al. “Explainable artificial intelligence: an analytical review”. In:
Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 11.5 (2021),
e1424.
[111] Christoph Molnar. Interpretable machine learning. Lulu. com, 2020.
[112] Ramprasaath R Selvaraju et al. “Grad-cam: Visual explanations from deep networks
via gradient-based localization”. In: Proceedings of the IEEE international conference
on computer vision. 2017, pp. 618–626.
[113] Alessandro Wollek et al. “Attention-based saliency maps improve interpretability of
pneumothorax classification”. In: Radiology: Artificial Intelligence 5.2 (2022), e220187.
[114] Jonathan Long, Evan Shelhamer, and Trevor Darrell. “Fully convolutional networks
for semantic segmentation”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015, pp. 3431–3440.
99
Appendix A
CNN and Transformer comparison for
patch-based training strategy
Figure A.1 illustrates the training and testing procedure of our methods. For the CNN-based
networks without the down-sampling and up-sampling (U-Net-like) structure, the trained
models can be applied to images of any sizes by just applying the trained convolutional
kernels. For the Swin Transformer structure, as described in the original paper[26] since the
attention is calculated inside the windows with fixed size, the image size can vary and only
affects the number of windows. For example, a 48×48 image generates 4 24x24 windows,
while a 96×96 image will generate 16 24x24 windows. The attention calculation is conducted
in parallel for these windows. Also, the shifting of the windows will not affect window size
and the number of windows. So, Swin Transformer is flexible to be applied to different image
sizes, which is one of the advantages compared to other Transformer-based networks.
As shown in the previous paper[114], both patch-wise training and full image training
can be made to produce any distribution of the input. In this work, a randomized patch
sampling strategy is used in our training in order to create more patches for weight updating.
Here, we conducted a comparative experiment to study the difference between the patch-wise
training and full image training. We trained the same SwinIR model on smaller sub-volume
(48×48×3) and the sub-volume with full in-plane matrix size (96×96×3) and tested their
performance on Dataset 1&2.
100
Figure A.1: Illustration of patch wise training and full image inference for (A) the CNN
network and (B) SwinIR network.
101
Abstract (if available)
Abstract
Arterial spin labeling (ASL) is a magnetic resonance imaging (MRI) technique that can measure human cerebral blood flow (CBF) non-invasively. However, clinical application of this technique remains challenging due to the intrinsic low signal-to-noise ratio (SNR) and long scan time. Also, heterogeneity of ASL imaging protocols across vendor platforms make quantification not reliable. Traditional methods of denoising usually assumes an image models and/or noise characteristics, which may not well represent the real data. Deep Learning (DL)-based models can learn the underlying patterns purely from real data. Recent developments of DL in image processing and image generation provide powerful tools to improve clinical applications of medical imaging, such as improving image quality, reducing time for image acquisition, etc. However, while a handful studies have demonstrated the feasibility of DL applications on ASL, there remain large gaps in the reliable application of DL methods for improving the clinical use of ASL on multiple vendor platforms with different imaging protocols (e.g., single-delay and multi-delay).
The purpose of this work is to adapt, optimize and apply some of the latest DL techniques, including Transformer and diffusion model to improving the clinical translation of ASL by improving the image quality and/or reduce scan time, and generating the missing modality to enable CBF quantification to improve standardization cross vendors for the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset.
There are three specific aims in this study. In the first aim, a flexible Transformer-based DL denoising scheme will be developed and evaluated for 3D ASL to improve SNR and/or reduce scan time for both single-delay and multi-delay ASL data. Our hypothesis is that with the completion of aim 1, we will be able to improve the image quality for ASL acquired from multiple vendors with the trained model without introducing bias in quantification of cerebral blood flow (CBF) and/or arterial transit time (ATT).
In the second aim, the proposed DL framework and the trained model will be adapted to a high-resolution pediatric multi-delay ASL dataset for perfusion imaging of pediatric choroid plexus. Since there are no reference images for this cutting edge application, some self-supervised learning techniques will be explored. We will compare the performance of the proposed deep learning method with state-of-the-art conventional denoising method like total generalized variation (TGV). Our hypothesis is that with completion of aim 2, the proposed deep learning method will show better performance than the traditional method, both improving image quality and the test-retest reliability for pediatric choroid plexus perfusion imaging.
In the third aim, generative diffusion model will be applied to generating the M0 from the control image for Siemens 3D pulsed ASL (PASL) scans in the ADNI-3 dataset, where M0 is not acquired. The generated M0 can be used to quantify CBF for analysis of CBF variations among subjects with normal cognition, mild cognitive impairment (MCI) and AD. This can help resolve the heterogeneity of different type of ASL scans to standardize quantification of CBF in the AD dataset. Our hypothesis is that with the completion of aim 3, Siemens 3D PASL scans in the ADNI-3 dataset can be used to quantify CBF with the acquired control images and the generated M0. Analysis of CBF data from different MR vendors will reveal similar characteristic of deficits in quantitative CBF in MCI and AD subjects compared to normal, which can show better differentiation between AD and normal people compared to using non-standardized perfusion images.
In conclusion, with the completion of the three specific aims, we will show that latest DL methods such as Transformers and diffusion models have the potential to improve ASL in clinical applications by enhancing the image quality and better standardization.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Improving the sensitivity and spatial coverage of cardiac arterial spin labeling for assessment of coronary artery disease
PDF
Improving sensitivity and spatial coverage of myocardial arterial spin labeling
PDF
Radiofrequency pulse performance for myocardial ASL
PDF
Measurement of human brain perfusion with arterial spin labeling magnetic resonance imaging at ultra-high field
PDF
Improved myocardial arterial spin labeled perfusion imaging
PDF
Assessment of myocardial blood flow in humans using arterial spin labeled MRI
PDF
Learning from limited and imperfect data for brain image analysis and other biomedical applications
PDF
Characterization of lenticulostriate arteries using high-resolution black blood MRI as an early imaging biomarker for vascular cognitive impairment and dementia
PDF
Deep learning for characterization and prediction of complex fluid flow systems
PDF
3D deep learning for perception and modeling
PDF
Towards learning generalization
PDF
Deep learning models for temporal data in health care
PDF
Novel beamforming techniques for robust contrast enhancement in ultrasound imaging
PDF
Mapping water exchange rate across the blood-brain barrier
PDF
High-resolution data acquisition with neural and dermal interfaces
PDF
Learning, adaptation and control to enhance wireless network performance
PDF
Video object segmentation and tracking with deep learning techniques
PDF
Efficient deep learning for inverse problems in scientific and medical imaging
PDF
Labeling cost reduction techniques for deep learning: methodologies and applications
PDF
Scaling up temporal graph learning: powerful models, efficient algorithms, and optimized systems
Asset Metadata
Creator
Shou, Qinyang
(author)
Core Title
Improving arterial spin labeling in clinical application with deep learning
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Biomedical Engineering
Degree Conferral Date
2024-12
Publication Date
11/05/2024
Defense Date
10/30/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
arterial spin labeling,clinical application,deep learning,OAI-PMH Harvest
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Wang, Danny J. (
committee chair
), Kim, Hosung (
committee member
), Shi, Yonggang (
committee member
), Zhou, Qifa (
committee member
)
Creator Email
qinyangs@usc.edu,qyshou185@gmail.com
Unique identifier
UC11399DBHJ
Identifier
etd-ShouQinyan-13614.pdf (filename)
Legacy Identifier
etd-ShouQinyan-13614
Document Type
Dissertation
Format
theses (aat)
Rights
Shou, Qinyang
Internet Media Type
application/pdf
Type
texts
Source
20241108-usctheses-batch-1221
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
arterial spin labeling
clinical application
deep learning