Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Efficient deep learning for inverse problems in scientific and medical imaging
(USC Thesis Other)
Efficient deep learning for inverse problems in scientific and medical imaging
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
EFFICIENT DEEP LEARNING FOR INVERSE PROBLEMS
IN SCIENTIFIC AND MEDICAL IMAGING
by
Zalan Fabian
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
December 2023
Copyright 2024 Zalan Fabian
Acknowledgements
First, I would like to extend my deepest gratitude to my PhD advisor, Mahdi Soltanolkotabi, for his invaluable guidance, unwavering support and encouragement throughout my PhD journey. His great expertise
and insight has fundamentally shaped my research and I am grateful to have had the opportunity to work
under his mentorship. This work would have not been possible without him.
I would like to extend my appreciation to the members of my qualifying exam and dissertation committees, Krishna Nayak, Vatsal Sharan, Salman Avestimehr, Robin Jia and Hamid Palangi. Their thoughtful
feedback and advice have greatly improved the quality of this dissertation.
I would like to thank my research collaborator at TUM, Reinhard Heckel, for his guidance and support
and for the useful intellectual exchange between our labs. Furthermore, I would like to extend my gratitude to my collaborators at the department, Richard Leahy and Justin Haldar for sharing their invaluable
expertise.
I am thankful for my lab mates and fellow graduate students for their camaraderie and helpful discussions. Special thanks to Berk Tinaz for the countless brainstorming sessions and the shared enthusiasm
for our work.
My heartfelt appreciation goes to my family and friends for their endless support, understanding and
encouragement throughout my PhD.
This dissertation would have not been possible without the contribution of all these individuals and
thank you for being an integral part of my academic journey.
ii
Table of Contents
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem formulation and classical approaches . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Deep learning for inverse problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Road map of this dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Chapter 2: 3D phase retrieval at nano-scale via Accelerated Wirtinger Flow . . . . . . . . . . . . . 8
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Phaseless imaging in 3D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 From 3D object to 2D exit waves. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 From 2D exit waves to phaseless measurements. . . . . . . . . . . . . . . . . . . . 11
2.2.3 Ambiguity challenge. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Reconstruction via 3D-AWF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Convergence theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Numerical experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Chapter 3: Data augmentation for deep learning based accelerated MRI reconstruction with
limited data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Background and Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.1 Accelerated MRI acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.2 Traditional accelerated MRI reconstruction methods . . . . . . . . . . . . . . . . . 33
3.2.3 Deep learning based MRI reconstruction methods . . . . . . . . . . . . . . . . . . . 33
3.3 MRAugment: a data augmentation pipeline for MRI . . . . . . . . . . . . . . . . . . . . . . 34
3.3.1 Data augmentation needs to preserve noise statistics . . . . . . . . . . . . . . . . 35
3.3.2 Transformations used for data augmentation . . . . . . . . . . . . . . . . . . . . . 39
3.3.3 Scheduling and application of data augmentations . . . . . . . . . . . . . . . . . . 40
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
iii
3.4.2 Low-data regime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4.3 High-data regime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4.4 Model robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4.5 Naive data augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Chapter 4: HUMUS-Net:Hybrid Unrolled Multi-scale Network Architecture for Accelerated MRI
Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2.1 Inverse Problem Formulation of Accelerated MRI Reconstruction . . . . . . . . . . 54
4.2.2 Deep Learning-based Accelerated MRI Reconstruction . . . . . . . . . . . . . . . . 56
4.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.4.1 HUMUS-Block Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.4.2 Multi-scale Hybrid Feature Extraction via MUST . . . . . . . . . . . . . . . . . . . 62
4.4.3 Iterative Unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.4.4 Adjacent Slice Reconstruction (ASR) . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.5.1 Benchmark Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.5.2 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.5.3 Direct comparison of image-domain denoisers . . . . . . . . . . . . . . . . . . . . 70
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Chapter 5: DiracDiffusion: Denoising and Incremental Reconstruction with Assured DataConsistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3.1 Degradation severity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3.2 Deterministic and stochastic degradation processes . . . . . . . . . . . . . . . . . . 78
5.3.3 SDP as a Stochastic Differential Equation . . . . . . . . . . . . . . . . . . . . . . . 79
5.3.4 Denoising - learning a score network . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.3.5 Incremental reconstructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.3.6 Data consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.3.7 Guidance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3.8 Perception-distortion trade-off . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.3.9 Degradation scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.5 Conclusions and limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Chapter 6: Adapt and Diffuse: Sample-adaptive Reconstruction via Latent Diffusion Models . . . . 95
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.3.1 Severity encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.3.2 Sample-adaptive Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
iv
6.4.1 Severity encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.4.2 Sample-adaptive reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Chapter 7: Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.1 Summary of contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.2 Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
A Appendix to 3D Phase retrieval at nano-scale via Accelerated
Wirtinger Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
A.1 Proof of Theorem 2.4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
B Appendix to Data augmentation for deep learning based accelerated MRI reconstruction
with limited data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
B.1 Experiments on the fastMRI dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 146
B.1.1 Experimental details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
B.1.2 Additional experimental results on the fastMRI dataset . . . . . . . . . . 147
B.2 Experiments on the Stanford 2D FSE dataset . . . . . . . . . . . . . . . . . . . . . 150
B.3 Experiments on the Stanford Fullysampled 3D FSE Knees dataset . . . . . . . . . . 151
B.4 Robustness experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
B.5 Ablation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
C Appendix to HUMUS-Net: Hybrid unrolled multi-scale network
architecture for accelerated MRI reconstruction . . . . . . . . . . . . . . . . . . . . . . . . 159
C.1 HUMUS-Net baseline details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
C.2 Ablation study experimental details . . . . . . . . . . . . . . . . . . . . . . . . . . 160
C.3 Results on additional accelerations . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
C.4 Effect of the number of unrolled iterations . . . . . . . . . . . . . . . . . . . . . . . 161
C.5 Effect of Adjacent Slice Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . 162
C.6 Iterative denoising visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
C.7 Vision Transformer terminology overview . . . . . . . . . . . . . . . . . . . . . . 164
C.8 Detailed validation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
C.9 Additional figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
D Appendix to DiracDiffusion: Denoising and Incremental
Reconstruction with Assured Data-Consistency . . . . . . . . . . . . . . . . . . . . . . . . 169
D.1 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
D.1.1 Denoising score-matching guarantee . . . . . . . . . . . . . . . . . . . . 169
D.1.2 Theorem 3.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
D.1.3 Incremental reconstruction loss guarantee . . . . . . . . . . . . . . . . . 172
D.1.4 Theorem 3.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
D.2 Degradation scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
D.3 Guidance details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
D.4 Note on the output of the algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 180
D.5 Experimental details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
D.6 Incremental reconstruction loss ablations . . . . . . . . . . . . . . . . . . . . . . . 184
D.7 Further incremental reconstruction approximations . . . . . . . . . . . . . . . . . . 185
D.8 Comparison methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
v
D.8.1 DPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
D.8.2 DDRM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
D.8.3 PnP-ADMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
D.8.4 ADMM-TV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
D.9 Further reconstruction samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
E Appendix to Adapt and Diffuse: Sample-adaptive Reconstruction via Latent Diffusion
Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
E.1 Training details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
E.2 Comparison method details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
vi
List of Tables
3.1 Scanner transfer results using 2% (top) and 100% (bottom) of training data. . . . . . . . . 50
4.1 Performance of state-of-the-art accelerated MRI reconstruction techniques on the fastMRI
knee test dataset. Most models are trained only on the fastMRI training dataset, if available
we show results of models trained on the fastMRI combined training and validation dataset
denoted by (†). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2 Validation SSIM of HUMUS-Net on various datasets. For datasets with multiple trainvalidation split runs we show the mean and standard error of the runs. . . . . . . . . . . . 66
4.3 Results of ablation studies on HUMUS-Net, evaluated on the Stanford 3D MRI dataset. . . 66
4.4 Direct comparison of denoisers on the Stanford 3D dataset. Mean and standard error of 3
random training-validation splits is shown. . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.1 Experimental results on the FFHQ (top) and ImageNet (bottom) test splits. . . . . . . . . . 90
6.1 Experimental results on the FFHQ test split. . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.2 Robustness experiments on the FFHQ test split. . . . . . . . . . . . . . . . . . . . . . . . . 113
A1 Number of available slices for each scanner type in the train and validation splits of the
fastMRI dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
A2 Data augmentation configuration for all fastMRI experiments. . . . . . . . . . . . . . . . . 153
A3 Comparison of peak validation SSIM applying various sets of augmentations on 1% of
fastMRI training data, multi-coil acquisition. . . . . . . . . . . . . . . . . . . . . . . . . . . 153
A4 Comparison of peak validation SSIM using different augmentation probability schedules
on 1% of fastMRI training data, multi-coil acquisition. . . . . . . . . . . . . . . . . . . . . . 153
A5 Ablation study experimental details. We show the number of STL layers per RSTB blocks
and number of attention heads for multi-scale networks in downsampling (D) , bottleneck
(B) and upsampling (U) paths separately. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
vii
A6 Experiments on various acceleration factors on the Stanford 3D dataset. Mean and
standard error of 3 random training-validation splits is shown. . . . . . . . . . . . . . . . 161
A7 Results of the adjacent slice reconstruction ablation study on the Stanford 3D dataset.
Mean and standard error over 3 random train-validation splits is shown. ASR improves
the performance of E2E-VarNet. However, HUMUS-Net outperforms E2E-VarNet in all
cases even without ASR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
A8 Detailed validation results of HUMUS-Net on various datasets. For datasets with multiple
train-validation split runs we show the mean and standard error of the runs. . . . . . . . . 166
A9 Architectural hyper-parameters for the score-models for Dirac (top) and other diffusionbased methods (bottom) in our experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . 183
A10 Settings for perception optimized (PO) and distortion optimized (DO) sampling for all
experiments on test data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
viii
List of Figures
2.1 Projection approximation of wave propagation. Incident beam Br
′
x,r′
y
passes through the
3D object, and produces a pixel in projection plane at (r
′
x
, r′
y
). Ir
′
x,r′
y
denotes the set of
voxel indices intersected by the beam. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 In ptychography, the sample is illuminated by a probe function from various angles. The
diffraction pattern in the far field is the Fourier transform of the exit wave multiplied by
the probe function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Evolution of relative reconstruction error before and after correction across iterations.
L = 100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Relative reconstruction error before and after correction vs. number of angles. . . . . . . 21
2.5 Difference between the exponential model and its linear approximation. We plot the
normalized mean squared error between an exit wave obtained from the non-linear
model and the linearized model at various wavelengths (normalized by the wavelength
used in the experiment). At high energies (short wavelengths) the linear approximation
significantly deviates from the exponential model. . . . . . . . . . . . . . . . . . . . . . . 22
2.6 Normalized pixelwise squared difference between exit waves calculated from the
exponential propagation model and the linearized model. The error is significantly higher
at pixels resulting from the illuminating beam passing through metallic parts, such as
copper interconnects in the object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.7 Magnitude of ground truth of a slice (x − y plane at z = 1) of 3D-AWF and 2-Step
reconstructions after correction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.8 Phase of ground truth of a slice (x − y plane at z = 60) of 3D-AWF and 2-Step
reconstructions after correction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.9 3D rendering of the magnitude and phase of the ground truth and reconstructed volumes
using 3D-AWF and 2-Step (L = 100). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1 Flowchart of MRAugment, our data augmentation pipeline for MRI. . . . . . . . . . . . . . 36
3.2 Transformations used in MRAugment applied to a ground truth slice. . . . . . . . . . . . . 40
ix
3.3 Experimental results on the Stanford 2D FSE dataset. . . . . . . . . . . . . . . . . . . . . . 44
3.4 Visual comparison of reconstructions on the Stanford 2D FSE dataset with and without
data augmentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5 Experimental results on the Stanford Fullysampled 3D FSE dataset. . . . . . . . . . . . . . 45
3.6 Visual comparison of single-coil (top row) and multi-coil (bottom-row) reconstructions
using varying amounts of training data with and without data augmentation. We achieve
reconstruction quality comparable to the state of the art but using 1% of the training data.
Without DA fine details are completely lost. . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.7 Single-coil (left) and multi-coil (right) validation SSIM vs. # of training images. . . . . . . 46
3.8 Hallucinated features appear on reconstructions without data augmentation. . . . . . . . . 49
3.9 Val. performance of models trained on knee and evaluated on brain MRI. . . . . . . . . . . 50
3.10 Experimental results comparing MRAugment with naive data augmentation. . . . . . . . . 50
4.1 Overview of the HUMUS-Block architecture. First, we extract high-resolution features
FH from the input noisy image through a convolution layer fH. Then, we apply a
convolutional feature extractor fL to obtain low-resolution features and process them
using a Transformer-convolutional hybrid multi-scale feature extractor. The shallow,
high-resolution and deep, low-resolution features are then synthesized into the final
high-resolution denoised image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Depiction of different RSTB modules used in the HUMUS-Block. . . . . . . . . . . . . . . . 63
4.3 Patch merge and expand operations used in our multi-scale feature extractor. . . . . . . . 63
4.4 Architecture of convolutional blocks for feature extraction and reconstruction. . . . . . . 63
4.5 Adjacent slice reconstruction (depicted in image domain for visual clarity): HUMUS-Net
takes a volume of adjacent slices (x˜
c−a
, ..., x˜
c
, ..., x˜
c+a
) and jointly reconstructs a volume
(xˆ
c−a
, ..., xˆ
c
, ..., xˆ
c+a
). The reconstruction loss L is calculated only on the center slice xˆ
c
. 65
5.1 Severity of degradations: We can always find a more degraded image yt
′′ from a less
degraded version of the same clean image yt
′ via the forward degradation transition
function Gt
′→t
′′, but not vice versa. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.2 Perception-distortion trade-off on CelebA-HQ deblurring: distortion metrics initially
improve, peak fairly early in the reverse process, then gradually deteriorate, while
perceptual metrics improve. We plot the mean of n = 30 trajectories starting from the
same initial noisy measurement. The shaded area depicts ±std. We quantify uncertainty
via mean pixel-wise standard deviation across different reverse process trajectories . We
observe low uncertainty at the distortion peak, with gradual increase during the reverse
process. Image regions with fine details correspond to high uncertainty in the final
reconstructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
x
5.3 Left: Data consistency curves for FFHQ inpainting. ϵdc := ∥y˜ − A1(xˆ0(yt))∥
2
measures
how consistent is the clean image estimate with the original noisy measurement. We
expect ϵdc to approach the noise floor σ
2
1 = 0.0025 in case of perfect data consistency. We
plot ϵ¯dc the mean over the validation set. Dirac maintains data consistency throughout
the reverse process. Center: Data consistency is not always achieved with DPS. Right:
Number of reverse diffusion steps vs. perceptual quality. . . . . . . . . . . . . . . . . . . . 89
5.4 Visual comparison of reconstructions on images from FFHQ (top 2 rows) and ImageNet
(bottom 2 rows) on the Gaussian deblurring task. . . . . . . . . . . . . . . . . . . . . . . . 90
5.5 Visual comparison of reconstructions on images from FFHQ (top 2 rows) and ImageNet
(bottom 2 rows) on the inpainting task with Gaussian masks. . . . . . . . . . . . . . . . . . 91
6.1 Overview of FlashDiffusion: we estimate the degradation severity of corrupted images
in the latent space of an autoencoder. We leverage the severity predictions to find the
optimal start time in a latent reverse diffusion process on a sample-by-sample basis. As a
result, inference cost is automatically scaled by the difficulty of the reconstruction task at
test time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.2 The optimal number of reverse diffusion steps varies depending on the severity of
degradations. Fixing the number of steps results in over-diffusing some samples, whereas
others could benefit from more iterations. . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.3 Effect of degradation on predicted severities: given a ground truth image corrupted by
varying amount of blur, σˆ is an increasing function of the blur amount. . . . . . . . . . . . 105
6.4 Left: Blur amount (t) vs. predicted degradation severity (σˆ). Outliers indicate that the
predicted degradation severity is not solely determined by the amount of blur. The
bottom image is ’surprisingly easy’ to reconstruct, as it is overwhelmingly smooth with
features close to those seen in the training set. The top image is ’surprisingly hard’, due
to more high-frequency details and unusual features not seen during training. Right:
Contributions to predicted severity. Degraded images with approx. the same σˆ (red dots
on left plot) may have different factors contributing to the predicted severity. The main
contributor to σˆ in the top image is the image degradation (blur), whereas the bottom
image is inherently difficult to reconstruct. See Section 6.4.1 for further detail. . . . . . . . 109
6.5 Visual comparison of FFHQ reconstructions under varying levels of Gaussian blur (top
2 rows) and nonlinear motion blur (bottom 2 rows), both with additive Gaussian noise
(σ = 0.05). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.6 Comparison of adaptive reconstruction with fixed number of diffusion steps. Left and
center: We plot the average number of reverse diffusion steps performed by our algorithm
vs. CCDF-L with a fixed number of steps. We observe the best FID and near-optimal
LPIPS across any fixed number of steps using our method. Right: We plot the histogram
of predicted number of reverse diffusion steps for our algorithm. The spread around the
mean highlights the adaptivity of our proposed technique. . . . . . . . . . . . . . . . . . . 112
A1 Single-coil (left) and multi-coil (right) validation PSNR vs. # of training images. . . . . . . 148
xi
A2 MRAugment recovers additional details in the moderate-data regime when 10% of fastMRI
training data is used. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
A3 Results on the i-RIM network. We achieve SSIM comparable to the baseline with only
10% of the training data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
A4 Visual comparison of reconstructions on the Stanford Fullysampled 3D FSE Knees dataset
under various amount of training data, with and without data augmentation. . . . . . . . . 151
A5 Visual comparison of fastMRI single-coil reconstructions presented in Figure 3.6 extended
with additional images corresponding to various amount of training data. . . . . . . . . . . 155
A6 Visual comparison of fastMRI multi-coil reconstructions presented in Figure 3.6 extended
with additional images corresponding to various amount of training data. . . . . . . . . . . 155
A7 Visual comparison of fastMRI single-coil reconstructions using varying amounts of
training data with and without data augmentation. . . . . . . . . . . . . . . . . . . . . . . 156
A8 Visual comparison of fastMRI multi-coil reconstructions using varying amounts of
training data with and without data augmentation. . . . . . . . . . . . . . . . . . . . . . . 157
A9 Visual comparison of reconstructions on the Stanford 2D FSE dataset under various
amount of training data, with and without data augmentation. . . . . . . . . . . . . . . . . 158
A10 Validation SSIM as a function of number of cascades (unrolled itarations) in HUMUS-Net
on the Stanford 3D dataset. We observe a steady increase in reconstruction performance
with more cascades. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
A11 Visualization of intermediate reconstructions in HUMUS-Net. . . . . . . . . . . . . . . . . 164
A12 Swin Transformer Layer, the fundamental building block of the Residual Swin Transformer
Block. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
A13 Depiction of iterative unrolling with sensitivity map estimator (SME). HUMUS-Net
applies a highly efficient denoiser to progressively improve reconstructions in a cascade
of sub-networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
A14 Patch merge and expand operations used in MUST. . . . . . . . . . . . . . . . . . . . . . . 167
A15 Visual comparison of reconstructions from the fastMRI knee dataset with ×8 acceleration.
HUMUS-Net reconstructs fine details on MRI images that other state-of-the-art methods
may miss. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
A16 Results of degradation scheduling from Algorithm 3. Left: Gaussian blur with kernel
std wt on CelebA-HQ. Center: inpainting with Gaussian mask with kernel width wt on
CelebA-HQ. Right: inpainting with Gaussian mask on ImageNet. . . . . . . . . . . . . . . . 176
xii
A17 Effect of guidance step size on best reconstruction in terms of LPIPS. We perform
experiments on the CelebA-HQ validation set on the deblurring task. . . . . . . . . . . . . 180
A18 Effect of incremental reconstruction loss step size on the CelebA-HQ validation set for
deblurring (left) and inpainting (middle). Visual comparison of inpainted samples is
shown on the right. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
A19 Distortion and Perception optimized deblurring results for the CelebA-HQ dataset (test
split). Uncertainty is calculated over n = 10 reconstructions from the same measurement. 191
A20 Distortion and Perception optimized inpainting results for the CelebA-HQ dataset (test
split). Uncertainty is calculated over n = 10 reconstructions from the same measurement.
For distortion optimized runs, images are generated in one-shot, hence we don’t provide
uncertainty maps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
A21 Distortion and Perception optimized deblurring results for the ImageNet dataset (test
split). Uncertainty is calculated over n = 10 reconstructions from the same measurement. 193
A22 Distortion and Perception optimized inpainting results for the ImageNet dataset (test
split). Uncertainty is calculated over n = 10 reconstructions from the same measurement.
For distortion optimized runs, images are generated in one-shot, hence we don’t provide
uncertainty maps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
xiii
Abstract
A multitude of scientific problems can be formulated as an inverse problem, where the goal is to recover
a clean signal from noisy or corrupted measurements. Developing efficient algorithms for solving inverse
problems is crucial in advancing a wide range of fields, spanning from MRI and CT reconstruction to imaging protein structures and even microchips. Deep learning has been extremely successful recently in various applications in computer vision, however a lot of the success can be attributed to training on massive,
often internet-scale, datasets. However, for many important scientific and medical problems collecting
such large, labeled datasets for training is prohibitively costly. Moreover, processing complex scientific
data often requires special considerations when compared with simple image or text data found on the
internet. Our goal is to remove these roadblocks from efficient deep learning for scientific and medical
imaging inverse problems. Our three-pronged approach is aimed at 1) reducing the cost and time required
to acquire scientific data, 2) reducing the reliance of deep learning models on large amounts of training data
and 3) improving the compute efficiency of models in order to tackle the complexity of scientific data processing. In this work, we adapt powerful deep learning models, such as vision transformers and diffusion
models, for a variety of imaging inverse problems and highlight their great potential in revolutionizing
scientific and medical applications.
xiv
Chapter 1
Introduction
1.1 Motivation
Deep learning is a computational paradigm in machine learning that leverages artificial neural networks to
learn patterns in data. Deep learning models are composed of a stack of network layers that progressively
transform the input to extract information relevant to the task at hand. Even though the first multilayer, so
called perceptron [138] learning algorithm has been around since the 1960s, recent advances in hardware
(namely GPUs) have elicited a revolution in deep learning starting in the late 2000s, transforming machine
learning by the mid-2010s, and still rapidly developing today.
A key contributor to the recent success of deep learning is the increased ease of access to data through
the internet. Contemporary state-of-the-art models are often trained on massive, internet-scale datasets
that can be collected at a low cost from publicly available sources online. Compared with other machine
learning techniques, the performance of deep learning is typically unparalleled where training data is
abundant. In fact, the contemporary landscape of general domain (i.e. data is easily accessible at scale)
computer vision (CV) and natural language processing (NLP) is dominated by deep learning approaches.
Beyond the undeniable success in generic CV and NLP tasks, deep learning has shown great promise in
addressing a swath of highly impactful problems in the scientific and medical domain with the potential to
fundamentally transform entire fields. Deep learning has been successfully deployed to tackle the protein
1
folding problem with atomic accuracy [77], expand the number of stable materials known to humanity by
an order of magnitude [109] and to control plasma configurations for nuclear fusion reactors [36]. These
remarkable feats are attributed to the ability of deep learning algorithms to efficiently extract patterns
from data that would otherwise require immense human labor and intricate scientific instruments.
Despite all of these promising advances, it is important to understand the current limitations of deep
learning and how these limitations impact its deployment to scientific and medical imaging tasks. In
particular, a crucial drawback of deep learning is its heavy reliance on large training datasets for optimal
performance. However, collecting large-scale datasets may pose a significant challenge for many scientific
and medical applications. First, acquiring scientific data is often time consuming due to physical limitations
of the sensor and the need for careful calibration procedures. Second, the costs associated with collecting
such data can be extremely high due to expensive instruments and high operating expenses. Moreover,
the data acquisition process can even be harmful to the subject, as it is often the case in X-ray imaging
techniques. Finally, the target real-world phenomenon may have a rare occurrence, severely limiting the
amount of data that can be collected (e.g. rare diseases).
An additional weakness of deep learning models is their lack of robustness in face of often minor deviations from the training setup. Therefore, the real-world environment where the model is to be deployed
needs to be closely matched during training in order to minimize the distribution shift. This can be extremely challenging or even impossible in many scientific problems, where simulating the subtleties of
real data is infeasible. Moreover, deep learning models are known to break down when their inputs are
exposed to small, carefully selected ’adversarial’ perturbations imperceptible to the human eye, rendering
the model vulnerable to attacks from bad actors. However, in the scientific and medical domains the stakes
are much higher (e.g. medical diagnosis) than in common AI applications where deep learning has already
been widely adopted. Reliability is a fundamental requirement that often comes before model performance
and can be a matter of life and death in safety-critical applications.
2
Compounding the challenges associated with data scarcity and reliability, scientific data is often more
complex than data in the general domain (e.g. multi-channel measurements, complex-valued images, high
floating point precision), imposing stringent requirements on the compute-efficiency of the deployed deep
learning technique. Beyond computational challenges, the underlying patterns in scientific observations
can be exceedingly intricate calling for innovative algorithmic solutions able to effectively reason over and
learn from such data.
In order to unlock the full potential of deep learning in scientific and medical applications, it is therefore
imperative to address the above challenges by developing novel deep learning techniques that are efficient
in terms of both data and computational resources and able to adapt to the unique requirements imposed
by such critical applications.
1.2 Problem formulation and classical approaches
We acquire information about the physical world by observing noisy physical processes through imperfect sensors. The image sensor in our camera captures light reflected from an object and turns it into an
image by converting light waves into electrical signals. The resulting image may be corrupted by noise
due to thermal effects and quantum fluctuations in the sensor and distorted by blur effects due to motion
or improper focus. Thus, the measurement (i.e. the resulting image) carries only a fraction of the information that entered the camera. In other cases, such as accelerated MRI or X-ray computed tomography,
acquisition itself may be costly or harmful and therefore we are forced to collect limited measurements
insufficient to fully characterize the signal of interest. All of these problems can be described as an inverse
problem in the form
y = A (x) + n, (1.1)
3
where x ∈ R
n denotes the clean signal that we would like to recover from the noisy and possibly degraded
measurement y ∈ R
m. Here, A (·) denotes the forward operator that represents the data acquisition
process and n ∈ R
m is measurement noise. Typically, A (·) is noninvertible, and thus perfect recovery of
x from y is not possible. A multitude of problems in computational and medical imaging admit the form
in (1.1).
The popular problem of image deblurring [189] can be written as
y = Dx + n, (1.2)
where D is a linear operator representing convolution with a known blur kernel (e.g. Gaussian blur or
motion blur). The operator D acts as a low-pass filter, thus information is fundamentally lost in the
measurement process. Another example is accelerated MRI [104]: the MRI scanner obtains noisy measurements in the Fourier domain (also called k-space), however acquiring fully-sampled measurements is
time-consuming. Thus, the scan can be accelerated by undersampling in k-space, that is acquiring only
a subset of the Fourier domain information necessary for perfect reconstruction. Accelerated MRI can be
written as an inverse problem in the form
y = MF Sx + n, (1.3)
where S represents the sensitivity map of receiver coils (determined by the physical placement of the
coils), F is a multi-dimensional Fourier transform and M is a masking operator that sets missing (unsampled) frequency components to 0. Further examples of scientific and medical imaging problems that
can be represented as an inverse problem include computed tomography, ptychography, cryogenic electron
microscopy and many more. Moreover, most image restoration problems, such as denoising, inpainting,
superresolution or JPEG compression artifact removal can be formulated as inverse problems.
4
Due to the lossy forward operator and measurement noise, perfect recovery of the signal from the
observations is not possible. Compressed sensing [42] methods thus incorporate prior knowledge on clean
signals into the reconstruction process in the form of structural assumptions on x (typically sparsity in
some transform domain). One can formulate this as an optimization problem written as
xˆ = min
x
∥y − A (x)∥
2 + R(x), (1.4)
where xˆ is the reconstructed signal, R is a regularizer that incorporates prior knowledge on x and the norm
in the first term depends on the measurement noise (e.g. ℓ2 for Gaussian noise). In traditional compressed
sensing approaches, R is tailored to the specific problem, for instance R(x) = ∥Ψx∥ℓ1
is a popular choice
[188], where Ψ represents some analysis transform (e.g. wavelet) and ∥· ∥ℓ1
induces sparsity. Even though
the compressed sensing framework is supported by rigorous theoretical guarantees and the hand-crafted
regularizer has well-understood properties, more recently data-driven methods that learn to reconstruct
from training examples have achieved superior performance.
1.3 Deep learning for inverse problems
Deep learning has achieved great success in solving many imaging inverse problems, far surpassing traditional signal processing and machine learning approaches. Deep learning has established new state-of-theart in image deblurring [195], superresolution [182] and inpainting [128]. Furthermore, it has demonstrated
great promise in medical imaging, such as MRI reconstruction [159] with some deep learning techniques
even obtaining FDA approval for CT imaging [169]. Moreover, deep learning has garnered much interest
in computational microscopy with success in both reconstructing images [114] and designing the microscope’s illumination pattern [85]. In the context of computational imaging, deep learning enabled low-light
5
imaging [20], depth estimation of objects in a scene [50] and has been even used for white balancing in
the production version of smart phones [98]. A comprehensive overview can be found in [122].
Despite the promising results, scientific problems pose additional challenges to deep learning that are
not present in applications where the technology is already widely adopted, such as natural language processing or generic (natural) image recognition. First, in many scientific problems training data is extremely
limited. This can be due to the costly or harmful data acquisition process, data confidentiality or even the
rare occurrence of the phenomenon under study (e.g. rare diseases). Furthermore, scientific and medical
data can be more complex and computationally demanding to process due to high precision requirements,
complex valued data, and multiple data channels leading to extreme high dimensionality. For instance,
a typical MRI scan consists of dozens of (spatially) 2D slices, each with complex valued frequency components describing phase and magnitude and split between multiple coil channels. Processing such data
requires additional considerations in terms of data pre-processing, network architecture and training.
1.4 Road map of this dissertation
In this work, we provide techniques to address the above difficulties that may hinder the wider adoption of
artificial intelligence techniques in scientific and medical applications. Our approach is threefold. First, we
address the challenge of costly data acquisition by reducing the amount of measurement data required for
high quality reconstruction in a 3D computational microscopy application in Chapter 2. Second, we propose an effective data augmentation scheme for accelerated MRI reconstruction in Chapter 3 that markedly
improves reconstruction performance in the low-data setting, a common occurrence in medical imaging.
Third, we introduce a compute-efficient deep learning architecture for MRI reconstruction in Chapter 4
that is tailored to the complexities of raw multicoil k-space data and establishes new state of the art on
a large competitive MRI reconstruction benchmark. Finally, we explore the novel opportunities of diffusion models, a class of powerful generative deep learning models, in image reconstruction. We propose a
6
state-of-the-art method in Chapter 5 that can flexibly navigate the trade-off between high perceptual image quality and faithfulness to the measurement dictated by the so called perception-distortion trade-off.
Moreover, we introduce a novel computational framework in Chapter 6 that can adapt the reconstruction
cost to the corruption level in test time achieving great efficiency through sample-adaptivity.
7
Chapter 2
3D phase retrieval at nano-scale via Accelerated Wirtinger Flow
Acquiring scientific data at scale can pose a significant challenge due to the time consuming measurement
procedure, high costs and potential damage to the subject. In this chapter, we introduce a computational
technique to recover the 3D structure of nano-scale objects with a significantly reduced number of measurements compared to prior art, thus accelerating image acquisition while simultaneously reducing the
cost associated with ptychographic reconstruction in 3D.
This chapter is based on the the following works:
• Zalan Fabian, Justin Haldar, Richard Leahy, and Mahdi Soltanolkotabi. "3D phase retrieval at nanoscale via accelerated Wirtinger flow." In 2020 28th European Signal Processing Conference (EUSIPCO), pp. 2080-2084. IEEE, 2021.
• Zalan Fabian, Justin Haldar, Richard Leahy, and Mahdi Soltanolkotabi. "3D phase retrieval at nanoscale via accelerated Wirtinger flow." arXiv preprint arXiv:2002.11785 (2020).
2.1 Introduction
Imaging nano-structures at fine resolution has become increasingly important in diverse fields of science
and engineering. For instance, quality control/examination of modern multi-layered integrated circuits requires detailed imaging of intricate 3D structures with 10nm features. Similarly, real-time non-destructive
8
imaging of biological specimens, such as protein complexes, on the molecular scale could provide invaluable insight into many biological processes that are little understood. Imaging at finer resolution necessitates high-energy beams with shorter wavelengths. Building optical components such as mirrors and
lenses on this scale is very difficult and often phaseless coherent diffraction methods are required. This
necessity triggered a major revival in phaseless imaging techniques and experiments [1, 29, 110, 115, 124,
145, 39, 41, 67, 146, 162, 69, 164, 163, 200, 18] as well as algorithms for phase retrieval. See [186] for a
comprehensive overview of algorithmic approaches and [10, 17, 176] for theoretical work. Some authors
leverage prior knowledge on the signal structure such as sparsity [74, 150] in order to further decrease the
necessary number of measurements.
Despite all of this recent progress on phaseless reconstruction methods, there has been significantly
less focus on 3D imaging at nano-scale. We briefly discuss a few recent efforts. [108] uses a multi-slice
approach to image thick specimens in 3D, where the wave front is propagated through the object layer
by layer. Authors in [165] use this multi-slice forward model combined with Fourier-ptychography to
successfully reconstruct thick biological samples. [112] uses filtered backprojection for the reconstruction
of flat specimens. [41] uses a two step approach where they first reconstruct 2D projections of the object
from phaseless measurements, then obtain the 3D structure via tomography from the 2D reconstructions.
A more recent line of work investigates a joint technique that alternates between a ptychography step on
exit waves and a tomographic reconstruction step on the object based on the updated projections [52, 4, 117,
19]. Most of these techniques typically use a first-order approximation of the projections in the tomography
step due to the challenges introduced by the non-linearity. Even though the linear regime provides a good
estimation for small biological samples, it becomes increasingly inaccurate for extended specimens and
for materials used in electronics. An additional challenge of 3D imaging at very fine resolution is the
extremely sensitive calibration process that highly increases data acquisition time.
9
In this work, we introduce a 3D reconstruction technique where the object is reconstructed directly in
lieu of separating reconstruction into ptychography and tomography steps or alternating between those
two as in prior work. Furthermore, we use a highly non-linear wave propagation model without linear
approximation. We expect this model to be more accurate than the linear approximation, especially for
larger specimens where the path length of the beam passing through the object is longer. Our work builds
upon AWF [186], an accelerated optimization technique used for 2D phase retrieval. We extend this framework to 3D reconstruction by directly incorporating tomography in the algorithm and by adding weighted
TV-regularization, which we term 3D Accelerated Wirtinger Flow (3D-AWF). We show that the merit of
TV-regularization is threefold: (1) it offers a computationally inexpensive method to alleviate the effect of
ambiguities introduced by the non-linear model by leveraging prior knowledge, (2) it significantly accelerates data acquisition by reducing the number of measurements needed for a given level of reconstruction
accuracy and (3) effectively incorporates the structure of integrated circuits by promoting a piecewise
constant reconstruction. We demonstrate through numerical simulations on realistic chip data that our
non-linear model results in significantly more accurate reconstructions compared to its linear approximation. Moreover, we provide mathematically rigorous guarantees for convergence of our algorithm.
2.2 Phaseless imaging in 3D
We are interested in reconstructing the complex valued 3D refractive index of the object, where we model
the object as shifts of a voxel basis function over a cubic lattice. Let x ∈ C
N represent the complex refractive index of the discretized 3D object obtained from vectorizing X ∈ C
N1×N2×N3
for which Xn1,n2,n3
is the complex refractive index at voxel (n1, n2, n3) on a cubic lattice. Here, N1, N2 and N3 denote the
number of voxels along each dimension and obey N1N2N3 = N. Here, the object is of the form x = d+ib
with d, b ∈ R
N with d denoting the phase shift and b the attenuation associated with wave propagation
through the object, and i the imaginary unit. Our forward model consists of two stages. The first stage
10
consists of applying a non-linear projection to the 3D object resulting in a 2D complex exit wave. Then,
the exit wave is passed through a linear mapping and its magnitude is measured in the far field.
2.2.1 From 3D object to 2D exit waves.
Let Tℓ ∈ R
P ×N represent the part of the conventional Radon transform projection operator corresponding
to the ℓth projection angle. Based on the projection approximation of wave propagation [41](Fig. 2.1), for
a wavelength λ the mapping from x to the discretized exit wave in orientation ℓ can be represented as
gℓ
:= gℓ
(x) = exp
2πi
λ
Tℓ(d + ib)
, (2.1)
where exponentiation should be interpreted element-wise.
2.2.2 From 2D exit waves to phaseless measurements.
In 3D ptychography, a sample is illuminated with several different illumination functions (or "probes")
from L different orientations and the corresponding diffraction patterns for each probe are measured by
a detector in the far field (Fig. 2.2). In many cases, the different probe functions pk(r
′
) are obtained
as different spatial shifts of the same basic probe function. Let gℓ(r
′
) represent the 2D exit wave as a
function of the spatial position in projection plane r
′ = (r
′
x
, r′
y
) ((2.1) is the corresponding discretization).
Then, the complex field at the detector plane resulting from the kth probe in orientation ℓ is given by
ζk,ℓ(r
′
) = F {pk(r
′
)gℓ(r
′
)} , where F denotes the Fourier transform. However, we cannot sense the
complex far field directly, only its magnitude. Therefore our phaseless measurement corresponding to the
kth probe and ℓth illumination angle takes the form yk,ℓ = |ζk,ℓ|. All measurements obtained in the ℓth
orientation can be written in the more compact form yℓ = |Agℓ
|, where A represents the ptychographic
propagation model described in [186].
11
detector plane 3D object
voxels of the 3D object
xj = dj + ibj
(r
′
x, r
′
y
)
a pixel of the
2D projection
beam Br′
x
,r′
y
θ
light
source
gθ(r
′
x
, r′
y
) = exp
⎛
⎝
2πi
λ ∑
j∈Ir
′
x,r′
y
(dj + ibj)
⎞
⎠
Figure 2.1: Projection approximation of wave propagation. Incident beam Br
′
x,r′
y
passes through the 3D
object, and produces a pixel in projection plane at(r
′
x
, r′
y
). Ir
′
x,r′
y
denotes the set of voxel indices intersected
by the beam.
12
Figure 2.2: In ptychography, the sample is illuminated by a probe function from various angles. The
diffraction pattern in the far field is the Fourier transform of the exit wave multiplied by the probe function.
2.2.3 Ambiguity challenge.
Tomography. Recovering the phase of the ground truth object based on phaseless measurements is only
possible up to some ambiguity factors. First, the mapping from the object x to the 2D projections {Tℓx}
L
ℓ=1
may not invertible and therefore the 2D projection images may result from infinitely many possible 3D
objects. This ambiguity is rather pronounced when we only have measurements from a few orientations.
However, with a sufficiently large number of angles, the mapping is typically invertible.
Global phase. Another source of ambiguity arises from the phaseless measurements. Recovering the
2D exit waves is only possible up to a phase factor, since the magnitude measurements are invariant to a
global shift in phase.
Phase wrapping. Phase wrapping is another source of ambiguity that appears in the 2D exit waves and
originates in the projection model in (2.1). Specifically, let x
∗ = d
∗+ib
∗ be the ground truth object we wish
to reconstruct, xb = db+ ibb be the estimate obtained from our reconstruction algorithm, and de:= db− d
∗
13
the error in the real part of the reconstruction. Then the reconstructed exit wave at orientation ℓ takes the
form
gbℓ = e
− 2π
λ
Tℓ
bb
· e
i
2π
λ
Tℓd
∗
· e
i
2π
λ
Tℓde
. (2.2)
From this identity it is clear that if deis such that Tℓde= λkℓ
for any kℓ ∈ Z
N will be consistent with the
measurements and one can not hope to differentiate between the reconstruction and the ground truth exit
wave. This effect translates to an ambiguity in the real part of each voxel of the 3D object: dband db+ λk
are indistinguishable for any k ∈ Z
N in our model. To elaborate further, consider an incident beam Bl
j
at
angle l that produces pixel rˆ
l
j
on the projection image. Denote I : {i|xi ∈ Bl
j
} the set of indices of voxels
intersected by the beam. Explicitly writing out the Radon-transform for the real part of this pixel is simply
a sum in the discrete case:
ℜ(ˆr
ℓ
j
) = X
i∈I
ˆdi =
X
i∈I
d
∗
i +
X
i∈I
˜di
.
Assume that the voxel-wise reconstruction error can be written as ˜di = kiλ, ki ∈ Z ∀i, and in this case
ℜ(ˆr
ℓ
j
) = X
i∈B
d
∗
i + λ
X
i∈B
ki =
X
i∈B
d
∗
i + λk′
, k′ ∈ Z
which results in the same exit wave pixel as the ground truth object and therefore indistinguishable from
the ground truth in our model. Moreover, due to phase wrapping we lose all information on each individual
ki
.
14
2.3 Reconstruction via 3D-AWF
Our goal is to find x ∈ C
N that best explains our phaseless measurements under the propagation model.
Formally, we solve the optimization problem
xˆ = arg min
x∈CN
L(x) + λT V TV3D(x; w) = arg min
x∈CN
Ltotal(x), (2.3)
where λT V ∈ [0, ∞) is the regularization strength and
L(x) := X
L
l=1
∥yl − |Agℓ
|∥2
2
.
The second term penalizes the weighted total variation of the reconstruction defined as
TV3D(x; w) = X
i,j,k
(wx |xi+1,j,k − xi,j,k| + wy |xi,j+1,k − xi,j,k| + wz |xi,j,k+1 − xi,j,k|),
where w = [wx, wy, wz] is a fixed vector of non-negative weights that can be used to leverage prior
knowledge on the structure of the object along different spatial dimensions. The optimization problem in
(2.3) is nonconvex and in general does not admit a closed form solution. Classical gradient descent requires
a differentiable loss landscape and the loss in (2.3) is not complex differentiable. However, this does not
pose a significant challenge since the loss function is differentiable except for isolated points, and we can
define generalized gradients at non-differentiable points [30]. We use the notion of Wirtinger-derivatives
and apply a proximal variant of AWF [186], which we call 3D-AWF with update rule
zτ+1 = xτ + βτ (xτ − xτ−1) − µτ∇L(xτ + βτ (xτ − xτ−1))
xτ+1 = proxT V (zτ+1), (2.4)
15
where proxf denotes the proximal mapping associated with function f. More details on Wirtingerderivatives, its properties and applications to phase retrieval can be found in [10]. The generalized gradient
of L(x) takes the form
∇L(x) = −
2πi
λ
X
L
l=1
T
H
ℓ diag(gℓ
)AH(Agℓ − yl ⊙ sgn(Agℓ
)), (2.5)
where sgn(·) denotes the complex signum function and ⊙ stands for elementwise multiplication. We
choose the step size µτ =
1
Γτ
, where
Γτ =
4π
2
λ2
"X
L
ℓ=1
X
K
k=1
diag(|pk|
2
) diag(|gℓ
|
2
)
2
+
diagh ∂
∂gℓ
L(x)
H
⊙ gℓ
i
2
#
. (2.6)
We note that Γτ can be determined from the known probe and quantities computed whilst calculating the
gradient (current exit wave estimate, gradient w.r.t. exit wave) and hence requires no additional effort.
This step size is motivated by a theoretical bound on the spectral norm of the loss Hessian that describes
the maximum variation of the loss landscape and works well in practice. In the next section, we provide
formal convergence guarantees for a slightly more conservative step size.
Due to the ambiguities discussed in Section 2.2, the loss landscape L(x) has many undesired global
optima. Furthermore, due to the highly nonlinear nature of the forward model the loss is highly nonconvex
with many local optima. 3D-AWF biases the optimization process towards the desired reconstruction by
exploiting a priori knowledge on the structure of the solution via TV proximal mappings. The benefit
of TV-regularization is threefold: (1) it expedites data acquisition time drastically through decreasing the
necessary number of measurements required for accurate reconstruction, (2) it helps resolve the ambiguity
introduced by phase wrapping to a high degree and (3) serves as excellent prior for integrated circuits due
to their highly structured, piecewise constant nature.
16
In Section 2.2 we showed that there is a voxel-level ambiguity in the real part of the object due to
phase wrapping. Applying 3D TV regularization promotes a piecewise constant structure over the 3D
reconstruction. Since we know a priori that the ground truth object is piecewise constant, this in turn
ensures that the ambiguity in d is also piecewise constant. Therefore, it opens up a way to mitigate the
phase wrapping effect by facilitating the approximation of the ambiguity by a single constant over the
object: db ≈ d
∗ + 1de with de ∈ R. Finding the optimal constant de necessitates some knowledge on
the ground truth object. We assume that some pixels of the ground truth exit waves are known, which
translates to knowing some line integrals through the ground truth object. This information is readily
available by using the part of the 3D object which is known to be vacuum or a given substrate. Denote Dℓ
the diagonal operator that masks out unknown pixels in the ground truth projection image in orientation
ℓ, so that DℓTℓd
∗
is known. Then we can obtain deby solving
min
d˜
X
L
ℓ=1
||DℓTℓ
db− 1de
− DℓTℓd
∗
||2
2
, (2.7)
for which the solution can be easily calculated in closed form by
˜d =
1
H PL
ℓ=1 T
H
ℓ
(DℓTℓdˆ− DℓTℓd
∗
)
1H PL
ℓ=1 T
H
ℓ DℓTℓ1
.
Let xbT be the full reconstruction obtained from running 3D-AWF for T iterations. Then, our correction
technique yields the final reconstruction xbF given by
xbF = xbT − 1 ˜d. (2.8)
17
2.4 Convergence theory
The loss function in (2.3) is non-differentiable and highly non-convex. Therefore it is completely unclear
why 3D-AWF even converges. In the next theorem we ensure convergence to a stationary point. We defer
the proof to Appendix A.1.
Theorem 2.4.1. Let x ∈ C
N represent the object and assume we have noisy measurements of the form
yℓ = |Agℓ
| + nℓ corresponding to projection angles ℓ = 1, . . . , L. Here, gℓ ∈ C
P is defined per (2.1) and
nℓ
is used to denote arbitrary noise on the measurements from the ℓth angle. We run 3D-AWF updates of the
form (2.4) with βτ = 0 with step size
µ ≤
"
4π
2
λ2
(1 + √
P)Lλmax +
p
λmaxX
L
ℓ=1
∥yℓ∥ℓ2
!#−1
,
where λmax =
PK
k=1 diag(|pk|
2
)
2
. Furthermore, let x
∗
be a global optimum of Ltotal(x). Then, we have
limτ→∞
∥proxT V (zτ ) − xτ ∥ℓ2
= 0,
and more specifically
min
τ∈{1,2,...,T}
∥proxT V (zτ ) − xτ ∥
2
ℓ2
≤ µ
Ltotal(x0) − Ltotal(x
∗
)
T + 1
.
This theorem guarantees that if we choose the step size smaller than a constant which can be calculated purely based on our measurements and the known probe function, then 3D-AWF will converge to
a stationary point. Moreover, the norm of the difference of iterates diminishes proportional to 1
T
. It is
important to note that even though Theorem 2.4.1 is formulated in terms of TV regularization for this particular application, a more general result in Appendix A.1 shows that 3D-AWF converges for any convex
regularizer.
18
2.5 Numerical experiments
In this section, we investigate the performance of 3D-AWF in the context of ptychographic phaseless imaging of 3D samples. We perform the reconstruction on a complex 3D test image of size 124×124×220 voxels
(N ≈ 3.4·106
) obtained from a highly realistic synthetic IC structure specified in [186]. We use a simulated
x-ray source with an energy of 6.2keV (λ0 = 0.2nm). To generate the measurements we repeat the ptychographic acquisition procedure with parameters described in [186] for L = {5, 10, 25, 50, 100, 250, 400}
different illumination angles, where the object is rotated by π
L
increments about its y axis. First, we imAlgorithm 1 3D-AWF
Input: λT V , y := {yℓ}ℓ=1,2,..,L
1: xˆ0 ← 0 ▷ Initialization
2: for τ = 1 to T do
3: βτ ← τ+1
τ+3
4: qτ+1 ← xτ + βτ (xτ − xτ−1) ▷ Temporary variable
5: ∇L(qτ+1) ← gradient(qτ+1, y) ▷ Gradient from (2.5)
6: µτ ← 1/Γτ ▷ Step size from (2.6)
7: zτ+1 ← qτ+1 − µτ∇L(qτ+1)
8: xτ+1 ← proxT V (zτ+1)
9: xˆF = correction(xT ) ▷ Correction from (2.8)
Output: xˆF ▷ Final reconstruction
plement 3D-AWF (Algorithm 1) to minimize the TV-regularized problem defined in (2.3) with iterative
proximal update rule in (2.4) with T = 550 iterations. We tune the regularization strength by minimizing
reconstruction error with L = 100 illumination angles. For experiments with different number of angles
we scale the regularizer linearly with L to maintain the ratio of TV-penalty to the total loss. The tuned
value for 3D-AWF at 100 angles is λ
AW F
T V = 0.1. The chip has a fine, layered structure along the z-axis,
therefore we set the regularization weights to w = [1, 1, 0.1] to enforce a piecewise constant structure
mostly in the x − y plane. We report the relative error on the corrected reconstruction (output of Algorithm 1) as REf inal = ∥M(xbF − x
∗
)∥2
/ ∥Mx∗∥2
. Here, M extracts the center 62 × 62 × 110 voxel
region of the object (the region-of-interest), outside of which the object did not receive enough illumination
from the probes and therefore we don’t expect to have accurate reconstruction in that region.
19
We compare our results to a combined, two-step (2-Step) approach in which we first perform 2D phase
retrieval then reconstruct the object from projections via tomography. In the first step, we reconstruct the
exit waves by minimizing PL
ℓ=1 ∥yℓ − |Afℓ
|∥2
2
, yielding estimated exit waves fb
ℓ
. In this method the exit
wave is approximated based on its Taylor-series expansion as exp(
2πi
λ
Tℓx) ≈ 1+ 2πi
λ
Tℓx yielding the loss
function for tomography
X
L
ℓ=1
∥Hfb
ℓ −
2πi
λ
HTℓx∥
2
2 + λT V TV3D(x), (2.9)
where H represents the ramp filter used in filtered backprojection aimed at inverting the Radon transform.
We will assume that the exit waves have been reconstructed perfectly in the phase retrieval step (that is
fb
ℓ = f
∗
ℓ
, ℓ = 1, 2, .., L) and run conjugate gradient descent on the loss function in (2.9) for T = 550
iterations. We tune and scale the regularizer by the same methodology as for 3D-AWF with λ
2−Step
T V = 104
for L = 100 angles. To perform the correction, we assume that the same pixel values are known as in case
of 3D-AWF and report the final reconstruction error after correction.
In case of 2-Step, we observe that the relative reconstruction error achieves its minimum fairly early
(100-150 iterations) and increases afterwards, with consistently worse reconstructions at iteration 550.
Therefore we show the best reconstruction across all iterations for this technique. On the other hand, as it
is observed on Fig. 2.3, 3D-AWF reconstruction error is decreasing throughout iterations and therefore we
report results for the last iteration. 3D-AWF reconstruction improves with more iterations, which cannot
be said for 2-Step. We note that the relative error before correction is consistently high for both algorithms.
We attribute this fact to the inherent ambiguity of the reconstruction problem, which emphasizes the need
to incorporate some form of prior knowledge. After applying the correction technique described in (2.8),
the reconstruction error decreases drastically for both algorithms.
20
0 100 200 300 400 500 600
10−2
10−1
100
Iterations
Rel. recon. error
AWF, corr. 2-Step, corr.
AWF, no corr. 2-Step, no corr.
1
Figure 2.3: Evolution of relative reconstruction error
before and after correction across iterations. L =
100.
0 100 200 300 400
10−2
10−1
100
# of projection angles
Rel. recon. error
3D-AWF, corr. 2-Step, corr.
3D-AWF, no corr. 2-Step, no corr.
1
Figure 2.4: Relative reconstruction error before and
after correction vs. number of angles.
Fig. 2.4 depicts relative reconstruction error achieved by 3D-AWF and 2-Step for various number of
illumination angles. These results show that 3D-AWF achieves significantly better reconstruction accuracy
with significantly fewer angles. We attribute most of this difference to the inaccuracy of the linear model
used in 2-Step. Fig. 2.5 shows how the linear approximation increasingly deviates from the exponential
model at shorter wavelengths, such as the one used in our simulation. Imaging with high energy beams
(or short wavelengths) is crucial for obtaining nano-scale resolution. Moreover, our experiments show
that the presence of metallic parts in the object further increases the inaccuracy of the linear model (Fig.
2.6). This is due to the fact that metals typically have high attenuation (represented by b in Section 2.2,
the imaginary part of the complex refractive index). All these observations highlight the advantage of
the exponential model over the linear approximation for high resolution imaging of integrated circuits of
significant spatial extent.
Lastly, we plot the magnitude of a slice of the ground truth object and reconstructions after correction in Fig. 2.7 for various projection angles. Even though the reconstructions significantly improve with
more illumination angles, visible reconstruction quality saturates after 100 angles. Reconstruction of the
21
1 2 3 4 5
10−3
10−2
10−1
Wavelength/ λ0
Normalized MSE
1
Figure 2.5: Difference between the exponential model and its linear approximation. We plot the normalized
mean squared error between an exit wave obtained from the non-linear model and the linearized model
at various wavelengths (normalized by the wavelength used in the experiment). At high energies (short
wavelengths) the linear approximation significantly deviates from the exponential model.
Figure 2.6: Normalized pixelwise squared difference between exit waves calculated from the exponential
propagation model and the linearized model. The error is significantly higher at pixels resulting from the
illuminating beam passing through metallic parts, such as copper interconnects in the object.
22
magnitude image using 3D-AWF is highly accurate with sharp edges even with low number of measurements. Edges on the 2-Step magnitude plot are less well-defined and magnitude values are inaccurate. The
phase plots (Figure 2.8) show drastic differences between the two reconstruction algorithms. In general,
the phase of the object converges significantly slower than the magnitude and is less accurate, which is
due to the loss of phase information in the measurement process. One may observe that the phase plot
of 2-Step exhibit serious inaccuracies, even after correction. A 3D rendering of the reconstructed volume
using L = 100 illumination angles can be seen on Figure 2.9. The quality of 3D-AWF reconstruction is
visibly better throughout the volume, and we observe lower reconstruction error close to the center of the
object due to the geometry of the setup.
23
Figure 2.7: Magnitude of ground truth of a slice (x−y plane at z = 1) of 3D-AWF and 2-Step reconstructions
after correction.
24
Figure 2.8: Phase of ground truth of a slice (x − y plane at z = 60) of 3D-AWF and 2-Step reconstructions
after correction.
25
Figure 2.9: 3D rendering of the magnitude and phase of the ground truth and reconstructed volumes using
3D-AWF and 2-Step (L = 100).
26
Chapter 3
Data augmentation for deep learning based accelerated MRI
reconstruction with limited data
In many scientific and medical imaging applications collecting large training datasets is infeasible. This is
especially the case with MRI data, due to slow data acquisition, high operating costs and stringent patient
confidentiality requirements when handling the data. In this chapter, we introduce a pipeline to enlarge the
training dataset using synthetic samples generated via augmenting existing data for MRI reconstruction.
The proposed technique can not only significantly improve the quality of MRI images when training data
is limited, but also endows the model with increased robustness against shifts in the instrumental setup
and scanned anatomy.
This chapter is based on the the following works:
• Zalan Fabian, Reinhard Heckel, and Mahdi Soltanolkotabi. "Data augmentation for deep learning
based accelerated MRI reconstruction with limited data." In International Conference on Machine
Learning, pp. 3057-3067. PMLR, 2021.
• Zalan Fabian, Reinhard Heckel, and Mahdi Soltanolkotabi. "Data augmentation for deep learning
based accelerated MRI reconstruction with limited data." arXiv preprint arXiv:2106.14947 (2021).
27
3.1 Introduction
In magnetic resonance imaging (MRI), an extremely popular medical imaging technique, it is common to
reduce the acquisition time by subsampling the measurements, because this reduces cost and increases
accessibility of MRI to patients. Due to the subsampling, there are fewer equations than unknowns, and
therefore the signal is not uniquely identifiable from the measurements. To overcome this challenge there
has been a flurry of activity over the last decade aimed at utilizing prior knowledge about the signal, in a
research area referred to as compressed sensing [11, 43].
Compressed sensing methods reduce the required number of measurements by utilizing prior knowledge about the images during the reconstruction process, traditionally via a convex regularization that
enforces sparsity in an appropriate transformation of the image. More recently, deep learning techniques
have been used to enforce much more nuanced forms of prior knowledge (see [120] and references therein
for an overview). The most successful of these approaches aim to directly learn the inverse mapping from
the measurements to the image by training on a large set of training data consisting of signal/measurement
pairs. This approach often enables faster reconstruction of images, but more importantly, deep learning
techniques yield significantly higher quality reconstructions. Thus, deep learning techniques enable reconstructing a high-quality image from fewer measurements which further reduces image acquisition times.
For instance, in an accelerated MRI competition known as fastMRI Challenge [191], all the top contenders
used deep learning reconstruction techniques.
Contrary to classical compressive sensing approaches, however, deep learning techniques typically
rely on large sets of training data consisting of images along with the corresponding measurement. This
is also true about the use of deep learning techniques in other areas such as computer vision and Natural
Language Processing (NLP) where superb empirical success has been observed. While large datasets have
been harvested and carefully curated in areas such as vision and NLP, this is not feasible in many scientific
applications including MRI. It is difficult and expensive to collect the necessary datasets for a variety of
28
reasons, including patient confidentiality requirements, cost and time of data acquisition, lack of medical
data compatibility standards, and the rarity of certain diseases.
A common strategy to reduce reliance on training data in classification tasks is data augmentation.
Data augmentation techniques are used in classification tasks to significantly increase the performance
on standard benchmarks such as ImageNet and CIFAR-10. For a comprehensive survey of image data
augmentation in deep learning see [147]. More specific to medical imaging, data augmentation techniques
have been successfully applied to registration, classification and segmentation of medical images. Recently,
several studies [199, 80, 198] have demonstrated that data augmentation can significantly reduce the data
needed for GAN training for high quality image generation. In a classification setting, data augmentation consists of adding additional synthetic data obtained by performing invariant alterations to the data
(e.g. flips, translations, or rotations) which do not affect the response (i.e., the label).
In image reconstruction tasks, however, data augmentation techniques are less common and much
more difficult to design because the response (the measurement) is affected by the data augmentation. For
example, measurements of a rotated image are not the same as measurements from the original image.
In the context of accelerated MRI reconstruction, augmentation techniques such as randomly generated
undersampling masks [100] and simple random flipping [93] have been applied, and authors in [144] note
the importance of rigid transforms in avoiding overfitting. However, an effective pipeline of augmentations
for training data reduction and thorough experimental studies thereof are lacking.
The goal of this paper is to explore the benefits of data augmentation techniques for accelerated MRI
with limited training data. By carefully taking into account the physics behind the MRI acquisition process
we design a data augmentation pipeline, which we call MRAugment ∗
, that can successfully reduce the
amount of training data required. Our contributions are as follows:
∗Code: https://github.com/MathFLDS/MRAugment
29
• We perform an extensive empirical study of data augmentation in accelerated MRI reconstruction.
To the best of our knowledge, this work is the first in-depth experimental investigation focusing on
the benefits of data augmentation in the context of training data reduction for accelerated MRI.
• We propose a data augmentation technique tailored to the physics of the MR reconstruction problem. It is not obvious how to perform data augmentation in the context of accelerated MRI or in
inverse problems in general, because by changing an image to enlarge the training set, we do not
automatically get a corresponding measurement, contrary to classification problems, where the label
is retained.
• We demonstrate the effectiveness of MRAugment on a variety of datasets. On small datasets we
report significant improvements in reconstruction performance on the full dataset when MRAugment is applied. Moreover, on small datasets we are able to surpass full dataset baselines by using
only a small fraction of the available training data by leveraging our proposed data augmentation
technique.
• We perform an extensive study of MRAugment on a large benchmark accelerated MRI data set,
specifically on the fastMRI [191] dataset. For 8-fold acceleration and multi-coil measurements (multicoil measurements are the standard acquisition mode for clinical practice) we achieve performance
on par with the state of the art with only 10% of the training data. Similarly, again for 8-fold acceleration and single-coil experiments (an acquisition mode popular for experimentation) MRAugment
can achieve the performance of reconstruction methods trained on the entire dataset while using
only 33% of training data.
• We reveal additional benefits of data augmentation on model robustness in a variety of settings.
We observe that MRAugment has the potential to improve generalization to unseen MRI scanners,
field strengths and anatomies. Furthermore, due to the regularizing effect of data augmentation,
30
MRAugment prevents overfitting to training data and therefore may help eliminate hallucinated
features on reconstructions, an unwanted side-effect of deep learning based reconstruction.
3.2 Background and Problem Formulation
MRI is a medical imaging technique that exploits strong magnetic fields to form images of the anatomy.
MRI is a prominent imaging modality in diagnostic medicine and biomedical research because it does not
expose patients to ionizing radiation, contrary to competing technologies such as computed and positron
emission tomography.
However, performing an MR scan is time intensive, which is problematic for the following reasons.
First, patients are exposed to long acquisition times in a confined space with high noise levels. Second, long
acquisition times induce reconstruction artifacts caused by patient movement, which sometimes requires
patient sedation in particular in pediatric MRI [173]. Reducing the acquisition time can therefore increase
both the accuracy of diagnosis and patient comfort. Furthermore, decreasing the acquisition time needed
allows more patients to receive a scan using the same machine. This can significantly reduce patient cost,
since each MRI machine comes with a high cost to maintain and operate.
Since the invention of MR in the 1980s there has been tremendous research focusing on reducing their
acquisition time. The two main ideas are to i) perform multiple acquisitions simultaneously [148, 125, 51]
and to ii) subsample the measurements, known as accelerated acquisition or compressed sensing [105].
Most modern scanners combine both techniques, and therefore we consider such a setup.
31
3.2.1 Accelerated MRI acquisition
In magnetic resonance imaging, measurements of a patient’s anatomy are acquired in the Fourier-domain,
also called k-space, through receiver coils. In the single-coil acquisition mode, the k-space measurement
k ∈ C
n of a complex-valued ground truth image x
∗ ∈ C
n
is given by
k = Fx
∗ + z,
where F is the two-dimensional Fourier-transform, and z ∈ C
n denotes additive noise arising in the
measurement process. In parallel MR imaging, multiple receiver coils are used, each of which captures a
different region of the image, represented by a complex-valued sensitivity map Si
. In this multi-coil setup,
coils acquire k-space measurements modulated by their corresponding sensitivity maps:
ki = FSix
∗ + zi
, i = 1, .., N,
where N is the number of coils. Obtaining fully-sampled k-space data is time-consuming, and therefore in
accelerated MRI we decrease the number of measurements by undersampling in the Fourier-domain. This
undersampling can be represented by a binary mask M that sets all frequency components not sampled
to zero:
k˜i = Mki
, i = 1, .., N.
We can write the overall forward map concisely as
k˜ = A (x
∗
),
32
where A (·) is the linear forward operator and k˜ denotes the undersampled coil measurements stacked
into a single column vector. The goal in accelerated MRI reconstruction is to recover the image x
∗
from
the set of k-space measurements k˜. Note that—without making assumptions on the image x
∗—it is in
general impossible to perfectly recover the image, because we have fewer measurements than variables
to recover. This recovery problem is known as compressed sensing. To make image recovery potentially
possible, recovery methods make structural assumptions about x
∗
, such that it is sparse in some basis or
implicitly that it looks similar to images from the training set.
3.2.2 Traditional accelerated MRI reconstruction methods
Traditional compressed sensing recovery methods for accelerated MRI are based on assuming that the
image x
∗
is sparse in some dictionary, for example the wavelet transform. Recovery is then posed typically
as a convex optimization problem:
xˆ = arg min
x
A (x) − k˜
2
+ R(x),
where R(·) is a regularizer enforcing sparsity in a certain domain. Typical functions used in CS based
MRI reconstruction are ℓ1-wavelet and total-variation regularizers. These optimization problems can be
numerically solved via iterative gradient descent based methods.
3.2.3 Deep learning based MRI reconstruction methods
In recent years, several deep learning algorithms have been proposed and convolutional neural networks
established new state of the art in MRI reconstruction significantly surpassing the classical baselines.
Encoder-decoder networks such as the U-Net [136] and its variants were successfully used in various
medical image reconstruction [72, 55] and segmentation problems [27, 203]. These models consist of two
sub-networks: the encoder repeatedly filters and downsamples the input image with learned convolutional
33
filters resulting in a concise feature vector. This low-dimensional representation is then fed to the decoder
consisting of subsequent upsampling and learned filtering operations. Another approach that can be considered a generalization of iterative compressed sensing reconstructions consists of unrolling the data flow
graph of popular algorithms such as ADMM [187] or gradient descent iterations [193] and mapping them
to a cascade of sub-networks. Several variations of this unrolled method have been proposed recently for
MR reconstruction, such as i-RIM [127], Adaptive-CS-Net [123], Pyramid Convolutional RNN [178] and
E2E VarNet [158].
Another line of work, inspired by the deep image prior [171] focuses on using the inductive bias of
convolutional networks to perform reconstruction without any training data [76, 34, 60, 59, 172]. Those
methods do perform significantly better than classical un-trained networks, but do not perform as well as
neural networks trained on large sets of training data.
3.3 MRAugment: a data augmentation pipeline for MRI
In this section we propose our data augmentation technique, MRAugment, for MRI reconstruction. We
emphasize that data augmentation in this setup and for inverse problems in general is substantially different from DA for classification problems. For classification tasks, the label of the augmented image is
trivially the same as that of the original image, whereas for inverse problems we have to generate both an
augmented target image and the corresponding measurements. This is non-trivial as it is critical to match
the noise statistics of the augmented measurements with those in the dataset.
We are given training data in the form of fully-sampled MRI measurements in the Fourier domain,
and our goal is to generate new training examples consisting of a subsampled k-space measurement along
with a target image. MRAugment is model-agnostic in that the generated augmented training example
can be used with any machine learning model and therefore can be seamlessly integrated with existing
reconstruction algorithms for accelerated MRI, and potentially beyond MRI.
34
Our data augmentation pipeline, illustrated in Figure 3.1, generates a new example consisting of a subsampled k-space measurement k˜a along with a target image x¯a as follows. We are given training examples
as fully-sampled k-space slices, which we stack into a single column vector k = col(k1, k2, ..., kN ) for
notational convenience. From these, we obtain the individual coil images by applying the inverse Fourier
transform as x = F
−1k. We generate augmented individual coil images with an augmentation function D,
specified later, as xa = D(x). From the augmented images, we generate an undersampled measurement
by applying the forward model as k˜a = A (xa). Both x and xa are complex-valued: even though the MR
scanner obtains measurements of a real-valued object, due to noise the inverse Fourier-transform of the
measurement is complex-valued. Therefore the augmentation function has to generate complex-valued
images, which adds an extra layer of difficulty compared to traditional data augmentation techniques pertaining to real-valued images (see Section 3.3.1 for further details). Finally, the real-valued ground truth
image is obtained by combining the coil images xa,i by pixel-wise root-sum-squares (RSS) followed by
center-cropping C:
x¯a = C (RSS(xa)) = C
vuutX
N
i=1
|xa,i|
2
.
In the following subsections we first argue why we generate individual coil images with the augmentation
function, then discuss the design of the augmentation function D itself.
3.3.1 Data augmentation needs to preserve noise statistics
As mentioned before, we are augmenting complex-valued, noisy images. This noise enters in the measurement process when we obtain the fully-sampled measurement of an image x
∗
as k = Fx
∗ + z, and
is well approximated by i.i.d complex Gaussian noise, independent in the real and imaginary parts of each
pixel [118]. Therefore, we can write x = x
∗ + z
′ where z
′ has the same distribution as z due to F being
35
complex
real
k
F
−1
x = F−1k
D
xa = Dx
RSS C
F M
RSS (xa)
ka = Fxa
x¯a
k˜a = Mka
x split
R/I
t1
p1
tk
pk
pixel preserving
augmentations
augmentation
scheduler
↗ tk+1
pk+1
tK
pK
interpolating
augmentations
↘ combine
R/I
xa
1
Figure 3.1: Flowchart of MRAugment, our data augmentation pipeline for MRI.
unitary. Since the noise distribution is characteristic to the instrumental setup (in this case the MR scanner and the acquisition protocol), assuming that the training and test images are produced by the same
setup, it is important that the augmentation function preserves the noise distribution of training images
as much as possible. Indeed, a large mismatch between training and test noise distribution leads to poor
generalization [88].
Let us demonstrate why it is non-trivial to generate augmented measurements for MRI through a
simple example. A natural but perhaps naive approach for data augmentation is to augment the real-valued
target image x¯ instead of the complex valued x. This would allow us to directly obtain real augmented
images from a real target image just as in typical data augmentation. However, this approach leads to
different noise distribution in the measurements compared to the test data due to the non-linear mapping
from individual coil images to the real-valued target and works poorly. Experiments demonstrating the
weakness of this naive approach of data augmentation can be found in Section 3.4.5.
36
In contrast, if we augment the individual coil images x directly with a linear function D, which is our
main focus here, we obtain the augmented k-space data
ka = FDx = FD(x
∗ + z
′
) = FDx
∗ + FDz
′
,
where FDx
∗
represents the augmented signal and the noise FDz
′
is still additive complex Gaussian. A
key observation is that in case of transformations such as translation, horizontal and vertical flipping and
rotation the noise distribution is exactly preserved. Moreover, for general linear transformations the noise
is still Gaussian in the real and imaginary parts of each pixel.
To elaborate further, in the multi-coil case our augmentation pipeline applies transformations to the
underlying object modulated by the different coil sensitivity maps. In particular, the fully sampled measurement of the ith coil in the image domain takes the form
xi = Six
∗ + z
′
i
, (3.1)
where z
′
i = F
−1
zi
is i.i.d Gaussian noise obtained via a unitary transform of the original measurement
noise. Assuming linear augmentations, the augmented coil image from MRAugment can be written as
xa,i = D(Six
∗ + z
′
i
) = DSix
∗ + Dz
′
i
, (3.2)
where the additive noise is still Gaussian. As seen in (3.2), MRAugment transforms images modulated by
the coil sensitivities, therefore the sensitivitiy maps are also indirectly augmented. However, the models
we experimented with had no issues learning the proper mapping from augmented measurements with
transformed sensitivity maps as our experimental results show.
37
It is natural to ask if data augmentation would be possible by directly augmenting the object, before the
coil sensitivities are applied. If the sensitivity maps are known or are estimated a priori, one may recover
the object from the various coils as
x =
X
N
j=1
S
∗
j xj =
X
N
j=1
S
∗
j
(Sjx
∗ + z
′
j
) = x
∗ +
X
N
j=1
S
∗
j
z
′
j
,
where S
∗
j
is the complex conjugate of Sj and PN
j=1 S
∗
j Sj = I due to typical normalization [158]. Then,
we can apply the augmentation as
xa = Dx = D(x
∗ +
X
N
i=1
S
∗
j
z
′
j
) = Dx
∗ + D
X
N
j=1
S
∗
j
z
′
j
.
Finally, we obtain the augmented coil images as
xa,i = Sixa = SiDx
∗ + SiD
X
N
j=1
S
∗
j
z
′
j
. (3.3)
Comparing (3.3) with (3.2), one may see that now the augmentation is directly applied to the ground
truth signal bypassing the coil sensitivities. However, comparing this result in (3.3) with the original
unaugmented coil images in (3.1) reveals that the additive noise in (3.3) has a very different distribution
from the original i.i.d Gaussian, even worse noise on different augmented coil images are now correlated.
Finally, the sensitivity maps are typically not known and need to be estimated before we can apply this
augmentation technique, which can introduce additional inaccuracies in the augmentation pipeline.
This discussion motivates our choice to i) augment complex-valued images directly derived from the
original k-space measurements, ii) consider simple transformations which preserve the noise distribution
and iii) augment individual coil images as in (3.2). Next we overview the types of augmentations we
propose in line with these key observations.
38
3.3.2 Transformations used for data augmentation
We apply the following two types of image transformations D in our data augmentation pipeline:
Pixel preserving augmentations, that do not require any form of interpolation and simply result in a
permutation of pixels over the image. Such transformations are vertical and horizontal flipping, translation
by integer number of pixels and rotation by multiples of 90◦
. As we pointed out in Section 3.3.1, these
transformations do not affect the noise distribution on the measurements and therefore are suitable for
problems where training and test data are expected to have similar noise characteristics.
General affine augmentations, that can be represented by an affine transformation matrix and in general
require resampling the transformed image at the output pixel locations. Augmentations in this group are:
translation by arbitrary (not necessarily integer) coordinates, arbitrary rotations, scaling and shearing.
Scaling can be applied along any of the two spatial dimensions. We differentiate between isotropic scaling,
in which the same scaling factor s is applied in both directions (s > 1: zoom-in, s < 1: zoom-out) and
anisotropic scaling in which different scaling factors (sx, sy) are applied along different axes.
Figure 3.2 provides a visual overview of the types of augmentations applied in this paper. Numerous
other forms of transformations may be used in this framework such as exposure and contrast adjustment,
image filtering (blur, sharpening) or image corruption (cutout, additive noise). However, in addition to the
noise considerations mentioned before that have to be taken into account, some of these transformations
are difficult to define for complex-valued images and may have subtle effects on image statistics. For
instance, brightness adjustment could be applied to the magnitude image, the real part only or both real
and imaginary parts, with drastically different effects on the magnitude-phase relationship of the image.
That said, we hope to incorporate additional augmentations in our pipeline in the future after a thorough
study of how they affect the noise distribution.
39
Original V-flip H-flip Rot. k90◦ Rotation
Transl. Zoom-in Zoom-out Aniso sc. Shearing
Figure 3.2: Transformations used in MRAugment applied to a ground truth slice.
3.3.3 Scheduling and application of data augmentations
With the different components in place we are now ready to discuss the scheduling and application of the
augmentations, as depicted in the bottom half of Figure 3.1. Recall that MRAugment generates a target
image x¯a and corresponding undersampled k-space measurement k˜a from a full k-space measurement.
Which augmentation is applied and how frequently is determined by a parameter p, the common parameter
determining the probability of applying a transformation to the ground truth image during training, and
the weights W = (w1, w2, ..., wK) pertaining to the K different augmentations, controlling the weights of
transformations relative to each other. We apply a given transformation ti with probability pi = p·wi
. The
augmentation function is applied to the coil images, specifically the same transformation is applied with
the same parameters to the real and imaginary parts (ℜ{x1}, ℑ{x1}, ℜ{x2}, ℑ{x2}, ..., ℜ{xN }, ℑ{xN })
of coil images. If a transformation ti
is sampled (recall that we select them with probabilities pi
), we
randomly select the parameters of the transformation from a pre-defined range (for example, rotation
angle in [0, 180◦
]). To avoid aliasing artifacts, we first upsample the image before transformations that
require interpolation. Then the result is downsampled to the original size.
40
A critical question is how to schedule p over training in order to obtain the best model. Intuitively, in
initial stages of training no augmentation is needed, since the model can learn from the available original
training examples. As training progresses the network learns to fit the original data points and their utility
decreases over time. We find schedules starting from p = 0 and increasing over epochs to work best in
practice. The ideal rate of increase depends on both the model size and amount of available training data.
3.4 Experiments
In this section we explore the effectiveness of MRAugment in the context of accelerated MRI reconstruction
in various regimes of available training data sizes on various datasets. We start with providing a summary
of our main findings, followed by a detailed description of the experiments. Additional reconstructions
and more experimental details can be found in the supplementary material.
In the low-data regime (up to ≈ 4k images), data augmentation very significantly boosts reconstruction performance. The improvement is large both in terms of raw SSIM and visual reconstruction quality.
Using MRAugment, fine details are recovered that are completely missing from reconstructions without
DA. This suggests that DA improves the value of reconstructions for medical diagnosis, since health experts typically look for small features of the anatomy. This regime is especially important in practice, since
large public datasets are extremely rare.
In the moderate-data regime ( ≈ 4k − 15k images) MRAugment still achieves significant improvement in reconstruction SSIM. We want to emphasize the significance of seemingly small differences in
SSIM close to the state of the art and invite the reader to visit the fastMRI Challenge Leaderboard that
demonstrates how close the best performing models are.
In the high-data regime (more than 15k images) data augmentation has diminishing returns. It does
not notably improve performance of the current state of the art, but it does not degrade performance either.
Our experiments in the latter two regimes however strongly suggest that data augmentation combined
41
with much larger models may lead to significant improvement over the state of the art, even in the highdata regime. However, without larger models it is expected that in a regime of abundant data, DA does not
improve performance. For the models and problem considered here, this is around 15k images. We hope
to investigate the effectiveness of MRAugment combined with such larger models in our future work.
Additional benefits of data augmentation include improved robustness under shifts in test distribution, such as improved generalization to new MRI scanners and field strengths. Furthermore, we observe
that data augmentation can help to eliminate hallucinations by preventing overfitting to training data.
3.4.1 Experimental setup
We use the state-of-the-art End-to-End VarNet model [158], which is as of now one of the best performing neural network models for MRI reconstruction. We measure performance in terms of the structural
similarity index measure (SSIM), which is a standard evaluation metric for medical image reconstruction.
We study the performance of MRAugment as a function of the size of the training set. We construct different subsampled training sets by randomly sampling volumes of the original training dataset and adding
all slices of the sampled volumes to the new subsampled dataset. For all experiments, we apply random
masks by undersampling whole k-space lines in the phase encoding direction by a factor of 8 and including
4% of lowest frequency adjacent k-space lines in order to be consistent with baselines in [191]. For both
the baseline experiments and for MRAugment, we generate a new random mask for each slice on-the-fly
while training by uniformly sampling k-space lines, but use the same fixed mask for each slice within the
same volume on the validation set (different across volumes). This technique is standard for models trained
on the fastMRI dataset and not specific to our data augmentation pipeline. For augmentation probability
scheduling we use
p(t) = pmax
1 − e−c
(1 − e
−tc/T ),
42
where t is the current epoch, T denotes the total number of epochs, c = 5 and pmax = 0.55 unless specified
otherwise. This schedule works resonably well on datasets of various size that we have studied and has
not been fine-tuned to individual experiments. Ablation studies on the effect of the scheduling function is
deferred to the supplementary.
3.4.2 Low-data regime
For the low-data regime we work with two different datasets, the Stanford 2D FSE dataset and the 3D FSE
Knee dataset described below and demonstrate significant gains in reconstruction performance.
Stanford 2D FSE dataset. First, we perform experiments on the Stanford 2D FSE [22] dataset, a
public dataset of 89 fully-sampled MRI volumes of various anatomies including lower extremity, pelvis
and more. We use 80% − 20% training-validation split, randomly sampled by volumes. We generate 5
random splits in order to minimize variations in reconstruction metrics due to validation set selection and
report the mean validation SSIM over 5 runs along with the standard errors.
We plot a training curve of validation SSIM with and without data augmentation in Figure 3.3a. The
regularizing effect of data augmentation prevents overfitting to the training set and improves reconstruction SSIM on the validation dataset even in case of training 4× longer than in the baseline experiments
without data augmentation. Figure 3.3b compares mean validation SSIM when the model is trained in different data regimes from 25% to 100% of all training data. MRAugment leads to significant improvement
in reconstruction SSIM and this improvement is consistent across different train-val splits and training set
sizes. We achieve higher mean SSIM using only 25% of the training data with MRAugment than training on the full dataset without DA. On the full dataset, we improve reconstruction SSIM from 0.8950 to
0.9120, and MRAugment achieves even larger gains in the lower data regime. Figure 3.4 provides a visual
comparison of a reconstructed slice emphasizing the benefit of data augmentation.
43
50 100 150 200
0.8
0.82
0.84
0.86
0.88
0.9
Epochs
Val. SSIM
no DA
DA
1
(a) Validation SSIM vs. epochs. MRAugment prevents the model from overfitting to training data.
0.38k 0.83k 1.66k
0.86
0.88
0.9
0.92
# of training examples
Val. SSIM
no DA
DA
1
(b) Validation SSIM vs. number of training images.
Mean and standard error over 5 train/val splits is
depicted.
Figure 3.3: Experimental results on the Stanford 2D FSE dataset.
Figure 3.4: Visual comparison of reconstructions on the Stanford 2D FSE dataset with and without data
augmentation.
Stanford Fullysampled 3D FSE Knees dataset. The Stanford Fullysampled 3D FSE Knees dataset
[143] consists of 20 fully-sampled k-space volumes of knees. We use the same methodology to generate
training and validation splits and evaluate results as in case of the Stanford 2D FSE dataset.
This dataset has significantly less variation compared to the Stanford 2D FSE dataset. Consequentially,
we observe strong overfitting early in training if no data augmentation is used (Figure 3.5a). However,
44
50 100 150 200
0.93
0.94
0.95
Epochs
Val. SSIM
no DA
DA
1
(a) Validation SSIM vs. training epochs. We observe
strong overfitting without data augmentation.
1k 2k 4.1k
0.93
0.94
0.95
# of training examples
Val. SSIM
no DA
DA
1
(b) Validation SSIM vs. number of training images.
Mean and standard error over 5 train/val splits is
depicted.
Figure 3.5: Experimental results on the Stanford Fullysampled 3D FSE dataset.
applying data augmentation successfully prevents overfitting. Furthermore, in accordance with observations on the Stanford 2D FSE dataset, data augmentation significantly boosts reconstruction SSIM across
different data regimes (Figure 3.5b).
3.4.3 High-data regime
Next, we perform an extensive study on the fastMRI dataset [191], the largest publicly available fullysampled MRI dataset with competitive baseline models, that allows us to investigate the utility of MRAugment across a wide range of training data regimes. More specifically, we use the fastMRI knee dataset, for
which the original training set consists of approximately 35k MRI slices in 973 volumes and we subsample
to 1%, 10%, 33% and 100% of the original size. We measure performance on the original (fixed) validation
set separate from the training set.
Single-coil experiments. For single-coil acquisition we are able to exactly match the performance
of the model trained on the full dataset using only a third of the training data as depicted on the left in
Fig. 3.7. Moreover, with only 10% of the training data we achieve comparable SSIM to the model trained
on the full dataset. The visual difference between reconstructions with and without data augmentation
45
Figure 3.6: Visual comparison of single-coil (top row) and multi-coil (bottom-row) reconstructions using
varying amounts of training data with and without data augmentation. We achieve reconstruction quality
comparable to the state of the art but using 1% of the training data. Without DA fine details are completely
lost.
0.35k 3.5k 11.5k 35k
0.65
0.66
0.67
0.68
0.69
# of training examples
Val. SSIM
no DA
DA
1
0.35k 3.5k 11.5k 35k
0.84
0.85
0.86
0.87
0.88
0.89
# of training examples
Val. SSIM
no DA
DA
1
Figure 3.7: Single-coil (left) and multi-coil (right) validation SSIM vs. # of training images.
becomes striking in the low-data regime. As seen in the top row of Fig. 3.6, the model without DA was
unable to reconstruct any of the fine details and the results appear blurry with strong artifacts. Applying
MRAugment greatly improves reconstruction quality both in a quantitative and qualitative sense, visually
approaching that obtained from training on the full dataset but using hundred times less data.
46
Multi-coil experiments. As depicted on the right in Fig. 3.7 for multi-coil acquisition we closely
match the state of the art while significantly reducing training data. More specifically, we approach the
state of the art SSIM within 0.6% using 10% of the training data and within 0.25% with 33% of training
data. As seen in the bottom row of Fig. 3.6, when using only 1% of the training data we successfully reconstruct fine details comparable to that obtained from training on the full dataset, while high frequencies
are completely lost without DA.
Finally, we perform ablation studies on the fastMRI dataset and demonstrate that both pixel preserving and interpolating transformations individually improve reconstruction SSIM. Furthermore their effect
is complementary: the best results are obtained by adding all transformations to the pipeline. Moreover,
we investigate the effect of the data augmentation scheduling function and demonstrate that exponential scheduling results in better performance compared to a constant augmentation probability. We also
evaluate a range of different augmentation schedules and show that both significantly lower or higher
probabilities lead to poorer reconstruction SSIM. All ablation experiments can be found in the supplementary material.
3.4.4 Model robustness
In this section we investigate further potential benefits of data augmentation in scenarios where training
examples from the target data distribution are not only scarce as studied before, but unavailable. Distribution shifts can have a detrimental effect on a variety of reconstruction methods [33]. Furthermore, we
show some initial experimental results how data augmentation may help avoiding hallucinated features
appearing on reconstructions due to overfitting.
Unseen MR scanners. First, we explore how data augmentation impacts generalization to new MRI
scanner models not available in training time. Different MRI scanners may use different field strenghts
for acquisition, and higher field strength typically correlates with higher SNR. Approximately half of the
47
volumes in the fastMRI knee dataset have been acquired by a 1.5T scanner, whereas the rest by three
different 3T scanners. We perform the following experiments:
• 3T → 3T: We train and validate on volumes acquired using 3T scanners. Volumes in the validation
set have been imaged by a 3T scanner not in the training set.
• 3T → 1.5T: We train on all volumes acquired by 3T scanners and validate on the 1.5T scanner.
• 1.5T → 3T: We train on all volumes acquired by the 1.5T scanner and validate on all other 3T
scanners.
Table 3.1 summarizes our results. Data augmentation consistently improves reconstruction SSIM on unseen scanner models. Similarly to our main experiments, the improvement is especially significant in the
low-data regime. We observe that DA provides the greatest benefit when training on 1.5T scanners and
testing on 3T models. We hypothesize that data augmentation can hinder the model from overfitting to
the higher noise level present on 1.5T acquisitions during training thus resulting in better generalization
on the lower noise 3T volumes.
Unseen anatomies. We demonstrate how data augmentation may help improving generalization on
new anatomies not included in the training set, even in the high-data regime. We train a VarNet model on
the complete fastMRI knee train dataset using the hyperparameters recommended in [158], and evaluated
the network on the fastMRI brain validation dataset throughout training. We repeated the experiment
with the same hyperparameters, but with MRAugment turned on. The results can be seen in Fig. 3.9. The
regularizing effect of data augmentation impedes the network to overfit to the training dataset, thus the
resulting model is more robust to shifts in test distribution. This results in higher reconstruction quality
in terms of SSIM on unseen brain data when using MRAugment.
Hallucinations. An unfortunate side-effect of deep learning based reconstructions may be the presence of hallucinated details. This is especially problematic in providing accurate medical diagnosis and
lessens the trust of medical practicioners in deep learning. We observe that data augmentation has the
48
Figure 3.8: Hallucinated features appear on reconstructions without data augmentation.
potential benefit of increased robustness against hallucinations by preventing overfitting to training data,
as seen in Fig. 3.8.
3.4.5 Naive data augmentation
We would like to emphasize the importance of applying DA in a way that takes into account the measurement noise distribution. When applied incorrectly, DA leads to significantly worse performance than not
using any augmentation.
We train a model using ’naive’ data augmentation without considering the measurement noise distribution as described in Section 3.3.1, by augmenting real-valued target images. We use the same exponential
augmentation probability scheduling for MRAugment and the naive approach. As Fig. 3.10a demonstrates,
reconstruction quality degrades over training using the naive approach. This is due to the fact that as augmentation probability increases, the network sees less and less original, unaugmented images, whereas
49
2% train no DA DA
3T → 3T 0.8646 0.9049
3T → 1.5T 0.8241 0.8551
1.5T → 3T 0.8174 0.8913
100% train no DA DA
3T → 3T 0.9177 0.9185
3T → 1.5T 0.8686 0.8690
1.5T → 3T 0.9043 0.9062
Table 3.1: Scanner transfer results using 2% (top)
and 100% (bottom) of training data.
10 20 30 40 50
0.85
0.86
0.87
0.88
0.89
Epochs
Val. SSIM
no DA
DA
1
Figure 3.9: Val. performance of models trained on
knee and evaluated on brain MRI.
the poorly augmented images are detrimental for generalization due to the mismatch in train and validation noise distribution. On the other hand, MRAugment clearly helps and validation performance steadily
improves over epochs. Fig. 3.10b provides a visual comparison of reconstructions using naive DA and
our data augmentation method tailored to the problem. Naive DA reconstruction exhibits high-frequency
artifacts and low image quality caused by the mismatch in noise distribution. These drastic differences
underline the vital importance of taking a careful, physics-informed approach to data augmentation for
MR reconstruction.
20 40 60 80 100
0.45
0.5
0.55
0.6
0.65
0.7
Epochs
Val. SSIM
10%+our DA
10%+naive DA
1
(a) Naive data augmentation
degrades generalization performance.
(b) Naive data augmentation that does not preserve measurement noise distribution leads to significantly degraded reconstruction quality.
Figure 3.10: Experimental results comparing MRAugment with naive data augmentation.
50
3.5 Conclusion
In this paper, we develop a physics-based data augmentation pipeline for accelerated MR imaging. We find
that MRAugment yields significant gains in the low-data regime which can be beneficial in applications
where only little training data is available or where the training data changes quickly. We also demonstrate that models trained with data augmentation are more robust against overfitting and shifts in the
test distribution. We believe that this work opens up many interesting venues for further research with
respect to data augmentation for inverse problems in the low data regime. First, learning efficient data
augmentation from the training data has been investigated in prior literature [31, 95, 168] and would be a
natural extension of our method. Second, finding the optimal augmentation strength throughout training
is challenging, therefore an adaptive scheme that automatically schedules the augmentation probability
would potentially further improve upon our results. Such a method is proposed in [80] for discriminator
augmentation in GAN training, where the augmentation strength is adjusted based on an discriminator
overfitting heuristic with great success. Finally, combining our technique with a generative framework
such as AmbientGAN [9] that generates high quality samples of a target distribution from noisy partial
measurements, could be potentially used to synthesize fully-sampled MRI data from few noisy k-space
measurements.
51
Chapter 4
HUMUS-Net:Hybrid Unrolled Multi-scale Network Architecture for
Accelerated MRI Reconstruction
Transformer-based architectures have become the new state of the art in a multitude of computer vision
applications in the general domain. It is crucial to adapt these innovations to scientific and medical applications in order to harness the potential of such techniques in critical real-world problems. In this chapter,
we introduce a novel architecture for MRI reconstruction that can leverage the efficiency of convolutions
and the power of Transformers simultaneously, resulting in an algorithm that establishes new state of the
art in MRI reconstruction.
This chapter is based on the the following works:
• Zalan Fabian, Berk Tinaz, and Mahdi Soltanolkotabi. "HUMUS-Net: Hybrid unrolled multi-scale network architecture for accelerated MRI reconstruction." Advances in Neural Information Processing
Systems 35 (2022): 25306-25319.
• Zalan Fabian, Berk Tinaz, and Mahdi Soltanolkotabi. "HUMUS-Net: Hybrid unrolled multi-scale
network architecture for accelerated MRI reconstruction." arXiv preprint arXiv:2203.08213 (2022)
52
4.1 Introduction
Magnetic resonance imaging (MRI) is a medical imaging technique that uses strong magnetic fields to
picture the anatomy and physiological processes of the patient. MRI is one of the most popular imaging
modalities as it is non-invasive and doesn’t expose the patient to harmful ionizing radiation. The MRI
scanner obtains measurements of the body in the spatial frequency domain also called k-space. The data
acquisition process is MR is often time-consuming. Accelerated MRI [106] addresses this challenge by
undersampling in the k-space domain, thus reducing the time patients need to spend in the scanner. However, recovering the underlying anatomy from undersampled measurements is an ill-posed problem (less
measurements than unknowns) and thus incorporating some form of prior knowledge is crucial in obtaining high quality reconstructions. Classical MRI reconstruction algorithms rely on the assumption that the
underlying signal is sparse in some transform domain and attempt to recover a signal that best satisfies
this assumption in a technique known as compressed sensing (CS) [12, 44]. These classical CS techniques
have slow reconstruction speed and typically enforce limited forms of image priors.
With the emergence of deep learning (DL), data-driven reconstruction algorithms have far surpassed
CS techniques (see [121] for an overview). DL models utilize large training datasets to extract flexible, nuanced priors directly from data resulting in excellent reconstruction quality. In recent years, there has been
a flurry of activity aimed at designing DL architectures tailored to the MRI reconstruction problem. The
most popular models are convolutional neural networks (CNNs) that typically incorporate the physics of
the MRI reconstruction problem, and utilize tools from mainstream deep learning (residual learning, data
augmentation, self-supervised learning). Comparing the performance of such models has been difficult
mainly due to two reasons. First, there has been a large variation in evaluation datasets spanning different scanners, anatomies, acquisition models and undersampling patterns rendering direct comparison
challenging. Second, medical imaging datasets are often proprietary due to privacy concerns, hindering
reproducibility.
53
More recently, the fastMRI dataset [192], the largest publicly available MRI dataset, has been gaining
ground as a standard benchmark to evaluate MRI reconstruction methods. An annual competition, the
fastMRI Challenge [111], attracts significant attention from the machine learning community and acts a
driver of innovation in MRI reconstruction. However, over the past years the public leaderboard has been
dominated by a single architecture, the End-to-End VarNet [160]
∗
, with most models concentrating very
closely around the same performance metrics, hinting at the saturation of current architectural choices.
In this work, we propose HUMUS-Net: a Hybrid, Unrolled, MUlti-Scale network architecture for accelerated MRI reconstruction that combines the advantages of well-established architectures in the field with
the power of contemporary Transformer-based models. We utilize the strong implicit bias of convolutions,
but also address their weaknesses, such as content-independence and inability to model long-range dependencies, by incorporating a novel multi-scale feature extractor that operates over embedded image patches
via self-attention. Moreover, we tackle the challenge of high input resolution typical in MRI by performing
the computationally most expensive operations on extracted low-resolution features. HUMUS-Net establishes new state of the art in accelerated MRI reconstruction on the largest available MRI knee dataset.
At the time of writing this paper, HUMUS-Net is the only Transformer-based architecture on the highly
competitive fastMRI Public Leaderboard. Our results are fully reproducible and the source code is available
online †
4.2 Background
4.2.1 Inverse Problem Formulation of Accelerated MRI Reconstruction
An MR scanner obtains measurements of the patient anatomy in the frequency domain, also referred to
as k-space. Data acquisition is performed via various receiver coils positioned around the anatomy being
∗At the time of writing this paper, the number one model on the leaderboard is AIRS-Net, however the complete architecture/training details as well as the code of this method are not available to the public and we were unable to reproduce their
results based on the limited available information.
†Code: https://github.com/MathFLDS/HUMUS-Net.
54
imaged, each with different spatial sensitivity. Given a total number of N receiver coils, the measurements
obtained by the ith coil can be written as
ki = FSix
∗ + zi
, i = 1, .., N,
where x
∗ ∈ C
n
is the underlying patient anatomy of interest, Si
is a diagonal matrix that represents the
sensitivity map of the ith coil, F is a multi-dimensional Fourier-transform, and zi denotes the measurement
noise corrupting the observations obtained from coil i. We use k = (k1, ..., kN ) as a shorthand for the
concatenation of individual coil measurements and x = (x1, ..., xN ) as the corresponding image domain
representation after inverse Fourier transformation.
Since MR data acquisition time is proportional to the portion of k-space being scanned, obtaining fullysampled data is time-consuming. Therefore, in accelerated MRI scan times are reduced by undersampling
in k-space domain. The undersampled k-space measurements from coil i take the form
k˜i = Mki i = 1, .., N,
where M is a diagonal matrix representing the binary undersampling mask, that has 0 values for all
missing frequency components that have not been sampled during accelerated acquisition.
The forward model that maps the underlying anatomy to coil measurements can be written concisely
as k˜ = A (x
∗
), where A (·) is the linear forward mapping and k˜ is the stacked vector of all undersampled
coil measurements. Our target is to reconstruct the ground truth object x
∗
from the noisy, undersampled
measurements k˜. Since we have fewer observations than variables to recover, perfect reconstruction in
general is not possible. In order to make the problem solvable, prior knowledge on the underlying object
is typically incorporated in the form of sparsity in some transform domain. This formulation, known as
55
compressed sensing [12, 44], provides a classical framework for accelerated MRI reconstruction [106]. In
particular, the above recovery problem can be formulated as a regularized inverse problem
xˆ = arg min
x
A (x) − k˜
2
+ R(x), (4.1)
where R(·) is a regularizer that encapsulates prior knowledge on the object, such as sparsity in some
wavelet domain.
4.2.2 Deep Learning-based Accelerated MRI Reconstruction
More recently, data-driven deep learning-based algorithms tailored to the accelerated MRI reconstruction problem have surpassed the classical compressed sensing baselines. Convolutional neural networks
trained on large datasets have established new state of the art in many medical imaging tasks. The highly
popular U-Net [137] and other similar encoder-decoder architectures have proven to be successful in a
range of medical image reconstruction [73, 56] and segmentation [28, 202] problems. In the encoder path,
the network learns to extract a set of deep, low-dimensional features from images via a series of convolutional and downsampling operations. These concise feature representations are then gradually upsampled
and filtered in the decoder to the original image dimensions. Thus the network learns a hierarchical representation over the input image distribution.
Unrolled networks constitute another line of work that has been inspired by popular optimization
algorithms used to solve compressed sensing reconstruction problems. These deep learning models consist
of a series of sub-networks, also known as cascades, where each sub-network corresponds to an unrolled
iteration of popular algorithms such as gradient descent [194] or ADMM [161]. In the context of MRI
reconstruction, one can view network unrolling as solving a sequence of smaller denoising problems,
instead of the complete recovery problem in one step. Various convolutional neural networks have been
employed in the unrolling framework achieving excellent performance in accelerated MRI reconstruction
56
[126, 53, 54]. E2E-VarNet [160] is the current state-of-the-art convolutional model on the fastMRI dataset.
E2E-VarNet transforms the optimization problem in (4.1) to the k-space domain and unrolls the gradient
descent iterations into T cascades, where the tth cascade represents the computation
kˆt+1 = kˆt − µ
tM(kˆt − k˜) + G(kˆt
), (4.2)
where kˆt
is the estimated reconstruction in the k-space domain at cascade t, µ
t
is a learnable step size
parameter and G(·) is a learned mapping representing the gradient of the regularization term in (4.1). The
first term is also known as data consistency (DC) term as it enforces the consistency of the estimate with
the available measurements.
Figure 4.1: Overview of the HUMUS-Block architecture. First, we extract high-resolution features FH from
the input noisy image through a convolution layer fH. Then, we apply a convolutional feature extractor fL
to obtain low-resolution features and process them using a Transformer-convolutional hybrid multi-scale
feature extractor. The shallow, high-resolution and deep, low-resolution features are then synthesized into
the final high-resolution denoised image.
57
4.3 Related Work
Transformers in Vision – Vision Transformer (ViT) [45], a fully non-convolutional vision architecture, has demonstrated state-of-the-art performance on image classification problems when pre-trained on
large-scale image datasets. The key idea of ViT is to split the input image into non-overlapping patches,
embed each patch via a learned linear mapping and process the resulting tokens via stacked self-attention
and multi-layer perceptron (MLP) blocks. For more details we refer the reader to Appendix C.7 and [45].
The benefit of Transformers over convolutional architectures in vision lies in their ability to capture longrange dependencies in images via the self-attention mechanism.
Since the introduction of ViT, similar attention-based architectures have been proposed for many other
vision tasks such as object detection [15], image segmentation [179] and restoration [14, 21, 97, 190, 181].
A key challenge for Transformers in low-level vision problems is the quadratic compute complexity of the
global self-attention with respect to the input dimension. In some works, this issue has been mitigated by
splitting the input image into fixed size patches and processing the patches independently [21]. Others
focus on designing hierarchical Transformer architectures [61, 179] similar to popular ResNets [58]. Authors in [190] propose applying self-attention channel-wise rather than across the spatial dimension thus
reducing the compute overhead to linear complexity. Another successful architecture, the Swin Transformer [102], tackles the quadratic scaling issue by computing self-attention on smaller local windows. To
encourage cross-window interaction, windows in subsequent layers are shifted relative to each other.
Transformers in Medical Imaging – Transformer architectures have been proposed recently to
tackle various medical imaging problems. Authors in [13] design a U-Net-like architecture for medical
image segmentation where the traditional convolutional layers are replaced by Swin Transformer blocks.
They report strong results on multi-organ and cardiac image segmentation. In [201] a hybrid convolutional
and Transformer-based U-Net architecture is proposed tailored to volumetric medical image segmentation
with excellent results on benchmark datasets. Similar encoder-decoder architectures for various medical
58
segmentation tasks have been investigated in other works [71, 185]. However, these networks are tailored
for image segmentation, a task less sensitive to fine details in the input, and thus larger patch sizes are
often used (for instance 4 in [13] ). This allows the network to process larger input images, as the number
of token embedding is greatly reduced, but as we demonstrate in Section 4.5.2 embedding individual pixels
as 1 × 1 patches is crucial for MRI reconstruction. Thus, compute and memory barriers stemming from
large input resolutions are more severe in the MRI reconstruction task and therefore novel approaches are
needed.
Promising results have been reported employing Transformers in medical image denoising problems,
such as low-dose CT denoising [177, 107] and low-count PET/MRI denoising [196]. However, these studies
fail to address the challenge of poor scaling to large input resolutions, and only work on small images via
either downsampling the original dataset [107] or by slicing the large input images into smaller patches
[177]. In contrast, our proposed architecture works directly on the large resolution images that often arise
in MRI reconstruction. Even though some work exists on Transformer-based architectures for supervised
accelerated MRI reconstruction [70, 99, 48], and for unsupervised pre-trained reconstruction [90], to the
best of our knowledge ours is the first work to demonstrate state-of-the-art results on large-scale MRI
datasets such as the fastMRI dataset.
4.4 Method
HUMUS-Net combines the efficiency and beneficial implicit bias of convolutional networks with the powerful general representations of Transformers and their capability to capture long-range pixel dependencies. The resulting hybrid network processes information both in image representation (via convolutions)
and in patch-embedded token representation (via Transformer blocks). Our proposed architecture consists
of a sequence of sub-networks, also called cascades. Each cascade represents an unrolled iteration of an underlying optimization algorithm in k-space (see (4.2)), with an image-domain denoiser, the HUMUS-Block.
59
First, we describe the architecture of the HUMUS-Block, the core component in the sub-network. Then,
we specify the high-level, k-space unrolling architecture of HUMUS-Net in Section 4.4.3.
4.4.1 HUMUS-Block Architecture
The HUMUS-Block acts as an image-space denoiser that receives an intermediate reconstruction from
the previous cascade and performs a single step of denoising to produce an improved reconstruction for
the next cascade. It extracts high-resolution, shallow features and low-resolution, deep features through
a novel multi-scale transformer-based block, and synthesizes high-resolution features from those. The
high-level overview of the HUMUS-Block is depicted in Fig. 4.1.
High-resolution Feature Extraction– The input to our network is a noisy complex-valued image
xin ∈ R
h×w×cin derived from undersampled k-space data, where the real and imaginary parts of the image
are concatenated along the channel dimension. First, we extract high-resolution features FH ∈ R
h×w×dH
from the input noisy image through a convolution layer fH written as
FH = fH(xin).
This initial 3 × 3 convolution layer provides early visual processing at a relatively low cost and maps the
input to a higher, dH dimensional feature space. Important to note that the resolution of the extracted
features is the same as the spatial resolution of the input image.
Low-resolution Feature Extraction– In case of MRI reconstruction, input resolution is typically
significantly higher than commonly used image datasets (32×32 - 256×256), posing a significant challenge
to contemporary Transformer-based models. Thus we apply a convolutional feature extractor fL to obtain
low-resolution features
FL = fL(FH),
60
with FL ∈ R
hL×wL×dL where fL consists of a sequence of convolutional blocks and spatial downsampling
operations. The specific architecture is depicted in Figure 4.4. The purpose of this module is to perform
deeper visual processing and to provide manageable input size to the subsequent computation and memory
heavy hybrid processing module. In this work, we choose hL =
h
2
and wL =
w
2 which strikes a balance
between preserving spatial information and resource demands. Furthermore, in order to compensate for
the reduced resolution we increase the feature dimension to dL = 2 · dH := d.
Deep Feature Extraction– The most important part of our model is MUST, a MUlti-scale residual
Swin Transformer network. MUST is a multi-scale hybrid feature extractor that takes the low-resolution
image representations FL and performs hierarchical Transformer-convolutional hybrid processing in an
encoder-decoder fashion, producing deep features
FD = fD(FL),
where the specific architecture behind fD is detailed in Subsection 4.4.2.
High-resolution Image Reconstruction– Finally, we combine information from shallow, high-res
features FH and deep, low-res features FD to reconstruct the high-resolution residual image via a convolutional reconstruction module fR. The residual learning paradigm allows us to learn the difference
between noisy and clean images and helps information flow within the network [58]. Thus the final denoised image xout ∈ R
h×w×cin is obtained as
xout = xin + fR(FH,FD).
The specific architecture of the reconstruction network is depicted in Figure 4.4.
61
4.4.2 Multi-scale Hybrid Feature Extraction via MUST
The key component to our architecture is MUST, a multi-scale hybrid encoder-decoder architecture that
performs deep feature extraction in both image and token representation (Figure 4.1, bottom). First, individual pixels of the input representation of shape h
2 ×
w
2 × d are flattened and passed through a learned
linear mapping to yield h
2
·
w
2
tokens of dimension d. Tokens corresponding to different image patches are
subsequently merged in the encoder path, resulting in a concise latent representation. This highly descriptive representation is passed through a bottleneck block and progressively expanded by combining tokens
from the encoder path via skip connections. The final output is rearranged to match the exact shape of the
input low-resolution features FL, yielding a deep feature representation FD.
Our design is inspired by the success of Residual Swin Transformer Blocks (RSTB) in image denoising and super-resolution [97]. RSTB features a stack of Swin Transformer layers (STL) that operate on
tokens via a windowed self-attention mechanism [102], followed by convolution in image representation.
However, RSTB blocks operate on a single scale, therefore they cannot be readily applied in a hierarchical encoder-decoder architecture. Therefore, we design three variations of RSTB to facilitate multi-scale
processing as depicted in Figure 4.2.
RSTB-B is the bottleneck block responsible for processing the encoded latent representation while
maintaining feature dimensions. Thus, we keep the default RSTB architecture for our bottleneck block,
which already operates on a single scale.
RSTB-D has a similar function to convolutional downsampling blocks in U-Nets, but it operates on
embedded tokens. Given an input with size hi
· wi × d, we pass it through an RSTB-B block and apply
PatchMerge operation. PatchMerge linearly combines tokens corresponding to 2×2 non-overlapping image
patches, while simultaneously increasing the embedding dimension (see Figure 4.3, top and Figure A14 in
the appendix for more details) resulting in an output of size hi
2
·
wi
2 × 2 · d. Furthermore, RSTB-D outputs
62
Figure 4.2: Depiction of different RSTB modules used in the
HUMUS-Block.
Figure 4.3: Patch merge and
expand operations used in our
multi-scale feature extractor.
Figure 4.4: Architecture of convolutional blocks for feature extraction and reconstruction.
the higher dimensional representation before patch merging to be subsequently used in the decoder path
via skip connection.
RSTB-U used in the decoder path is analogous to convolutional upsampling blocks. An input with size
hi
· wi × d is first expanded into a larger number of lower dimensional tokens through a linear mapping
via PatchExpand (see Figure 4.3, bottom and Figure A14 in the appendix for more details). PatchExpand
reverses the effect of PatchMerge on feature size, thus resulting in 2hi
· 2wi tokens of dimension d
2
. Next,
we mix information from the obtained expanded tokens with skip embeddings from higher scales via
TokenMix. This operation linearly combines tokens from both paths and normalizes the resulting vectors.
Finally, the mixed tokens are processed by an RSTB-B block.
4.4.3 Iterative Unrolling
Architectures derived from unrolling the iterations of various optimization algorithms have proven to be
successful in tackling various inverse problems including MRI reconstruction. These architecture can be
interpreted as a cascade of simpler denoisers, each of which progressively refines the estimate from the
preceding unrolled iteration (see more details in Appendix C.6).
63
Following [160], we unroll the gradient descent iterations of the inverse problem in (4.1) in k-space
domain, yielding the iterative update scheme in (4.2). We apply regularization in image domain via our
proposed HUMUS-Block, that is we have
G(k) = F
E
D
R
F
−1
(k)
,
where D denotes the HUMUS-Block, R(x1, ..., xN ) = PN
i=1 S
∗
i xi
is the reduce operator that combines
coil images via the corresponding sensitivity maps and E(x) = (S1x, ...,SN x)is the expand operator that
maps the combined image back to individual coil images. The sensitivity maps can be estimated a priori
using methods such as ESPIRiT [170] or learned in an end-to-end fashion via a Sensitivity Map Estimator
(SME) network proposed in [160]. In this work we aspire to design an end-to-end approach and thus
we use the latter method and estimate the sensitivity maps from the low-frequency (ACS) region of the
undersampled input measurements during training using a standard U-Net network. A visual overview of
our unrolled architecture is shown in Figure A13 in the appendix.
4.4.4 Adjacent Slice Reconstruction (ASR)
We observe improvements in reconstruction quality when instead of processing the undersampled data
slice-by-slice, we jointly reconstruct a set of adjacent slices via HUMUS-Net (Figure 4.5). That is, if we
have a volume of undersampled data k˜vol = (k˜1
, ..., k˜K) with K slices, when reconstructing slice c, we
instead reconstruct the volume (k˜c−a
, ..., k˜c−1
, k˜c
, k˜c+1
, ..., k˜c+a
) by concatenating them along the coil
channel dimension, where a denotes the number of adjacent slices added on each side. However, we only
calculate and backpropagate the loss on the center slice c of the reconstructed volume. The benefit of ASR
is that the network can remove artifacts corrupting individual slices as it sees a larger context of the slice
by observing its neighbors. Even though ASR increases compute cost, it is important to note that it does
Figure 4.5: Adjacent slice reconstruction (depicted in image domain for visual clarity): HUMUS-Net takes a
volume of adjacent slices(x˜
c−a
, ..., x˜
c
, ..., x˜
c+a
) and jointly reconstructs a volume (xˆ
c−a
, ..., xˆ
c
, ..., xˆ
c+a
).
The reconstruction loss L is calculated only on the center slice xˆ
c
.
not impact the number of token embeddings (spatial resolution is unchanged) and thus can be combined
favorably with Transformer-based methods.
4.5 Experiments
In this section we provide experimental results on our proposed architecture, HUMUS-Net. First, we
demonstrate the reconstruction performance of our model on various datasets, including the large-scale
fastMRI dataset. Then, we justify our design choices through a set of ablation studies.
4.5.1 Benchmark Experiments
We investigate the performance of HUMUS-Net on three different datasets. We use the structural similarity
index measure (SSIM) [183] as a basis of our evaluation, which is the most common evaluation metric
in medical image reconstruction. In all of our experiments, we follow the setup of the fastMRI multicoil knee track with 8× acceleration in order to provide comparison with state-of-the-art networks of
the Public Leaderboard. That is, we perform retrospective undersampling of the fully-sampled k-space
data by randomly subsampling 12.5% of whole k-space lines in the phase encoding direction, keeping
4% of lowest frequency adjacent k-space lines. Experiments on other acceleration ratios can be found in
Appendix C.3. During training, we generate random masks following the above method, whereas for the
validation dataset we keep the masks fixed for each k-space volume. For HUMUS-Net, we center crop and
pad inputs to 384 × 384. We compare the reconstruction quality of our proposed model with the current
65
fastMRI knee multi-coil ×8 test data
Method # of params (approx.) SSIM(↑) PSNR(↑) NMSE(↓)
E2E-VarNet [160]
30M 0.8900 36.9 0.0089
E2E-VarNet†
[160] 0.8920 37.1 0.0085
XPDNet [132] 155M 0.8893 37.2 0.0083
Σ-Net [54] 676M 0.8877 36.7 0.0091
i-RIM [126] 300M 0.8875 36.7 0.0091
U-Net [192] 214M 0.8640 34.7 0.0132
HUMUS-Net (ours) 109M 0.8936 37.0 0.0086
HUMUS-Net (ours)†
0.8945 37.3 0.0081
HUMUS-Net-L (ours) 228M 0.8944 37.3 0.0083
HUMUS-Net-L (ours)† 0.8951 37.4 0.0080
Table 4.1: Performance of state-of-the-art accelerated MRI reconstruction techniques on the fastMRI knee
test dataset. Most models are trained only on the fastMRI training dataset, if available we show results of
models trained on the fastMRI combined training and validation dataset denoted by (†).
Validation SSIM
Method fastMRI knee Stanford 2D Stanford 3D
E2E-VarNet [160] 0.8908 0.8928 ± 0.0168 0.9432 ± 0.0063
HUMUS-Net (ours) 0.8934 0.8954 ± 0.0136 0.9453 ± 0.0065
Table 4.2: Validation SSIM of HUMUS-Net on various datasets. For datasets with multiple train-validation
split runs we show the mean and standard error of the runs.
Ablation studies
Method Unrolled? Multi-scale? Low-res features? Patch size Embed. dim. SSIM
Un-SS ✓ ✗ ✗ 1 12 0.9319 ± 0.0080
Un-MS ✓ ✓ ✗ 1 12 0.9357 ± 0.0038
Un-MS-Patch2 ✓ ✓ ✗ 2 36 0.9171 ± 0.0075
HUMUS-Net ✓ ✓ ✓ 1 36 0.9449 ± 0.0064
SwinIR 0.9336 ± 0.0069
E2E-VarNet 0.9432 ± 0.0063
Table 4.3: Results of ablation studies on HUMUS-Net, evaluated on the Stanford 3D MRI dataset.
best performing network, E2E-VarNet. For details on HUMUS-Net hyperparameters and training, we refer
the reader to Appendix C.1. For E2E-VarNet, we use the hyperparameters specified in [160].
fastMRI – The fastMRI dataset [192] is the largest publicly available MRI dataset with competitive
baseline models and a public leaderboard, and thus provides an opportunity to directly compare different
algorithms. Specifically, we run experiments on the multi-coil knee dataset, consisting of close to 35k
slices in 973 volumes. We use the default HUMUS-Net model defined above with 3 adjacent slices as
input. Furthermore, we design a large variant of our model, HUMUS-Net-L, which has increased embedding
66
dimension compared to the default model (see details in Appendix C.1). We train models both only on the
training split, and also on the training and validation splits combined (additional ≈ 20% data) for the
leaderboard. Table 4.1 demonstrates our results compared to the best published models from the fastMRI
Leaderboard‡
evaluated on the test dataset. Our model establishes new state of the art in terms of SSIM on
this dataset by a large margin, and achieves comparable or better performance than other methods in terms
of PSNR and NMSE. Moreover, as seen in the second column of Table 4.2, we evaluated our model on the
fastMRI validation dataset as well and compared our results to E2E-VarNet, the best performing model from
the leaderboard. We observe similar improvements in terms of the reconstruction SSIM metric to the test
dataset. Visual inspection of reconstructions shows that HUMUS-Net recovers very fine details in images
that may be missed by other state-of-the-art reconstruction algorithms (see attached figures in Figure A15
in the appendix). Comparison based on further image quality metrics can be found in Appendix C.8. We
point out that even though our model has more parameters than E2E-VarNet, our proposed image-domain
denoiser is more efficient than the U-Net deployed in E2E-VarNet, even when their number of parameters
are matched as discussed in Section 4.5.3. Overall, larger model size does not necessarily correlate with
better performance, as seen in other competitive models in Table 4.1. Finally, we note that the additional
training data (in the form of the validation split) provides a consistent small boost to model performance.
We refer the reader to [87] for an overview of scaling properties of reconstruction models.
Stanford 2D – Next, we run experiments on the Stanford2D FSE [22] dataset, a publicly available MRI
dataset consisting of scans from various anatomies (pelvis, lower extremity and more) in 89 fully-sampled
volumes. We randomly sample 80% of volumes as train data and use the rest for validation. We randomly
generate 3 different train-validation splits this way to reduce variations in the presented metrics. As slices
in this dataset have widely varying shapes across volumes, we center crop the target images to keep spatial
resolution within 384 × 384. We use the default HUMUS-Net defined above with single slices as input.
‡
https://fastmri.org/leaderboards
67
Our results comparing the best performing MRI reconstruction model with HUMUS-Net is shown in the
third column of Table 4.2. We present the mean SSIM of all runs along with the standard error. We achieve
improvements of similar magnitude as on the fastMRI dataset. These results demonstrate the effectiveness
of HUMUS-Net on a more diverse dataset featuring multiple anatomies.
Stanford 3D – Finally, we evaluate our model on the Stanford Fullysampled 3D FSE Knees dataset
[143], a public MRI dataset including 20 volumes of knee MRI scans. We generate train-validation splits
using the method described for Stanford 2D and perform 3 runs. We use the default HUMUS-Net network
with single slices as input. The last column of Table 4.2 compares our results to E2E-VarNet, showing
improvements of similar scales as on other datasets we have investigated in this work. This experiment
demonstrates that HUMUS-Net performs well not only on large-scale MRI datasets, but also on smaller
problems.
4.5.2 Ablation Studies
In this section, we motivate our design choices through a set of ablation studies. We start from SwinIR, a
general image reconstruction network, and highlight its weaknesses for MRI. Then, we demonstrate stepby-step how we addressed these shortcomings and arrived at the HUMUS-Net architecture. We train the
models on the Stanford 3D dataset. More details can be found in Appendix C.2. The results of our ablation
studies are summarized in Table 4.3.
First, we investigate SwinIR, a state-of-the-art image denoiser and super-resolution model that features
a hybrid Transformer-convolutional architecture. In order to handle the 10× larger input sizes (320×320)
in our MRI dataset compared to the input images this network has been designed for (128×128), we reduce
the embedding dimension of SwinIR to fit into GPU memory (16 GB). We find that compared to models
designed for MRI reconstruction, such as E2E-VarNet (last row in Table 4.3) SwinIR performs poorly. This
68
is not only due to the reduced network size, but also due to the fact that SwinIR is not specifically designed
to take the MRI forward model into consideration.
Next, we unroll SwinIR and add a sensitivity map estimator. We refer to this model as Un-SS. Due
to unrolling, we have to further reduce the embedding dimension of the denoiser and also decrease the
depth of the network in order to fit into GPU memory. Un-SS, due to its small size, performs slightly worse
than vanilla SwinIR and significantly lags behind the E2E-VarNet architectures. We note that SwinIR
operates over a single, full-resolution scale, whereas state-of-the-art MRI reconstruction models typically
incorporate multi-scale processing in the form of U-Net-like architectures.
Thus, we replace SwinIR by MUST, our proposed multi-scale hybrid processing unit, but keep the
embedding dimension in the largest-resolution scale fixed. The obtained network, which we call UnMS, has overall lower computational cost when compared with Un-SS, however as Table 4.3 shows MRI
reconstruction performance has significantly improved compared to both Un-SS and vanilla SwinIR, which
highlights the efficiency of our proposed multi-scale feature extractor. Reconstruction performance is
limited by the low dimension of patch embeddings, which we are unable to increase further due to our
compute and memory constraints originating in the high-resolution inputs.
The most straightforward approach to tackle the challenge of high input resolution is to increase the
patch size. To test this idea, we take Un-MS and embed 2 × 2 patches of the inputs, thus reducing the
number of tokens processed by the network by a factor of 4. We refer to this model as Un-MS-Patch2.
This reduction in compute and memory load allows us to increase network capacity by increasing the
embedding dimension 3-folds (to fill GPU memory again). However, Un-MS-Patch2 performs much worse
than previous models using patch size of 1. For classification problems, where the general image context
is more important than small details, patches of 16×16 or 8×8 are considered typical [45]. Even for more
dense prediction tasks such as medical image segmentation, patch size of 4 × 4 has been used successfully
69
[13]. However, our experiments suggest that in low-level vision tasks such as MRI reconstruction using
patches larger than 1 × 1 may be detrimental due to loss of crucial high-frequency detail information.
Our approach to address the heavy computational load of Transformers for large input resolutions
where increasing the patch size is not an option is to process lower resolution features extracted via convolutions. That is we replace MUST in Un-MS by a HUMUS-Block, resulting in our proposed HUMUS-Net
architecture. We train a smaller version of HUMUS-Net with the same embedding dimension as Un-MS.
As seen in Table 4.3, our model achieves the best performance across all other proposed solutions, even
surpassing E2E-VarNet. This series of incremental studies highlights the importance of each architectural
design choice leading to our proposed HUMUS-Net architecture. Further ablation studies on the effect
of the number of unrolled iterations and adjacent slice reconstruction can be found in Appendix C.4 and
Appendix C.5 respectively.
4.5.3 Direct comparison of image-domain denoisers
In order to further demonstrate the advantage of HUMUS-Net over E2E-VarNet, we provide direct comparison between the image-domain denoisers used in the above methods. E2E-VarNet unrolls a fully convolutional U-Net, whereas we deploy our hybrid HUMUS-Block architecture as a denoiser (Fig. 4.1). We
scale down the HUMUS-Block used in HUMUS-Net to match the size of the U-Net in E2E-Varnet in terms
of number of model parameters. To this end, we reduce the embedding dimension from 66 to 30. We train
both networks on magnitude images from the Stanford3D dataset. The results are summarized in Table 4.4.
We observe that given a fixed parameter budget, our proposed denoiser outperforms the widely used convolutional U-Net architecture in MRI reconstruction, further demonstarting the efficiency of HUMUS-Net.
This experiment suggests that our HUMUS-Block could serve as an excellent denoiser, replacing convolutional U-Nets, in a broad range of image restoration applications outside of MRI, which we leave for future
work.
70
Model # of parameters SSIM(↑) PSNR(↑) NMSE(↓)
U-Net 2.5M 0.9348 ± 0.0072 39.0 ± 0.6 0.0257 ± 0.0007
HUMUS-Block 2.4M 0.9378 ± 0.0065 39.2 ± 0.5 0.0246 ± 0.0004
Table 4.4: Direct comparison of denoisers on the Stanford 3D dataset. Mean and standard error of 3
random training-validation splits is shown.
4.6 Conclusion
In this paper, we introduce HUMUS-Net, an unrolled, Transformer-convolutional hybrid network for accelerated MRI reconstruction. HUMUS-Net achieves state-of-the-art performance on the fastMRI dataset
and greatly outperforms all previous published and reproducible methods. We demonstrate the performance of our proposed method on two other MRI datasets and perform fine-grained ablation studies to
motivate our design choices and emphasize the compute and memory challenges of Transformer-based architectures on low-level and dense computer vision tasks such as MRI reconstruction. A limitation of our
current architecture is that it requires fixed-size inputs, which we intend to address with a more flexible
design in the future. This work opens the door for the adoption of a multitude of promising techniques
introduced recently in the literature for Transformers, which we leave for future work.
71
Chapter 5
DiracDiffusion: Denoising and Incremental Reconstruction with
Assured Data-Consistency
Diffusion models have established new state of the art in generative modeling, capable of creating synthetic
images of exceptional fidelity and realism. In this chapter, we explore the potential of such models for image
reconstruction and demonstrate how diffusion can be tailored to a specific inverse problem. The proposed
algorithm has great flexibility in controlling the trade-off between faithfulness to the measurements and
perceptual image quality.
This chapter is based on the the following work:
• Zalan Fabian, Berk Tinaz, and Mahdi Soltanolkotabi. "DiracDiffusion: Denoising and Incremental
Reconstruction with Assured Data-Consistency." arXiv preprint arXiv:2303.14353 (2023).
5.1 Introduction
Diffusion models are powerful generative models capable of synthesizing samples of exceptional quality
by reversing a diffusion process that gradually corrupts a clean image by adding Gaussian noise. Diffusion
models have been explored from two perspectives: Denoising Diffusion Probabilistic Models (DDPM) [149,
63] and Score-Based Models [153, 155], which have been recently unified under a general framework of
Stochastic Differential Equations (SDEs) [157]. Diffusion models have established new state of the art in
72
image generation [40, 116, 141, 131, 135, 65, 140], audio [89] and video synthesis [66], and recently have
been deployed for solving inverse problems with great success.
In inverse problems, one wishes to recover a signal x from a noisy observation y connected via the forward model A and measurement noise z in the form y = A(x)+z. As A is typically non-invertible, prior
knowledge on x has been incorporated in a variety of ways including sparsity-inducing regularizers [12,
44], plug-and-play priors [174] and learned deep learning priors [121]. Diffusion models learn a prior over
the data distribution by matching the gradient of the log density (Stein-score). The unconditional score
function learned by diffusion models has been successfully leveraged to solve inverse problems without
any task-specific training [78, 75, 142]. However, as the score of the posterior distribution is intractable
in general, different methods have been proposed to enforce consistency between the generated image
and the corresponding observations. These methods include alternating between a step of unconditional
update and a step of projection[156, 26, 25] or other correction techniques [23, 24] to guide the diffusion
process towards data consistency. Another line of work proposes diffusion in the spectral space of the
forward operator, achieving high quality reconstructions, however requires costly singular value decomposition [84, 82, 83]. Concurrent work uses pseudo-inverse guidance [152] to incorporate the model into
the reconstruction process. All of these methods utilize a score function learned from denoising scorematching on a standard diffusion process that simply adds Gaussian noise to clean images, and therefore
the specific corruption model of the inverse problem is not incorporated into model training directly.
There has been a recent push to broaden the notion of Gaussian diffusion, such as extension to other
noise distributions [35, 113, 119]. In the context of image generation, there has been work to generalize
the corruption process, such as blur diffusion [94, 68], inverse heat dissipation [134] and arbitrary linear corruptions [32]. [6] questions the necessity of stochasticity in the generative process all together
and demonstrates empirical results on noiseless corruptions with arbitrary deterministic degradations.
However, designing the diffusion process specifically for inverse problem solving has not been explored
73
extensively yet. A recent example is [184] proposing adding an additional drift term to the forward SDE
that pulls the iterates towards the corrupted measurement and demonstrates high quality reconstructions
for JPEG compression artifact removal. Concurrent work [37] defines the forward process as convex combinations between the clean image and the corrupted observation, obtaining promising results in motion
deblurring and super-resolution.
Despite all of these successes of diffusion models in high-quality image generation, the requirements
imposed on inverse problems are very different from synthetic image generation. First, due to the
perception-distortion trade-off [8], as diffusion models generate images of exceptional detail, typically
reconstruction methods relying on diffusion underperform in distortion metrics, such as PSNR and SSIM
[23], that are traditionally used to evaluate image reconstructions. Moreover, as data consistency is not
always explicitly enforced during reverse diffusion, we may obtain visually appealing reconstructions, that
are in fact not faithful to our original observations.
In this paper, we propose a novel framework for solving inverse problems using a generalized notion
of diffusion that mimics the corruption process that produced the observation. We call our method Dirac:
Denoising and Incremental Reconstruction with Assured data-Consistency. As the forward model and
noising process are directly incorporated into the framework, our method maintains data consistency
throughout the reverse diffusion process, without any additional steps such as projections. Furthermore,
we make the key observation that details are gradually added to the posterior mean estimates during the
sampling process. This property imbues our method with great flexibility: by leveraging early-stopping
we can freely trade off perceptual quality for better distortion metrics and sampling speedup or vice versa.
We provide theoretical analysis on the accuracy, performance and limitations of our method that are wellsupported by empirical results. Our numerical experiments demonstrate state-of-the-art results in terms
of both perceptual and distortion metrics with fast sampling.
74
5.2 Background
Diffusion models – Diffusion models are generative models based on a corruption process that gradually
transforms a clean image distribution q0 into a known prior distribution which is tractable, but contains
no information of data. The corruption level, or severity as we refer to it in this paper, is indexed by time
t and increases from t = 0 (clean images) to t = 1 (pure noise). The typical corruption process consists
of adding Gaussian noise of increasing magnitude to clean images, that is qt(xt
|x0) ∼ N (x0, σ2
t
I), where
x0 ∼ q0 is a clean image, and xt
is the corrupted image at time t. By learning to reverse the corruption
process, one can generate samples from q0 by sampling from a simple noise distribution and running the
learned reverse diffusion process from t = 1 to t = 0.
Diffusion models have been explored along two seemingly different trajectories. Score-Based Models
[153, 155] attempt to learn the gradient of the log likelihood and use Langevin dynamics for sampling,
whereas DDPM [149, 63] adopts a variational inference interpretation. More recently, a unified framework
based on SDEs [157] has been proposed. Namely, both Score-Based Models and DDPM can be expressed
via a Forward SDE in the form dx = f(x, t)dt+g(t)dw with different choices of f and g. Here w denotes
the standard Wiener process. This SDE is reversible [2], and the Reverse SDE can be written as
dx = [f(x, t) − g
2
(t)∇x log qt(x)]dt + g(t)dw¯, (5.1)
where w¯ is the standard Wiener process, where time flows in the reverse direction. The true score
∇x log qt(x) is approximated by a neural network sθ(xt
, t) from the tractable conditional distribution
qt(xt
|x0) by minimizing
Et∼U[0,1],(x0,xt)
h
w(t) ∥sθ(xt
, t) − ∇xt
qt(xt
|x0)∥
2
i
, (5.2)
75
where (x0, xt) ∼ q0(x0)qt(xt
|x0) and w(t) is a weighting function. By applying different discretization
schemes to (6.1), one can derive various algorithms to simulate the reverse diffusion process for sample
generation.
Diffusion Models for Inverse problems – Our goal is to solve a noisy, possibly nonlinear inverse
problem
y˜ = A(x0) + z, z ∼ N (0, σ2
I), (5.3)
with y˜, x0 ∈ R
n
and A : R
n → R
n
. That is, we are interested in solving a reconstruction problem,
where we observe a measurement y˜ that is known to be produced by applying a non-invertible mapping
A to a ground truth signal x0 and is corrupted by additive noise z. We refer to A as the degradation, and
A(x0) as a degraded signal. Our goal is to recover x0 as faithfully as possible, which can be thought of
as generating samples from the posterior distribution q(x0|y˜). However, as information is fundamentally
lost in the measurement process in (5.3), prior information on clean signals needs to be incorporated to
make recovery possible. In the classical field of compressed sensing [12], a sparsity-inducing regularizer
is directly added to the reconstruction objective. An alternative is to leverage diffusion models as the prior
to obtain a reverse diffusion sampler for sampling from the posterior based on (6.1). Using Bayes rule, the
score of the posterior can be written as
∇x log qt(x|y˜) = ∇x log qt(x) + ∇x log qt(y˜|x), (5.4)
where the first term can be approximated using score-matching as in (5.2). On the other hand, the second
term cannot be expressed in closed-form in general, and therefore a flurry of activity emerged recently
to circumvent computing the likelihood directly. The most prominent approach is to alternate between
unconditional update from (6.1) and some form of projection to enforce consistency with the measurement
[156, 26, 25]. In recent work, it has been shown that the projection step may throw the sampling path off
76
the data manifold and therefore additional correction steps are proposed to keep the solver close to the
manifold [24, 23].
5.3 Method
In this work, we propose a novel perspective on solving ill-posed inverse problems. In particular, we
assume that our noisy observation y˜ results from a process that gradually applies more and more severe
degradations to an underlying clean signal.
5.3.1 Degradation severity
To define severity more rigorously, we appeal to the intuition that given two noiseless, degraded signals
y and y
+ of a clean signal x0, then y
+ is corrupted by a more severe degradation than y, if y contains all
the information necessary to find y
+ without knowing x0.
Definition (Severity of degradations). A mapping A+ : R
n → R
n
is a more severe degradation than
A : R
n → R
n
if there exists a surjective mapping GA→A+ : Image(A) → Image(A+). That is,
A+(x0) = GA→A+ (A(x0)) ∀x0 ∈ dom(A).
We call GA→A+ the forward degradation transition function from A to A+.
Take image inpainting as an example (Fig. 5.1) and let At denote a masking operator that sets pixels
to 0 within a centered box, where the box side length is l(t) = t · W, where W is the image width and
t ∈ [0, 1]. Assume that we have an observation yt
′ = At
′(x0) which is a degradation of a clean image x0
where a small center square with side length l(t
′
) is masked out. Given yt
′, without having access to the
complete clean image, we can find any other masked version of x0 where a box with at least side length
l(t
′
) is masked out. Therefore every other masking operator At
′′, t′ < t′′ is a more severe degradation
77
x0 yt
′ = At
′(x0) Gt
′→t
′′
can be determined
?
yt
′′ = At
′′(x0) A1(x0)
Severity
1
Figure 5.1: Severity of degradations: We can always find a more degraded image yt
′′ from a less degraded
version of the same clean image yt
′ via the forward degradation transition function Gt
′→t
′′, but not vice
versa.
than At
′. The forward degradation transition function GAt
′→At
′′ in this case is simply At
′′. We also note
here, that the reverse degradation transition function HAt
′′→At
′
that recovers At
′(x0) from a more severe
degradation At
′′(x0) for any x0 does not exist in general.
5.3.2 Deterministic and stochastic degradation processes
Using this novel notion of degradation severity, we can define a deterministic degradation process that
gradually removes information from the clean signal via more and more severe degradations.
Definition (Deterministic degradation process). A deterministic degradation process is a differentiable
mapping A : [0, 1] × R
n → R
n
that has the following properties:
1. Diminishing severity: A(0, x) = x
2. Monotonically degrading: ∀t
′ ∈ [0, 1) and t
′′ ∈ (t
′
, 1] A(t
′′
, ·) is a more severe degradation than
A(t
′
, ·).
We use the shorthand A(t, ·) = At(·) and GAt
′→At
′′ = Gt
′→t
′′ for the underlying forward degradation
transition functions for all t
′ < t′′
.
Our deterministic degradation process starts from a clean signal x0 at time t = 0 and applies degradations with increasing severity over time. If we choose A(1, ·) = 0, then all information in the original
78
signal is destroyed over the degradation process. One can sample easily from the forward process, that is
the process that evolves forward in time, starting from a clean image x0 at t = 0. A sample from time t
can be computed directly as yt = At(x0).
So far we have shown how to write a noiseless measurement yt as part of a deterministic degradation
process. However, typically our measurements are not only degraded by a non-invertible mapping, but
also corrupted by noise as seen in (5.3). Thus, one can combine the deterministic degradation process with
a stochastic noising process that gradually adds Gaussian noise to the degraded measurements.
Definition (Stochastic degradation process (SDP)). yt = At(x0) + zt
, zt ∼ N (0, σ2
t
I) is a stochastic
degradation process if At
is a deterministic degradation process, t ∈ [0, 1], and x0 ∼ q0(x0) is a sample from
the clean data distribution. We denote the distribution of yt as qt(yt) ∼ N (At(x0), σ2
t
I).
A key contribution of our work is looking at a noisy, degraded signal as a sample from the forward
process of an underlying SDP, and considering the reconstruction problem as running the reverse process
of the SDP backwards in time in order to recover the clean sample.
5.3.3 SDP as a Stochastic Differential Equation
We can formulate the evolution of our degraded and noisy measurements yt as an SDE:
dyt = A˙
t(x0)dt +
r
d
dt
σ
2
t
dw. (5.5)
This is an example of an Itô-SDE, and for a fixed x0 the above process is reversible, where the reverse
diffusion process is given by
dyt =
A˙
t(x0)dt −
d
dt
σ
2
t
∇yt
log qt(yt)
dt +
r
d
dt
σ
2
t
dw¯. (5.6)
79
One would solve the above SDE by discretizing it (for example Euler-Maruyama), approximating differentials with finite differences:
yt−∆t = yt + At−∆t(x0) − At(x0)
| {z }
incremental reconstruction
− (σ
2
t−∆t − σ
2
t
)∇yt
log qt(yt)
| {z }
denoising
+
q
σ
2
t − σ
2
t−∆t
z, (5.7)
where z ∼ N (0,I). The update in (5.7) lends itself to an interesting interpretation. One can look at it
as the combination of a small, incremental reconstruction and denoising steps. In particular, assume that
yt = At(x0) + zt and let
R(t, ∆t; x0) := At−∆t(x0) − At(x0). (5.8)
Then, the first term yt + R(t, ∆t; x0) = At−∆t(x0) + zt will reverse a ∆t step of the deterministic
degradation process, equivalent in effect to the reverse degradation transition function Ht→t−∆t
. The
second term is analogous to a denoising step in standard diffusion, where a slightly less noisy version of
the image is predicted. However, before we can simulate the reverse SDE in (5.7) to recover x0, we face
two obstacles.
Denoising: We do not know the score of qt(yt). This is commonly tackled by learning a noiseconditioned score network that matches the conditional log-probability log qt(yt
|x0) which we can easily
compute. We are also going to follow this path.
Incremental reconstruction: We do not know At−∆t(x0) and At(x0) for the incremental reconstruction step, since x0 is unknown to us when reversing the degradation process (it is in fact what we
would like to recover). We are going to address the above issues one-by-one.
5.3.4 Denoising - learning a score network
To run the reverse SDE, we need the score of the noisy, degraded distribution ∇yt
log qt(yt), which is
intractable. However, we can use the denoising score matching framework to approximate the score. In
80
particular, instead of the true score, we can easily compute the score for the conditional distribution, when
the clean image x0 is given:
∇yt
log qt(yt
|x0) = At(x0) − yt
σ
2
t
. (5.9)
During training, we have access to clean images x0 and can generate any degraded, noisy image yt using
our SDP formulation yt = At(x0) + zt
. Thus, we learn an estimator of the conditional score function
sθ(yt
, t) by minimizing
Lt(θ) = E(x0,yt)
"
sθ(yt
, t) −
At(x0) − yt
σ
2
t
2
#
, (5.10)
where (x0, yt) ∼ q0(x0)qt(yt
|x0). One can show that the well-known result of [175] applies to our SDP
formulation, and thus by minimizing the objective in (5.10), we can learn the score ∇yt
log qt(yt). The
technical condition that all conditional distributions qt(yt
|x0) are fully supported requires that we can get
to any yt from a given x0 with some non-zero probability, which is achieved by adding Gaussian noise.
We include the theorem in the supplementary.
We parameterize the score network as follows:
sθ(yt
, t) = At(Φθ(yt
, t)) − yt
σ
2
t
, (5.11)
that is given a noisy and degraded image as input,the model predicts the underlying clean image x0. Other
parametrizations are also possible, such as predicting zt or (equivalently) predicting At(x0). However,
as pointed out in [32], this might lead to learning the image distribution only locally, around degraded
images. Furthermore, in order to estimate the incremental reconstruction R(t, ∆t; x0), we not only need
81
to estimate At(x0), but other functions of x0, and thus estimating x0 directly gives us more flexibility.
Rewriting (5.10) with the new parametrization leads to
L(θ) = Et,(x0,yt)
h
w(t) ∥At(Φθ(yt
, t)) − At(x0)∥
2
i
, (5.12)
where t ∼ U[0, 1],(x0, yt) ∼ q0(x0)qt(yt
|x0) and typical choices in the diffusion literature for weights
w(t) are 1 or 1/σ2
t
. Intuitively, the neural network receives a noisy, degraded image, along with the
degradation severity, and outputs a prediction xˆ0(yt) = Φθ(yt
, t) such that the degraded ground truth
At(x0) and the degraded prediction At(xˆ0(yt)) are consistent.
5.3.5 Incremental reconstructions
Now that we have an estimator of the score, we still need to approximate R(t, ∆t; x0) in order to run the
reverse SDE in (5.7). That is we have to estimate how the degraded image changes if we very slightly decrease
the degradation severity. Since we parameterized our score network in (5.11) to learn a representation of
the clean image manifold directly, we can estimate the incremental reconstruction term as
Rˆ(t, ∆t; yt) = At−∆t(Φθ(yt
, t)) − At(Φθ(yt
, t)). (5.13)
One may consider this a look-ahead method, since we use yt with degradation severity t to predict a less
severe degradation of the clean image "ahead" in the reverse process. This becomes more obvious when
we note, that our score network already learns to predict At(x0) given yt due to the training loss in (5.12).
However, even if we learn the true score perfectly via (5.12), there is no guarantee that At−∆t(x0) ≈
At−∆t(Φθ(yt
, t)). The following result provides an upper bound on the approximation error.
Theorem 5.3.4. Let Rˆ(t, ∆t; yt) from (5.13) denote our estimate of the incremental reconstruction, where
Φθ(yt
, t) is trained on the loss in (5.12). Let R∗
(t, ∆t; yt) = E[R(t, ∆t; x0)|yt
] denote the MMSE estimator
82
of R(t, ∆t; x0). Assume, that the degradation process is smooth such that ∥At(x) − At(x
′
)∥ ≤ L
(t)
x ∥x −
x
′∥, ∀x, x
′ ∈ R
n and ∥At(x) − At
′(x)∥ ≤ Lt
|t − t
′
|, ∀t, t′ ∈ [0, 1], ∀x ∈ R
n
. Further assume that the
clean images have bounded entries x0[i] ≤ B, ∀i ∈ (1, 2, ..., n) and that the error in our score network is
bounded by ∥sθ(yt
, t) − ∇yt
log qt(yt)∥ ≤ ϵt
σ
2
t
, ∀t ∈ [0, 1]. Then,
∥Rˆ(t, ∆t; yt) − R∗
(t, ∆t; yt)∥ ≤ (L
(t)
x + L
(t−∆t)
x
)
| {z }
degr. smoothness
√
nB
| {z }
data
+ 2Lt
|{z}
scheduling
∆t
|{z}
algorithm
+ 2ϵt
|{z}
optimization
.
The first term in the upper bound depends on the smoothness of the degradation with respect to
input images, suggesting that smoother degradations are easier to reconstruct accurately. The second
term indicates two crucial points: (1) sharp variations in the degradation with respect to time leads to
potentially large estimation error and (2) the error can be controlled by choosing a small enough step size
in the reverse process. This provides a possible explanation why masking diffusion models are significantly
worse in image generation than models relying on blurring, as observed in [32]. Masking leads to sharp
jumps in pixel values at the border of the inpainting mask, thus Lt can be arbitrarily large. This can
be compensated to a certain degree by choosing a very small ∆t (very large number of sampling steps),
which has also been observed in [32]. Scheduling of the degradation over time is a design parameter, and
Theorem 5.3.4 suggests that sharp changes with respect to t should be avoided. Finally, the error grows
with less accurate score estimation, however with large enough network capacity, this term can be driven
close to 0.
The main contributor to the error in Theorem 5.3.4 stems from the fact that consistency under less
severe degradations, that is At−∆t(Φθ(yt
, t)) ≈ At−∆t(x0), is not enforced by the loss in (5.12). To this
83
end, we propose a novel loss function, the incremental reconstruction loss, that combines learning to denoise
and reconstruct simultaneously:
LIR(∆t, θ) = Et,(x0,yt)
h
w(t) ∥Aτ (Φθ(yt
, t)) − Aτ (x0)∥
2
i
, (5.14)
where τ = max(t − ∆t, 0), t ∼ U[0, 1], (x0, yt) ∼ q0(x0)qt(yt
|x0). It is clear, that minimizing this loss
directly improves our estimate of the incremental reconstruction in (5.13). We find that if Φθ has large
enough capacity, minimizing the incremental reconstruction loss in (5.14) also implies minimizing (5.12),
and thus the true score is learned (denoising is achieved). Furthermore, we show that (5.14) is an upper
bound to (5.12). More details are included in the supplementary. By minimizing (5.14), the model learns
not only to denoise, but also to perform small, incremental reconstructions of the degraded image such
that At−∆t(Φθ(yt
, t)) ≈ At−∆t(x0). There is however a trade-off between incremental reconstruction
performance and learning the score: we are optimizing an upper bound to (5.12) and thus it is possible that
the score estimation is less accurate. We expect incremental reconstruction loss to work best in scenarios
where the degradation may change rapidly with respect to t and hence a network trained to accurately
estimate At(x0) from yt may become inaccurate when predicting At−∆t(x0) from yt
. This hypothesis is
further supported by our experiments in Section 5.4.
5.3.6 Data consistency
Data consistency is a crucial requirement on generated images when solving inverse problems. That is, we
want to obtain reconstructions that are consistent with our original measurement under the degradation
model. More formally, we define data consistency as follows in our framework.
Definition (Data consistency). Given a deterministic degradation process At(·), two degradation severities
τ ∈ [0, 1] and τ
+ ∈ [τ, 1] and corresponding degraded images yτ ∈ R
n and yτ+ ∈ R
n
, yτ+ is data consistent
84
with yτ under At(·) if ∃x0 ∈ X0 such that Aτ (x0) = yτ and Aτ+ (x0) = yτ+ , where X0 denotes the clean
image manifold. We use the notation yτ+
d.c. ∼ yτ .
Simply put, two degraded images are data consistent, if there is a clean image which may explain both
under the deterministic degradation process. As our proposed technique is directly trained to reverse a
degradation process, enforcement of data consistency is built-in without applying additional steps, such as
projection. The following theorem guarantees that in the ideal scenario, data consistency with the original
measurement is maintained in each iteration of the reconstruction algorithm.
Theorem 5.3.6 (Data consistency over iterations). Assume that we run the updates in (5.7) with sθ(yt
, t) =
∇yt
log qt(yt), ∀t ∈ [0, 1] and Rˆ(t, ∆t; yt) = R(t, ∆t; x0), x0 ∈ X0. If we start from a noisy degraded
observation y˜ = A1(x0) + z1, x0 ∈ X0, z1 ∼ N (0, σ2
1
I) and run the updates in (5.7) for τ = 1, 1 −
∆t, ..., ∆t, 0, then
E[y˜]
d.c. ∼ E[yτ ], ∀τ ∈ [1, 1 − ∆t, ..., ∆t, 0]. (5.15)
Proof is provided in the supplementary. Even though the assumption that we achieve perfect incremental reconstruction is strong, in practice data consistency is indeed maintained without additional guidance
from y˜ during the reverse process, as shown in Section 5.4, hinting at the robustness of the proposed
framework.
5.3.7 Guidance
So far, we have only used our noisy observation y˜ = A1(x0)+z1 as a starting point for the reverse diffusion
process, however the measurement is not used directly in the update in (5.7). We learned the score of the
prior distribution ∇yt
log qt(yt), which we can leverage to sample from the posterior distribution qt(yt
|y˜).
In fact, using Bayes rule the score of the posterior distribution can be written as
∇yt
log qt(yt
|y˜) = ∇yt
log qt(yt) + ∇yt
log qt(y˜|yt), (5.16)
85
Distortion optimized Perception optimized
30
30.5
31
31.5
PSNR
0.7
0.75
0.8
1
− LPIPS
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0
0.5
1
⋅10−2
t
mean pixel std
1
Sample reconstruction
Distortion optimized Perception optimized
std of reconstructions
0
0.05
0.1
0.15
0.2
1
Figure 5.2: Perception-distortion trade-off on CelebA-HQ deblurring: distortion metrics initially improve,
peak fairly early in the reverse process, then gradually deteriorate, while perceptual metrics improve. We
plot the mean of n = 30 trajectories starting from the same initial noisy measurement. The shaded area
depicts ±std. We quantify uncertainty via mean pixel-wise standard deviation across different reverse
process trajectories . We observe low uncertainty at the distortion peak, with gradual increase during the
reverse process. Image regions with fine details correspond to high uncertainty in the final reconstructions.
where we already approximate ∇yt
log qt(yt) via sθ(yt
, t). Finding the posterior distribution analytically
is not possible, and therefore we use the approximation qt(y˜|yt) ≈ qt(y˜|Φθ(yt
, t)), from which distribution we can easily sample from. Since qt(y˜|Φθ(yt
, t)) ∼ N (A1(Φθ(yt
, t)), σ2
1
I), our estimate of the
posterior score takes the form
s
′
θ
(yt
, t) = sθ(yt
, t) − ηt∇yt
∥y˜ − A1(Φθ(yt
, t))∥
2
2σ
2
1
, (5.17)
where ηt
is a hyperparameter that tunes how much we rely on the original noisy measurement. Even
though we do not need to rely on y˜ after the initial update for our method to work, we observe small
improvements by adding the above guidance scheme to our algorithm.
86
5.3.8 Perception-distortion trade-off
Diffusion models generate synthetic images of exceptional quality, almost indistinguishable from real images to the human eye. This perceptual image quality is typically evaluated on features extracted by
a pre-trained neural network, resulting in metrics such as Learned Perceptual Image Patch Similarity
(LPIPS)[197] or Fréchet Inception Distance (FID)[62]. In image restoration however, we are often interested
in image distortion metrics that reflect faithfulness to the original image, such as Peak Signal to Noise Ratio (PSNR) or Structural Similarity Index Measure (SSIM) when evaluating the quality of reconstructions.
Interestingly, distortion and perceptual quality are fundamentally at odds with each other, as shown in the
seminal work of [8]. As diffusion models tend to favor high perceptual quality, it is often at the detriment
of distortion metrics [23].
As shown in Figure 5.2, we empirically observe that in the reverse process of Dirac, the quality of
reconstructions with respect to distortion metrics initially improves, peaks fairly early in the reverse process, then gradually deteriorates. Simultaneously, perceptual metrics such as LPIPS demonstrate stable
improvement for most of the reverse process. More intuitively, the algorithm first finds a rough reconstruction that is consistent with the measurement, but lacks fine details. This reconstruction is optimal
with respect to distortion metrics, but visually overly smooth and blurry. Consecutively, image details
progressively emerge during the rest of the reverse process, resulting in improving perceptual quality at
the cost of deteriorating distortion metrics. Therefore, our method provides an additional layer of flexibility: by early-stopping the reverse process, we can trade-off perceptual quality for better distortion metrics.
The early-stopping parameter tstop can be tuned on the validation dataset, resulting in distortion- and
perception-optimized reconstructions depending on the value of tstop.
87
5.3.9 Degradation scheduling
In order to deploy our method, we need to define how the degradation changes with respect to severity
t following the properties specified in Definition 5.3.3. That is, we have to determine how to interpolate
between the identity mapping A0(x) = x for t = 0 and the most severe degradation A1(·) for t = 1.
Theorem 5.3.4 suggests that sharp changes in the degradation function with respect to t should be avoided,
however we propose a more principled method of scheduling. In particular, we use a greedy algorithm to
select a set of degraded distributions, such that the maximum distance between them is minimized. We
define the distance between distributions as Ex0∼X0
[M(Ai(x0), Aj (x0))], where M is a pairwise image
dissimilarity metric. Details on our scheduling algorithm can be found in the Supplementary. An overview
of the complete Dirac algorithm is shown in Algorithm 2.
Algorithm 2 Dirac
Input: y˜: noisy observation, Φθ: score network, At(·): degredation function, ∆t: step size, σt
: noise std
at time t, ηt
: guidance step size, ∀t ∈ [0, 1], tstop: early-stopping parameter
N ← ⌊1/∆t⌋
y ← y˜
for i = 1 to N do
t ← 1 − ∆t · i
if t ≤ tstop then ▷ Early-stopping
break
z ∼ N (0, σ2
t
I)
xˆ0 ← Φθ(y, t) ▷ Predict posterior mean
yr ← At−∆t(xˆ0) − At(xˆ0) ▷ Incremental reconstruction
yd ← −
σ
2
t−∆t−σ
2
t
σ
2
t
(At(xˆ0) − y) ▷ Denoising
yg ← −(σ
2
t−∆t − σ
2
t
)∇y∥y˜ − A1(xˆ0)∥
2 ▷ Guidance
y ← y + yr + yd + ηtyg +
q
σ
2
t − σ
2
t−∆t
z
Output: y ▷ Alternatively, output xˆ0
5.4 Experiments
Experimental setup – We evaluate our method on CelebA-HQ (256 × 256) [79] and ImageNet (256 ×
256) [38]. For competing methods that require a score model, we use pre-trained SDE-VP models. For
88
1 0.8 0.6 0.4 0.2 0
10−2
10−1
t
¯ϵdc
DDRM
DPS
Dirac
Noise Floor
1
Noisy DPS
Dirac Target
1
1 5 20 100 500 1000
0.3
0.4
0.6
0.9
NFEs
LPIPS
DDRM
DPS
Dirac
1
Figure 5.3: Left: Data consistency curves for FFHQ inpainting. ϵdc := ∥y˜ − A1(xˆ0(yt))∥
2 measures how
consistent is the clean image estimate with the original noisy measurement. We expect ϵdc to approach the
noise floor σ
2
1 = 0.0025 in case of perfect data consistency. We plot ϵ¯dc the mean over the validation set.
Dirac maintains data consistency throughout the reverse process. Center: Data consistency is not always
achieved with DPS. Right: Number of reverse diffusion steps vs. perceptual quality.
Dirac, we train models from scratch using the NCSN++[157] architecture. As the pre-trained score-models
for competing methods have been trained on the full CelebA-HQ dataset, we test all methods for fair
comparison on the first 1k images of the FFHQ [81] dataset. For ImageNet experiments, we sample 1
image from each class from the official validation split to create disjoint validation and test sets of 1k
images each. We only train our model on the train split of ImageNet.
We investigate two degradation processes of very different properties: Gaussian blur and inpainting,
both with additive Gaussian noise. In all cases, noise with σ1 = 0.05 is added to the measurements in
the [0, 1] range. We use standard geometric noise scheduling with σmax = 0.05 and σmin = 0.01 in the
SDP. For Gaussian blur, we use a kernel size of 61, with standard deviation of wmax = 3. We change the
standard deviation of the kernel between wmax(strongest) and 0.3 (weakest) to parameterize the severity of
Gaussian blur in the degradation process, and use the scheduling method described in the supplementary
to specify At
. For inpainting, we generate a smooth mask in the form
1 −
f(x;wt)
maxx f(x;wt)
k
, where f(x; wt)
denotes the density of a zero-mean isotropic Gaussian with standard deviation wt that controls the size of
the mask and k = 4 for sharper transition. We set w1 = 50 for CelebA-HQ/FFHQ inpainting and 30 for
ImageNet inpainting.
89
Deblurring Inpainting
Method PSNR(↑) SSIM(↑) LPIPS(↓) FID(↓) PSNR(↑) SSIM(↑) LPIPS(↓) FID(↓)
Dirac-PO (ours) 26.67 0.7418 0.2716 53.36 25.41 0.7595 0.2611 39.43
Dirac-DO (ours) 28.47 0.8054 0.2972 69.15 26.98 0.8435 0.2234 51.87
DPS [23] 25.56 0.6878 0.3008 65.68 21.06 0.7238 0.2899 57.92
DDRM [82] 27.21 0.7671 0.2849 65.84 25.62 0.8132 0.2313 54.37
PnP-ADMM [16] 27.02 0.7596 0.3973 74.17 12.27 0.6205 0.4471 192.36
ADMM-TV 26.03 0.7323 0.4126 89.93 11.73 0.5618 0.5042 264.62
Deblurring Inpainting
Method PSNR(↑) SSIM(↑) LPIPS(↓) FID(↓) PSNR(↑) SSIM(↑) LPIPS(↓) FID(↓)
Dirac-PO (ours) 24.68 0.6582 0.3302 53.91 25.48 0.8077 0.2185 40.46
Dirac-DO (ours) 25.76 0.7085 0.3705 83.23 28.42 0.8906 0.1760 40.73
DPS [23] 21.51 0.5163 0.4235 52.60 22.71 0.8026 0.1986 34.55
DDRM [82] 24.53 0.6676 0.3917 61.06 25.92 0.8347 0.2138 33.71
PnP-ADMM [16] 25.02 0.6722 0.4565 98.72 18.14 0.7901 0.2709 101.25
ADMM-TV 24.31 0.6441 0.4578 88.26 17.60 0.7229 0.3157 120.22
Table 5.1: Experimental results on the FFHQ (top) and ImageNet (bottom) test splits.
Figure 5.4: Visual comparison of reconstructions on images from FFHQ (top 2 rows) and ImageNet (bottom
2 rows) on the Gaussian deblurring task.
90
Figure 5.5: Visual comparison of reconstructions on images from FFHQ (top 2 rows) and ImageNet (bottom
2 rows) on the inpainting task with Gaussian masks.
91
We compare our method against DDRM [82], a well-established diffusion-based linear inverse problem
solver; DPS [23], a recent, state-of-the-art diffusion technique for noisy inverse problems; PnP-ADMM [16],
a reliable traditional solver with learned denoiser; and ADMM-TV, a classical optimization technique. To
evaluate performance, we use PSNR and SSIM as distortion metrics and LPIPS and FID as perceptual quality
metrics.
Deblurring – We train our model on LIRN (∆t = 0, θ), as we observed no significant difference in
using other incremental reconstruction losses, due to the smoothness of the degradation. We show results
on our perception-optimized (PO) reconstructions, tuned for best LPIPS and our distortion-optimized (DO)
reconstructions, tuned for best PSNR on a separate validation set via early-stopping at the PSNR-peak (see
Fig. 5.2). Our results, summarized in Table 6.1 (left side), demonstrate superior performance compared
with other benchmark methods in terms of both distortion and perceptual metrics. Visual comparison
in Figure 5.4 reveals that DDRM produces reliable reconstructions, similar to our DO images, but these
reconstructions tend to lack detail. On the other hand, DPS produces detailed images, similar to our PO
reconstructions, but often with hallucinated details inconsistent with the measurement.
Inpainting – We train our model on LIRN (∆t = 1, θ), as we see improvement in reconstruction
quality as ∆t is increased. We hypothesize that this is due to sharp changes in the inpainting operator
with respect to t, which can be mitigated by the incremental reconstruction loss according to Theorem
5.3.4. We tuned models to optimize FID, as it is more suitable than pairwise image metrics to evaluate
generated image content. Our results in Table 6.1 (right side) shows best performance in most metrics,
followed by DDRM. Fig. 5.5 shows, that our method generates high quality images even when limited
context is available. Ablations on the effect of ∆t in the incremental reconstruction loss can be found in
the supplementary.
Data consistency – Consistency between reconstructions and the original measurement is a crucial
requirement in inverse problem solving. Our proposed method has the additional benefit of maintaining
92
data consistency throughout the reverse process, as shown in Theorem 5.3.6 in the ideal case, however
we empirically validate this claim. Figure 5.3 (left) shows the evolution of ϵdc := ∥y˜ − A1(xˆ0(yt))∥
2
,
where xˆ0(yt) is the clean image estimate at time t (Φθ(yt
, t) for our method). Since y˜ = A1(x0) + σ
2
1
,
we expect ϵdc to approach σ
2
1
in case of perfect data consistency. We observe that our method, without
applying guidance, stays close to the noise floor throughout the reverse process, while other techniques
approach data consistency only close to t = 1. In case of DPS, we observe that data consistency is not
always satisfied (see Figure 5.3, center), as DPS only guides the iterates towards data consistency, but does
not directly enforce it. As our technique reverses an SDP, our intermediate reconstructions are always
interpretable as degradations of varying severity of the same underlying image. This property allows us
to early-stop the reconstruction and still obtain consistent reconstructions.
Sampling speed – Dirac requires low number of reverse diffusion steps for high quality reconstructions leading to fast sampling. Figure 5.3 (right) compares the perceptual reconstruction quality at different
number of reverse diffusion steps for diffusion-based inverse problem solvers. Our method typically requires 20 − 100 steps for optimal perceptual quality, and shows the most favorable scaling in the low-NFE
regime. Due to early-stopping we can trade-off perceptual quality for better distortion metrics and even
further sampling speed-up. We obtain acceptable results even with one-shot reconstruction.
5.5 Conclusions and limitations
In this paper, we propose a novel framework for solving inverse problems based on reversing a stochastic degradation process. Our solver can flexibly trade off perceptual image quality for more traditional
distortion metrics and sampling speedup. Moreover, we show both theoretically and empirically that our
method maintains consistency with the measurement throughout the reconstruction process. Our method
produces reconstructions of exceptional quality in terms of both perceptual and distortion-based metrics,
93
surpassing comparable state-of-the-art methods on multiple high-resolution datasets and image restoration tasks. The main limitation of our method is that a model needs to be trained from scratch for each
inverse problem, whereas other diffusion-based solvers can leverage standard score networks trained via
denoising score matching. Incorporating pre-trained score models into our framework for improved speed
and reconstruction quality is an interesting direction for future work.
94
Chapter 6
Adapt and Diffuse: Sample-adaptive Reconstruction via Latent
Diffusion Models
In this chapter, we highlight a key shortcoming of existing reconstruction techniques: they are unable
to dynamically adapt their compute to the corruption level of the sample, resulting in wasteful resource
allocation. We introduce an efficient algorithm that automatically scales its compute in test time to the
difficulty of the reconstruction problem on a sample-by-sample basis, greatly reducing the average cost of
inference while maintaining exceptional reconstruction quality.
This chapter is based on the the following work:
• Zalan Fabian, Berk Tinaz, and Mahdi Soltanolkotabi. "Adapt and Diffuse: Sample-adaptive Reconstruction via Latent Diffusion Models." arXiv preprint arXiv:2309.06642 (2023).
6.1 Introduction
Inverse problems arise in a multitude of computer vision [92, 91, 180], biomedical imaging [3, 160] and
scientific [57, 47] applications, where the goal is to recover a clean signal from noisy and degraded observations. As information is fundamentally lost in the process, structural assumptions on the clean signal are
needed to make recovery possible. Traditional compressed sensing [12, 44] approaches utilize explicit regularizers that encourage sparsity of the reconstruction in transformation domains such as wavelets. More
95
Latent domain
Severity
Encoder Decoder
(zˆ1, σˆ1)
(zˆ2, σˆ2)
N (0, I) Reverse diffusion q0(z0)
Hard example
Reconstruction in 200 steps
Easy example Reconstruction in 50 steps
1
Figure 6.1: Overview of FlashDiffusion: we estimate the degradation severity of corrupted images in the
latent space of an autoencoder. We leverage the severity predictions to find the optimal start time in a
latent reverse diffusion process on a sample-by-sample basis. As a result, inference cost is automatically
scaled by the difficulty of the reconstruction task at test time.
recently, data-driven supervised and unsupervised deep learning methods have established new state-ofthe-art in tackling most inverse problems (see an overview in [121]).
A key shortcoming of available techniques is their inherent inability to adapt their compute power allocation to the difficulty of reconstructing a given corrupted sample. There is a natural sample-by-sample
variation in the difficulty of recovery due to multiple factors. First, variations in the measurement process
(e. g. more or less additive noise, different blur kernels) greatly impact the difficulty of reconstruction.
Second, a sample can be inherently difficult to reconstruct for the particular model, if it is different from
examples seen in the training set (out-of-distribution samples). Third, the amount of information loss due
to the interaction between the specific sample and the applied degradation can vary vastly. For instance,
applying a blur kernel to an image consisting of high-frequency textures destroys significantly more information than applying the same kernel to a smooth image. Finally, the implicit bias of the model architecture
towards certain signal classes (e.g. piece-wise constant or smooth for convolutional architectures) can be
a key factor in determining the difficulty of a recovery task. Therefore, expending the same amount of
compute to reconstruct all examples is potentially wasteful, especially on datasets with varied corruption
parameters.
96
Sample-adaptive methods that incorporate the difficulty of a reconstruction problem, or the severity of
degradation, and allocate compute effort accordingly are thus highly desired. To the best of our knowledge
however, such methods have not been studied extensively in the literature. So called unrolled networks
[194, 161] have been proposed for reconstruction problems, that map the iterations of popular optimization algorithms to learnable submodules, where deeper networks can be used to tackle more challenging
reconstruction tasks. However, network size is determined in training time and therefore these methods
are unable to adapt on a sample-by-sample basis. Deep Equilibrium Models [5] have been proposed to solve
inverse problems [49] by training networks of arbitrary depth through the construction of fixed-point iterations. These methods can adapt their compute in test time by scaling the number of fixed-point iterations
to convergence, however it is unclear how the optimal number of iterations correlates with degradation
severity.
Diffusion models have established new state-of-the-art performance in synthesizing data of various
modalities [40, 116, 131, 135, 141, 65, 140, 66, 89], inverse problem solving and image restoration [78, 142,
156, 26, 25, 23, 24, 84, 82, 83, 46]. Diffusion-based sampling techniques generate the missing information
destroyed by the corruption step-by-step through a diffusion process that transforms pure noise into a target distribution. Recent work [25] has shown that the sampling trajectory can be significantly shortened
by starting the reverse diffusion process from a good initial reconstruction, instead of pure noise. However,
this approach treats the noise level of the starting manifold as a hyperparameter independent of degradation severity. Therefore, even though sampling is accelerated, the same number of function evaluations
are required to reconstruct any sample. In [46] an early stopping technique is proposed for diffusion-based
reconstruction, however it is unclear how to determine the stopping time on a sample-by-sample basis.
More recently, latent domain diffusion, that is a diffusion process defined in the low-dimensional latent
space of a pre-trained autoencoder, has demonstrated great success in image synthesis [135] and has been
successfully applied to solving linear inverse problems [139] and in high-resolution image restoration
97
[103]. Latent diffusion has the clear benefit of improved efficiency due to the reduced dimensionality
of the problem leading to faster sampling. In addition to this, the latent space consists of compressed
representations of relevant information in data and thus provides a natural space to quantify the loss of
information due to image corruptions, which strongly correlates with the difficulty of the reconstruction
task.
In this paper, we propose a novel reconstruction framework (Figure 6.1), where the cost of inference is
automatically scaled based on the difficulty of the reconstruction task on a sample-by-sample basis. Our
contributions are as follows:
• We propose a novel method that we call severity encoding, to estimate the degradation severity of
noisy, degraded images in the latent space of an autoencoder. We show that the estimated severity
has strong correlation with the true corruption level and can give useful hints at the difficulty of
reconstruction problems on a sample-by-sample basis. Training the severity encoder is efficient, as
it can be done by fine-tuning a pre-trained encoder.
• We propose a reconstruction method based on latent diffusion models that leverages the predicted
degradation severities to fine-tune the reverse diffusion sampling trajectory and thus achieve
sample-adaptive inference times. Furthermore, we utilize latent diffusion posterior sampling to
maintain data consistency with the observations. Our framework can take advantage of pre-trained
latent diffusion models out of the box, reducing compute requirements compared to other imagedomain diffusion solvers. We call our method Flash-Diffusion: Fast Latent Sample-Adaptive Reconstruction ScHeme.
• We perform numerical experiments on both linear and nonlinear inverse problems and demonstrate
that the proposed technique achieves performance comparable to state-of-the-art diffusion-based
techniques, while significantly reducing the computational cost.
98
6.2 Background
Diffusion models – Diffusion in the context of generative modeling consists of transforming a clean data
distribution x0 ∼ q0(x) through a forward noising process, defined over 0 ≤ t ≤ T, t ∈ R, into some
tractable distribution qT . Typically, qt
is chosen such that xt
is obtained from x0 via adding i.i.d. Gaussian
noise, that is qt(xt
|x0) ∼ N (x0, σ2
t
I), where σ
2
t
is from a known variance schedule. Diffusion models
(DMs) [149, 63, 153, 155] learn to reverse the forward corruption process in order to generate data samples
starting from a simple Gaussian distribution. The forward process can be described as an Itô stochastic
differential equation (SDE) [157]
dx = f(x, t)dt + g(t)dw,
where f and g are also called the drift and diffusion coefficients, and w ∈ R
n
is the standard Wiener
process. The forward SDE is reversible [2] and the reverse SDE can be written as
dx = [f(x, t) − g(t)
2∇x log qt(x)]dt + g(t)dw¯, (6.1)
where w¯ is the standard Wiener process where time flows in the negative direction and ∇x log qt(x) is
referred to as the score of the data distribution. The score is approximated by a neural network sθ(xt
, t)
trained such that sθ(x, t) ≈ ∇x log qt(x). Then, sθ(xt
, t) can be used to simulate the reverse SDE in (6.1)
from which a variety of discrete time sampling algorithms can be derived. The continuous time interval
t ∈ [0, T] is typically discretized uniformly into N time steps.
Denoising Diffusion Probabilistic Models (DDPMs) [149, 63] are obtained from the discretization of the
variance preserving SDE with f(x, t) = −
1
2
βtx and g(t) = √
βt
, where βt
is a pre-defined variance schedule that is a strictly increasing function of t. One can sample from the corresponding forward diffusion
process at any time step i as xi =
√
α¯ix0 +
√
1 − α¯iε, with ε ∼ N (0, I) and α¯i =
Qi
j=1 αi
, αi = 1 − βi
,
99
which can be interpreted as a mixing of the clean signal with prescribed re-weighting factors. By minimizing the denoising score-matching objective
LDM = Eε∼N (0,I),i∼U[1,N],xi∼q0(x0)qi(xi|x0)
h
∥εθ(xi
, i) − ε∥
2
i
the (rescaled) score model εθ(xi
, i) = −
√
1 − α¯isθ(xi
, i)learns to predict the noise on the input corrupted
signal. The associated reverse diffusion step derived from (6.1) takes the form
xi−1 =
1
√
αi
(xi + (1 − αi)sθ(xi
, i)) + √
1 − αiε,
which is iterated from i = N to i = 1 to draw a sample from the data distribution, starting from xN ∼
N (0, I). Another line of work [153, 155] utilizes score matching with Langevin dynamics (SMLD), that
can be obtained from the discretization of variance exploding SDEs with f(x, t) = 0 and g(t) = q
d[σt]
2
dt
,
where σt
is a pre-defined strictly increasing function of t.
Latent Diffusion Models – Latent Diffusion Models (LDMs) [135] aim to mitigate the computational
burden of traditional diffusion models by running diffusion in a low-dimensional latent space of a pretrained autoencoder. In particular, an encoder E0 is trained to extract a compressed representation z ∈
R
d
, d << n of the input signal x in the form z = E0(x). In order to recover the clean signal from the
latent representation z, a decoder D0 is trained such that D0(E0(x)) ≈ x. A score model that progressively
denoises z can be trained in the latent space of the autoencoder via the objective
LLDM = Eε∼N (0,I),i∼U[1,N],zi∼q˜0(z0)˜qi(zi|z0)
h
∥εθ(zi
, i) − ε∥
2
i
,
where z0 = E0(x0), x0 ∼ q0(x0) and q˜i(zi
|z0) ∼ N (
√
α¯iz0,(1−α¯i)I) following the DDPM framework.
The final generated image can be obtained by passing the denoised latent through D0.
100
Diffusion models for solving inverse problems – Solving a general noisy inverse problem amounts
to finding the clean signal x ∈ R
n
from a noisy and degraded observation y ∈ R
m in the form
y = A(x) + n, (6.2)
where A : R
n → R
m denotes a deterministic degradation (such as blurring or inpainting) and n ∼
N (0, σ2
y
I) is i.i.d. additive Gaussian noise. As information is fundamentally lost in the measurement
process, structural assumptions on clean signals are necessary to recover x. Deep learning approaches
extract a representation of clean signals from training data either by learning to directly map y to x or
by learning a generative model pθ(x) that represents the underlying structure of clean data and can be
leveraged as a prior to solve (6.2). In particular, the posterior over clean data can be written as pθ(x|y) ∝
pθ(x)p(y|x), where the likelihood term p(y|x) is represented by (6.2). Thus one can sample from the
posterior distribution by querying the generative model. The score of the posterior log-probabilities can
be written as
∇x log pθ(x|y) = ∇x log pθ(x) + ∇x log p(y|x),
where the first term corresponds to an unconditional score model trained to predict noise on the signal
without any information about the forward model A. The score of the likelihood term however is challenging to estimate in general. Various approaches have emerged to incorporate the data acquisition model
into an unconditional diffusion process, including projection-based approaches [156, 26, 25], restricting
updates to stay on a given manifold [23, 24], spectral approaches [82], or methods that tailor the diffusion
process to the degradation [184, 46, 37].
A key challenge of diffusion-based solvers is their heavy compute demand, as reconstructing a single
sample requires typically 100 − 1000 evaluations of a large score model. Come-Closer-Diffuse-Faster
(CCDF) [25], a recently proposed solver shortens the sampling trajectory by leveraging a good initial
101
posterior mean estimate xˆ0 from a reconstruction network. They initialize the reverse process by jumping
to a fixed time step in the forward process via xk =
√
α¯kxˆ0 +
√
1 − α¯kε, and only perform k << N
reverse diffusion steps, where k is a fixed hyperparameter.
6.3 Method
6.3.1 Severity encoding
The goal of inverse problem solving is to recover the clean signal x from a corrupted observation y (see
(6.2)). The degradation A and additive noise n fundamentally destroy information in x. The amount
of information loss, or the severity of the degradation, strongly depends on the interaction between the
signal structure and the specific degradation. For instance, blurring removes high-frequency information,
which implies that applying a blur kernel to an image with abundant high-frequency detail (textures, hair,
background clutter etc.) results in a more severe degradation compared to applying the same kernel to a
smooth image with few details. In other words, the difficulty of recovering the clean signal does not solely
depend on the degradation process itself, but also on the specific signal the degradation is applied to. Thus,
tuning the reconstruction method’s capacity purely based on the forward model misses a key component
of the problem: the data itself.
Quantifying the severity of a degradation is a challenging task in image domain. As an example,
consider the forward model y = cx, c ∈ R
+ that simply rescales the clean signal. Recovery of x from y is
trivial, however image similarity metrics such as PSNR or NMSE that are based on the Euclidean distance
in image domain may indicate arbitrarily large discrepancy between the degraded and clean signals. On
the other hand, consider y = x + n, n ∼ N (0, σ2
I) where the clean signal is simply perturbed by some
additive random noise. Even though the image domain perturbation is (potentially) small, information is
fundamentally lost and perfect reconstruction is no longer possible.
102
What is often referred to as the manifold hypothesis [7] states that natural images live in a lower
dimensional manifold embedded in n-dimensional pixel-space. This in turn implies that the information
contained in an image can be represented by a low-dimensional latent vector that encapsulates the relevant
features of the image. Autoencoders [86, 133] learn a latent representation from data by first summarizing
the input image into a compressed latent vector z = E0(x) through an encoder. Then, the original image
can be recovered from the latent via the decoder xˆ = D0(z) such that x ≈ xˆ. As the latent space of
autoencoders contains only the relevant information of data, it is a more natural space to quantify the loss
of information due to the degradation than the image domain.
In particular, assume that we have access to the latent representation of clean images z0 = E0(x0),
z0 ∈ R
d
, for instance from a pre-trained autoencoder. We propose a severity encoder Eˆ
θ that achieves two
objectives simultaneously: (1) it can predict the latent representation of a clean image, given a noisy and
degraded observation and (2) it can quantify the error in its own latent estimation. We denote Eˆ
θ(y) =
(zˆ, σˆ) with zˆ ∈ R
d
the estimate of z0 and σˆ ∈ R the estimated degradation severity to be specified
shortly. We use the notation Eˆ
z(y) = zˆ and Eˆ
σ(y) = ˆσ for the two conceptual components of our model,
however in practice a single architecture is used to represent Eˆ
θ. The first objective can be interpreted as
image reconstruction in the latent space of the autoencoder: for y = A(x) + n and z0 = E0(x), we have
Eˆ
z(y) = zˆ ≈ z0. The second objective captures the intuition that recovering z0 from y exactly may not be
possible, and the prediction error is proportional to the loss of information about x due to the corruption.
Thus, even though the predicted latent zˆ might be away from the true z0, the encoder quantifies the
uncertainty in its own prediction. More specifically, we make the assumption that the prediction error in
latent space can be modeled as zero-mean i.i.d. Gaussian. That is, e(y) = zˆ − z0 ∼ N (0, σ2
∗
(y)I) and we
103
interpret the variance in prediction error σ
2
∗
as the measure of degradation severity. We optimize the joint
objective
Ex0∼q0(x0),y∼N (A(x0),σ2
y
I)
z0 − Eˆ
z(y)
2
+ λσ
σ¯
2
(y, z0) − Eˆ
σ(y)
2
:= Llat.rec. + λσLerr., (6.3)
with z0 = E0(x0) for a fixed, pre-trained encoder E0 and σ¯
2
(y, z0) = 1
d−1
Pd
i=1(e
(i) −
1
d
Pd
j=1 e
(j)
)
2
is
the sample variance of the prediction error estimating σ
2
∗
. Here λσ > 0 is a hyperparameter that balances
between reconstruction accuracy (Llat.rec.) and error prediction performance (Lerr.). We empirically observe that even small loss values of Llat.rec. (that is fairly good latent reconstruction) may correspond to
visible reconstruction error in image domain as semantically less meaningful features in image domain
are often not captured in the latent representation. Therefore, we utilize an extra loss term that imposes
image domain consistency with the ground truth image in the form
Lim.rec. = Ex0∼q0(x0),y∼N (A(x0),σ2
y
I)
x0 − D0(Eˆ
z(y))
2
,
resulting in the final combined loss
Lsev = Llat.rec. + λσLerr. + λim.Lim.rec.,
with λim. ≥ 0 hyperparameter. We note that training the severity encoder is fast, as one can fine-tune the
pre-trained encoder E0.
6.3.2 Sample-adaptive Inference
Diffusion-based inverse problem solvers synthesize missing data that has been destroyed by the degradation process through diffusion. As shown in Figure 6.2, depending on the amount of missing information
104
Degraded Recon.
Optimal diffusion steps: 200
Optimal diffusion steps: 50
Fixed diffusion steps
under-diffuse
over-diffuse
Figure 6.2: The optimal number of reverse diffusion steps varies depending on the severity of
degradations. Fixing the number of steps results
in over-diffusing some samples, whereas others
could benefit from more iterations.
Figure 6.3: Effect of degradation on predicted
severities: given a ground truth image corrupted
by varying amount of blur, σˆ is an increasing function of the blur amount.
(easy vs. hard samples), the optimal number of diffusion steps may greatly vary. Too few steps may not
allow the diffusion process to generate realistic details on the image, leaving the reconstruction overly
smooth. On the other hand, diffusion-based solvers are known to hallucinate details that may be inconsistent with the ground truth signal, or even become unstable when too many reverse diffusion steps are
applied. Authors in [25] observe that there always exists an optimal spot between 0 and N diffusion steps
that achieves the best reconstruction performance. We aim to automatically find this "sweet spot" on a
sample-by-sample basis.
Our proposed severity encoder learns to map degraded signals to a noisy latent representation, where
the noise level is proportional to the degradation severity. This provides us the unique opportunity to
leverage a latent diffusion process to progressively denoise the latent estimate from our encoder. Even
more importantly, we can automatically scale the number of reverse diffusion steps required to reach the
clean latent manifold based on the predicted degradation severity.
Finding the optimal starting time – We find the time index istart in the latent diffusion process at
which the signal-to-noise ratio (SNR) matches the SNR predicted by the severity encoder. Assume that the
latent diffusion process is specified by the conditional distribution qi(zi
|z0) ∼ N (aiz0, b2
i
I), where ai and
bi are determined by the specific sampling method (e. g. ai =
√
α¯i and b
2
i = 1 − α¯i for DDPM). On the
105
other hand, we have the noisy latent estimate zˆ ∼ N (z0, σ2
∗
(y)I), where we estimate σ
2
∗
by Eˆ
σ(y). Then,
SNR matching gives us the starting time index
istart(y) = arg min
i∈[1,2,..,N]
a
2
i
b
2
i
−
1
Eˆ
σ(y)
Thus, we start reverse diffusion from the initial reconstruction zˆ provided by the severity encoder and
progressively denoise it using a pre-trained unconditional score model, where the length of the sampling
trajectory is directly determined by the predicted severity of the degraded example.
Noise correction – Even though we assume that the prediction error in latent space is i.i.d. Gaussian
in order to quantify the estimation error by a single scalar, in practice the error often has some structure.
This can pose a challenge for the score model, as it has been trained to remove isotropic Gaussian noise.
We observe that it is beneficial to mix zˆ with some i.i.d. correction noise in order to suppress structure in
the prediction error. In particular, we initialize the reverse process by
zstart =
p
1 − cσˆ
2zˆ +
√
cσˆ
2ε, ε ∼ N (0, I)
where c ≥ 0 is a tuning parameter.
Latent Diffusion Posterior Sampling – Maintaining consistency with the measurements is nontrivial in the latent domain, as common projection-based approaches are not applicable directly in latent
space. We propose Latent Diffusion Posterior Sampling (LDPS), a variant of diffusion posterior sampling
[23] that guides the latent diffusion process towards data consistency in the original data space. In particular, by applying Bayes’ rule the score of the posterior in latent space can be written as
∇zt
log qt(zt
|y) = ∇zt
log qt(zt) + ∇zt
log qt(y|zt).
106
The first term on the r.h.s. is simply the unconditional score that we already have access to as pre-trained
LDMs. As qt(y|zt) cannot be written in closed form, following DPS we use the approximation
∇zt
log qt(y|zt) ≈ ∇zt
log qt(y|zˆ0(zt)),
where zˆ0(zt) = E [z0|zt
] is the posterior mean of z0, which is straightforward to estimate from the score
model via Tweedie’s formula. This form is similar to PSLD in [139], but without the "gluing" objective. As
y ∼ N (A(D0(z0)), σ2
y
I), we approximate the score of the likelihood as
∇zt
log qt(y|zt) ≈ −
1
2σ
2
y
∇zt ∥A(D0(zˆ0(zt))) − y∥
2
. (6.4)
LDPS is a general approach to impose data consistency in the latent domain that works with noisy and
potentially nonlinear inverse problems.
6.4 Experiments
Dataset – We perform experiments on CelebA-HQ (256 × 256) [79] where we match the training and
validation splits used to train LDMs in [135], and set aside 1k images from the validation split for testing.
For comparisons involving image domain score models we test on FFHQ [81], as pre-trained image-domain
score models have been trained on the complete CelebA-HQ dataset, unlike LDMs.
Degradations – We investigate two degradations of diverging characteristics. Varying blur, fixed
noise: we apply Gaussian blur with kernel size of 61 and sample kernel standard deviation uniformly on
[0, 3], where 0 corresponds to no blurring. We add Gaussian noise to images in the [0, 1] range with noise
standard deviation of 0.05. Nonlinear blur, fixed noise: we deploy GOPRO motion blur simulated by a
neural network model from [167]. This is a nonlinear forward model due to the camera response function.
107
We randomly sample nonlinear blur kernels for each image and add Gaussian noise with standard deviation
0.05.
Comparison methods – We compare our method, Flash-Diffusion, with SwinIR [97], a state-of-theart Transformer-based supervised image restoration model, DPS [23], a diffusion-based solver for noisy
inverse problems, and Come-Closer-Diffuse-Faster (CCDF) [25], an accelerated image-domain diffusion
sampler with two variants: (1) CCDF-DPS: we replace the projection-based data consistency method with
diffusion posterior sampling [23] to facilitate nonlinear forward models and (2) CCDF-L: we deploy CCDF
in latent space using the same LDM as for our method and we replace the data consistency step with LDPS
based on (6.4). The latter method can be viewed as a fixed diffusion steps version of Flash-Diffusion. For
all CCDF-variants we use the SwinIR reconstruction as initialization. Finally, we show results of decoding
our severity encoder’s latent estimate zˆ directly without diffusion, denoted by AE (autoencoded). Further
details on comparison methods are in Appendix E.2.
Models – We use pre-trained score models from [40] for image-domain diffusion methods∗
and from
[135] for latent diffusion models†
. We fine-tune severity encoders from pre-trained LDM encoders and
utilize a single convolution layer on top of zˆ to predict σˆ. For more details on the experimental setup and
hyperparameters, see Appendix E.1.
6.4.1 Severity encoding
In this section, we investigate properties of the predicted degradation severity σˆ. We perform experiments
on a 1k-image subset of the validation split (to be released along with source code). First, we isolate the
effect of degradation on σˆ (Fig. 6.3). We fix the clean image and apply increasing amount of Gaussian
blur. We observe that the predicted σˆ is an increasing function of the blur amount applied to the image:
∗
https://github.com/ermongroup/SDEdit
†
https://github.com/CompVis/latent-diffusion
108
0 0.2 0.4 0.6 0.8 1
0.1
0.2
0.3
0.4
t
ˆσ
σˆ vs t (R
2 = 0.824)
1
σˆnoisy σˆdegr.+noisy blur amount
0
0.2
0.4
0.6
0.8
1
Percentiles Clean Noisy Degraded and noisy
1
σˆnoisy σˆdegr.+noisy blur amount
0
0.2
0.4
0.6
0.8
1
1
Figure 6.4: Left: Blur amount (t) vs. predicted degradation severity (σˆ). Outliers indicate that the predicted
degradation severity is not solely determined by the amount of blur. The bottom image is ’surprisingly
easy’ to reconstruct, as it is overwhelmingly smooth with features close to those seen in the training
set. The top image is ’surprisingly hard’, due to more high-frequency details and unusual features not seen
during training. Right: Contributions to predicted severity. Degraded images with approx. the same σˆ (red
dots on left plot) may have different factors contributing to the predicted severity. The main contributor
to σˆ in the top image is the image degradation (blur), whereas the bottom image is inherently difficult to
reconstruct. See Section 6.4.1 for further detail.
heavier degradations on a given image result in higher predicted degradation severity. This implies that
the severity encoder learns to capture the amount of information loss caused by the degradation.
Next, we investigate the relation between σˆ and the underlying degradation severity (Fig. 6.4, left). We
parameterize the corruption level by t, where t = 0 corresponds to no blur and additive Gaussian noise
(σ = 0.05) and t = 1 corresponds to the highest blur level with the same additive noise. We vary the blur
kernel width linearly for t ∈ (0, 1). We observe that the predicted σˆ severities strongly correlate with the
corruption level. However, the existence of outliers suggest that factors other than the corruption level
may also contribute to the predicted severities. The bottom image is predicted to be ’surprisingly easy’, as
other images of the same corruption level are typically assigned higher predicted severities. This sample
is overwhelmingly smooth, with a lack of fine details and textures, such as hair, present in other images.
Moreover, the image shares common features with others in the training set. On the other hand, the top
image is considered ’surprisingly difficult’, as it contains unexpected features and high-frequency details
that are uncommon in the dataset. This example highlights the potential application of our technique to
hard example mining and dataset distillation.
109
Gaussian Deblurring Non-linear Deblurring
Method PSNR(↑) SSIM(↑) LPIPS(↓) FID(↓) PSNR(↑) SSIM(↑) LPIPS(↓) FID(↓)
Flash-Diffusion (ours) 29.16 0.8191 0.2241 29.46 27.22 0.7705 0.2694 36.92
AE 29.43 0.8366 0.2668 58.47 27.15 0.7824 0.3351 73.81
SwinIR [97] 30.71 0.8598 0.2399 59.07 27.66 0.7964 0.3072 62.11
DPS 28.35 0.7806 0.2470 55.17 22.82 0.6247 0.3603 72.20
CCDF-DPS [25] 30.02 0.8365 0.2324 50.62 26.98 0.7445 0.2840 56.92
CCDF-L 29.55 0.8377 0.2346 49.06 27.25 0.7793 0.2833 55.85
Table 6.1: Experimental results on the FFHQ test split.
Finally, we analyze the contribution of different factors to the predicted degradation severities (Fig. 6.4,
right). To this end, we apply severity encoding to both the clean image with noise (no blur) and the noisy
and degraded image, resulting in predicted severities σˆnoisy and σˆdegr.+noisy. We quantify the difficulty of
samples relative to each other via percentiles of the above two quantities, where we use σˆnoisy as a proxy for
the difficulty originating from the image structure. We observe that for a fixed σˆdegr.+noisy, the composition
of degradation severity may greatly vary. The two presented images have been assigned approximately the
same σˆdegr.+noisy, however the top image is inherently easy to encode (low σˆnoisy percentile) compared
to other images in the dataset, therefore the severity is mostly attributed to the image degradation. On the
other hand, the bottom image with the same σˆdegr.+noisy is less corrupted by blur, but with high σˆnoisy
indicating a difficult image. This example further corroborates the interaction between ground truth signal
structure and the applied corruption in determining the difficulty of a reconstruction task.
6.4.2 Sample-adaptive reconstruction
Comparison with state-of-the-art – Table 6.1 summarizes our results on Gaussian deblurring with varying blur amount and non-linear deblurring. We observe that Flash-Diffusion consistently outperforms
other diffusion-based solvers in terms of perceptual metrics such as LPIPS and FID. SwinIR, a state-of-theart supervised image restoration model achieves higher PSNR and SSIM as diffusion methods, however
110
Measurement CCDF-DPS CCDF-L SwinIR Autoencoder Ours Target
Gaussian blur Nonlinear blur
1
Figure 6.5: Visual comparison of FFHQ reconstructions under varying levels of Gaussian blur (top 2 rows)
and nonlinear motion blur (bottom 2 rows), both with additive Gaussian noise (σ = 0.05).
the reconstructions lack detail compared to other techniques. This phenomenon is due to the perceptiondistortion trade-off [8]: improving perceptual image quality is fundamentally at odds with distortion metrics. Furthermore, we highlight that our initial reconstructions are worse than SwinIR reconstructions
(compare AE vs. SwinIR) used to initialize CCDF-variants. Despite this, we still achieve overall better
perceptual reconstruction quality.
We perform visual comparison of reconstructed samples in Figure 6.5. We observe that Flash-Diffusion
can reconstruct fine details, significantly improving upon the autoencoded reconstruction used to initialize
the reverse diffusion process. SwinIR produces reliable reconstructions, but with less details compared
to diffusion-based methods. Moreover, note that diffusion-based solvers with fixed number of diffusion
steps tend to under-diffuse (see 2nd row, lack of details) or over-diffuse (4th row, high-frequency artifacts)
leading to subpar reconstructions.
111
10 50 100 200 500
28
30
32
34
36
38
NF E
FID
Fixed steps
Ours
10 50 100 200 500
0.18
0.19
0.2
NF E
LPIPS
CelebA-HQ - Gaussian blur
60 80 100 120 140
0
0.5
1
1.5
·10−2
# of diffusion steps
Frequency
10 50 100 200 500
30
35
40
45
50
NF E
FID
Fixed steps
Ours
10 50 100 200 500
0.22
0.23
0.24
0.25
NF E
LPIPS
CelebA-HQ, NL-blur
80 100 120 140 160 180 200
0
0.5
1
1.5
·10−2
# of diffusion steps
Frequency
Figure 6.6: Comparison of adaptive reconstruction with fixed number of diffusion steps. Left and center:
We plot the average number of reverse diffusion steps performed by our algorithm vs. CCDF-L with a
fixed number of steps. We observe the best FID and near-optimal LPIPS across any fixed number of steps
using our method. Right: We plot the histogram of predicted number of reverse diffusion steps for our
algorithm. The spread around the mean highlights the adaptivity of our proposed technique.
Efficiency of the method – In order to demonstrate the gains in terms of reconstruction efficiency,
we compare Flash-Diffusion to CCDF-L across various (fixed) number of reverse diffusion steps (Fig. 6.6).
We observe that our adaptive method achieves the best FID across any fixed number of steps by a large
margin. Moreover, it achieves near-optimal LPIPS with often 2× less average number of diffusion steps.
We observe that the predicted diffusion steps are spread around the mean and not closely concentrated,
further highlighting the adaptivity of our proposed method.
Robustness against forward model mismatch Our method relies on a severity encoder that has
been trained on paired data of clean and degraded images under a specific forward model. We simulate a
mismatch between the severity encoder fine-tuning operator and test-time operator in order to investigate
the robustness of our technique with respect to forward model perturbations. In particular, we run the
112
following experiments to assess the test-time shift: 1) we train the encoder on Gaussian blur and test on
non-linear blur and 2) we train the encoder on non-linear blur and test on Gaussian blur. The results
on the FFHQ test set are in Table 6.2. We observe minimal loss in performance when non-linear blur
encoder is used for reconstructing images corrupted by Gaussian blur. For the non-linear deblurring task,
using Gaussian blur encoder results in a more significant drop in the performance, while still providing
acceptable reconstructions. These results are expected, as Gaussian blur can be thought of as a special case
of the non-linear blur model we consider. Therefore even when the encoder is swapped, it can provide
meaningful mean and error estimation. However, the Gaussian blur encoder has never been trained on
images corrupted by non-linear blur. As such, the mean estimate is worse, resulting in a larger performance
drop. Note that we did not re-tune the hyper-parameters in these experiments and doing so may potentially
alleviate the loss in performance.
Gaussian Deblurring Non-linear Deblurring
Method PSNR(↑) SSIM(↑) LPIPS(↓) FID(↓) PSNR(↑) SSIM(↑) LPIPS(↓) FID(↓)
Ours + Gaussian blur encoder 29.16 0.8191 0.2241 29.467 25.36 0.7238 0.3416 54.90
Ours + NL blur encoder 28.96 0.8129 0.2362 30.34 27.22 0.7705 0.2694 36.92
Table 6.2: Robustness experiments on the FFHQ test split.
6.5 Conclusions
In this work, we make the key observation that the difficulty of solving an inverse problem may vary
greatly on a sample-by-sample basis, depending on the ground truth signal structure, the applied corruption, the model, the training set and the complex interactions between these factors. Despite this natural
variation in the difficulty of a reconstruction task, most techniques apply a rigid model that expends the
same compute irrespective of the amount of information destroyed in the noisy measurements, resulting in
suboptimal performance and wasteful resource allocation. We propose Flash-Diffusion, a sample-adaptive
113
method that predicts the degradation severity of corrupted signals, and utilizes this estimate to automatically tune the compute allocated to reconstruct the sample. In particular, we use the prediction error
of an encoder in latent space as a proxy for reconstruction difficulty, which we call severity encoding.
Then, we leverage a latent diffusion process to reconstruct the sample, where the length of the sampling
trajectory is directly scaled by the predicted severity. We experimentally demonstrate that the proposed
technique achieves performance on par with state-of-the-art diffusion-based reconstruction methods, but
with greatly improved compute efficiency.
114
Chapter 7
Conclusions
7.1 Summary of contributions
In this work, we have proposed a variety of techniques to address the key challenges of deploying artificial
intelligence in medical and scientific applications. We have tackled the challenge of data scarcity via two
approaches: first by reducing the number of necessary measurements for high quality reconstruction in
a traditional, compressed sensing framework, and second by generating synthetic examples through data
augmentation in order to enlarge the training set. We have also addressed the computational challenges
posed by MRI reconstruction from multicoil k-space data: we introduce a hybrid architecture that can
capitalize on the advantages of both convolutional and transformer-based neural network architectures.
Finally, with the recent emergence of diffusion models, a new class of powerful generative models, we
investigate the unique opportunities the diffusion framework may provide in image reconstruction. We
find that diffusion opens up novel and interesting directions, such as flexible trade-off between perceptual
image quality and distortion metrics (e.g. PSNR, SSIM), and even sample adaptive reconstruction where
compute cost is directly scaled with the degradation severity of the mesasurement.
115
7.2 Future directions
Deep learning remains a rapidly evolving field with a dynamically changing state of the art in terms of
network architectures, training practices and algorithmic developments. It is crucial to investigate the
viability of such emerging methodology when applied to high impact applications such as those arising in
the medical field and the sciences. We identify the following key directions for potential future work.
Diffusion models – Diffusion models [64, 154] achieve exceptional quality in synthetic image generation and can serve as powerful image priors for reconstruction tasks. The key challenge left to be
addressed is their slow inference speed compared to end-to-end reconstruction techniques. Accelerating diffusion-based solvers could elicit wider adoption in computational imaging applications. Moreover,
adapting diffusion models pre-trained on large-scale generic data to application specific distributions (e.g.
medical or microscopy images) is an interesting venue for future work.
Foundation models – Foundation models are large (up to billions of parameters) neural network
models trained on massive, internet-scale data. Large language models [130, 166] have fueled the recent
revolution in conservational agents, such as ChatGPT, and vision-only [96] and language-vision [129]
foundation models have been successful as well in tasks such as zero-shot recognition. In order to tap
into the knowledge of such models, they need to be adapted to downstream applications, as they are
overwhelmingly trained on generic data. Developing efficient adaptation techniques has great potential in
bringing the ’foundational’ knowledge of such models (e.g. visual representations, scene understanding,
visual reasoning) to data-limited applications, common in medical and scientific problems.
Multimodality – Multimodal foundation models learn an aligned representation space of images and
text and can leverage such joint representations for visual reasoning [101]. As textual information is often
available in scientific and medical reconstruction problems in the form of context (e.g. past medical reports
of patient, additional prior textual information on the imaged sample) or auxiliary information (model and
maker of instrument, parameters of the acquisition protocol), exploring multimodal techniques to utilize
116
such information may open up new capabilities and push the frontier of reconstruction performance even
further.
117
Bibliography
[1] Brian Abbey, Keith A Nugent, Garth J Williams, Jesse N Clark, Andrew G Peele, Mark A Pfeifer,
Martin De Jonge, and Ian McNulty. “Keyhole coherent diffractive imaging”. In: Nature Physics
(2008).
[2] Brian DO Anderson. “Reverse-time diffusion equation models”. In: Stochastic Processes and their
Applications 12.3 (1982), pp. 313–326.
[3] Diego Ardila, Atilla P Kiraly, Sujeeth Bharadwaj, Bokyung Choi, Joshua J Reicher, Lily Peng,
Daniel Tse, Mozziyar Etemadi, Wenxing Ye, Greg Corrado, et al. “End-to-end lung cancer
screening with three-dimensional deep learning on low-dose chest computed tomography”. In:
Nature medicine 25.6 (2019), pp. 954–961.
[4] Selin Aslan, Viktor Nikitin, Daniel J Ching, Tekin Bicer, Sven Leyffer, and Doğa Gürsoy. “Joint
ptycho-tomography reconstruction through alternating direction method of multipliers”. In:
Optics express (2019).
[5] Shaojie Bai, J Zico Kolter, and Vladlen Koltun. “Deep equilibrium models”. In: Advances in Neural
Information Processing Systems 32 (2019).
[6] Arpit Bansal, Eitan Borgnia, Hong-Min Chu, Jie S Li, Hamid Kazemi, Furong Huang,
Micah Goldblum, Jonas Geiping, and Tom Goldstein. “Cold diffusion: Inverting arbitrary image
transforms without noise”. In: arXiv preprint arXiv:2208.09392 (2022).
[7] Yoshua Bengio, Aaron Courville, and Pascal Vincent. “Representation learning: A review and new
perspectives”. In: IEEE transactions on pattern analysis and machine intelligence 35.8 (2013),
pp. 1798–1828.
[8] Yochai Blau and Tomer Michaeli. “The perception-distortion tradeoff”. In: Proceedings of the IEEE
conference on computer vision and pattern recognition. 2018, pp. 6228–6237.
[9] Ashish Bora, Eric Price, and Alexandros G Dimakis. “Ambientgan: Generative models from lossy
measurements”. In: International Conference on Learning Representations. 2018.
[10] Emmanuel J Candes, Xiaodong Li, and Mahdi Soltanolkotabi. “Phase retrieval via Wirtinger flow:
Theory and algorithms”. In: IEEE Transactions on Information Theory (2015).
118
[11] Emmanuel J. Candes, Justin K. Romberg, and Terence Tao. “Stable Signal Recovery from
Incomplete and Inaccurate Measurements”. In: Communications on Pure and Applied Mathematics:
A Journal Issued by the Courant Institute of Mathematical Sciences 59.8 (2006), pp. 1207–1223.
[12] Emmanuel J. Candes, Justin K. Romberg, and Terence Tao. “Stable signal recovery from
incomplete and inaccurate measurements”. In: Communications on Pure and Applied Mathematics:
A Journal Issued by the Courant Institute of Mathematical Sciences 59.8 (2006), pp. 1207–1223.
[13] Hu Cao, Yueyue Wang, Joy Chen, Dongsheng Jiang, Xiaopeng Zhang, Qi Tian, and
Manning Wang. “Swin-Unet: Unet-like pure Transformer for medical image segmentation”. en.
In: arXiv:2105.05537 (May 2021). arXiv: 2105.05537. (Visited on 11/19/2021).
[14] Jiezhang Cao, Yawei Li, Kai Zhang, and Luc Van Gool. “Video super-resolution Transformer”. In:
arXiv preprint arXiv:2106.06847 (2021).
[15] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and
Sergey Zagoruyko. “End-to-end object detection with Transformers”. In: European Conference on
Computer Vision. Springer. 2020, pp. 213–229.
[16] Stanley H Chan, Xiran Wang, and Omar A Elgendy. “Plug-and-play ADMM for image
restoration: Fixed-point convergence and applications”. In: IEEE Transactions on Computational
Imaging 3.1 (2016), pp. 84–98.
[17] Rohan Chandra, Ziyuan Zhong, Justin Hontz, Val McCulloch, Christoph Studer, and
Tom Goldstein. “Phasepack: A phase retrieval library”. In: 2017 51st Asilomar Conference on
Signals, Systems, and Computers. IEEE. 2017.
[18] Huibin Chang, Pablo Enfedaque, Yifei Lou, and Stefano Marchesini. “Partially coherent
ptychography by gradient decomposition of the probe”. In: Acta Crystallographica Section A:
Foundations and Advances (2018).
[19] Huibin Chang, Pablo Enfedaque, and Stefano Marchesini. “Iterative Joint
Ptychography-Tomography with Total Variation Regularization”. In: arXiv preprint
arXiv:1902.05647 (2019).
[20] Chen Chen, Qifeng Chen, Jia Xu, and Vladlen Koltun. “Learning to see in the dark”. In:
Proceedings of the IEEE conference on computer vision and pattern recognition. 2018, pp. 3291–3300.
[21] Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yiping Deng, Zhenhua Liu, Siwei Ma,
Chunjing Xu, Chao Xu, and Wen Gao. “Pre-trained image processing Transformer”. In:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021,
pp. 12299–12310.
[22] Joseph Y. Cheng. Stanford 2D FSE. http://mridata.org/list?project=Stanford 2D FSE.
[23] Hyungjin Chung, Jeongsol Kim, Michael T Mccann, Marc L Klasky, and Jong Chul Ye. “Diffusion
posterior sampling for general noisy inverse problems”. In: arXiv preprint arXiv:2209.14687 (2022).
119
[24] Hyungjin Chung, Byeongsu Sim, Dohoon Ryu, and Jong Chul Ye. “Improving diffusion models for
inverse problems using manifold constraints”. In: arXiv preprint arXiv:2206.00941 (2022).
[25] Hyungjin Chung, Byeongsu Sim, and Jong Chul Ye. “Come-closer-diffuse-faster: Accelerating
conditional diffusion models for inverse problems through stochastic contraction”. In: Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, pp. 12413–12422.
[26] Hyungjin Chung and Jong Chul Ye. “Score-based diffusion models for accelerated MRI”. In:
Medical Image Analysis 80 (2022), p. 102479.
[27] Özgün Çiçek, Ahmed Abdulkadir, Soeren S Lienkamp, Thomas Brox, and Olaf Ronneberger. “3D
U-Net: Learning dense volumetric segmentation from sparse annotation”. In: International
Conference on Medical Image Computing and Computer-Assisted Intervention. Springer. 2016,
pp. 424–432.
[28] Özgün Çiçek, Ahmed Abdulkadir, Soeren S Lienkamp, Thomas Brox, and Olaf Ronneberger. “3D
U-Net: Learning dense volumetric segmentation from sparse annotation”. In: International
Conference on Medical Image Computing and Computer-Assisted Intervention. Springer. 2016,
pp. 424–432.
[29] JN Clark, L Beitra, G Xiong, A Higginbotham, DM Fritz, HT Lemke, D Zhu, M Chollet,
GJ Williams, Marc Messerschmidt, et al. “Ultrafast three-dimensional imaging of lattice dynamics
in individual gold nanocrystals”. In: Science (2013).
[30] Frank H Clarke. Optimization and nonsmooth analysis. Vol. 5. Siam, 1990.
[31] Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. “Autoaugment:
Learning augmentation strategies from data”. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition. 2019, pp. 113–123.
[32] Giannis Daras, Mauricio Delbracio, Hossein Talebi, Alexandros G Dimakis, and Peyman Milanfar.
“Soft diffusion: Score matching for general corruptions”. In: arXiv preprint arXiv:2209.05442 (2022).
[33] Mohammad Zalbagi Darestani, Akshay Chaudhari, and Reinhard Heckel. “Measuring Robustness
in Deep Learning Based Compressive Sensing”. In: International Conference on Machine Learning.
2021.
[34] Mohammad Zalbagi Darestani and Reinhard Heckel. “Can Un-Trained Neural Networks Compete
with Trained Neural Networks at Image Reconstruction?” In: arXiv:2007.02471 [cs, eess, stat]
(2020).
[35] Jacob Deasy, Nikola Simidjievski, and Pietro Liò. “Heavy-tailed denoising score matching”. In:
arXiv preprint arXiv:2112.09788 (2021).
[36] Jonas Degrave, Federico Felici, Jonas Buchli, Michael Neunert, Brendan Tracey,
Francesco Carpanese, Timo Ewalds, Roland Hafner, Abbas Abdolmaleki, Diego de Las Casas,
et al. “Magnetic control of tokamak plasmas through deep reinforcement learning”. In: Nature
602.7897 (2022), pp. 414–419.
120
[37] Mauricio Delbracio and Peyman Milanfar. “Inversion by Direct Iteration: An Alternative to
Denoising Diffusion for Image Restoration”. In: arXiv preprint arXiv:2303.11435 (2023).
[38] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. “Imagenet: A large-scale
hierarchical image database”. In: 2009 IEEE conference on computer vision and pattern recognition.
Ieee. 2009, pp. 248–255.
[39] Junjing Deng, David J Vine, Si Chen, Youssef SG Nashed, Qiaoling Jin, Nicholas W Phillips,
Tom Peterka, Rob Ross, Stefan Vogt, and Chris J Jacobsen. “Simultaneous cryo X-ray
ptychographic and fluorescence microscopy of green algae”. In: Proceedings of the National
Academy of Sciences (2015).
[40] Prafulla Dhariwal and Alex Nichol. “Diffusion Models Beat GANs on Image Synthesis”. In: arXiv
preprint arXiv:2105.05233 (2021).
[41] Martin Dierolf, Andreas Menzel, Pierre Thibault, Philipp Schneider, Cameron M. Kewish,
Roger Wepf, Oliver Bunk, and Franz Pfeiffer. “Ptychographic X-ray computed tomography at the
nanoscale”. In: Nature (2010).
[42] David L. Donoho. “Compressed Sensing”. In: IEEE Transactions on Information Theory 52.4 (2006),
pp. 1289–1306.
[43] David L. Donoho. “Compressed Sensing”. In: IEEE Transactions on Information Theory 52.4 (2006),
pp. 1289–1306.
[44] David L. Donoho. “Compressed sensing”. In: IEEE Transactions on Information Theory 52.4 (2006),
pp. 1289–1306.
[45] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,
Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly,
Jakob Uszkoreit, and Neil Houlsby. “An image is worth 16x16 words: Transformers for image
recognition at scale”. In: arXiv:2010.11929 [cs] (2020).
[46] Zalan Fabian, Berk Tinaz, and Mahdi Soltanolkotabi. “DiracDiffusion: Denoising and Incremental
Reconstruction with Assured Data-Consistency”. In: arXiv preprint arXiv:2303.14353 (2023).
[47] Berthy T Feng, Jamie Smith, Michael Rubinstein, Huiwen Chang, Katherine L Bouman, and
William T Freeman. “Score-Based Diffusion Models as Principled Priors for Inverse Imaging”. In:
arXiv preprint arXiv:2304.11751 (2023).
[48] Chun-Mei Feng, Yunlu Yan, Huazhu Fu, Li Chen, and Yong Xu. “Task Transformer network for
joint MRI reconstruction and super-resolution”. In: International Conference on Medical Image
Computing and Computer-Assisted Intervention. 2021, pp. 307–317.
[49] Davis Gilton, Gregory Ongie, and Rebecca Willett. “Deep equilibrium architectures for inverse
problems in imaging”. In: IEEE Transactions on Computational Imaging 7 (2021), pp. 1123–1133.
121
[50] Clément Godard, Oisin Mac Aodha, and Gabriel J Brostow. “Unsupervised monocular depth
estimation with left-right consistency”. In: Proceedings of the IEEE conference on computer vision
and pattern recognition. 2017, pp. 270–279.
[51] Mark A Griswold, Peter M Jakob, Robin M Heidemann, Mathias Nittka, Vladimir Jellus,
Jianmin Wang, Berthold Kiefer, and Axel Haase. “Generalized autocalibrating partially parallel
acquisitions (GRAPPA)”. In: Magnetic Resonance in Medicine: An Official Journal of the
International Society for Magnetic Resonance in Medicine 47.6 (2002), pp. 1202–1210.
[52] Doğa Gürsoy. “Direct coupling of tomography and ptychography”. In: Optics letters (2017).
[53] Kerstin Hammernik, Teresa Klatzer, Erich Kobler, Michael P Recht, Daniel K Sodickson,
Thomas Pock, and Florian Knoll. “Learning a variational network for reconstruction of
accelerated MRI data”. In: Magnetic Resonance in Medicine 79.6 (2018), pp. 3055–3071.
[54] Kerstin Hammernik, Jo Schlemper, Chen Qin, Jinming Duan, Ronald M Summers, and
Daniel Rueckert. “Σ-net: Systematic Evaluation of Iterative Deep Neural Networks for Fast
Parallel MR Image Reconstruction”. In: arXiv preprint arXiv:1912.09278 (2019).
[55] Yoseob Han and Jong Chul Ye. “Framing U-Net via deep convolutional framelets: Application to
sparse-view CT”. In: IEEE Transactions on Medical Imaging 37.6 (2018), pp. 1418–1429.
[56] Yoseob Han and Jong Chul Ye. “Framing U-Net via deep convolutional framelets: Application to
sparse-view CT”. In: IEEE Transactions on Medical Imaging 37.6 (2018), pp. 1418–1429.
[57] Paul Hand, Oscar Leong, and Vlad Voroninski. “Phase retrieval under a generative prior”. In:
Advances in Neural Information Processing Systems 31 (2018).
[58] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep residual learning for image
recognition”. In: arXiv:1512.03385 (2015).
[59] Reinhard Heckel and Paul Hand. “Deep Decoder: Concise Image Representations from Untrained
Non-Convolutional Networks”. In: International Conference on Learning Representations. 2019.
[60] Reinhard Heckel and Mahdi Soltanolkotabi. “Compressive Sensing with Un-Trained Neural
Networks: Gradient Descent Finds the Smoothest Approximation”. In: International Conference on
Machine Learning. 2020.
[61] Byeongho Heo, Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Junsuk Choe, and Seong Joon Oh.
“Rethinking spatial dimensions of Vision Transformers”. In: Proceedings of the IEEE/CVF
International Conference on Computer Vision. 2021, pp. 11936–11945.
[62] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.
“Gans trained by a two time-scale update rule converge to a local nash equilibrium”. In: Advances
in neural information processing systems 30 (2017).
[63] Jonathan Ho, Ajay Jain, and Pieter Abbeel. “Denoising Diffusion Probabilistic Models”. In: arXiv
preprint arXiv:2006.11239 (2020).
122
[64] Jonathan Ho, Ajay Jain, and Pieter Abbeel. “Denoising diffusion probabilistic models”. In:
Advances in neural information processing systems 33 (2020), pp. 6840–6851.
[65] Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and
Tim Salimans. “Cascaded Diffusion Models for High Fidelity Image Generation.” In: J. Mach.
Learn. Res. 23.47 (2022), pp. 1–33.
[66] Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and
David J Fleet. “Video diffusion models”. In: arXiv preprint arXiv:2204.03458 (2022).
[67] Mirko Holler, Manuel Guizar-Sicairos, Esther HR Tsai, Roberto Dinapoli, Elisabeth Müller,
Oliver Bunk, Jörg Raabe, and Gabriel Aeppli. “High-resolution non-destructive three-dimensional
imaging of integrated circuits”. In: Nature (2017).
[68] Emiel Hoogeboom and Tim Salimans. “Blurring diffusion models”. In: arXiv preprint
arXiv:2209.05557 (2022).
[69] Roarke Horstmeyer, Xiaoze Ou, Guoan Zheng, Phil Willems, and Changhuei Yang. “Digital
pathology with Fourier ptychography”. In: Computerized Medical Imaging and Graphics (2015).
[70] Jiahao Huang, Yingying Fang, Yinzhe Wu, Huanjun Wu, Zhifan Gao, Yang Li, Javier Del Ser,
Jun Xia, and Guang Yang. “Swin Transformer for Fast MRI”. In: arXiv preprint arXiv:2201.03230
(2022).
[71] Xiaohong Huang, Zhifang Deng, Dandan Li, and Xueguang Yuan. “MISSFormer: An effective
medical image segmentation Transformer”. In: arXiv preprint arXiv:2109.07162 (2021).
[72] Chang Min Hyun, Hwa Pyung Kim, Sung Min Lee, Sungchul Lee, and Jin Keun Seo. “Deep
learning for undersampled MRI reconstruction”. In: Physics in Medicine & Biology 63.13 (2018),
p. 135007.
[73] Chang Min Hyun, Hwa Pyung Kim, Sung Min Lee, Sungchul Lee, and Jin Keun Seo. “Deep
learning for undersampled MRI reconstruction”. In: Physics in Medicine & Biology 63.13 (2018),
p. 135007.
[74] Kishore Jaganathan, Samet Oymak, and Babak Hassibi. “Recovery of sparse 1-D signals from the
magnitudes of their Fourier transform”. In: 2012 IEEE International Symposium on Information
Theory Proceedings. IEEE. 2012.
[75] Ajil Jalal, Marius Arvinte, Giannis Daras, Eric Price, Alexandros G Dimakis, and Jon Tamir.
“Robust compressed sensing mri with deep generative priors”. In: Advances in Neural Information
Processing Systems 34 (2021), pp. 14938–14954.
[76] Kyong Hwan Jin, Harshit Gupta, Jerome Yerly, Matthias Stuber, and Michael Unser.
“Time-Dependent Deep Image Prior for Dynamic MRI”. In: arXiv:1910.01684 [cs, eess] (2019).
[77] John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger,
Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. “Highly accurate
protein structure prediction with AlphaFold”. In: Nature 596.7873 (2021), pp. 583–589.
123
[78] Zahra Kadkhodaie and Eero Simoncelli. “Stochastic solutions for linear inverse problems using
the prior implicit in a denoiser”. In: Advances in Neural Information Processing Systems 34 (2021),
pp. 13242–13254.
[79] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. “Progressive Growing of GANs for
Improved Quality, Stability, and Variation”. In: arXiv:1710.10196 [cs, stat] (2018).
[80] Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, and Timo Aila.
“Training generative adversarial networks with limited data”. In: arXiv preprint arXiv:2006.06676
(2020).
[81] Tero Karras, Samuli Laine, and Timo Aila. “A style-based generator architecture for generative
adversarial networks”. In: Proceedings of the IEEE/CVF conference on computer vision and pattern
recognition. 2019, pp. 4401–4410.
[82] Bahjat Kawar, Michael Elad, Stefano Ermon, and Jiaming Song. “Denoising diffusion restoration
models”. In: arXiv preprint arXiv:2201.11793 (2022).
[83] Bahjat Kawar, Jiaming Song, Stefano Ermon, and Michael Elad. “Jpeg artifact correction using
denoising diffusion restoration models”. In: arXiv preprint arXiv:2209.11888 (2022).
[84] Bahjat Kawar, Gregory Vaksman, and Michael Elad. “SNIPS: Solving noisy inverse problems
stochastically”. In: Advances in Neural Information Processing Systems 34 (2021), pp. 21757–21769.
[85] Michael Kellman, Emrah Bostan, Michael Chen, and Laura Waller. “Data-driven design for fourier
ptychographic microscopy”. In: 2019 IEEE International Conference on Computational Photography
(ICCP). IEEE. 2019, pp. 1–8.
[86] Diederik P Kingma and Max Welling. “Auto-encoding variational bayes”. In: arXiv preprint
arXiv:1312.6114 (2013).
[87] Tobit Klug and Reinhard Heckel. “Scaling Laws For Deep Learning Based Image Reconstruction”.
In: arXiv preprint arXiv:2209.13435 (2022).
[88] Florian Knoll, Kerstin Hammernik, Erich Kobler, Thomas Pock, Michael P Recht, and
Daniel K Sodickson. “Assessment of the Generalization of Learned Image Reconstruction and the
Potential for Transfer Learning”. In: Magnetic Resonance in Medicine 81.1 (2019), pp. 116–128.
[89] Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. “Diffwave: A versatile
diffusion model for audio synthesis”. In: arXiv preprint arXiv:2009.09761 (2020).
[90] Yilmaz Korkmaz, Salman UH Dar, Mahmut Yurt, Muzaffer Özbey, and Tolga Cukur.
“Unsupervised MRI reconstruction via zero-shot learned adversarial Transformers”. In: IEEE
Transactions on Medical Imaging (2022).
[91] Orest Kupyn, Volodymyr Budzan, Mykola Mykhailych, Dmytro Mishkin, and Jiří Matas.
“Deblurgan: Blind motion deblurring using conditional adversarial networks”. In: Proceedings of
the IEEE conference on computer vision and pattern recognition. 2018, pp. 8183–8192.
124
[92] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham,
Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al.
“Photo-realistic single image super-resolution using a generative adversarial network”. In:
Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, pp. 4681–4690.
[93] Dongwook Lee, Jaejun Yoo, Sungho Tak, and Jong Chul Ye. “Deep residual learning for
accelerated MRI using magnitude and phase networks”. In: IEEE Transactions on Biomedical
Engineering 65.9 (2018), pp. 1985–1995.
[94] Sangyun Lee, Hyungjin Chung, Jaehyeon Kim, and Jong Chul Ye. “Progressive deblurring of
diffusion models for coarse-to-fine image synthesis”. In: arXiv preprint arXiv:2207.11192 (2022).
[95] Joseph Lemley, Shabab Bazrafkan, and Peter Corcoran. “Smart augmentation learning an optimal
data augmentation strategy”. In: Ieee Access 5 (2017), pp. 5858–5869.
[96] Feng Li, Hao Zhang, Huaizhe Xu, Shilong Liu, Lei Zhang, Lionel M Ni, and Heung-Yeung Shum.
“Mask dino: Towards a unified transformer-based framework for object detection and
segmentation”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition. 2023, pp. 3041–3050.
[97] Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. “SwinIR:
Image restoration using Swin Transformer”. In: arXiv:2108.10257 (2021).
[98] Orly Liba, Kiran Murthy, Yun-Ta Tsai, Tim Brooks, Tianfan Xue, Nikhil Karnad, Qiurui He,
Jonathan T Barron, Dillon Sharlet, Ryan Geiss, et al. “Handheld mobile photography in very low
light.” In: ACM Trans. Graph. 38.6 (2019), pp. 164–1.
[99] Kang Lin and Reinhard Heckel. “Vision Transformers Enable Fast and Robust Accelerated MRI”.
In: Medical Imaging with Deep Learning. 2022.
[100] Fang Liu, Alexey Samsonov, Lihua Chen, Richard Kijowski, and Li Feng. “SANTIS:
Sampling-augmented neural network with incoherent structure for MR image reconstruction”. In:
Magnetic Resonance in Medicine 82.5 (2019), pp. 1890–1904.
[101] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. “Visual instruction tuning”. In: arXiv
preprint arXiv:2304.08485 (2023).
[102] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo.
“Swin Transformer: Hierarchical Vision Transformer using shifted windows”. In: arXiv:2103.14030
(2021).
[103] Ziwei Luo, Fredrik K Gustafsson, Zheng Zhao, Jens Sjölund, and Thomas B Schön. “Refusion:
Enabling large-size realistic image restoration with latent-space diffusion models”. In: Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, pp. 1680–1691.
[104] Michael Lustig, David L. Donoho, Juan M. Santos, and John M. Pauly. “Compressed Sensing MRI”.
In: IEEE Signal Processing Magazine 25.2 (2008), pp. 72–82.
125
[105] Michael Lustig, David L. Donoho, Juan M. Santos, and John M. Pauly. “Compressed Sensing MRI”.
In: IEEE Signal Processing Magazine 25.2 (2008), pp. 72–82.
[106] Michael Lustig, David L. Donoho, Juan M. Santos, and John M. Pauly. “Compressed sensing MRI”.
In: IEEE Signal Processing Magazine 25.2 (2008), pp. 72–82.
[107] Achleshwar Luthra, Harsh Sulakhe, Tanish Mittal, Abhishek Iyer, and Santosh Yadav. “Eformer:
Edge Enhancement based Transformer for Medical Image Denoising”. In: arXiv preprint
arXiv:2109.08044 (2021).
[108] Andrew M Maiden, Martin J Humphry, and JM Rodenburg. “Ptychographic transmission
microscopy in three dimensions using a multi-slice approach”. In: JOSA A (2012).
[109] Amil Merchant, Simon Batzner, Samuel S Schoenholz, Muratahan Aykol, Gowoon Cheon, and
Ekin Dogus Cubuk. “Scaling deep learning for materials discovery”. In: Nature (2023), pp. 1–6.
[110] Jianwei Miao, Pambos Charalambous, Janos Kirz, and David Sayre. “Extending the methodology
of X-ray crystallography to allow imaging of micrometre-sized non-crystalline specimens”. In:
Nature (1999).
[111] Matthew J Muckley, Bruno Riemenschneider, Alireza Radmanesh, Sunwoo Kim, Geunu Jeong,
Jingyu Ko, Yohan Jun, Hyungseob Shin, Dosik Hwang, Mahmoud Mostapha, et al. “Results of the
2020 fastMRI challenge for machine learning MR image reconstruction”. In: IEEE transactions on
Medical Imaging 40.9 (2021), pp. 2306–2317.
[112] Anton Myagotin, Alexey Voropaev, Lukas Helfen, Daniel Hänschke, and Tilo Baumbach.
“Efficient volume reconstruction for parallel-beam computed laminography by filtered
backprojection on multi-core clusters”. In: IEEE Transactions on Image Processing (2013).
[113] Eliya Nachmani, Robin San Roman, and Lior Wolf. “Denoising diffusion gamma models”. In: arXiv
preprint arXiv:2110.05948 (2021).
[114] Elias Nehme, Lucien E Weiss, Tomer Michaeli, and Yoav Shechtman. “Deep-STORM:
super-resolution single-molecule microscopy by deep learning”. In: Optica 5.4 (2018), pp. 458–464.
[115] Johanna Nelson, Xiaojing Huang, Jan Steinbrener, David Shapiro, Janos Kirz, Stefano Marchesini,
Aaron M Neiman, Joshua J Turner, and Chris Jacobsen. “High-resolution x-ray diffraction
microscopy of specifically labeled yeast cells”. In: Proceedings of the National Academy of Sciences
(2010).
[116] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew,
Ilya Sutskever, and Mark Chen. “Glide: Towards photorealistic image generation and editing with
text-guided diffusion models”. In: arXiv preprint arXiv:2112.10741 (2021).
[117] Viktor Nikitin, Selin Aslan, Yudong Yao, Tekin Biçer, Sven Leyffer, Rajmund Mokso, and
Doğa Gürsoy. “Photon-limited ptychography of 3D objects via Bayesian reconstruction”. In: OSA
Continuum (2019).
[118] Dwight George Nishimura. Principles of magnetic resonance imaging. Stanford University, 1996.
126
[119] Andrey Okhotin, Dmitry Molchanov, Vladimir Arkhipkin, Grigory Bartosh, Aibek Alanov, and
Dmitry Vetrov. “Star-Shaped Denoising Diffusion Probabilistic Models”. In: arXiv preprint
arXiv:2302.05259 (2023).
[120] Gregory Ongie, Ajil Jalal, Christopher A. Metzler Richard G. Baraniuk, Alexandros G. Dimakis,
and Rebecca Willett. “Deep Learning Techniques for Inverse Problems in Imaging”. In: IEEE
Journal on Selected Areas in Information Theory (2020).
[121] Gregory Ongie, Ajil Jalal, Christopher A. Metzler Richard G. Baraniuk, Alexandros G. Dimakis,
and Rebecca Willett. “Deep learning techniques for inverse problems in imaging”. In: IEEE
Journal on Selected Areas in Information Theory (2020).
[122] Gregory Ongie, Ajil Jalal, Christopher A Metzler, Richard G Baraniuk, Alexandros G Dimakis,
and Rebecca Willett. “Deep learning techniques for inverse problems in imaging”. In: IEEE
Journal on Selected Areas in Information Theory 1.1 (2020), pp. 39–56.
[123] Nicola Pezzotti, Elwin de Weerdt, Sahar Yousefi, Mohamed S Elmahdy, Jeroen van Gemert,
Christophe Schülke, Mariya Doneva, Tim Nielsen, Sergey Kastryulin, Boudewijn PF Lelieveldt,
et al. “Adaptive-CS-Net: fastMRI with adaptive intelligence”. In: arXiv preprint arXiv:1912.12259
(2019).
[124] Mark A Pfeifer, Garth J Williams, Ivan A Vartanyants, Ross Harder, and Ian K Robinson.
“Three-dimensional mapping of a deformation field inside a nanocrystal”. In: Nature (2006).
[125] Klaas P Pruessmann, Markus Weiger, Markus B Scheidegger, and Peter Boesiger. “SENSE:
sensitivity encoding for fast MRI”. In: Magnetic Resonance in Medicine: An Official Journal of the
International Society for Magnetic Resonance in Medicine 42.5 (1999), pp. 952–962.
[126] Patrick Putzky, Dimitrios Karkalousos, Jonas Teuwen, Nikita Miriakov, Bart Bakker,
Matthan Caan, and Max Welling. “I-RIM applied to the fastMRI Challenge”. In: arXiv:1910.08952
(2019).
[127] Patrick Putzky and Max Welling. “Invert to learn to invert”. In: Advances in Neural Information
Processing Systems. 2019, pp. 446–456.
[128] Zhen Qin, Qingliang Zeng, Yixin Zong, and Fan Xu. “Image inpainting based on deep learning: A
review”. In: Displays 69 (2021), p. 102028.
[129] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal,
Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. “Learning transferable visual
models from natural language supervision”. In: International conference on machine learning.
PMLR. 2021, pp. 8748–8763.
[130] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. “Improving language
understanding by generative pre-training”. In: (2018).
[131] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. “Hierarchical
text-conditional image generation with clip latents”. In: arXiv preprint arXiv:2204.06125 (2022).
127
[132] Zaccharie Ramzi, Philippe Ciuciu, and Jean-Luc Starck. “XPDNet for MRI Reconstruction: an
application to the 2020 fastMRI Challenge”. In: arXiv preprint arXiv:2010.07290 (2020).
[133] Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. “Generating diverse high-fidelity images with
vq-vae-2”. In: Advances in neural information processing systems 32 (2019).
[134] Severi Rissanen, Markus Heinonen, and Arno Solin. “Generative modelling with inverse heat
dissipation”. In: arXiv preprint arXiv:2206.13397 (2022).
[135] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer.
“High-resolution image synthesis with latent diffusion models”. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition. 2022, pp. 10684–10695.
[136] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. “U-net: Convolutional networks for
biomedical image segmentation”. In: International Conference on Medical Image Computing and
Computer-Assisted Intervention. Springer. 2015, pp. 234–241.
[137] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. “U-net: Convolutional networks for
biomedical image segmentation”. In: International Conference on Medical Image Computing and
Computer-Assisted Intervention. 2015, pp. 234–241.
[138] Frank Rosenblatt. “The perceptron: a probabilistic model for information storage and
organization in the brain.” In: Psychological review 65.6 (1958), p. 386.
[139] Litu Rout, Negin Raoof, Giannis Daras, Constantine Caramanis, Alexandros G Dimakis, and
Sanjay Shakkottai. “Solving Linear Inverse Problems Provably via Posterior Sampling with Latent
Diffusion Models”. In: arXiv preprint arXiv:2307.00619 (2023).
[140] Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans,
David Fleet, and Mohammad Norouzi. “Palette: Image-to-image diffusion models”. In: ACM
SIGGRAPH 2022 Conference Proceedings. 2022, pp. 1–10.
[141] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton,
Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes,
et al. “Photorealistic text-to-image diffusion models with deep language understanding”. In: arXiv
preprint arXiv:2205.11487 (2022).
[142] Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J. Fleet, and
Mohammad Norouzi. “Image Super-Resolution via Iterative Refinement”. In: arXiv:2104.07636 [cs,
eess] (2021).
[143] Anne Marie Sawyer, Michael Lustig, Marcus Alley, Phdmartin Uecker, Patrick Virtue, Peng Lai,
and Shreyas Vasanawala. “Creation of fully sampled MR data repository for compressed sensing
of the knee”. In: (2013).
[144] Jo Schlemper, Jose Caballero, Joseph V Hajnal, Anthony Price, and Daniel Rueckert. “A deep
cascade of convolutional neural networks for MR image reconstruction”. In: International
Conference on Information Processing in Medical Imaging. Springer. 2017, pp. 647–658.
128
[145] David Shapiro, Pierre Thibault, Tobias Beetz, Veit Elser, Malcolm Howells, Chris Jacobsen,
Janos Kirz, Enju Lima, Huijie Miao, Aaron M Neiman, et al. “Biological imaging by soft x-ray
diffraction microscopy”. In: Proceedings of the National Academy of Sciences (2005).
[146] David A Shapiro, Young-Sang Yu, Tolek Tyliszczak, Jordi Cabana, Rich Celestre, Weilun Chao,
Konstantin Kaznatcheev, AL David Kilcoyne, Filipe Maia, Stefano Marchesini, et al. “Chemical
composition mapping with nanometre resolution by soft X-ray microscopy”. In: Nature Photonics
(2014).
[147] Connor Shorten and Taghi M Khoshgoftaar. “A survey on image data augmentation for deep
learning”. In: Journal of Big Data 6.1 (2019), pp. 1–48.
[148] Daniel K Sodickson and Warren J Manning. “Simultaneous acquisition of spatial harmonics
(SMASH): fast imaging with radiofrequency coil arrays”. In: Magnetic Resonance in Medicine: An
Official Journal of the International Society for Magnetic Resonance in Medicine 38.4 (1997),
pp. 591–603.
[149] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. “Deep
unsupervised learning using nonequilibrium thermodynamics”. In: International Conference on
Machine Learning. PMLR. 2015, pp. 2256–2265.
[150] Mahdi Soltanolkotabi. “Structured signal recovery from quadratic measurements: Breaking
sample complexity barriers via nonconvex optimization”. In: IEEE Transactions on Information
Theory (2019).
[151] Jiaming Song, Chenlin Meng, and Stefano Ermon. “Denoising Diffusion Implicit Models”. In:
International Conference on Learning Representations. 2021. url:
https://openreview.net/forum?id=St1giarCHLP.
[152] Jiaming Song, Arash Vahdat, Morteza Mardani, and Jan Kautz. “Pseudoinverse-Guided Diffusion
Models for Inverse Problems”. In: International Conference on Learning Representations.
[153] Yang Song and Stefano Ermon. “Generative Modeling by Estimating Gradients of the Data
Distribution”. In: arXiv:1907.05600 [cs, stat] (2020).
[154] Yang Song and Stefano Ermon. “Generative modeling by estimating gradients of the data
distribution”. In: Advances in neural information processing systems 32 (2019).
[155] Yang Song and Stefano Ermon. “Improved Techniques for Training Score-Based Generative
Models”. In: arXiv:2006.09011 [cs, stat] (2020).
[156] Yang Song, Liyue Shen, Lei Xing, and Stefano Ermon. “Solving inverse problems in medical
imaging with score-based generative models”. In: arXiv preprint arXiv:2111.08005 (2021).
[157] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and
Ben Poole. “Score-based generative modeling through stochastic differential equations”. In: arXiv
preprint arXiv:2011.13456 (2020).
129
[158] Anuroop Sriram, Jure Zbontar, Tullie Murrell, Aaron Defazio, C Lawrence Zitnick,
Nafissa Yakubova, Florian Knoll, and Patricia Johnson. “End-to-End Variational Networks for
Accelerated MRI Reconstruction”. In: arXiv preprint arXiv:2004.06688 (2020).
[159] Anuroop Sriram, Jure Zbontar, Tullie Murrell, Aaron Defazio, C Lawrence Zitnick,
Nafissa Yakubova, Florian Knoll, and Patricia Johnson. “End-to-end variational networks for
accelerated MRI reconstruction”. In: Medical Image Computing and Computer Assisted
Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020,
Proceedings, Part II 23. Springer. 2020, pp. 64–73.
[160] Anuroop Sriram, Jure Zbontar, Tullie Murrell, Aaron Defazio, C. Lawrence Zitnick,
Nafissa Yakubova, Florian Knoll, and Patricia Johnson. “End-to-end variational networks for
accelerated MRI reconstruction”. In: Medical Image Computing and Computer Assisted
Intervention. 2020, pp. 64–73.
[161] Jian Sun, Huibin Li, Zongben Xu, et al. “Deep ADMM-Net for compressive sensing MRI”. In:
Advances in Neural Information Processing Systems 29 (2016).
[162] Pierre Thibault, Martin Dierolf, Andreas Menzel, Oliver Bunk, Christian David, and Franz Pfeiffer.
“High-resolution scanning x-ray diffraction microscopy”. In: Science (2008).
[163] Lei Tian, Ziji Liu, Li-Hao Yeh, Michael Chen, Jingshan Zhong, and Laura Waller. “Computational
illumination for high-speed in vitro Fourier ptychographic microscopy”. In: Optica (2015).
[164] Lei Tian and Laura Waller. “3D intensity and phase imaging from light field measurements in an
LED array microscope”. In: optica (2015).
[165] Lei Tian and Laura Waller. “3D intensity and phase imaging from light field measurements in an
LED array microscope”. In: optica (2015).
[166] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux,
Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. “Llama: Open
and efficient foundation language models”. In: arXiv preprint arXiv:2302.13971 (2023).
[167] Phong Tran, Anh Tuan Tran, Quynh Phung, and Minh Hoai. “Explore image deblurring via
encoded blur kernel space”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition. 2021, pp. 11956–11965.
[168] Toan Tran, Trung Pham, Gustavo Carneiro, Lyle Palmer, and Ian Reid. “A bayesian data
augmentation approach for learning deep models”. In: arXiv preprint arXiv:1710.10564 (2017).
[169] TrueFidelity. https://www.gehealthcare.com/products/truefidelity. Accessed: 2023-11-22.
[170] Martin Uecker, Peng Lai, Mark J Murphy, Patrick Virtue, Michael Elad, John M Pauly,
Shreyas S Vasanawala, and Michael Lustig. “ESPIRiT—an eigenvalue approach to autocalibrating
parallel MRI: where SENSE meets GRAPPA”. In: Magnetic Resonance in Medicine 71.3 (2014),
pp. 990–1001.
130
[171] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. “Deep Image Prior”. In: Conference on
Computer Vision and Pattern Recognition. 2018, pp. 9446–9454.
[172] Dave Van Veen, Ajil Jalal, Mahdi Soltanolkotabi, Eric Price, Sriram Vishwanath, and
Alexandros G Dimakis. “Compressed sensing with deep image prior and learned regularization”.
In: arXiv preprint arXiv:1806.06438 (2018).
[173] Shreyas S Vasanawala, Marcus T Alley, Brian A Hargreaves, Richard A Barth, John M Pauly, and
Michael Lustig. “Improved pediatric MR imaging with compressed sensing”. In: Radiology 256.2
(2010), pp. 607–616.
[174] Singanallur V Venkatakrishnan, Charles A Bouman, and Brendt Wohlberg. “Plug-and-play priors
for model based reconstruction”. In: 2013 IEEE Global Conference on Signal and Information
Processing. IEEE. 2013, pp. 945–948.
[175] Pascal Vincent. “A connection between score matching and denoising autoencoders”. In: Neural
computation 23.7 (2011), pp. 1661–1674.
[176] Irène Waldspurger. “Phase retrieval with random gaussian sensing vectors by alternating
projections”. In: IEEE Transactions on Information Theory (2018).
[177] Dayang Wang, Zhan Wu, and Hengyong Yu. “TED-net: Convolution-free T2T
VisionTransformer-based Encoder-decoder Dilation network for Low-dose CT Denoising”. In:
International Workshop on Machine Learning in Medical Imaging. 2021, pp. 416–425.
[178] Puyang Wang, Eric Z Chen, Terrence Chen, Vishal M Patel, and Shanhui Sun. “Pyramid
Convolutional RNN for MRI Reconstruction”. In: arXiv preprint arXiv:1912.00543 (2019).
[179] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo,
and Ling Shao. “Pyramid Vision Transformer: A versatile backbone for dense prediction without
convolutions”. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021,
pp. 568–578.
[180] Yi Wang, Xin Tao, Xiaojuan Qi, Xiaoyong Shen, and Jiaya Jia. “Image inpainting via generative
multi-column convolutional neural networks”. In: Advances in neural information processing
systems 31 (2018).
[181] Zhendong Wang, Xiaodong Cun, Jianmin Bao, and Jianzhuang Liu. “Uformer: A general U-shaped
Transformer for image restoration”. In: arXiv preprint arXiv:2106.03106 (2021).
[182] Zhihao Wang, Jian Chen, and Steven CH Hoi. “Deep learning for image super-resolution: A
survey”. In: IEEE transactions on pattern analysis and machine intelligence 43.10 (2020),
pp. 3365–3387.
[183] Zhou Wang, Eero P Simoncelli, and Alan C Bovik. “Multiscale structural similarity for image
quality assessment”. In: The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers,
2003. Vol. 2. Ieee. 2003, pp. 1398–1402.
131
[184] Simon Welker, Henry N Chapman, and Timo Gerkmann. “DriftRec: Adapting diffusion models to
blind image restoration tasks”. In: arXiv preprint arXiv:2211.06757 (2022).
[185] Yixuan Wu, Kuanlun Liao, Jintai Chen, Danny Z Chen, Jinhong Wang, Honghao Gao, and
Jian Wu. “D-Former: A U-shaped Dilated Transformer for 3D Medical Image Segmentation”. In:
arXiv preprint arXiv:2201.00462 (2022).
[186] Rui Xu, Mahdi Soltanolkotabi, Justin P Haldar, Walter Unglaub, Joshua Zusman, Anthony FJ Levi,
and Richard M Leahy. “Accelerated wirtinger flow: A fast algorithm for ptychography”. In: arXiv
preprint arXiv:1806.05546 (2018).
[187] Yan Yang, Jian Sun, Huibin Li, and Zongben Xu. “Deep ADMM-Net for compressive sensing
MRI”. In: Proceedings of the 30th International Conference on Neural Information Processing
Systems. 2016, pp. 10–18.
[188] Jong Chul Ye. “Compressed sensing MRI: a review from signal processing perspective”. In: BMC
Biomedical Engineering 1.1 (2019), pp. 1–17.
[189] Lu Yuan, Jian Sun, Long Quan, and Heung-Yeung Shum. “Image deblurring with blurred/noisy
image pairs”. In: ACM SIGGRAPH 2007 papers. 2007, 1–es.
[190] Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, and
Ming-Hsuan Yang. “Restormer: Efficient Transformer for high-resolution image restoration”. In:
arXiv:2111.09881 (2021).
[191] Jure Zbontar, Florian Knoll, Anuroop Sriram, Matthew J Muckley, Mary Bruno, Aaron Defazio,
Marc Parente, Krzysztof J Geras, Joe Katsnelson, Hersh Chandarana, et al. “fastMRI: An open
dataset and benchmarks for accelerated MRI”. In: arXiv preprint arXiv:1811.08839 (2018).
[192] Jure Zbontar, Florian Knoll, Anuroop Sriram, Tullie Murrell, Zhengnan Huang,
Matthew J. Muckley, Aaron Defazio, Ruben Stern, Patricia Johnson, Mary Bruno, Marc Parente,
Krzysztof J. Geras, Joe Katsnelson, Hersh Chandarana, Zizhao Zhang, Michal Drozdzal,
Adriana Romero, Michael Rabbat, Pascal Vincent, Nafissa Yakubova, James Pinkerton, Duo Wang,
Erich Owens, C. Lawrence Zitnick, Michael P. Recht, Daniel K. Sodickson, and Yvonne W. Lui.
“fastMRI: An open dataset and benchmarks for accelerated MRI”. In: arXiv:1811.08839 (2019).
[193] Jian Zhang and Bernard Ghanem. “ISTA-Net: Interpretable optimization-inspired deep network
for image compressive sensing”. In: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition. 2018, pp. 1828–1837.
[194] Jian Zhang and Bernard Ghanem. “ISTA-Net: Interpretable optimization-inspired deep network
for image compressive sensing”. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern
Recognition. 2018, pp. 1828–1837.
[195] Kaihao Zhang, Wenqi Ren, Wenhan Luo, Wei-Sheng Lai, Björn Stenger, Ming-Hsuan Yang, and
Hongdong Li. “Deep image deblurring: A survey”. In: International Journal of Computer Vision
130.9 (2022), pp. 2103–2130.
132
[196] Lipei Zhang, Zizheng Xiao, Chao Zhou, Jianmin Yuan, Qiang He, Yongfeng Yang, Xin Liu,
Dong Liang, Hairong Zheng, Wei Fan, et al. “Spatial adaptive and Transformer fusion network
(STFNet) for low-count PET blind denoising with MRI”. In: Medical Physics (2021).
[197] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. “The unreasonable
effectiveness of deep features as a perceptual metric”. In: Proceedings of the IEEE conference on
computer vision and pattern recognition. 2018, pp. 586–595.
[198] Shengyu Zhao, Zhijian Liu, Ji Lin, Jun-Yan Zhu, and Song Han. “Differentiable Augmentation for
Data-Efficient GAN Training”. In: arXiv:2006.10738 (2020). arXiv: 2006.10738.
[199] Zhengli Zhao, Zizhao Zhang, Ting Chen, Sameer Singh, and Han Zhang. “Image Augmentations
for GAN Training”. In: arXiv:2006.02595 (2020). arXiv: 2006.02595.
[200] Guoan Zheng, Roarke Horstmeyer, and Changhuei Yang. “Wide-field, high-resolution Fourier
ptychographic microscopy”. In: Nature photonics (2013).
[201] Hong-Yu Zhou, Jiansen Guo, Yinghao Zhang, Lequan Yu, Liansheng Wang, and Yizhou Yu.
“nnFormer: Interleaved Transformer for Volumetric Segmentation”. In: arXiv preprint
arXiv:2109.03201 (2021).
[202] Zongwei Zhou, Md Mahfuzur Rahman Siddiquee, Nima Tajbakhsh, and Jianming Liang. “Unet++:
A nested U-net architecture for medical image segmentation”. In: Deep Learning in Medical Image
Analysis and Multimodal Learning for Clinical Decision Support. 2018, pp. 3–11.
[203] Zongwei Zhou, Md Mahfuzur Rahman Siddiquee, Nima Tajbakhsh, and Jianming Liang. “Unet++:
A nested u-net architecture for medical image segmentation”. In: Deep Learning in Medical Image
Analysis and Multimodal Learning for Clinical Decision Support. Springer, 2018, pp. 3–11.
133
Appendices
A Appendix to 3D Phase retrieval at nano-scale via Accelerated
Wirtinger Flow
A.1 Proof of Theorem 2.4.1
Here, we are going to prove our main result on the convergence of 3D-AWF to stationary points stated
in Theorem 2.4.1. We are going to use Wirtinger-derivatives in place of regular differentiation. For an
overview on the notion of Wirtinger-derivatives and some properties we refer the reader to [10]. Let x¯
denote the complex conjugate of x ∈ C and for a matrix A ∈ C
n×m we write AH = A¯T ∈ C
m×n
its
Hermitian transpose.
First, we want to upperbound the spectral norm of the Hessian of L(x). Let
Jgℓ =
∂
∂x
gℓ
denote the Jacobian of gℓ
. Since
{Jgℓ
}i,j =
2πi
λ
{Tℓ}i,j{gℓ
}i
,
and therefore we have
Jgℓ =
2πi
λ
Tℓ ⊙ [gℓ
gℓ
gℓ
.. gℓ
] = 2πi
λ
diag(gℓ
)Tℓ
. (1)
134
Note that the "mixed" derivatives
∂
∂x¯
gℓ = 0,
∂
∂xgℓ =
∂
∂x
e
− 2πi
λ
Tℓx¯ = 0.
Moreover
Jgℓ
(x) = ∂
∂x¯
gℓ = −
2πi
λ
diag(gℓ
)T
H
ℓ
.
Therefore, the complex gradient of the loss function takes the form
∇L(x) = X
L
l=1
Jg
H
ℓ AH(Agℓ − yℓ ⊙ sgn(Agℓ
))
= −
2πi
λ
X
L
l=1
T
H
ℓ diag(gℓ
)AH(Agℓ − yℓ ⊙ sgn(Agℓ
)). (2)
To find the Hessian, first consider the smoothed 1D problem in the form
Lϵ(x) = X
L
l=1
M/L
X
m=1
|a
H
mgℓ
|
2 + ϵ
1
2 − ym,l2
, (3)
where am represents the mth row of A as a column vector and ym,l is the mth entry of yℓ
. Rewriting (3)
as a holomorphic function of gℓ
and its conjugate, we obtain
∂
∂x
Lϵ(x)
T
=
X
L
l=1
M/L
X
m=1
g
T
ℓ
(ama
H
m)
T gℓ + ϵ
1
2 − ym,l
g
T
ℓ
(amaH
m)
T gℓ + ϵ
1
2
Jg
T
ℓ
(ama
H
m)
T
gℓ
,
and therefore by substituting the Jacobian from (1) we have
∂
∂x
Lϵ(x)
H
= −
2πi
λ
X
L
l=1
M/L
X
m=1
|a
H
mgℓ
|
2 + ϵ
1
2 − ym,l
(|aH
mgℓ
|
2 + ϵ)
1
2
T
H
ℓ diag(gℓ
)(ama
H
m)gℓ
,
Now, applying the chain rule we obtain the second derivatives as
Hgg =
∂
∂x
∂
∂x
Lϵ(x)
H
=
X
L
l=1
M/L
X
m=1
h
|a
H
mgℓ
|
2 + ϵ
1
2 − ym,l
(|aH
mgℓ
|
2 + ϵ)
1
2
+
1
2
|a
H
mgℓ
|
2 ym,l
(|aH
mgℓ
|
2 + ϵ)
3
2
i
Jg
H
ℓ
(ama
H
m)Jgℓ
=
4π
λ2
X
L
l=1
M/L
X
m=1
h
1 −
ym,l
(|aH
mgℓ
|
2 + ϵ)
1
2
+
1
2
|a
H
mgℓ
|
2 ym,l
(|aH
mgℓ
|
2 + ϵ)
3
2
i
· T
H
ℓ diag(gℓ
)(ama
H
m)diag(gℓ
)Tℓ
Hgg¯ =
∂
∂x¯
∂
∂x
Lϵ(x)
H
=
∂
∂x¯
−
2πi
λ
X
L
l=1
M/L
X
m=1
|a
H
mgℓ
|
2 + ϵ
1
2 − ym,l
(|aH
mgℓ
|
2 + ϵ)
1
2
T
H
ℓ diag[(ama
H
m)gℓ
]gℓ
=
X
L
l=1
M/L
X
m=1
1
2
(a
H
mgℓ
)
2ym,l
(|aH
mgℓ
|
2 + ϵ)
3
2
Jg
H
ℓ
(ama
T
m)Jgℓ
−
2πi
λ
X
L
l=1
M/L
X
m=1
1 −
ym,l
(|aH
mgℓ
|
2 + ϵ)
1
2
T
H
ℓ diag[(ama
H
m)gℓ
]Jgℓ
To find the largest singular value of the Hessian we want to upper bound the quadratic form
u
u¯
H
∇2Lϵ(x)
u
u¯
= u
HHggu + u
HHgg¯ u¯ + u
THgg¯u + u
THg¯g¯u¯.
The first term takes the form
u
HHggu =
X
L
l=1
M/L
X
m=1
1 −
1
2
ym,l
(|aH
mgℓ
|
2 + ϵ)
1
2
−
ϵ
2
ym,l
(|aH
mgℓ
|
2 + ϵ)
3
2
|a
H
mJgℓu|
2
.
1
For the mixed terms we have
u
HHgg¯ u¯ + u
THgg¯u = 2ℜ
u
HHgg¯ u¯
!
=
X
L
l=1
M/L
X
m=1
ym,l
(|aH
mgℓ
|
2 + ϵ)
3
2
ℜ
(a
H
mgℓ
)
2
(u
HJg
H
ℓ am)
2
!
−
8π
2
λ2
X
L
l=1
M/L
X
m=1
1 −
ym,l
(|aH
mgℓ
|
2 + ϵ)
1
2
ℜ
(u
HT
H
ℓ diagh
(ama
H
m)gℓ ⊙ gℓ
i
Tℓu¯
!
=
X
L
l=1
M/L
X
m=1
ym,l
(|aH
mgℓ
|
2 + ϵ)
3
2
ℜ
(a
H
mgℓ
)
2
(u
HJg
H
ℓ am)
2
!
−
8π
2
λ2
X
L
l=1
ℜ
u
HT
H
ℓ diagh ∂
∂gℓ
Lϵ(x)
H
⊙ gℓ
i
Tℓu¯
!
.
137
Therefore,
u
u¯
H
∇2Lϵ(x)
u
u¯
=
X
L
l=1
M/L
X
m=1
1 −
1
2
ym,l
(|aH
mgℓ
|
2 + ϵ)
1
2
−
ϵ
2
ym,l
(|aH
mgℓ
|
2 + ϵ)
3
2
|a
H
mJgℓu|
2
+
X
L
l=1
M/L
X
m=1
ym,l
(|aH
mgℓ
|
2 + ϵ)
3
2
ℜ
(a
H
mgℓ
)
2
(u
HJg
H
ℓ am)
2
!
−
8π
2
λ2
X
L
l=1
ℜ
u
HT
H
ℓ diagh ∂
∂gℓ
Lϵ(x)
H
⊙ gℓ
i
Tℓ(x)u¯
!
=
X
L
l=1
M/L
X
m=1
1 − ϵ
ym,l
(|aH
mgℓ
|
2 + ϵ)
3
2
|a
H
mJgℓu|
2
+
X
L
l=1
M/L
X
m=1
ym,l
(|aH
mgℓ
|
2 + ϵ)
3
2
ℜ
(a
H
mgℓ
)
2
(u
HJg
H
ℓ am)
2 − |a
H
mgℓ
|
2
|u
HJg
H
ℓ am|
2
!
−
8π
2
λ2
X
L
l=1
ℜ
u
HT
H
ℓ diagh ∂
∂gℓ
Lϵ(x)
H
⊙ gℓ
i
Tℓu¯
!
≤ 2
X
L
l=1
M/L
X
m=1
|a
H
mJgℓu|
2 −
8π
2
λ2
X
L
l=1
ℜ
u
HT
H
ℓ diagh ∂
∂gℓ
Lϵ(x)
H
⊙ gℓ
i
Tℓu¯
!
= 2X
L
l=1 "
u
HJg
H
ℓ
M/L
X
m=1
ama
H
m
Jgℓu
−
4π
2
λ2
X
L
l=1
ℜ
u
HT
H
ℓ diagh ∂
∂gℓ
Lϵ(x)
H
⊙ gℓ
i
Tℓu¯
!# (4)
Note that the diagonal matrix in the second term Dℓ = diagh ∂
∂gℓ
Lϵ(x)
H
⊙ gℓ
i
is directly calculated in
each iteration, since it is the gradient corresponding to a certain angle before applying the adjoint operator
T
H
ℓ
.
138
Focusing on the first term in Eq. (4) and letting P =
PK
k=1 diag(pk)
Hdiag(pk), a PSD diagonal
matrix, we obtain
X
L
l=1
u
HJg
H
ℓ
M/L
X
m=1
ama
H
m
Jgℓu =
4π
2
λ2
X
L
l=1
u
HT
H
ℓ diag(gℓ
)
HP diag(gℓ
)Tℓu
≤
4π
2
λ2
X
L
l=1
∥P diag(|gℓ
|
2
)∥2∥Tℓu∥
2
=
4π
2
λ2
X
L
l=1
∥P diag(|gℓ
|
2
)∥2∥F{Tℓu}∥2
=
4π
2
λ2
X
L
l=1
∥P diag(|gℓ
|
2
)∥2∥F{u}ℓ∥
2
,
where we first applied Parseval’s theorem followed by the Fourier-slice theorem. F{u}ℓ denotes the slice
in Fourier domain corresponding to angle ℓ in spatial domain. To maximize this sum we have to allocate
the total energy of u at the intersection of all slices, that is F{u}(k) = δ(k). Therefore, the following
holds:
X
L
l=1
u
HJg
H
ℓ
M/L
X
m=1
ama
H
m
Jgℓu ≤
4π
2
λ2
X
L
l=1
∥P diag(|gℓ
|
2
)∥2∥u∥
2
Let qℓ = −
2πi
λ
Tℓu¯, then the second term in Eq. (4)
−
4π
2
λ2
X
L
l=1
ℜ
u
HT
H
ℓ diagh ∂
∂gℓ
Lϵ(x)
H
⊙ gℓ
i
Tℓu¯
!# =
X
L
l=1
ℜ
q
T
ℓ Dℓqℓ
≤
X
L
ℓ=1
∥|Dℓ
|∥2∥qℓ∥
2 =
4π
2
λ2
X
L
ℓ=1
∥|Dℓ
|∥2 ∥Tℓu¯∥
2
=
4π
2
λ2
X
L
ℓ=1
∥|Dℓ
|∥2 ∥Tℓu∥
2 ≤
4π
2
λ2
X
L
ℓ=1
∥|Dℓ
|∥2
∥u∥
2
139
To summarize the above results, we conclude that
u
u¯
H
∇2Lϵ(x)
u
u¯
≤
≤ 2
4π
2
λ2
"X
L
l=1
∥P diag(|gℓ
|
2
)∥2 + ∥|Dℓ
|∥2
#
∥u∥
2
=
4π
2
λ2
"X
L
l=1
∥P diag(|gℓ
|
2
)∥2 + ∥|Dℓ
|∥2
#
u
u¯
2
=
4π
2
λ2
"X
L
l=1
X
K
k=1
diag(|pk|
2
) diag(|gℓ
|
2
)
2
+
diagh ∂
∂gℓ
Lϵ(x)
H
⊙ gℓ
i
2
#
u
u¯
2
The final result is an iteration-dependant upper bound on the loss Hessian singular value that motivates
our practical step size selection in (2.6). However, for the following convergence results to hold we need
to find an upper bound that is satisfied in each iteration. First, note that ∥gℓ
∥ℓ∞
≤ 1, reflecting the fact
that a passive medium can only attenuate the incident beam. Hence,
X
K
k=1
diag(|pk|
2
) diag(|gℓ
|
2
)
2
≤
X
K
k=1
diag(|pk|
2
)
2
Notice that
∂
∂gℓ
Lϵ(x)
H
⊙ gℓ = diag(gℓ
)AH(Agℓ − yℓ ⊙ sgn(Agℓ
)).
We are going to bound the ℓ∞ norm of each term of the above quantity:
diag(gℓ
)AHAgℓ
ℓ∞
≤
AHAgℓ
ℓ∞
∥gℓ
∥ℓ∞
≤ λmax(AHA) ∥gℓ
∥ℓ2
≤ λmax(AHA)
√
P ,
140
and
diag(gℓ
)AH(yℓ ⊙ sgn(Agℓ
)
ℓ∞
≤
AH(yℓ ⊙ sgn(Agℓ
))
ℓ∞
,
≤
AH
ℓ2
∥yℓ ⊙ sgn(Agℓ
)∥ℓ2
≤
q
λmax(AHA) ∥yℓ∥ℓ∞
.
Therefore, the Hessian spectral norm is upper bounded by
Γ = 4π
2
λ2
(1 + √
P)L
X
K
k=1
diag(|pk|
2
)
2
+
vuut
X
K
k=1
diag(|pk|
2)
2
X
L
l=1
∥yℓ∥ℓ2
, (5)
independent of τ .
141
Let L(x)
ϵ
total = Lϵ(x)+h(x) the smoothed version of the total loss, where h(x) is an arbitrary convex
scalar function. Using the Wirtinger derivative version of Taylor’s approximation theorem on Lϵ(x), the
total loss at consecutive iterations can be written as
L(xτ+1)
ϵ
total = L(xτ )
ϵ
total +
∇Lϵ(xτ )
∇Lϵ(xτ )
H
xτ+1 − xτ
xτ+1 − xτ
+
1
2
xτ+1 − xτ
xτ+1 − xτ
H
Z 1
0
∇2Lϵ(xτ + t(xτ+1 − xτ ))dt
xτ+1 − xτ
xτ+1 − xτ
+ h(xτ+1)
≤ L(xτ )
ϵ
total +
∇Lϵ(xτ )
∇Lϵ(xτ )
H
xτ+1 − xτ
xτ+1 − xτ
+
Γ
2
xτ+1 − xτ
xτ+1 − xτ
2
ℓ2
+ h(xτ+1)
= L(xτ )
ϵ
total + µ
(xτ+1 − xτ )/µ + ∇Lϵ(xτ )
(xτ+1 − xτ )/µ + ∇Lϵ(xτ )
H
(xτ+1 − xτ )/µ
(xτ+1 − xτ )/µ
− µ(1 −
Γµ
2
)
(xτ+1 − xτ )/µ
(xτ+1 − xτ )/µ
2
ℓ2
+ h(xτ+1) (6)
By the definition of the proximal operator
xτ+1 = arg min
z∈CN
1
2µ
∥z − (xτ − µ∇Lϵ(xτ ))∥
2
ℓ2
+ h(xτ ) = xτ − µGτ (xτ ),
142
where Gτ (x) is the generalized gradient at x in iteration τ . Due to the necessary condition of optimality,
we must have
z − (xτ − µ∇Lϵ(xτ ) + tv = 0,
where v ∈ ∂h(z) is a subgradient of h(x) at z. Substituting z = xτ − µGτ (xτ ) yields
v = Gτ (xτ ) − ∇Lϵ(xτ ) = −(xτ+1 − xτ )/µ − ∇Lϵ(xτ )
Since h(x) is convex and v ∈ ∂h(xτ+1) we have
h(xτ+1) ≤ h(xτ ) + µ
−(xτ+1 − xτ )/µ − ∇Lϵ(xτ )
−(xτ+1 − xτ )/µ − ∇Lϵ(xτ )
H
(xτ+1 − xτ )/µ
(xτ+1 − xτ )/µ
Combining this result with (6) yields
L
ϵ
total(xτ+1) − Lϵ
total(xτ ) ≤ −µ(1 −
Γµ
2
)
(xτ+1 − xτ )/µ
(xτ+1 − xτ )/µ
2
ℓ2
≤ −
Γ
2
xτ+1 − xτ
xτ+1 − xτ
2
ℓ2
= −Γ ∥xτ+1 − xτ ∥
2
ℓ2
Summing over both sides up to some fixed iteration T we have
Γ
X
T
τ=0
∥xτ+1 − xτ ∥
2
ℓ2
≤ Lϵ
total(x0) − Lϵ
total(xT +1) ≤ Lϵ
total(x0) − Lϵ
total(x
∗
)
143
for a global minimizer x
∗ of L
ϵ
total(x). Since the above expression holds for any ϵ, we take ϵ → 0 and
obtain
Γ
X
T
τ=0
∥xτ+1 − xτ ∥
2
ℓ2
≤ Ltotal(x0) − Ltotal(x
∗
)
Since the series on the left hand side converges, we must have
limτ→∞
∥xτ+1 − xτ ∥ℓ2
= limτ→∞
∥proxh(zτ ) − xτ ∥ℓ2
= 0.
Moreover,
Γ
X
T
τ=0
∥xτ+1 − xτ ∥
2
ℓ2
≥ Γ(T + 1) min
τ∈{0,1,..,T}
∥xτ+1 − xτ ∥
2
ℓ2
and therefore
min
τ∈{1,2,..,T}
∥proxh(zτ ) − xτ ∥ℓ2
≤
Ltotal(x0) − Ltotal(x
∗
)
Γ(T + 1) ≤
Ltotal(x0) − Ltotal(x
∗
)
µ(T + 1) .
We conclude the proof of Theorem 2.4.1 by picking h(x) = TV3D(x; w), which is a convex function of x.
Note that the same proof methodology works for any other convex regularizer, and includes total-variation
as a special case.
144
B Appendix to Data augmentation for deep learning based accelerated
MRI reconstruction with limited data
Appendix outline
The following appendix provides additional experimental details, enlarged images of reconstructed slices
and extra discussions not included in the main paper. The organization of the supplementary material is
as follows:
FastMRI dataset. Appendix B.1 provides additional details on the fastMRI dataset and the experimental setup used in our experiments. We plot reconstruction metric results in PSNR in addition to the
SSIM comparison in the main paper (Fig. 3.7) and demonstrate gains comparable to that measured in SSIM.
Furthermore, we plot randomly picked reconstructions from the validation set in order to provide a comprehensive view of reconstruction quality. Finally, we apply MRAugment with a model different from the
one used in our main experiments on the fastMRI dataset to demonstrate the wider applicability of DA for
deep learning based MR reconstruction.
Stanford datasets. In Appendices B.2 and B.3 we provide more details on the Stanford datasets and
the experimental details. Moreover, further reconstructed slices are depicted complementing the ones in
the main paper.
Robustness. In Appendix B.4 we give more details on the robustness experiments from Section 3.4.4,
describing the MR scanner models used in the experiments.
Ablation studies. In Appendix B.5 we perform ablation studies on the fastMRI dataset to investigate
the utility of various augmentations and the effect of augmentation scheduling on the final reconstruction.
Finally, our code is published at https://github.com/MathFLDS/MRAugment. We refer to this code for additional detail regarding the implementation. We note that MRAugment pipeline can be seamlessly integrated with any existing MR reconstruction code, and can be applied to the fastMRI code base by only a
145
couple of lines of additional code. We hope that the utility and ease of use of MRAugment will prove useful
for a wider range of practitioners.
B.1 Experiments on the fastMRI dataset
B.1.1 Experimental details
The fastMRI dataset [191] is a large open dataset of knee and brain MRI volumes. The train and validation splits contain fully-sampled k-space volumes and corresponding target reconstructions for both
(simulated) single-coil and multi-coil acquisition. The knee MRI dataset we are focusing on in this paper
includes 973 train volumes (34742 slices) and 199 validation volumes (7135 slices). The target reconstructions are fixed size 320 × 320 center cropped images corresponding to the fully-sampled data of varying
sizes. The undersampling ratio is either 25% (4× acceleration) or 12.5% (8× acceleration). Undersampling
is performed along the phase encoding dimension in k-space, that is columns in k-space are sampled. A
certain neighborhood of adjacent low-frequency lines are always included in the measurement. The size
of this fully-sampled region is 8% of all frequencies in case of 4× acceleration and 4% in case of 8×
acceleration.
Dataset sampling. We use the fastMRI [191] single-coil and multi-coil knee dataset for our experiments. For creating the sub-sampled datasets, we uniformly sample volumes from the training set, and add
all slices from the sampled volumes. Our validation results are reported on the whole validation dataset.
Images in the dataset have varying dimensions. Due to GPU memory considerations we center-cropped
the input images to 640 × 368 pixels (which covers most of the images). We use random undersampling
masks with 8× acceleration and 4% fully-sampled low-frequency band, undersampled in the phase encoding direction by masking whole kspace lines. We generate a new random mask for each slice on-the-fly
while training, but use the same fixed mask for each slice within the same volume on the validation set
(different across volumes).
146
Model. We train the default E2E-VarNet network from [158] with 12 cascades (approx. 30M parameters) for both the single-coil and multi-coil reconstruction problems. For single-coil data we remove the
Sensitivity Map Estimation sub-network as sensitivity maps are not relevant in this problem.
Hyperparameters and training. We use an Adam optimizer with 0.0003 learning rate following
[158]. We train the baseline model on the full training dataset for 50 epochs. For the smaller, sub-sampled
datasets we train for the same computational cost as the baseline, that is we train for N · 50 epochs on
1/Nth of the training data. Without data augmentation, we observe a saturation in validation SSIM during
this time. With data augmentation we trained 50% longer as we still observe improvement in validation
performance after the standard number of epochs. We report the best SSIM on the validation set throughout
training. We train on 4 GPUs for single-coil data and on 8 GPUs for multi-coil data. The batch size matches
the number of GPUs used for training, since a GPU can only hold a single datapoint.
Data augmentation parameters. The transformations and their corresponding probability weights
and ranges of values are depicted in Table A2. We adjust the weights so that groups of transformations such
as rotation (arbitrary, by k · 90◦
), flipping (horizontal or vertical) or scaling (isotropic or anisotropic) have
similar probabilities. For both the affine transformations and upsampling we use bicubic interpolation.
Due to computational considerations we only use upsampling before transformations for the single-coil
experiments.
B.1.2 Additional experimental results on the fastMRI dataset
Comparison of additional metrics. In order to provide more in-depth comparison for our main experiment, here we provide results on PSNR as an additional image quality metric, extending our results from
Figure 3.7. We observe significant and consistent improvement in PSNR when applying MRAugment (Fig.
A1) with similar trends to SSIM: the improvement is the most prominent in the low-data regime, but still
significant in the moderate domain.
147
0.35k 3.5k 11.5k 35k
28.5
29
29.5
30
30.5
31
# of training examples
Val. PSNR
no DA
DA
1
0.35k 3.5k 11.5k 35k
33
34
35
36
# of training examples
Val. PSNR
no DA
DA
1
Figure A1: Single-coil (left) and multi-coil (right) validation PSNR vs. # of training images.
Additional reconstructions. In order to demonstrate that MRAugment works well across a wide
range of MR slices, here we provide additional reconstructions with and without data augmentation. In
multi-coil reconstructions the visual differences are more subtle, therefore we magnified regions with fine
details for better comparison.
Figures A5 and A6 provide a comprehensive set of reconstructions across all subsampling ratios with
and without data augmentation for the single-coil and multi-coil slices additional to the ones presented in
Figure 3.6.
Even though the most visible improvement on reconstructions is observed when training data is especially low (1% subsampling), Figure A2 provides a closer look at a slice where significant details are
recovered by MRAugment using 10% of training data.
Figures A7 and A8 provide more reconstructed slices randomly sampled from the validation dataset
with and without DA.
Other models. Even though we demonstrated our DA pipeline on E2E-VarNet, the potential of our
technique is not limited to a specific model. We performed preliminary experiments on i-RIM [127], another high performing model on single-coil MR reconstruction. We kept the hyperparameters proposed in
[127] for the single-coil problem with modifications as follows. Due to computational considerations, we
148
Figure A2: MRAugment recovers additional details in the moderate-data regime when 10% of fastMRI
training data is used.
decreased the number of invertible layers to 6 with [64, 128, 256, 256, 128, 64] hidden features inside the
reversible blocks and [1, 2, 4, 4, 2, 1] kernel strides in each layer, resulting in a model with 20M parameters. In order to further reduce training time, we trained on volumes without fat suppression that take up
50% of the full fastMRI knee dataset. We refer to this new reduced dataset as ’full’ in this section. Finally,
we trained on 368x368 center crops of input images for each experiment. We used ramp scheduling with
augmentation probability linearly increasing from 0 to pmax = 0.4. The acceleration factor and undersampling mask were the same as for other experiments. As depicted in Fig. A3 , our experiments show
that applying data augmentation to only 10% of the training data can match the performance of the model
trained on the full dataset.
10 20 30 40 50
0.72
0.74
0.76
0.78
Epochs
Val. SSIM
10%
10%+DA
100% baseline
1
149
Figure A3: Results on the i-RIM network. We achieve SSIM comparable to the baseline with only 10% of
the training data.
B.2 Experiments on the Stanford 2D FSE dataset
The Stanford 2D FSE [22] dataset is a public dataset of 89 fully-sampled MRI volumes of various anatomies
including lower extremity, pelvis and cardiac images. All measurements have been acquired by the same
MRI scanner using multi-coil acquisition, however volume dimensions and the number of receiver coils
vary between volumes. The total number of MRI slices is about 5% of the fastMRI knee training dataset.
Dataset sampling. When random sampling, we randomly select volumes of the original dataset and
add all slices of the sampled volumes. For volumes where multiple contrasts are available, we arbitrarily
pick the first one and discard the others. We scale all measurements by 10−7
to approximately match the
range of fastMRI measurements. Unlike the fastMRI dataset, Stanford 2D FSE is not separated into training,
validation and test sets. Therefore, we use 80% − 20% training-validation split in our experiments, where
we generate 5 random splits in order to minimize variations in reconstruction metrics due to validation
set selection and show the mean of validation SSIMs over all 5 runs. When performing experiments on
less training data, we keep 20% of the full dataset as validation set and only subsample the train split. We
use no center-cropping on the training images as volume dimensions vary strongly. We undersample the
measurements by a factor of 8 and generate masks the same way as in the fastMRI experiments detailed
in Section B.1.
Model. We train the default E2E-VarNet network as used in the multi-coil fastMRI experiments detailed in Section B.1.
Hyperparameters and training. We use Adam optimizer with a learning rate of 0.0003 as in our
other experiments. For the baseline experiments without data augmentation, we train the model for 50
epochs, after which we see no significant improvement in reconstruction SSIM and the model overfits to
the training dataset. With data augmentation we train for 200 epochs, as validation SSIM increases well
150
after 50 epochs and we observe no overfitting. In all experiments, we report the mean of best validation
SSIMs over 5 independent runs.
Data augmentation parameters. In all data augmentation experiments on the Stanford 2D FSE
dataset we use exponential schedulig with pmax = 0.55. The range of values for the various transformations is almost identical to that in Table A2. For more details we refer the reader to the attached source
code.
B.3 Experiments on the Stanford Fullysampled 3D FSE Knees dataset
The Stanford Fullysampled 3D FSE Knees dataset [143] is a public MRI dataset of 20 fully-sampled k-space
volumes of knees, acquired by the same MRI scanner. Each volume consists of 256 slices of 320 × 320
images with a multi-coil acquisition using 8 receiver coils. The full dataset consists of 5120 slices, or about
15% of the fastMRI knee training dataset.
Experimental setup. We apply the same dataset sampling, metric reporting method, model and
hyperparameters (includig data augmentation scheduling) as in the Stanford 2D FSE experiments in Section
B.2. We scale all original measurements by 10−6
.
Figure A4: Visual comparison of reconstructions on the Stanford Fullysampled 3D FSE Knees dataset under
various amount of training data, with and without data augmentation.
151
B.4 Robustness experiments
Validation on unseen MRI scanners. We explore how data augmentation impacts generalization to
new MRI scanner models not available in training time. Different MRI scanners may use different field
strenghts for acquisition, and higher field strength typically correlates with higher SNR. Volumes in the
fastMRI knee dataset have been acquired by the following 4 different scanners (followed by field strength):
MAGNETOM Aera (1.5T), MAGNETOM Skyra (3T), Biograph mMR (3T) and MAGNETOM Prisma Fit (3T).
The number of slices acquired by the different scanners are shown in Table A1. We perform the following
experiments:
• 3T → 3T: We train and validate on volumes acquired by scanners with 3T field strength. The training set only contains scans from MAGNETOM Skyra and MAGNETOM Prisma Fit, and we validate
on Biograph mMR scans.
• 3T → 1.5T: We train on all volumes acquired by 3T scanners and validate on the 1.5T scanner
(MAGNETOM Aera).
• 1.5T → 3T: We train on all volumes acquired by the 1.5T scanner and validate on all other 3T
scanners.
We combine volumes corresponding to the same scanner model from the official train and validation sets
to form the validation set, however for training we only use volumes of the given model from the training
set. Table 3.1 summarizes our results.
B.5 Ablation studies
Transformations. We performed ablation studies on 1% of the fastMRI knee training dataset in order to
better understand which augmentations are useful. We use the multi-coil experiment with all augmentations as baseline and tune augmentation probability for other experiments such that the probability that
152
Model Slices in train Slices in val
Aera (1.5T) 13856 3200
Skyra (3T) 15370 2967
Biograph (3T) 3961 748
Prisma Fit (3T) 1555 220
Table A1: Number of available slices for each
scanner type in the train and validation splits of
the fastMRI dataset.
Transform Range of values wi
H-flip flipped/not flipped 0.5
V-flip flipped/not flipped 0.5
Rot. by k · 90◦ k ∈ {0, 1, 2, 3} 0.5
Rotation [−180◦
, 180◦
] 0.5
Translation x: [−8%, 8%], y: [−12.5%, 12.5%] 1.0
Iso. scaling [0.75, 1.25] 0.5
Aniso. scaling [0.75, 1.25] along each axes 0.5
Shearing [−12.5
◦
, 12.5
◦
] 1.0
Table A2: Data augmentation configuration for
all fastMRI experiments.
Augmentations SSIM
none 0.8396
pixel preserving only 0.8585
interpolating only 0.8731
all augmentations 0.8758
Table A3: Comparison of peak validation SSIM
applying various sets of augmentations on 1% of
fastMRI training data, multi-coil acquisition.
Augmentation scheduling SSIM
none 0.8396
exponential, 0.3 0.8565
constant, 0.3 0.8588
exponential, 0.6 0.8758
constant, 0.6 0.8611
exponential, 0.8 0.8600
Table A4: Comparison of peak validation SSIM
using different augmentation probability schedules on 1% of fastMRI training data, multi-coil acquisition.
a certain slice is augmented by at least one augmentation is the same across all experiments. We depict
results on the validation dataset in Table A3. Both pixel preserving and general (interpolating) affine transformations are useful and can significantly increase reconstruction quality. Furthermore, we observe that
their effect is complementary: they are helpful separately, but we achieve peak reconstruction SSIM when
all applied together. Finally, the utility of pixel preserving augmentations seems to be lower than that of
general affine augmentations, however they come with a negligible additional computational cost.
Augmentation scheduling. Furthermore, we investigate the effect of varying the augmentation
probability scheduling function. The results on the validation dataset are depicted in Table A4, where
exponential, pˆ denotes the exponential scheduling function in
p(t) = pmax
1 − e−c
(1 − e
−tc/T ),
153
with pmax = ˆp and constant, pˆ means we use a fixed augmentation probability pˆ throughout training.
We observe that scheduling starting from low augmentation probability and gradually increasing is better
than a constant probability, as initially the network does not benefit much from data augmentation as it
can still learn from the original samples. Furthermore, too low or too high augmentation probability both
degrade performance. If the augmentation probability is too low, the network may overfit to training data
as more regularization is needed. On the other hand, too much data augmentation hinders reconstruction
performance as the network rarely sees images close to the original training distribution.
154
Figure A5: Visual comparison of fastMRI single-coil reconstructions presented in Figure 3.6 extended with
additional images corresponding to various amount of training data.
Figure A6: Visual comparison of fastMRI multi-coil reconstructions presented in Figure 3.6 extended with
additional images corresponding to various amount of training data.
155
Ground truth 100% train 1% train + DA 1% train
Figure A7: Visual comparison of fastMRI single-coil reconstructions using varying amounts of training
data with and without data augmentation.
156
Ground truth 100% train 1% training data + DA 1% training data
Figure A8: Visual comparison of fastMRI multi-coil reconstructions using varying amounts of training
data with and without data augmentation.
157
Figure A9: Visual comparison of reconstructions on the Stanford 2D FSE dataset under various amount of
training data, with and without data augmentation.
158
C Appendix to HUMUS-Net: Hybrid unrolled multi-scale network
architecture for accelerated MRI reconstruction
C.1 HUMUS-Net baseline details
Our default model has 3 RSTB-D downsampling blocks, 2 RSTB-B bottleneck blocks and 3 RSTB-U upsampling blocks with 3 − 6 − 12 attention heads in the D/U blocks and 24 attention heads in the
bottleneck block. For Swin Transformers layers, the window size is 8 for all methods and MLP ratio
(hidden_dim/input_dim) of 2 is used. Each RSTB block consists of 2 STLs with embedding dimension of
66. For HUMUS-Net-L, we increase the embedding dimension to 96. We use 8 cascades of unrolling with
a U-Net as sensitivity map estimator (same as in E2E-VarNet) with 16 channels.
We center crop and reflection pad the input images to 384 × 384 resolution for HUMUS-Net and use
the complete images for VarNet. In all experiments, we minimize the SSIM loss between the target image
x
∗
and the reconstruction xˆ defined as
LSSIM (x
∗
, xˆ) = 1 − SSIM(x
∗
, xˆ).
fastMRI experiments– We train HUMUS-Net using Adam for 50 epochs with a learning rate of
0.0001, dropped by a factor of 10 at epoch 40 and apply adjacent slice reconstruction with 3 slices. We
run the experiments on 8× Titan RTX 24GB GPUs with a per-GPU batch size of 1. We train HUMUS-NetL using a learning rate of 0.00007 (using inverse square root scaling with embedding dimension) for 45
epochs, dropped by a factor of 10 for further 5 epochs. We trained HUMUS-Net-L on 4× A100 40GB GPUs
on Amazon SageMaker with per-GPU batch size of 1.
159
Stanford 3D experiments– We train for 25 epochs with a learning rate of 0.0001, dropped by a factor
of 10 at epoch 20 and reconstruct single slices. We run the experiments on 4× Quadro RTX 5000 16GB
GPUs with a per-GPU batch size of 1.
Stanford 2D experiments– We train for 50 epochs with a learning rate of 0.0001, dropped by a
factor of 10 at epoch 40 and reconstruct single slices. Moreover, we crop reconstruction targets to fit into
a 384 × 384 box. We run the experiments on 8× Quadro RTX 5000 16GB GPUs with a per-GPU batch size
of 1.
C.2 Ablation study experimental details
Here we discuss the experimental setting and hyperparameters used in our ablation study in Section 5.2.
In all experiments, we train the models on the Stanford 3D dataset for 3 different train-validation splits
and report the mean and standard error of the results. We use Adam optimizer with learning rate 0.0001
and train for 25 epochs, decaying the learning rate by a factor of 10 at 20 epochs. For all unrolled methods,
we unroll 12 iterations in order to provide direct comparison with the best performing E2E-VarNet, which
uses the same number of cascades. We reconstruct single slices and do not use the method of adjacent
slice reconstruction discussed in Section 4.4. In models with sensitivity map estimator,we used the default
U-Net with 8 channels (same as default E2E-VarNet). For Swin Transformers layers, the window size is
8 for all methods and MLP ratio (hidden_dim/input_dim) of 2 is used. Further hyperparameters are
summarized in Table A5.
Method Embedding dim. # of STLs in RSTB # attention heads Patch size
Un-SS 12 2 − 2 − 2 6 − 6 − 6 1
Un-MS 12 D/U : 2 − 2 − 2, B : 2 D/U : 3 − 6 − 12, B : 24 1
Un-MS-Patch2 36 D/U : 2 − 2 − 2, B : 2 D/U : 3 − 6 − 12, B : 24 2
HUMUS-Net 36 D/U : 2 − 2 − 2, B : 2 D/U : 3 − 6 − 12, B : 24 1
SwinIR 66 6 − 6 − 6 − 6 6 − 6 − 6 − 6 1
Table A5: Ablation study experimental details. We show the number of STL layers per RSTB blocks and
number of attention heads for multi-scale networks in downsampling (D) , bottleneck (B) and upsampling
(U) paths separately.
160
We run the experiments on 4× Quadro RTX 5000 16GB GPUs with a per-GPU batch size of 1.
C.3 Results on additional accelerations
We observe that HUMUS-Net achieves state-of-the-art performance across a wide range of acceleration
ratios. We perform experiments on the Stanford 3D dataset using a small HUMUS-Net model with only
6 cascades, embedding dimension of 66, adjacent slice reconstruction with 3 slices and for the sake of
simplicity we removed the residual path from the HUMUS-Block. We call this model HUMUS-Net-S. We
set the learning rate to 0.0002 and train the model with Adam optimizer. We generate 3 random trainingvalidation splits on the Stanford 3D dataset (same splits as in other Stanford 3D experiments in this paper),
and show the mean and standard error of the results in Table A6. In our experiments, HUMUS-Net achieved
higher quality reconstructions in every metric over E2E-VarNet.
Acceleration Model SSIM(↑) PSNR(↑) NMSE(↓)
4x
E2E-VarNet 0.9623 ± 0.0038 42.9 ± 0.5 0.0103 ± 0.0001
HUMUS-Net-S 0.9640 ± 0.0040 43.3 ± 0.5 0.0096 ± 0.0002
8x
E2E-VarNet 0.9432 ± 0.0063 40.0 ± 0.6 0.0203 ± 0.0006
HUMUS-Net-S 0.9459 ± 0.0065 40.4 ± 0.7 0.0184 ± 0.0008
12x
E2E-VarNet 0.9259 ± 0.0084 37.7 ± 0.7 0.0347 ± 0.0018
HUMUS-Net-S 0.9283 ± 0.0052 38.0 ± 0.5 0.0298 ± 0.0019
Table A6: Experiments on various acceleration factors on the Stanford 3D dataset. Mean and standard
error of 3 random training-validation splits is shown.
C.4 Effect of the number of unrolled iterations
The number of cascades in unrolled networks has a fundamental impact on their performance. Deeper networks typically perform better, but also incur heavy computational and memory cost. In this experiment
we investigate the scaling of HUMUS-Net with respect to the number of iterative unrollings.
We perform an ablation study on the Stanford 3D dataset with ×8 acceleration for a fixed trainingvalidation split, where 20% of the training set has been set aside for validation. We reconstruct single
161
4 6 8 12
0.931
0.932
0.933
0.934
0.935
# of cascades
Val. SSIM
1
Figure A10: Validation SSIM as a function of number of cascades (unrolled itarations) in HUMUS-Net on
the Stanford 3D dataset. We observe a steady increase in reconstruction performance with more cascades.
slices, without applying adjacent slice reconstruction. We plot the highest SSIM throughout training on
the validation dataset for each network on Figure A10.
We observe improvements in reconstruction performance with increasing number of cascades, and
this improvement has not saturated yet in the range of model sizes we have investigated. In our main
experiments, we select 8 cascades due to memory and compute limitations for deeper models. However,
our experiment suggests that HUMUS-Net can potentially obtain even better reconstruction results given
enough computational resources.
C.5 Effect of Adjacent Slice Reconstruction
In this section, we demonstrate that even though adjacent slice reconstruction is in general helpful, it is
not the main reason why HUMUS-Net performs better than E2E-VarNet.
To this end, we add adjacent slice reconstruction to E2E-VarNet and investigate its effect on model
performance. We run experiments on the Stanford 3D dataset for 3 different random train-validation
162
Model ASR? SSIM PSNR NMSE
E2E-VarNet ✗ 0.9432 ± 0.0063 40.0 ± 0.6 0.0203 ± 0.0006
E2E-VarNet ✓ 0.9457 ± 0.0064 40.4 ± 0.7 0.0186 ± 0.0007
HUMUS-Net ✗ 0.9467 ± 0.0059 40.6 ± 0.6 0.0178 ± 0.0005
Table A7: Results of the adjacent slice reconstruction ablation study on the Stanford 3D dataset. Mean
and standard error over 3 random train-validation splits is shown. ASR improves the performance of E2EVarNet. However, HUMUS-Net outperforms E2E-VarNet in all cases even without ASR.
splits. We use adjacent slice reconstruction with 3 slices. We have found that the default learning rate for
E2E-VarNet is not optimal with the increased input size, therefore we tune the learning rate using grid
search and set it to 0.0005. We compare the results with HUMUS-Net, where we match the number of
cascades in E2E-VarNet, and use the default embedding dimension of 66. We do not use adjacent slice
reconstruction when training HUMUS-Net in order to ablate its effect.
The results are summarized in Table A7. We observe that ASR boosts the reconstruction quality of
E2E-VarNet. However, ASR alone cannot close the gap between the two models, as HUMUS-Net without
ASR still outperforms the best E2E-VarNet model with ASR.
C.6 Iterative denoising visualization
Here, we provide more discussion on the iterative denoising interpretation of unrolled networks, such as
HUMUS-Net. Consider the regularized inverse problem formulation of MRI reconstruction as
xˆ = arg min
x
A (x) − k˜
2
+ R(x), (7)
and design the architecture based on unrolling the gradient descent steps on the above problem. Applying
data consistency corresponds to the gradient of the first loss term in (7), whereas we learn the gradient of
the regularizer parameterized by the HUMUS-Block architecture yielding the update rule in k-space
kˆt+1 = kˆt − µ
tM(kˆt − k˜) + G(kˆt
).
163
Figure A11: Visualization of intermediate reconstructions in HUMUS-Net.
This can be conceptualized as alternating between enforcing consistency with the measurements and applying a denoiser based on some prior, represented by the HUMUS-Block in our architecture.
We show an example of a sequence of intermediate reconstructions in order to support the denoising
intuition in Figure A11. We plot the magnitude image of reconstructions at the outputs of consecutive
HUMUS-Blocks, along with the model input zero-padded reconstruction. We point out that in general
the intermediate reconstruction quality (for instance in terms of SSIM) is not necessarily an increasing
function of cascades. This is due to the highly non-linear nature of the mapping represented by the neural
network, which might not always be intuitive and interpretable.
C.7 Vision Transformer terminology overview
Here we provide a brief overview of the terms related to the Transformer architecture, some of which has
been carried over from the Natural Language Processing (NLP) literature. We introduce the key concepts
through the Vision Transformer architecture, which has inspired most Transformer-based architectures
in computer vision.
Traditional Transformers for NLP receive a sequence of 1D token embeddings. Each token may represent for instance a word in a sentence. To extend this idea to 2D images, the input x ∈ R
H×W×C image (H
and W stand for spatial dimensions, C for channel dimension) is split into N patches of size P × P, and
each patch is flattened into a 1D vector to produce an input of flattened patches xf with shape N × P
2C:
xf = PatchFlatten(x).
164
In order to set the latent dimension of the input entering the Vision Transformer, a trainable linear mapping
E ∈ R
P
2C×D is applied to the flattened patches, resulting in the so called patch embeddings z of shape
N × D:
z = xfE.
In order to encode information with respect to the position of image patches, a learnable positional embedding Epos ∈ R
N×D is added to the patch embeddings:
z0 = z + Epos.
The input to the Transformer encoder is this N ×D representation, which we also refer to in the paper
as token representation, as each row in the representation corresponds to a token (in our case an image
patch) in the original input.
165
C.8 Detailed validation results
Dataset Model SSIM(↑) PSNR(↑) NMSE(↓)
fastMRI E2E-VarNet 0.8908 36.8 0.0092
HUMUS-Net 0.8934 37.0 0.0090
Stanford 2D E2E-VarNet 0.8928 ± 0.0168 33.9 ± 0.7 0.0339 ± 0.0037
HUMUS-Net 0.8954 ± 0.0136 33.7 ± 0.6 0.0337 ± 0.0024
Stanford 3D E2E-VarNet 0.9432 ± 0.0063 40.0 ± 0.6 0.0203 ± 0.0006
HUMUS-Net 0.9453 ± 0.0065 40.4 ± 0.6 0.0187 ± 0.0009
Table A8: Detailed validation results of HUMUS-Net on various datasets. For datasets with multiple trainvalidation split runs we show the mean and standard error of the runs.
C.9 Additional figures
Figure A12: Swin Transformer Layer,
the fundamental building block of the
Residual Swin Transformer Block.
Figure A13: Depiction of iterative unrolling with sensitivity map
estimator (SME). HUMUS-Net applies a highly efficient denoiser
to progressively improve reconstructions in a cascade of subnetworks.
166
Figure A14: Patch merge and expand operations used in MUST.
167
Figure A15: Visual comparison of reconstructions from the fastMRI knee dataset with ×8 acceleration.
HUMUS-Net reconstructs fine details on MRI images that other state-of-the-art methods may miss.
168
D Appendix to DiracDiffusion: Denoising and Incremental
Reconstruction with Assured Data-Consistency
D.1 Proofs
D.1.1 Denoising score-matching guarantee
Just as in standard diffusion, we approximate the score of the noisy, degraded data distribution ∇yt
qt(yt)
by matching the score of the tractable conditional distribution ∇yt
qt(yt
|x0) via minimizing the loss in
(5.12). For standard Score-Based Models with At = I, the seminal work of [175] guarantees that the true
score is learned by denoising score-matching. More recently, [32] points out that this result holds for a
wide range of corruption processes, with the technical condition that the SDP assigns non-zero probability
to all yt for any given clean image x0. This condition is satisfied by adding Gaussian noise. For the sake
of completeness, we include the theorem from [32] updated with the notation from this paper.
Theorem D.1. Let q0 and qt be two distributions in R
n
. Assume that all conditional distributions, qt(yt
|x0),
are supported and differentiable in R
n
. Let:
J1(θ) = 1
2
Eyt∼qt
h
∥sθ(yt
, t) − ∇yt
log qt(yt)∥
2
i
, (8)
J2(θ) = 1
2
E(x0,yt)∼q0(x0)qt(yt|x0)
h
∥sθ(yt
, t) − ∇yt
log qt(yt
|x0)∥
2
i
. (9)
Then, there is a universal constant C (that does not depend on θ) such that: J1(θ) = J2(θ) + C.
The proof, that follows the calculations of [175], can be found in Appendix A.1. of [32]. This result implies that by minimizing the denoising score-matching objective in (9), the objective in (8) is also
minimized, thus the true score is learned via matching the tractable conditional distribution qt(yt
|x0)
governing SDPs.
169
D.1.2 Theorem 3.4.
Assumption (Lipschitzness of degradation). Assume that ∥At(x) − At(y)∥ ≤ L
(t)
x ∥x − y∥, ∀x, y ∈
R
n
, ∀t ∈ [0, 1] and ∥At
′(x) − At
′′(x)∥ ≤ Lt
|t
′ − t
′′|, ∀x ∈ R
n
, ∀t
′
, t′′ ∈ [0, 1].
Assumption (Bounded signals). Assume that each entry of clean signals x0 are bounded as x0[i] ≤
B, ∀i ∈ (1, 2, ..., n).
Lemma. Assume yt = At(x0) + zt with x0 ∼ q0(x0) and zt ∼ N (0, σ2
t
I) and that Assumption D.2 holds.
Then, the Jensen gap is upper bounded as ∥E[At
′(x0)|yt
] − At
′(E[x0|yt
])∥ ≤ L
(t
′
)
x
√
nB, ∀t, t′ ∈ [0, 1].
Proof.
∥E[At
′(x0)|yt
] − At
′(E[x0|yt
])∥
(1)
≤
Z
∥At
′(x0) − At
′(E[x0|yt
])∥ p(x0|yt)dx0
(2)
≤
sZ
∥At
′(x0) − At
′(E[x0|yt
])∥
2
p(x0|yt)dx0
≤ L
(t
′
)
x
sZ
∥x0 − E[x0|yt
]∥
2
p(x0|yt)dx0
(3)
≤ L
(t
′
)
x
sZ
∥x0∥
2
p(x0|yt)dx0
≤ L
(t
′
)
x
sZ
nB2p(x0|yt)dx0 = L
(t
′
)
x
√
nB
Here (1) and (2) hold due to Jensen’s inequality, and in (3) we use the fact that E[x0|yt
] is the minimum
mean-squared error (MMSE) estimator of x0, thus we can replace it with 0 to get an upper bound.
Theorem. 3.4 Let Rˆ(t, ∆t; yt) = At−∆t(Φθ(yt
, t))− At(Φθ(yt
, t)) denote our estimate of the incremental
reconstruction, where Φθ(yt
, t) is trained on the loss in (13). Let R∗
(t, ∆t; yt) = E[R(t, ∆t; x0)|yt
] denote
170
the MMSE estimator of R(t, ∆t; x0). If Assumptions D.3 and D.2 hold and the error in our score network is
bounded by ∥sθ(yt
, t) − ∇yt
log qt(yt)∥ ≤ ϵt
σ
2
t
, ∀t ∈ [0, 1], then
∥Rˆ(t, ∆t; yt) − R∗
(t, ∆t; yt)∥ ≤ (L
(t)
x + L
(t−∆t)
x
)
√
nB + 2Lt∆t + 2ϵt
.
Proof. First, we note that due to Tweedie’s formula,
E[At(x0)|yt
] = yt + σ
2
t ∇yt
log qt(yt).
Since we parameterized our score model as
sθ(yt
, t) = At(Φθ(yt
, t)) − yt
σ
2
t
,
the assumption that ∥sθ(yt
, t) − ∇yt
log qt(yt)∥ ≤ ϵt
σ
2
t
, is equivalent to
∥At(Φθ(yt
, t)) − E[At(x0)|yt
]∥ ≤ ϵt
. (10)
171
By applying the triangle inequality repeatedly, and applying Lemma D.4 and (10)
Rˆ(t, ∆t; yt) − R∗
(t, ∆t; yt)
= ∥(At−∆t(Φθ(yt
, t)) − At(Φθ(yt
, t))) − (E[At−∆t(x0)|yt
] − E[At(x0)|yt
])∥
≤ ∥At−∆t(Φθ(yt
, t)) − E[At−∆t(x0)|yt
]∥ + ∥At(Φθ(yt
, t)) − E[At(x0)|yt
]∥
≤ ∥At−∆t(Φθ(yt
, t)) − At−∆t(E[x0|yt
]) + At−∆t(E[x0|yt
]) − E[At−∆t(x0)|yt
]∥ + ϵt
≤ ∥At−∆t(Φθ(yt
, t)) − At−∆t(E[x0|yt
])∥ + L
(t−∆t)
x
√
nB + ϵt
≤ ∥At−∆t(Φθ(yt
, t)) − At(Φθ(yt
, t))∥ + ∥At(Φθ(yt
, t)) − At(E[x0|yt
])∥
+ ∥At(E[x0|yt
]) − At−∆t(E[x0|yt
])∥ + L
(t−∆t)
x
√
nB + ϵt
≤ ∥At(Φθ(yt
, t)) − At(E[x0|yt
])∥ + 2Lt∆t + L
(t−∆t)
x
√
nB + ϵt
≤ ∥At(Φθ(yt
, t)) − E[At(x0)|yt
]∥ + ∥E[At(x0)|yt
] − At(E[x0|yt
])∥
+ 2Lt∆t + L
(t−∆t)
x
√
nB + ϵt
≤ 2Lt∆t + (L
(t−∆t)
x + L
(t)
x
)
√
nB + 2ϵt
.
D.1.3 Incremental reconstruction loss guarantee
Assumption. The forward degradation transition function Gt
′→t
′′ for any t
′
, t′′ ∈ [0, 1], t′ < t′′ is Lipschitz
continuous: ∥Gt
′→t
′′(x) − Gt
′→t
′′(y)∥ ≤ LG(t
′
, t′′)∥x − y∥, ∀t
′
, t′′ ∈ [0, 1], t′ < t′′
, ∀x, y ∈ R
n
.
This is a very natural assumption, as we don’t expect the distance between two images after applying
a degradation to grow arbitrarily large.
172
Proposition. If the model Φθ(yt
, t) has large enough capacity, such that LIR(∆t, θ) = 0 is achieved, then
sθ(yt
, t) = ∇yt
log qt(yt), ∀t ∈ [0, 1]. Otherwise, if Assumption D.5 holds, then we have
L(θ) ≤ max
t∈[0,1]
(LG(τ, t))LIR(∆t, θ). (11)
Proof. We denote τ = max(0, t − ∆t). First, if LIR(∆t, θ) = 0, then
Aτ (Φθ(yt
, t)) = Aτ (x0)
for all (x0, yt) such that qt(x0, yt) > 0. Applying the forward degradation transition function to both
sides yields
Gτ→t(Aτ (Φθ(yt
, t))) = Gτ→t(Aτ (x0)),
which is equivalent to
At(Φθ(yt
, t)) = At(x0).
This in turn means that L(θ) = 0 and thus due to Theorem D.1 the score is learned.
In the more general case,
L(θ) = Et,(x0,yt)
h
wt ∥At(Φθ(yt
, t)) − At(x0)∥
2
i
= Et,(x0,yt)
h
wt ∥Gτ→t(Aτ (Φθ(yt
, t))) − Gτ→t(Aτ (x0))∥
2
i
≤ Et,(x0,yt)
h
wtLG(τ, t) ∥Aτ (Φθ(yt
, t)) − Aτ (x0)∥
2
i
≤ max
t∈[0,1]
(LG(τ, t))Et,(x0,yt)
h
wt ∥Aτ (Φθ(yt
, t)) − Aτ (x0)∥
2
i
= max
t∈[0,1]
(LG(τ, t))LIR(∆t, θ)
173
This means that if the model has large enough capacity, minimizing the incremental reconstruction
loss in (5.14) also implies minimizing (5.12), and thus the true score is learned (denoising is achieved). Otherwise, the incremental reconstruction loss is an upper bound on the loss in (5.12). Training a model on
(5.14), the model learns not only to denoise, but also to perform small, incremental reconstructions of the
degraded image such that At−∆t(Φθ(yt
, t)) ≈ At−∆t(x0). There is however a trade-off between incremental reconstruction performance and learning the score: as Proposition D.6 indicates, we are optimizing
an upper bound to (5.12) and thus it is possible that the score estimation is less accurate. We expect our
proposed incremental reconstruction loss to work best in scenarios where the degradation may change
rapidly with respect to t and hence a network trained to accurately estimate At(x0) from yt may become
inaccurate when predicting At−∆t(x0) from yt
. This hypothesis is further supported by our experiments
in Section 5.4. Finally, we mention that in the extreme case where we choose ∆t = 1, we obtain a loss
function purely in clean image domain.
D.1.4 Theorem 3.7
Lemma Transitivity of data consistency. If yt+
d.c. ∼ yt and yt++
d.c. ∼ yt+ with t < t+ < t++, then
yt++
d.c. ∼ yt
.
Proof. By the definition of data consistency yt++
d.c. ∼ yt+ ⇒ ∃x0 : At++ (x0) = yt++ and At+ (x0) = yt+ .
On the other hand, yt+
d.c. ∼ yt ⇒ ∃x
′
0
: At+ (x
′
0
) = yt+ and At(x
′
0
) = yt
. Therefore,
yt++ = At++ (x0) = Gt+→t++ (At+ (x0)) = Gt+→t++ (yt+ ) = Gt+→t++ (At+ (x
′
0
)) = At++ (x
′
0
).
By the definition of data consistency, this implies yt++
d.c. ∼ yt
.
174
Theorem. 3.7. Assume that we run the updates in (5.7) with sθ(yt
, t) = ∇yt
log qt(yt
|x0), ∀t ∈ [0, 1]
and Rˆ(t, ∆t; yt) = R(t, ∆t; x0), x0 ∈ X0. If we start from a noisy degraded observation y˜ = A1(x0) +
z1, x0 ∈ X0, z1 ∼ N (0, σ2
1
I) and run the updates in (5.7) for τ = 1, 1 − ∆t, ..., ∆t, 0, then we have
E[y˜]
d.c. ∼ E[yτ ], ∀τ ∈ [1, 1 − ∆t, ..., ∆t, 0]. (12)
Proof. Assume that we start from a known measurement y˜ := yt = At(x0) + zt at arbitrary time t and
run reverse diffusion from t with time step ∆t. Starting from t = 1 that we have looked at in the paper is
a subset of this problem. Starting from arbitrary yt
, the first update takes the form
yt−∆t = yt + At−∆t(Φθ(yt, t)) − At(Φθ(yt, t)) − (σ
2
t−∆t − σ
2
t
)
At(Φθ(yt, t)) − yt
σ
2
t
+
q
σ
2
t − σ
2
t−∆t
z =
= At(x0) + zt + At−∆t(Φθ(yt, t)) − At(Φθ(yt, t)) − (σ
2
t−∆t − σ
2
t
)
At(Φθ(yt, t)) − At(x0) − zt
σ
2
t
+
q
σ
2
t − σ
2
t−∆t
z
Due to our assumption on learning the score function, we have At(Φθ(yt, t)) = At(x0) and due to the perfect
incremental reconstruction assumption At−∆t(Φθ(yt, t)) = At−∆t(x0). Thus, we have
yt−∆t = At−∆t(x0) + σ
2
t−∆t
σ
2
t
zt +
q
σ
2
t − σ
2
t−∆t
z.
Since z and zt are independent Gaussian, we can combine the noise terms to yield
yt−∆t = At−∆t(x0) + zt−∆t, (13)
with zt−∆t ∼ N (0,
σ
2
t−∆t
σt
2
+ σ
2
t − σ
2
t−∆t
I). This form is identical to the expression on our original measurement y˜ = yt = At(x0) + zt, but with slightly lower degradation severity and noise variance. It is also important
to point out that E[yt]
d.c. ∼ E[yt−∆t]. If we repeat the update to find yt−2∆t, we will have the same form as in
(13) and E[yt−∆t]
d.c. ∼ E[yt−2∆t]. Due to the transitive property of data consistency (Lemma D.7), we also have
175
E[yt]
d.c. ∼ E[yt−2∆t], that is data consistency is preserved with the original measurement. This reasoning can be
then extended for every further update using the transitivity property, therefore we have data consistency in each
iteration.
D.2 Degradation scheduling
0 0.2 0.4 0.6 0.8 1
0.5
1
1.5
2
2.5
3
t
wt
1
0 0.2 0.4 0.6 0.8 1
0
10
20
30
40
50
t
wt
1
0 0.2 0.4 0.6 0.8 1
0
10
20
30
t
wt
1
Figure A16: Results of degradation scheduling from Algorithm 3. Left: Gaussian blur with kernel std wt on
CelebA-HQ. Center: inpainting with Gaussian mask with kernel width wt on CelebA-HQ. Right: inpainting
with Gaussian mask on ImageNet.
When solving inverse problems, we have access to a noisy measurement y˜ = A(x0) + z and we would
like to find the corresponding clean image x0. In order to deploy our method, we need to define how the
degradation changes with respect to severity t following the properties specified in Definition 5.3.3. That
is, we have to determine how to interpolate between the identity mapping A0(x) = x for t = 0 and
the most severe degradation A1(·) = A(·) for t = 1. Theorem 5.3.4 suggests that sharp changes in the
degradation function with respect to t should be avoided, however a more principled method of scheduling
is needed.
In the context of image generation, [32] proposes a scheduling framework that splits the path between
the distribution of clean images D0 and the distribution of pure noise D1 into T candidate distributions
Di
, i ∈ [1/T, 2/T, ..., T −1
T
]. Then, they find a path through the candidate distributions that minimizes
the total path length, where the distance between Di and Dj is measured by the Wasserstein-distance.
However, for image reconstruction, instead of distance between image distributions, we are more interested in how much a given image degrades in terms of image quality metrics such as PSNR or LPIPS.
176
Therefore, we replace the Wasserstein-distance by a notion of distance between two degradation severities d(ti
, tj ) := Ex0∼D0
[M(Ati
(x0), Atj
(x0))], where M is some distortion-based or perceptual image
quality metric that acts on a corresponding pair of images.
We propose a greedy algorithm to select a set of degradations from the set of candidates based on the
above notion of dataset-dependent distance, such that the maximum distance is minimized. That is, our
scheduler is not only a function of the degradation At
, but also the data. The intuitive reasoning to minimize the maximum distance is that our model has to be imbued with enough capacity to bridge the gap
between any two consecutive distributions during the reverse process, and thus the most challenging transition dictates the required network capacity. In particular, given a budget of m intermediate distributions
on [0, 1], we would like to pick a set of m interpolating severities S such that
S = arg min
T
max
i
d(ti
, ti+1), (14)
where T = {t1, t2, ..., tm|ti ∈ [0, 1], ti < ti+1 ∀i ∈ (1, 2, ..., m)} is the set of possible interpolating
severities with budget m. To this end, we start with S = {0, 1} and add new interpolating severities oneby-one, such that the new point splits the interval in S with the maximum distance. Thus, over iterations
the maximum distance is non-increasing. We also have local optimality, as moving a single interpolating
severity must increase the maximum distance by the construction of the algorithm. Finally, we use linear
interpolation in between the selected interpolating severities. The technique is summarized in Algorithm
3, and we refer the reader to the source code for implementation details.
The results of our proposed greedy scheduling algorithm are shown in Figure A16, where the distance
is defined based on the LPIPS metric. In case of blurring, we see a sharp decrease in degradation severity
close to t = 1. This indicates, that LPIPS difference between heavily blurred images is small, therefore
177
most of the diffusion takes place at lower blur levels. On the other hand, we find that inpainting mask size
is scaled almost linearly by our algorithm on both datasets we investigated.
D.3 Guidance details
Even though Dirac does not need to rely on y˜ after the initial update for maintaining data-consistency,
we observe small improvements in reconstructions when adding a guidance scheme to our algorithm. As
described in Section 5.3.7, our approximation of the posterior score takes the form
s
′
θ
(yt, t) = sθ(yt, t) − ηt∇yt
∥y˜ − A1(Φθ(yt, t))∥
2
2σ
2
1
, (15)
where ηt
is a hyperparameter that tunes how much we rely on the original noisy measurement. For the
sake of simplicity, in this discussion we merge the scaling of the gradient into the step size parameter as
follows:
s
′
θ
(yt, t) = sθ(yt, t) − η
′
t∇yt ∥y˜ − A1(Φθ(yt, t))∥
2
(16)
We experiment with two choices of step size scheduling for the guidance term η
′
t
:
• Standard deviation scaled (constant): ηt = η
1
2σ
2
1
, where η is a constant hyperparameter and σ
2
1
is the
noise level on the measurements. This scaling is justified by our derivation of the posterior score
approximation, and matches (15).
• Error scaled: ηt = η
1
∥y˜−A1(Φθ(yt,t))∥
, which has been proposed in [23]. This method attempts to
normalize the gradient of the data consistency term.
In general, we find that constant step size works better for deblurring, whereas error scaling performed
slightly better for inpainting experiments, however the difference is minor. Figure A17 shows the results
of our ablation study on the effect of ηt
. We perform deblurring experiments on the CelebA-HQ validation
set and plot the mean LPIPS (lower the better) with different step size scheduling methods and varying
178
Algorithm 3 Greedy Degradation Scheduling
Input: M: pairwise image dissimilarity metric, X0: clean samples, At
: unscheduled degradation function,
N: number of candidate points, m: number of interpolation points
ts ← (0,
1
N−1
,
2
N−1
, ...,
N−2
N−1
, 1) ▷ N candidate severities uniformly distributed over [0, 1]
S ← (1, N) ▷ Array of indices of output severities in ts
dmax ← Distance(ts[1], ts[N]) ▷ Maximum distance between two severities in the output array
estart ← 1 ▷ Start index of edge with maximum distance
eend ← N ▷ End index of edge with maximum distance
for i = 1 to m do
s ← F indBestSplit(estart, eend, dmax)
Append(S, s)
dmax, estart, eend ← U pdateM ax(S)
Output: S
procedure Distance(ti
, tj ) ▷ Distance between degradation severities ti and tj
d ← 1
|X0|
P
x∈X0 M(Ati
(x), Atj
(x))
Output: d
procedure FindBestSplit(estart, eend, dmax) ▷ Split edge into two new edges with minimal maximum
distance
M axDistance ← dmax
for j = estart + 1 to eend − 1 do
d1 ← Distance(ts[estart], ts[j])
d2 ← Distance(ts[j], ts[eend])
if max(d1, d2) < M axDistance then
M axDistance ← max(d1, d2)
Split ← j
Output: Split
procedure UpdateMax(S)
M axDistance ← 0
for i = 1 to |S| − 1 do
estart ← S[i]
eend ← S[i + 1]
d ← Distance(ts[estart], ts[eend])
if d > M axDistance then
M axDistance ← d
NewStart ← estart
NewEnd ← eend
Output: M axDistance, NewStart, NewEnd
179
step size. We see some improvement in LPIPS when adding guidance to our method, however it is not a
crucial component in obtaining high quality reconstructions, or for maintaining data-consistency.
0 0.5 1 1.5 2
0.210
0.215
0.220
ηt
LPIPS (↓)
Std scaled (constant)
1
0 2 4 6 8 10 12 14
0.208
0.21
0.212
0.214
0.216
ηt
LPIPS (↓)
Error scaled
1
Figure A17: Effect of guidance step size on best reconstruction in terms of LPIPS. We perform experiments
on the CelebA-HQ validation set on the deblurring task.
D.4 Note on the output of the algorithm
In the ideal case, σ0 = 0 and A0 = I. However, in practice due to geometric noise scheduling (e.g.
σ0 = 0.01), there is small magnitude additive noise expected on the final iterate. Moreover, in order to
keep the scheduling of the degradation smooth, and due to numerical stability in practice A0 may slightly
deviate from the identity mapping close to t = 0 (for example very small amount of blur). Thus, even close
to t = 0, there may be a gap between the iterates yt and the posterior mean estimates xˆ0 = Φθ(yt
, t). Due
to these reasons, we observe that in some experiments taking Φθ(yt
, t) as the final output yields better
reconstructions. In case of early stopping, taking xˆ0 as the output is instrumental, as an intermediate
iterate yt represents a sample from the reverse SDP, thus it is expected to be noisy and degraded. However,
as Φθ(yt
, t) always predicts the clean image, it can be used at any time step t to obtain an early-stopped
prediction of x0.
180
D.5 Experimental details
Datasets – We evaluate our method on CelebA-HQ (256 × 256) [79] and ImageNet (256 × 256) [38].
For CelebA-HQ training, we use 80% of the dataset for training, and the rest for validation and testing.
For ImageNet experiments, we sample 1 image from each class from the official validation split to create
disjoint validation and test sets of 1k images each. We only train our model on the official train split of
ImageNet. We center-crop and resize ImageNet images to 256×256 resolution. For both datasets, we scale
images to [0, 1] range.
Comparison methods – We compare our method against DDRM [82], the most well-established
diffusion-based linear inverse problem solver; DPS [23], a very recent, state-of-the-art diffusion technique
for noisy and possibly nonlinear inverse problems; PnP-ADMM [16], a reliable traditional solver with
learned denoiser; and ADMM-TV, a classical optimization technique. More details can be found in Section
D.8.
Models – For Dirac, we train new models from scratch using the NCSN++[157] architecture with 67M
parameters for all tasks except for ImageNet inpainting, for which we scale the model to 126M parameters.
For competing methods that require a score model, we use pre-trained SDE-VP models∗
(126M parameters
for CelebA-HQ, 553M parameters for ImageNet). The architectural hyper-parameters for the various scoremodels can be seen in Table A9.
Training details – We train all models with Adam optimizer, with learning rate 0.0001 and batch
size 32 on 8× Titan RTX GPUs. We do not use exponential moving averaging or learning rate scheduling
schemes. We train for approximately 10M examples seen by the network. For the weighting factor w(t)
in the loss, we set w(t) = 1
σ
2
t
in all experiments.
Degradations – We investigate two degradation processes of very different properties: Gaussian
blur and inpainting, both with additive Gaussian noise. In all cases, noise with σ1 = 0.05 is added to
∗CelebA-HQ: https://github.com/ermongroup/SDEdit
ImageNet: https://github.com/openai/guided-diffusion
181
the measurements in the [0, 1] range. We use standard geometric noise scheduling with σmax = 0.05
and σmin = 0.01 in the SDP. For Gaussian blur, we use a kernel size of 61, with standard deviation of
wmax = 3 to create the measurements. We change the standard deviation of the kernel between wmax
(strongest) and 0.3 (weakest) to parameterize the severity of Gaussian blur in the degradation process, and
use the scheduling method described in Section D.2 to specify At
. We keep an imperceptible amount of
blur for t = 0 to avoid numerical instability with very small kernel widths. For inpainting, we generate a
smooth mask in the form
1 −
f(x;wt)
maxx f(x;wt)
k
, where f(x; wt) denotes the density of a zero-mean isotropic
Gaussian with standard deviation wt that controls the size of the mask and k = 4 for sharper transition.
We set w1 = 50 for CelebA-HQ/FFHQ inpainting and 30 for ImageNet inpainting.
Evaluation method – To evaluate performance, we use PSNR and SSIM as distortion metrics and
LPIPS and FID as perceptual quality metrics. For the final reported results, we scale and clip all outputs to
the [0, 1] range before computing the metrics. We use validation splits to tune the hyper-parameters for
all methods, where we optimize for best LPIPS in the deblurring task and for best FID for inpainting. As
the pre-trained score-models for competing methods have been trained on the full CelebA-HQ dataset, we
test all methods for fair comparison on the first 1k images of the FFHQ [81] dataset. The list of test images
for ImageNet can be found in the source code.
Sampling hyperparameters – The settings are summarized in Table A10. We tune the reverse
process hyper-parameters on validation data. For the interpretation of ’guidance scaling’ we refer the
reader to the explanation of guidance step size methods in Section D.3. In Table A10, ’output’ refers to
whether the final reconstruction is the last model output (posterior mean estimate, xˆ0 = Φθ(yt
, t)) or the
final iterate yt
.
182
Dirac(Ours)
Hparam Deblur/CelebA-HQ Deblur/ImageNet Inpainting/CelebA-HQ Inpainting/ImageNet
model_channels 128 128 128 128
channel_mult [1, 1, 2, 2, 2, 2, 2] [1, 1, 2, 2, 2, 2, 2] [1, 1, 2, 2, 2, 2, 2] [1, 1, 2, 2, 4, 4]
num_res_blocks 2 2 2 2
attn_resolutions [16] [16] [16] [16]
dropout 0.1 0.1 0.1 0.0
Total # of parameters 67M 67M 67M 126M
DDRM/DPS
Hparam Deblur/CelebA-HQ Deblur/ImageNet Inpainting/CelebA-HQ Inpainting/ImageNet
model_channels 128 256 128 256
channel_mult [1, 1, 2, 2, 4, 4] [1, 1, 2, 2, 4, 4] [1, 1, 2, 2, 4, 4] [1, 1, 2, 2, 4, 4]
num_res_blocks 2 2 2 2
attn_resolutions [16] [32, 16, 8] [16] [32, 16, 8]
dropout 0.0 0.0 0.0 0.0
Total # of parameters 126M 553M 126M 553M
Table A9: Architectural hyper-parameters for the score-models for Dirac (top) and other diffusion-based
methods (bottom) in our experiments.
PO Sampling hyper-parameters
Hparam Deblur/CelebA-HQ Deblur/ImageNet Inpainting/CelebA-HQ Inpainting/ImageNet
∆t 0.02 0.02 0.005 0.01
tstop 0.25 0.0 0.0 0.0
ηt 0.5 0.2 1.0 0.0
Guidance scaling std std error -
Output xˆ0 xˆ0 yt yt
DO Sampling hyper-parameters
Hparam Deblur/CelebA-HQ Deblur/ImageNet Inpainting/CelebA-HQ Inpainting/ImageNet
∆t 0.02 0.02 0.005 0.01
tstop 0.98 0.7 0.995 0.99
ηt 0.5 1.5 1.0 0.0
Guidance scaling std std error -
Output xˆ0 xˆ0 xˆ0 xˆ0
Table A10: Settings for perception optimized (PO) and distortion optimized (DO) sampling for all experiments on test data.
183
D.6 Incremental reconstruction loss ablations
We propose the incremental reconstruction loss, that combines learning to denoise and reconstruct simultaneously in the form
LIR(∆t, θ) = Et,(x0,yt)
h
w(t) ∥Aτ (Φθ(yt
, t)) − Aτ (x0)∥
2
i
, (17)
where τ = max(t − ∆t, 0), t ∼ U[0, 1], (x0, yt) ∼ q0(x0)qt(yt
|x0). This loss directly improves incremental reconstruction by encouraging At−∆t(Φθ(yt
, t)) ≈ At−∆t(x0). We show in Proposition D.6 that
LIR(∆t, θ) is an upper bound to the denoising score-matching objective L(θ). Furthermore, we show
that given enough model capacity, minimizing LIR(∆t, θ) also minimizes L(θ). However, if the model
capacity is limited compared to the difficulty of the task, we expect a trade-off between incremental reconstruction accuracy and score accuracy. This trade-off might not be favorable in tasks where incremental
reconstruction is accurate enough due to the smoothness properties of the degradation (see Theorem 5.3.4).
Here, we perform further ablation studies to investigate the effect of the look-ahead parameter ∆t in the
incremental reconstruction loss.
Deblurring – In case of deblurring, we did not find a significant difference in perceptual quality with
different ∆t settings. Our results on the CelebA-HQ validation set can be seen in Figure A18 (left). We
observe that using ∆t = 0 (that is optimizing L(θ)) yields slightly better reconstructions (difference in
the third digit of LPIPS) than optimizing with ∆t = 1, that is minimizing
LIR(∆t = 1, θ) := L
X0
IR(θ) = Et,(x0,yt)
h
w(t) ∥Φθ(yt
, t) − x0∥
2
i
. (18)
This loss encourages one-shot reconstruction and denoising from any degradation severity, intuitively
the most challenging task to learn. We hypothesize, that the blur degradation used in our experiments is
184
smooth enough, and thus the incremental reconstruction as per Theorem 5.3.4 is accurate. Therefore, we
do not need to trade off score approximation accuracy for better incremental reconstruction.
Inpainting – We observe very different characteristics in case of inpainting. In fact, using the vanilla
score-matching loss L(θ), which is equivalent to LIR(∆t, θ) with ∆t = 0, we are unable to learn a
meaningful inpainting model. As we increase the look-ahead ∆t, reconstructions consistently improve.
We obtain the best results in terms of FID when minimizing L
X0
IR(θ). Our results are summarized in
Figure A18 (middle). We hypothesize that due to rapid changes in the inpainting operator, our incremental
reconstruction estimator produces very high errors when trained on L(θ) (see Theorem 5.3.4). Therefore,
in this scenario improving incremental reconstruction at the expense of score accuracy is beneficial. Figure
A18 (right) demonstrates how reconstructions visually change as we increase the look-ahead ∆t. With
∆t = 0, the reverse process misses the clean image manifold completely. As we increase ∆t, reconstruction
quality visually improves, but the generated images often have features inconsistent with natural images
in the training set. We obtain high quality, detailed reconstructions for ∆t = 1 when minimizing L
X0
IR(θ).
0 50 100 150 200
0.210
0.215
0.220
0.225
0.230
NFE
LPIPS
Deblurring
∆t = 0.0
∆t = 1.0
1
0 50 100 150 200
40
60
80
100
120
140
160
NFE
FID
Inpainting
∆t = 0.0
∆t = 0.2
∆t = 1.0
1
Figure A18: Effect of incremental reconstruction loss step size on the CelebA-HQ validation set for deblurring (left) and inpainting (middle). Visual comparison of inpainted samples is shown on the right.
D.7 Further incremental reconstruction approximations
In this work, we focused on estimating the incremental reconstruction
R(t, ∆t; x0) := At−∆t(x0) − At(x0) (19)
185
in the form
Rˆ(t, ∆t; yt) = At−∆t(Φθ(yt
, t)) − At(Φθ(yt
, t)), (20)
which we call the look-ahead method. The challenge with this formulation is that we use yt with degradation severity t to predict At−∆t(x0) with less severe degradation t − ∆t. That is, as we discussed in
the paper Φθ(yt
, t) does not only need to denoise images with arbitrary degradation severity, but also has
to be able to perform incremental reconstruction, which we address with the incremental reconstruction
loss. However, other methods of approximating (19) are also possible, with different trade-offs. The key
idea is to use different methods to estimate the gradient of At(x0) with respect to the degradation severity,
followed by first-order Taylor expansion to estimate At−∆t(x0).
Small look-ahead (SLA) – We use the approximation
At−∆t(x0) − At(x0) ≈ ∆t ·
At−δt(x0) − At(x0)
δt , (21)
where 0 < δt < ∆t to obtain
RˆSLA(t, ∆t; yt) = ∆t ·
At−δt(Φθ(yt
, t)) − At(Φθ(yt
, t))
δt . (22)
The potential benefit of this method is that At−δt(Φθ(yt
, t)) may approximate At−δt(x0) much more
accurately than At−∆t(Φθ(yt
, t)) can approximate At−∆t(x0), since t − δt is closer in severity to t than
t−∆t. However, depending on the sharpness of At
, the first-order Taylor approximation may accumulate
large error.
Look-back (LB) – We use the approximation
At−∆t(x0) − At(x0) ≈ At(x0) − At+∆t(x0), (23)
186
that is we predict the incremental reconstruction based on the most recent change in image degradation.
Plugging in our model yields
RˆLB(t, ∆t; yt) = At(Φθ(yt
, t)) − At+∆t(Φθ(yt
, t)). (24)
The clear advantage of this formulation over (20) is that if the loss in (5.12) is minimized such that
At(Φθ(yt
, t)) = At(x0),
then we also have
At+∆t(Φθ(yt
, t)) = Gt→t+∆t(At(Φθ(yt
, t))) = Gt→t+∆t(At(x0)) = At+∆t(x0).
However, this method may also accumulate large error if At changes rapidly close to t.
Small look-back (SLB)– Combining the idea in SLA with LB yields the approximation
At−∆t(x0) − At(x0) ≈ ∆t ·
At(x0) − At+δt(x0)
δt , (25)
where 0 < δt < ∆t. Using our model, the estimator of the incremental reconstruction takes the form
RˆSLB(t, ∆t; yt) = ∆t ·
At(Φθ(yt
, t)) − At+δt(Φθ(yt
, t))
δt . (26)
Compared with LB, we still have At+δt(Φθ(yt
, t)) = At+δt(x0) and the error due to first-order Taylorapproximation is reduced, however potentially higher than in case of SLA.
187
Incremental Reconstruction Network – Finally, an additional model ϕθ′ can be trained to directly
approximate the incremental reconstruction, that is ϕθ′(yt
, t) ≈ R(t, ∆t; x0). All these approaches are
interesting directions for future work.
D.8 Comparison methods
For all methods, hyperparameters are tuned based on first 100 images of the folder "00001" for FFHQ and
tested on the folder "00000". For ImageNet experiments, we use the first samples of the first 100 classes
of ImageNet validation split to tune, last samples of each class as the test set.
D.8.1 DPS
We use the default value of 1000 NFEs for all tasks. We make no changes to the Gaussian blurring operator
in the official source code. For inpainting, we copy our operator and apply it in the image input range
[0, 1]. The step size ζ
′
is tuned via grid search for each task separately based on LPIPS metric. The optimal
values are as follows:
1. FFHQ Deblurring: ζ
′ = 3.0
2. FFHQ Inpainting: ζ
′ = 2.0
3. ImageNet Deblurring: ζ
′ = 0.3
4. ImageNet Inpainting: ζ
′ = 3.0
As a side note, at the time of writing this paper, the official implementation of DPS†
adds the noise to the
measurement after scaling it to the range [−1, 1]. For the same noise standard deviation, the effect of the
noise is halved as compared to applying in [0, 1] range. To compensate for this discrepancy, we set the
noise std in the official code to σ = 0.1 for all DPS experiments which is the same effective noise level as
σ = 0.05 for our experiments.
†
https://github.com/DPS2022/diffusion-posterior-sampling
188
D.8.2 DDRM
We keep the default settings ηB = 1.0, η = 0.85 for all of the experiments and sample for 20 NFEs with
DDIM [151]. For the Gaussian deblurring task, the linear operator has been implemented via separable 1D
convolutions as described in D.5 of DDRM [82]. We note that for blurring task, the operator is applied to
the reflection padded input. For Gaussian inpainting task, we set the left and right singular vectors of the
operator to be identity (U = V = I) and store the mask values as the singular values of the operator. For
both tasks, operators are applied to the image in the [−1, 1] range.
D.8.3 PnP-ADMM
We take the implementation from the scico library. Specifically the code is modified from the sample
notebook‡
. We set the number of ADMM iterations to be maxiter=12 and tune the ADMM penalty
parameter ρ via grid search for each task based on LPIPS metric. The values for each task are as follows:
1. FFHQ Deblurring: ρ = 0.1
2. FFHQ Inpainting: ρ = 0.4
3. ImageNet Deblurring: ρ = 0.1
4. ImageNet Inpainting: ρ = 0.4
The proximal mappings are done via pre-trained DnCNN denoiser with 17M parameters.
D.8.4 ADMM-TV
We want to solve the following objective:
arg min
x
1
2
∥y − A1(x)∥
2
2 + λ∥Dx∥2,1
‡
https://github.com/lanl/scico-data/blob/main/notebooks/superres_ppp_dncnn_admm.ipynb
189
where y is the noisy degraded measurement, A1(·) refers to blurring/masking operator and D is a finite
difference operator. ∥Dx∥2,1 TV regularizes the prediction x and λ controls the regularization strength.
For a matrix A ∈ R
m×n
, the matrix norm ∥.∥2,1 is defined as:
∥A∥2,1 =
Xm
i=1
vuut
Xn
j=1
A
2
ij
The implementation is taken from scico library where the code is based on the sample notebook§
. We
note that for consistency, the blurring operator is applied to the reflection padded input. In addition to
the penalty parameter ρ, we need to tune the regularization strength λ in this problem. We tune the pairs
(λ, ρ) for each task via grid search based on LPIPS metric. Optimal values are as follows:
1. FFHQ Deblurring: (λ, ρ) = (0.007, 0.8)
2. FFHQ Inpainting: (λ, ρ) = (0.02, 0.2)
3. ImageNet Deblurring: (λ, ρ) = (0.007, 0.5)
4. ImageNet Inpainting: (λ, ρ) = (0.02, 0.2)
D.9 Further reconstruction samples
Here, we provide more samples from Dirac reconstructions on the test split of CelebA-HQ and ImageNet
datasets. We visualize the uncertainty of samples via pixelwise standard deviation across n = 10 generated
samples. In experiments where the distortion peak is achieved via one-shot reconstruction, we omit the
uncertainty map.
§
https://github.com/lanl/scico-data/blob/main/notebooks/deconv_tv_padmm.ipynb
190
Measurement DO - Sample 1 DO - Sample 2 DO Uncertainty PO - Sample 1 PO - Sample 1 PO Uncertainty Target
0
0.2
0.4
0.6
0.8
1
Figure A19: Distortion and Perception optimized deblurring results for the CelebA-HQ dataset (test split).
Uncertainty is calculated over n = 10 reconstructions from the same measurement.
191
Measurement DO - Sample 1 DO - Sample 2 PO - Sample 1 PO - Sample 1 PO Uncertainty Target
0
0.2
0.4
0.6
0.8
1
Figure A20: Distortion and Perception optimized inpainting results for the CelebA-HQ dataset (test split).
Uncertainty is calculated over n = 10 reconstructions from the same measurement. For distortion optimized runs, images are generated in one-shot, hence we don’t provide uncertainty maps.
192
Measurement DO - Sample 1 DO - Sample 2 DO Uncertainty PO - Sample 1 PO - Sample 1 PO Uncertainty Target
0
0.2
0.4
0.6
0.8
1
Figure A21: Distortion and Perception optimized deblurring results for the ImageNet dataset (test split).
Uncertainty is calculated over n = 10 reconstructions from the same measurement.
193
Measurement DO - Sample 1 DO - Sample 2 PO - Sample 1 PO - Sample 1 PO Uncertainty Target
0
0.2
0.4
0.6
0.8
1
Figure A22: Distortion and Perception optimized inpainting results for the ImageNet dataset (test split).
Uncertainty is calculated over n = 10 reconstructions from the same measurement. For distortion optimized runs, images are generated in one-shot, hence we don’t provide uncertainty maps.
194
E Appendix to Adapt and Diffuse: Sample-adaptive Reconstruction via
Latent Diffusion Models
E.1 Training details
Here we provide additional details on the training setup and hyperparameters.
Model architecture – In all experiments, we use an LDM model pre-trained on the CelebA-HQ dataset
out of the box. We fine-tune the severity encoder from the LDM’s pre-trained encoder. We obtain the
degradation severity estimate σˆ ∈ R
+ from the latent reconstruction zˆ ∈ R
d
as
σˆ =
1
d
X
d
i=1
[Conv(zˆ)
2
]i
,
where Conv(·) is a learned 1 × 1 convolution with d input and d output channels.
Training setup – We train severity encoders using Adam optimizer with batch size 28 and learning
rate 0.0001 for about 200k steps until the loss on the validation set converges. We use Quadro RTX 5000
and Titan GPUs.
Hyperparameters – We scale the reconstruction loss terms with their corresponding dimension (d
for Llat.rec. and n for Lim.rec.), which we find to be sufficient without tuning for λim.rec.. We tune λσ via
grid search on [0.1, 1, 10] on the varying Gaussian blur task and set to 10 for all experiments.
For latent diffusion posterior sampling, following [23] we scale the posterior score estimate with the
data consistency error as
∇zt
log qt(zt
|y) ≈ sθ(zt) − ηt∇zt ∥A(D0(zˆ0(zt))) − y∥
2
,
where
ηt =
η
∥A(D0(zˆ0(zt))) − y∥
,
195
and η > 0 is a tuning parameter. We perform grid search over [0.5, 1.0, 1.5, 2.0] over a small subset of the
validation set (100 images) and find η = 1.5 to work the best for all tasks.
Similarly, we tune the noise correction parameter c on the same validation subset by grid search over
[0.5, 0.8, 1.0, 1.1, 1.2, 1.5] and find c = 1.2 for Gaussian blur and c = 1.1 for nonlinear blur to work the
best.
E.2 Comparison method details
For all methods, we use the train and validation splits provided for CelebA and FFHQ in the GitHub repo
of "Taming Transformers" ¶
. For the test split, we subsample 1000 ids from the corresponding validation
ids file. Specific ids we used will be available when the codebase is released. We provide the specifics for
each comparison method next.
SwinIR: For both Gaussian blur and non-linear blur experiments, we train SwinIR using Adam optimizer with batch size 28 for 100 epochs. We use learning rate 0.0002 for the first 90 epochs and drop it by
a factor of 10 for the remaining 10 epochs. We use Quadro RTX 5000 and Titan GPUs.
CCDF-DPS: We implement this method by modifying the official GitHub repo ∥ of DPS [23]. Instead of
projection based conditioning, we replace it with the DPS updates to handle noise in the measurements and
non-linear degradations. As the initial estimate, we use the output of SwinIR model that we trained. We
tune the data consistency step size ζ
′
and number of reverse diffusion steps N′ by doing 2D grid search over
[0.5, 1.0, 2.0, 3.0, 5.0, 10.0, 20.0] × [1, 10, 20, 50, 100, 200, 1000] on the small subset of validation split of
FFHQ dataset (100 images) based on LPIPS metric. For Gaussian blur, we find the optimal hyperparameters
to be (ζ
′
, N′
) = (5.0, 20). For non-linear blur optimal values are (ζ
′
, N′
) = (3.0, 100).
CCDF-L: This method is the same as ours but with a fixed starting time and initial estimate provided
by SwinIR model we trained. We tune the data consistency step size η and the number of reverse diffusion
steps N′ by doing grid search over [0.5, 1.0, 1.5, 2.0] × [20, 50, 100, 200] on the small subset of validation
¶
https://github.com/CompVis/taming-transformers/tree/master/data
∥
https://github.com/DPS2022/diffusion-posterior-sampling
196
split of FFHQ (100 images) based on LPIPS metric. For varying blur experiments, we found the optimal
value to be (η, N′
) = (1.0, 100). For non-linear blur, we found it to be (η, N′
) = (1.5, 200).
DPS: This method can be seen as a special case of CCDF-DPS where number of reverse diffusion steps
is fixed to N′ = 1000. From the same 2D grid search we performed for CCDF-DPS, we find the optimal
data consistency step size ζ
′
to be 5.0 for Gaussian blur, 0.5 for non-linear blur.
Autoencoded (AE): We use the latent at severity encoders output and decode it without reverse
diffusion to get the reconstruction.
197
Abstract (if available)
Abstract
A multitude of scientific problems can be formulated as an inverse problem, where the goal is to recover a clean signal from noisy or corrupted measurements. Developing efficient algorithms for solving inverse problems is crucial in advancing a wide range of fields, spanning from MRI and CT reconstruction to imaging protein structures and even microchips. Deep learning has been extremely successful recently in various applications in computer vision, however a lot of the success can be attributed to training on massive, often internet-scale, datasets. However, for many important scientific and medical problems collecting such large, labeled datasets for training is prohibitively costly. Moreover, processing complex scientific data often requires special considerations when compared with simple image or text data found on the internet. Our goal is to remove these roadblocks from efficient deep learning for scientific and medical imaging inverse problems. Our three-pronged approach is aimed at 1) reducing the cost and time required to acquire scientific data, 2) reducing the reliance of deep learning models on large amounts of training data and 3) improving the compute efficiency of models in order to tackle the complexity of scientific data processing. In this work, we adapt powerful deep learning models, such as vision transformers and diffusion models, for a variety of imaging inverse problems and highlight their great potential in revolutionizing scientific and medical applications.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Artificial Decision Intelligence: integrating deep learning and combinatorial optimization
PDF
Learning shared subspaces across multiple views and modalities
PDF
High-throughput methods for simulation and deep reinforcement learning
PDF
Shift-invariant autoregressive reconstruction for MRI
PDF
Learning controllable data generation for scalable model training
PDF
Federated and distributed machine learning at scale: from systems to algorithms to applications
PDF
Deep learning architectures for characterization and forecasting of fluid flow in subsurface systems
PDF
Deep learning models for temporal data in health care
PDF
Theoretical foundations for dealing with data scarcity and distributed computing in modern machine learning
PDF
Acceleration of deep reinforcement learning: efficient algorithms and hardware mapping
PDF
Heterogeneous federated learning
PDF
Learning to optimize the geometry and appearance from images
PDF
Labeling cost reduction techniques for deep learning: methodologies and applications
PDF
Controlling information in neural networks for fairness and privacy
PDF
Exploring complexity reduction in deep learning
PDF
Emotional appraisal in deep reinforcement learning
PDF
Robust and adaptive algorithm design in online learning: regularization, exploration, and aggregation
PDF
Leveraging prior experience for scalable transfer in robot learning
PDF
Towards learning generalization
PDF
Coding centric approaches for efficient, scalable, and privacy-preserving machine learning in large-scale distributed systems
Asset Metadata
Creator
Fabian, Zalan
(author)
Core Title
Efficient deep learning for inverse problems in scientific and medical imaging
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2023-12
Publication Date
12/12/2023
Defense Date
12/01/2023
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
artificial intelligence,computational imaging,deep learning,inverse problems,medical imaging,OAI-PMH Harvest
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Soltanolkotabi, Mahdi (
committee chair
), Nayak, Krishna (
committee member
), Sharan, Vatsal (
committee member
)
Creator Email
fabian.zalan@gmail.com,zfabian@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113789003
Unique identifier
UC113789003
Identifier
etd-FabianZala-12545.pdf (filename)
Legacy Identifier
etd-FabianZala-12545
Document Type
Dissertation
Format
theses (aat)
Rights
Fabian, Zalan
Internet Media Type
application/pdf
Type
texts
Source
20231213-usctheses-batch-1114
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
artificial intelligence
computational imaging
deep learning
inverse problems
medical imaging