Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Covariance-based distance-weighted regression for incomplete and misaligned spatial data
(USC Thesis Other)
Covariance-based distance-weighted regression for incomplete and misaligned spatial data
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Covariance-based Distance-weighted Regression for
Incomplete and Misaligned Spatial Data
by
Khang Chau
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(Biostatistics)
December 2021
Copyright 2021 Khang Chau
Epigraph
All environments are intractably local.
—JAMES C. SCOTT
Am I a God who is near, declares the LORD, and not a God far off?
—JEREMIAH 23:23
ii
Dedication
To Ingrid, my beloved.
iii
Acknowledgements
I would like to thank my dissertation committee for their supervision of this research
work, including their valuable questions, feedback, and suggestions. Dr. Yao-Yi Chiang
provided me with a clear understanding of spatial data and introduced me to PostgreSQL,
both of which made the data management part of my research much more, well, manage-
able. Dr. Juan Pablo Lewinger was eager to suggest alternative optimization methods to
improve the solving algorithm. Dr. Jill Johnston generously supported me through her
grant and exposed me to important epidemiological research on gas flaring from fracking.
Dr. Jim Gauderman provided thoughtful feedback on both the progress of my research
as well as the analyses involving the Children’s Health Study. Dr. Meredith Franklin has
been a wonderful Doktormutter who took me on as a student and walked with me through
many research projects even if some did not make it into the final form of the dissertation.
She has also been a tremendous colleague and friend.
I must also acknowledge my deep gratitude to the colleagues outside of USC whom
I have had the pleasure of learning from in the past few years. Huikyo Lee (JPL) pro-
vided us with a key assimilated aerosol dataset for the Middle East project. Meytar
Sorek-Hamer (Universities Space Research Association) has taught me a great deal about
AERONET and MAIAC AOD data. Lara Cushing (previously SFSU and currently UCLA)
iv
has been a joy to work with on our gas flaring studies. Olga Kalashnikova and Michael
Garay (JPL) provided amazing guidance and insights on the different projects I worked
on as an intern at JPL.
My academic work has also been sustained by the encouragement and prayers of my
friends, both at Fairview Heights Baptist Church and beyond. My pastor Dr. Paul Felix
and his wife Mrs. Marlean have continually prayed for me and never failed to encourage
my wife and me during my USC years. Michael Tu has encouraged me through our late
night chats, academic to academic. Beyond being a steadfast friend and my best man,
Taemin Jin also managed to procure Korean fried chicken at the final stages of both my
MS thesis as well as this dissertation.
Finally, my wife Ingrid has seen me through every stage of my doctoral studies and
has been the most constant source of encouragement to me through it all. Our home never
lacked the sound of laughter thanks to her infectious joy. I dedicate this dissertation to
you, my beloved.
v
Table of Contents
Epigraph ii
Dedication iii
Acknowledgements iv
List of Tables ix
List of Figures xi
Abstract xv
Chapter 1: Introduction 1
1.1 Health Effects of Air Pollution . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Modeling Air Pollution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Current Modeling Approaches . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Satellite Aerosol Data Missingness . . . . . . . . . . . . . . . . . . . . 7
1.2.3 Spatial Misalignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Chapter 2: Current Approaches in PM
2.5
Estimation 9
2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Particulate Matter Measurements . . . . . . . . . . . . . . . . . . . . . 12
2.2.2 MISR Aerosol Optical Depth Data . . . . . . . . . . . . . . . . . . . . 13
2.2.3 Meteorological Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.4 Health Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.5 Exposure Estimation Methods . . . . . . . . . . . . . . . . . . . . . . 15
2.2.6 Epidemiological Methods . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.1 Exposure Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.2 Health Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.6 Supplementary Tables and Figures . . . . . . . . . . . . . . . . . . . . . . . . 30
vi
Chapter 3: Further Challenges in PM
2.5
Estimation 34
3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.1 Particulate Matter Data . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.2 Aerosol Optical Depth Data . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.3 Meteorology Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.4 Assimilated Aerosol Data . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.5 Statistical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3.1 AOD Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3.2 Model Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3.3 Temporal and Spatial Autocorrelations . . . . . . . . . . . . . . . . . 52
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.6 Supplementary Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Chapter 4: A Covariance-based Distance-weighting Model 65
4.1 Gaussian Random Fields and Covariance Functions . . . . . . . . . . . . . . 65
4.1.1 Gaussian Processes and Random Fields . . . . . . . . . . . . . . . . . 65
4.1.2 Covariance Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.1.3 Multivariate Random Fields and Cross-Covariances . . . . . . . . . . 69
4.2 Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3 The Matérn Cross-Covariance Model . . . . . . . . . . . . . . . . . . . . . . . 73
4.4 Simulation Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.5 Simulation Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.5.1 A Rough Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.5.2 Marginal Smoothness Parameters . . . . . . . . . . . . . . . . . . . . 81
4.5.3 Colocated Correlation Coefficient . . . . . . . . . . . . . . . . . . . . 82
4.5.4 Missingness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.6 Supplementary Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Chapter 5: Application in PM
2.5
Estimation 89
5.1 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.1.1 Study Region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.1.2 Satellite Aerosol Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.1.3 EPA Particulate Matter Data . . . . . . . . . . . . . . . . . . . . . . . . 93
5.1.4 gridMET Meteorological Data . . . . . . . . . . . . . . . . . . . . . . . 94
5.1.5 Data Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.1.6 Univariate Direct Estimation . . . . . . . . . . . . . . . . . . . . . . . 96
5.1.7 Multivariate Direct Estimation . . . . . . . . . . . . . . . . . . . . . . 97
5.1.8 Multi-stage Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.1.9 Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.2.1 Univariate PM
2.5
Estimation . . . . . . . . . . . . . . . . . . . . . . . 104
5.2.2 Multivariate PM
2.5
Estimation . . . . . . . . . . . . . . . . . . . . . . 106
vii
5.2.3 Multi-stage PM
2.5
Estimation . . . . . . . . . . . . . . . . . . . . . . . 108
5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.4 Supplementary Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Chapter 6: Conclusion 118
6.1 Strengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.3 Future Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Bibliography 123
viii
List of Tables
2.1 Summary of machine learning methods, MISR predictors, and best test R
2
for predicting each exposure of interest. . . . . . . . . . . . . . . . . . . . . . 19
2.2 Characteristics of the study population . . . . . . . . . . . . . . . . . . . . . 24
2.3 Air pollution single-pollutant effect estimates by outcome adjusted for
demographic factors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
S2.1 Spearman correlation coefficients for annual means of air pollution
exposures during the three CHS follow-up periods (2007–2012). . . . . . . . 30
S2.2 Mean (SD) of FEV
1
and FVC by gender and race/ethnicity. . . . . . . . . . . 31
S2.3 Effect estimates of biological characteristics on lung function in mixed
models without air pollution exposure. . . . . . . . . . . . . . . . . . . . . . 31
3.1 Median test R
2
across machine learning methods, AOD sources, and
samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.1 Univariate PM
2.5
estimation models using only MAIAC AOD. . . . . . . . . 96
5.2 Multivariate PM
2.5
estimation models using MAIAC AOD and gridMET
meteorology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.3 Performance of univariate PM
2.5
models using only MAIAC AOD,
reported as median cross-validation R
2
and RMSE. . . . . . . . . . . . . . . 104
5.3 Performance of univariate PM
2.5
models using only MAIAC AOD,
reported as median cross-validation R
2
and RMSE. . . . . . . . . . . . . . . 105
5.4 Estimates for the parameter b
1
and the length-scale parameter l
1
for
MAIAC AOD in the three univariate CDR models fitted. . . . . . . . . . . . 105
ix
5.5 Performance of multivariate PM
2.5
models incorporating meteorological
variables in addition to MAIAC AOD, reported as median cross-validation
R
2
and RMSE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.6 Prediction performance comparing multi-stage machine learning models
with AOD-calibrated PM
2.5
to models using AOD directly, measured in
test R
2
and test RMSE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
x
List of Figures
2.1 Map of the Children’s Health Study participants who were recruited in
Santa Barbara (left) and other communities (right). . . . . . . . . . . . . . . 16
2.2 Relative importance of the ten most important features in model predicting
PM
2.5
using gradient boosting. . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Relative importance of the ten most important features in models
predicting SO
2
4
(top left), NO
3
(top right), EC (bottom left), and
dust (bottom right). Feature importance were calculated using relative
influences in GBM models (PM
2.5
, SO
2
4
, and EC) [115] and global
sensitivity analysis in SVM models (NO
3
and dust) [116]. SSR: surface
shortwave radiation; ERC: energy release component. . . . . . . . . . . . . . 21
2.4 Maps of mean MISR-derived PM
2.5
(μg/m
3
) from 2007 through 2012 over
Santa Barbara (left) and other communities in Southern California (right). . 22
2.5 Maps of mean MISR-derived SO
2
4
(top left), NO
3
(top right), EC (bottom
left), and dust (bottom right) from 2007 through 2012, all measured
in μg/m
3
over Santa Barbara (left in each) and other communities in
Southern California (right in each). . . . . . . . . . . . . . . . . . . . . . . . . 22
S2.1 Maps of EPA PM2.5 monitoring sites (left) and CSN monitoring sites
(right) from 2000 to 2018. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
S2.2 Maps of EPA PM2.5 monitoring sites and the MISR pixels within 4.4 km of
each site, in Santa Barbara (left) and other communities (right). . . . . . . . 32
S2.3 Boxplots of test R
2
from 20 iterations of each machine learning method
applied to predicting PM
2.5
speciation, in AOD-products (coral) and
AOD-mixtures (teal) models. . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1 Map of the study region with all PM
2.5
monitors (left) and only those in
Kuwait (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
xi
3.2 Areas within each country where temporal and spatial autocorrelations
for MAIAC AOD were evaluated (shaded pink) and the PM
2.5
monitors
(yellow circles) in each country. . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3 Validation of AERONET AOD interpolated to 550 nm versus MAIAC
AOD (left), MISR AOD raw (center), and MISR AOD (right) with 1-to-1
line (dotted) and correlation line (red). . . . . . . . . . . . . . . . . . . . . . . 47
3.4 The top-10 most important variables for the random forest overall models
using MAIAC AOD (left) and MISR AOD (right), as measured by increase
in MSE when excluded. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.5 Test R
2
across different settings: overall and regional models, MISR and
MAIAC models, and models with and without AOD variables. . . . . . . . 50
3.6 The top-10 most important variables for the random forest models based
on monitors in Kuwait using MAIAC AOD (top left) and MISR AOD (top
right) and in the U.A.E. using MAIAC AOD (bottom left) and MISR AOD
(bottom right), as measured by increase in MSE when excluded. . . . . . . . 51
3.7 Autocorrelation functions for PM
2.5
(left) and median ACFs for MAIAC
AOD (right) at different sites in Kuwait and the U.A.E. . . . . . . . . . . . . 53
3.8 Daily semivariograms for MAIAC AOD in Kuwait and the U.A.E. (grey
lines in top row) and median semivariograms (red lines in both top and
bottom rows). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
S3.1 Out-of-region PM
2.5
predictions for Kuwait-trained models over the
U.A.E. (left column) and out-of-region PM
2.5
predictions for U.A.E.-
trained models over Kuwait (right column), by MAIAC and MISR AODs
(top and bottom rows, respective). . . . . . . . . . . . . . . . . . . . . . . . . 62
S3.2 Autocorrelation functions for ERA5 meteorological variables with high
importance in the random forests models. The Kuwait City and Central
monitors shared the same MERRA-2 pixel and thus the same ACF. . . . . . 63
S3.3 Autocorrelation functions for MERRA-2 assimilated aerosol extinction
variables, including dust, sulfate, black carbon, and organic carbon. The
Kuwait City and Central monitors shared the same MERRA-2 pixel and
thus the same ACF. Organic carbon was not found to be useful in any
models and was excluded in the final PM
2.5
models. . . . . . . . . . . . . . . 64
4.1 Sample univariate Gaussian processes under Matérn covariances with
different values ofn. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
xii
4.2 Simulated examples with different values ofn
1
andn
2
. . . . . . . . . . . . . 76
4.3 Simulated examples with different values ofr
12
(4.3a, 4.3b), and fors
2
1
and
s
2
2
(4.3c). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.4 Simulation design where each field is a 101 101 grid. Modeling is
restricted to observations that fall within the yellow square in fieldY(s).
The sampledY(s)points (red) are matched with points inX(s)that fall
within their corresponding buffers (blue). . . . . . . . . . . . . . . . . . . . . 78
4.5 Boxplots of r for training and test sets for linear models (LM), generalized
additive model (GAM), and our proposed model (CDR). . . . . . . . . . . . 80
4.6 Results of evaluating our model in bivariate fields with half-integer values
of(n
1
,n
2
) while keepingn
12
= (n
1
+n
2
) /2. . . . . . . . . . . . . . . . . . . . 81
4.7 Boxplots of r for training and test sets for linear model (LM), generalized
additive model (GAM), and our proposed model (CDR) across a subset of
simulation parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.8 Boxplots of r for training and test sets for linear model (LM), generalized
additive model (GAM), and our proposed model (CDR) across a subset of
simulation parameters, where missingness inX(s) was introduced. . . . . 87
S4.1 Samples of successfully retrieved AOD data on two consecutive days in
October 2018 over southern California. Los Angeles county is outlined in
yellow and the red dots are PM
2.5
monitors. Although it was during the
fall in a more temperate region, successful retrieval is not guaranteed. . . . 88
5.1 Comparison of monthly average MAIAC AOD retrieval success rates in
2018 for Kuwait, Los Angeles county, and Ulaanbaatar municipality. . . . . 91
5.2 Map of PM
2.5
monitors in south Los Angeles county (red triangles),
MAIAC pixels (left panel), 1 km coincident MAIAC pixels (blue squares),
gridMET pixels (right panel), and 4 km coincident gridMET pixels (green
squares). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.3 EPA PM
2.5
monitors (red dots) in Los Angeles county (blue outline) from
2000–2019. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.4 Sample of MAIAC AOD observations and PM
2.5
measurements (triangles)
in Los Angeles county on twelve days in 2018. . . . . . . . . . . . . . . . . . 103
xiii
5.5 Distance weighting curves for AOD in the three univariate CDR models,
with the plain covariance function on the left and adjusting for b
1
on the
right. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.6 Partitions of the data for those with 1 km coincident AOD, 10 km
coincident AOD, and those whose nearest AOD pixel is more than 10 km
away. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.7 Variable importance for the final random forest model estimating PM
2.5
;
permuted importance is measured by percentage increase in MSE when
each variable is excluded from the model. . . . . . . . . . . . . . . . . . . . . 109
S5.1 Sample of several MISR retrieval “paths” over North America. . . . . . . . . 114
S5.2 Example of a MISR path that does not fully cover the target area—Los
Angeles county. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
S5.3 Sample of different air pollutants monitored by the EPA and their linear
relationship with MAIAC AOD (550 nm). . . . . . . . . . . . . . . . . . . . . 115
S5.4 Prediction performance of univariate models estimating PM
2.5
, grouped
by the coincident MAIAC AOD subsets and the number of MAIAC AOD
pixels for the CDR model. LM: linear model; GAM: generalized additive
model; CDR: covariance-based distance-weighted regression; “-C”: 1 km
coincident AOD subset; and “-M”: missing coincident AOD. . . . . . . . . . 116
S5.5 Prediction performance of multivariate models estimating PM
2.5
. LM:
linear model; GAM: generalized additive model; GAM-F: fully smoothed
(meteorological) GAM; CDR: covariance-based distance-weighted regres-
sion; and CDR-F: fully distance-weighted (meteorological) CDR. . . . . . . . 117
S5.6 Proportions of the overall data where the nearest available AOD is not 1
km coincident, up to 50 km. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
xiv
Abstract
Estimating air pollution exposure, notably particulate matter with aerodynamic diameter
less than 2.5 μm(PM
2.5
), using statistical models is a critical component in epidemiological
research when the true exposure was not measured in situ. Earth-orbiting satellite instru-
ments have provided daily atmospheric aerosol data globally (measured as aerosol opti-
cal depth—AOD), which have been a great addition to PM
2.5
prediction studies. There
remain serious challenges to using satellite AOD, such as data missingness and spatial
misalignment. We demonstrated a common analysis pipeline where PM
2.5
concentrations
were estimated using AOD and then used as exposures in a cross-sectional study to as-
sess its impact on children’s lung functions in Southern California. We further examined
spatial and temporal autocorrelation trends for PM
2.5
and AOD in two Middle Eastern
regions to investigate possible explanations for different prediction performances, given
the same models and predictors. We proposed a novel approach to modeling PM
2.5
that
sought to overcome these modeling challenges, particularly missingness. Relying on the
Gaussian random fields framework, we constructed a regression model using covariance
xv
functions to weight contribution of nearby AOD data by their distances to PM
2.5
mon-
itors. We simulated bivariate random fields under different settings of surface smooth-
ness and spatial correlation and compared our covariance-based distance-weighted re-
gression (CDR) model against the linear model and the generalized additive model. We
further compared these models in the task of modeling PM
2.5
in Los Angeles county us-
ing satellite AOD and meteorology. In simulation, the CDR model outperformed the
linear and generalized additive models in recovering the correlation between the input
and outcome random fields, especially when the input field was less smooth. In modeling
PM
2.5
, the CDR model also showed better prediction than the standard models, although
the improvement was reduced when meteorological data were added to the models. In
multi-stage modeling with machine learning, the CDR model improved the final predic-
tion modestly. Our proposed model overcame the data missingness problem and used
larger sample sizes by relaxing the coincident AOD data requirement, while improving
the overall prediction performances. In regions where PM
2.5
concentrations were less
inconsistently monitored, our model could preserve the data sample sizes when the co-
incident AOD data was missing yet nearby AOD data were sufficiently available. We
did not explicitly model spatial trends in our regression model and future work could in-
tegrate the covariance-based distance-weighting term into more sophisticated methods,
including the generalized additive models.
xvi
Chapter 1
Introduction
In this introductory chapter, we surveyed epidemiological studies on the health effects of
air pollution, specifically particulate matter (PM) with aerodynamic diameter less than 2.5
μm (PM
2.5
). We also reviewed common approaches to modeling air pollution, from classi-
cal geostatistical methods to more current machine learning methods. We then identified
two challenges in modeling PM using satellite aerosol data that motivated our proposed
method.
1.1 Health Effects of Air Pollution
The harmful effect of fine particulates (including PM
2.5
and particulate matters with aero-
dynamic diameter less than 10 μm—PM
10
) garnered great public attention in the 1990s
following several landmark studies that associated them with mortality. Schwartz and
Marcus reported on the likely causal relation between particulates and daily mortality
in London [1]. These findings were replicated in various communities in the United
States, including Philadelphia and the “Harvard Six Cities” [2–4]. Subsequent studies
1
in the next two decades confirmed the correlation between particulates and all-cause and
cause-specific mortality, including one study that found decreases in PM
2.5
associated
with improved mean life expectancy across the U.S. [5–8].
Additional studies, including several that analyzed data from the American Cancer
Society, identified the significant associations between PM
2.5
and cardiopulmonary and
respiratory mortality, including lung cancer mortality [5–7, 9–11]. Altogether, these stud-
ies demonstrated at least two of Hill’s criteria for causation: consistency (the trends of
association have been observed in different populations) and coherence (cause-specific
mortality has largely been related to the cardiovascular and respiratory system) [12]. The
Harvard Six Cities study also led to the U.S. Environmental Protection Agency (EPA)
adopting a new standard for PM
2.5
[13].
Apart from lung cancer mortality, attention has also been given to studying the effect
of particulates on lung function, typically measured in forced expiratory volume in 1
second (FEV
1
) and forced vital capacity (FVC). Reduced lung function in both children
and adults of all ages has been associated with all-cause and cardiovascular mortality [14–
17]. Beyond mortality, reduced lung function and 2poor lung function growth trajectories
have also been associated with chronic illnesses such as asthma, persistent wheezing, and
chronic obstructive pulmonary disease (COPD) [17–20].
Several studies have related lung function with PM
2.5
and other pollutants, including
PM
10
, nitrogen dioxide (NO
2
), nitrogen oxides (NO
x
), ozone (O
3
), sulfur dioxide (SO
2
),
and black carbon. Earlier studies have typically relied on air pollution data from local
air quality monitors: investigators in the Children’s Health Study in southern California
and in Detroit established new monitors in the communities of interest [21–23], while
2
others employed personal PM
2.5
monitors [24]. Otherwise, researchers were restricted
to communities with existing monitors [25, 26]. Several studies have also incorporated
atmospheric dispersion models such as CALINE4 [27–29].
There have been a growing number of studies that relied on prediction models to es-
timate PM
2.5
and other pollutants before relating them to lung function outcomes [29–
35]. These studies have largely predicted air pollutant using land use regression, al-
though newer studies have also taken advantage of satellite-observed aerosols and ma-
chine learning methods to improve PM
2.5
estimation [33, 36, 37].
1.2 Modeling Air Pollution
1.2.1 Current Modeling Approaches
Due to its epidemiological impact, PM
2.5
has been the subject of extensive modeling in
recent years. The task of estimating PM
2.5
concentrations where monitors do not exist
has been tackled by a variety of spatial statistical methods and different sources of data.
Land use and meteorological data are the two most common additions to models of PM
2.5
and other air pollutants, both due to their wide availability as well as their strong associ-
ations with air pollution. Although geographically and temporally more abundant than
air quality monitors, meteorological data from weather stations typically still require a
certain amount of interpolation to align with air quality monitors for modeling [38–40].
Others have relied on data assimilation systems such as the European Reanalysis (ERA)
[41], the North America Regional Reanalysis (NARR) [42], and the North American Land
3
Data Assimilation System (NLDAS) [43, 44], which aggregate observed data from a large
network of monitors and, through physical models, provide spatially and temporally
dense gridded meteorological fields [45–47].
More recently, earth-orbiting satellite instruments such as the MODerate resolution
Imaging Spectroradiometer (MODIS) and the Multi-angle Imaging SpectroRadiometer
(MISR) have made available atmospheric aerosol data observed globally. Capturing and
processing the amount of reflected light, these satellite instruments estimate the amount
of aerosol from the earth surface to the top of the atmosphere, called total column aerosol
optical depth (AOD). Updated algorithms have made AOD data available at resolutions
as high as 1 km and at daily or almost daily frequency [48, 49]. These advancements in
remote sensing data have enabled the development of models that can predict air pollu-
tants at finer spatial scales by utilizing the high resolution of newer satellite AOD data [36,
50–53]. Most recent modeling efforts have integrated some or all three types of covariates
(AOD, land use, and meteorology) to build more powerful PM
2.5
estimating models [51,
54–57].
Geostatistical methods that predict values at unobserved locations by weighting the
distance between observed points (inverse distance weighting) or quantifying their un-
derlying spatial autocorrelation (kriging) were traditionally used to interpolate point-
based air quality measurements as they are computationally straightforward [38, 54, 58].
These methods require spatially dense sampling to achieve good performance and are
sensitive to the number as well as the spatial distribution of observed points. Air qual-
ity monitors, however, are not inexpensive to install and are typically located in urban
4
areas due to greater interest in automotive- and industrial-related air pollution. There-
fore, these interpolation methods are not well suited beyond localized models in densely
monitored regions [59].
Land use regression (LUR) models, which incorporate external data such as land use
and meteorology at air quality measurement locations, have often been limited to ur-
ban areas where data are more readily available and able to capture likely air pollution
sources such as highways or industrial lands [31, 52, 55, 60–63]. LUR models are typically
formulated as linear regression, which ignores potentially complex higher order or inter-
active terms. Generalized additive models (GAMs) have been very popular extensions
of LUR as they feature flexible smoothed terms for spatial as well as temporal trends [62,
64–67]. Spatial autoregressive models such as the simultaneous autoregressive (SAR) and
conditional autoregressive (CAR) models are very powerful in explaining spatial auto-
correlation [68], but are usually applied to aggregate, or areal data. Furthermore, their
computational complexity often makes them ill-suited for large-scale air pollution mod-
els when other less expensive methods achieve excellent results, although there has been
effort to incorporate them into multi-stage models [69, 70].
Recently, machine learning and deep learning methods have been widely adopted to
produce robust air pollution predictions [71–74]. Non-linear learners such as gradient
boosting, random forest, and neural network have proved suitable in relating quantities
such as PM
2.5
to satellite AOD due to their well-documented non-linear relationship [36,
50, 65, 66]. More recently, multi-stage models in which one or several predictors are first
“calibrated” prior to being integrated into the final predictive models [40, 53, 75–77].
5
A sophisticated proposal by Lin et al. (2020) integrated statistical theories along with
deep learning techniques to construct a powerful model cleverly named DeepLATTE
[78]. The proposed framework employed L
1
regularization to filter out weak input fea-
tures, generated a condensed feature embedding, enforced a similarity assumption for
spatiotemporal embeddings that were close together, and incorporated spatial autocor-
relation via assessing semivariograms. Its prediction over a 2-year study period proved
superior to those from standard approaches such as ordinary kriging, random forest, and
deep air learning. The model leveraged well the large quantity of data available in south-
ern California, both in land use as well as meteorological features. The complex loss
function (which is composed of different losses at each modeling task) demonstrated the
multi-faceted challenge of modeling air pollution.
It is less clear, however, whether the model could achieve similar promising perfor-
mance in regions whose features are more scarce. For example, the United States De-
partment of Veterans Affairs are currently very interested in understanding the types and
quantities of toxic fumes that U.S. soldiers were exposed to while stationed overseas in
places such as Afghanistan or Iraq. However, there were very few air quality monitors in
these locales and the few that were available were less likely to be consistently monitor-
ing over long periods of time. A recent study attempted to predict PM
2.5
over Iraq using
a model trained entirely only on data from 4 monitors in Kuwait and validated only on
data from Kuwait [79]. Similarly, although static features such as those based on land
use data are highly important to these models, how frequently or infrequently the data
is updated could affect how well the model predicts in locations with fast-paced urban
development (e.g., Playa Vista or, more recently, the SoFi Stadium in Inglewood).
6
Nevertheless, there remain two persistent challenges in the current modeling methods
when integrating satellite AOD data with other data sources: missing data and spatial
misalignment.
1.2.2 Satellite Aerosol Data Missingness
While land use and meteorological covariates in air pollution prediction models are gen-
erally complete both spatially and temporally, satellite AOD observations (e.g. from
MODIS and MISR) have consistently been plagued with missing data. This is due to
the satellite retrieval mechanisms’ reliance on reflected light form earth to estimate atmo-
spheric aerosols. There are circumstances that make aerosol retrieval more difficult, such
as cloud cover, overly bright surfaces due to snow cover, deserts, or certain structures
in developed metro areas. Under these and similar challenging circumstances, satellite
algorithms sometimes fail to retrieve and output missing data for the particular grid cells
[77]. In dealing with such missingness, there have been two common approaches: (a) only
use available data (thus excluding air pollution observations where the coincident AOD
is missing) or (b) impute the missing values by spatial interpolation methods or other
gap-filling approaches using surrounding available grid cells [36, 50, 80–82]. The first
approach reduces the sample size of the training models, while the latter introduces un-
certainty to the predicting covariates. In addition, missingness in AOD grid cells is often
not random and likely attributable to physical phenomena that lead to retrieval failure,
as described above [83].
7
1.2.3 Spatial Misalignment
When integrating different spatial data sources, one common challenge is the issue of
spatial misalignment. Gotway and Young (2002) and Banerjee (2015) provide extensive
discussions on this topic [84, 85]. In our context, the challenge is the so-called change of
support problem (COSP): relating data at different native resolutions. An example would
be integrating Multi-Angled Implementation of Atmospheric Correction (MAIAC) AOD
at 1-km resolution with gridMET meteorological data at 4-km [46, 48]. Depending on the
prediction needs, one could upscale the meteorology to 1-km, downscale the AOD to 4-
km, or align them to another desired grid. Regardless, the task is typically one of spatial
regression where the result is either or both of the data being predicted at the desired grid.
Again, this approach implicitly introduces uncertainty into the model prior to estimating
the intended measurements of interest (e.g., PM
2.5
).
8
Chapter 2
Current Approaches in PM
2.5
Estimation
We incorporated the fractionated aerosol measurements from the Multi-angle Imaging
SpectroRadiometer (MISR) instrument to build prediction models for PM
2.5
and its spe-
ciation (SO
2
4
, NO
3
, elemental carbon, and dust) in California. We then predicted these
quantities as exposures for Cohort E of the Children’s Health Study and conducted cross-
sectional analysis on the effect of these pollutants on the subjects’ lung function, as mea-
sured in forced expiratory volume in 1 second (FEV
1
) and forced vital capacity (FVC).
1
2.1 Background
Numerous studies have examined the association between exposure to traffic-related air
pollution and children’s respiratory health [27, 28, 32, 86]. Long-term exposure to air pol-
lution, indicated by concentrations of nitric oxides (NO, NO
2
, NO
x
), ozone (O
3
), partic-
ulate matter with aerodynamic diameter less than 2.5 μm(PM
2.5
), and particulate matter
with aerodynamic diameter less than 10 μm(PM
10
), has been shown to lead to reduced
1
This chapter was previously published in Remote Sensing in March 2020 [36].
9
lung development in children [21]. Fortunately, decreases in air pollution in Southern
California over the past 17 years have led to significant reductions in these detrimental
effects [23].
The air pollution exposures used in the aforementioned studies were estimated from
concentrations measured at national- and state-operated fixed-site monitors [87]. For ex-
ample, in a longitudinal assessment of air quality and lung development, Gauderman et
al. (2015) [23] used concentrations of nitrogen dioxide (NO
2
), O
3
, PM
2.5
, PM
10
from one
monitor within each community to determine exposures. With this approach, the 120 to
300 study subjects residing in each community were assigned the same exposure. Spa-
tial statistical techniques such as kriging, smoothing, and land use regression have been
used to incorporate additional information (e.g., traffic, population density, elevation,
land cover, and other geographic data) to characterize the spatial relationships in fixed-
site monitoring data and interpolate concentrations to the unmonitored locations where
there are health data [60, 88]. While these approaches are valuable for generating expo-
sures with greater spatial coverage, it has been shown that if their prediction performance
is poor, subsequent epidemiological studies can yield severe biases and underestimation
of standard errors in the health effects estimates [54].
Advances in using satellite observations of aerosol optical depth (AOD) to estimate
ground-level concentrations of particulate matter (PM) air pollution have been extremely
valuable in improving the spatial and temporal coverage of exposure estimates [89–93].
Among the satellite instruments most commonly used for PM estimation are the Multi-
angle Implementation for Aerosols (MAIAC) of the Moderate Resolution Imaging Spec-
troradiometer (MODIS) on-board the NASA Earth Observing System (EOS) Terra and
10
Aqua satellites [48], and the Multiangle Imaging SpectroRadiometer (MISR) on-board the
Terra satellite [94]. Recent algorithms applied to observations from these instruments
provide global, near-daily AOD at a spatially resolved grid resolutions (1 km, 4.4 km)
[48, 49].
Some studies have combined AOD from MODIS and MISR to derive PM
2.5
concentra-
tions [95] and more recently PM
2.5
speciation concentrations [96]. MISR, given its configu-
ration of nine cameras and four spectral bands, has the capability of differentiating aerosol
size and type resulting in fractionated AOD [97]. In a recent study over Southern Califor-
nia, we reliably estimated PM
2.5
with MISR 4.4-km resolution AOD small+medium, and
PM
10
with AOD large using generalized additive models (GAMs) [66]. The MISR AOD-
derived PM concentrations were well correlated (confirmed by leave-one-site-out cross
validation, CV) with EPA monitoring site data (PM
2.5
CV r = 0.71, PM
10
CV r = 0.66). In
the same region, speciated PM
2.5
(sulfate, SO
2
4
; nitrate, NO
3
; organic carbon, OC; and
elemental carbon, EC) were estimated using GAMs from 8 MISR component fractions
combined with meteorology and geographic characteristics [98].
In a simulation study, high-resolution exposure estimates derived from satellite AOD
were found to produce less biased acute and chronic health effects estimates with smaller
standard errors than did exposure estimates derived from kriging PM
2.5
concentrations
from fixed-site monitors [54]. Satellite-derived PM
2.5
concentrations have been instru-
mental in studies of the global burden of disease [99, 100]. A few epidemiological studies
of smaller cohorts have used satellite-derived PM
2.5
to estimate residential exposures in
longitudinal children’s health effects [33, 61, 101].
11
In this study, we derived daily PM
2.5
and PM
2.5
speciation (SO
2
4
, NO
3
, EC, dust)
exposures from 2000–2018 over the state of California by applying machine learning ap-
proaches to ground-level air quality measurements linked with the high-dimensional 4.4-
km MISR AOD products and mixtures. Estimated annual average concentrations were
then assigned to the residences of children in 8 Southern California communities to ex-
amine the chronic effects of exposure to PM
2.5
and the aforementioned PM
2.5
components
on lung function. This study is unique in that it is the first of its kind to examine the dif-
ferential effects of satellite-derived PM
2.5
speciation on children’s respiratory health.
2.2 Materials and Methods
2.2.1 Particulate Matter Measurements
The United States Environmental Protection Agency (EPA) provides data on PM
2.5
con-
centrations from outdoor monitoring sites across the US through their Air Quality Sys-
tem (AQS) [102], and PM
2.5
speciation concentrations including ions, carbons, and metals
through their Chemical Speciation Network (CSN) [103] (Figure S2.1). The AQS PM
2.5
data we used include daily averages of hourly measurements or 24-hour integrated mea-
surements from Federal Equivalent and Reference Method (FEM/FRM) instruments. The
CSN sites measure 24-hour chemical composition on a 1-in-3- or 1-in-6-day sampling
schedule. In this study, we used all available AQS and CSN data in California begin-
ning in March 2000, the earliest date for which MISR AOD data are available, and ending
in July 2018 for PM
2.5
speciation and in December 2018 for PM
2.5
mass. Among PM
2.5
12
species, we focused on predicting SO
2
4
ion, NO
3
ion, elemental carbon (EC), and dust,
which was calculated as a linear combination of aluminum (Al), calcium (Ca), iron (Fe),
silicon (Si), and titanium (Ti) [104]:
Dust= 2.2 Al+ 2.49 Si+ 1.63 Ca+ 1.94 Ti+ 2.42 Fe.
This definition of dust pertains to fugitive geological materials, and has been shown to
have a temporally stable compositional source profile over California [105].
2.2.2 MISR Aerosol Optical Depth Data
The MISR instrument has been collecting data from nine camera angles and four spec-
tral bands since early 2000. Due to its narrower retrieval swaths, the instrument over-
passes any given location every 3–5 days (between 10:00 and 13:00 local time) instead of
daily as is the case with MODIS/MAIAC. The latest version (V23) of the retrieval algo-
rithm re-processed the entire MISR mission at a 4.4-km spatial resolution (from 17.6-km
resolution) [49] with improved retrievals over water [106]. In addition to total column
AOD, MISR also characterizes the size, shape, and type of aerosol particles via these frac-
tionated measures: small, medium, and large AOD (amount of particles of each size),
nonspherical AOD (amount of nonspherical particles), and absorption AOD (amount of
light-absorbing particles). These AOD features are hereafter referred to as AOD products.
In its auxiliary data products, MISR further provides 74 AOD mixtures, which are
more fine-grained groupings of aerosol particles characteristics [97]. Specifically, AOD
mixtures numbered 1–30 are made up of spherical, non-absorbing components, mixtures
31–50 contain spherical, absorbing components, and mixtures 51–74 contain spherical and
13
nonspherical dust analogues. The MISR particle properties provide robust aerosol-type
classification for distinguishing aerosol mass types including polluted, smoky, maritime,
and dusty conditions [97, 107]. Details of processing MISR data, particularly AOD mix-
tures, are documented in our previous paper [50].
2.2.3 Meteorological Data
Similar to other studies [54, 61, 66, 95, 108], we used meteorological data including tem-
perature, relative humidity, wind speed, and wind direction to better inform the associ-
ation between ground-monitored air pollutants and satellite-observed AOD. The Clima-
tology Lab at the University of Idaho provides daily meteorological data, called gridMET,
in the contiguous United States from 1979–yesterday [109]. This dataset has been exten-
sively validated [46] and, at 4-km resolution, is particularly useful for our study. The
gridMET data also provide surface shortwave radiation (SSR–the amount of shortwave
radiation that reaches the earth), whose negative correlation with aerosol emissions has
been documented by Smith et al. (2016) [110] and Freychet et al. (2019) [111].
2.2.4 Health Data
Since its inception in the early 1990s, the Southern California Children’s Health Study
(CHS) has enrolled over 11,000 children in a series of five cohorts. In this study, we fo-
cused on the most recent cohort that began in 2003, enrolling approximately 3,000 chil-
dren at age 6–7 years, and followed until 2012 when they were 15–16 years old. These
14
children resided and went to school in eight communities in the greater Los Angeles, Cal-
ifornia area: Anaheim, Glendora, Long Beach, Mira Loma, Riverside, Santa Barbara, San
Dimas, and Upland (Figure 2.1).
From 2007–2012, pulmonary function tests were conducted on each child by trained
respiratory staff, measuring FEV
1
and FVC with pressure transducer-based spirometers
(ScreenStar Spirometers, Morgan Scientific, Haverhill, Massachusetts, USA). A written
questionnaire was also administered to obtain demographic information including age,
sex, self-identified race and ethnic background, parental education, occurrences of acute
respiratory illness, exercise, tobacco-smoke exposure (personal smoking or environmen-
tal), and housing characteristics (air conditioning, age of house, presence of mildew, pets
in the home). Ethnic background in the CHS specifically relates to Hispanic ancestry,
identifying Caucasian subjects with Hispanic and non-Hispanic ethnicity [112]. Study
protocols were approved by the Institutional Review Board at the University of Southern
California (USC), and additional details of CHS community and subject selection have
been previously reported [113, 114].
2.2.5 Exposure Estimation Methods
We expand upon our previous work [50, 66] to include MISR aerosol properties of absorp-
tion (absorbing or non-absorbing), shape (spherical or nonspherical), and type provided
by 74 weighted aerosol optical depths (mixtures) [97] to predict PM
2.5
and PM
2.5
SO
2
4
,
NO
3
, EC, and dust. We matched daily PM
2.5
and PM
2.5
speciation measurements from
ground monitors to the nearest available MISR pixel within 4.4 km (Figure S2.2) and then
15
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●● ● ● ●●● ● ● ● ● ● ● ●●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Santa Barbara Anaheim Glendora Long Beach Mira Loma Riverside San Dimas Upland
Figure 2.1: Map of the Children’s Health Study participants who were recruited in Santa
Barbara (left) and other communities (right).
further matched them to the nearest gridMET pixel within 4 km. As some PM
2.5
monitors
are close to each other, it is possible they share the same MISR or gridMET pixel. In these
instances, we used an algorithm to pair the shared MISR/gridMET pixel to the nearest
monitor, and the remaining monitors were paired with the next available MISR/gridMET
pixel within 4.4/4 km. If there were no other available MISR/gridMET pixels within an
appropriate distance, the PM
2.5
measurement on that day for that monitor was removed
in order to avoid duplicating AOD and meteorological data in the dataset.
We trained the PM
2.5
models on 70% of the data using 5-fold CV and assessed per-
formance of the best model on the remaining 30%. We also trained the PM
2.5
speciation
models on 70% of the data with 5-fold CV and tested on 30%, but repeated the process
20 times to assess model stability due to much smaller sample sizes. Five machine learn-
ing methods were considered: Ridge regression, Least Absolute Shrinkage and Selection
16
Operator (LASSO), Gradient Boosting (GBM), Random Forests (RF), and Support Vector
Machines (SVM), all within a regression setting. Inputs to the models were meteorology
and either the MISR AOD products or the 74 MISR AOD mixtures. The optimal model for
each pollutant was chosen based on its test R
2
as the primary metric and its test RMSE as
the secondary metric. We further supplemented the model with geospatial (coordinates
of MISR pixels projected to UTM zone 11) and temporal (Julian date and month) predic-
tors. The best predicting model for each pollutant was trained on the full dataset prior to
estimating exposures for the epidemiological assessment.
2.2.6 Epidemiological Methods
To examine the association between air pollution exposure and lung function during the
children’s critical period of development, we focused on pulmonary function tests (FEV
1
and FVC) taken at ages 15–16 (2011–2012). During each assessment visit, the study partici-
pants reported their current addresses, which were geocoded. We identified MISR aerosol
optical depth and gridMET meteorological data within 4.4 km of each study participant’s
residence for the 12 months prior to their pulmonary function test, and predicted PM
2.5
,
NO
3
, SO
2
4
, EC, and dust concentrations spatially and temporally specific to each child.
The 12-month means of these air pollutant estimates were assigned as exposures of inter-
est for each child.
We fitted single-pollutant models to examine the effects of each predicted air pollu-
tant exposure on lung function. Multi-pollutant models were also fitted to assess whether
17
multiple predicted exposures better informed the associations of interest. Using mixed-
effects models, we adjusted for study community with a random intercept and for biolog-
ical characteristics such as age, gender, height, BMI, and race/ethnicity as fixed effects.
Based on previous studies of the CHS, log transformation of FEV
1
and FVC as well as
quadratic terms for height and BMI were considered [23, 27, 28, 32]. We separately fit-
ted similar models with central-site PM
2.5
to compare with MISR-derived PM
2.5
models.
Akaike information criterion (AIC) was used as the main metric for model comparison
and Bayesian information criterion (BIC), which favors more parsimonious models, as
the secondary metric. Generalized variance inflation factor was calculated to assess po-
tential collinearity among the predictors (using GVIF 10 as the cutoff).
2.3 Results
2.3.1 Exposure Estimation
From March 2000 through December 2018, there were 2828 days where MISR had com-
plete retrievals of AOD products that could be matched to EPA-PM
2.5
stations, resulting
in a dataset of N 56, 000. In the same period, there were 2864 days where MISR had
complete retrievals of AOD mixtures that could be matched to EPA-PM
2.5
stations, re-
sulting in a dataset of N 35, 000. While there were more days with successful AOD
mixtures retrieved, the AOD-products datasets was larger because AOD-products were
successfully retrieved in more locations per day compared to AOD mixtures. In a sim-
ilar time frame (ending in July 2018), MISR had 698 days with AOD-products data and
18
544 days with AOD-mixtures data that were matched to PM
2.5
speciation concentrations
from CSN sites in California, resulting in datasets of sizes N = 1965 for models using
AOD products and N = 1090 for models using AOD mixtures.
Non-linear machine learning methods (GBM, RF, and SVM) generally predicted better
than linear methods (ridge and LASSO) for all exposures of interest and for both AOD-
products and AOD-mixtures models. Although the AOD-mixtures models for PM
2.5
had
a smaller training dataset, the increase in model complexity with 74 AOD mixtures ver-
sus 6 AOD products led to much more time-expensive model fitting. We experimented
with a smaller sample size (N = 8000) but the AOD-mixtures models did not perform
better than the AOD-products models. The GBM model using AOD products had the
best performance for PM
2.5
(R
2
= 0.68) (Table 2.1).
Table 2.1: Summary of machine learning methods, MISR predictors, and best test R
2
for
predicting each exposure of interest.
Exposure ML Method MISR Predictors Best test R
2
Best test RMSE (μg/m
3
)
PM
2.5
GBM AOD Products 0.68 4.75
SO
2
4
GBM AOD Products 0.71 0.69
NO
3
SVM AOD Products 0.71 2.13
EC GBM AOD Products 0.63 0.37
Dust SVM AOD Mixtures 0.53 0.60
The boxplots of test R
2
(Figure S2.3) for PM
2.5
speciation models show that AOD-
products models outperformed AOD-mixtures models and did so consistently (narrower
boxplots) with the exception of dust, where AOD-mixtures models performed better. The
best test R
2
for SO
2
4
and NO
3
was 0.71, EC 0.63, and dust 0.53 (Table 2.1). The most
important variables for PM
2.5
include an interpretable mix of AOD, small and medium
AOD, as well as meteorological variables (surface shortwave radiation, wind speed, and
19
temperature) (Figure 2.2). Similar variables were important for SO
2
4
and NO
3
, but non-
spherical AOD played a larger role in both (Figure 2.3). Interestingly, AODs only ranked
8th and 9th for EC, with meteorology and temporal indicators playing a larger role in
its prediction. Finally, most important for dust were AOD mixtures relating primarily to
dust (mixtures 70 and 53) and non-absorbing (mixtures 4, 12, 13, 19, 21) particles.
20
Month
Temperature
Medium AOD
Longitude
Wind speed
Latitude
Small AOD
Julian date
AOD
SSR
0% 5% 10% 15% 20%
Feature importance
Figure 2.2: Relative importance of the ten most important features in model predicting
PM
2.5
using gradient boosting.
Nonspherical AOD
SSR
Small AOD
Wind direction
Humidity
Large AOD
Medium AOD
Temperature
Julian date
AOD
0% 10% 20% 30%
Feature importance
Temperature
ERC
Julian date
Longitude
Wind speed
Small AOD
SSR
AOD
Nonspherical AOD
Medium AOD
0% 5% 10% 15% 20%
Feature importance
ERC
Medium AOD
AOD
Humidity
Month
Wind direction
Temperature
Julian date
Wind speed
SSR
0% 10% 20% 30% 40%
Feature importance
AOD mixture 41
AOD mixture 19
AOD mixture 49
AOD mixture 13
AOD mixture 4
AOD mixture 53
AOD mixture 21
AOD mixture 12
AOD mixture 38
AOD mixture 70
0% 1% 2% 3%
Feature importance
Figure 2.3: Relative importance of the ten most important features in models predicting
SO
2
4
(top left), NO
3
(top right), EC (bottom left), and dust (bottom right). Feature
importance were calculated using relative influences in GBM models (PM
2.5
, SO
2
4
, and
EC) [115] and global sensitivity analysis in SVM models (NO
3
and dust) [116]. SSR:
surface shortwave radiation; ERC: energy release component.
21
Figure 2.4: Maps of mean MISR-derived PM
2.5
(μg/m
3
) from 2007 through 2012 over
Santa Barbara (left) and other communities in Southern California (right).
Figure 2.5: Maps of mean MISR-derived SO
2
4
(top left), NO
3
(top right), EC (bottom
left), and dust (bottom right) from 2007 through 2012, all measured in μg/m
3
over Santa
Barbara (left in each) and other communities in Southern California (right in each).
MISR AOD-products were used to predict all exposures except for dust, which was
predicted using MISR AOD mixtures. Figures 2.4 and 2.5 show 6-year (2007–2012) means
of the predicted air pollution exposures over greater Los Angeles and Santa Barbara.
Over this period the correlation between annual mean MISR-derived PM
2.5
and PM
2.5
22
speciation with its central-site counterpart were positive and statistically significant, with
Spearman r = 0.68 and p< 0.001 (Table S2.1).
2.3.2 Health Outcomes
There were 1206 children assessed in 2011–2012 with mean age 15.2 (SD = 0.6) years,
whose geocoded addresses were linked with MISR-derived exposure estimates. Sum-
mary statistics of the study population, including gender, height, weight, race/ethnicity,
and study community, are shown in Table 2.2. Both FEV
1
and FVC were significantly
higher among boys compared to girls; mean FEV
1
was 4111 mL among boys and 3304
mL among girls (t-test p < 0.001), mean FVC was 4801 mL among boys and 3752 mL
among girls (t-test p < 0.001). Among race/ethnicity groups, there were significant dif-
ferences in mean FEV
1
and FVC with one-way ANOVA test p = 0.003 and p < 0.001,
respectively (Table S2.2). The associations between biological factors and lung function
remained significant when evaluated in fully adjusted models (Table S2.3).
In single-pollutant mixed models adjusted for biological characteristics, MISR-derived
PM
2.5
, SO
2
4
, and dust were significantly associated with decreases in FEV
1
, and SO
2
4
and NO
3
were significantly associated with decreases in FVC. Notably, central-site PM
2.5
was not associated with either lung function measure (Table 2.3). Among multi-pollutant
models, we did not find any grouping of two or more pollutants to be significantly as-
sociated with FEV
1
or FVC. Furthermore, while certain multi-pollutant models achieved
modest reductions in AIC, none outperformed single-pollutant models in BIC. Log trans-
formation of the lung function measures, with slight improvements in normality on Q-Q
23
Table 2.2: Characteristics of the study population
Mean (SD) N (%)
N 1206 Gender
Lung function Boys 588 (48.8)
Forced Vital Capacity (mL) 4265 (856) Girls 618 (51.2)
Forced Expiratory Volume (mL) 3700 (703) Race/Ethnicity
Exposures (μg/m
3
) White 413 (34.2)
Central site Asian 54 (4.5)
PM
2.5
11.99 (2.9) Black 23 (1.9)
MISR-derived Hispanic 617 (51.2)
PM
2.5
14.22 (2.6) Others 99 (8.2)
Sulfate 1.37 (0.3) Community
Nitrate 2.36 (0.4) Anaheim 113 (9.4)
EC 0.82 (0.1) Glendora 228 (18.9)
Dust 1.12 (0.3) Long Beach 56 (4.6)
Demographics Mira Loma 146 (12.1)
Age 15.24 (0.6) Riverside 138 (11.4)
Height (cm) 166.20 (8.6) San Dimas 170 (14.1)
Weight (kg) 64.41 (15.8) Santa Barbara 186 (15.4)
BMI (kg/m
2
) 23.23 (5.0) Upland 169 (14.0)
Table 2.3: Air pollution single-pollutant effect estimates by outcome adjusted for demo-
graphic factors.
Lung Function Source Exposure Estimate
95% CI p
FEV
1
Central site PM
2.5
41 (161, 80) 0.521
MISR-derived PM
2.5
131 (232,35) 0.013
SO
2
4
158 (273,43) 0.008
NO
3
75 (265, 124) 0.447
EC 161 (446, 128) 0.289
Dust 177 (306,56) 0.011
FVC Central site PM
2.5
21 (162, 119) 0.775
MISR-derived PM
2.5
122 (260, 25) 0.103
SO
2
4
175 (310,29) 0.015
NO
3
212 (391,28) 0.026
EC 218 (547, 106) 0.206
Dust 106 (305, 95) 0.316
24
plots, did not improve model fit for either outcomes. Collinearity was not an issue in any
of the models (all GVIF 3).
Effect estimates in Table 2.3 represent the difference in mean FEV
1
and FVC between
the highest and the lowest exposure level for each pollutant in single-pollutant models.
After accounting for biological and community effects in single-pollutant mixed models,
FEV
1
on average decreased by 131 mL (95% CI:232,35) per 10.7-μg/m
3
increase in
PM
2.5
, by 158 mL (95% CI:273,43) per 1.2-μg/m
3
in SO
2
4
, and by 177 mL (95% CI:
306,56) per 1.6-μg/m
3
increase in dust. Meanwhile, FVC on average decreased by
175 mL (95% CI:310,29) per 1.2-μg/m
3
increase in SO
2
4
and by 212 mL (95% CI:
391,28) per 2.5-μg/m
3
increase in NO
3
.
2.4 Discussion
In this study, we used satellite observations of AOD, characterized by size, shape, and ab-
sorption properties as well as fractionated into 74 mixtures, to estimate PM
2.5
and select
PM
2.5
chemical components. We then incorporated these estimates into an epidemiolog-
ical assessment of their association with children’s lung function. In terms of exposure
estimation, MISR AOD products resulted in better and more robust estimates than did
AOD mixtures, except for dust. Non-linear models (GBM, RF, and SVM) performed bet-
ter than linear models (Ridge and LASSO), which was consistent with previous studies
where linear models were inadequate in explaining the relationship between AOD and
ground-monitored PM [50, 65, 66]. Although MISR aerosol data have coarser temporal
and spatial resolution compared to MAIAC (every 3–5 days vs. daily and 4.4 km vs. 1 km,
25
respectively), our model achieved high prediction performance (Table 2.1) using MISR-
specific data products on size, shape, and absorption, which proved vital in the prediction
models. At least two MISR AOD products were among the five most important features
for predicting PM
2.5
, SO
2
4
, and NO
3
, and the ten most important features for predicting
dust were all AOD mixtures (Figures 2.2 and 2.3).
Our PM
2.5
prediction performance was similar to those by Sorek-Hamer et al. (2013)
[65], who modeled PM
2.5
using AOD data from MODIS (Collection 5 Level 2) and the
Ozone Monitoring Instrument (OMI) at 10-km resolution over the Central Valley in Cal-
ifornia. In Southern California, we improved upon previous work that predicted PM
2.5
,
SO
2
4
, and NO
3
by Franklin et al. (2017) [66], whose meteorological data from NOAA
weather stations did not provide the spatial coverage of gridMET data. Surface short-
wave radiation provided by gridMET was also among the most important predictors for
PM
2.5
, NO
3
, and EC (Figures 2.2 and 2.3). Our PM
2.5
, SO
2
4
, NO
3
, and EC models per-
formed comparably to those by Meng et al. (2018) [98], who reconstructed fractional AOD
using the AOD mixtures (V22) while we relied on MISR AOD products (V23).
One limitation in our PM
2.5
speciation prediction models is the scarcity of data. As the
number of CSN sites in California increased from 3 in 2000 to 19 in 2013 and decreased
to 16 in 2018 (California PM
2.5
mass sites increased from 95 in 2000 to 157 in 2018), spa-
tial coverage was certainly restricted (Figure S2.1). Furthermore, the locations of these
sparsely available monitors are not necessarily representative of the population density
of Southern California. We used the coordinates of MISR pixels, which lacked a fixed grid,
instead of monitoring sites as geospatial predictors to help mitigate this problem by intro-
ducing additional spatial variability. We also attempted prediction models for PM
10
and
26
SO
2
(detailed results not reported here) to compare with our previous work over Mon-
golia [50]. While PM
10
models performed about the same, i.e., average test R
2
< 0.10,
SO
2
models over California performed much worse, with average test R
2
< 0.10 (vs.
test R
2
> 0.45 over Ulaanbaatar). Poor prediction performance for SO
2
was likely due
to much lower concentrations of SO
2
in California (mean SO
2
= 2.4 ppb in 2000–2018)
compared to Ulaanbaatar (9.7 ppb in 2008–2017), where SO
2
is the more dominant source
of PM. For the epidemiological purposes of the current study, we focused on models pre-
dicting PM
2.5
and its chemical components.
This study is unique in estimating air pollution exposure specifically to the residence
and follow-up period of each subject. Previously, exposures were assigned using annual
means from one central air pollution monitoring site for each study community [21, 23,
113], even if the children might have lived far away from these sites. Furthermore, the
follow-up period spanned about 211 days, yet the annual means of central-site air pollu-
tants for each community were calculated using a fixed time window. Leveraging MISR
aerosol and gridMET meteorological data, we improved upon these limitations by assign-
ing exposures that were spatially within 4.4 km of where each child lived and temporally
specific to the 12 months prior to each child’s assessment visit. Nevertheless, our expo-
sure prediction models are not without unexplained residual variance; our best models
had CV R
2
from 0.53 (dust) to 0.71 (SO
2
4
, NO
3
). As noted by Alexeeff et al. (2015) [54],
there can be 1-5% upward bias in subsequent health effects estimates when exposure pre-
dictions have performance statistics in the range we observed, and their standard errors
may be underestimated. It is difficult to mitigate these issues due to imperfect exposure
models, but it is worth keeping in mind while interpreting our epidemiological results.
27
While this is not the first study using AOD-derived PM
2.5
concentration in an epi-
demiological context, it is the first examining satellite-derived PM
2.5
speciation. Previous
studies of AOD-derived PM
2.5
include Rice et al. (2016) [33], who found that each 2 μg/m
3
increase in AOD-derived PM
2.5
was associated with a 28 mL (43.9 to 0.2 mL) decrease
in forced vital capacity (FVC) and higher odds of forced expiratory volume in 1 second
(FEV
1
) being less than 80% predicted (OR = 1.41, 1.03 to 1.93). In another study, AOD-
derived PM
2.5
concentrations were associated with an increased rate of asthma onset (HR
= 1.31, 1.28 to 1.33) in Quebec [101].
Similar to previous evaluations of the CHS [21, 23, 32], biological characteristics (age,
gender, race/ethnicity, height, height squared, BMI, and BMI squared) were significantly
associated with both measurements of lung function (Table S2.3). With these adjustments,
several MISR-derived estimates of air pollutants were able to explain the residual differ-
ences in lung function measurements among the children. Another strength of this study
is identifying the effect of specific PM
2.5
chemical components on lung function. While
MISR-derived PM
2.5
was significantly associated with decreases in FEV
1
, its effect, mea-
sured as the difference in FEV
1
between the highest and lowest exposure level for each
pollutant, is smaller than those of SO
2
4
and dust (Table 2.3). Similarly, although FVC was
only marginally statistically significantly associated with MISR-derived PM
2.5
, its associ-
ations with SO
2
4
and NO
3
were clinically significant. In California, secondary aerosols
including nitrate and sulfate have been shown to be the most abundant contributors to
ambient PM
2.5
, with nitrate accounting for as much as 55% of the total mass [117]. Geo-
logic dust can also contribute up to 20% of the mass in summer in more arid regions of
28
Southern California. Importantly, we were able to distinguish that these PM
2.5
species
had differentially stronger associations with children’s FEV
1
and FVC.
Urman et al. (2014) [32], who examined the cross-sectional effect of central-site air
pollution on lung function in the same cohort but at an earlier visit when the children
were at ages 11–12, found central-site PM
2.5
to be significant with both log-transformed
FEV
1
and FVC. In our study, we did not find central-site PM
2.5
to be significant with
either outcomes. We did find MISR-derived PM
2.5
to be significantly associated with log-
transformed FEV
1
, with a similar effect size, but not with log-transformed FVC. Children
exposed to the highest level of MISR-derived PM
2.5
on average were3.54% (95% CI:
6.24%,0.49%) lower in FEV
1
compared to those exposed to the lowest level. In the
same CHS cohort and during the same follow-up period, Franklin and Fruin [28] found
a significant association between NO
x
on FVC when adjusted for traffic-related noise ex-
posure. We found a similarly significant relationship between NO
3
and FVC where an
IQR increase in NO
3
(0.64 μg/m
3
) is associated with 53 mL decrease in FVC (95% CI:
98,7).
2.5 Conclusions
We have shown in this study that MISR AOD observations distinguishing size, shape, ab-
sorption and mixture properties can aid in predicting PM
2.5
and its chemical speciation in-
cluding SO
2
4
, NO
3
, EC, and dust, particularly when supplemented with spatio-temporal
information and high-resolution meteorological data. Machine learning methods such as
Gradient Boosting and Support Vector Machines were more suitable for characterizing the
29
non-linear relationships between air pollutants and AOD. We further showed that differ-
ent MISR-derived PM
2.5
composition, estimated specifically to the residence and follow-
up period of CHS study participants, were able to explain clinically significant differences
in lung function measurements FEV
1
and FVC. This demonstrates that satellite-observed
aerosol data products can be incorporated to strengthen epidemiological studies inves-
tigating the health effects of environmental pollution. Epidemiological assessments will
only be made more viable, particularly as the quality of remote sensing data and estima-
tion models continue to improve and exposure measurement error decreases.
2.6 Supplementary Tables and Figures
Table S2.1: Spearman correlation coefficients for annual means of air pollution exposures
during the three CHS follow-up periods (2007–2012).
PM
2.5
PM
†
2.5
SO
2
4
NO
3
EC Dust
PM
2.5
1 0.68 0.55 0.34 0.29 0.68
PM
†
2.5
1 0.74 0.59 0.15 0.38
SO
2
4
1 0.92 0.29 0.14
NO
3
1 0.37 0.12
EC 1 0.32
Dust 1
Central-site PM
2.5
;
†
MISR-derived PM
2.5
30
Table S2.2: Mean (SD) of FEV
1
and FVC by gender and race/ethnicity.
FEV
1
FVC
Gender
Female 3304 (453) 3752 (537)
Male 4111 (680) 4801 (797)
Race/Ethnicity
White 3746 (665) 4377 (831)
Asian 3492 (635) 3892 (746)
Black 3271 (548) 3777 (683)
Hispanic 3715 (733) 4258 (884)
Others 3634 (688) 4162 (774)
Table S2.3: Effect estimates of biological characteristics on lung function in mixed models
without air pollution exposure.
FEV
1
FVC
Characteristic Estimate 95% CI Estimate 95% CI
Age (year) 121 (81, 162) 112 (67, 156)
Gender
Girls (ref.) – (ref.) –
Boys 317 (256, 378) 440 (372, 506)
Race/ethnicity
White (ref.) – (ref.) –
Asian 21 (145, 102) 188 (323,54)
Black 331 (513,151) 454 (653,254)
Hispanic 96 (38, 154) 40 (24, 103)
Others 52 (42, 146) 12 (115, 91)
Height (cm) 47 (44, 51) 59 (55, 63)
Height
2
0.3 (0.0, 0.5) 0.5 (0.3, 0.8)
BMI (kg/m
2
) 35.7 (29, 42) 49 (42, 56)
BMI
2
2.4 (3.0,1.8) 2.8 (3.5,2.2)
31
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Figure S2.1: Maps of EPA PM2.5 monitoring sites (left) and CSN monitoring sites (right)
from 2000 to 2018.
● ● ● ● ●● ● ● ● ● ● ● ● ● ● MISR Pixels
EPA PM2.5 Sites
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Figure S2.2: Maps of EPA PM2.5 monitoring sites and the MISR pixels within 4.4 km of
each site, in Santa Barbara (left) and other communities (right).
32
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Elemental Carbon Dust
Sulfate Nitrate
Ridge LASSO GBM RF SVM Ridge LASSO GBM RF SVM
0.2
0.4
0.6
0.2
0.4
0.6
Method
Test R
2
Figure S2.3: Boxplots of test R
2
from 20 iterations of each machine learning method ap-
plied to predicting PM
2.5
speciation, in AOD-products (coral) and AOD-mixtures (teal)
models.
33
Chapter 3
Further Challenges in PM
2.5
Estimation
In this chapter, we constructed PM
2.5
prediction models in the Middle East over a region
that surrounds the Persian Gulf. We fitted a model using all available data in the region
as well as two local models using only data from Kuwait and the United Arab Emirates
(U.A.E.). We further investigated temporal and spatial autocorrelation trends to explore
possible reasons for the great disparity in prediction performance between the local mod-
els, and their implications for transferred models—the practice of using models train in
one region to estimate PM
2.5
in another region.
1
3.1 Background
Spatially and temporally resolved estimates of particulate matter with aerodynamic di-
ameter less than 2.5 mm (PM
2.5
) are of significant importance for studying the health ef-
fects associated with exposure to air pollution. Various statistical and machine learning
methods have been developed for this task, and have been facilitated by increases in both
1
This chapter was submitted to Remote Sensing in July 2021 and is currently being reviewed.
34
the quantity and types of spatio-temporal data. For instance, aerosols observed from
polar-orbiting satellites, retrieved as aerosol optical depth (AOD), have become attractive
for PM
2.5
estimation due to their vast coverage, both spatial and temporal. Many stud-
ies have found great success in leveraging AOD to predict PM
2.5
globally [76, 89, 90, 92,
93, 96]. The quality of AOD data has improved significantly in the past decade for the
MODerate resolution Imaging Spectroradiometer (MODIS) and the Multiangle Imaging
SpectroRadiometer (MISR) instruments, which originally provided AOD retrievals at 10
km and 17.6 km resolutions, respectively. Advanced processing algorithms have down-
scaled these retrievals to much higher resolutions (1 km and 4.4 km, respectively) with
improved accuracy in validation tests [48, 49].
While the spatio-temporal coverage of satellite-observed AOD has made it a key pre-
dictor in estimating PM
2.5
, the majority of studies have focused on data-rich regions of the
world such as the United States and China, which have large and established air quality
monitoring networks to train and validate models [53, 70, 76, 96]. In other locations such
as the Middle East where ground PM
2.5
monitoring is sparse or nonexistent, modeling
efforts have been limited.
Air quality in Middle Eastern cities is poor, with contributing sources ranging from
industry and construction to transportation [118]. Compounding these anthropogenic
sources, extreme meteorological events such as dust storms further contribute to high
particulate matter concentrations and make it challenging to estimate PM
2.5
accurately. A
recent study in Kuwait found dust events occurring in more than one third (37%) of the
days observed from October 2017 through October 2019 [119], and most particle samplers
are likely inadequate to collect samples during dust storms when concentrations exceed
35
200–400 μg/m
3
[120]. The same study also found dust storms and road dust to be the
second largest contributor (20%) to PM
2.5
in Kuwait City.
While it may be desirable to apply models trained in data-rich regions to make pre-
dictions where there is little validation data, they are not guaranteed to perform well
and estimates are likely to have large uncertainties. Most studies estimating PM
2.5
typ-
ically report an overall summary performance metric such as cross-validation R
2
, yet
these statistics may be deceiving because models are unlikely to predict with the same
accuracy at all locations and at all times within a given study domain. Spatial and tem-
poral variability in predictive performance has been shown through leave-one-out (LOO)
cross-validation (CV), either by leaving out one site or region or by leaving out one year
in each training-validating iteration [66, 67, 76, 88, 121–123].
For example, differences in predictive performance have appeared in large-scale stud-
ies including Engel-Cox et al. (2004), who found that MODIS AOD showed poorer corre-
lation with PM
2.5
in the western U.S. compared to eastern and Midwest portion of the U.S.
[124]. In their cross-validation modeling based on U.S. Census Bureau geographic subdi-
visions, Di et al. (2019) found that the West region had the poorest prediction: Mountain
subdivision had the lowest CV R
2
and Pacific subdivision had the highest CV RMSE (root
mean squared error). The Midwest and Southern subdivisions west of the Mississippi
River also predicted worse than all subdivisions east of the River [76].
Model adjustments can be made to account for differences in predictive performance.
In a global study, Shaddick et al. (2018) accounted for spatial variability in their AOD-
PM
2.5
models by adding random effect terms for country, region, and super-region [93]. In
another global study, Hammer et al. (2020) identified divergent temporal trends in PM
2.5
36
across regions that affected model performance, which was addressed through statisti-
cal fusion [125]. Modeling over the contiguous U.S., Di et al. (2019) included a spatially
lagged term for PM
2.5
to account for spatial autocorrelation [76].
Although Engel-Cox et al. (2004) and Di et al. (2019) cited sparse data (both fewer
PM
2.5
monitors and lower AOD retrieval success rate) and challenging terrain as possi-
ble reasons for heterogeneity in model performance, few studies have investigated other
possible mechanisms. We suggest that trends in spatial and temporal autocorrelation
for PM
2.5
, AOD, and possibly meteorology could impact the strength of the correlation
between PM
2.5
and model predictors. Several studies have examined the spatial and tem-
poral autocorrelation (or coherence) of PM
2.5
and AOD, yet its potential impact on models
predicting PM
2.5
remained under-explored.
Sullivan and Pryor (2014) found that PM
2.5
temporal variability had an impact on both
natural and anthropogenic cycles and indicated there was a link between PM
2.5
concen-
trations and meteorological conditions. They also found that, at a sub-daily scale, spatial
variability was 2–3 times greater than temporal variability [126]. In another study of
MODIS AOD over eastern North America, Sullivan et al. (2015) found that spatial coher-
ence r for AOD dropped below 0.3 at750 km on average, and that the range of the
semivariogram for the summer was almost twice that of the winter (2200 km vs.1300
km, respectively) [127].
Toth et al. (2019) studied decay in spatial correlation for PM
2.5
in the contiguous U.S.
(CONUS) and found the e-folding length in correlation (distance or time for correlation
to reduce below 1/e, about 0.37) to be600 km; regional analysis by 10 km bin aver-
ages found the e-folding length to be700 km in the eastern CONUS and300 km in
37
the western CONUS [128]. In their temporal autocorrelation analysis in the Southeastern
U.S., Kaku et al. (2018) found that the e-folding time was 3 days for ground-monitored
PM
2.5
and only 1 day for ground-monitored AERONET AOD. The authors noted that
strong single-day spikes in AOD–and hence day-to-day variability–were far more preva-
lent in AOD than in PM
2.5
[129].
Using the Middle East as our study region (Figure 3.1), we develop AOD-PM
2.5
mod-
els, evaluate their overall and regional predictive performance, examine the spatial het-
erogeneity in model performance, identify the most important predictors, and explore the
impact of spatial and temporal autocorrelation on relationship between PM
2.5
AOD, and
the other predictors.
Figure 3.1: Map of the study region with all PM
2.5
monitors (left) and only those in Kuwait
(right).
38
3.2 Materials and Methods
3.2.1 Particulate Matter Data
The PM
2.5
data used in this study were obtained from two sources. Through a collab-
orative project between researchers from Kuwait University and the Harvard School of
Public Health, PM
2.5
data were collected in years 2004–2005 and 2017–2019 at 3 loca-
tions in Kuwait: a “central” site at Kuwait University, a “northern” site in Um-Al-Aish—
approximately 55 km north of Kuwait City in the Al Jahra Governorate, and a “south-
ern” site in Um-Al-Haiman—approximately 45 km south of Kuwait City in Ali Sabah Al
Salem. The central and southern sites were in urban areas and the northern site is in a
desert area. Due to high concentrations of PM and extreme temperatures in the region,
Harvard high-capacity impactors were used to sample PM
2.5
[130, 131]. We refer to these
data as the HHI PM
2.5
dataset.
The second set of data were acquired from OpenAQ, an open-source platform that pro-
vides real-time and historical air quality data. OpenAQ aggregates government-measured
and research-grade data from different government entities and international organiza-
tions [132]. For this study, we acquired PM
2.5
data from measurements taken at United
States embassies and consulates from 5 cities in 4 countries: Bahrain (Manama), Iraq
(Baghdad), Kuwait (Kuwait City), and United Arab Emirates (Abu Dhabi and Dubai).
PM
2.5
data were available as early as 2016 (Manama) and we included data collected up
to mid-June, 2020. We refer to these data as the OpenAQ PM
2.5
dataset.
39
3.2.2 Aerosol Optical Depth Data
Launched in December, 1999, the polar-orbiting Terra satellite carries two instruments
that observe atmospheric aerosols: the MODerate resolution Imaging Spectroradiome-
ter (MODIS) and the Multi-angle Imaging SpectroRadiometer (MISR). The MODIS in-
strument uses a single sensor with a wide observation swath and provides daily aerosol
observations globally. Initially available at 10-km resolution, MODIS data have been re-
processed using the Multi-Angle Implementation of Atmospheric Correction (MAIAC)
and are currently available at 1-km resolution [48]. We refer to these as MAIAC AOD
data.
The MISR instrument relies on nine camera angles and four spectral bands to discern
aerosol particles of different sizes, shapes, and types. In addition to total column AOD,
MISR provides AODs fractionated into small, medium, and large (amount of particles
of each size), nonspherical (amount of nonspherical particles), and absorption (amount
of light-absorbing particles). One limitation of MISR AOD is that due to its narrower
observation swath, the instrument only overpasses any given location every 3–5 days.
Originally available at 17.6-km resolution, MISR AOD data has also been improved re-
processed and the latest version (V23) is available at 4.4-km [49].
The MISR quality-control procedures sometimes reject observations that are deemed
“low-quality”, one reason being unusually high values. However, all original AOD ob-
servations are still preserved in the "AUXILIARY" subgroup of the MISR AOD product
files and are labelled “raw” (e.g., raw total AOD, raw small AOD, and raw absorption
AOD). The “raw” version of each AOD variable contains both the “high-quality” and
40
“low-quality” observations, and therefore typically provides a slightly larger sample size.
As extreme weather conditions such as sandstorms are typical of our study region, we
considered the raw AOD variables as a candidate for modeling PM
2.5
. Although we fitted
predictive models using both “high-quality” and “raw” AOD data, we will only report
results of models with “raw” AOD variables as they consistently predicted better than
those with “high-quality” AOD variables. For the remainder of this paper, we refer to the
“raw” MISR AOD variables simply as MISR AOD variables, except for the AERONET
AOD validation portion where we report both and will be specific in that context.
3.2.3 Meteorology Data
Similar to previous efforts using satellite-observed AOD coupled with ground-monitored
measurements to develop PM
2.5
prediction models, we also incorporated meteorologi-
cal data [36]. For this study, we relied on reanalysis data from the European Centre for
Medium-Range Weather Forecasts (ECMWF), version 5 (ERA5). ERA5 provides a wide
range of meteorological data at hourly temporal and approximately 31-km horizontal
resolution [41]. We extracted ERA5 meteorological variables including wind speed, wind
direction, temperature, evaporation, surface pressure, cloud cover, precipitation, relative
humidity, boundary layer height (BLH), and downward shortwave (UV) radiation, which
we identified to be a key predictor variable in previous studies [36, 110, 111].
41
3.2.4 Assimilated Aerosol Data
To our models we also incorporated assimilated aerosol optical depth data from the
Modern-Era Retrospective analysis for Research and Applications, version 2 (MERRA-
2), which is the latest atmospheric reanalysis of the modern satellite era produced by
NASA’s Global Modeling and Assimilation Office (GMAO) [133, 134]. Although MERRA-
2 data are provided at relatively low resolution (50 km), their source-specific AOD have
proven to be useful for estimating PM
2.5
in other studies [76, 79, 135]. While there have
been some efforts to downscale MERRA-2 to a finer spatial resolution [135], in this study
we used MERRA-2 AOD variables (dust, black carbon (BC), organic carbon (OC), and
sulfate (SO
4
)) at their native resolution.
3.2.5 Statistical Methods
We validated MAIAC and MISR AOD against AOD from the Aerosol Robotic Network
(AERONET). AERONET is an instrument network of ground-based sun photometers that
derive AOD at a number of visible and near-infrared wavelengths from direct sun obser-
vations and has served as the primary standard for validating satellite aerosol products
[136]. We matched MAIAC and MISR AOD to the Kuwait University AERONET when
it was active between 2006–2010 by identifying the nearest MAIAC AOD observation
within 1 km and the nearest MISR AOD observation within 4.4 km of the AERONET site.
As AERONET only measures AOD at 500 nm and 675 nm, we interpolated it to 550 nm
to match total column AOD from both MAIAC and MISR [67]. We averaged AERONET
AOD to the nearest hour, and filtered for measurements during the usual Terra satellite
42
overpass time (between 10:00 and 12:00 local time). Validation statistics including linear
correlation (r), RMSE, and bias between AERONET AOD and MAIAC and MISR AOD
were calculated.
We used a selection of 5 different machine learning models [36, 50] in a regression
framework (LASSO regression, ridge regression, gradient boosting machines (GBM), ran-
dom forests (RF), and support vector machines (SVM)) [137]. In previous studies, non-
linear methods (GBM, RF, and SVM) demonstrated stronger predictive performance com-
pared to linear learners (LASSO and ridge), due to the non-linear relationship between
PM
2.5
and satellite-observed AODs [36, 50, 65]. We considered the linear methods here to
assess whether the pattern of non-linearity held in this region with more extreme climate
conditions.
We fitted separate predictive models for MAIAC AOD and for the 6 fractionated MISR
AODs (total, small, medium, large, nonspherical, and absorption). In both cases, we in-
cluded ERA5 meteorological variables, MERRA-2 assimilated AODs, spatial coordinates
(the latitude and longitude of MAIAC or MISR pixels), and time (Julian date and month).
For each machine learning method, the model was trained on 70% of the data, and using
10-fold cross validation we selected the best fitting hyperparameters based on R
2
as the
performance metric. The models with the best performing hyperparameters were then
assessed on the remaining 30% test sample of data, also via R
2
. Due to the small sample
size of our data, particularly for the MISR-AOD dataset, the trained models could be very
sensitive to the observations in the training set. We therefore repeated this training-testing
process in 25 iterations, each time drawing a different 70-30 split in the data, to assess the
43
stability and robustness of each model. The models with the best median prediction test
R
2
were chosen as the final models.
Leave-one-out (LOO) cross-validation, typically reserving a single site or region, or a
time unit (e.g., year) for out-of-sample validation, is a common approach to assess model
generalizability—a key component of predictive modeling using machine learning tech-
niques [61, 76]. Given the small number of PM
2.5
monitoring sites (8) in our sample, the
overall small sample size, and the unbalanced sample sizes across regions, this approach
was less useful in our study so was not conducted (Kuwait and the U.A.E. respectively
made up 50% and 25% of the MISR-AOD data and 56% and 19% of the MAIAC-AOD
data).
In addition to developing a model using data from all sites, we divided the study
region into 2 subregions: Kuwait (4 sites) and the U.A.E. (2 sites). We excluded Manama
and Baghdad from this portion of the analysis as they each had only one PM
2.5
monitoring
site and thus provided little spatial variability. For the iterated training-testing process
described above, we calculated both the overall test R
2
and the regional test R
2
using the
data that included all sites. Further, we fitted separate models using only data from each
subregion, also once with MAIAC AOD and once with fractionated MISR AODs. These
regional models allowed us to assess the transferability of the models—i.e., predicting
PM
2.5
over an area using a model trained in a completely different location. Specifically,
we used the Kuwait-trained models to predict PM
2.5
in the U.A.E. and used the U.A.E.-
trained models to predict PM
2.5
in Kuwait, and assessed their respective prediction R
2
.
For the best performing machine learning method we estimated variable importance
to understand the relative importance of satellite-observed aerosol variables among the
44
models considered (MAIAC AOD vs. MISR AODs and overall vs. regional). Variable
importance was calculated using permutation accuracy importance, which is based on
the increase in mean squared error (MSE) when a variable is excluded, as it is more robust
to bias than the typical Gini impurity importance, which is based on node purity [138]. To
verify variable importance for AOD variables in these models, we fitted the same models
without satellite AOD predictors and assessed their predictive performance (R
2
).
Possible explanations for differences in performance and variable importance between
subregions were explored by estimating temporal and spatial autocorrelations. The tem-
poral autocorrelation function (ACF) was assessed for PM
2.5
, AOD, MERRA-2 variables,
and ERA5 meteorological variables at the Kuwait and U.A.E. PM
2.5
monitoring locations.
Due to the temporal gaps in MISR AODs (i.e., MISR being able to retrieve every 3–5 days
for a given location), high-resolution temporal ACFs for satellite observed aerosols were
only assessed using MAIAC AODs. The abundance of MAIAC data also allowed for au-
tocorrelation to be assessed more extensively. We collected MAIAC AOD in 2017–2019
for every pixel retrieved over land within a 2 degree by 2 degree area that covered the
PM
2.5
monitors in each country (Figure 3.2).
Instead of examining only coincident MAIAC pixels at each site, we estimated ACF
for AOD at every MAIAC pixel within 50 km of each site, and then used these ACFs to
calculate the median ACF for each site. To assess spatial autocorrelation, we estimated
the daily empirical semivariograms [139] in 1 km bins up to 50 km for each region, and
calculated the median semivariogram over the 3-year period. Due to the lower resolu-
tion of ERA5 and MERRA-2 data (31 and 50 km, respectively), assessment of the spatial
autocorrelation was only conducted for MAIAC AOD.
45
Figure 3.2: Areas within each country where temporal and spatial autocorrelations for
MAIAC AOD were evaluated (shaded pink) and the PM
2.5
monitors (yellow circles) in
each country.
3.3 Results
3.3.1 AOD Validation
From 2006 through 2010, the Kuwait University AERONET monitor measured AOD on
421 days and we were able to match to MAIAC AOD on 319 days, to MISR AOD on 40
days, and to MISR “raw” AOD on 47 days. Overall, all three satellite-observed AOD
samples showed strong correlations with ground-observed AERONET AOD (r = 0.79–
0.81, Figure 3.3). Although the MISR AOD samples were smaller, they showed slightly
better correlation with AERONET AOD (r = 0.799 for AOD and r = 0.813 for raw AOD)
compared to MAIAC AOD (r = 0.793). The higher r for MISR raw AOD over MISR AOD
could be due to the larger sample that included highly correlated observations. Notably,
46
MAIAC AOD showed a larger bias (0.140) relative to AERONET AOD compared to that
of MISR AOD (0.087) and MISR raw AOD (0.088).
Figure 3.3: Validation of AERONET AOD interpolated to 550 nm versus MAIAC AOD
(left), MISR AOD raw (center), and MISR AOD (right) with 1-to-1 line (dotted) and corre-
lation line (red).
3.3.2 Model Results
From February 2004 through February 2020, we were able to match PM
2.5
data (3 from
HHI and 5 from OpenAQ in the entire study region including 4 countries) to 411 unique
days of MISR (raw) AOD data and 1454 unique days of MAIAC AOD data for model
building. The PM
2.5
data included a 10-year temporal gap between 2005 and 2016 – only
the HHI monitors were active during 2004–2005 and were active again during 2017–2019,
and OpenAQ monitors in the study region began monitoring as early as 2016. Predictive
performances of the full models are summarized in Table 3.1.
The models using data from the entire study region generally did not predict well
and, at best, could only explain approximately 50% of the variability in PM
2.5
in the test
set. Non-linear machine learning methods performed better than linear methods, with
47
Table 3.1: Median test R
2
across machine learning methods, AOD sources, and samples.
Instrument Model N LASSO Ridge GBM RF SVM
MISR Overall 542 0.34 0.35 0.47 0.48 0.44
Kuwait 271 0.38 0.37 0.42 0.51 0.43
U.A.E. 138 0.57 0.57 0.64 0.66 0.53
MAIAC Overall 3334 0.28 0.29 0.50 0.53 0.47
Kuwait 1863 0.29 0.29 0.46 0.48 0.44
U.A.E. 642 0.51 0.49 0.63 0.65 0.61
random forests models demonstrating the highest predictive performance. Non-linear
methods had similar test R
2
between the MISR and AOD models despite the large differ-
ence in sample size (N = 542 for MISR and N = 3, 334 for MAIAC) while linear models
performed much worse in the MAIAC models.
The ten most important variables in the RF models are shown in Figure 3.4 in de-
scending order of importance. Variable importance was measured by percentage increase
in MSE; although the absolute percentages of increase in MSE were fairly small for each
variable excluded, their relative proportions are more meaningful in understanding vari-
able importance [138]. In the MISR model, total column AOD was responsible for the
largest increase in MSE when excluded and size-related fractionated AOD variables were
also among the ten most important. Although total column AOD was also among the
ten most important variables for the MAIAC model, its importance is relatively smaller
than those of MERRA-2 dust, temperature, and boundary layer height. MISR models
performed better when MERRA-2 sulfate extinction is included while MAIAC models
preferred MERRA-2 black carbon extinction.
In our regional analysis, the RF model using MAIAC AOD resulted in higher predic-
tion R
2
for the U.A.E. than for Kuwait (median test R
2
= 0.65 and 0.48, respectively; Table
48
Figure 3.4: The top-10 most important variables for the random forest overall models
using MAIAC AOD (left) and MISR AOD (right), as measured by increase in MSE when
excluded.
3.1). Relative to the overall test R
2
, the RF model predicted slightly poorer in Kuwait and
significantly better in the U.A.E. These results are fairly stable and robust to different
training-testing splits (i.e. narrow boxplots, Figure 3.5). The RF model using fractionated
MISR AODs also predicted better in the U.A.E. than in Kuwait (median test R
2
= 0.66 and
0.51, respectively). Relative to the overall test R
2
, the regional models predicted slightly
better in Kuwait and much better in the U.A.E. (the lower overall R
2
was likely due to
poorer prediction results in Baghdad and Manama, which are not shown). Wider box-
plots indicated that the MISR-AOD model was more sensitive to different training-test
splits, which was expected due to the smaller sample size (Figure 3.5).
For Kuwait, the MISR model had total column AOD and MERRA-2 dust as the two
most important variables in the RF models; nonspherical AOD was among the ten most
important variables along with size-related fractionated AOD variables (Figure 3.6). Sim-
ilar to the overall model, MAIAC AOD was not as important as MERRA-2 dust or tem-
perature.
49
Figure 3.5: Test R
2
across different settings: overall and regional models, MISR and MA-
IAC models, and models with and without AOD variables.
The regional models using only PM
2.5
monitors in the U.A.E. showed better predictive
performance with both MISR and MAIAC models explaining an additional 13–23% of
the variation in the test set compared to the overall and Kuwait-specific models (Table
3.1). While MERRA-2 dust remained among the most important variables, total column
AOD was less important, especially in the MAIAC model. Julian date, temperature, and
UV radiation became much more important in both MISR and MAIAC models. For the
MISR model, large and small AOD were still among the ten most important (Figure 3.6).
Differences in predictive performance between the regional models were also seen in the
overall models when test R
2
was estimated for each subregion (Figure 3.5).
50
Figure 3.6: The top-10 most important variables for the random forest models based on
monitors in Kuwait using MAIAC AOD (top left) and MISR AOD (top right) and in the
U.A.E. using MAIAC AOD (bottom left) and MISR AOD (bottom right), as measured by
increase in MSE when excluded.
In our assessment of model transferability we used the RF model trained only on data
in Kuwait to predict PM
2.5
in the U.A.E. and vice versa, and found that both regional
models performed much worse in predicting a different region than predicting their na-
tive region. In particular, the MAIAC- and MISR-based Kuwait models were only able
to explain 37.4% and 28.4%, respectively, of the variation in PM
2.5
in the U.A.E. Simi-
larly, the MAIAC- and MISR-based models developed for the U.A.E. were only able to
explain 12.8% and 13.2%, respectively, of the variation in PM
2.5
in Kuwait (Figure S3.1).
51
The poorer performance in the U.A.E. models was likely due to their under-estimation of
extremely high PM
2.5
values in Kuwait.
3.3.3 Temporal and Spatial Autocorrelations
The temporal autocorrelation functions for PM
2.5
and MAIAC AOD, grouped by monitor
location in Kuwait and the U.A.E. For PM
2.5
, are shown in Figure 3.7. We observed that
the decay in autocorrelation was much faster in Kuwait (e-folding length—time until ACF
drops below 0.37—at 1 day for all 4 sites) compared to both Abu Dhabi and Dubai (e-
folding length at 12 and 14 days, respectively). For MAIAC AOD, the median ACFs are
shown by site after aggregating the ACF for every MAIAC pixel within 50 km of each
site. The autocorrelation difference between Kuwait and the U.A.E. was still observed,
although was smaller; the e-folding length at 1 day for MAIAC AOD in Kuwait and at 3
days for MAIAC AOD in the U.A.E.
Among the meteorological variables, we calculated the ACF for those with highest
variable importance in the RF models: temperature, UV radiation, BLH, and wind speed
(Figure S3.2). At all 6 sites in both countries, temperature and UV radiation showed
very strong autocorrelations and seasonal trends. BLH and wind speed showed weaker
autocorrelations (e-folding length at 3 and 2 days, respectively), with the exception of
the Northern site in Kuwait where BLH had a e-folding length of almost 2 months and
showed a clear seasonal trend.
Among the assimilated aerosol MERRA-2 variables, the autocorrelation for dust ex-
tinction showed a small separation between PM
2.5
sites in Kuwait and the U.A.E. early
52
Figure 3.7: Autocorrelation functions for PM
2.5
(left) and median ACFs for MAIAC AOD
(right) at different sites in Kuwait and the U.A.E.
on yet they all had an e-folding length at 3 days, after which there was little distinction
between the two regions (Figure S3.3). Sulfate extinction showed clearer separation be-
tween the two regions for almost a month; the e-folding length was at 2 days for Kuwait
sites and 3 days for the U.A.E. There was also little separation between Kuwait and the
U.A.E. for black carbon extinction, and the e-folding length was just over a month for
Kuwait and slightly under a month for the U.A.E.
We restricted our analysis of spatial autocorrelation to days with at least 2,500 suc-
cessful AOD retrievals in each region. From 2017 through 2019, we analyzed daily semi-
variograms for 961 days in Kuwait, ranging from 2,515 to 19,552 AOD retrievals, with a
median of 18,116 AOD retrievals per day. In the same period, we analyzed daily semi-
variograms for 1,001 days in the U.A.E., ranging from 2,610 to 23,609 AOD retrievals per
day, with a median of 20,721 AOD retrievals per day. The top row of Figure 3.8 shows
53
the daily semivariograms for each region, with the median semivariograms in red (note
the y-axes are different for each region). To better show the aggregate trend of spatial
autocorrelation, median semivariograms are shown in the bottom row of Figure 3.8.
Figure 3.8: Daily semivariograms for MAIAC AOD in Kuwait and the U.A.E. (grey lines
in top row) and median semivariograms (red lines in both top and bottom rows).
Daily and median semivariograms of MAIAC AOD in Kuwait and the U.A.E. demon-
strated clear differences in spatial autocorrelation trends between the two regions. Daily
semivariograms in Kuwait showed greater semivariance and greater variation compared
to the U.A.E. where semivariance was both smaller and demonstrated less variation. The
54
median semivariograms showed that MAIAC AOD in Kuwait reached a partial sill at 0.1
with a range of 15 km while MAIAC AOD in the U.A.E. reached a partial sill at 0.004
with a range of 5 km. These semivariance trends showed that in the U.A.E. MAIAC AOD
showed much smaller spatial variation at distances as far as 50 km compared to Kuwait.
3.4 Discussion
In this study, we applied methods developed in our previous work to estimate PM
2.5
from
satellite-observed AOD and meteorology over the Middle East, a geographic region with
poor air quality and meteorological phenomena that render air quality estimation diffi-
cult. Noting regional differences in model performance, we explored possible statistical
explanations including variable importance in the best fitting models as well as temporal
and spatial autocorrelation trends in AOD and PM
2.5
.
Model performance in Kuwait was similar to our previous study in Mongolia, al-
though the Mongolia models did not include meteorology [50]. On the other hand, model
performance in the U.A.E. was better and similar to that of our models in California [36].
We observed that the non-linear relationship between PM
2.5
and satellite-retrieved AOD
also affected the predictive performance of machine learning methods differently: gradi-
ent boosting, random forest, and support vector machines performed better than ridge
and LASSO regression. Curiously, the linear methods did not perform much worse than
GBM and RF in the U.A.E. and even outperformed SVM in the MISR AOD models (Table
3.1). These findings indicate that linear approaches should be carefully considered when
constructing prediction models in other study regions.
55
The comparison between MAIAC AOD and MISR fractionated AODs also highlighted
important differences. First, our data showed stronger linear correlation between PM
2.5
and MISR AOD than MAIAC AOD (r = 0.50 and 0.38, respectively). Although the MA-
IAC models were developed on a data set with 4–6 times the sample size of the MISR
models, their predictive performances were equivalent, both in the overall and the re-
gional models (Table 3.1). MISR AOD also contributed more to the predictive accuracy of
RF models (0.030–0.087% increases in MSE when excluded) compared to MAIAC AOD
(0.009–0.044% increases in MSE when excluded). The MISR model in Kuwait saw greater
improvements in predictive accuracy attributable to total and fractionated AODs while
the MISR model in the U.A.E. saw greater reliance on temporal (Julian date) and meteo-
rological variables (temperature and UV radiation).
Similarly, while MAIAC AOD was less important than MERRA-2 dust, temperature,
and month of the year in predicting PM
2.5
over Kuwait, its importance was even smaller
in the U.A.E. When we excluded AOD variables entirely from each model the MAIAC-
based models performed similarly while MISR-based models performed worse. Notably,
the Kuwait model without MISR AODs observed the largest decrease in test R
2
(Figure
3.5) while the overall model predicted better in the U.A.E. without AOD variables.
The differences in predictive performance between Kuwait and the U.A.E. were part
of a larger trend in our study where both the overall and the regional models performed
differently by region. Kuwait models performed worst in most configurations than those
in the U.A.E. despite having twice the sample size, indicating that sample sizes could ex-
plain little of the regional differences. These differences were also consistent with other
PM
2.5
studies where geography-based leave-one-out cross-validation revealed that one
56
trained model did not perform the same everywhere. Di et al. (2019) obtained lower cross-
validated R
2
in western subdivisions of the contiguous U.S., which could be attributed to
the sparsity of PM
2.5
monitors in these regions [76]. They also suggested that in moun-
tainous regions such as the Appalachians and Rocky Mountains, terrain had an important
influence on model performance.
Regional analyses of temporal and spatial autocorrelation further revealed significant
regional heterogeneity. PM
2.5
measurements in Kuwait demonstrated poor temporal au-
tocorrelation (fast ACF decay) while PM
2.5
monitors in the U.A.E. showed slower auto-
correlation decay (Figure 3.7). MAIAC AOD also showed a similar but smaller difference
in temporal autocorrelation decays between the two regions. Anderson et al. (2003) ob-
served different ACF trends in ground-monitored AOD from two AERONET stations
(Bondville, Illinois and Spitzbergen, Svalbard and Jan Mayen) [140]. As the Abu Dhabi
station was only active for very short periods during 2004–2005 and did not provide suf-
ficient data, ACFs were calculated using MAIAC AOD instead. We observed different
ACF trends with MAIAC AOD in the two regions, although the difference was negligible
after 4 days. PM
2.5
autocorrelation decayed much slower than AOD autocorrelation in
the U.A.E., which was similar to the findings by Kaku et al. (2018) for the Southeastern
U.S. [129]. Regional semivariograms also demonstrated distinct differences in MAIAC
AOD spatial autocorrelation trends: AOD in the U.A.E. showed a smaller variance than
in Kuwait, as far as 50 km (Figure 3.8).
Differences in spatial and temporal autocorrelation trends could explain some of the
differences in variable importance between the regional models. For example, in the
U.A.E. where PM
2.5
showed stronger temporal autocorrelation, the model seemed to rely
57
more on variables with similarly stronger temporal autocorrelation such as temperature
and UV radiation (Figure S3.2), as well as temporal trends (Julian date), rather than satel-
lite AOD, which showed poor temporal autocorrelation (Figure 3.7). On the other hand,
in Kuwait where PM
2.5
correlated poorly over time, the model relied less on meteoro-
logical variables and more on aerosol-related variables, including MERRA-2 extinctions
(Figure S3.3).
These findings highlight the challenge of modeling PM
2.5
over large areas that cover
subregions with potentially heterogeneous temporal and spatial trends. Our regional
models focused on two subregions with similar geographical characteristics: the Kuwait
PM
2.5
monitors covered an east-facing coastal region approximately 40 km 90 km and
the U.A.E. monitors covered a west-facing coastal region approximately 85 km 95 km.
However, the two subregions showed very different trends in PM
2.5
temporal depen-
dence which likely affected the predictive power of the respective regional models. The
overall models with 8 monitors in the study area also masked the difference in test R
2
in Kuwait (lower) and the U.A.E. (higher). Our study still found greater utility in mod-
els with larger spatial coverage when the data were available; regional models did not
necessarily predict better than the overall model in their respective regions. We recom-
mend careful evaluation of subregional spatial and temporal trends to better understand
inter-region predictive performance differences.
Finally, a test of model transferability demonstrated the difficulty of using models
trained with data from one region to predict PM
2.5
in another region. Here we have in
mind studies similar to that of Li et al. (2021) where the final model trained only on PM
2.5
data in Kuwait was used to predict PM
2.5
in neighboring Iraq [79]. In our study, the
58
Kuwait-based model was transferred 850 km southeast to similarly coastal U.A.E. and
predicted very poorly. Therefore, using a model trained on only PM
2.5
data from Kuwait
to estimate PM
2.5
over Iraq—a region almost entirely landlocked and about 25 times the
surface area—seemed to warrant some degree of skepticism. To be sure, their model
took advantage of the relatively stable relationship between visibility and PM
2.5
, and the
severe scarcity of reliable PM
2.5
data in Iraq offered few modeling alternatives.
The largest limitation of this study is the small sample sizes. Although the dimension
of the overall MAIAC-AOD dataset was in the thousands and the PM
2.5
monitors in our
data covered a large region—from Baghdad to Dubai—more than half of the data were
in Kuwait. We introduced spatial variation in the data by using coordinates of MAIAC
pixels instead of PM
2.5
monitors, but the small number of monitors meant that predictions
to regions further away from these sites were likely less accurate and the ability to capture
local variability in PM
2.5
was limited. While this could be ameliorated by the addition
of air pollution monitors, retrospective prediction of historical PM
2.5
measurement may
remain unreliable.
Using assimilated meteorology and aerosol data at lower resolutions compared to
MAIAC and MISR native resolutions was another limitation. This was an example of
spatial misalignment, specifically the change of support problem discussed by Gotway
and Young (2002) as well as Banerjee et al. (2015) [84, 85]. Downscaling gridded data
is an active area of research with different interpolation approaches. For example, Li et
al. (2020) recently downscaled MERRA-2 data from 50 km to 1 km using residual deep
network [135]. In our context, such downscaling would not guarantee better prediction
performance as our handful of PM
2.5
monitors were largely located in urban areas with
59
complex anthropogenic sources of air pollution. Simple downscaling methods (e.g., bi-
linear interpolation) were likely to produce overly smoothed surfaces while sophisticated
methods (e.g., deep learning) were likely difficult to quantify uncertainty.
One shortcoming of our study concerned the general approach of relying on machine
learning methods to predict PM
2.5
, particularly non-parametric learners. Classical sta-
tistical models such as linear regression explicitly assume that the error term follows a
particular distribution (e.g., Gaussian). As such, prediction uncertainty can be quantified
in a parametric manner such as prediction intervals based on standard errors. However,
non-parametric machine learning methods such as gradient boosting and random forest
depend on hyper-parameters (such as number of trees and learning rate) that are typi-
cally chosen via non-statistical optimization methods such as cross-validation. As such,
it is difficult to quantity uncertainty for prediction estimates from these learners.
Several methods have been proposed such as bootstrap sampling by Hastie et al. (2009)
[137], yet few air pollution studies have reported uncertainty on their estimates. Predic-
tions by Di et al. (2019), which covered the contiguous U.S., did include PM
2.5
uncertainty
estimates by modeling the monthly standard deviation of the difference between daily
monitored and predicted PM
2.5
at 1 km resolution [76]. In our study, the small sample
size and the very limited number of PM
2.5
monitors meant that prediction intervals were
likely too wide and not very useful.
60
3.5 Conclusions
Our study contributes to a large body of research that has established (1) remotely-sensed
AOD as a helpful addition to models estimating PM
2.5
, particularly where PM
2.5
moni-
tors are scarce, and (2) machine learning, notably non-linear techniques, is a powerful
means of predicting PM
2.5
using AOD and meteorology. In both overall models as well
as regional models in Kuwait and the U.A.E., we noted heterogeneity in predictive per-
formance across regions despite sharing key geographical features. Spatial and temporal
autocorrelation trends for PM
2.5
and MAIAC AOD showed regional differences. We sug-
gest that these differences in autocorrelation trends can help explain the discrepancies
in model performance and feature importance between regions. Noting issues in model
transferability, we also caution against applying regionally-tuned models to different ge-
ographic areas.
PM
2.5
data scarcity in the Middle East will continue to pose serious challenges to mod-
eling efforts, yet understanding divergent spatial and temporal trends can better inform
future work in this area and others.
61
3.6 Supplementary Figures
Figure S3.1: Out-of-region PM
2.5
predictions for Kuwait-trained models over the U.A.E.
(left column) and out-of-region PM
2.5
predictions for U.A.E.-trained models over Kuwait
(right column), by MAIAC and MISR AODs (top and bottom rows, respective).
62
Figure S3.2: Autocorrelation functions for ERA5 meteorological variables with high im-
portance in the random forests models. The Kuwait City and Central monitors shared the
same MERRA-2 pixel and thus the same ACF.
63
Figure S3.3: Autocorrelation functions for MERRA-2 assimilated aerosol extinction vari-
ables, including dust, sulfate, black carbon, and organic carbon. The Kuwait City and
Central monitors shared the same MERRA-2 pixel and thus the same ACF. Organic car-
bon was not found to be useful in any models and was excluded in the final PM
2.5
models.
64
Chapter 4
A Covariance-based Distance-weighting Model
In this chapter, we reviewed the concept of Gaussian random fields and covariance func-
tions before introducing the covariance-based distance-weighted regression (CDR) model.
We then provided an overview of the Matérn cross-covariance model proposed by Gneit-
ing et al. (2010) [141], which we used to simulate bivariate Gaussian random fields. We
simulated bivariate fields with different parameter specifications to assess the perfor-
mance of the proposed model and compared it against the linear and the generalized
additive models. In particular, we examined how the models performed at different set-
tings of surface roughness as well as the underlying (“spatial”) co-colocated correlation.
4.1 Gaussian Random Fields and Covariance Functions
4.1.1 Gaussian Processes and Random Fields
The underlying framework of common spatial statistical approaches to modeling air pol-
lution (such as co-kriging and linear model of coregionalization—LMC) is to treat the
spatially-indexed air pollution and predicting variables as Gaussian random fields [142–
65
145]. Although recent modeling efforts that rely on machine learning approaches often
do not assume this framework, it can still be leveraged to address the two challenges
mentioned in Chapter 1: missing data and spatial misalignment.
A random variable X is Gaussian distributed if its probability density function takes
the form
f(t)=
1
p
2ps
exp
(tm)
2
2s
2
!
, t,m2R,s> 0.
Notably, Gaussian distributions are completely determined by its mean m and variance
s
2
(the first and second central moments, respectively).
AnR
d
-valued random variableX = (X
1
, . . . , X
d
) is multivariate Gaussian distributed
if for any vectora = (a
1
, . . . , a
d
)2R
d
, the variableaX
>
=å
d
i=1
a
i
X
i
is univariate Gaus-
sian. The probability density function of the multivariate Gaussian distribution takes the
from
f(t)=
1
(2p)
d/2
jCj
1/2
exp
1
2
(t)C
1
(t)
>
where= (m
1
, . . . ,m
d
)2R
d
is the mean vector, the positive semidefinite d d matrixC
is the covariance matrix where c
ij
=E
(X
i
m
i
)
X
j
m
j
, andjCj is the determinant
ofC. Similarly, multivariate Gaussian distributions are also fully described by andC
[146].
A random function Z(t) is Gaussian process distributed if for any 1 n < ¥ input
pointsS, the vector of function values Z(s)= (Z
s
1
, . . . , Z
s
n
) is multivariate Gaussian:
Z(s)Nfm(s), k(s,s+hj)g
66
wherem(s) is the mean function and k(s,s+hj) is the covariance function with param-
eters, sometimes also referred to as the kernel. A neat property of Gaussian processes
is that, assuming without loss of generality that m(s) = 0, they can be entirely described
by the covariance function k. Of primary interest in our context are (weakly) stationary
1
and isotropic processes such that the covariance functions depend only on the Euclidean
distances between input points: k(s,s
0
j) = k(dj) where d =ks(s+h)k =khk.
Finally, a 2-dimensional Gaussian random field is a Gaussian process that (1) takes values
in a Euclidean space and (2) is defined over a parameter spaceS of 2 dimensions (e.g.,
longitude and latitude) [146–148].
4.1.2 Covariance Functions
Covariance functions describe how similar two data points are in a Gaussian process
based on their proximity to each other. As the covariance matrixC is positive semidef-
inite for multivariate Gaussian distributions, a valid covariance function k must also be
positive semidefinite. In addition, covariance functions also determine the smoothness
of the Gaussian processes based on their mean square differentiability or continuity. For
stationary covariance functions, this is reduced to checking for continuity at k(0). Exam-
ples of valid classes of covariance functions can be found in Chapter 4 of Rasmussen and
Williams (2006) [147].
1
Weak stationarity, or covariance stationarity, describes a constant-mean random process whose covari-
ance is invariant to translations. In other words, the covariance function depends only on the difference
betweens ands
0
[147]. As our approach does not require strong stationarity, we refer to weakly stationary
processes or fields as stationary.
67
A popular choice of covariance for Gaussian processes is the squared exponential (SE)
covariance function, which takes the form
k
SE
(dj l)= exp
d
2
2l
2
where l is the (characteristic) lengthscale parameter. The lengthscale controls the rate of
decay in correlation between points in a process. As parameterized, larger l indicates
faster decay (and less similarity among nearby points) and smaller l indicates slower
decay (and more similarity among nearby points) [149]. The SE covariance function is
infinitely mean square (MS) differentiable and describes very smooth Gaussian processes.
Sums of valid covariance functions are also valid covariances and Matérn discussed
an infinite sum of the SE covariance function in [150], which is the rational quadratic (RQ)
covariance function
k
RQ
(dj l,a)=
1+
d
2
2al
2
a
where l,a > 0. When a! ¥, the RQ covariance returns to the SE covariance function.
Notably, the RQ covariance is infinitely MS differentiable for alla and thus also describes
very smooth processes.
Originally proposed by Matérn and later named by Stein, the Matérn covariance func-
tion has been widely adopted in geostatistics due to its flexibility [150, 151]. The covari-
ance takes the form
M(dj l,n)=
2
1n
G(n)
p
2nd
l
!
n
K
n
p
2nd
l
!
68
where l,n > 0 and K
n
is a modified Bessel function. As parameterized, the Matérn co-
variance converges to the SE covariance function as n! ¥. The parameter n controls
smoothness in that the Gaussian process Z(s) with a Matérn covariance is k-times MS
differentiable if and only if n > k. For small n such as n = 1/2, the process becomes
very rough whereas for n 7/2, the process approaches SE covariance smoothness. Fig-
ure 4.1 shows the smoothing effect of n on generated Gaussian processes. In addition,
n also affects the variance of the process: processes with larger n tends to have smaller
variances as points are more likely to be similar to one another. For half-integer values of
n= p+ 1/2 where p is a non-negative integer, the covariance can be re-parameterized as
follows:
M(dj l,n= p+ 1/2)= exp
p
2nd
l
!
G(p+ 1)
G(2p+ 1)
p
å
i=0
(p+ 1)!
i!(p i)!
p
2nd
l
!
p1
The advantage of this form is that it does not require evaluating the modified Bessel func-
tion [149]. For example, atn= 3/2, the covariance simplifies to
M(dj n= 3/2)=
1+
p
3d
l
!
exp
p
3d
l
!
.
4.1.3 Multivariate Random Fields and Cross-Covariances
Consider a p-dimensional multivariate random fieldZ(s) = (Z
1
(s), . . . , Z
p
(s)) defined
onR
2
where Z
i
(s) is the ith process at locations, i = 1, . . . , p. Extending our previous
definition of Gaussian random field from Section 4.1.1,Z(s) is a Gaussian multivariate
random field if it can be fully specified by its mean vector (s) = E[Z(s)] and the
69
(a)n= 0.5 (b)n= 2 (c)n= 10
Figure 4.1: Sample univariate Gaussian processes under Matérn covariances with differ-
ent values ofn.
covariance matrixS. Given a finite sample of n locationsS = (s
1
, . . . ,s
n
), the random
vector(Z(s
1
), . . . ,Z(s
n
))2R
np
has a covariance matrix of dimensions n n where
S=
2
6
6
6
6
6
6
6
6
6
6
4
C(s
1
,s
1
) C(s
1
,s
2
) C(s
1
,s
n
)
C(s
2
,s
1
) C(s
2
,s
2
) C(s
2
,s
n
)
.
.
.
.
.
.
.
.
.
.
.
.
C(s
n
,s
1
) C(s
n
,s
2
) C(s
h
,s
n
)
3
7
7
7
7
7
7
7
7
7
7
5
and S must be positive semidefinite. The diagonal terms (i = j) are called direct- or
marginal-covariance functions and the off-diagonal terms (i6= j) are called cross-covariance
functions [152]. Fors
1
,s
2
2R
2
, the cross-covariance takes the form
C(s
1
,s
2
)= cov[Z(s
1
),Z(s
2
)] =
C
ij
(s
1
,s
2
)
p
i,j=1
=
cov
Z
i
(s
1
), Z
j
(s
2
)
p
i,j=1
.
70
Assuming stationarity and isotropy, the task of specifying the cross-covariance struc-
ture for a zero-mean ((s)= 0) Gaussian multivariate random field, with d =khk de-
noting the Euclidean distance, is reduced to choosing the covariance structure such that
cov
Z
i
(s
1
), Z
j
(s
2
)
= C
ij
(d).
Various methods of constructing valid cross-covariance models for Gaussian multi-
variate random fields have been proposed with varying degrees of sophistication and re-
striction. Separable cross-covariance functions were among the earliest suggested; these
are the simplest but also most restrictive model [153]. The linear model of coregional-
ization for random fields is also straightforward and is popular in geostatistics [143, 154,
155]. Kernel and covariance convolution methods have also been proposed although they
require numerical integration [156, 157]. More recently, several cross-covariance models
based on the Matérn covariance function have been proposed [141, 158]. These models
offer great flexibility in specifying the cross-covariance structure between random fields
and will be discussed further in the simulation portion.
4.2 Proposed Model
Of the two challenges in modeling air pollution using satellite aerosol optical depth (AOD),
data missingness is perhaps the more critical problem to tackle. As described above, one
approach would be to restrict the data sample to days and locations where there were
successful AOD retrievals and another would be to interpolate the missing AOD using
nearby available AOD. However, AOD missingness is not always random and is often
71
in response to less than ideal conditions for retrievals (e.g., heavy cloud, snow cover, or
overly bright surfaces) [77]. In the former, excluding days and locations that failed to
retrieve could lead to class imbalance in the data and uneven predictive performances
(e.g., summertime data could outnumber wintertime data) [52, 54, 75, 83, 159, 160]. In the
latter, gap-filling techniques often carry some uncertainty that could be difficult to quan-
tify downstream should the predicted air pollution data be applied to epidemiological
studies [45, 51, 53, 59, 80, 81].
Under the multivariate Gaussian random field framework as described in Section 4.1,
we aim to leverage the fact that nearby spatially indexed predictor fields could correlate
with the outcome field even though the coincident location is missing in the former. In the
satellite aerosol data provided by the Multi-Angled Implementation of Atmospheric Cor-
rection (MAIAC) algorithm, for example, there are typically successful retrievals nearby
within a given distance.
Consider an outcome variableY(s) and a predictor variableX(s), both spatially in-
dexed on a 2-dimensional domains2R
2
. Our proposed model seeks to address scenar-
ios whereY is scarce—and therefore of greater interest in predicting (e.g., air pollution
concentrations such as PM
10
and PM
2.5
)—whilex is more spatially distributed with a
certain degree of missingness. Our model leverages the more abundantly available data
fromx and weight their contribution to the relationship withy using covariance func-
tions (such as those described above). For a given location j, within a given maximum
72
distance d
max
and a pre-defined P number of observations fromx, our model requires the
following:
y
j
: observation fromy,
x
j
=
x
j1
, . . . , x
jP
: P observations within distance d
max
of y
j
, and
d
j
=
d
j1
, . . . , d
jP
, d
jp
d
max
: P corresponding distances betweenx
j
and y
ij
.
The covariance-based distance-weighted regression model takes the form:
y
j
= b
0
+b
1
f
x
j
, d
j
+e
i
f
x
j
, d
j
=
P
å
p=1
k(d
jp
j)x
jp
where e
i
N(0, 1) and k(j) is a valid covariance function. In this text, we generally
refer to this model as the “proposed model” and abbreviate it as the “CDR” (Covariance-
based Distance-weighted Regression) model.
4.3 The Matérn Cross-Covariance Model
Consider a bivariate (p = 2) Gaussian random fieldZ(s) = (Z
1
(s), Z
2
(s)) where each
component is a univariate Gaussian random field Z
i
(s) evaluable at every location in
finite sampleS = (s
1
, . . . ,s
n
),s
i
2R
2
. Assuming zero-mean, stationarity, and isotropy,
Gneiting et al. (2010) proposed a cross-covariance model based on the Matérn covariance
function whose symmetric and positive semidefinite covariance matrix C takes the form:
C=
2
6
6
4
C
11
(d) C
12
(d)
C
21
(d) C
22
(d)
3
7
7
5
=
2
6
6
4
s
2
1
M(dj l
1
,n
1
) r
12
s
1
s
2
M(dj l
12
,n
12
)
r
12
s
1
s
2
M(dj l
12
,n
12
) s
2
2
M(dj l
2
,n
2
)
3
7
7
5
73
where M(dj l,n) are Matérn covariance functions. On the diagonal, s
2
1
,s
2
2
> 0 are vari-
ance parameters,n
1
,n
2
> 0 smoothness parameters, and l
1
, l
2
> 0 lengthscale parameters
for the Matérn marginal covariances, respectively. The cross-covariance terms are Matérn
functions with parametersn
12
and l
12
, and the colocated correlation coefficientsr
12
.
2
For multivariate Gaussian random fields with more than 2 fields (p > 2), Gneiting
et al. (2010) suggest a parsimonious Matérn cross-covariance model that imposes certain
restrictions on l
ij
andn
ij
, for i6= j, to reduce the its overall complexity without loss of
generality. Specifically, the parsimonious model provides a valid second-order structure
inR
2
if the following conditions are met for 1 i, j p:
n
ij
=
1
2
n
i
+n
j
for i6= j,
l
ij
= l for i6= j and l
1
= = l
p
= l, and
r
ij
= b
ij
G(n
i
+ 1)
1/2
G(n
i
)
1/2
G
n
j
+ 1
1/2
G
n
j
1/2
G
1
2
n
i
+n
j
G
1
2
(n
i
+n
j
)+ 1
for i6= j,
where
b
ij
1 affords great flexibility in specifying the degree of correlation among the
components of the multivariate Gaussian field.
3
The covariance of a bivariate parsimo-
nious Matérn model is then simplified to
C=
2
6
6
4
s
2
1
M(dj l,n
1
) r
12
s
1
s
2
M
dj l,
1
2
(n
1
+n
2
)
r
12
s
1
s
2
M
dj l,
1
2
(n
1
+n
2
)
s
2
2
M(dj l,n
2
)
3
7
7
5
,
2
A symmetric C requires thatC
12
(d)=C
21
(d), or thatr
12
= r
21
,n
12
= n
21
, and l
12
= l
21
.
3
The requirement for r
ij
has been simplified to 2-dimensional case of random fields inR
2
. See Theorem
1 in [141] for the generalized d-dimensional case.
74
where
jr
12
j
G(n
1
+ 1)
1/2
G(n
1
)
1/2
G(n
2
+ 1)
1/2
G(n
2
)
1/2
G
1
2
(n
1
+n
2
)
G
1
2
(n
1
+n
2
)+ 1
.
In the bivariate case, a full Matérn cross-covariance model removes the restrictions on
n
12
, l
1
, l
2
, and l
12
and requires the following inequality for the Matérn cross-covariance
model to be valid:
r
2
12
G(n
1
+ 1)G(n
2
+ 1)G(n
12
)
2
G(n
1
)G(n
2
)G(n
12
+ 1)
2
l
2n
1
1
l
2n
2
2
l
4n
12
12
inf
t0
l
2
12
+ t
2
2n
12
+2
l
2
1
+ t
2
n
1
+1
l
2
2
+ t
2
n
2
+1
.
In summary, to simulate a bivariate Gaussian random field under the full Matérn cross-
covariance model, we need only to specify these quantities:
n
1
,n
2
,n
12
: smoothness parameters wheren
12
1
2
(n
1
+n
2
) ,
l
1
, l
2
, l
12
: lengthscale parameters where l
2
12
1
2
l
1
1
+ l
2
2
,
s
2
1
,s
2
2
: variances of the individual random fields,
r
12
: colocated correlation coefficient.
Each required parameter listed above has fairly straightforward interpretability, which
is not necessarily the case for more sophisticated methods such as kernel and covariance
convolution [156, 157]. The number of parameters also offers greater flexibility in model
specification compared to the more restrictive linear model of coregionalization [143, 154,
155]. Being able to identify general scenarios and edge cases where our proposed models
may or may not perform well is crucial, hence the Matérn model is an appealing option
to simulate data for evaluating the usefulness of our model.
75
With some re-parameterization, these can be specified to simulate the desired bivari-
ate fields using theRMbiwm() function in theRandomFields package [161, 162]. To demon-
strate how these parameters affect the (point-wise) linear correlation r, we simulated sev-
eral sets of bivariate Gaussian random fieldsZ(s) = (X(s),Y(s)) by varying each pa-
rameter while holding others constant. These are shown in Figures 4.2 and 4.3.
(a)n
1
= 1,n
2
= 2, r = 0.45
(b)n
1
= 2,n
2
= 1, r = 0.76
Figure 4.2: Simulated examples with different values ofn
1
andn
2
.
76
(a)r
12
= 1, r = 0.37
(b)r
12
=1, r =0.39
(c)s
2
1
= 1,s
2
2
= 5, r = 0.47
Figure 4.3: Simulated examples with different values ofr
12
(4.3a, 4.3b), and fors
2
1
ands
2
2
(4.3c).
77
4.4 Simulation Design
Each set of bivariate Gaussian random fields were simulated under the following specifi-
cations. The two fields shared a common spatial grid with the x- and y-coordinates both
taking values from10 through 10 in increments of 0.2.
4
This resulted in a 101 101
grid with 10,201 points in total. Our proposed model required that each outcome y
i
has
an accompanying set of P observations fromX(s) within certain distance d
max
. To meet
this requirement, we only considered the subset of the fieldY(s) with both x- and y-
coordinates in the range[5, 5].
Figure 4.4: Simulation design where each field is a 101 101 grid. Modeling is restricted
to observations that fall within the yellow square in fieldY(s). The sampledY(s)points
(red) are matched with points inX(s)that fall within their corresponding buffers (blue).
Figure 4.4 shows an example of a simulated bivariate field with the dimensions as
described above with d
max
= 5. In the fieldY(s), only points within the yellow square
were considered for modeling. Also shown are several sampled points inY(s) (red) and
their corresponding sets of points fromX(s) (blue circles). At d
max
= 5, each point y
i
had
a total of about 1940 possible matched points inX(s), so we could choose any P 1, 940.
4
The choice of scale should be made with respect to the lengthscale parameter l.
78
For our evaluations, we chose d
max
= 5 and P= 500, i.e., among the 1940 possible points
inX(s) within distance of 5 from a given observation y
i
, we chose the nearest 500 to fit
in the proposed model.
We compared our model against two classical models: the simple linear regression
model (LM) and the generalized additive model (GAM). Linear regression is a good base-
line model to compare against, and the GAM can fit flexible curves to model non-linear
relationships. For our proposed model, we evaluated using the squared exponential co-
variance which only has 1 parameter, the lengthscale. We compared these three models:
y
i
= b
0
+b
1
500
å
p=1
exp
d
2
ip
2l
2
!
x
ip
+e
i
(4.1)
y
i
= b
0
+b
1
x
i
+e
i
(4.2)
y
i
= b
0
+ s(x
i
)+e
i
(4.3)
wheree
i
N(0, 1) and s() is the smoothed term for input fromX(s).
In each iteration, we sampled 25 points inY(s) from the grid for the training dataset
and another 250 points for the test dataset. Each point was paired with their coincident
points inX(s) for the LM and GAM. For our proposed model, we paired each point in
Y(s) with its corresponding 500 points inX(s). This sampling scheme emulated the
typical air pollution estimation situation where models are typically trained on a handful
of sites and predictions are made on the larger surface.
5
To evaluate model performances
across the three comparison models, we estimated the linear correlation coefficient (r)
between the observed outcome y
i
and the predicted outcome ˆ y
i
.
5
For example, there were about 12 active PM
2.5
monitors in Los Angeles county in 2016, and not every
monitor was active everyday of the year.
79
Figure 4.5: Boxplots of r for training and test sets for linear models (LM), generalized
additive model (GAM), and our proposed model (CDR).
4.5 Simulation Parameters
4.5.1 A Rough Example
We simulated 1000 bivariate Gaussian random fieldsZ(s) = (Y(s),X(s)) with n
1
=
n
2
= 0.5, n
12
= 1.5, and all other parameters r
12
= l
1
= l
2
= l
12
= s
2
1
= s
2
2
= 1. With the
parameters as specified, both surfaces were fairly rough. Each time we fitted the 3 models
on the training dataset, predicted on the test dataset, and calculated the linear correlation
r between the observed and predicted y
i
’s from each model. Figure 4.5 shows the box-
plots of the linear correlation r for the training and the test datasets for the three models.
Among the training sets, the GAM performed well which was expected. However, in the
test sets, our model generally performed better than both the linear model and the GAM.
In fact, the GAM showed the largest gap in performance between the training set and the
test set, which indicates overfitting.
80
4.5.2 Marginal Smoothness Parameters
(a) Linear correlation r between predicted and observedy in the test dataset across three methods
over 64 possible pairs of(n
1
,n
2
).
(b) Difference in linear correlation r in the test dataset between the proposed model and GAM
(left) and LM (right).
Figure 4.6: Results of evaluating our model in bivariate fields with half-integer values of
(n
1
,n
2
) while keepingn
12
= (n
1
+n
2
) /2.
In this next simulation, we considered a set of half-integer values for the marginal
smoothness parameters, which control the smoothness of each univariate field individ-
ually: n
1
,n
2
2 (0.5, 1.5, . . . , 7.5). Keeping n
12
= (n
1
+n
2
) /2, we simulated 100 bivariate
fields for every permutation ofn
1
andn
2
, totally 6,400 simulations. For each iteration, we
81
fitted the three models using the training dataset then predicted on and calculated the
correlation r for the test dataset. The results are summarized in Figure 4.6.
Whenn
1
andn
2
were similar, all three models performed well with high linear correla-
tion r. Whenn
1
was large andn
2
was small—i.e., the input surfaceX(s) was smooth but
the outcome surfaceY(s) was rough—all three model predicted poorly. When n
2
was
large and n
1
was small—i.e., the input surfaceX(s) was rough but the outcome surface
Y(s) was smooth—the linear model performed poorly while both the GAM and CDR
model perform well. In fact, the linear model seemed only to perform well when the two
surfaces were similar in terms of roughness, holding constant all other parameters of the
Matérn cross-covariance model.
Figure 4.6 (b) shows the difference in correlation r between the CDR model and the
GAM and linear model, respectively. Comparing against the linear model, our model
performed well for all scenarios where the outcome surfaceY(s) was smoother than the
input surfaceX(s). Comparing against the GAM, our model performed slightly better
when the input surfaceX(s) was very rough (n
1
= 0.5) and the outcome surfaceY(s)
was smoother, and our model also performed slightly worse when the outcome surface
Y(s) was rougher than the input surfaceX(s).
4.5.3 Colocated Correlation Coefficient
In the previous section, bivariate random fields were simulated by varying the marginal
smoothness parameters for theX(s) andY(s) surfaces while holding constant the colo-
cated correlation coefficient atr
12
= 1. In this section, we compare the performance of our
82
proposed model against the linear model and the GAM when the colocated correlation
coefficient varies. We will refer to this parameter as the “spatial” correlation coefficient
to distinguish it from point-wise (linear) correlation coefficient r, which was the primary
metric of prediction performance for the models being considered.
The bivariate random fields were simulated using the following parameters:
• r
12
2f0, 0.2, 0.4, 0.6, 0.8, 1g,
• n
1
,n
2
2f0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5g,
Although the spatial correlation coefficient can range from1 to 1, preliminary sim-
ulation and model fitting found that negative values of r
12
mirrored positive values in
terms of model performance, hence we only considered the latter in our sets of parame-
ters. For each of the 384 combinations of parameters, we simulated 100 bivariate random
fields and fitted our proposed model along with the linear model and the GAM using the
same study design described in section 4.4.
Figure 4.7 displays the boxplots for linear correlations between the observed and
predicted values of Y(s) in the test sets across different models and simulation spec-
ifications. For simplicity, only results for 27 combinations of parameters are shown:
r
12
2f0.2, 0.6, 1g, n
1
,n
2
2f0.5, 2.5, 4.5g. The proposed model generally predicted val-
ues that showed better correlation with the observed values compared to LM and GAM,
although the degree of improvement varied greatly by the smoothness of the inputX(s)
and outcomeY(s) fields, as well as the spatial correlationr
12
. All models predicted better
as the underlying spatial correlation increases from 0 to 1. There was a general decrease
of performance for our model asX(s) became smoother while LM and GAM performed
better.
83
Figure 4.7: Boxplots of r for training and test sets for linear model (LM), generalized
additive model (GAM), and our proposed model (CDR) across a subset of simulation
parameters.
More specifically, the proposed model showed the greatest improvement over LM and
GAM whenX(s) was rough (n
1
= 0.5),Y(s) was smooth (n
2
= 2.5 or 4.5), and the spatial
correlation was high (r
12
= 1). A possible explanation for these findings was that when
Y(s) was smooth and andX(s) was rough, the points onY(s) varied at a slower rate
than those onX(s); as such, although the spatial correlation r
12
was high, the pointwise
linear correlation might not track as well. Consequently, linear model and GAM, which
rely on the linear correlation between points inX(s) andY(s), struggled to capture the
underlying spatial relationship between the two surfaces while our model, which was
84
informed not only by the coincident point but also the nearest 500 points inX(s), was
able to better recover that relationship in the model.
Meanwhile, the proposed model seemed to perform slightly worse than LM and GAM
whenX(s) was smooth (n
1
= 4.5) andY(s) was rough (n
2
= 0.5), regardless of spatial
correlation. This was the inverse of the previous scenario, and the difference in smooth-
ness could also help explain the poorer performance in our proposed model: a smooth
X(s) meant that a set of points inX(s) were likely to have relatively small variations by
location, and thus they were unlikely to be able capture the higher degree of variation in
Y(s); our model, which relied on the coincident as well as neighboring points to predict
points onY(s), understandably did not do as well here.
4.5.4 Missingness
In our previous work with satellite-observed aerosol data, the rate of retrieval failure var-
ied greatly from day to day (Figure S4.1). However, on average we typically excluded
about 35% of our PM
2.5
data from modeling when restricted to using only those with co-
incident AOD data. To replicate the intended use case of our model—where coincident
input data was missing but neighboring input data were available—missingness was in-
troduced into our simulation at a rate of 35%.
Bivariate random fields were generated again using the same set of smoothness and
spatial correlation coefficient parameters. At each iteration, a set of 25 points inY(s) and
their corresponding points inX(s) was sampled for the linear and generalized additive
models. For the proposed model, we
85
• added another 15 points inY(s),
• randomly deleted 35% of points inX(s)(except those that were coincident for the
original 25 points inY(s)),
• located the nearest 500 points inX(s) for all 40 points inY(s).
The simulation, modeling fitting, and prediction procedures were repeated and a sam-
ple of the results are summarized in Figure 4.8. The findings here were generally consis-
tent with those in the previous section which did not include missingness. Although it
relied on points inX(s) that were likely further away from the points inY(s) to fit the
weighted covariance function, the proposed model improved its prediction performance
very slightly in settings where it excelled previously (rough X(s) and smooth Y(s)),
likely due to the additional data. The larger sample, however, did not improve on the
prediction performance where the model struggled to outperform LM and GAM (rough
Y(s) and smoothX(s)).
86
Figure 4.8: Boxplots of r for training and test sets for linear model (LM), generalized
additive model (GAM), and our proposed model (CDR) across a subset of simulation
parameters, where missingness inX(s) was introduced.
87
4.6 Supplementary Figures
Figure S4.1: Samples of successfully retrieved AOD data on two consecutive days in Oc-
tober 2018 over southern California. Los Angeles county is outlined in yellow and the
red dots are PM
2.5
monitors. Although it was during the fall in a more temperate region,
successful retrieval is not guaranteed.
88
Chapter 5
Application in PM
2.5
Estimation
In this chapter, we applied the Covariance-based Distance-weighted Regression (CDR)
model to a real-world problem: relating satellite-observed aerosol to ground-monitored
concentration of particulate matter with aerodynamic diameter less than 2.5 μm(PM
2.5
).
In chapters 2 and 3, we constructed PM
2.5
prediction models using only coincident aerosol,
retrieved at the exact pixel where each ground monitor was located. As described in
chapter 1, satellite instruments can fail to retrieve atmospheric aerosol for a variety of
reasons. In this chapter, we incorporated nearby aerosol data into the CDR model to
estimate PM
2.5
in three ways: (1) direct estimation in univariate models, (2) direct esti-
mation in multivariate models that included meteorological variables, and (3) multi-step
estimation that used results of the multivariate model as a predictor in machine learning
estimation models.
89
5.1 Materials and Methods
5.1.1 Study Region
We chose Los Angeles county as our study region for several reasons. First, sunny south-
ern California is known for its temperate climate with infrequent rain and few inclement
weather events, thus atmospheric aerosol would be better observed year-round by satel-
lite instrument with minimal cloud cover interference [50]. Figure 5.1 shows great re-
trieval success in southern California in winter months (December through February)
compared to Ulaanbaatar—the coldest capital in the world. Second, there were also a
good number of high-quality PM
2.5
monitors in the county from 2000 through 2019 and
several of them were active almost daily in that 20-year period. Although Kuwait—which
is very flat and is also one of the warmest capitals in the world [163, p. 374]—has better
aerosol retrieval rates than Los Angeles county, there were very few consistently active
PM
2.5
monitors in the region (see Chapter 3). Finally, we intended to use the best predic-
tive model in this chapter to estimate PM
2.5
for subjects in the Children’s Health Study.
This would let us assess the model’s out-of-area prediction performance by training on
PM
2.5
data from within the county and predicting on subjects outside the county in com-
munities such as Santa Barbara and Upland.
5.1.2 Satellite Aerosol Data
Atmospheric aerosols is commonly measured by satellite instruments as aerosol optical
depth (AOD) which reflects the amount of light that is scattered or absorbed by aerosols
90
Figure 5.1: Comparison of monthly average MAIAC AOD retrieval success rates in 2018
for Kuwait, Los Angeles county, and Ulaanbaatar municipality.
within a column through the atmosphere [164]. On board the polar-orbiting Terra satel-
lite, the MODerate resolution Imaging Spectroradiometer (MODIS) and the Multi-angle
Imaging SpectroRadiometer (MISR) instruments have been providing satellite-observed
atmospheric aerosols data since 2000. Due to their global coverage, both MODIS and
MISR have been popular sources of AOD data in studies estimating ambient air pollution
with remote-sensed predictors. MODIS AOD and MISR AOD were originally provided
at 10 km and 17.6 km spatial resolution, respectively. Recent advanced algorithms have
downscaled MODIS AOD to 1 km (using the the Multi-Angle Implementation of Atmo-
spheric Correction algorithm—MAIAC) and MISR AOD to 4.4 km resolution [48, 49].
AOD data provided by each instrument have different strengths and weaknesses. In
addition to the higher spatial resolution, the MODIS instrument is able to scan the whole
earth daily. However, MODIS only retrieves AOD at 470 nm and 550 nm, both of which
91
have been shown to correlate poorly with ambient PM
2.5
[65, 165]. Due to its narrower
observation swath, the MISR instrument takes 16 days to cover the entire globe; for any
given location, MISR AOD is typically retrieved every 3–6 days, with higher frequency
for locations closer to the poles due to greater overlapping retrieval paths (Figure S5.1).
The unique strength of the MISR instrument is its ability to retrieve AOD at different
sizes, shapes, and aerosol type [97, 107]. These fractionated AOD columns were able to
predict PM
2.5
with similar performance as models using MAIAC AOD at as little as 10%
of the sample size, as seen in Chapter 3. Despite their strength, the combination of lower
temporal and spatial resolution as well as spatial sparseness made MISR AOD variables
less suitable for our proposed model. The CDR model is designed to overcome miss-
ing coincident AOD by relying on AOD from nearby pixels, and the narrow observation
swaths unfortunately limited the usefulness of our model when the MISR paths did not
fully cover the target area (Figure S5.2). We therefore focused our application to MAIAC
AOD.
Daily MAIAC AOD (MCD19A2 Version 6) data product files were downloaded from
the Level-1 and Atmosphere Archive and Distribution System Distributed Active Archive
Center (LAADS DAAC) maintained by the National Aeronautics and Space Administra-
tion (NASA) [166], specifically tile h08v05 of the MODIS sinusoidal tile grid, which covers
the entire study area [167]. We processed MAIAC AOD from March 2000—the earliest
available—through December 2019 for Los Angeles county. The proposed model relies
on AOD from nearby MAIAC pixels to account for failed coincident aerosol satellite re-
trieval, so we processed MAIAC AOD for an additional 50 km buffer around the county.
This buffer would also accommodate monitors near the border of the county, such as
92
the PM
2.5
monitor located south of Lebec, California on the border between Los Angeles
county and Kern county. We filtered for MAIAC retrievals marked as “Clear” in the data
product files and excluded those marked as “Possibly cloudy” or “Cloudy” [48]. MAIAC
AOD was measured at the 550 nm wavelength and was log-transformed to improve lin-
earity in model-fitting. Figure 5.2 demonstrates the high resolution of the MAIAC grid
and highlights the coincident pixels (those within 1 km) of the PM
2.5
monitors.
Figure 5.2: Map of PM
2.5
monitors in south Los Angeles county (red triangles), MAIAC
pixels (left panel), 1 km coincident MAIAC pixels (blue squares), gridMET pixels (right
panel), and 4 km coincident gridMET pixels (green squares).
5.1.3 EPA Particulate Matter Data
The United States Environmental Protection Agency (EPA) Air Quality System (AQS)
is a database of ambient levels of important air pollutants across the U.S. Most of these
pollutants have been monitored as early as 1980 such as carbon monoxide (CO), lead (Pb),
93
nitrogen dioxide (NO
2
), ozone (O
3
), and sulfur dioxide (SO
2
), while particulate matters
were added later (PM
10
in 1988 and PM
2.5
in 1999). Excluding colorless pollutants (O
3
and
SO
2
), we chose PM
2.5
as the target for estimation as it had the highest linear correlation
with satellite-retrieved aerosol optical depth (AOD) at 550 nm (r = 0.18) and a greater
sample size compared to Pb, and PM
10
. Figure S5.3 shows a sample of each EPA pollutant
matched to available AOD data, and their respective linear correlation.
We acquired daily-averaged PM
2.5
data from the EPA from the Air Data website [102]
for monitors within Los Angeles county and for the same period that MAIAC AOD data
were processed—March 2000 through December 2019. We filtered for monitors that met
the requirements for the Federal Reference and Equivalent Method (FRM/FEM) using
parameter occurrence code (POC) 1. PM
2.5
concentrations were measured in μg/m
3
and
were log-transformed in the model-fitting process to improve linearity.
5.1.4 gridMET Meteorological Data
Meteorological data have been instrumental in models estimating PM
2.5
[38, 40, 47, 54,
56, 57, 61, 66, 95, 165, 168]. gridMET is a daily meteorology dataset at 4 km spatial reso-
lution by the Climatology Lab at the University of Idaho [109] and has been extensively
validated [46]. In our previous study over California, we found several gridMET vari-
ables to be important in machine learning models constructed to predict PM
2.5
, notably
shortwave surface radiation, temperature, and wind speed. We acquired gridMET data
for the study period over Los Angeles county, including the same 50 km buffer that we
94
provided for MAIAC AOD pixels. Figure 5.2 shows the resolution of the gridMET grid
and highlights the coincident pixels (those within 4 km) of the PM
2.5
monitors.
5.1.5 Data Integration
To construct the dataset for the proposed CDR model, we conducted the procedure de-
scribed below. We first chose d
max
(50 km), N
AOD
(250, 500, or 1,000), and N
MET
(50) in
our setting and for each PM
2.5
observation within Los Angeles county, we:
1. Identified up to N
AOD
observations of AOD within d
max
and sorted them by in-
creasing order of distance between PM
2.5
site and MAIAC pixel.
2. Excluded the PM
2.5
observation from our sample if there were fewer than N
AOD
observations of AOD available within d
max
. This observation would also have been
excluded in other coincident modeling approaches. It is possible for there to be
coincident AOD measurements while not having N
AOD
AOD observations in total,
but these cases were extremely rare.
3. Identified up to N
MET
pixels of gridMET, collected their corresponding meteorology
variables.
4. Assigned the AOD and gridMET variables from the nearest pixel in their respective
grids as the independent variables for the linear and generalized additive models.
95
5.1.6 Univariate Direct Estimation
We began with univariate models relating PM
2.5
to satellite AOD. Our primary compar-
isons were the linear model (LM) and the generalized additive model (GAM) as summa-
rized in Table 5.1, where s(AOD) is the GAM smoothed term for AOD.
Table 5.1: Univariate PM
2.5
estimation models using only MAIAC AOD.
Model Construction
LM PM= b
0
+b
1
AOD+e
GAM PM= b
0
+ s(AOD)+e
CDR PM= b
0
+b
1
å
500
p=1
exp
d
2
p
2l
2
AOD
p
+e
For the proposed model, we specified N
AOD
—the number of AOD observations to
be included in the distance-weighted term—at 250, 500, and 1,000 to examine whether
incorporating more AOD observation led to better predictive performance. N
AOD
does
affect the overall sample size: the sample was larger when we restricted the sample to
cases where there were at least 250 AOD observations for every PM
2.5
observation, and
the sample was smaller when the requirement is at least 1,000 AOD observations per
PM
2.5
observations.
In most studies relating PM
2.5
to satellite-observed aerosols, the prediction models
were typically fitted using only “coincident” AOD. This means that ground-monitored
PM
2.5
measurements would typically be paired with AOD observed at the nearest avail-
able pixel within the native resolution. For example, in Chapter 3 we matched PM
2.5
data
with MAIAC AOD within 1 km and with MISR AOD within 4.4 km. In this portion,
we wanted to compare performances between the models in the following subsets of our
data: (1) the entire available dataset and (2) the subset with coincident AOD within 1 km.
96
We were also curious how the proposed model would perform when in the subset without
coincident AOD.
In sum, for every level of N
AOD
of interest, we fitted seven models: 3 models for the
entire dataset, 3 models for the subset with 1 km coincident AOD, and 1 CDR model for
the subset without 1 km coincident AOD, for a total of 21 models.
5.1.7 Multivariate Direct Estimation
We then extended the univariate models by incorporating coincident meteorological co-
variates as linear terms in multivariate models. In addition to these three models (LM,
GAM, and CDR), we added two models: a “fully smoothed” GAM (GAM-F) where every
meteorological variables were also fitted with a smoother and a “fully distance-weighted”
CDR model (CDR-F) where we applied the same covariance-based distance-weighted
terms to the meteorological variables. For the CDR-F model, we incorporated the 4 km
coincident gridMET variables along with data from the nearest 50 gridMET pixels. The
choice of N
MET
= 50 here was made for two reasons: (1) gridMET is provided at much
lower resolution (4 km) and (2) gridMET data is complete and does not require searching
for data in pixels farther away, as is the case for MAIAC AOD. Among the gridMET me-
teorological variables, we attempted different combinations and settled on a set of three
variables that resulted in the best prediction:
We summarized in Table 5.2 the models being fitted, where x
ij
are linear meteorolog-
ical terms for j 2, s() are smoothed terms in the generalized additive models, and
97
the following covariance-based distanced-weighted terms using the squared exponential
covariance function:
f
1
(AOD)=
500
å
p=1
exp
d
2
p
2l
2
1
!
AOD
p
f
j
(x
j
)=
50
å
p=1
exp
d
2
p
2l
2
j
!
x
jp
.
Table 5.2: Multivariate PM
2.5
estimation models using MAIAC AOD and gridMET mete-
orology.
Model Construction
LM PM= b
0
+b
1
AOD+å
J
j=2
b
j
x
j
+e
GAM PM= b
0
+ s(AOD)+å
J
j=2
b
j
x
j
+e
GAM-F PM= b
0
+ s(AOD)+å
J
j=2
s
x
j
+e
CDR PM= b
0
+b
1
f
1
(AOD
i
)+å
J
j=2
b
j
x
j
+e
CDR-F PM= b
0
+b
1
f
1
(AOD
i
)+å
J
j=2
b
j
f
j
(x
j
)+e
Having examined the results of the univariate models, we observed that N
AOD
= 500
was a good specification for the CDR model in this context where it (1) improved on
predictive performance over N
AOD
= 250 and also (2) took less time to fit compared to
N
AOD
= 1, 000.
Based on the comparison in performance among univariate models in the subset miss-
ing 1 km coincident AOD, we wanted to further examine how the models could perform
when we included observations whose nearest available AOD was within 10 km (native
MODIS AOD resolution prior to re-processing by MAIAC). As such, we fitted the five
models in Table 5.2 twice, first in the sample with 1 km coincident AOD and then in the
sample with 10 km coincident AOD, for a total of 10 models.
98
5.1.8 Multi-stage Estimation
In recent years, several studies have constructed more complex multi-stage models to
predict PM
2.5
. One approach has been to first calibrate the relationship between PM
2.5
and AOD using linear mixed-effects models, sometimes adjusted for meteorology and
land-use variables included, and then model the residuals of the in a second stage using
the same or different set of predictors [82, 169]. Several of these studies also incorporated
interpolation techniques to account for missing AOD data from satellite instruments [72,
82]. Kloog et al. (2015) first modeled PM with AOD where AOD data were available, then
predicted PM where AOD was available but no PM monitor, then incorporated the newly
predicted PM to predict PM where both PM and AOD were missing [170]. More recently,
Di et al. (2019) trained an ensemble model to relate PM
2.5
with a collection predictors
then used this model to predict spatially and temporally lagged PM
2.5
and used these
predicted PM
2.5
measurements as predictors in a second round of ensemble models to
estimate PM
2.5
[76].
Here we wanted to evaluate whether first calibrating the relationship between PM
2.5
and AOD using the CDR model could improve the predictive performance of machine
learning models compared to using coincident AOD directly as a predictor in the ma-
chine learners. We also assessed whether relaxing the coincident requirement for MAIAC
AOD would affect prediction performance by fitting the models below in two subsets:
those with 1 km coincident AOD and those with 10 km coincident AOD. This multi-stage
modeling approach is summarized in three steps below.
99
1. Fit univariate/multivariate PM
2.5
vs. AOD models.
2. Predict AOD-calibrated PM
2.5
for all PM
2.5
locations
3. Fit machine learning models either AOD-calibrated PM
2.5
For step 3, we considered a set of five machine learning methods: ridge regression,
LASSO regression, gradient boosting, random forest, and support vectors. The last three
have shown great predictive utility in our previous studies [36, 66]. The linear methods
(ridge and LASSO) most often struggled to predict PM
2.5
due to the non-linear relation-
ship between PM
2.5
and AOD and we wanted to evaluate whether AOD-calibrated PM
2.5
improved linearity for these two models.
Based on preliminary results, we chose to focused on random forest for a few reasons:
(1) the linear learners did not improve when using AOD-calibrated PM
2.5
and (2) both
gradient boosting and support vector machines learned much slower than random forest
without showing any less reduction in overfitting (e.g., gradient boosting required tens
of thousands of trees to be competitive with random forest, which only required a few
hundred trees). We therefore only reported the results based on random forest models.
Finally, we estimated feature importance of the best predicting model using permutation
accuracy importance, which in a regression context is measured as the increase in mean
squared error (MSE) when the a given feature was excluded. Permutation importance
has been shown to be more robust than other popular importance metrics such as Gini
impurity [138].
100
5.1.9 Model Evaluation
In the univariate and multivariate direction estimation models, we used 10-fold cross-
validation to assess model performance. We divided up the data into ten subsets of equal
size and then in ten iterations, we fitted the models on 90% of the data (training set)
and predicted on the remaining 10% (validation set). At each iteration, we compared the
observed PM
2.5
against PM
2.5
measurements predicted by the different models. We used
the coefficient of determination (R
2
) and root mean squared error (RMSE) as the metrics
to compare predictive performance among the PM
2.5
estimation models described in the
three scenarios described above. R
2
helps us understand how much of the variation in
PM
2.5
can be explained by a given model and RMSE quantifies the average deviation
between the predicted and observed PM
2.5
.
For the multi-stage prediction models, we first split the data into two sets: 70% for
the training set and 30% for the test set. We then trained the machine learning models
on the 70% using 5-fold cross-validation to identify the best hyperparameters for each
of the learning methods (e.g., number of trees for gradient boosting and random forest
learners). Using the best models in the train set, we then predicted on the test set to assess
their out-of-sample prediction.
5.2 Results
From March 2000 through December 2019, there were 16 active PM
2.5
monitors in Los
Angeles county that met the EPA FRM/FEM requirement. These monitors—and their
corresponding number of daily PM
2.5
measurements collected—are shown in Figure 5.3.
101
Figure 5.3: EPA PM
2.5
monitors (red dots) in Los Angeles county (blue outline) from
2000–2019.
During the study period, about 42,000 daily PM
2.5
measurements were collected and more
than half (54.4%) were from the 4 most active monitors: Los Angeles (near Los Angeles
Department of Water and Power on Main Street), Long Beach - North, Long Beach - South,
and Azusa.
When matching MAIAC AOD to this sample, we were able to paired AOD with 33,601
PM
2.5
observations for N
AOD
= 250, 32,591 for N
AOD
= 500, and 31,111 for N
AOD
=
1, 000. Figure 5.4 shows a sample of successfully retrieved MAIAC AOD as well as active
PM
2.5
monitors during 12 days in the year 2018.
102
Figure 5.4: Sample of MAIAC AOD observations and PM
2.5
measurements (triangles) in
Los Angeles county on twelve days in 2018.
103
5.2.1 Univariate PM
2.5
Estimation
We summarized the prediction performance of the univariate models in Table 5.3; box-
plots for the cross-validated R
2
and RMSE are shown in Figure S5.4. We observed that at
all three N
AOD
of interest, both when using the entire sample and when using only the
sample with 1 km coincident AOD, the proposed CDR model was able to explain more
of the variable in PM
2.5
than the linear and generalized additive models. The CDR model
was also better at reducing the deviation between the observed and predicted PM
2.5
. Re-
stricting to the subset with 1 km coincident AOD reduced the sample sizes by 20–25%,
and the reduction was largest when reducing at lower N
AOD
. The separation in perfor-
mance between the CDR and the other models was reduced when restricting to 1 km
coincident AOD, although the separation remained clear. All three models performed
better when (1) restricting to 1 km coincident AOD and (2) increasing the required N
AOD
.
Finally, in the subset missing 1 km coincident AOD, the CDR performed very poorly.
Table 5.3: Performance of univariate PM
2.5
models using only MAIAC AOD, reported as
median cross-validation R
2
and RMSE.
N
AOD
Subset N Model R
2
RMSE
250 All 33,601 LM 0.030 9.12
GAM 0.033 9.09
CDR 0.065 8.94
1 km coincident AOD 25,221 LM 0.076 8.69
GAM 0.073 8.72
CDR 0.098 8.62
Missing coincident AOD 8,380 CDR 0.015 9.63
500 All 32,591 LM 0.031 9.13
GAM 0.040 9.08
CDR 0.072 8.87
1 km coincident AOD 25,167 LM 0.080 8.71
GAM 0.081 8.72
104
Table 5.3: Performance of univariate PM
2.5
models using only MAIAC AOD, reported as
median cross-validation R
2
and RMSE.
N
AOD
Subset N Model R
2
RMSE
CDR 0.107 8.60
Missing coincident AOD 7,424 CDR 0.022 9.93
1000 All 31,111 LM 0.045 9.14
GAM 0.050 9.10
CDR 0.083 8.90
1 km coincident AOD 25,056 LM 0.087 8.75
GAM 0.081 8.74
CDR 0.113 8.62
Missing coincident AOD 6,055 CDR 0.030 9.91
For the CDR models fitted using the N
AOD
= 500 subset, parameter estimates of the
distance-weighting function for AOD are showed in Table 5.4 and their curves plotted
in Figure 5.5. We found that the distance-weighting function for the subset with 1 km
coincident AOD decayed the fastest and while that for the subset missing 1 km coincident
AOD decayed the slowest; however, the parameter estimateb
1
for the former is also much
larger than that of the latter.
Table 5.4: Estimates for the parameter b
1
and the length-scale parameter l
1
for MAIAC
AOD in the three univariate CDR models fitted.
Subset b
1
l
1
All 0.084 4.81 10
5
1 km coincident AOD 0.285 1.91 10
4
Missing coincident AOD 0.046 2.70 10
5
105
Figure 5.5: Distance weighting curves for AOD in the three univariate CDR models, with
the plain covariance function on the left and adjusting for b
1
on the right.
5.2.2 Multivariate PM
2.5
Estimation
Choosing N
AOD=500
resulted in a total sample size of 32,591. Among these, the 5 mod-
els for comparison were fitted on two samples: those with only 1 km coincident AOD
(N = 25, 167) and those with 10 km coincident AOD (N = 29, 770). Figure 5.6 summarize
the proportions of this sample for those with 1 km coincident AOD (77%), those whose
nearest AOD was from 1–10 km (14%), and those whose nearest AOD was more than
10 km away (9%). Among the meteorological variables, we found that specific humid-
ity, minimum temperature, and wind speed were the best combination of meteorological
covariates to include in the multivariate models.
For the sample with 1 km coincident AOD, we found that the models were able to
explain an additional 18–22% of the variation in PM
2.5
(Table 5.6 and Figure S5.5). The
GAMs showed some separation in performance against the linear model while the 2-4%
gap in R
2
between the CDR model and the GAM remained generally the same (32.8%
106
Figure 5.6: Partitions of the data for those with 1 km coincident AOD, 10 km coincident
AOD, and those whose nearest AOD pixel is more than 10 km away.
vs. 28.5%, respectively). Adding meteorology smoothed term allowed the GAM to more
clearly separate itself from the LM (31% vs. 26.2%), although it still lagged behind the
CDR slightly. Meanwhile adding covariance-based distance-weighting terms for meteo-
rology did not improve the predictive R
2
of the CDR at all, although the cross-validated
RMSE seemed to decrease slightly.
Table 5.5: Performance of multivariate PM
2.5
models incorporating meteorological vari-
ables in addition to MAIAC AOD, reported as median cross-validation R
2
and RMSE.
Subset N Model R
2
RMSE
1 km coincident AOD 25,167 LM 0.262 8.34
GAM 0.285 8.34
GAM-F 0.310 8.20
CDR 0.328 8.18
CDR-F 0.328 8.15
10 km coincident AOD 29,770 LM 0.275 8.42
GAM 0.286 8.21
GAM-F 0.317 8.20
CDR 0.321 8.12
CDR-F 0.319 8.15
107
In the sample with 10 km coincident AOD (i.e., the nearest available AOD is within
10 km), we found that the predictive performance largely remained the same. The GAM
showed some more improvement in the fully smoothed model, while the CDR showed
some slight decreases in R
2
and some minor improvement in RMSE. As these are me-
dian R
2
and RMSE of 10-fold cross-validation, the random splits might have affected the
within-method performances somewhat, but the separation between the CDR and the
LM and GAM seemed stable.
5.2.3 Multi-stage PM
2.5
Estimation
Multi-stage estimation found that using AOD-calibrated PM
2.5
helped random forest
models predict better in the test set, both in explaining more of the variation in observed
PM
2.5
and reducing the deviation between observed and predicted PM
2.5
(Table 5.6). The
difference in performance between the model using AOD directly and one using AOD-
calibrated PM
2.5
became larger when we added observations with 10 km coincident AOD.
Other predictors included in the final model were meteorological variables (energy re-
lease component, humidity, shortwave surface radiation, temperature, wind speed, and
wind direction), temporal trends (Julian date and month), and spatial trends (longitude
and latitude of coincident MAIAC AOD). Permuted variable importance found AOD-
calibrated PM
2.5
to be the most important feature, although latitude was almost just as
important 5.7. Out of the five most important predictors, 3 were spatial (coordinates)
and temporal (Julian date) trends with wind speed being the fifth most important. These
results were similar to our findings in chapters 2 and 3.
108
Table 5.6: Prediction performance comparing multi-stage machine learning models with
AOD-calibrated PM
2.5
to models using AOD directly, measured in test R
2
and test RMSE.
Subset Input R
2
RMSE
1 km coincident AOD 0.710 6.14
CDR-PM 0.719 6.05
10 km coincident AOD 0.708 6.19
CDR-PM 0.726 5.98
Figure 5.7: Variable importance for the final random forest model estimating PM
2.5
; per-
muted importance is measured by percentage increase in MSE when each variable is ex-
cluded from the model.
5.3 Discussion
In univariate estimation, we observed that all comparison models performed better when
(1) restricting to 1 km coincident AOD and (2) selecting larger N
AOD
. The first scenario
made sense, especially when recognizing that the standard model fitting approach with
remote-sensed data would be to restrict to coincident measurements. Fitting a model to
109
explain PM
2.5
using satellite-observed AOD at a location 20 km away would be unlikely
to yield strong predictive results. We observed similar results in a previous study model-
ing PM
2.5
, PM
10
, and SO
2
in Ulaanbaatar, where models restricted to MISR AOD within
10 km predicted better than those using any AOD available, regardless of how far away
[50].
When N
AOD
increases, we were likely restricting to the subset on days where satel-
lite instruments were more successful in retrieving atmospheric aerosol. We have noted
before that the pattern of satellite retrieval failure is not always random and could be
related to inclement weather conditions or other phenomena such as wildfire smokes or
sand storms (see also Figure 5.4). As such, a higher N
AOD
could mean that the result-
ing sample contained more days with clear sky and thus leaving the model vulnerable
to poor prediction on days with less than ideal meteorological conditions. Therefore, we
suggest that caution should be taken when increasing N
AOD
, in addition to concerns for
computational time and resources.
Unfortunately, when the distances between the PM
2.5
monitor and the nearest avail-
able AOD extended beyond 10 km, the model deteriorated quickly in performance. This
can be seen in how the model upweighted the importance of AOD observations further
away in the sample of total data and the sample missing coincident AOD 5.5. However,
as seen in the curves adjusted for the b
1
parameter, when the distance-weighting had
to rely on AOD observations further away, the model modulates their contribution by
quite a lot (b
1
= 0.046–0.084). On the other hand, in the sample with 1 km coincident
AOD, the model essentially zeroed out all AOD observations beyond 15 km while also
110
increasing the average contribution of each nearby AOD observation (b
1
= 0.285). Re-
laxing the model to 10 km coincident AOD meant an addition of 14% of the total sample
size, although we still excluded almost 10% of the remaining non-coincident AOD sample
(Figures 5.6 and S5.6).
In multivariate estimation, the fully smoothed GAM where both AOD and meteoro-
logical variables were smoothed did improve over the GAM with smoothed AOD and
linear meteorology. However, distance-weighting meteorology using additional nearby
gridMET data did not improve the CDR model at all. One possible explanation could
be that gridMET was a product of integrating gridded data from the Parameter-elevation
Regressions on Independent Slopes Model (PRISM) with the North American Land Data
Assimilation System (NLDAS-2) [46]. As such, meteorological estimations from gridMET
were likely to have been overly smoothed when going pixel to pixel.
Recall in chapter 4 that our proposed model performed better when the input data
fieldsX(s) were rough (i.e., has higher variance) and worse when the input data fields
were smoother than the outcome field Y(s). As constructed, the covariance function
likely detected few changes in meteorological values in nearby gridMET pixels, and thus
its distance-weighting parameter (l) decayed quickly and the distance-weighted term ap-
proximated to a linear term. One implication of these findings is that the CDR model
might not be suitable when the input data were interpolated or gap-filled, which typi-
cally could be smoother than the truth. This limitation, however, does comport with one
important principle underlying our modeling approach: we avoid relying on interpola-
tion where possible as these gap-filling techniques typically carry a degree of uncertainty
that would be passed on further in complex multi-stage settings.
111
Feature importance results for our final random forest model indicated that our model
was heavily influenced by spatial and temporal trends rather than meteorology. Its over-
reliance on spatial trends was likely due to the sparseness of PM
2.5
monitors in Los Ange-
les county: there were only 16 active monitors in a county with 10.3 km
2
in land area over
the 20-year study period. This would certainly have implications on out-of-area predic-
tion. In our previous model over the entire California [36], we were able to incorporate
PM
2.5
monitors from the entire state, so the important features in those models were more
diverse among the target air pollutants (we also fitted using MISR fractionated AODs,
which often related better to PM
2.5
than MAIAC AOD). The strong temporal trend in the
data was also noted in the California study, a known long-term trend of PM
2.5
decreasing
over time in Southern California following more stringent air quality guidelines by the
EPA.
In constructing the generalized additive models to compare against in both univariate
and multivariate estimation, we intentionally left out a 2-dimensional smoothed term
for spatial trends (longitude and latitude). Although such a spatial smoothed term is
common when using GAM to relate PM
2.5
and AOD, it would have been inappropriate
in our setting. When given such a term, the GAM would excel in capturing the spatial
pattern of the outcome, PM
2.5
. Our proposed model is intended to improve the linear
relationship between PM
2.5
and AOD using covariance-based distance-weighting without
explicitly modeling spatial trends as a regression term. This study, of course, leaves room
for future work where one might do just that: incorporating a covariance-based distance-
weighting term into generalized additive models that do include an explicit smoothed
term for spatial trends.
112
Although the CDR maintained a certain improvement over the linear and general-
ized additive models when meteorology variables were incorporated, the relative differ-
ence became smaller. This is understandable by recognizing that despite constructing a
complex covariance-based distance-weighting term, the model is fundamentally a linear
model. Therefore, our model improves by being informed by nearby AOD measure-
ments rather than modeling the well-established non-linear relationship between AOD
and PM
2.5
, which the GAM and certain machine learners would also naturally excel at.
One major limitation in our modeling approach is the computational complexity, par-
ticularly in the data management process. For a sample of 32,591 PM
2.5
observations
at N
AOD
= 500, we had to incorporate an matrix of AOD measurements and a matrix
distances, both with dimensions 500 32, 591, and neither of them were sparse. Applica-
tion of our model is also likely to be limited to regions with good satellite retrieval rates
(basically warm and flat areas). This limitation prevents it from being used in regions
with more extreme climate whose air pollution concentrations were no less severe (e.g.,
Ulaanbaatar).
113
5.4 Supplementary Figures
Figure S5.1: Sample of several MISR retrieval “paths” over North America.
Figure S5.2: Example of a MISR path that does not fully cover the target area—Los Ange-
les county.
114
Figure S5.3: Sample of different air pollutants monitored by the EPA and their linear
relationship with MAIAC AOD (550 nm).
115
Figure S5.4: Prediction performance of univariate models estimating PM
2.5
, grouped by
the coincident MAIAC AOD subsets and the number of MAIAC AOD pixels for the CDR
model. LM: linear model; GAM: generalized additive model; CDR: covariance-based
distance-weighted regression; “-C”: 1 km coincident AOD subset; and “-M”: missing co-
incident AOD.
116
Figure S5.5: Prediction performance of multivariate models estimating PM
2.5
. LM: lin-
ear model; GAM: generalized additive model; GAM-F: fully smoothed (meteorological)
GAM; CDR: covariance-based distance-weighted regression; and CDR-F: fully distance-
weighted (meteorological) CDR.
Figure S5.6: Proportions of the overall data where the nearest available AOD is not 1 km
coincident, up to 50 km.
117
Chapter 6
Conclusion
In chapter 2, we implemented a common approach to modeling air pollution using satel-
lite observed fractionated aerosol optical depth (AOD) retrieved by the Multi-angle Imag-
ing SpectroRadiometer (MISR) instrument and were able to not only predict PM
2.5
but
also its speciation SO
2
4
, NO
3
, elemental carbon, and dust. These satellite-derived ex-
posures were then applied in a cross-sectional analysis for Cohort E in the Children’s
Health Study with some success. In chapter 3, we explored spatial and temporal autocor-
relation trends for both PM
2.5
and remote-sensed AOD in Kuwait and the United Arab
Emirates in order to explain the divergent predictive performances when fitting the same
PM
2.5
estimation model with the same set of covariates in different regions. The results
in this chapter also highlights the perilous difficulty of out-of-area prediction by using
models trained in one region to estimate air pollution in another, even if the two regions
are nearby and share similar geographic characteristics.
In chapter 4, we reviewed Gaussian random fields and the usefulness of covariance
functions in characterizing similarity between observations with some Euclidean prox-
imity. We then proposed the covariance-based distance-weighted regression model that
118
takes as its input not only the coincident data but also nearby data from theX(s) field
to help inform the relationship between theX(s) andY(s) fields. Using the Matérn
cross-covariance model with its flexible model specifications, we simulated different sce-
narios of bivariate random fields and assessed how our proposed model performed. We
observed better performance than the linear and the generalized additive model (GAM)
when the input fieldX(s) was less smooth (i.e., had more variance). We introduced ran-
dom missingness to theX(s) field to simulate scenarios similar to real life application
where satellite AOD is often missing.
In chapter 5, we applied the CDR model to the task of estimating PM
2.5
in Los Ange-
les county, using AOD re-processed by the Multi-Angle Implementation of Atmospheric
Correction (MAIAC) algorithm, which had higher spatial and temporal resolution than
MISR AOD. We performed PM
2.5
estimation in three different approaches: univariate
direct estimation using only AOD, multivariate direct estimation using AOD and meteo-
rology, and multi-stage estimation where we fitted a multivariate CDR model for PM
2.5
,
predicted PM
2.5
, then used the new AOD-calibrated PM
2.5
as a predictor in random forest
models. We also compared model performances based on whether the sample included
all observations, those with 1 km coincident AOD, or those with 10 km coincident AOD.
We found that our model predicted PM
2.5
better than the linear model and the GAM in
most contexts and our model maintained its prediction performance when relaxing the
coincident radius from 1 km (native MAIAC resolution) to 10 km (native MODIS resolu-
tion). The multi-stage estimation found slight improvement when using AOD-calibrated
PM
2.5
instead of the AOD measured by the satellite instrument.
119
6.1 Strengths
Although our model is computationally expensive—both in computing time as well as
memory storage—we were able to fit the model on a personal computer with reasonable
hardware (Intel Core i7 processor and 16 GB of RAM). One important principle of our
modeling approach is ease of implementation. We specifically did not want to rely on
high performance computing to fit our model, as many machine learning methods cur-
rently can achieve reasonably good prediction on very standard computer hardware.
One important strength of the CDR model was its adaptability to data availability.
When the data were made up mostly of coincident input data (AOD), the distance weight-
ing curve decayed quickly to upweight the data closely nearby. When we expanded the
model to include non-coincident input data, the distance-weighting curve had a fatter tail
to accommodate information from pixels further away. At the same time, the regression
parameterb would adjust to appropriately counter-balance the distance-weighting curve.
As such, another strength of the model is its legibility: we can easily understand how the
model behaves based on its parameter estimates.
6.2 Limitations
Although the CDR outperformed the GAM in both simulation and application settings,
the improvement was still fairly narrow and situation-specific. When the nearest input
data (AOD) were more than 10 km away, the CDR model performed much worse as
its distance-weighting curve attempted to bring in useless information that ultimately
120
were not able to relate the observed PM
2.5
. In addition, for the small gain in predictive
performance, the CDR model required a lot more work in data preparation (for the input
and distance matrices) and computational resources.
Overall, the task of estimating PM
2.5
remains difficult and, in our context, encounter
serious challenges from several places. PM
2.5
monitors are located in non-random places,
typically due to concerns for high concentration of ambient air pollution (e.g., near free-
ways or certain industrial locations). Therefore, it is not hard to imagine that if most
PM
2.5
monitors are found at highly polluted locations, prediction models might overesti-
mate in certain places without monitors. Satellite-retrieved AOD data, for all its strengths
in temporal and spatial coverage, remain inconsistent in its retrieval due to atmospheric
physics. MAIAC AOD—measured at 550 nm wavelength—remains poorly correlated
with PM
2.5
while MISR AOD suffers from spotty coverage. In addition, the time lag com-
ponent should not be underestimated. PM
2.5
data from the EPA were provided as daily
means, while satellite AOD are typically retrieved between 10:30 AM through 2:00 PM
local time.
These challenges could be further compounded when we move beyond regions with
temperate climate and relatively well monitored for ambient pollution. In places like
Ulaanbaatar and Kuwait, either PM
2.5
monitors are inconsistently active or satellite in-
struments struggle to retrieve or, in many cases, both.
121
6.3 Future Development
Our research has contributed to the ongoing effort exploring the usefulness of lever-
aging large-coverage aerosol data provided by satellite instruments. The covariance-
based distance-weighting method excels when the input field is densely observed. There
are new aerosol retrieving instruments currently being developed by NASA for future
launches and one can certainly look forward to better retrieval algorithms that can over-
come limitations of the current instruments. One in-development instrument in partic-
ular (Multi-Angle Imager for Aerosols—MAIA) is planned to retrieve AOD at different
wavelengths and the resulting data products might correlate better with PM
2.5
.
There are also great room for improvement in terms of optimization for estimating
the parameters, although I am not sure what methods to try. Sparse matrix techniques
are unlikely to be successful as the input and distance matrices tend to be very dense
with almost no zeros. Still to be implemented are covariance functions other than the
squared exponential used in both chapters 4 and 5, such as the rational quadratic and
the Matérn covariance functions. In this study, I was not able to get either of these to
(1) fit faster than the squared exponential or (2) produce better prediction. Finally, the
current covariance functions assume isotropy, which can be too stringent. However, im-
plementing anisotropic covariance functions is not so straightforward. In typical analy-
sis of anisotropic covariances, classical geostatistical methods typically involved visual
checks to determine the “major” axis in each field of spatial data instead of algorithmic
evaluations. This is also a potential interesting area of research.
122
Bibliography
[1] J. Schwartz and A. Marcus. “Mortality and Air Pollution in London: A Time
Series Analysis”. In: American Journal of Epidemiology 131.1 (1990), pp. 185–194.
DOI: 10.1093/oxfordjournals.aje.a115473.
[2] J. Schwartz and D. W. Dockery. “Increased Mortality in Philadelphia Associated
with Daily Air Pollution Concentrations”. In: American Review of Respiratory
Disease 145.3 (1992), pp. 600–604. DOI: 10.1164/ajrccm/145.3.600.
[3] D. W. Dockery et al. “An Association between Air Pollution and Mortality in Six
U.S. Cities”. In: New England Journal of Medicine 329.24 (1993), pp. 1753–1759. DOI:
10.1056/nejm199312093292401.
[4] J. Schwartz, D. W. Dockery, and L. M. Neas. “Is Daily Mortality Associated
Specifically with Fine Particles?” In: Journal of the Air & Waste Management
Association 46.10 (1996), pp. 927–939. DOI: 10.1080/10473289.1996.10467528.
[5] C. A. Pope et al. “Lung Cancer, Cardiopulmonary Mortality, and Long-term
Exposure to Fine Particulate Air Pollution”. In: JAMA 287.9 (2002), pp. 1132–1141.
ISSN: 0098-7484. DOI: 10.1001/jama.287.9.1132. eprint:
https://jamanetwork.com/journals/jama/articlepdf/194704/joc11435.pdf.
[6] B. Ostro et al. “Fine Particulate Air Pollution and Mortality in Nine California
Counties: Results from CALFINE”. In: Environmental Health Perspectives 114.1
(2006), pp. 29–33. DOI: 10.1289/ehp.8335.
[7] M. Franklin, A. Zeka, and J. Schwartz. “Association between PM2.5 and all-cause
and specific-cause mortality in 27 US communities”. In: Journal of Exposure Science
& Environmental Epidemiology 17.3 (2007), pp. 279–287. DOI: 10.1038/sj.jes.7500530.
[8] A. W. Correia et al. “Effect of Air Pollution Control on Life Expectancy in the
United States”. In: Epidemiology 24.1 (2013), pp. 23–31. DOI:
10.1097/ede.0b013e3182770237.
123
[9] M. C. Turner et al. “Long-term Ambient Fine Particulate Matter Air Pollution and
Lung Cancer in a Large Cohort of Never-Smokers”. In: American Journal of
Respiratory and Critical Care Medicine 184.12 (2011), pp. 1374–1381. DOI:
10.1164/rccm.201106-1011oc.
[10] O. Raaschou-Nielsen et al. “Air pollution and lung cancer incidence in 17
European cohorts: prospective analyses from the European Study of Cohorts for
Air Pollution Effects (ESCAPE)”. In: The Lancet Oncology 14.9 (2013), pp. 813–822.
DOI: 10.1016/s1470-2045(13)70279-1.
[11] K. Katanoda et al. “An Association Between Long-Term Exposure to Ambient Air
Pollution and Mortality From Lung Cancer and Respiratory Diseases in Japan”.
In: Journal of Epidemiology 21.2 (2011), pp. 132–143. DOI: 10.2188/jea.je20100098.
[12] A. B. Hill. “The Environment and Disease: Association or Causation?” In:
Proceedings of the Royal Society of Medicine 58.5 (1965). 14283879[pmid],
pp. 295–300. ISSN: 0035-9157. URL: https://pubmed.ncbi.nlm.nih.gov/14283879.
[13] J. Q. Koenig. “Health Effects of Particulate Matter”. In: Health Effects of Ambient
Air Pollution. Springer US, 2000, pp. 115–137. DOI: 10.1007/978-1-4615-4569-9_10.
[14] G. Ryan et al. “Decline in lung function and mortality: the Busselton Health
Study”. In: Journal of Epidemiology & Community Health 53.4 (1999), pp. 230–234.
DOI: 10.1136/jech.53.4.230.
[15] D. D. Sin, L. Wu, and S. F. P . Man. “The Relationship Between Reduced Lung
Function and Cardiovascular Mortality”. In: Chest 127.6 (2005), pp. 1952–1959.
DOI: 10.1378/chest.127.6.1952.
[16] V . V . Georgiopoulou et al. “Lung Function and Risk for Heart Failure Among
Older Adults: The Health ABC Study”. In: The American Journal of Medicine 124.4
(2011), pp. 334–341. DOI: 10.1016/j.amjmed.2010.12.006.
[17] A. Agustí et al. “Lung function in early adulthood and health in later life: a
transgenerational cohort analysis”. In: The Lancet Respiratory Medicine 5.12 (2017),
pp. 935–945. DOI: 10.1016/s2213-2600(17)30434-4.
[18] T. Islam et al. “Relationship between air pollution, lung function and asthma in
adolescents”. In: Thorax 62.11 (2007), pp. 957–963. DOI: 10.1136/thx.2007.078964.
[19] D. C. M. Belgrave et al. “Lung function trajectories from pre-school age to
adulthood and their associations with early life factors: a retrospective analysis of
three population-based birth cohort studies”. In: The Lancet Respiratory Medicine
6.7 (2018), pp. 526–534. DOI: 10.1016/s2213-2600(18)30099-7.
124
[20] D. S. Bui et al. “Childhood predictors of lung function trajectories and future
COPD risk: a prospective cohort study from the first to the sixth decade of life”.
In: The Lancet Respiratory Medicine 6.7 (2018), pp. 535–544. DOI:
10.1016/s2213-2600(18)30100-0.
[21] W. J. Gauderman et al. “The Effect of Air Pollution on Lung Development from 10
to 18 Years of Age”. In: New England Journal of Medicine 351.11 (2004),
pp. 1057–1067. DOI: 10.1056/nejmoa040610.
[22] T. C. Lewis et al. “Air Pollution–Associated Changes in Lung Function among
Asthmatic Children in Detroit”. In: Environmental Health Perspectives 113.8 (2005),
pp. 1068–1075. DOI: 10.1289/ehp.7533.
[23] W. J. Gauderman et al. “Association of Improved Air Quality with Lung
Development in Children”. In: New England Journal of Medicine 372.10 (2015),
pp. 905–913. DOI: 10.1056/nejmoa1414123.
[24] W. A. Jedrychowski et al. “Effect of prenatal exposure to fine particulate matter
on ventilatory lung function of preschool children of non-smoking mothers”. In:
Paediatric and Perinatal Epidemiology 24.5 (2010), pp. 492–501. DOI:
10.1111/j.1365-3016.2010.01136.x.
[25] A. Barraza-Villarreal et al. “Air Pollution, Airway Inflammation, and Lung
Function in a Cohort Study of Mexico City Schoolchildren”. In: Environmental
Health Perspectives 116.6 (2008), pp. 832–838. DOI: 10.1289/ehp.10926.
[26] E. I. Thaller et al. “Moderate Increases in Ambient PM2.5 and Ozone Are
Associated With Lung Function Decreases in Beach Lifeguards”. In: Journal of
Occupational and Environmental Medicine 50.2 (2008), pp. 202–211. DOI:
10.1097/jom.0b013e31816386b4.
[27] W. J. Gauderman et al. “Effect of exposure to traffic on lung development from 10
to 18 years of age: a cohort study”. In: The Lancet 369.9561 (2007), pp. 571–577.
DOI: 10.1016/s0140-6736(07)60037-3.
[28] M. Franklin and S. Fruin. “The role of traffic noise on the association between air
pollution and children’s lung function”. In: Environmental Research 157 (2017),
pp. 153–159. DOI: 10.1016/j.envres.2017.05.024.
[29] M. Wang et al. “Air Pollution and Lung Function in Dutch Children: A
Comparison of Exposure Estimates and Associations Based on Land Use
Regression and Dispersion Exposure Modeling Approaches”. In: Environmental
Health Perspectives 123.8 (2015), pp. 847–851. DOI: 10.1289/ehp.1408541.
125
[30] U. Gehring et al. “Air Pollution Exposure and Lung Function in Children: The
ESCAPE Project”. In: Environmental Health Perspectives 121.11-12 (2013),
pp. 1357–1364. DOI: 10.1289/ehp.1306770.
[31] M. Eeftens et al. In: Environmental Science & Technology 46.20 (2012),
pp. 11195–11205. DOI: 10.1021/es301948k.
[32] R. Urman et al. “Associations of children’s lung function with ambient air
pollution: joint effects of regional and near-roadway pollutants”. In: Thorax 69.6
(2014), pp. 540–547. DOI: 10.1136/thoraxjnl-2012-203159.
[33] M. B. Rice et al. “Lifetime Exposure to Ambient Pollution and Lung Function in
Children”. In: American Journal of Respiratory and Critical Care Medicine 193.8
(2016), pp. 881–888. DOI: 10.1164/rccm.201506-1058oc.
[34] A. Yang et al. “Children’s respiratory health and oxidative potential of PM2.5: the
PIAMA birth cohort study”. In: Occupational and Environmental Medicine 73.3
(2016), pp. 154–160. DOI: 10.1136/oemed-2015-103175.
[35] H. Lin et al. “Consumption of fruit and vegetables might mitigate the adverse
effects of ambient PM2.5 on lung function among adults”. In: Environmental
Research 160 (2018), pp. 77–82. DOI: 10.1016/j.envres.2017.09.007.
[36] K. Chau, M. Franklin, and W. J. Gauderman. “Satellite-Derived PM2.5
Composition and Its Differential Effect on Children’s Lung Function”. In: Remote
Sensing 12.6 (2020), p. 1028. DOI: 10.3390/rs12061028.
[37] S. Li et al. “Long-term exposure to PM2.5 and Children’s lung function: a
dose-based association analysis”. In: Journal of Thoracic Disease 12.10 (2020),
pp. 6379–6395. DOI: 10.21037/jtd-19-crh-aq-007.
[38] X. Li et al. “Characteristics of Particulate Pollution (PM2.5 and PM10) and Their
Spacescale-Dependent Relationships with Meteorological Elements in China”. In:
Sustainability 9.12 (2017), p. 2330. DOI: 10.3390/su9122330.
[39] G. Chen et al. “A machine learning method to estimate PM2.5 concentrations
across China with remote sensing, meteorological and land use information”. In:
Science of The Total Environment 636 (2018), pp. 52–60. DOI:
10.1016/j.scitotenv.2018.04.251.
[40] X. Ni et al. “Spatio-Temporal Pattern Estimation of PM2.5 in
Beijing-Tianjin-Hebei Region Based on MODIS AOD and Meteorological Data
Using the Back Propagation Neural Network”. In: Atmosphere 9.3 (2018), p. 105.
DOI: 10.3390/atmos9030105.
126
[41] H. Hersbach et al. “The ERA5 global reanalysis”. In: Quarterly Journal of the Royal
Meteorological Society 146.730 (2020), pp. 1999–2049. DOI: 10.1002/qj.3803.
[42] F. Mesinger et al. “North American Regional Reanalysis”. In: Bulletin of the
American Meteorological Society 87.3 (2006), pp. 343–360. DOI: 10.1175/bams-87-3-343.
[43] Y. Xia et al. “Continental-scale water and energy flux analysis and validation for
the North American Land Data Assimilation System project phase 2 (NLDAS-2):
1. Intercomparison and application of model products”. In: Journal of Geophysical
Research: Atmospheres 117.D3 (2012), pp. 1–27. DOI: 10.1029/2011jd016048.
[44] Y. Xia et al. “Continental-scale water and energy flux analysis and validation for
North American Land Data Assimilation System project phase 2 (NLDAS-2): 2.
Validation of model-simulated streamflow”. In: Journal of Geophysical Research:
Atmospheres 117.D3 (2012), pp. 1–23. DOI: 10.1029/2011jd016051.
[45] M. Z. Al-Hamdan et al. “Environmental public health applications using
remotely sensed data”. In: Geocarto International 29.1 (2012), pp. 85–98. DOI:
10.1080/10106049.2012.715209.
[46] J. T. Abatzoglou. “Development of gridded surface meteorological data for
ecological applications and modelling”. In: International Journal of Climatology 33.1
(2013), pp. 121–131. ISSN: 08998418. DOI: 10.1002/joc.3413.
[47] H. Xu et al. “National PM2.5 and NO2 exposure models for China based on land
use regression, satellite measurements, and universal kriging”. In: Science of The
Total Environment 655 (2019), pp. 423–433. DOI: 10.1016/j.scitotenv.2018.11.125.
[48] A. Lyapustin et al. “MODIS Collection 6 MAIAC algorithm”. In: Atmospheric
Measurement Techniques 11.10 (2018), pp. 5741–5765. DOI: 10.5194/amt-11-5741-2018.
[49] M. J. Garay, O. V . Kalashnikova, and M. A. Bull. “Development and assessment
of a higher-spatial-resolution (4.4 km) MISR aerosol optical depth product using
AERONET-DRAGON data”. In: Atmospheric Chemistry and Physics 17.8 (2017),
pp. 5095–5106. DOI: 10.5194/acp-17-5095-2017.
[50] M. Franklin et al. “Using Multi-Angle Imaging SpectroRadiometer Aerosol
Mixture Properties for Air Quality Assessment in Mongolia”. In: Remote Sensing
10.8 (2018), p. 1317. DOI: 10.3390/rs10081317.
[51] K. Huang et al. “Predicting monthly high-resolution PM2.5 concentrations with
random forest model in the North China Plain”. In: Environmental Pollution 242
(2018), pp. 675–683. DOI: 10.1016/j.envpol.2018.07.016.
127
[52] R. Li et al. “Using MAIAC AOD to verify the PM2.5 spatial patterns of a land use
regression model”. In: Environmental Pollution 243 (2018), pp. 501–509. DOI:
10.1016/j.envpol.2018.09.026.
[53] F. Liang et al. “MAIAC-based long-term spatiotemporal trends of PM2.5 in
Beijing, China”. In: Science of The Total Environment 616-617 (2018), pp. 1589–1598.
DOI: 10.1016/j.scitotenv.2017.10.155.
[54] S. E. Alexeeff et al. “Consequences of kriging and land use regression for PM2.5
predictions in epidemiologic analyses: insights into spatial variability using
high-resolution satellite data”. In: Journal of Exposure Science & Environmental
Epidemiology 25.2 (2015), pp. 138–144. DOI: 10.1038/jes.2014.40.
[55] H. J. Lee, R. B. Chatfield, and A. W. Strawa. “Enhancing the Applicability of
Satellite Remote Sensing for PM2.5 Estimation Using MODIS Deep Blue AOD
and Land Use Regression in California, United States”. In: Environmental Science
& Technology 50.12 (2016), pp. 6546–6555. DOI: 10.1021/acs.est.6b01438.
[56] H. J. Lee. “Benefits of High Resolution PM2.5 Prediction using Satellite MAIAC
AOD and Land Use Regression for Exposure Assessment: California Examples”.
In: Environmental Science & Technology 53.21 (2019), pp. 12774–12783. DOI:
10.1021/acs.est.9b03799.
[57] A. Mhawish et al. “Estimation of High-Resolution PM2.5 over the Indo-Gangetic
Plain by Fusion of Satellite Data, Meteorology, and Land Use Variables”. In:
Environmental Science & Technology 54.13 (2020), pp. 7891–7900. DOI:
10.1021/acs.est.0c01769.
[58] R. S. Bivand et al. Applied Spatial Data Analysis with R. New York, NY: Springer,
2013. ISBN: 978-1-4614-7617-7.
[59] M. Z. Al-Hamdan et al. “Methods for Characterizing Fine Particulate Matter
Using Ground Observations and Remotely Sensed Data: Potential Use for
Environmental Public Health Surveillance”. In: Journal of the Air & Waste
Management Association 59.7 (2009), pp. 865–881. DOI: 10.3155/1047-3289.59.7.865.
[60] P . H. Ryan and G. K. LeMasters. “A Review of Land-use Regression Models for
Characterizing Intraurban Air Pollution Exposure”. In: Inhalation Toxicology
19.sup1 (2007), pp. 127–133. DOI: 10.1080/08958370701495998.
[61] I. Kloog et al. “Incorporating Local Land Use Regression And Satellite Aerosol
Optical Depth In A Hybrid Model Of Spatiotemporal PM2.5 Exposures In The
Mid-Atlantic States”. In: Environmental Science & Technology 46.21 (2012),
pp. 11913–11921. DOI: 10.1021/es302673e.
128
[62] B. Zou et al. “Satellite Based Mapping of Ground PM2.5 Concentration Using
Generalized Additive Modeling”. In: Remote Sensing 9.1 (2016), p. 1. DOI:
10.3390/rs9010001.
[63] M. Sanchez et al. “Development of land-use regression models for fine particles
and black carbon in peri-urban South India”. In: Science of The Total Environment
634 (2018), pp. 77–86. DOI: 10.1016/j.scitotenv.2018.03.308.
[64] Y. Liu, C. J. Paciorek, and P . Koutrakis. “Estimating Regional Spatial and
Temporal Variability of PM2.5 Concentrations Using Satellite Data, Meteorology,
and Land Use Information”. In: Environmental Health Perspectives 117.6 (2009),
pp. 886–892. DOI: 10.1289/ehp.0800123.
[65] M. Sorek-Hamer et al. “Improved retrieval of PM2.5 from satellite data products
using non-linear methods”. In: Environmental Pollution 182 (2013), pp. 417–423.
DOI: 10.1016/j.envpol.2013.08.002.
[66] M. Franklin, O. V . Kalashnikova, and M. J. Garay. “Size-resolved particulate
matter concentrations derived from 4.4 km-resolution size-fractionated
Multi-angle Imaging SpectroRadiometer (MISR) aerosol optical depth over
Southern California”. In: Remote Sensing of Environment 196 (2017), pp. 312–323.
DOI: 10.1016/j.rse.2017.05.002.
[67] M. Sorek-Hamer et al. “Spatiotemporal Characteristics of the Association
between AOD and PM over the California Central Valley”. In: Remote Sensing 12.4
(2020), p. 685. DOI: 10.3390/rs12040685.
[68] A. E. Gelfand et al. Handbook of Spatial Statistics. Boca Raton: CRC Press, 2010.
ISBN: 978-1-4200-7288-4.
[69] B. M. Kazar et al. “Comparing Exact and Approximate Spatial Auto-regression
Model Solutions for Spatial Data Analysis”. In: Geographic Information Science.
Springer Berlin Heidelberg, 2004, pp. 140–161. DOI: 10.1007/978-3-540-30231-5_10.
[70] W. Wang et al. “Estimation of PM2.5 Concentrations in China Using a Spatial
Back Propagation Neural Network”. In: Scientific Reports 9.1 (2019). DOI:
10.1038/s41598-019-50177-1.
[71] X. Hu et al. “Estimating PM2.5 Concentrations in the Conterminous United States
Using the Random Forest Approach”. In: Environmental Science & Technology 51.12
(2017), pp. 6936–6944. DOI: 10.1021/acs.est.7b01210.
[72] A. Shtein et al. “Estimating Daily PM2.5 and PM10 over Italy Using an Ensemble
Model”. In: Environmental Science & Technology (2019). DOI:
10.1021/acs.est.9b04279.
129
[73] L. Zang et al. “Estimation of spatiotemporal PM1.0 distributions in China by
combining PM2.5 observations with satellite aerosol optical depth”. In: Science of
The Total Environment 658 (2019), pp. 1256–1264. DOI:
10.1016/j.scitotenv.2018.12.297.
[74] A. Analitis et al. “Prediction of PM2.5 concentrations at the locations of
monitoring sites measuring PM10 and NOx, using generalized additive models
and machine learning methods: A case study in London”. In: Atmospheric
Environment 240 (2020), p. 117757. DOI: 10.1016/j.atmosenv.2020.117757.
[75] H. H. Chang, X. Hu, and Y. Liu. “Calibrating MODIS aerosol optical depth for
predicting daily PM2.5 concentrations via statistical downscaling”. In: Journal of
Exposure Science & Environmental Epidemiology 24.4 (2013), pp. 398–404. DOI:
10.1038/jes.2013.90.
[76] Q. Di et al. “An ensemble-based model of PM2.5 concentration across the
contiguous United States with high spatiotemporal resolution”. In: Environment
International 130 (2019), p. 104909. DOI: 10.1016/j.envint.2019.104909.
[77] J. D. Stowell et al. “Estimating PM2.5 in Southern California using satellite data:
factors that affect model performance”. In: Environmental Research Letters 15.9
(2020), p. 094004. DOI: 10.1088/1748-9326/ab9334.
[78] Y. Lin et al. “Building Autocorrelation-Aware Representations for Fine-Scale
Spatiotemporal Prediction”. In: 2020 IEEE International Conference on Data Mining
(ICDM). IEEE, Nov. 2020. DOI: 10.1109/icdm50108.2020.00044.
[79] J. Li et al. “Estimation of ambient PM2.5 in Iraq and Kuwait from 2001 to 2018
using machine learning and remote sensing”. In: Environment International 151
(2021), p. 106445. DOI: 10.1016/j.envint.2021.106445.
[80] X. Hu et al. “Estimating ground-level PM2.5 concentrations in the Southeastern
United States using MAIAC AOD retrievals and a two-stage model”. In: Remote
Sensing of Environment 140 (2014), pp. 220–232. DOI: 10.1016/j.rse.2013.08.032.
[81] X. Hu et al. “10-year spatial and temporal trends of PM2.5 concentrations in the
southeastern US estimated using high-resolution satellite data”. In: Atmospheric
Chemistry and Physics 14.12 (2014), pp. 6301–6314. DOI: 10.5194/acp-14-6301-2014.
[82] Q. Xiao et al. “Full-coverage high-resolution daily PM2.5 estimation using
MAIAC AOD in the Yangtze River Delta of China”. In: Remote Sensing of
Environment 199 (2017), pp. 437–446. DOI: 10.1016/j.rse.2017.07.023.
130
[83] M. Lee et al. “Spatiotemporal prediction of fine particulate matter using
high-resolution satellite images in the Southeastern US 2003–2011”. In: Journal of
Exposure Science & Environmental Epidemiology 26.4 (2015), pp. 377–384. DOI:
10.1038/jes.2015.41.
[84] C. A. Gotway and L. J. Young. “Combining Incompatible Spatial Data”. In:
Journal of the American Statistical Association 97.458 (2002), pp. 632–648. DOI:
10.1198/016214502760047140.
[85] S. Banerjee, B. P . Carlin, and A. E. Gelfand. Hierarchical Modeling and Analysis for
Spatial Data. Boca Raton: CRC Press, Taylor & Francis Group, 2015. ISBN:
978-1439819173.
[86] R. McConnell et al. “Childhood Incident Asthma and Traffic-Related Air
Pollution at Home and School”. In: Environmental Health Perspectives 118.7 (2010),
pp. 1021–1026. DOI: 10.1289/ehp.0901232.
[87] United States Environmental Protection Agency. AQS (Air Quality System) User
Guide. URL: https://www.epa.gov/aqs/aqs-user-guide (visited on 08/01/2021).
[88] M. Franklin et al. “Predictors of intra-community variation in air quality”. In:
Journal of Exposure Science & Environmental Epidemiology 22.2 (2012), pp. 135–147.
DOI: 10.1038/jes.2011.45.
[89] M. Brauer et al. “Exposure Assessment for Estimation of the Global Burden of
Disease Attributable to Outdoor Air Pollution”. In: Environmental Science &
Technology 46.2 (2012), pp. 652–660. DOI: 10.1021/es2025752.
[90] A. de Sherbinin et al. “Using satellite data to develop environmental indicators”.
In: Environmental Research Letters 9.8 (2014), p. 084013. DOI:
10.1088/1748-9326/9/8/084013.
[91] A. van Donkelaar et al. “Optimal estimation for global ground-level fine
particulate matter concentrations”. In: Journal of Geophysical Research: Atmospheres
118.11 (2013), pp. 5621–5636. DOI: 10.1002/jgrd.50479.
[92] A. van Donkelaar et al. “Use of Satellite Observations for Long-Term Exposure
Assessment of Global Concentrations of Fine Particulate Matter”. In:
Environmental Health Perspectives 123.2 (2015), pp. 135–143. DOI:
10.1289/ehp.1408646.
[93] G. Shaddick et al. “Data integration model for air quality: a hierarchical approach
to the global estimation of exposures to ambient air pollution”. In: Journal of the
Royal Statistical Society: Series C (Applied Statistics) 67.1 (2018), pp. 231–253. DOI:
10.1111/rssc.12227.
131
[94] D. Diner et al. “Performance of the MISR instrument during its first 20 months in
Earth orbit”. In: IEEE Transactions on Geoscience and Remote Sensing 40.7 (2002),
pp. 1449–1466. DOI: 10.1109/tgrs.2002.801584.
[95] A. van Donkelaar et al. “Global Estimates of Ambient Fine Particulate Matter
Concentrations from Satellite-Based Aerosol Optical Depth: Development and
Application”. In: Environmental Health Perspectives 118.6 (2010), pp. 847–855. DOI:
10.1289/ehp.0901623.
[96] A. van Donkelaar et al. “Regional Estimates of Chemical Composition of Fine
Particulate Matter Using a Combined Geoscience-Statistical Method with
Information from Satellites, Models, and Monitors”. In: Environmental Science &
Technology 53.5 (2019), pp. 2595–2611. DOI: 10.1021/acs.est.8b06392.
[97] R. A. Kahn and B. J. Gaitley. “An analysis of global aerosol type as retrieved by
MISR”. In: Journal of Geophysical Research: Atmospheres 120.9 (2015), pp. 4248–4281.
DOI: 10.1002/2015jd023322.
[98] X. Meng et al. “Estimating PM2.5 speciation concentrations using prototype 4.4
km-resolution MISR aerosol properties over Southern California”. In: Atmospheric
Environment 181 (2018), pp. 70–81. DOI: 10.1016/j.atmosenv.2018.03.019.
[99] R. T. Burnett et al. “An Integrated Risk Function for Estimating the Global Burden
of Disease Attributable to Ambient Fine Particulate Matter Exposure”. In:
Environmental Health Perspectives 122.4 (2014), pp. 397–403. DOI:
10.1289/ehp.1307049.
[100] A. J. Cohen et al. “Estimates and 25-year trends of the global burden of disease
attributable to ambient air pollution: an analysis of data from the Global Burden
of Diseases Study 2015”. In: The Lancet 389.10082 (2017), pp. 1907–1918. DOI:
10.1016/s0140-6736(17)30505-6.
[101] L.-F. Tétreault et al. “Childhood Exposure to Ambient Air Pollutants and the
Onset of Asthma: An Administrative Cohort Study in Québec”. In: Environmental
Health Perspectives 124.8 (2016), pp. 1276–1282. DOI: 10.1289/ehp.1509838.
[102] United States Environmental Protection Agency. Download Daily Data. URL:
https://www.epa.gov/outdoor-air-quality-data/download-daily-data (visited on
08/01/2021).
[103] United States Environmental Protection Agency. Chemical Speciation Network
(CSN). URL: https://www.epa.gov/amtic/chemical-speciation-network-csn (visited on
08/01/2021).
132
[104] J. C. Chow et al. “Mass reconstruction methods for PM2.5: a review”. In: Air
Quality, Atmosphere & Health 8.3 (2015), pp. 243–263. DOI:
10.1007/s11869-015-0338-3.
[105] J. C. Chow et al. “Similarities and differences in PM10 chemical source profiles for
geological dust from the San Joaquin Valley, California”. In: Atmospheric
Environment 37.9-10 (2003), pp. 1317–1340. DOI: 10.1016/s1352-2310(02)01021-x.
[106] M. L. Witek et al. “New approach to the retrieval of AOD and its uncertainty
from MISR observations over dark water”. In: Atmospheric Measurement Techniques
11.1 (2018), pp. 429–439. DOI: 10.5194/amt-11-429-2018.
[107] R. A. Kahn et al. “Multiangle Imaging SpectroRadiometer global aerosol product
assessment by comparison with the Aerosol Robotic Network”. In: Journal of
Geophysical Research 115.D23 (2010). DOI: 10.1029/2010jd014601.
[108] J. Al-Saadi et al. “Improving National Air Quality Forecasts with Satellite Aerosol
Observations”. In: Bulletin of the American Meteorological Society 86.9 (2005),
pp. 1249–1262. DOI: 10.1175/bams-86-9-1249.
[109] University of Idaho. gridMET - Climatology Lab. URL:
http://www.climatologylab.org/gridmet.html (visited on 08/01/2021).
[110] D. M. Smith et al. “Role of volcanic and anthropogenic aerosols in the recent
global surface warming slowdown”. In: Nature Climate Change 6.10 (2016),
pp. 936–940. DOI: 10.1038/nclimate3058.
[111] N. Freychet et al. “The Local Aerosol Emission Effect on Surface Shortwave
Radiation and Temperatures”. In: Journal of Advances in Modeling Earth Systems
11.3 (2019), pp. 806–817. DOI: 10.1029/2018ms001530.
[112] G. M. Weaver and W. J. Gauderman. “Traffic-Related Pollutants: Exposure and
Health Effects Among Hispanic Children”. In: American Journal of Epidemiology
187.1 (2018), pp. 45–52. DOI: 10.1093/aje/kwx223.
[113] J. M. Peters et al. “A Study of Twelve Southern California Communities with
Differing Levels and Types of Air Pollution”. In: American Journal of Respiratory
and Critical Care Medicine 159.3 (1999), pp. 760–767. DOI:
10.1164/ajrccm.159.3.9804143.
[114] R. McConnell et al. “Traffic, Susceptibility, and Childhood Asthma”. In:
Environmental Health Perspectives 114.5 (2006), pp. 766–772. DOI: 10.1289/ehp.8594.
[115] J. H. Friedman. “Greedy function approximation: A gradient boosting machine”.
In: The Annals of Statistics 29.5 (2001), pp. 1189–1232. DOI: 10.1214/aos/1013203451.
133
[116] P . Cortez and M. J. Embrechts. “Using sensitivity analysis and visualization
techniques to open black box data mining models”. In: Information Sciences 225
(2013), pp. 1–17. DOI: 10.1016/j.ins.2012.10.039.
[117] S. Hasheminassab et al. “Spatial and temporal variability of sources of ambient
fine particulate matter (PM2.5) in California”. In: Atmospheric Chemistry and
Physics 14.22 (2014), pp. 12085–12097. DOI: 10.5194/acp-14-12085-2014.
[118] V . Tsiouri, K. E. Kakosimos, and P . Kumar. “Concentrations, sources and
exposure risks associated with particulate matter in the Middle East Area—a
review”. In: Air Quality, Atmosphere & Health 8.1 (2015), pp. 67–80. DOI:
10.1007/s11869-014-0277-4.
[119] B. Alahmad et al. “A two-year assessment of particulate air pollution and sources
in Kuwait”. In: Environmental Pollution 282 (2021), p. 117016. DOI:
https://doi.org/10.1016/j.envpol.2021.117016.
[120] N. R. Council. Review of the Department of Defense Enhanced Particulate Matter
Surveillance Program report. Washington, D.C: National Academies Press, 2010.
ISBN: 978-0-309-15413-0.
[121] I. Kloog et al. “Assessing temporally and spatially resolved PM2.5 exposures for
epidemiological studies using satellite aerosol optical depth measurements”. In:
Atmospheric Environment 45.35 (2011), pp. 6267–6275. DOI:
10.1016/j.atmosenv.2011.08.066.
[122] A. van Donkelaar et al. “High-Resolution Satellite-Derived PM2.5 from Optimal
Estimation and Geographically Weighted Regression over North America”. In:
Environmental Science & Technology 49.17 (2015), pp. 10482–10491. DOI:
10.1021/acs.est.5b02076.
[123] Y. Zheng et al. “Estimating ground-level PM2.5 concentrations over three
megalopolises in China using satellite-derived aerosol optical depth
measurements”. In: Atmospheric Environment 124 (2016), pp. 232–242. DOI:
10.1016/j.atmosenv.2015.06.046.
[124] J. A. Engel-Cox et al. “Qualitative and quantitative evaluation of MODIS satellite
sensor data for regional and urban scale air quality”. In: Atmospheric Environment
38.16 (2004), pp. 2495–2509. DOI: 10.1016/j.atmosenv.2004.01.039.
[125] M. S. Hammer et al. “Global Estimates and Long-Term Trends of Fine Particulate
Matter Concentrations (1998–2018)”. In: Environmental Science & Technology 54.13
(2020), pp. 7879–7890. DOI: 10.1021/acs.est.0c01764.
134
[126] R. Sullivan and S. Pryor. “Quantifying spatiotemporal variability of fine particles
in an urban environment using combined fixed and mobile measurements”. In:
Atmospheric Environment 89 (2014), pp. 664–671. DOI:
10.1016/j.atmosenv.2014.03.007.
[127] R. Sullivan, R. Levy, and S. Pryor. “Spatiotemporal coherence of mean and
extreme aerosol particle events over eastern North America as observed from
satellite”. In: Atmospheric Environment 112 (2015), pp. 126–135. DOI:
10.1016/j.atmosenv.2015.04.026.
[128] T. D. Toth et al. “A bulk-mass-modeling-based method for retrieving particulate
matter pollution using CALIOP observations”. In: Atmospheric Measurement
Techniques 12.3 (2019), pp. 1739–1754. DOI: 10.5194/amt-12-1739-2019.
[129] K. C. Kaku et al. “Assessing the Challenges of Surface-Level Aerosol Mass
Estimates From Remote Sensing During the SEAC4RS and SEARCH Campaigns:
Baseline Surface Observations and Remote Sensing in the Southeastern United
States”. In: Journal of Geophysical Research: Atmospheres 123.14 (2018),
pp. 7530–7562. DOI: 10.1029/2017jd028074.
[130] K. W. Brown et al. “Characterization of Particulate Matter for Three Sites in
Kuwait”. In: Journal of the Air & Waste Management Association 58.8 (2008),
pp. 994–1003. DOI: 10.3155/1047-3289.58.8.994.
[131] S. Masri et al. “A novel calibration approach using satellite and visibility
observations to estimate fine particulate matter exposures in Southwest Asia and
Afghanistan”. In: Journal of the Air & Waste Management Association 67.1 (2016),
pp. 86–95. DOI: 10.1080/10962247.2016.1230079.
[132] OpenAQ. OpenAQ. URL: https://openaq.org (visited on 08/01/2021).
[133] R. Gelaro et al. “The Modern-Era Retrospective Analysis for Research and
Applications, Version 2 (MERRA-2)”. In: Journal of Climate 30.14 (2017),
pp. 5419–5454. DOI: 10.1175/jcli-d-16-0758.1.
[134] M. G. Bosilovich, R. Lucchesi, and M. J. Suárez. MERRA-2: File Specification.
GMAO Office Note No. 9 (Version 1.1). GMAO, NASA GSFC. Greenbelt, MD, USA,
2016. URL: http://gmao.gsfc.nasa.gov/pubs/office_notes.
[135] L. Li et al. “Ensemble-based deep learning for estimating PM2.5 over California
with multisource big data including wildfire smoke”. In: Environment
International 145 (2020), p. 106143. DOI: 10.1016/j.envint.2020.106143.
[136] B. Holben et al. “AERONET—A Federated Instrument Network and Data
Archive for Aerosol Characterization”. In: Remote Sensing of Environment 66.1
(1998), pp. 1–16. DOI: 10.1016/s0034-4257(98)00031-5.
135
[137] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. New York, NY, USA: Springer, 2009. ISBN:
0387848576.
[138] C. Strobl et al. “Bias in random forest variable importance measures: Illustrations,
sources and a solution”. In: BMC Bioinformatics 8.1 (2007). DOI:
10.1186/1471-2105-8-25.
[139] M. Franklin et al. “Characterization of Subgrid-Scale Variability in Particulate
Matter with Respect to Satellite Aerosol Observations”. In: Remote Sensing 10.4
(2018), p. 623. DOI: 10.3390/rs10040623.
[140] T. L. Anderson et al. “Mesoscale Variations of Tropospheric Aerosols”. In: Journal
of the Atmospheric Sciences 60.1 (2003), pp. 119–136. DOI:
10.1175/1520-0469(2003)060<0119:mvota>2.0.co;2.
[141] T. Gneiting, W. Kleiber, and M. Schlather. “Matérn Cross-Covariance Functions
for Multivariate Random Fields”. In: Journal of the American Statistical Association
105.491 (2010), pp. 1167–1177. DOI: 10.1198/jasa.2010.tm09420.
[142] P . J. Brown, N. D. Le, and J. V . Zidek. “Multivariate spatial interpolation and
exposure to air pollutants”. In: Canadian Journal of Statistics 22.4 (1994),
pp. 489–509. DOI: 10.2307/3315406.
[143] A. M. Schmidt and A. E. Gelfand. “A Bayesian coregionalization approach for
multivariate pollutant data”. In: Journal of Geophysical Research: Atmospheres
108.D24 (2003), n/a–n/a. DOI: 10.1029/2002jd002905.
[144] T. R. Fanshawe and P . J. Diggle. “Bivariate geostatistical modelling: a review and
an application to spatial variation in radon concentrations”. In: Environmental and
Ecological Statistics 19.2 (2011), pp. 139–160. DOI: 10.1007/s10651-011-0179-7.
[145] M. Schlather. “Construction of Covariance Functions and Unconditional
Simulation of Random Fields”. In: Lecture Notes in Statistics. Springer Berlin
Heidelberg, 2011, pp. 25–54. DOI: 10.1007/978-3-642-17086-7_2.
[146] R. Adler and J. Taylor. Random Fields and Geometry. New York: Springer, 2007.
ISBN: 978-0-387-48112-8.
[147] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning.
Cambridge, Mass: MIT Press, 2006. ISBN: 026218253X.
[148] D. A. Moore and S. J. Russell. “Gaussian Process Random Fields”. In: arXiv
e-prints (2015). arXiv: 1511.00054 [cs.LG].
136
[149] R. Gramacy. Surrogates: Gaussian Process Modeling, Design, and Optimization for the
Applied Sciences. Boca Raton, FL: CRC Press, 2020. ISBN: 978-0-367-41542-6.
[150] B. Matérn. Spatial Variation. New York: Springer, 1986. ISBN: 0387963650.
[151] M. L. Stein. Interpolation of Spatial Data. Springer New York, 1999. DOI:
10.1007/978-1-4612-1494-6.
[152] M. G. Genton and W. Kleiber. “Cross-Covariance Functions for Multivariate
Geostatistics”. In: Statistical Science 30.2 (2015), pp. 147–163. DOI:
10.1214/14-sts487.
[153] K. V . Mardia and C. R. Goodall. “Spatial-temporal analysis of multivariate
environmental monitoring data”. In: Multivariate environmental statistics 6.76
(1993), pp. 347–385.
[154] M. Goulard and M. Voltz. “Linear Coregionalization Model: Tools for Estimation
and Choice of Cross-Variogram Matrix”. In: Mathematical Geology 24.3 (1992),
pp. 269–286. DOI: 10.1007/bf00893750.
[155] H. Wackernagel. Multivariate Geostatistics. Springer Berlin Heidelberg, 2003. DOI:
10.1007/978-3-662-05294-5.
[156] J. M. V . Hoef and R. P . Barry. “Constructing and fitting models for cokriging and
multivariable spatial prediction”. In: Journal of Statistical Planning and Inference
69.2 (1998), pp. 275–294. DOI: 10.1016/s0378-3758(97)00162-6.
[157] A. Majumdar and A. E. Gelfand. “Multivariate Spatial Modeling for
Geostatistical Data Using Convolved Covariance Functions”. In: Mathematical
Geology 39.2 (2007), pp. 225–245. DOI: 10.1007/s11004-006-9072-6.
[158] T. V . Apanasovich, M. G. Genton, and Y. Sun. “A Valid Matérn Class of
Cross-Covariance Functions for Multivariate Random Fields With Any Number
of Components”. In: Journal of the American Statistical Association 107.497 (2012),
pp. 180–193. DOI: 10.1080/01621459.2011.643197.
[159] A. van Donkelaar, R. V . Martin, and R. J. Park. “Estimating ground-level PM2.5
using aerosol optical depth determined from satellite remote sensing”. In: Journal
of Geophysical Research 111.D21 (2006). DOI: 10.1029/2005jd006996.
[160] G. Geng et al. “Satellite-Based Daily PM2.5 Estimates During Fire Seasons in
Colorado”. In: Journal of Geophysical Research: Atmospheres (2018). DOI:
10.1029/2018jd028573.
[161] M. Schlather et al. RandomFields: Simulation and Analysis of Random Fields. R
package version 3.3.8. 2020. URL: https://cran.r-project.org/package=RandomFields.
137
[162] M. Schlather et al. “Analysis, Simulation and Prediction of Multivariate Random
Fields with Package RandomFields”. In: Journal of Statistical Software 63.8 (2015),
pp. 1–25. DOI: 10.18637/jss.v063.i08.
[163] N. Silver. The signal and the noise : why so many predictions fail—but some don’t. New
York: Penguin Press, 2012. ISBN: 978-0143125082.
[164] N. G. S. F. Center. How Aerosols Are Measured. URL:
https://deepblue.gsfc.nasa.gov/science (visited on 08/01/2021).
[165] T. Enebish et al. “Predicting ambient PM2.5 concentrations in Ulaanbaatar,
Mongolia with machine learning approaches”. In: Journal of Exposure Science &
Environmental Epidemiology 31.4 (2020), pp. 699–708. DOI:
10.1038/s41370-020-0257-8.
[166] Level-1, A. Archive, and D. S. ( D. A. A. C. (DAAC). MODIS/Terra+Aqua Land
Aerosol Optical Depth Daily L2G Global 1km SIN Grid. URL:
https://ladsweb.modaps.eosdis.nasa.gov/missions-and-measurements/products/MCD19A2/
(visited on 08/01/2021).
[167] M. L. Team. MODIS Grids... URL:
https://modis-land.gsfc.nasa.gov/MODLAND_grid.html (visited on 08/01/2021).
[168] J. Lin et al. “Recent Changes in Particulate Air Pollution over China Observed
from Space and the Ground: Effectiveness of Emission Control”. In: Environmental
science & technology 44.20 (2010), pp. 7771–7776. DOI: 10.1021/es101094t.
[169] A. Shtein et al. “Estimating daily and intra-daily PM10 and PM2.5 in Israel using
a spatio-temporal hybrid modeling approach”. In: Atmospheric Environment 191
(2018), pp. 142–152. DOI: 10.1016/j.atmosenv.2018.08.002.
[170] I. Kloog et al. “Estimating daily PM 2.5 and PM 10 across the complex geo-climate
region of Israel using MAIAC satellite-based AOD data”. In: Atmospheric
Environment 122 (2015), pp. 409–416. DOI: 10.1016/j.atmosenv.2015.10.004.
138
Abstract (if available)
Abstract
Estimating air pollution exposure, notably particulate matter with aerodynamic diameter less than 2.5 μm (PM₂.₅), using statistical models is a critical component in epidemiological research when the true exposure was not measured in situ. Earth-orbiting satellite instruments have provided daily atmospheric aerosol data globally (measured as aerosol optical depth–AOD), which have been a great addition to PM₂.₅ prediction studies. There remain serious challenges to using satellite AOD, such as data missingness and spatial misalignment. We demonstrated a common analysis pipeline where PM₂.₅ concentrations were estimated using AOD and then used as exposures in a cross-sectional study to assess its impact on children's lung functions in Southern California. We further examined spatial and temporal autocorrelation trends for PM₂.₅ and AOD in two Middle Eastern regions to investigate possible explanations for different prediction performances, given the same models and predictors. We proposed a novel approach to modeling PM₂.₅ that sought to overcome these modeling challenges, particularly missingness. Relying on the Gaussian random fields framework, we constructed a regression model using covariance functions to weight contribution of nearby AOD data by their distances to PM₂.₅ monitors. We simulated bivariate random fields under different settings of surface smoothness and spatial correlation and compared our covariance-based distance-weighted regression (CDR) model against the linear model and the generalized additive model. We further compared these models in the task of modeling PM₂.₅ in Los Angeles County using satellite AOD and meteorology. In simulation, the CDR model outperformed the linear and generalized additive models in recovering the correlation between the input and outcome random fields, especially when the input field was less smooth. In modeling PM₂.₅, the CDR model also showed better prediction than the standard models, although the improvement was reduced when meteorological data were added to the models. In multi-stage modeling with machine learning, the CDR model improved the final prediction modestly. Our proposed model overcame the data missingness problem and used larger sample sizes by relaxing the coincident AOD data requirement, while improving the overall prediction performances. In regions where PM₂.₅ concentrations were less inconsistently monitored, our model could preserve the data sample sizes when the coincident AOD data was missing yet nearby AOD data were sufficiently available. We did not explicitly model spatial trends in our regression model and future work could integrate the covariance-based distance-weighting term into more sophisticated methods, including the generalized additive models.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Uncertainty quantification in extreme gradient boosting with application to environmental epidemiology
PDF
Using multi-angle imaging spectroradiometer aerosol mixture properties and meteorology for PM₂.₅ assessment in Iran
PDF
Prediction and feature selection with regularized regression in integrative genomics
PDF
Prediction modeling with meta data and comparison with lasso regression
PDF
Machine learning approaches for downscaling satellite observations of dust
PDF
Hierarchical regularized regression for incorporation of external data in high-dimensional models
PDF
Incorporating prior knowledge into regularized regression
PDF
High-dimensional regression for gene-environment interactions
PDF
Enhancing model performance of regularization methods by incorporating prior information
PDF
Comparison of models for predicting PM2.5 concentration in Wuhan, China
PDF
Statistical downscaling with artificial neural network
PDF
Inference correction in measurement error models with a complex dosimetry system
PDF
Spatial analysis of PM₂.₅ air pollution in association with hospital admissions in California
PDF
Generalized linear discriminant analysis for high-dimensional genomic data with external information
PDF
Bayesian multilevel quantile regression for longitudinal data
PDF
Statistical analysis of high-throughput genomic data
PDF
Leveraging functional datasets of stimulated cells to understand the relationship between environment and diseases
PDF
Hierarchical approaches for joint analysis of marginal summary statistics
PDF
Robust feature selection with penalized regression in imbalanced high dimensional data
PDF
Spatial modeling of non-tailpipe emissions and its association with children's lung function
Asset Metadata
Creator
Chau, Khang
(author)
Core Title
Covariance-based distance-weighted regression for incomplete and misaligned spatial data
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Biostatistics
Degree Conferral Date
2021-12
Publication Date
10/06/2021
Defense Date
08/27/2021
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Air pollution,covariance,exposure estimation,OAI-PMH Harvest,remote sensing,spatial misalignment,spatial statistics
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Franklin, Meredith (
committee chair
), Gauderman, Jim (
committee chair
), Chiang, Yao-Yi (
committee member
), Johnston, Jill (
committee member
), Lewinger, Juan Pablo (
committee member
)
Creator Email
khang.chau@usc.edu,khangcha@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC16022361
Unique identifier
UC16022361
Legacy Identifier
etd-ChauKhang-10137
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Chau, Khang
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
covariance
exposure estimation
remote sensing
spatial misalignment
spatial statistics