Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Machine learning approaches for downscaling satellite observations of dust
(USC Thesis Other)
Machine learning approaches for downscaling satellite observations of dust
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Machine learning approaches for downscaling satellite observations of dust
by
Tianxing Zhai
A Thesis Presented to the
Faculty of the USC Graduate School
University of Southern California
In Partial Fulfillment of the
Requirements for the Degree
Master of Science
(Biostatistics)
December 2019
ii
Table of Contents
1. Abstract .................................................................................................................................. iii
2. Introduction ............................................................................................................................. 1
3. Methods................................................................................................................................... 2
3.1 Data Extraction and Preprocessing .................................................................................. 2
3.2 Feature Engineering ......................................................................................................... 4
3.3 Train-test Splitting............................................................................................................ 4
3.4 Machine Learning Methods ............................................................................................. 5
3.4.1 Ridge Regression ...................................................................................................... 5
3.4.2 Elastic-Net Regression .............................................................................................. 5
3.4.3 Random Forest Regression ....................................................................................... 6
3.4.4 XGBoost Regression ................................................................................................. 6
4. Results ..................................................................................................................................... 6
5. Discussion ............................................................................................................................. 13
6. References ............................................................................................................................. 14
iii
1. Abstract
Dust has a significant impact on both the climate and human health. Accurate estimation of its
spatiotemporal variability is therefore important to reliably generate exposures for health effects
studies and for understanding its role in climate change. The Modern-Era Retrospective analysis
for Research and Applications, Version 2 (MERRA-2) provides data-assimilated concentrations
of dust globally over a long time period, but with coarse spatial resolution (50 km). The GEOS-5
Nature Run (G5NR), which has the same data-assimilation backbone as MERRA-2, similarly
provides dust concentrations with refined spatial resolution (7 km) but only for 2 years (2005 –
2007). With G5NR dust data as the target, we developed and compared several machine learning
regression models including Ridge, Elastic-Net, Random Forest and extreme gradient boosting
(XGBoost) to downscale the MERRA-2 dust from 50 km to 7 km resolution. To capture
temporal differences, we trained seasonal machine learning models using 3 weeks of data as the
training samples. For validation, we used 2 types of testing samples: a random sample of 33% of
the data and leaving out the entire middle-day of the 3-week period. We found that in terms of
RMSE and R
2
, among all 4 methods, XGBoost Regression performed best for all seasons and
both testing samples. In general, we found that there was an underestimation in downscaled dust
due to a systematic bias between the MERRA-2 and G5NR products. For our Middle East study
region, the bias is particularly notable in summer and over the Arabian Desert.
Our approach indicates that we can leverage high spatial resolution aerosol dust to downscale a
similar coarse resolution product using machine learning, but that region- and season-specific
bias correction is needed to obtain reliable and interpretable estimates.
1
2. Introduction
Dust, typically characterized as large and irregularly shaped particles, arises in the atmosphere
from a variety of sources. Both its size and chemical composition vary widely in relation to the
nature of the source and transport history. Coarse particulate matter (PM) predominates dust,
which can be regarded as particles with aerodynamic diameter 10 μm or less (PM10) (UK Air
Pollution Information System, 2019). Smaller particles with diameter less than 2.5 μm (PM2.5)
have been studied extensively due to epidemiological associations with a myriad of health effects
(Atkinson et al., 2014), yet PM10 has been linked with upper respiratory health effects (Host et
al., 2008), and vascular injury (Liu et al., 2015). In regions of the world with dust storms (e.g.
Middle East), acute adverse health effects have been found (Goudie, 2014). Dust can be
transported long distances impacting regions of China, where concentrations on storm days have
been associated with increases in daily mortality rates and hospital admissions attributable to
cardiovascular and respiratory causes (Chan et al., 2008; Chan and Ng, 2011; Cheng et al.,
2008).
Accurate estimation of the spatiotemporal variability of dust is important to reliably generate
exposures for human exposure and health effects studies (Li et al. 2015) and for understanding
its role in climate change (Voiland 2010).
The Modern-Era Retrospective analysis for Research and Applications, Version 2 (MERRA-2)
provides data beginning in 1980 (Randels et al., 2017). The MERRA-2 assimilates multiple
sources of aerosol remote sensing, emissions, and meteorological data using the Goddard Earth
Observing System Model (GEOS) and further incorporates aerosols, chemistry, atmosphere,
land, ice, and ocean biogeochemistry. It provides global estimates of surface-level sea salt, dust,
black carbon, organic carbon, sulfates and PM2.5 at hourly time resolution. While the data are
available over a long time period, they have coarse spatial resolution (approximately 50 km in
the latitudinal direction) (Gelaro et al., 2017).
The GEOS-5 Nature Run (G5NR) is a MERRA-2 based 2-year global, non-hydrostatic
mesoscale simulation for the period June 2005 through May 2007 with a 7 km horizontal
resolution. In addition to standard meteorological parameters (wind, temperature, moisture,
surface pressure), G5NR includes 15 aerosol tracers (dust, sea salt, sulfate, black and organic
2
carbon), O3, CO and CO2 (NASA, 2014). The data have refined spatial resolution (approximately
7 km in the latitudinal direction) but are only available for 2 years (2005 – 2007).
Both MERRA-2 and G5NR aerosol tracers are derived using a coupled version of the Goddard
Chemistry, Aerosol, Radiation, and Transport model (GOCART) (Colarco et al. 2010). These
aerosols represent the decomposition of the total column aerosol optical depth (AOD, ) into
components by 𝜏 = ∑ 𝑥 𝑖 ×
𝑧 ,𝑖 𝑏 𝑒𝑥𝑡 ,𝑗 (𝑅𝐻 , 𝜆 ) × 𝛿 𝑧 where 𝑥 𝑖 is the aerosol mass mixing ratio,
𝑏 𝑒𝑥𝑡 ,𝑗 (𝑅𝐻 , 𝜆 ) is the component-specific extinction coefficient and wavelength 𝜆 and 𝛿 𝑧 is the
atmospheric layer thickness (or depth). Dust extinction is one of i=15 components (Randles et
al., 2017) and is the target variable for this study. It typically ranges between 0 and 1 and is a
unitless measure of dust aerosol loading in the column from the surface to the top of the top of
the atmosphere. However, it has been shown that MERRA-2 places most dust below 5 km
altitude (Kramer et al, 2018), while lidar observations place most of the free-tropospheric dust
somewhat lower, below 3 km.
We leveraged the spatial resolution of G5NR and link it to MERRA-2 to understand the
spatiotemporal variability of dust. Then with G5NR dust data as the target, we built machine
learning regression models to downscale the MERRA-2 dust data from 50 km resolution to 7 km
resolution.
3. Methods
3.1 Data Extraction and Preprocessing
3
Fig 1. Flowchart of data extraction and preprocessing
We acquired MERRA-2 (50 km resolution, 2000-2018) and G5NR (7 km resolution, 2005-2007)
data from Global Modeling and Assimilation Office (GMAO) at the NASA Goddard Space
Flight Center (GSFC, https://gmao.gsfc.nasa.gov) in a series of NetCDF-4 format files. The
study area encompasses the region from Eastern Africa to Afghanistan (Latitude: 10.9375° to
42.0625° East; Longitude: 25.6875° to 74.875° North) and the original per-pixel hourly time
series were averaged to daily values. We focused on the G5NR time period, May 16th, 2005 to
May 16th, 2007, to build downscaling models.
We extracted dust values, longitude, latitude and date separately from G5NR and MERRA-2. To
spatially match 7 km G5NR data points within the encompassing 50 km MERRA-2 grid, we
rounded the latitudes of MERRA-2 data to nearest 0.5 degrees (e.g. 11.4375 to 11.5) and
longitudes to nearest 0.625 degrees (e.g. 26.1875 to 26.25), creating latitude_round and
longitude_round features. Then we merged the G5NR and MERRA-2 datasets of the same date
based on latitude_round = latitude_50 and longitude_round = longitude_50. Finally, we removed
redundant features (latitude_round, latitude_50, longitude_round, longitude_50) and renamed the
latitude_7 as latitude and longitude_7 as longitude. For model testing and validation, we made 4
4
datasets to cover 4 seasons, each containing 3 weeks of daily data: Spring (May 16th 2005 to
June 5th 2005), summer (August 15th 2005 to September 4th 2005), fall (November 14th 2005
to December 4th 2005) and winter (February 13th 2006 to March 5th 2006). Each dataset had
8,257,452 records and 5 columns.
3.2 Feature Engineering
Fig 2. Flowchart of feature engineering
We used a Box-Cox transformation to normalize the dust values, which were highly skewed.
Then we recoded date to consecutive numbers from 0 to 20 (3 weeks, 21 days of data). Finally,
we standardized all features by its mean and standard deviation ( X
standardized
=
X−mean
SD
).
Dust_7 was used as the outcome and the other 4 columns as predictors.
3.3 Train-test Splitting
We used two methods of train-test splitting: (1) Randomly sampling 33% of the observations
from the original dataset for the testing data, keeping the remaining 67% as the training data; (2)
Selecting one entire day in the middle of the 3-week time series (day 10) as the testing data
(about 5%) and the remaining 95% as the training data. Method 2 aims to more strictly test the
ability of the downscaling to regenerate the spatial surface.
5
3.4 Machine Learning Methods
Four machine learning methods were chosen as candidate downscalers: ridge regression, elastic-
net regression, random forest regression, and extreme gradient boosting regression (XGBoost).
We compared the performance of each by their test R
2
and root-mean-square-error (RMSE).
3.4.1 Ridge Regression
Ridge regression is a linear regression method with a penalty on the size of the coefficients (l2
regularization). The ridge coefficients minimize a penalized residual sum of squares:
min
𝑤 || 𝑋 𝑤 − 𝑦 ||
2
2
+ 𝛼 ||𝑤 ||
2
2
Where || 𝑋 𝑤 − 𝑦 ||
2
2
is the ordinary least squares term and 𝛼 ||𝑤 ||
2
2
is the l2 regularization term.
𝛼 is a positive complexity parameter which controls the amount of shrinkage: the larger the value
of 𝛼 , the greater the amount of shrinkage and thus the coefficients become more robust to
collinearity (Pedregosa et al., 2011).
We used the scikit-learn library (v0.21.3) in Python (v3.7) to perform the ridge regression. We
set 𝛼 = 1 and all other hyperparameters were the default of the Ridge() function of scikit-learn.
3.4.2 Elastic-Net Regression
Elastic-Net is a linear regression model trained with both l1 and l2-norm regularization of the
coefficients. This combination allows learning a sparse model where few of the weights are non-
zero, while still maintaining the regularization properties of Ridge.
The objective function to minimize is in this case:
min
w
1
2𝑛 samples
||𝑋 𝑤 − 𝑦 ||
2
2
+ α ρ ||w||
1
+
α(1 − ρ)
2
||w||
2
2
Where 𝛼 𝜌 ||𝑤 ||
1
is the l1 regularization term and
𝛼 (1−𝜌 )
2
||𝑤 ||
2
2
is the l2 regularization term
(Pedregosa et al., 2011).
6
We used the scikit-learn library (v0.21.3) in Python (v3.7) to perform the Elastic-Net regression.
We set 𝛼 = 1, 𝜌 = 0.5 and all other hyperparameters were the default of the ElasticNet()
function of scikit-learn.
3.4.3 Random Forest Regression
Random forest is an ensemble method which ensembles decision trees predictors using the
Bagging method. In random forest, each tree in the ensemble is built from a sample drawn with
replacement (i.e., a bootstrap sample) from the training set (L. Breiman, 2001). It is widely used
in both classification and regression tasks.
We used the scikit-learn library (v0.21.3) in Python (v3.7) to perform the random forest
regression. We set n_estimators = 200, max_depth = 3 and all other hyperparameters were the
default of the RandomForestRegressor() method of scikit-learn.
3.4.4 XGBoost Regression
XGBoost is an optimized distributed gradient boosting method designed to be highly efficient,
flexible and portable. It implements machine learning algorithms under the Gradient Boosting
framework. The XGBoost library provides a parallel tree boosting (also known as GBDT, GBM)
that solve many data science problems in a fast and accurate way (Chen, T et al., 2016).
We used the XGBoost library (v0.90) in Python (v3.7) to perform the XGBoost regression. We
set learning_rate = 1, n_estimators = 200, max_depth = 3 and all other hyperparameters were the
default of the XGBRegressor() function of XGBoost.
4. Results
The dust values of MERRA-2 (Dust_50) and G5NR (Dust_7) before normalization and
standardization are shown in Table 1. The mean, standard deviation (SD), minimum value (min)
and maximum value (max) were calculated for 4 seasons: Spring (May 16
th
2005 to June 5
th
2005), summer (August 15
th
2005 to September 4
th
2005), fall (November 14
th
2005 to December
4
th
2005) and winter (February 13
th
2006 to March 5
th
2006).
Table 1. Descriptive statistics of dust data
7
Spring Summer Fall Winter
MERRA-2
Dust
( Dust_50)
Mean 0.198 0.187 0.076 0.138
SD 0.130 0.127 0.056 0.140
Max 0.969 1.034 0.623 2.270
Min 0.728 0.007 6.173e-04 0.004
G5NR Dust
(Dust_7)
Mean 0.140 0.299 0.097 0.140
SD 0.131 0.250 0.102 0.158
Max 1.482 2.294 0.814 1.662
Min 2.906e-28 0.004 2.008e-04 7.512e-04
Except spring, mean MERRA-2 dust were lower than mean G5NR dust and showed far less
seasonal variability (e.g. MERRA-2 0.076 in Fall to 0.198 in Spring compared to G5NR 0.097 in
Fall to 0.299 in Summer).
Comparisons of spatially matched 3-week averages (Figure 3) further illustrate the differences
between the two products. The MERRA-2 dust tends to be lower than the G5NR dust (slopes are
all less than 1). The bias is largest in winter: The R
2
of winter is only 0.257.
8
Figure 3. Spatially matched 3-week averages of G5NR (x-axis) versus MERRA-2 (y-axis) dust
with 1-to-1 (dashed red) and linear regression (blue) lines. Clockwise from top left: Spring,
summer, winter, fall.
The results of 4 machine learning methods are presented below (Tables 2 – 5). There are 2 linear
methods (Ridge and Elastic-Net regression) and 2 tree-based methods (Random Forest and
XGBoost regression). The performance of models was evaluated using root mean square error
(RMSE) and R
2
.
Table 2. Results of Ridge Regression Models
Spring Summer Fall Winter
Random
train-test
Splitting
Training RMSE 0.672 0.628 0.816 0.862
Training R
2
0.548 0.605 0.333 0.256
Testing RMSE 0.673 0.628 0.816 0.862
Testing R
2
0.547 0.605 0.333 0.255
Middle-day-
off train-test
Splitting
Training RMSE 0.679 0.626 0.819 0.857
Training R
2
0.551 0.609 0.328 0.253
Testing RMSE 0.543 0.655 0.754 0.973
Testing R
2
0.336 0.516 0.418 0.278
9
Table 3. Results of Elastic-Net Regression Models
Spring Summer Fall Winter
Random
train-test
Splitting
Training RMSE 0.983 0.868 0.981 0.999
Training R
2
0.032 0.246 0.037 0.000
Testing RMSE 0.983 0.868 0.981 1.000
Testing R
2
0.032 0.246 0.037 -2.232
Middle-day-
off train-test
Splitting
Training RMSE 0.986 0.868 0.986 0.991
Training R
2
0.052 0.249 0.027 0.000
Testing RMSE 0.679 0.821 0.981 1.156
Testing R
2
-0.041 0.242 0.016 -0.016
Table 4. Results of Random Forest Regression Models
Spring Summer Fall Winter
Random
train-test
Splitting
Training RMSE 0.649 0.622 0.769 0.774
Training R
2
0.578 0.612 0.408 0.399
Testing RMSE 0.649 0.622 0.769 0.775
Testing R
2
0.578 0.612 0.408 0.399
Middle-day-
off train-test
Splitting
Training RMSE 0.656 0.620 0.771 0.767
Training R
2
0.579 0.616 0.404 0.400
Testing RMSE 0.517 0.660 0.734 0.915
Testing R
2
0.395 0.510 0.449 0.361
Table 5. Results of XGBoost Regression Models
Spring Summer Fall Winter
Random
train-test
Splitting
Training RMSE 0.338 0.307 0.401 0.406
Training R
2
0.885 0.905 0.839 0.834
Testing RMSE 0.339 0.307 0.400 0.407
Testing R
2
0.884 0.905 0.839 0.834
10
Middle-day-
off train-test
Splitting
Training RMSE 0.338 0.303 0.400 0.410
Training R
2
0.888 0.968 0.839 0.828
Testing RMSE 0.348 0.417 0.556 0.715
Testing R
2
0.725 0.804 0.684 0.610
The XGBoost models produced the best results while Elastic-Net models produced the worst
results for all seasons and both splitting methods. The testing R
2
of Elastic-Net models were
nearly 0 or even negative. The testing R
2
of XGBoost models were above 0.8 for randomly
selected test samples and above 0.6 for the middle day test samples. The Ridge regression and
Random Forest Regression performed similarly.
We also found that the machine learning models performed differently for 4 seasons. All 4
models performed best for summer but worst for winter, in terms of R
2
. In summer, the best
XGBoost model had 0.905 of R
2
for randomly selected test sample and 0.804 for the middle day
test sample. In winter, the best XGBoost model had 0.834 of R
2
for random selected test sample
but only 0.610 for the middle day test sample.
The learning models performed differently for the 2 kinds of testing samples. All 4 models
performed better in randomly selected test samples but worse in middle day test samples. Models
performed similarly in training and testing data if we used randomly selected training and testing
samples. However, if we used the middle day data as testing samples and the rest as training
samples, models performed worse in testing data. For example, in winter, XGBoost model had
0.828 of R
2
in training data but only 0.610 in testing data. The models showed signal of
overfitting and had bad generalizability if we tried to use models to predict a whole day rather
than randomly selected testing data points.
For each middle day, we made 2 maps to show spatial distributions of true dust values in 7 km
resolution and 10 km resolution and 2 maps to show spatial distributions of XGBoost predicted
dust values and its difference between the ground truth (Fig 4 – 7). In difference maps, the blue
color meant underestimation while the red color meant overestimation (Prediction – ground
truth).
11
Fig 4. Spatial distribution of dust on May 26
th
2005 (true values, predictions and difference)
Fig 5. Spatial distribution of dust on August 26
th
2005 (true values, predictions and difference)
12
Fig 6. Spatial distribution of dust on November 24
th
2005 (true values, predictions and
difference)
Fig 7. Spatial distribution of dust on February 23
th
2006 (true values, predictions and difference)
We found that the dust data of 50 km and 7 km resolution did not show the same patterns, even
they were from the same area on the same day. We also found that Saudi Arabia, Yemen, Oman
and Sudan had the highest level of dust. XGBoost models performed badly and tended to
13
underestimate dusts when the true dust values were extremely high. Interestingly, XGBoost
models tended to overestimate dust values on the sea, especially the Arabian Sea.
5. Discussion
Statistical approaches to reliably downscaling the MERRA-2 50 km product to finer spatial
resolution would be a great asset to the atmospheric research community. The G5NR is a unique
but extremely computationally intensive approach to provide high spatial resolution estimates of
MERRA-2 parameters that is not tractable for global prediction. Furthermore, MERRA-2
provides assimilated products such as dust over an extremely long period of time at hourly and
daily temporal resolution by combining data sources from the early 1980s to present, again
intractable to recreate at a fine spatial scale with a G5NR type of modeling effort. Training the
available matched MERRA-2 and G5NR with statistical (machine learning) techniques provides
a tool that we can broadly apply to downscale MERRA-2 data at other time periods.
Our results showed that linear-based models, especially the Elastic-Net model, perform poorly
for the downscaling task. This is because we had too few features (only 4) and low variance
within features. In this situation, linear models were too simple to handle the problem and would
be underfitting. Compared to Ridge, Elastic-Net had additional l1 regularization terms, which
may cause weights of some features to be 0 and worsen the dilemma of few features. That was
why the R
2
of Elastic-Net models on some testing data were even negative.
Among all 4 models, XGBoost showed the best performance in terms of test R
2
. Nevertheless,
when trying to predict an entire day of data, which replicates the intended use of the downscaler
performance was not as robust. We also noted a fairly systematic bias in the downscaled results
due to the short- and longer-term differences in the MERRA-2 and G5NR dust products. This
difference has been noted by the G5NR team (Gelaro et al., 2015), but we observe that a global
bias correction is not applicable to all regions, particularly dusty ones with large spatial and
temporal gradients.
Future work will involve the introduction of additional features such as elevation and
meteorology that could potentially produce better results. Deep learning has also been used with
success for spatial downscaling (Li et al., 2019), and will be explored in this context. Further
investigation into the two data sources is also warranted to help provide a more detailed
explanation as to why they are so dissimilar in the same area and on the same day.
14
6. References
Atkinson RW, Kang S, Anderson HR, Mills IC, Walton HA. Epidemiological time series studies
of PM2.5 and daily mortality and hospital admissions: A systematic review and meta-analysis.
Thorax. 2014;69(7):660-665. doi:10.1136/thoraxjnl-2013-204492.
Chan, C.C., K.J. Chuang, W.J. Chen, W.T. Chang, C.T. Lee, and C.M. Peng. 2008. Increasing
cardiopulmonary emer- gency visits by long-range transported Asian dust storms in Taiwan.
Environ. Res. 106:393–400. doi:10.1016/j. envres.2007.09.006
Chan, C.C., and H.C. Ng. 2011. A case-crossover analysis of Asian dust storms and mortality in
the downwind areas using 14-year data in Taipei. Sci. Total Environ. 410– 411:47–52.
doi:10.1016/j.scitotenv.2011.09.031
Chen, T., & Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the
22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785-
794). ACM.
Cheng, M.F., S.C. Ho, H.F. Chiu, T.N. Wu, P.S. Chen, and C.Y. Yang. 2008. Consequences of
exposure to Asian dust storm events on daily pneumonia hospital admissions in Taipei, Taiwan.
J. Toxicol. Environ. Health A 71:1295–99. doi:10.1080/15287390802114808
Colarco P, Da Silva A, Chin M, Diehl T. Online simulations of global aerosol distributions in the
NASA GEOS-4 model and comparisons to satellite and ground-based aerosol optical depth. J
Geophys Res Atmos. 2010;115(14). doi:10.1029/2009JD012820.
Host S, Larrieu S, Pascal L, et al. Short-term associations between fine and coarse particles and
hospital admissions for cardiorespiratory diseases in six French cities. Occup Environ Med.
2008;65(8):544-551. doi:10.1136/oem.2007.036194.
Gelaro R, McCarty W, Suárez MJ, et al. The modern-era retrospective analysis for research and
applications, version 2 (MERRA-2). J Clim. 2017;30(14):5419-5454. doi:10.1175/JCLI-D-16-
0758.1.
Gelaro, R., Putman, W.M., Pawson, S., Draper, C., Molod, A., Norris, P.M., Ott, L., Prive, N.,
Reale, O., Achuthavarier, D. and Bosilovich, M., 2015. Evaluation of the 7-km GEOS-5 Nature
Run.
Kramer, S, Zuidema, P., Delgadillo, R., da Silvia, A, Alvarez, C., Custals, Barkley, L., Gaston,
C.J. and Prospero, J.M. Comparison of Saharan Dust Surface Mass Observations and Lidar in
Miami, FL, to the MERRA2 Reanalysis, American Meteorological Society 98
th
Annual Meeting,
Austin TX January 2018.
L. Breiman, “Random Forests”, Machine Learning, 45(1), 5-32, 2001.
Li, L., Franklin, M., Girguis, M., Lurmann, F., Wu, J., Pavlovic, N., Gilliland, F., Habre, R.
Spatiotemporal Imputation of MAIAC AOD Using Deep Learning with Downscaling, Remote
Sensing of Environment, Accepted October 2019.
15
Li, J., Carlson, E.B., & Lacis, A.A. (2015). How well do satellite AOD observations represent
the spatial and temporal variability of PM2.5 concentration for the United States? Atmos
Environ, 102, 260-273.
Liu L, Urch B, Poon R, Szyszkowicz M, Speck M, Gold DR. Effects of Ambient Coarse , Fine ,
and Ultrafine Particles and Their Biological Constituents on Systemic Biomarkers : A Controlled
Human Exposure Study. Environ Health Perspect. 2015;534(6):534-540.
NASA, 2014. GEOS-5 Nature Run, Ganymed Release. Retrieved from
https://gmao.gsfc.nasa.gov/global_mesoscale/7km-G5NR/
Randles CA, da Silva AM, Buchard V, et al. The MERRA-2 aerosol reanalysis, 1980 onward.
Part I: System description and data assimilation evaluation. J Clim. 2017;30(17):6823-6850.
doi:10.1175/JCLI-D-16-0609.1.
Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
UK Air Pollution Information System, 2019. Dusts. Retrieved from
http://www.apis.ac.uk/overview/pollutants/overview_particles.htm
Voiland, A. (2010). Aerosols: Tiny Particles, Big Impact. In: NASA
Abstract (if available)
Abstract
Dust has a significant impact on both the climate and human health. Accurate estimation of its spatiotemporal variability is therefore important to reliably generate exposures for health effects studies and for understanding its role in climate change. The Modern-Era Retrospective analysis for Research and Applications, Version 2 (MERRA-2) provides data-assimilated concentrations of dust globally over a long time period, but with coarse spatial resolution (50 km). The GEOS-5 Nature Run (G5NR), which has the same data-assimilation backbone as MERRA-2, similarly provides dust concentrations with refined spatial resolution (7 km) but only for 2 years (2005 – 2007). With G5NR dust data as the target, we developed and compared several machine learning regression models including Ridge, Elastic-Net, Random Forest and extreme gradient boosting (XGBoost) to downscale the MERRA-2 dust from 50 km to 7 km resolution. To capture temporal differences, we trained seasonal machine learning models using 3 weeks of data as the training samples. For validation, we used 2 types of testing samples: a random sample of 33% of the data and leaving out the entire middle-day of the 3-week period. We found that in terms of RMSE and R², among all 4 methods, XGBoost Regression performed best for all seasons and both testing samples. In general, we found that there was an underestimation in downscaled dust due to a systematic bias between the MERRA-2 and G5NR products. For our Middle East study region, the bias is particularly notable in summer and over the Arabian Desert. ❧ Our approach indicates that we can leverage high spatial resolution aerosol dust to downscale a similar coarse resolution product using machine learning, but that region- and season-specific bias correction is needed to obtain reliable and interpretable estimates.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Downscaling satellite observations of dust with deep learning
PDF
Using multi-angle imaging spectroradiometer aerosol mixture properties and meteorology for PM₂.₅ assessment in Iran
PDF
Statistical downscaling with artificial neural network
PDF
Uncertainty quantification in extreme gradient boosting with application to environmental epidemiology
PDF
Hierarchical regularized regression for incorporation of external data in high-dimensional models
PDF
Comparison of Cox regression and machine learning methods for survival analysis of prostate cancer
PDF
Covariance-based distance-weighted regression for incomplete and misaligned spatial data
PDF
Hierarchical approaches for joint analysis of marginal summary statistics
PDF
Comparison of models for predicting PM2.5 concentration in Wuhan, China
PDF
Nonlinear modeling and machine learning methods for environmental epidemiology
PDF
Assessment of land cover change in Southern California from 2003 to 2011 using Landsat Thematic Mapper
PDF
Cell-specific case studies of enhancer function prediction using machine learning
PDF
Machine learning-based breast cancer survival prediction
PDF
Nonlinear modeling of the relationship between smoking and DNA methylation in the multi-ethnic cohort
PDF
Generalized linear discriminant analysis for high-dimensional genomic data with external information
PDF
Analysis of factors associated with breast cancer using machine learning techniques
PDF
Predicting autism severity classification by machine learning models
PDF
Best practice development for RNA-Seq analysis of complex disorders, with applications in schizophrenia
PDF
Differential methylation analysis of colon tissues
PDF
Genome-wide characterization of the regulatory relationships of cell type-specific enhancer-gene links
Asset Metadata
Creator
Zhai, Tianxing
(author)
Core Title
Machine learning approaches for downscaling satellite observations of dust
School
Keck School of Medicine
Degree
Master of Science
Degree Program
Biostatistics
Publication Date
12/04/2019
Defense Date
12/03/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
downscaling,dust,Elastic-Net,G5NR,machine learning,MERRA-2,OAI-PMH Harvest,random forest,ridge,XGBoost
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Franklin, Meredith (
committee chair
), Lewinger, Juan Pablo (
committee member
), Marjoram, Paul (
committee member
)
Creator Email
txzhai0617@gmail.com,tzhai@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-242068
Unique identifier
UC11674246
Identifier
etd-ZhaiTianxi-7983.pdf (filename),usctheses-c89-242068 (legacy record id)
Legacy Identifier
etd-ZhaiTianxi-7983.pdf
Dmrecord
242068
Document Type
Thesis
Rights
Zhai, Tianxing
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
downscaling
Elastic-Net
G5NR
machine learning
MERRA-2
random forest
XGBoost