Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Effect of spatial patterns on sampling design performance in a vegetation map accuracy assessment
(USC Thesis Other)
Effect of spatial patterns on sampling design performance in a vegetation map accuracy assessment
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
EFFECT OF SPATIAL PATTERNS ON SAMPLING DESIGN PERFORMANCE IN A VEGETATION
MAP ACCURACY ASSESSMENT
by
Benito Mark Lo
A Thesis Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements of the Degree
MASTER OF SCIENCE
(GEOGRAPHIC INFORMATION SCIENCE AND TECHNOLOGY)
April 2014
Copyright 2014 Benito Mark Lo
ii
DEDICATION
I dedicate this document to my fellow labmates Kellie Uyeda, Spring Strahm and Julia Jesu. These three
provided for me the one thing I was lacking after graduating with my bachelor’s degree: opportunity. At a
time when I was at my lowest, Kellie, Spring and Julia gave me the chance to showcase what I can do in
both a work and an academic environment. The opportunity that these three afforded to me directly led
me to completing not one, but two master’s degrees in three years.
iii
ACKNOWLEDGEMENTS
I’d like to acknowledge the Principle Investigators of my theses at USC and SDSU: Dr. Travis Longcore
and Dr. Douglas Deutschman. Both individuals have spent a tremendous amount of time and effort
guiding me in my academic decisions, helping me to understand complex experimental results, and
correcting countless grammatical mistakes in my writing. It’s been a long road completing two sets of
experiments, and I couldn’t have done it without the dedication and support of these two men. Thank you.
iv
TABLE OF CONTENTS
DEDICATION ii
ACKNOWLEDGEMENTS iii
LIST OF TABLES v
LIST OF FIGURES vi
LIST OF ABBREVIATIONS vii
ABSTRACT viii
CHAPTER ONE: INTRODUCTION 1
Background 4
Vegetation Map Accuracy Assessment Considerations 7
Sampling Designs 9
Spatial autocorrelation 13
CHAPTER TWO: METHODS 15
SANDAG’s 2012 San Diego County Ve getation Map 17
Construction of the Vegetation Base Map, Error, and Sampling Data Sets 19
Vegetation Base Map 19
Error Base Maps 20
Sampling Strategy Raster Datasets for Accuracy Assessment 24
Consolidation of Datasets 26
Validation of the ArcGIS Simulation 28
The Effect of Sample Size 28
Testing the Pseudorandom Number Generator for Bias away from a Random Distribution 29
CHAPTER THREE: RESULTS 31
Validating the ArcGIS Simulation 33
The Effect of Sample Size 33
Testing the Pseudorandom Number Generator for Bias away from a Random Distribution 34
CHAPTER FOUR: DISCUSSION 35
Estimated Accuracy and the Coefficient of Variance 35
Future Direction 38
The Issue of Spatial Autocorrelation 39
The Error Matrix and Classification-Specific Accuracy Analysis 40
Conclusion 42
References 43
v
LIST OF TABLES
Table 1: Numerical Codes Used in the ArcGIS Simulation 19
Table 2: Sum of Numerical Codes and Definitions 27
Table 3: Rank of Estimated Accuracy Precision and Biases 36
Table 4: Hypothetical Error Matrix 41
vi
LIST OF FIGURES
Figure 1: Example of Accuracy and Precision 6
Figure 2: Disproportionate Cover Classifications 11
Figure 3: Thesis Experimental Design 15
Figure 4: ArcGIS Simulation Conceptual Design 16
Figure 5: Example of Vegetation Map Error 18
Figure 6: Raster-Converted Vegetation Base Map 20
Figure 7: Example of the 10% Random Error Map 22
Figure 8: Example of the 10% Clustered Error Map 24
Figure 9: Example of the Consolidation of Simulation Datasets 28
Figure 10: Boxplots of Simulation Sample Size as a Function of Target Sample Sizes 29
Figure 11: Estimated Accuracy by Experimental Permutations 31
Figure 12: Coefficient of Variance of Estimated Accuracy by Experimental Permutations 32
Figure 13: Coefficient of Variation of Estimated Accuracy after Filtering 33
Figure 14: Correlogram of Clustered and Random Error Base Maps 34
Figure 15: Comparison of Sampling Design Efficiency 37
viii
LIST OF ABBREVIATIONS
AECOM Architecture, Engineering, Consulting, Operations and Maintenance
AF-CV Adenostenum faciculatum – Ceanothus veracosus
CDFG California Department of Fish and Game
CRS Clustered Random Sample
CV Coefficient of Variance
Esri Environmental Systems Research Institute
SRS Simple Random Sample
SANDAG San Diego Association of Governments
viii
ABSTRACT
Vegetation classification maps can be important tools for academic research, environmental mitigation
and restoration, and governmental decision-making. Continued development in satellite and photographic
technology has resulted in an increase in production of remotely-sensed vegetation classification maps.
As vegetation classification maps become easier and more efficient to make, there will also be increasing
importance in evaluating the accuracy of these maps. Traditional sampling strategies for vegetation
classification map accuracy assessments were developed using aspatial statistical theory. The
performance of these sampling strategies may be affected by spatial patterns in the vegetation
classification map in ways that are unpredictable. In this study, I simulated accuracy assessments using
Esri’s ArcGIS software. The goal of this study was to understand how different spatial patterns of map
error affect the performance of two common sampling strategies during an accuracy assessment.
1
CHAPTER ONE: INTRODUCTION
As our understanding of the environment has become more sophisticated, research trends in the fields of
ecology, conservation biology, and environmental science have been shifting to address complex
landscape-scale questions. With the continual growth of urbanization and agriculture, suitable floral
communities for wildlife have reduced in spatial coverage and are often interspersed in a matrix of natural
and urban environments. The heterogeneous and patchy distribution of natural habitats creates difficulties
for researchers hoping to understand environmental trends. Small-scale studies that ignore heterogeneous
patterns in ecosystems are informative, but can often simplify ecological interactions into models in a way
that is difficult to interpret when extrapolated to larger scale systems. Moreover, land managers of nature
reserves and forests look to ecological studies to create management policies that can be generalized over
large, often heterogeneous swaths of land. Because of the importance of recognizing landscape patterns,
of establishing large-scale studies and experiments, and of creating inter-reserve management policies, the
development of accurate large-scale vegetation maps has become an important goal for many
environmental organizations, government agencies, and scientists.
Mapping vegetation is both intellectually and practically challenging. Vegetation does not often
grow in homogeneous single species stands, but rather grows as a community of different species and
plant functional types. The spatial distribution of individual plant species is determined by a variety of
interrelated physical factors such as slope, soil characteristics, elevation and rainfall, as well as biotic
factors such as species competition and herbivory (Bartolome et al. 2007). Mapping individual plants or
individual plant species distributions for can be prohibitively time-intensive and financially expensive.
Researchers have long discussed how to best fit vegetation communities into categories of plant
assemblages to avoid mapping individual plant or individual species distributions. While there is still
some disagreement on the validity of the categorization of plant assemblages, thematic maps of
vegetation classifications are common. Classification maps hold several advantages for researchers and
2
land managers. Visualizing vegetation data as plant assemblage categories can simplify interpretation and
analysis while still maintaining large-scale spatial structure. Large-scale visualization of complex data can
reveal patterns that may have previously gone unnoticed. Simplified representations of vegetation data
enabled by vegetation cover classification maps can also provide a means of visualization that can more
easily communicate complex messages to audiences with a wide variety of expertise. Not only do cover
classification vegetation maps allow for ease of interpretation and communication, but they are also
becoming progressively easier to construct. Advancements in satellite imagery and remotely controlled
aerial vehicles capable of capturing high-definition images have made constructing thematic vegetation
maps powerful and inexpensive relative to older methods that relied on ground sampling.
Creation of classification maps can, however, be prone to errors. Error from remote-sensed data
can occur in any or all of the steps map-making process. During the data collection phase, research
technicians may incorrectly input a value in the field or on the computer, or may misidentify a pixel of
remote-sensed imagery. Identification errors can further be exacerbated by sensor malfunctions or sub-
optimal ground conditions (Loveland et al. 1999). This data may in turn be mis-entered during the data
entry phase of data collection. Errors can also propagate as data is processed or transformed, such as
during geographic rectification or data conversion processes (Lunetta et al. 1991). With so many
opportunities to generate errors throughout the map-making process, reliable accuracy assessments are
needed to validate the information communicated through the maps.
The presence of errors is not the only important factor to consider when evaluating the accuracy
of vegetation cover classification maps; the spatial distribution of error is not often randomly distributed,
and almost always organized in some sort of pattern (Loveland et al. 1999). The distribution of error can
be dependent on a variety of factors. For example, a glitch in the imagery or a particular visual cue from
remote-sensed images may make a particular group of pixels more difficult to classify than others,
creating clustered (spatially autocorrelated) error (Moisen et al. 1994). At all scales of measurement,
inherent geographical patterns such as spatial autocorrelation and spatial heterogeneity in the error or
3
error structure (Fortin 1989) may interfere with analysis and interpretation of data collected and the
validity of classification map accuracy measurements. A reliable accuracy assessment of a vegetation
cover classification may need to be able to identify where error may be organized in a particular pattern.
Conducted by comparing classification maps to a reference map that is considered to be more
accurate or previously validated, accuracy assessments are an important tool for evaluating a vegetation
cover classification map. Agencies may have a variety of motivations that make accuracy assessments
important. Agencies may choose to conduct an accuracy assessment in order to: (1) estimate different
accuracy parameters, (2) provide information on rare land-cover classes, (3) compare the benefits of
different land-cover classification schemes, (4) assess change detection accuracy or (5) evaluate
conductor compliance (Stehman 1999).
Progression of technology enables faster and larger-scale production of vegetation maps. As map
production increases and as vegetation maps are used for a wider variety of applications, the accuracy
assessment of vegetation maps becomes increasingly important. Vegetation maps used in determining
public policy, supporting political arguments, or planning restoration strategies must necessarily contain
reliable information. The reliability of a vegetation map is only as valuable as the reliability of the
accuracy assessment performed on the map. Accuracy assessment quality can be dependent on a variety
of factors, including the sampling strategy and the number of sampling locations used during the accuracy
assessment. Few studies, however, have looked at how the spatial distribution of error might affect the
performance of accuracy assessment strategies.
Sampling strategies applied over large-scale vegetation maps are both financially expensive and
time intensive to implement. Developing an understanding of sampling strategy effectiveness under
different circumstances is often not practical through in-field tests. In this study, I examined the potential
use of Esri’s ArcGIS as a low -cost modeling platform for understanding sampling strategy behavior under
varying spatial patterns of vegetation map error. Utilizing packaged tools to develop models within the
4
ArcGIS Desktop software, I constructed simulations that modeled (1) a realistic vegetation study area, (2)
error distributed in varying rates and patterns across a vegetation map, and (3) the application of different
sampling strategies on the vegetation map with modeled error. This study demonstrates that the ArcGIS
Desktop software can be an effective platform for researching sampling strategy efficacy for vegetation
monitoring.
Background
Vegetation maps, especially those that seek to describe regional vegetation communities, often
encompass large extents of land. Due to large study area extents, uneven topography, heterogeneous
landscapes and ownerships, and restricted access points, assessing the accuracy of these vegetation maps
can be difficult when assessors are provided with time restraints and a limited budget. In order to develop
an understanding of a vegetation map’s accuracy within these restrictions, accuracy assessors will often
employ a sampling strategy.
A sampling strategy is a pre-determined set of methods designed to select a subset of the total
number of features in a map for assessment. By sampling a subset of the features on a map, accuracy
assessors can report a metric for a vegetation map’s accuracy without having to exert the same amount of
effort required to census an entire map. Typically, sampling strategies are made up of two components:
the sample size and the sampling design. A sampling strategy must have a large enough sample size such
that, when randomly selected, most map classifications are represented within the sample. The sampling
design is the methodology or protocol through which map features are selected to be a part of the sample.
The goal of a sampling strategy is to provide a sample that is an unbiased representation of the
population. An unbiased sample requires a sampling strategy to have both a large enough sample size so
that randomly selected features are likely to include most of the potential variations in classification, and
must have a sampling design that will select map features without artificially choosing certain map
features move often than another. Accuracy assessors are tasked with developing a sampling strategy that
5
utilizes an appropriate sample size and sampling design to maximize their sampling strategy’s efficacy in
estimating a map’s true accuracy.
Efforts from the 1980s and early 1990s to assess sampling strategy efficacy emphasized the
statistical importance of sampling design selection and appropriate sample sizes (Curran and Williamson
1986; Keith 1988; Contalgon 1991). Much contemporary work has been focused on developing sampling
designs and analysis strategies for accuracy assessment of remote-sensing derived land cover
classification maps (Comber et al. 2012; Foody 2002; Stehman 2009). Understanding the accuracy
assessment of regional-scale thematic maps, however, is still needed.
Accuracy in thematic maps can be defined in two ways: classification accuracy and positional
accuracy (Lo & Watson 1998). Classification accuracy describes the misidentification of a map’s unit of
measurement (raster cell, pixel, or polygon), while positional accuracy describes the spatial misplacement
of a map entity. Though both types of error are problematic, both field-based and remotely-sensed
mapping contain consistently small rates of positional error relative to misclassification errors (Hearn et
al. 2011). This work focuses solely on misclassification errors, and further usage of the word “error” will
refer to “misclassification error.”
Vegetation classification map accuracy measurements are reported in one of two ways: overall
map accuracy and classification-specific accuracy. These two metrics are used to inform agencies of the
amount of error present in a map. Overall map accuracy, as the name suggests, estimates the accuracy of
the entire vegetation map. Classification-specific accuracy reports the accuracy of individual
classification types (Stehman 2009). Accuracy assessors may be interested in analyzing classification-
specific accuracy if they have reason to believe that map error is distributed unevenly between
classification types. For simplicity, this study examines only overall map accuracy estimates, not
classification-specific accuracy.
6
Sampling strategies are used to estimate the true accuracy of a vegetation map. The estimated
accuracy captured by an assessment can depend on the precision of the sampling strategy, which can be
affected by both the sampling design and sample size used by the accuracy assessors. A sampling
strategy’s accuracy estimate (hereafter referred to as “estimate”) is considered to be accurate if the
sampling strategy’s estimate is near in value to a vegetation map’s true accuracy value.
Precision on the other hand measures the consistency with which the sampling strategy estimates
map accuracy. High precision and high accuracy of estimates do not necessarily coincide with one
another. A sampling strategy that has high estimate accuracy in a certain environment should provide an
average estimate that is close in value to the true map accuracy. If the sampling strategy has low precision
in that environment, however, then any individual accuracy assessment using the strategy may by chance
not estimate the correct map accuracy. Sampling strategies that have high accuracy of estimates but have
low precision will produce estimates that are inconsistent between accuracy assessment iterations.
Conversely, a sampling strategy that has high precision and low accuracy of estimates in a certain
environment will consistently have consistent estimate values, but the average value of the estimates are
unlikely be the same value as the true map accuracy (Figure 1).
Figure 1 - Example of accuracy vs. precision. Distribution of points within target represent different rates of accuracy and
precision.
7
The goal of a sampling strategy should be to have high accuracy of estimates and high precision
because most accuracy assessors do not have the opportunity to replicate their accuracy assessment due to
time and financial constraints. Accuracy assessors should have confidence that their sampling strategy
leads to an estimate that is consistently close to the true map accuracy. If either sampling strategy estimate
accuracy or precision is low, the reliability of the accuracy assessment is subject to scrutiny.
In this study, I utilize the coefficient of variation (CV) as a metric for precision. CV is an effect
size metric widely used in biological sciences that allows for the direct comparison between different
accuracy measures (Kelley 2007; Chen and Wei 2009). CV is a ratio of the sample standard deviation and
mean, and is calculated with the following equation:
where s is the sample standard deviation and M is the sample mean. Standard deviation measures the
amount of variation around the mean, where larger standard deviation values equate to greater variation.
Because CV is directly proportional to the standard deviation, decreasing CV results in higher precision
(less variation around the mean). Moreover, standard deviation is inversely proportional to sample size; as
sample size increases, standard deviation decreases if all other variables are held constant. Thus, CV also
decreases as sample size increases, resulting in higher precision. In the following section, I outline lessons
learned from previous applications of accuracy assessments on vegetation maps.
Vegetation Map Accuracy Assessment Considerations
While all vegetation maps should ideally be 100% accurate, many researchers have found that
reliable vegetation classification maps should at least contain no more than 15% classification error
(Edwards Jr. et al. 1998). To determine the degree of error in a vegetation map, researching organizations
must use a combination of field sampling and/or high accuracy reference maps to determine whether
8
spatial entities within the vegetation map are correctly classified. When no high accuracy reference map is
available, researchers must rely on field sampling data to assess the accuracy of the vegetation map
Researchers must take into account a variety of considerations prior to conducting field sampling,
including desired metrics of map accuracy, map usage by end-users, issues with map creation
methodology and its potential for error, and the cost of accuracy assessments. The solutions to these
issues are not straightforward, and are often interconnected. For example, overall map accuracy is an
important metric to investigate because it provides the end-user with a general understanding of the map’s
reliability. End-users who want to use the map for a project that concerns only one group of plant species
may only be interested in the accuracy of a certain classification within the vegetation map.
The individual classification accuracy can be vastly different from the overall vegetation map
accuracy if the map makers or map-making algorithm are not as adept at identifying that individual
classification relative to their ability to identify other classification types. While many accuracy
assessments in the literature report overall vegetation map accuracy, many do not report individual
classification error (Foody 2002), even though most accuracy assessments that do report individual
classification accuracies show unequal accuracy between classification types (Foody 2002). Accuracy
assessors must identify potential end-users, and consider the priorities of the end-user prior to conducting
an accuracy assessment.
Usage by the end-user can dictate many of the decisions the accuracy assessors must make. Not
only might end-users be concerned with individual classification accuracy, different end-users may be
interested in different scales of accuracy. A state agency that is interested in broad, statewide planning
will only require accuracy assessments with a large-scale sampling design. End-users that plan on using
the vegetation for more local needs would on the other hand be interested in accuracy assessments that
make use of small-scale sampling design. The accuracy assessor’s choice of sampling design and sample
size should reflect the map’s potential intended uses. Sampling designs such as the Stratified Random
9
Sample are specialized to address the needs of small-scale end-users, while other strategies such as the
Simple Random Sample (SRS) can meet the requirements of multiple-scale users provided that the
assessors use a large sample size (Edwards Jr. et al. 1998).
Researchers have suggested several different designs to assess classification accuracy at a
regional scale. In choosing a sampling design, several major criteria should be met to ensure that the
sampling design meets the requirements of statistical, scientific and geographic rigor. In general, good
sampling designs must: satisfy the requirements of a probability sampling design (that is, ensuring that
the chosen sampling design represents a large enough percentage of the population to attain a meaningful
and truthful estimate of precision for the accuracy assessment), be practical and cost effective, be spatially
well distributed, and must have low variability of accuracy estimates (Stehman 2009). The designs that
are examined in this study – Simple Random Sampling (SRS) and Clustered Random Sampling (CRS) –
each have their own unique strengths and weaknesses. This study evaluates the performance of these
sampling strategies by utilizing computer simulations that apply the sampling strategies to a vegetation
map with known error rates and spatial distributions
Sampling Designs
The Simple Random Sample (SRS) is a sampling design that ensures true randomness in
sampling units, and thus is one of the most statistically valid choices for a sampling design (Fortin 1989;
Roleck et al. 2007; Chen & Wei 2009; Stehman 2009). SRS describes a sampling design that selects map
features (polygon, raster, or pixel) at random to be sampled and verified for accuracy. Sampling size for
SRS can be estimated through power analysis to achieve the desired precision for accuracy estimates
(Stehman 1999). Because SRS is unbiased, SRS is a good overall estimator of accuracy across all cover in
a study area (Roleck et al. 2007; Chen & Wei 2009).
Although SRS reliably estimates accuracy over an entire study area, SRS is inherently aspatial,
which can lead to bias in an accuracy assessment (Roleck et al. 2007), especially in accuracy estimates on
10
the cover classification level. While patterns such as spatial autocorrelation of spatial entities have not
been found to have a strong effect on the efficacy of SRS (Chen and Wei 2009), the proportion of
classifications does have an effect on sampling strategy efficacy. The performance of SRS behaves
differently when the relative proportion of two or more classifications are unequal than when
classifications are equal in abundance (Chen and Wei 2009).
Relative proportion of cover classifications can be understood in two ways. First, given map
features of uniform size, such as in raster-based maps, a certain cover class may be used to classify a
smaller percentage of map features relative to other cover classes. In this case, the cover class is rare
because it is assigned to a small number of map features relative to the total number of map features and
therefore represents a smaller proportion of the study area.
The second way to understand relative proportion of cover classification is when map features are
not uniform in size, and a cover class is identified with map features that are disproportionately large in
area relative to the number of map features identified as that cover classification. Figure 2 displays an
example taken from the 2012 SANDAG vegetation map. In Figure 2, the red highlighted polygons
represent all of the map features identified as the vegetation community “ Adenostenum faciculatum –
Ceanothus veracosus association (AF-CV association)” within the map extent. Although there are over
400 polygons depicted in Figure 2, the four polygons identified as AF-CV association represent nearly
20% of the map’s area. These same four polygons represent less than 1% of the total number of polygons
found in the map extent. While it’s important to acknowledge both forms of potential baise caused by
differences in relative proportions of cover classifications, this study only examines the effect of spatial
pattern on sampling strategy efficacy in a raster-based map.
Bias in accuracy estimates provided by SRS results when either of these two uneven relative
proportions are present in a map. If a random sample of polygons is selected from Figure 2, for example,
regardless of sample size, it is improbable that any of the four polygons identified as “AF -CV
11
association” will be selected. Leaving out a cover classification for assessment that represents nearly 20%
of total area in a study area can bias accuracy estimates in unpredictable ways, and violates one of the
criteria for good sampling designs defined above.
Figure 2 - Example of cover classification disproportionate in size to abundance. Four polygons, highlighted in red,
represent nearly 20% of map area, but less than 1% of total number of polygons.
12
When accuracy assessments are conducted at individual sampling points, rather than on entire
polygon scales, the inverse problem may occur. With random sampling points at a large sample size, we
might expect that approximately 20% of all sampling points will lie within the “AF -CV association”.
While an SRS of points will have good spatial representation, the sampling design may miss
classifications that might be more numerous in number of map features but less abundant in spatial
coverage. As a result, data collected from SRS designs may have an overrepresentation of spatially large,
numerically small classifications.
Clustered Random Sampling (CRS) is a sampling design that attempts to address some of the
spatial issues in sampling design. CRS is a strategy that randomly selects a map feature to be sampled as a
primary sampling unit. A fixed number of map features adjacent to the primary sampling units are then
sampled as secondary sampling units. One of the advantages of this approach is that a more thorough and
representative spatial distribution may be attained by sampling adjacent map features. In the example of
Figure 2, because the “AF -CV association” represents nearly 20% of the map’s area, there is a higher
chance that one of the “AC -CV association” polygons will be chosen as a secondary sampling unit than
one being selected under SRS; there are several polygons adjacent to the “AC -CV association” polygons.
With a point-based CRS sampling design, CRS may avoid over-representing spatially large,
numerically uncommon classifications by taking sampling points adjacent to one another if classifications
with small spatial coverage tend to clump together. A CRS may capture information from several
spatially rare classifications in one clustered sampling event, alleviating potential bias from SRS designs.
The disadvantage of CRS is that, while CRS addresses some spatial issues, the statistical validity
of the design may be in question due to its inherent lack of randomness. The accuracy estimates may be
statistically biased as a result.
There is no single recommendation that can be made for sampling designs. Each sampling design
must be context specific, and depend on a combination of statistical rigor, ecological reality and spatial
13
sensitivity. Spatial behavior must be considered when evaluating the efficacy of accuracy assessment
sampling design. In particular, natural phenomena such as vegetation distributions tend to be spatially
autocorrelated (Legendre 1993).
Spatial autocorrelation
Spatial autocorrelation is used to describe data where data points that are close to one another in
spatial proximity are more similar in value than what would be expected under a perfectly random
distribution of data points. Spatial autocorrelation can often be observed in natural phenomena. Mountain
elevations are for example spatially autocorrelated. The elevation at any given point on a mountain is
likely to be similar in value to elevation at points within a few meters. As points are observed further
away from the original point, the similarity in elevation values should become increasingly unrelated.
Spatial autocorrelation predicts that two points that are a large distance (relative to the nature of the
dataset) away from one another should contain values that are less similar than points that are a small
distance away from one another.
Most traditional statistical tests assume all data within a dataset are independent from one
another. The presence of spatial autocorrelation in ecological data may interfere with the ability to
reliably interpret statistical analyses (Legendre 1993). While statisticians in the late 1970s argued that
sampling designs inherently address potential problems spatial autocorrelation introduces (Cochran 1977;
Green 1979), more recent research has revealed that sampling design performance can vary when spatial
autocorrelation is present (Fortin et al. 1989; Lichstein et al. 2002; Chen and Wei 2009). More work
needs to be done to understand the behavior of sampling strategies under varying spatial patterns and
proportional abundance of the spatial entities classified as the error of the map.
This study examined the use of Esri’s ArcGIS desktop software as a tool to evaluate the efficacy of
two sampling strategies in capturing the true error rate of a vegetation map. While work has been done to
understand how different sampling strategies behave under different distributions of error, few studies
14
have shown how these sampling distributions behave with realistic distributions of vegetation
classifications. Moreover, few studies have shown the efficacy of sampling strategies in revealing the
patterns of error distribution. Specifically, this study was designed to answer the following questions:
1. In what ways does local spatial autocorrelation affect the ability of SRS and CRS to capture
vegetation map accuracy?
2. In what ways does varying error rate affect the ability of SRS and CRS to capture vegetation map
accuracy?
3. What sampling size is necessary to capture reliable estimates of vegetation map accuracy given
different map error rates and spatial autocorrelation?
15
CHAPTER TWO: METHODS
This study made use of a four-phase procedure performed on Esri’s ArcGIS softwar e designed to simulate
sampling strategies applied onto a vegetation raster-based map with known rates of error. Through the
simulations, I sought to show how map error rate, spatial distribution of map error, and sample size affect
the performance of SRS and CRS sampling design. As shown in Figure 3, the study compared each
sampling design performance given four rates of map error (10%, 20%, 50% and 80%), two patterns of
error distribution (random and clustered) and four sample sizes (100, 200, 300 and 500).
Figure 3 - Experimental Design. Study explores a 4 x 4 x 2 x 2 design to explain sampling strategy performance. The
independent variables are: Sampling Strategy, Sample Size, Error Rate and Error Distribution. SRS – Simple Random
Sample; CRS – Clustered Random Sample; RE – Random Error Distribution; CE – Clustered Error Distribution.
Using the ModelBuilder application in ArcGIS 10.2, I developed four models composed of
different ArcGIS tools (Figure 4). Each model addressed one step of the simulation’s procedure:
(1) Create one vegetation base map for use in all iterations of the simulation
Sampling
Design
16
(2) Construct error into the vegetation base map at known rates and spatial distributions
(3) Create sampling points for accuracy assessment according to two sampling strategy protocol
(4) Combine the information produced in the first three steps to determine the percentage of
constructed error detected by the sampling at the simulated points within the vegetation
map’s spatial extent .
Figure 4 - Conceptual design of ArcGIS simulation used in this study.
Each of the first three models of the procedure created raster datasets which were consolidated in
the final (4
th
) model of the procedure. During consolidation, the model first compared the sampling
locations designated in the sampling strategy datasets with the vegetation base map’s exte nt. All sampling
locations outside of the extent were discarded, keeping only those that were within vegetation classified
on the map. Next, the model sampled the error base map at the remaining sampling locations. The model
assumed that all raster cells sampled in the accuracy assessment were correctly assessed. The model
distinguished whether there was an error at each sampling location. Lastly, the model calculated the
Estimation of Map
Accuracy
Sampling Strategy
Sampling Locations
Designated
Within Vegetation
Map Extent?
Discard Location
Sampling Locations
Finalized
Sample Error Base Map
Error Present at
Sampling Location?
Number of Sampled
Locations Correctly
Classified
Number of Sampled
Locations Incorrectly
Classified
No?
Yes?
Yes?
No?
17
estimated map accuracy based on the information that the sampling strategy collected using the following
equation:
.
In the following sections, I describe the San Diego vegetation map used and describe in detail the
development and utilization of the four models described above.
SANDAG’s 2012 San Diego County Vegetation Map
The San Diego Association of Governments (SANDAG) is an interdepartmental public agency
that works together with San Diego County and the county’s 18 city governments to tackle problem s that
involve two or more governmental entities. When first assembled, SANDAG was tasked with sorting out
and providing grants for inherently inter-departmental traffic and transportation issues in San Diego
County. In recent years, SANDAG has evolved to encompass and address issues of land use and regional
growth, airport access, the local economy and safety. Moreover, SANDAG has increasingly placed
emphasis on environmental considerations in their planning to match the overall attitude of the State of
California towards environmental concerns.
The vegetation map of San Diego County published by SANDAG was collected by AECOM, an
international contracting company that is involved in several environmentally related projects throughout
San Diego County. AECOM, in partnership with the California Department of Fish and Game (CDFG),
developed a classification system for vegetation monitoring in San Diego that identifies vegetation
communities into “alliances” and “associations” (Sproul et al. 2011). In a press release, AECOM states
that their vegetation classification system coupled with digital data collection, “reduces time and potential
human error from transcribing field data forms into a database for analysis; saves… an estimated 250
labor hours for data form transcription at a cost savings of $25,000 to $30,000; [and] provides for an
updated and more accurate vegetation map that will facilitate regional species and habitat management” .
AECOM advertises that their maps will be used for real life conservation habitat management.
18
The map produced, however, is problematic due to poor protocol execution. Rather than field
sampling to create the map with the new classifications, AECOM used photo-interpretation from digital
images downloaded from Bing Maps to create and classify polygons. As a result of digitizing polygons
from low-resolution photo-imagery, peculiar artifacts exist within the data. For example, polygons as
small as 16 square feet in area, well below the reported 1-hectare minimum mapping unit for this dataset
were found within the SANDAG vegetation map (Figure 5). Small polygons, as well as polygons with
widths smaller than the minimum mapping unit, may suffer from classification error.
Figure 5 – Polygon from SANDAG 2012 Vegetation Map 16 square feet in area. Spatial data downloaded from the San
Diego Geographic Information Source (SanGIS).
AECOM has reported an accuracy rate for the map of 80%, though it did not specify whether the
accuracy indicated classification or positional error. Furthermore, the sampling process used for
AECOM’s in -house accuracy assessment was not been made public. The lack of transparency from
AECOM in map construction and assessment, coupled with data artifacts and inconsistent dataset rules,
highlight the need to perform a statistically rigorous and spatially sensitive accuracy assessment.
19
Construction of the Vegetation Base Map, Error, and Sampling Data Sets
Two sets of raster base maps were created upon which the sampling strategies were applied: the
vegetation base map and eight error base maps. In a raster environment, cell values must be numerical to
be eligible for analysis and manipulation. Thus, information from the base maps such as the extent of the
vegetation and the presence or absence of error was numerically coded for analysis. I represented the
different types of information found in the vegetation base map, error base maps, and sampling strategy
dataset raster cell values as numerical values of different orders of magnitude (Table 1). In the following
sections, I outline the construction of the vegetation base map and the error base maps.
Table 1 – Numerical codes used to in ArcGIS computer simulation to designate vegetation assessment status in cells.
Dataset
Desired Cell
Designation
Numerical
Code
Vegetation Basemap Within Study Area 0
Vegetation Basemap Beyond Study Area 100
Error Basemaps With Error 1
Error Basemaps Without error 0
Sampling Dataset Sampled 10
Sampling Dataset Not Sampled 5
Vegetation Base Map
In this study, the vegetation base map served as a realistic spatial extent on which a region-scale
vegetation map accuracy assessment might take place. Using an actual vegetation map to provide the
spatial extent of the study area provides realistic vegetation community shapes. I obtained the vegetation
map of San Diego County from SANDAG’s spatial data warehouse ( www.sangis.com). The vegetation
map is a vector-based classification map that describes each map unit (polygons) with a hierarchical
vegetation classification system developed by AECOM.
To prepare the vegetation base map, the vector-based SANDAG vegetation map was converted
into a raster dataset in ArcGIS with the Polygon to Raster tool. Raster cells were set at 10,000 sq. ft. in
area to preserve realistic expectations for remote-sensed vegetation map cell-size scales (Stehman 2009)
20
without losing too much detail from the original vector dataset. I then homogenized the values of the
vegetation map raster cells by assigning a value of 100 to all the cells, indicating the total extent of
vegetation (Figure 6).
Figure 6 - Northern San Diego County vegetation base map example of raster-converted dataset.
Gray cells represent vegetation within study area extent.
Error Base Maps
I created error maps to mimic misclassified raster cells generated by map producers during map
construction processes. I created error base maps that contained two different types of spatial patterns:
random and clustered. The clustered error base maps contained simplified patterns of spatial
21
autocorrelation that were limited mostly by the square cells used in the construction of the clustered base
maps.
Eight error base maps were created, each with four different rates (10%, 20%, 50% and 80%)
and two different spatial patterns (random and clumped) of error. The raster was positioned to the same
spatial extent and alignment as the vegetation base map raster. Error map cells were binary; each cell
could be classified as having the presence of error or the absence of error. Error base map cells labeled
“error” that overlap spatially with classified cells on the vegetation base map are considered to be
misclassified vegetation information in this simulation.
Depending on the quality of the data collection, data entry and data quality assurance phases of
real life map construction, actual vegetation classification maps can have different rates and spatial
distribution of error. For this simulation, I chose four error rates to test – 10% error, 20% error, 50% error
and 80% error – and two types of spatial distribution of error – random distribution and clumped
distribution. Error may be distributed randomly if errors in a map making process are independent of map
producer bias. Error can be distributed in a clumped, or spatially auto-correlated, manner if map qualities
influence a map producer’s ability to identify a group of cells, rather than individual cells. I combined
each of the four permutations of error rate and two permutations of error distribution to create eight total
permutations of error base maps.
I constructed the error maps on ArcGIS. To construct error maps with random distributed error, I
first generated four raster datasets with randomly assigned cell values ranging from 0 to 1 and cell sizes
10,000 sq. ft. in area using the Create Random Raster tool. I then reclassified the cell values of the
random raster dataset with the Reclassify Raster tool such that the percentage of total cells of the raster
dataset matched the desired rate of error. To obtain an error map with 10% error, I set the reclassification
parameters to reclass all cell values between 0 and 0.1 via the Create Random Raster tool to a cell value
of 1, representing the presence of error. I reclassified all other cell values (from 0.100001 to 1) with a cell
22
value of 0, representing the absence of error. I repeated the process of reclassifying new random raster
datasets to create 20% error, 50% error and 80% error maps, each by reclassifying cells with random
values between 0 and 0.2, 0 and 0.5, and 0 and 0.8, respectively, with a cell value of 1. The proces
resulted in four error maps with cells designated as containing error that were randomly distributed and
present at rates of either 10%, 20%, 50% or 80% of the total cells in the raster datasets (Figure 7). When
the extent of the error maps were limited by the extent of the vegetation map layer, the percentage of cells
designated as containing error within the vegetation map extent still matched the target rates of 10%,
20%, 50% and 80% of the total cells within the vegetation map extent (accurate within 0.00001% for all
maps).
Figure 7 - Example of a 10% random error map generated by the simulation.
23
To create error maps with clumped error distribution, I again generated raster datasets with
randomly assigned cell values. However, rather than creating cells all 10,000 sq. ft. in area, I created three
component raster datasets with raster cell sizes of 10,000 sq. ft., 40,000 sq. ft. and 90,000 sq. ft for each
of the desired error maps (four in total: one for each rate of error). To make the 40,000 sq. ft. and 90,000
sq. ft. datasets compatible in analysis with the 10,000 sq. ft. dataset, the cells from the larger-celled
datasets needed to be split so that they would be composed of multiple 10,000 sq ft. cells. To split the
cells into 2x2 and 3x3 (40,000 and 90,000 sq. ft.), I used the Resample tool with the 10,000 sq. ft. raster
dataset as the cell size template. The three component error maps were then reclassified to contain cells
designated with error present such that the sum of the percentage of cells designated as containing error
between the three error maps was equivalent to the desired error rate of the map (reclassification
parameters reported in Table 1). All cells that were designated as containing error were reclassified to a
value of 1. I consolidated the three component error maps using the sum parameter of the Cell Statistics
tool. All cells with values greater than 0 were cells designated as containing error, while cells with a value
of 0 were designated as containing no error. Finally, I reclassified the consolidated map such that all cells
with values greater than 1 were reclassified to have a value of 1. The resulting error map contained cells
designated with error that are locally clumped and present at rates of either 10%, 20%, 50% or 80% of the
total cells in the raster datasets (Figure 8). Similar to the randomly distributed error maps, when the
extent of the clumped error maps were limited by the extent of the vegetation map layer, the percentage of
cells designated as containing error within the vegetation map extent still matched the target rates of 10%,
20%, 50% and 80% of the total cells within the vegetation map extent (accurate within 0.001% for each
clustered error map).
24
Figure 8 - Example of a 10% clustered error map generated by the simulation.
Sampling Strategy Raster Datasets for Accuracy Assessment
The sampling strategy raster datasets were used to indicate where, within the vegetation base map
extent, the simulation should sample for accuracy. Raster cells designated as sample locations indicate to
the simulation that the location should be sampled. The sampling strategy raster datasets were the final
datasets created prior to consolidation and analysis. I created eight different types of sampling strategy
raster datasets, one for each permutation of sampling design (SRS or CRS) and sample size (100, 200,
300, or 500). The sampling strategy raster sets were positioned and aligned to the same spatial extent as
the vegetation raster base maps. Raster cells in the sampling strategy datasets were designated as
25
sampling locations by following one of two sampling strategies. When juxtaposed with the vegetation
raster base map and the error raster base maps, the sampling strategy raster datasets indicate which
overlapping cells from the error base maps within the vegetation base map extent are sampled for
accuracy assessment in the simulation
. I created 500 randomly generated (generated through the procedures described below) raster
datasets iterations for each of the eight different type of sampling strategy raster datasets, for a total of
4,000 dataset iterations. One dataset iteration represents a single accuracy assessment simulation. That is,
each dataset iteration simulates an accuracy assessor conducting a single accuracy assessment utilizing a
sampling strategy defined by a specific sampling design and size. For example, I created 500 data
iterations of a sampling strategy that uses SRS with sample size 100. Because each iteration was
generated independent of any other iteration, and the cells designated to be sampled were randomly
assigned within each iteration, any individual data iteration within those 500 is distinct from all other
iterations. One of these iterations simulates an accuracy assessor conducting an assessment using a
sampling strategy that utilizes SRS with a sample size of 100. Collectively, the 500 iterations would then
simulate a situation where an assessor conducts 500 individual assessments with an SRS design and a
sample size of 100. Repeating each of the eight sampling strategy types 500 times allows us to understand
the repeatability of accuracy estimates derived from the sampling strategies.
The SRS sampling strategy datasets were designed such that single cells within the vegetation
study area extent were selected at random to be sampled for error. The SRS sampling strategy datasets
were created in similar fashion as the random error base map datasets. To construct the SRS datasets, I
first used the Create Random Raster tool to construct a raster dataset with cells 10,000 sq. ft. in size and
that had a random value between 0 and 1. I then reclassified a percentage of cells as “sampled” and “not
sampled” (coded numerically as “10” and “5”, respectively). The Create Random Raster tool creates a
rectangular dataset that extends beyond the vegetation study area extent (Figure 10). Because of this, the
percentage of cells reclassified as “sampled” was designed to take into account not only the total number
26
of cells created by the Create Random Raster tool, but also the probability of reclassified cells falling
beyond the vegetation study area extent. The percentages of cells reclassified as “sampled” were chosen
such that the resulting number of cells classified as “sampled” within the vegetation map study area extent
would be roughly equivalent to desired sample sizes. For example, for all sampling strategy datasets
designed to have a sample size of 100 samples, I reclassified all cells from the random raster dataset that
had a value between 0 and 4.4461 x 10
-5
as “sampled”. The range of 0 to 4.4461 x 10
-5
was a derived as a
function of the total number of cells in the original random raster dataset, the sample size, and the
percentage of cells. The equation used to determine the upper bound of the range is as follows:
The result of this equation produced a number of reclassified cells that, on average, was
equivalent to the desired sample size within the vegetation study area extent. Variation in sampling size
within desired sample size targets were observed between simulation iterations, and analysis of the
potential effect of variation in sampling size is be discussed in Chapter 6.
The CRS sampling datasets were designed to designate clusters of nine cells patterned in a 3x3
square as “sampled.” The CRS sampling datasets were created by again using the Create Random
Raster tool to generate a raster with cells 90,000 sq ft. in size with random assigned values. The larger
sized cells were then reclassified as “sampled” and “not sampled” following the numerical coding (Table
1). The upper bound for reclassification was used using the same equation described above. Next, the
90,000 sq ft. cells were broken into smaller 10,000 sq ft. cells using the Resample tool.
Consolidation of Datasets
Once the vegetation base map, error base maps and the sampling datasets were generated, I
consolidated the information of these datasets by summing the numerical codes of spatially overlapping
cells from the three datasets. The sum of the numerical codes indicated different types of information: If
27
for example, for three overlapping cells from the three datasets, the sum of cell values was 105, that
spatial location was within the vegetation extent, but did not contain error and was not sampled (Figure
9). Five numerical codes based on cell value sums were relevant to final analysis (Table 2). The
consolidation of datasets through summation was repeated for each of the 500 simulation iterations
generated for each of 64 experimental permutations designed for this study (Figure 3).
Table 2 - Sum of numerical codes and corresponding vegetation assessment status. Each cell is defined by 3 binary
categories: whether the cell is in the study area extent, whether the cell is assigned error, and whether the cell is sampled.
Numerical
Code
Within
Extent?
Contains
Error? Sampled?
0 No No No
105 Yes No No
106 Yes Yes No
110 Yes No Yes
111 Yes Yes Yes
The important numerical codes for analysis of estimated map accuracy were “110” – locations
that were within the extent and were sampled, but contained no error – and “111” – locations that were
within the extent, were sampled, and contained error.
After consolidation, I calculated the estimated map accuracy of each sampling iteration. The
estimated map error is a function of the total number of cells sampled within the vegetation extent and the
number of errors caught by the sampling within the study area extent. The estimated map accuracy was
calculated using the following equation:
111
28
Figure 9 - Example of consolidation of vegetation basemap, error base map, and sampling design datasets. Numerical
coding designate different vegetation assessment statsus. 105 = Within study area extent, no error, not sampled. 106 =
Within extent, error, not sampled. 110 = Within extent, no error, sampled. 111 = Within extent, error, sampled.
Validation of the ArcGIS Simulation
The Effect of Sample Size
The simulation used in this study targeted a sample size by reclassifying a percentage of the total
number of cells in a rectangular raster dataset. At low percentages relative to the entire raster dataset,
there was variation in the total number of actual cells reclassified as “sampled” within the vegetation
extent. As a result, individual simulation iteration sample sizes were not always identical to the sample
size targeted by the respective experimental permutation. However, the mean sample size of the 500
iterations within each of the eight sampling strategy types matched the target sample size for each
29
permuation (Figure 10). SRS simulation iterations had less variation in actual sample size than did CRS
simulation iterations. I hypothesized that the differing variation in sample size between the CRS and SRS
simulation iterations might have been driving the unusual pattern I observed in the CV. To test the
hypothesis, I filtered the simulation dataset to remove all simulation iterations that did not have an actual
sample size within 10% of the target sample size. Filtering the data removed 27.9% of the data.
Figure 10 - Actual sample size created by the simulation as a function of the target sample size for the simulation and
sampling design.
Testing the Pseudorandom Number Generator for Bias away from a Random Distribution
The simulation in this study used the Create Random Raster tool as the foundation for creating
all of the error and sampling datasets. The Create Random Raster tool uses Rand.c function from
Microsoft Visual Studios 2008 in order to generate the randomly assign values between 0 and 1 to each
cell. The Rand.C function uses a pseudorandom number generator that, according to Microsoft Support, is
not meant to produce high quality randomness at larger volumes. I hypothesized that the pattern I
observed in the CV might be due to bias in the random number generator. The random number generator
might introduce bias by over- or under-dispersing random numbers. Furthermore, perhaps the clustered
error datasets were not truly clustered.
30
To test for potential bias in the Create Random Raster tool, I clipped 500 x 500 cell subsets of
the Random Error and Clustered Error datasets. I examined deviations from randomness in these data
through correlograms. Correlograms visualize the correlation between cells within set spatial intervals of
one another. Positive coefficients in a correlogram represent positive correlations between cells within a
given distance, while a coefficient value of 0 represents no correlation. In this study, I examined the
correlations between cells at 1 cell length, or 100 feet.
31
CHAPTER THREE: RESULTS
All sampling permutations correctly estimated percentage of map error over the course of 500 iterations,
regardless of sample size, sampling strategy, error distribution and error rate. . The average of each
sample size permutation predicts near 10% map error (Figure 11). Accuracy appears to increase slightly
with sample size, though not statistically significantly so. In the 10% Clustered Error – CRS experimental
permutation, where the differences in estimated accuracy between sample sizes is most dramatic of any
experimental permutation, sample size is not a significant predictor of estimated accuracy (F = 0.311, p =
0.817).
Figure 11 - Simulation estimated accuracy as a function of error type, sampling strategy, and sample size. All graphs are
from experimental permutations with 10% true map error.
32
Estimated accuracy of experimental permutations were similarly close to actual map error rates
across all error rates irrespective of sampling size, strategy or error distribution. Variation around the
estimated accuracy means, however, changed dramatically between sample size, sampling strategy, error
distribution and error rate. That is, the probability that any given effort to classify error would find the
actual amount of error varied widely by strategy. To quantify this variation around the mean, I calculated
the coefficient of variation (CV) of each experimental permutation (Figure 12).
Figure 12 - The Coefficient of Variance as a function of sample size, sampling strategy, error distribution and error rate.
Several repeatable trends appear in the CV of the experimental permutations. Across all
permutations, CV decreases as sample size increases. CV also seems be sensitive to both sampling
strategy and error distribution. At all error rates, the Clustered Error – CRS permutation always has the
highest CV within the same sample size. The Random Error – SRS permutation always has the second
10% Error 20% Error
50% Error 80% Error
33
highest CV within the same sample size. The Random Error – CRS permutation always has the lowest
CV within the same sample size while the Clustered Error – SRS permutation always has an intermediate
CV between the Random Error – CRS and Random Error – SRS permutations.
Validating the ArcGIS Simulation
The Effect of Sample Size
After filtering out data points with sample sizes not 10% of the target sample size, all
experimental permutations still estimated the correct map accuracy. Moreover, the patterns of CV did not
change even after filtration (Figure 13). Variation in sample size is thus not an important determining
factor in the unexpected CV patterns observed with the full dataset.
Figure 13 - Coefficient of Variation of simulation estimated accuracy in sample-sized filtered dataset as a function of
sample size, sample design, error distribution and error rate.
10% Error 20% Error
50% Error 80% Error
34
Testing the Pseudorandom Number Generator for Bias away from a Random Distribution
The clustered error dataset had fairly strong correlation within 100 feet, and moderate correlation
at 200 feet (Figure 14). By 300 feet, the relationship between cells was random. The results show that
within 300 feet, cells tend to be like one another, supporting the notion that the clustered error datasets
were indeed clustered.
Figure 14 - Correlogram of clustered error (red) and random error (blue) base maps. Coefficient value indicates
correlation between a cell's value and the value of a cell distance x away.
The random error dataset also contained coefficients near 0 regardless of distance. Had the
pseudorandom number generator created distributions that were overly dispersed, we should have
observed increasing correlation coefficients with increasing distance in the random error correlogram.
However, the correlation coefficients remained constant at 0 or near 0 with no discernible pattern. These
results show that the CV patterns observed did not originate from biases caused by the random number
generator.
35
CHAPTER FOUR: DISCUSSION
The results of this study show that, when repeated over 500 iterations, both SRS and CRS sampling
designs will on average correctly predict vegetation map accuracy. However, the precision of the
accuracy estimates were not constant across the experimental permutations. In the following chapter, I
will discuss the ramifications of the differences found in the coefficient of variances, and explore possible
future directions in sampling design efficacy research.
Estimated Accuracy and the Coefficient of Variance
Over 500 simulation iterations, both SRS and CRS sampling strategies correctly estimated the
true map accuracy regardless of error characteristics or sample size. Most accuracy assessors, however,
do not have the time or resources to implement a sampling strategy more than once. Because of these
limitations, precision is an important metric in order to understand the efficacy of a sampling strategy.
Several expected and unexpected patterns emerged during the analysis of the CV. As expected,
CV decreased as sample size increased in all experimental permutations. Increasing sample size should
coincide with decreasing CV because CV is directly proportional to standard deviation, which in turn is
inversely proportional to sample size. In other words, larger sample sizes reduce the probability that the
sampling strategy design by chance identifies too little or too much of the error. The result of higher
sample sizes should be greater precision in accuracy estimates.
The CV pattern between the sampling strategies and error distribution experimental permutations
was, however, unexpected. The SRS - Random Error permutation should represent two spatial
distributions that are randomly distributed (unbiased) transposed upon one another because both were
generated based on a pseudorandom generator (Table 3). Conversely, CRS and the Clustered Error
datasets were created through different methods and contained different spatial patterns. Their differences
create two distinct biases away from what would be expected under a perfectly random distribution. The
36
SRS – Random Error permutation, containing two identically unbiased datasets, was the third most
precise permutation. The CRS – Clustered Error permutation, which contained two distinctly biased
datasets, was the least precise experimental permutation. Although the combination of two different
biases produces the least most precise permutation, the combination of either of these biases with an
unbiased dataset, as is the case with the CRS – Random Error and the SRS – Clustered Error
permutations, produces the most and second most precise experimental permutations, respectively.
Table 3 - The rank of estimated accuracy precision and biases away from a random distribution in the experimental
permutations. CRS and CE have distinct biases away from the random distribution
Experimental
Permutation
CRS -
RE
SRS -
CE
SRS -
RE
CRS -
CE
Rank 1 2 3 4
Bias B
1
U UB
2
UU B
1
B
2
It is unclear why the combination of two datasets that are biased in different ways away from a
random distribution would be the least precise experimental permutation while the combination of one of
the biased datasets and an unbiased dataset results in the most precise experimental permutations. One
possibility is that unexpected biases could have been introduced through the simulation design or through
the use of the ArcGIS software as the platform for the simulation. Specifically, I examined the possible
effect of sample size variation on the CV, and examined whether the Create Random Raster tool within
ArcGIS, which uses a pseudorandom number generator, produced a truly random distribution. However,
we showed that sample size variation had little effect on CV, even when we restricted the iterations used
in analyses to iterations close in value to the target sample size. Correlograms also revealed that the
pseudorandom number generator did in fact produce a spatially random distribution. More work needs to
be done to empirically show why we might have observed the precision patterns observed between the
experimental permutations.
Regardless of reason we observed the patterns between the experimental permutations, an
important result from this work highlights the consequences of ignoring spatial distribution when
37
planning a sampling strategy. According to our results, using a CRS design on a map with clustered error
is a poor choice, yielding the worst precision (and thus yielding the least reliability) given constant
sample size (Figure 15). In fact, given a map with clustered error, a CRS design with sample size 500
will yield the same precision as an SRS design with sample size 200. An accuracy assessor may save a
significant amount of resources using SRS with less than half of the sample size required by CRS to get
the same result. Similarly, when a map contains randomly distributed error, a CRS design with less than
100 samples should yield the same precision as an SRS design with 300 samples. By ignoring error
distribution, an accuracy assessor may potentially lose precision in their accuracy estimate or may expend
resources that can be better allocated elsewhere.
Figure 15 – Comparison of Coefficient of Variation of simulated accuracy estimates between experimental permutations
in an error map with 10% error. Dashed line shows a CV of 0.21. Different experimental permutations reach the same
level of precision with different sample sizes.
38
Future Direction
The results of this study were both unexpected and novel. Contrary to expectations, the ArcGIS
simulation found that the combination of a randomly distributed dataset and a dataset that contained local
spatial autocorrelation yielded more precise estimations of accuracy. The results of the simulation suggest
that when accuracy assessors have reason to believe that the error in a vegetation map is randomly
distributed, they should apply a CRS. Conversely, when there is reason to believe that the error is
spatially autocorrelated at a local level, the assessors should use a SRS. In the literature, most studies
examining the spatial distribution of error during accuracy assessments report spatially autocorrelated
pattern of error distribution skews accuracy estimates in often unpredictable ways (Loveland et al. 1999).
If accuracy assessors cannot make an educated guess at the spatial distribution of error in a vegetation
map, SRS may be a safe choice.
The results of this study are, however, dependent on the validity of the vegetation map, error map,
and sampling strategy datasets modeled in the ArcGIS simulation. Further work needs to be done to fully
understand the outcomes of the results. Specifically, the nature of the spatial autocorrelation pattern and
the degree of spatial autocorrelation should be examined because the spatial distributions of error and
sampling strategy modeled by this simulation drive the unexpected results.
Furthermore, ever improving technological capabilities and development of the ArcGIS software
can enable future researchers to examine the effect of spatial patterns on sampling strategy efficacy in
more detailed and efficient ways. The simulation in this study, for example, does not address the need for
understanding classification-specific accuracy estimates.
The simulation used in this study used a simplified model of both spatial autocorrelation of error
and vegetation map classifications. In ideal circumstances, this study’s simulation would have use d a
model for spatial autocorrelation of error that would represent a more realistic picture of spatial clumping
of map error. A more realistic simulation should also be able to analyze within-class error in addition to
39
overall map error, as was done in this study. In the following sections, I will discuss in detail the issue of
spatial autocorrelation of error in this study and I will elaborate on the importance of understanding
within-class error.
The Issue of Spatial Autocorrelation
In this simulation, I created clustered error base maps that contained simplified patterns of spatial
autocorrelation. Spatial autocorrelation is observed when objects or locations adjacent to one another
contain values that are more similar than would be expected under a random distribution. In a vegetation
classification map, error may be spatially autocorrelated if map creators consistently misidentify specific
classification types that form associations close in proximity to one another, or if there are sensor
malfunctions that affect the interpretability of certain spectrums (in the case of maps that were created
through remotely-sensed data).
In this study, the simulation used created spatial autocorrelation in the error basemaps between
cells that were, on average, between one and two cells away from one another. The results of the
simulation were error basemaps that were very locally autocorrelated. The spatial autocorrelation of the
error basemaps was most likely limited in pattern and degree by the geometric nature with which the
basemaps were created; that is, because the basemaps were created by overlaying a series of cells of three
different sizes, the largest being three times larger than the smallest cell size, the degree and pattern of
spatial autocorrelation were limited by the geometry of these cell shapes. These limitations are
underscored by the results of the correlogram, showing that the correlation between cells in the error base
maps extended only to the second lag, or two cells away.
Future research might consider developing a method for creating spatial autocorrelation more
organically, using a method that is founded more on probability theory than geometry. Haining and his
associates outline a multivariate probability distribution from which a two-dimensional plane with spatial
autocorrelated values may be generated (Haining et al. 1983). Chen and Wei describe a method using the
40
Focal Statistics tool in ArcGIS to reclassify and control the degree of spatial autocorrelation when
simulating and spatial autocorrelated dataset (Chen and Wei 2009). These techniques may be adapted for
future research to provide a more realistic representation of spatially autocorrelated error.
The Error Matrix and Classification-Specific Accuracy Analysis
Due to the limitations and time constraints of this study, I was unable to perform a classification-
specific accuracy analysis. Utilizing the ArcGIS software, however, it is possible to analyze the effect of
error distribution on sampling strategy efficacy in individual classifications, especially with smaller
datasets. When performing an accuracy assessment of a vegetation map, the classifications of the
vegetation map are traditionally compared to the classifications of reference data that are considered to be
either more recent, more accurate, or both (Stehman 2009). Reference data can originate from sensitive
remotely-sensed maps or from field-collected vegetation surveys that apply sampling strategies as
discussed above. The evaluation the vegetation map classifications and the reference data classifications
can be compared through an error matrix.
An error matrix is a table that aids in calculating map accuracy by comparing both the vegetation
map data and more accurate reference vegetation data (Table 4). In an error matrix, all the vegetation
classification types used in the vegetation map are listed as individual rows of the matrix. All possible
vegetation classification types found in the reference data are listed as individual columns of the matrix.
Typically, equivalent classifications are listed in the same order; that is, the first listed classification in the
map classification rows should be identical (or a facsimile) of the first listed classification in the reference
data columns. Each cell in the matrix represents the number of map entities that are identified both as a
particular classification on the original map and as a particular classification in the reference data. As an
example, Table 1 displays a hypothetical error matrix with vegetation classifications that range from
“class 1” to “class k”. The cell first cell of the in the error matrix, cell p
11
, represents the number of map
entities that were identified as “class 1” in both the original vegetation map and the reference data.
41
Table 4 - Hypothetical error matrix. Rows are map classes, columns are accuracy assessment classes. Table reprinted
from Stehman 2009.
The error matrix allows for the easy organization of all relevant accuracy metrics. The error
matrix cells along the diagonal, from p
11
to p
kk,,
represent the map classifications that agreed with the more
accurate reference data. If there were no values in any cells outside of the diagonal, the map would be
100% accurate for all classification types. Any cells to the left or right of the p
11
-p
kk
diagonal are map
classification errors. For example, if a value of 1 is input for cell p
12
, the map identified 1 map entity as
“class 1” while the reference data indicates that the map entity should be “class 2”. Assuming the
reference data is more accurate, the error matrix shows that the map misidentified one “class 2” map
entity as “class 1”. The map accuracy for individual classifications can be calculated by taking the
diagonal value of a column and dividing it by the sum of all values in the column. Total map accuracy can
be calculated by taking the sum of all diagonal values and dividing it by the sum of all values in the table.
A benefit of the error matrix is the ability to identify error to the classification type. While
estimates of overall map accuracy are beneficial, map errors that are disproportionately present in one
map classification type can be uniquely problematic. Map errors may aggregate in particular
classifications because the classifications may be harder to identify from remotely sensed data than other
classifications or because data technicians may have unintentionally created frame-shifts or localized data
errors during the data entry process. Increased error rate in specific classification types can be problematic
if the classification type is rare or if the classification type carries some disproportionate importance in the
42
ecosystem. The use of the error matrix allows for map accuracy assessments to be sensitive to potential
classification-specific problems.
Conclusion
In this study, I sought two goals: first, I strove to develop a model through which a simulation of
a vegetation map accuracy assessment could be made. Second, I strove to use this model to understand
the effect of spatial patterns on sampling strategy efficacy. The results of the simulation I created heavily
suggests that the Simple Random Sample performs best when used to assess error that is spatially auto-
correlated, while the Clustered Random sample performs best when used to assess randomly distributed
error. Moreover, the results suggest that selecting an appropriate design based on proper knowledge of
error distribution can reduce the required sampling size (and thus sampling effort) more than 3-fold.
While the model used in this study utilized a simplified, geometric representation of spatial auto-
correlation, this study provides a foundation through which future research on sampling strategy efficacy
in the context of spatial distributions might take place. As the availability and accessibility to spatial
software increases in the environmental field, vegetation classification maps will become cheaper and
more efficient to make. There is a great need to understand the most reliable and cost-effective way to
assess the appropriateness of sampling strategy use in map accuracy assessments. The simulation used in
this study shows that ArcGIS can be used as an inexpensive and time-saving platform for understanding
sampling strategy behavior prior to application.
43
References
Bartolome, JW, Barry, WJ, Griggs, T and P Hopkinson (2007), Valley Grassland. Terrestrial Vegetation
of California, 3
rd
ed. Barbour M.G., T. Keeler-Wolf and A.A. Schoenherr, editor. 367-393.
Burnicki, AC (2013), Spatio-temporal errors in land-cover change analysis: implications for accuracy
assessment. International Journal of Remote Sensing 32(22):7487-7512.
Chen, D, and H Wei (2009), The effect of spatial autocorrelation and class proportion on the accuracy
measures from different sampling designs. ISPRS Journal of Photogrammetry and Remote
Sensing 64:140-150.
Chiarucci, A, Bacaro, B, Filibeck, G, Landi, S, Maccherini, S, and A Scoppola (2012), Scale dependence
of plant species richness in a network of protected areas. Biodiversity Conservation 21:503-516.
Cochran, WG (1977), Sampling Techniques, 3
rd
ed. John Wiley & Sons, New York.
Comber, A, Fisher, P, Brunsdon, C and A Khmag (2012). A GWR analysis of land cover accuracy.
Proceedings of the AGILE2012 International Conference on Geographic Information Science.
Contalgon, RG (1991), A review of assessing accuracy of classifications of remotely sensed data. Remote
sensing of Environment 37:35-46.
Edwards Jr., TC, Moisen, GG and RD Cutler (1998), Assessing map accuracy in a remotely sensed,
ecoregion-scale cover map. Remote Sensing of Environment 63:73-83.
Foody, GM (2002), Status of land cover classification accuracy assessment. Remote Sensing of
Environment 80:185-201.
Fortin, MJ, Drapeau, P and P Legendre, (1989). Spatial autocorrelation and sampling design in plant
ecology. Vegetatio 83:209-222.
44
Green, RH (1979), Sampling Design and Statistical Methods for Environmental Biologists. John Wiley &
Sons, New York.
Hearn, SM, Healey, JR, McDonald, MA, Turner, AJ, Wong, JLG, and GB Stewart (2011), The
repeatability of vegetation classification and mapping. Journal of Environmental Management
92:1174-1184.
Keith, LH (1988), The Principles of Environmental Sampling. American Chemical Society.
Legendre, P (1993), Spatial autocorrelation: Trouble or New Paradigm? Ecology 74(6): 1659-1673.
Lichstein, JW, Simons, TR, Shriner, SA and KE Franzreb (2002), Spatial autocorrelation and
autoregressive models in ecology. Ecological Monographs 72(3):445-463.
Lo, CP & LJ Watson (1998), The influence of geographic sampling methods on vegetation map accuracy
evaluation in a swampy environment. Photogrammetric Engineering and Remote Sensing
64:1189-1200
Loveland, TR, Reed, BC, Brown, JF, Ohlen, DO, Zhu, Z, Yang, L and JW Merchant (2000),
Development of a global land cover characteristics database and IGDP DISCover from 1km
AVHRR data. International Journal of Remote Sensing 21:1303-1330.
Lunetta, RS, Congalton, RG, Fenstermaker, LK, Jensen, JR, McGwire, KC, and LR Tinney (1991),
Remote-sensing and geographic information-system data integration: error sources and research
issues. Photogrammetric Engineering and Remote Sensing 57(6):677-687.
Moisen, GG, Edwards, TC, and DR Cutler (1994), Spatial sampling to assess classification accuracy of
remotely sensed data. In Enironmental Information Management and Analysis: Ecosystem to
Global Scales (WK Michener, JW Brunt, and SG Stafford, Eds.), Taylor and Francis, New York.
45
Roleck, J, Chytry, M, Hajeck, M, Lvoncik, S and L Tichy (2007), Sampling design in large-scale
vegetation studies: Do not sacrifice ecological thinking for statistical purism! Folia Geobotanica
42:199-208.
Sproul, F, Keeler-Wolf, T, Gordon-Reedy, P, Dunn, J, Klein, A, and K Harper (2011), Vegetation
Classification Manual for Western San Diego County, 1
st
Edition, AECOM, San Diego.
Stehman, SV (1997), Estimating Standard Errors of Accuracy Assessment Statistics under Cluster
Sampling. Remote Sensing of Environment 60:258-289.
Stehman, SV (1999), Basic probability sampling designs for thematic map accuracy assessment.
International Journal of Remote Sensing 20(12): 2423-2441.
Stehman, SV (2009), Sampling designs for accuracy assessment of land cover. International Journal of
Remote Sensing 30(20):5243-5272.
Abstract (if available)
Abstract
Vegetation classification maps can be important tools for academic research, environmental mitigation and restoration, and governmental decision‐making. Continued development in satellite and photographic technology has resulted in an increase in production of remotely‐sensed vegetation classification maps. As vegetation classification maps become easier and more efficient to make, there will also be increasing importance in evaluating the accuracy of these maps. Traditional sampling strategies for vegetation classification map accuracy assessments were developed using aspatial statistical theory. The performance of these sampling strategies may be affected by spatial patterns in the vegetation classification map in ways that are unpredictable. In this study, I simulated accuracy assessments using Esri’s ArcGIS software. The goal of this study was to understand how different spatial patterns of map error affect the performance of two common sampling strategies during an accuracy assessment.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Testing LANDIS-II to stochastically model spatially abstract vegetation trends in the contiguous United States
PDF
Preparing for immigration reform: a spatial analysis of unauthorized immigrants
PDF
Questioning the cause of calamity: using remotely sensed data to assess successive fire events
PDF
Using pattern oriented modeling to design and validate spatial models: a case study in agent-based modeling
PDF
Spatial narratives of struggle and activism in the Del Amo and Montrose Superfund cleanups: a community-engaged Web GIS story map
PDF
Developing art-based cultural experiences in North Kohala: A community engagement project with OneIsland
PDF
A Maxent-based model for identifying local-scale tree species richness patch boundaries in the Lake Tahoe Basin of California and Nevada
PDF
Implementing spatial thinking with Web GIS in the non-profit sector: a case study of ArcGIS Online in the Pacific Symphony
PDF
Spatial and temporal patterns of long-term temperature change in Southern California from 1935 to 2014
PDF
Use of GIS for analysis of community health worker patient registries from Chongwe district, a rural low-resource setting, in Lusaka Province, Zambia
PDF
Out-of-school suspensions by home neighborhood: a spatial analysis of student suspensions in the San Bernardino City Unified School District
PDF
Distribution and correlates of feral cat trapping permits in Los Angeles, California
PDF
The role of amenities in measuring park accessibility: a case study of Downey, California
PDF
Congestion effects on arterials as a result of incidents on nearby freeway: When should you get off the highway?
PDF
Using Landsat and a Bayesian hard classifier to study forest change in the Salmon Creek Watershed area from 1972–2013
PDF
Developing improved geologic maps and associated geologic spatial databases using GIS: Candy Mountain and Badger Mountain, WA
PDF
Using geospatial technology to establish marsh bird monitoring sites for a pilot study in Maine in accordance with the North American Marsh Bird Monitoring Protocol
PDF
Visualizing historic space through the integration of geographic information science in secondary school curriculums: a comparison of static versus dynamic methods
PDF
Cartographic approaches to the visual exploration of violent crime patterns in space and time: a user performance based comparison of methods
PDF
Mapping native plants: a mobile GIS application for sharing indigenous knowledge in Southern California
Asset Metadata
Creator
Lo, Benito Mark
(author)
Core Title
Effect of spatial patterns on sampling design performance in a vegetation map accuracy assessment
School
College of Letters, Arts and Sciences
Degree
Master of Science
Degree Program
Geographic Information Science and Technology
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
ArcGIS,classification map,Clustered Random Sample,coefficient of variation,landscape ecology,map accuracy assessment,OAI-PMH Harvest,sample size,sampling strategy,Simple Random Sample,spatial autocorrelation,Vegetation
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Longcore, Travis R. (
committee chair
), Kemp, Karen K. (
committee member
), Lee, Su Jin (
committee member
)
Creator Email
benitomlo@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-464718
Unique identifier
UC11288057
Identifier
etd-LoBenitoMa-2849.pdf (filename),usctheses-c3-464718 (legacy record id)
Legacy Identifier
etd-LoBenitoMa-2849.pdf
Dmrecord
464718
Document Type
Thesis
Rights
Lo, Benito Mark
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
ArcGIS
classification map
Clustered Random Sample
coefficient of variation
landscape ecology
map accuracy assessment
sample size
sampling strategy
Simple Random Sample
spatial autocorrelation