Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Efficient and accurate object extraction from scanned maps by leveraging external data and learning representative context
(USC Thesis Other)
Efficient and accurate object extraction from scanned maps by leveraging external data and learning representative context
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Efficient and Accurate Object Extraction from Scanned Maps by Leveraging
External Data and Learning Representative Context
by
Weiwei Duan
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
December 2023
Copyright 2023 Weiwei Duan
Dedication
I am dedicating this dissertation to my beloved father, Chao Duan, and mother, Hongwei Zhou.
ii
Acknowledgements
First and foremost, I would like to express my gratitude to my advisor, Professor Yao-Yi Chiang. His
support has profoundly impacted both my academic journey and personal life. His unwavering commitment
to engaging in discussions about research ideas and assisting me with paper writing has been incredibly
valuable throughout my academic journey. His encouragement has empowered me to explore the research
field with creativity. Furthermore, I am grateful for the positive atmosphere he cultivated in our group. The
heartwarming tradition of surprise birthday cakes not only brings joy but also unites our lab members as a
family. I am grateful for the annual birthday cakes he prepared for me. Lastly, I would like to convey my
gratitude for his help in my personal life. His help and suggestions have greatly eased my life, particularly
as an international student. I also value the wisdom he shared, which has been immensely beneficial.
I would also like to express my appreciation to my committee members: Professor Craig A. Knoblock,
Ram Nevatia, and John P. Wilson. Craig’s invaluable insights have played a pivotal role in shaping and
refining my research ideas. Ram’s perspective from the realm of computer vision has significantly enriched
my research. John’s geographer’s viewpoint has offered valuable perspectives to my work. Additionally, I
thank my undergraduate advisor, Professor Yanping Zhang, for her unwavering encouragement and insight-
ful advice, which ignited my passion for pursuing a Ph.D.
I would like to express my gratitude to all the members of our group. Thank you, Zekun Li, for discus-
sions on challenging research papers and coursework. Chatting with her also relieved my pressure. Yijun
iii
Lin, I deeply appreciate her assistance in enhancing my paper writing and her insightful suggestions dur-
ing my presentation rehearsals. To Johanna Avelar Portillo and Lois Park, I want to express my thanks for
infusing our office with positivity and joy through her cheerful presence and delightful snacks. My sincere
thanks also go to all my labmates who have supported me in preparing for conference presentations, the
qualifying exam, the thesis proposal, and the defense. Your collective contributions have played a pivotal
role in shaping my academic journey.
My Ph.D. journey would not have been possible without the support and assistance of numerous indi-
viduals. I am especially grateful to my friend, Cheng Ding, with whom I engaged in late-night discussions
on research and life. I also extend my heartfelt thanks to Lizsl De Leon for her productive work on admin-
istrative issues.
I extend my heartfelt gratitude to my family, including my parents and fianc ´ e, for their unwavering love
and support. My parents have consistently been a source of encouragement. My fianc ´ e, Dr. Ying Chen, has
been instrumental in offering invaluable mental support throughout the last three years of my Ph.D. journey.
I am especially grateful for Ying Chen’s patient efforts in reviewing my hard-to-understand paper drafts and
assisting me in enhancing my papers’ clarity and quality.
My research is based upon work supported by the National Science Foundation under Award No. IIS
1564164 and NVIDIA Corporation, the National Endowment for the Humanities under Award No. HC-
278125-21, and the University of Minnesota, Computer Science & Engineering Faculty startup funds.
iv
TableofContents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Background and Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.1 Labeling desired images from candidate images provided by the external vector
data with minimal manual work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.2 Automatically labeling linear objects nearby the external vector data . . . . . . . . . 9
1.3.3 Extracting precise and continuous linear objects . . . . . . . . . . . . . . . . . . . . 10
1.4 Outline of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Chapter 2: Training Data Generation and Extraction for Polygonal Geographic Objects . . . . . . . 13
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Target-Guided Generative Model (TGGM) . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.1 The Notations for Target-Guided Generative Model . . . . . . . . . . . . . . . . . . 19
2.2.2 The Variational Inference (Encoder) . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.3 The Generative Process for Images (Decoder) . . . . . . . . . . . . . . . . . . . . . 20
2.2.4 The Generative and Inference Processes for Labeled Images . . . . . . . . . . . . . 22
2.2.5 Evidence Lower Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.6 Network Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Polygonal Objects Extraction and Vectorization . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 Experiment Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.1 Experiment Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.3 Ground Truth Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.4 Experiment Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.5 Experiment Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
v
Chapter 3: Automatic Training Data Generation for Linear Geographic Object Extraction . . . . . . 38
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2 Automatic Label Generation (ALG) Algorithm using Object’s Shape . . . . . . . . . . . . . 42
3.2.1 Chan-Vese Algorithm (Foregrounds Detection without Leveraging Object’s Shape) . 42
3.2.2 Area-of-interest and Object’s Shape Representation in the ALG Algorithm . . . . . 44
3.2.3 The Objective Function for the ALG Algorithm . . . . . . . . . . . . . . . . . . . . 46
3.2.4 The Optimization Process of the ALG Algorithm . . . . . . . . . . . . . . . . . . . 47
3.3 Experiment Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3.1 Experiment Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3.3 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3.4 The ALG Algorithm Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3.5 Training Data Generation Results and Analysis . . . . . . . . . . . . . . . . . . . . 51
3.3.6 Extraction Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.3.6.1 The Impact of Label Quality on the Extraction Results . . . . . . . . . . . 55
3.3.6.2 With & Without the Normalized Cut Loss . . . . . . . . . . . . . . . . . 57
3.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Chapter 4: Linear Geographic Objects Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2 Linear object Detection TRansformer (LDTR) . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2.1 LDTR Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2.2 Training Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.2.3 Inference Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3.3 Model Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3.4 Experiment Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.3.5 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.3.6 Sensitivity Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Chapter 5: Conclusion and Future Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.1 Contributions and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.2 Future Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
vi
ListofTables
2.1 The comparison among three groups of referenced wetland polygons using IoU. Group 1, 2,
and 3 are referenced wetland polygons edited by the GIScience expert, the knowledgeable
annotator, and the experienced annotator, respectively. . . . . . . . . . . . . . . . . . . . . . 28
2.2 The evaluation for the wetland extraction using IoU . . . . . . . . . . . . . . . . . . . . . . 31
3.1 The evaluation for the quality of training data labels using pixel-level precision, recall, and
F
1
score for the linear object extraction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2 The evaluation for the vectorized extraction results. ”Original”, ”Vec2raster”, and ”ALG”
represent three groups of labeled data. ”Ncut” and ”no-Ncut” represent deeplab v3+ with
and without the normalized cut loss, respectively. The highlighted numbers indicate the
most optimal extraction results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.1 Table shows the number of regions in the training, validation, and testing sets, the window
size for models’ inputs, and the stride for the sliding window. The window size and stride
refer to the number of pixels. Fault lines
⋆
refers to thrust fault lines. . . . . . . . . . . . . . 72
4.2 Evaluation results for four objects detection in scanned historical topographic and geological
maps. The correctness, completeness, and APLS evaluate the detection results’ precision,
coverage, and connectivity, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.3 The waterlines detection evaluation. LDTR+c-token
⋆
refers to LDTR with a connectivity
token. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.4 The evaluation for waterlines detection with various N in the N-hop connectivity prediction
head. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.5 The evaluation for waterlines detection with the different number of node tokens. . . . . . . 83
vii
ListofFigures
1.1 The arrows above point out the wetland areas on the scanned topographic maps from the
United States Geological Survey (USGS) covering Palm Springs, Florida, circa 1946, 1950,
and 2018, respectively. The sizes of wetland areas in three different years show the wetland
changes over time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 The bluegrass symbols in (a) represent wetlands in the USGS historical topographic maps.
The black line with crosses in (b) represents railroads on the USGS historical topographic
maps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 The railroad’s external vector data overlap with a topographic map. The green lines are the
railroad’s external vector data. Black lines with crosses are railroads on the topographic
map. The external vector data fail to label the left two railroads and misalign with the right
two railroads on the map. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Examples of labeled images for wetlands, a type of polygonal object. Each labeled image
covers a bluegrass symbol, which represents wetlands on maps. . . . . . . . . . . . . . . . . 4
1.5 The blue area in the left figure shows the map area covering the external wetland vector
data. The right figure shows candidate images within the blue area in the left figure. The
candidate images, generated by the sliding window approach, include wetland symbols,
blue lines, and white backgrounds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6 The examples of inputs and outputs for the wetland polygon extraction. . . . . . . . . . . . 5
1.7 The green area shows the map area covered by the external railroad vector data. . . . . . . . 6
1.8 The examples of inputs and outputs for the linear object extraction. . . . . . . . . . . . . . . 7
2.1 The green polygon delineates a wetland area on a scanned topographic map. Wetlands,
polygonal objects, consist of a group of bluegrass symbols on scanned topographic maps. . . 14
2.2 (a) shows an RoI from the external vector data. The blue area represents an RoI from
the external vector data for wetlands on a topographic map. (b) shows candidate images
cropped by a sliding window across the blue RoI in (a). . . . . . . . . . . . . . . . . . . . . 15
2.3 The images covering the wetland symbols within an RoI. . . . . . . . . . . . . . . . . . . . 16
viii
2.4 TGGM, the proposed model, iteratively separates images within the RoI (the red region)
into the desired or non-desired cluster. In the first iteration, TGGM uses the few manually
labeled wetland symbols (purple box in the left-most figure) to form the desired and
non-desired cluster for wetland symbols. In the following iterations, TGGM uses the
images in the wetland cluster from the previous iteration to collect more wetland images.
The iteration ends when the number of images in the wetland cluster does not change. The
images in the wetland cluster are labeled data to train an extraction model. . . . . . . . . . . 18
2.5 The above figure shows TGGM’s architecture. f
in f
in the center model encodes x
u
concatenating with y into the distribution of the latent variable z, and f
gen
decodes z into ˆ x
u
weighted by q(y|x
u
) from f
cls
represented by the right model. Section 3.1 describes f
in f
and
f
cls
. Section 3.2 describes f
prior
represented by the left model. Section 3.3 and 3.4 shows
the optimization procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6 Three different ways to group the wetland areas on the topographic map covering
Duncanville, Texas, circa 1995. The purple areas represent polygons created by three
different annotators to depict the wetland areas. . . . . . . . . . . . . . . . . . . . . . . . . 27
2.7 The three groups of referenced wetland polygons are for the Big Swamp map edited in 1993.
The red, orange, and green polygons represent the referenced wetland polygons created
by the GIScience expert, the knowledgeable annotator, and the experienced annotator,
respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.8 The three groups of referenced wetland polygons are for the Duncanville map edited in
1959. The red, orange, and green polygons represent the referenced wetland polygons
created by the GIScience expert, the knowledgeable annotator, and the experienced
annotator, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.9 The visualizations of extraction results from the six maps. The purple areas are the
extracted polygons for the wetland areas. The red, orange, and green polygons represent
the referenced wetland polygons generated by the expert, the knowledgeable annotator, and
the experienced annotator, respectively. The caption below each figure follows the format
”X, Y , Z”. X is the city name covered by the map, Y is the map editing year, and Z is
the reference. Group 1, 2, and 3 are for referenced wetland polygons from the expert, the
knowledgeable annotator, and the experienced annotator, respectively. . . . . . . . . . . . . 35
3.1 The misaligned railroad labels (depicted by the green area) are from the external vector
data. The green area is the buffer zone of the external vector data. The buffer width is
equivalent to the railroad’s width on the map. The railroad is the black line with crosses on
the scanned topographic map. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 The green and blue areas represent the area-of-interest and the object’s shape for the railroad
on a scanned topographic map, respectively. The railroad is the black line with crosses on
the map. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
ix
3.3 The areas within the blue polygons are the labeled pixels for the desired linear objects.
”Orig vec” in Figure 3.3a, 3.3d and 3.3g refers to labels from the buffered original vector
data. ”vec2raster” in Figure 3.3b, 3.3e, and 3.3h refer to labels from the aligned vector
data processed by the vector-to-raster algorithm. ”ALG” in Figure 3.3c, 3.3f, and 3.3i
refer to labels from the proposed ALG algorithm. . . . . . . . . . . . . . . . . . . . . . . . 52
3.4 (a) shows the false labels from the ALG algorithm. The ALG algorithm cannot correct the
label error when the original vector data showing in (b) is closer to other linear objects than
the desired objects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.5 Illustration of correctness and completeness calculation. . . . . . . . . . . . . . . . . . . . . 55
3.6 The above figure illustrates the extraction results. The sub-figures from top to bottom show
the results using the annotations from the original vector, vector-to-raster, and ALG group,
respectively. The green and red lines represent the true positive and false positive extraction
results, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.7 The waterlines extraction results on the Bray map. The green lines represent the true
positive extracted waterlines. For the labels, (b) and (c) used the ALG group, while (a) used
the vector-to-raster group. For the extraction models, (a) and (b) show the extraction from
the model without the Ncut loss, while (c) is with the Ncut loss. . . . . . . . . . . . . . . . 57
3.8 The false extracted railroads (red lines) on the Bray map. (a) and (b) are the results from
the model without and with the Ncut loss, respectively. . . . . . . . . . . . . . . . . . . . . 57
4.1 The black lines highlighted with the arrows are the historical railroad locations on the USGS
topographic maps in Los Angeles, California, circa 1928, 1966, and 2018, respectively. The
density of railroad areas in three different years shows the changes in railroads over time. . . 61
4.2 Examples of road (a) and railroad (b) on a USGS topographic map . . . . . . . . . . . . . . 62
4.3 The above figure shows the LDTR overview. LDTR takes images as inputs and generates
the graph for desired lines by predicting nodes and edges. Notably, the N-hop connectivity
prediction head in LDTR plays a crucial role in capturing complex spatial context among
nodes and hence improves the connectivity of the detected lines. Best viewed in color. . . . . 64
x
4.4 The above figure shows the LDTR’s architecture. LDTR consists of five components.
The first component, CNN, takes images as inputs and generates the feature maps. The
second component, Transformer, comprises two sub-components: an encoder and a
decoder. The Transformer’s encoder processes image-patch tokens obtained by combining
flattened feature maps and positional encodings to produce refined image-patch tokens.
The Transformer’s decoder takes node tokens, edge tokens, and refined image-patch tokens
as inputs and generates refined node and edge tokens. The third component, the edge
prediction head, takes the concatenation of pair-wise refined node tokens and the refined
edge token as inputs and predicts if an edge exists between two node tokens. The fourth
component, the node detection head, takes the refined node tokens as inputs and predicts if
a node token is a valid node in the graph plus the node location in the input image. The fifth
component, the N-hop connectivity prediction head, takes the concatenation of pair-wise
refined node tokens as inputs and predicts if two nodes are connected within a specified
N-hop distance. The fifth component is only used in training time. . . . . . . . . . . . . . . 66
4.5 Examples of tested linear objects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.6 Visualization of detection results. The figures in the first and second columns are map
images and ground truth, respectively. The figures in the remaining columns are the
detection results from three baselines (SIINet, CoANet, Relationformer) and LDTR. The
first two rows are for railroad detection. The third and fourth rows are for waterline
detection. The fifth row is for the scarp lines detection, and the sixth and seventh rows are
for thrust fault lines detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.7 The figures show the query node tokens’ reference points and attention toward the key node
tokens from LDTR (top row) and Relationformer (bottom row) for waterlines and scarp
lines detection. In the map images, the red dots represent the query node tokens, while the
remaining dots represent the top key node tokens which have high attention scores to the
query node tokens. The dots’ color gradient, ranging from yellow to orange, indicates the
magnitude of the attention scores, with shades closer to orange signifying higher scores.
Additionally, the squares on the map images are the query node tokens’ reference points
in the multi-scaled deformable cross-attention. In the images with black backgrounds, the
green dots and white lines are the predicted nodes and edges, respectively. . . . . . . . . . . 79
4.8 The waterlines detection results from LDTR and LDTR with a conn-token. LDTR+conn
⋆
refers to LDTR with a conn-token. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.9 The detected waterlines from LDTR with various N in the N-hop connectivity prediction
head. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.10 The detected waterlines from LDTR with the various number of node tokens. . . . . . . . . 84
xi
Abstract
Scanned historical maps contain valuable information about environmental changes and human development
over time. For instance, comparing historical waterline locations can reveal patterns of climate change.
Extracting geographic objects in map images involves two main steps: 1. obtaining a substantial amount
of labeled data to train extraction models, and 2. training extraction models to extract desired geographic
objects. However, the extraction process has two main challenges. One challenge is generating a large
amount of labeled data with minimal human effort, as manual labeling is expensive and time-consuming.
The other challenge is ensuring that the extraction model learns representative and sufficient knowledge
for the accurate extraction of geographic objects. The success of subsequent analyses, like calculating the
shortest paths after extracting railroads, heavily depends on the accuracy of the extractions.
To generate labeled data with minimal human effort, this dissertation presents semi- and fully automatic
approaches to generate labeled desired geographic objects by leveraging external data. The semi-automatic
approach requires one or a few manually labeled desired objects to collect all desired objects from candidates
provided by the external data. In contrast, existing methods require more than a few manually labeled
desired objects to achieve the same goal. On the other hand, the proposed automatic approach aims to
label the desired objects in close proximity to the external data. Using the location and shape information
fully from the external data, the proposed automatic approach can accurately label the desired objects on
the maps. On the contrary, existing methods that do not utilize shape information may lead to false labels.
xii
The novel approaches introduced in this dissertation significantly reduce the need for manual labeling while
ensuring accurate labeling results.
Extracting accurate geographic objects is the other challenge due to the ambiguous appearances of ob-
jects and the presence of overlapped objects on maps. The extraction model presented in this dissertation
captures cartographic symbols to differentiate desired objects from other objects with similar appearances.
When the desired objects overlap with other objects on maps, the extracted results could be broken. The
proposed extraction model captures sufficient spatial context to reduce broken extraction. For example, the
proposed extraction model learns the long and continuous structure of linear objects to reduce the gaps in
the extracted lines. On the contrary, existing extraction models lack the ability to learn sufficient spatial con-
text, resulting in the broken extraction of linear objects. In summary, the proposed extraction model learns
representative cartographic symbols and sufficient spatial context to accurately extract desired objects.
The results of the experiment demonstrate the superiority of both the labeling and extraction approaches
compared to the existing methods. The proposed methods significantly improve the quality of training
data for extraction models by generating accurately labeled data. The extraction results from the proposed
extraction model have much less false extraction and better continuity than state-of-the-art baselines. The
combination of precise labeling and accurate extraction allows us to extract geographic objects from scanned
historical maps. Therefore, we can analyze and interpret historical map data effectively.
xiii
Chapter1
Introduction
1.1 BackgroundandProblemStatement
Large numbers of scanned historical map images store abundant and valuable information on the evolution
of natural features and human activities, such as changes in hydrography, the development of railroad net-
works, and the locations of mineral sites [5, 33, 80, 83, 84]. Figure 1.1 shows the wetland areas on scanned
topographic maps from the United States Geological Survey (USGS) covering Palm Springs, Florida, circa
1946, 1950, and 2018 from left to right. Wetlands, which are home to a wide range of plants and animals,
are reservoirs of biodiversity. Humans also benefit from wetlands purifying water, replenishing groundwa-
ter, and stabilizing shorelines. Analyzing changes in wetland areas over time in historical maps, such as
Figure 1.1, helps researchers study many topics, such as water quality. However, due to the large number of
historical maps, there is an urgent need for an efficient and accurate method to extract the historical locations
of geographical objects and store the extracted locations in a machine-readable format for further analysis.
In order to develop an extraction model capable of automatically processing thousands of historical
maps, a substantial amount of labeled data is essential for training the extraction model. During the training
phase, the extraction model utilizes the labeled data to learn the appearances and context of the desired
geographic objects. For example, as depicted in Figure 1.2a, the appearance of wetlands is characterized by
bluegrass symbols within white backgrounds. Similarly, Figure 1.2b shows that the appearance of railroads
1
(a) 1946 year (b) 1950 year (c) 2018 year
Figure 1.1: The arrows above point out the wetland areas on the scanned topographic maps from the United
States Geological Survey (USGS) covering Palm Springs, Florida, circa 1946, 1950, and 2018, respectively.
The sizes of wetland areas in three different years show the wetland changes over time.
is black lines with crosses amidst complex backgrounds. Labeled data generation is usually a resource-
intensive and time-consuming process [21]. This dissertation aims to use minimal manual annotations to
generate a large amount of labeled data for training the extraction models.
(a) Wetland symbols (b) Railroad symbols
Figure 1.2: The bluegrass symbols in (a) represent wetlands in the USGS historical topographic maps. The
black line with crosses in (b) represents railroads on the USGS historical topographic maps.
Besides efficiently generating labeled data, this dissertation aims to design an extraction model that
enables the accurate extraction of geographic objects from scanned map images. Therefore, achieving the
extraction goal involves two key components:
1. Develop methods to generate labeled data for training extraction models with minimal manual work.
2. Design extraction models to accurately extract geographic objects.
2
Leveraging external vector data helps reduce the manual work needed to generate labeled data. External
vector data refer to data from other sources representing the desired geographic objects depicted on the
map images. However, due to differences in editing years and coordinate projection systems between the
external vector data and the maps, the external vector data only offer partial and approximate locations of
desired objects from the maps. The external vector data lack annotations for the desired objects that might
have disappeared when the external vector data were edited. For example, in Figure 1.3, the green lines
depict the external railroad vector data overlapped with a topographic map. The external railroad’s vector
data do not provide labels for the two left railroads on the map, as the railroads were removed when the
external vector data were edited. Furthermore, the railroad’s external vector data inaccurately label the two
right railroads, as the vector data (green lines) are several pixels away from the true railroad’s locations on
the map. Therefore, the external vector data provide partial and misaligned labels for desired geographic
objects on maps.
Figure 1.3: The railroad’s external vector data overlap with a topographic map. The green lines are the
railroad’s external vector data. Black lines with crosses are railroads on the topographic map. The external
vector data fail to label the left two railroads and misalign with the right two railroads on the map.
For polygonal objects, the labeling task is labeling images covering desired symbols. Figure 1.4 shows
examples of labeled images for wetlands (a polygonal object). Each labeled image covers a bluegrass
symbol, which represents wetlands on maps. The external vector data help the labeling task by providing
3
a region-of-interest (RoI), which covers desired symbols and other objects. An example of an RoI is the
blue polygon in Figure 1.5a. The blue polygon contains wetland symbols and blue lines. To facilitate the
labeling process for polygonal objects, I use a sliding window approach to generate candidate images within
the RoI. Figure 1.5b shows examples of candidate images generated by the sliding window approach within
the RoI (the blue polygon) in Figure 1.5a. The candidate images include wetland symbols, blue lines, and
white backgrounds. Therefore, labeling symbols for polygonal objects is to identify desired images from
candidate images provided by the external vector data with minimal manual work.
Figure 1.4: Examples of labeled images for wetlands, a type of polygonal object. Each labeled image covers
a bluegrass symbol, which represents wetlands on maps.
After obtaining labeled images for polygonal objects, the main focus is extracting all desired symbols on
the entire maps. The external vector data only label partial desired symbols on the maps due to the different
editing years between the external vector data and maps. Therefore, the extraction process is necessary to
extract all desired symbols on the maps. The extraction process employs a sliding window approach that
traverses the entire map. As the sliding window moves horizontally and vertically by x pixels, the extraction
model detects if a window covers a desired symbol. The map areas covered by the detected windows form
the extracted polygons, encompassing all desired symbols on the maps. For example, the white area in
Figure 1.6b shows the extracted polygon for the wetland area in Figure 1.6a.
4
(a) (b)
Figure 1.5: The blue area in the left figure shows the map area covering the external wetland vector data. The
right figure shows candidate images within the blue area in the left figure. The candidate images, generated
by the sliding window approach, include wetland symbols, blue lines, and white backgrounds.
(a) A input map for wetland extraction (b) White area is the extracted wetlands
Figure 1.6: The examples of inputs and outputs for the wetland polygon extraction.
As for the linear object labeling task, the external vector data could provide misaligned labels. For
example, the green area in Figure 1.7 shows the labeled railroad pixels annotated by the external vector
data, while the true railroads on the map are several pixels away from the misaligned labels (the green area).
Accurate pixel-level annotations are essential for training semantic segmentation models, particularly for
small objects like linear objects on the maps. The accuracy of pixel-level annotations becomes critical for
small objects as the error-to-true label proportion is higher for small objects compared to large objects. The
misaligned annotations can lead to significant inaccuracies in extraction results. Therefore, the goal is to
5
design an algorithm that automatically labels the desired linear objects near the misaligned labels, ensuring
the accuracy of the labeled data.
Figure 1.7: The green area shows the map area covered by the external railroad vector data.
After obtaining labeled data, the main focus is developing an extraction model capable of accurately
extracting linear objects from maps. Linear objects pose particular challenges due to their ambiguous lo-
cal appearances and elongated, slender shapes. Existing extraction models [9, 56, 78] often produce false
and discontinuous extraction results. For example, the railroads and roads in Figure 1.8a are black lines
within a local area. The railroads are black lines with crosses, and the roads are dashed black lines next
to the railroads. Because railroads and roads appear similarly as black lines, an object extraction model
may falsely extract a small road segment as a railroad. Therefore, it is challenging to distinguish between
different geographic objects with ambiguous local appearances. Moreover, the railroad in Figure 1.8a inter-
sects with complex backgrounds. Most railroad segments are on white backgrounds, while some are on red
backgrounds or along brown contour lines. The complex background makes it difficult for object extrac-
tion models to maintain continuity in the extracted railroads. Therefore, the goal is to extract precise and
continuous desired linear objects. Figure 1.8b is an example of extracted railroads (white area).
In summary, the goals for labeling and extraction approaches in this dissertation are as follows:
• Semi- or fully automatically labeling desired symbols or linear objects by leveraging the external
vector data to generate a large amount of training data for training an extraction model
• Automatically extracting precise and continuous linear objects
6
(a) A input map for railroads extraction (b) White area is the extracted railroads
Figure 1.8: The examples of inputs and outputs for the linear object extraction.
1.2 ThesisStatement
I develop approaches to achieve the end-to-end process of accurately extracting geographic objects from
scanned historical maps with minimal manual annotations by designing machine learning methods to capture
representative and sufficient context in the maps and from external data.
1.3 Approach
This section briefly introduces my approaches to the goals of labeling and extraction.
1.3.1 Labelingdesiredimagesfromcandidateimagesprovidedbytheexternalvectordata
withminimalmanualwork
The task of labeling desired symbol images from candidate images involves a clustering process that groups
the candidate images into desired and non-desired clusters. For example, Figure 1.5b shows candidate im-
ages. The clustering process for candidate images in Figure 1.5b results in the formation of a desired cluster,
exclusively comprising wetland symbols, and a non-desired cluster, which includes images covering blue
lines and white backgrounds. The desired cluster exclusively contains the desired symbols, ensuring ac-
curate labels for training extraction models. However, achieving a cluster exclusively for desired symbols
7
while minimizing manual labeling efforts poses a significant challenge. The state-of-the-art (SOTA) unsu-
pervised and semi-supervised methods [39, 44, 53, 99, 105] have a trade-off between the amount of manual
work and the purity of the cluster specifically for desired symbols. In the absence of knowledge about the
desired symbols, the unsupervised clustering methods [39, 99] may not effectively learn the patterns dis-
tinguishing the desired and non-desired symbols, resulting in the absence of a distinct cluster exclusively
for the desired symbols. For instance, to separate desired and non-desired symbols within the blue poly-
gon in Figure 1.5a, the unsupervised clustering methods learn the blue color as the pattern to form clusters,
leading to one cluster containing wetland symbols and blue lines and the other cluster encompassing white
backgrounds. Consequently, the labeled wetland symbols still have blue lines as noise. On the other hand,
semi-supervised models [44, 53, 105] can form a cluster exclusively for desired symbols but require a small
set of manually labeled desired and non-desired symbols. To further reduce manual labeling efforts, we can
eliminate the need for manual annotations for non-desired symbols. By leveraging manually labeled desired
symbols as guidance to construct the desired cluster, we inherently obtain the other cluster for non-desired
symbols without the need for manual annotations on non-desired symbols.
The proposed generative clustering model aims to form one cluster exclusively for the desired symbols,
guided by one or a few manually labeled desired objects. The proposed model is built based on the Varia-
tional Autoencoder and Gaussian Mixture Models. The main contribution of the proposed model is that it
can exploit a few labeled data to guide the clustering process so that the data in one of the resulting clusters
are similar to the labeled data. Specifically, unlike the existing approaches in which the training process can
only update the Mixture of Gaussians (MoG) as a whole or use a subset of the data to update a specific com-
ponent in the MoG in separate optimization iterations (e.g., a ”minibatch”), The proposed model’s network
design allows the update of individual components in the MoG separately in the same optimization iteration.
Using labeled and all data in one optimization iteration is important because when the data in the desired
categories are (partially) labeled, an optimization iteration should update the MoG using both labeled and
8
all data so that the data distributions of some specific components in the MoG are specific for the partially
labeled data in the desired category.
1.3.2 Automaticallylabelinglinearobjectsnearbytheexternalvectordata
The proposed method aims to automatically label the pixels of the map images belonging to the desired
linear objects near the external vector data. The color information from map images plus location and shape
information from the external vector data ensure the labeled pixels are accurate. The map’s color information
means that the labeled pixels have homogeneous colors. Meanwhile, location and shape information require
the labeled pixels to be positioned close to the external vector data and share a similar shape to the external
vector data. However, the existing method [18] generates noisy labeled linear objects by only using the map
color information and location information from the external vector data. The omission of shape information
can lead to mislabeling other objects with colors similar to desired linear objects, even if their shapes do
not match those in the external vector data. Therefore, utilizing all color, location, and shape information
ensures that the labeled objects are for desired linear objects.
The proposed method automatically labels pixels that exhibit homogeneous colors, are in close prox-
imity to the vector data, and share a similar shape with the rasterized vector data. The rasterized vector
data is the image representation of the object, in which a group of pixels, overlapping with the vector data,
represents the object. To measure shape similarity, the proposed method calculates the overlap size between
the group of labeled pixels and the rasterized vector data. The larger overlap size represents a high shape
similarity between the group of labeled pixels and the rasterized vector data. However, due to several pixel
offsets between the pixels belonging to the desired linear objects and the rasterized vector data, directly com-
puting the overlap area might result in inaccuracies. For example, Figure 1.7 illustrates the small overlap
size between the railroads and the green area (rasterized railroad vector data). To address the small over-
lap size caused by misalignment, the proposed method innovatively applies an affine transformation to the
9
rasterized vector data, effectively aligning the rasterized vector data with the group of labeled pixels. This
alignment ensures that the overlap size between the rasterized vector data and the group of labeled pixels ac-
curately represents the shape similarity. In summary, the proposed method employs an affine transformation
to leverage shape information, in addition to color and location information, to automatically label pixels
belonging to the desired linear objects on maps.
1.3.3 Extractingpreciseandcontinuouslinearobjects
Extracting accurate linear objects requires the extraction model to capture both the image and the spatial
context. The image context should encompass the representative symbols of the desired linear objects. For
example, the black crosses in Figure 1.8a are representative symbols of railroads. In Figure 1.8a, there
are other black lines next to the railroads. Capturing the black crosses as an image context is crucial to
differentiate railroads from other black lines. Therefore, capturing the representative image context, such
as black crosses, can reduce false detection. Spatial context, on the other hand, refers to the correlations
among pixels in an image. For example, nearby pixels with similar colors are likely to belong to the same
object. Therefore, capturing the spatial context can improve the connectivity of extracted linear objects.
The extraction model must also be able to combine the image and spatial contexts, e.g., detecting black(ish)
pixels following an elongated area with repeated occurrences of the cross symbols to extract the railroads
accurately.
The SOTA linear object extraction models [22, 30, 31, 38, 48, 70, 75, 79, 87, 95, 103, 106, 107] have the
challenge of learning sufficient spatial context to extract continuous linear objects from maps. The SOTA
extraction models are in two categories: segmentation- and graph-based models. Segmentation-based mod-
els [22, 38, 70, 75, 79, 106, 107] independently predict if each pixel belongs to desired linear objects.
Independent prediction leads segmentation-based models to fail to capture spatial contexts among pixels.
The SOTA graph-based models [30, 31, 48, 70, 87, 95, 103] predict the vector lines consisting of nodes
10
and edges for the desired linear objects. SOTA graph-based extraction models, such as Relationformer [70],
focus on learning the adjacency information of the vector nodes. However, adjacency information are insuf-
ficient to encode the complex spatial context between nodes. For example, the spatial context of a curved
line is more complicated than that of a straight line. As a result, SOTA graph-based extraction models,
without capturing the curvature of linear objects from the adjacency information, would predict false edges,
leading to low connectivity in the extracted vector lines for desired linear objects.
The proposed graph-based extraction model aims to predict vector lines consisting of nodes and edges
for desired linear objects on maps. The proposed model is a variation of Transformer [88], which proposes
the attention mechanism to learn the context. To learn sufficient spatial context, the innovative N-hop con-
nectivity component of the proposed model explicitly encourages interactions among nodes within a given
N-hop distance in vector lines. The interactions enable the proposed extraction model to acquire sufficient
spatial context, including curvature of desired linear objects. The sufficient spatial context allows the pro-
posed model to generate vector lines with accurate connectivities. Furthermore, the multi-scale deformable
attention mechanism [108] in the proposed model learns representative image context for distinguishing
desired linear objects from other objects. In summary, the proposed extraction model generates precise
and connected vector lines for desired linear objects by learning representative image contexts and suffi-
cient spatial contexts, that is, connectivity information. The proposed model won first place in the United
States Geological Survey and the Defense Advanced Research Projects Agency 2022 AI for Critical Mineral
Assessment Competition
*
, significantly outperforming second place by 184%.
1.4 OutlineoftheDissertation
This dissertation presents labeling and extraction methods to achieve the overall goal: to extract polygon
and linear objects from scanned historical topographic and geological maps with minimal manual work.
*
https://criticalminerals.darpa.mil/The-Competition
11
Each chapter in the remainder of the dissertation presents one method for labeling or extraction and
discusses the related work as follows:
• Chapter 2 describes the end-to-end process of extracting polygonal objects from scanned historical
topographic maps. The method section in this chapter presents the details of the proposed method,
which aims to identify desired symbols from candidates provided by the external vector data with
minimal manual work. The experiment section in this chapter shows the polygon extraction results
from the extraction model trained by the labeled data from the proposed model. On average, the
extraction results archive 80% accuracy.
• Chapter 3 presents the details of the method for automatically labeling the desired linear objects near
the vector data. The experiment will show the labeling accuracy of the proposed method and two
baselines. The proposed method outperforms the baselines, achieving an average 4% higher F
1
score.
Furthermore, the extraction model trained with the labeled data generated by the proposed algorithm
achieves superior extraction results compared to the labeled data from the baselines. The superior
extraction results demonstrate the significant impact of labeled data quality on extraction results.
• Chapter 4 presents the extraction model for extracting precise and continuous linear objects from
topographic and geological maps. The experiment shows that the proposed extraction model averages
correctness of 0.87 and significantly improves connectivity by approximately 20% compared to the
SOTA linear object detectors, representing a significant advancement.
• Chapter 5 summarizes the proposed approaches for labeling and extraction. Furthermore, I discuss
the potential future research directions in the field of geographic object extraction. The geographic
objects extracted using my methods have great potential to benefit researchers in various domains.
12
Chapter2
TrainingDataGenerationandExtractionforPolygonalGeographicObjects
This chapter is about extracting the locations of desired polygonal objects from the scanned topographic
maps. The difficulty of the extraction process is efficiently generating the labeled images that cover the
desired objects to train the extraction model. This chapter presents my method that uses location information
from external data to label desired symbols using a minimum amount of manual work. Firstly, this chapter
introduces my method of efficiently labeling desired symbols in detail. Secondly, the experiment in this
chapter shows the results of the object extraction using the labeled data from the proposed method. The
desired polygonal object in the experiment is the wetland, a type of polygonal geographic object, in the
United States Geological Survey (USGS) historical topographic maps.
2.1 Introduction
The extraction of polygonal objects, such as wetlands, from scanned historical topographic maps is cru-
cial for obtaining valuable information about natural features and human activities, including changes in
hydrography over time. Polygonal objects on topographic maps are specific areas indicated by distinct sym-
bols that define the presence and boundaries. For example, in Figure 2.1, the green polygon is the wetland, a
polygonal object encompassing a collection of bluegrass symbols on a scanned topographic map. Therefore,
13
polygonal object extraction involves extracting symbols from topographic maps. Convolutional Neural Net-
works (CNN) [71] achieve impressive results for various extraction tasks, such as vehicle extraction from
satellite imagery [27, 69, 90]. The primary challenge in utilizing CNN for symbol extraction from topo-
graphic maps is the limited availability of labeled symbols for training. Since manual labeling is costly and
time-consuming, acquiring a sufficient amount of labeled symbols is difficult. This chapter aims to generate
labeled symbols with minimal manual effort.
Figure 2.1: The green polygon delineates a wetland area on a scanned topographic map. Wetlands, polygonal
objects, consist of a group of bluegrass symbols on scanned topographic maps.
External vector data present a potential solution to reduce manual work for generating labeled symbols.
External vector data provide region-of-interest (RoI) covering groups of desired symbols on maps. An RoI
is a map area covered by external vector data. For example, the blue polygon in Figure 2.2a is an example
of RoI covering a group of wetland symbols. Using a sliding window to crop images within RoI generates
candidates for images covering desired symbols automatically. The candidate images consist of desired
and non-desired images. For example, the candidate images within the blue polygon in Figure 2.2b are
images covering wetland symbols (desired images) and images covering blue lines and white backgrounds
(non-desired images).
14
(a) (b)
Figure 2.2: (a) shows an RoI from the external vector data. The blue area represents an RoI from the external
vector data for wetlands on a topographic map. (b) shows candidate images cropped by a sliding window
across the blue RoI in (a).
Grouping candidate images into the desired and the non-desired clusters helps generate labeled desired
symbols. The desired cluster exclusively contains images covering the desired symbols, ensuring accu-
rate labeling for CNN training. On the other hand, the non-desired cluster encompasses images that do
not contain the desired symbols. The state-of-the-art (SOTA) unsupervised and semi-supervised clustering
methods [39, 44, 53, 99, 105] can group images within an RoI into clusters. However, in the absence of
knowledge about the desired symbols, the unsupervised clustering methods [39, 99] may not effectively
learn the patterns distinguishing the desired and non-desired symbols, resulting in the absence of a distinct
cluster specifically for the desired symbols. For instance, in Figure 2.2b, the unsupervised clustering meth-
ods learn the blue color as the pattern to form clusters, leading to one cluster containing wetland symbols
and blue lines and the other cluster encompassing white backgrounds. Consequently, the labeled images
still have blue lines as noise. On the other hand, semi-supervised models [44, 53, 105] can group images
into a desired and non-desired cluster but require a small set of manually labeled desired and non-desired
images within RoI. Since our goal is to form a specific cluster for desired symbols to obtain the labeled
desired images for training CNN, we can further reduce the manual annotation work by providing only
15
manually labeled desired images to guide the formation of the desired cluster without needing manually
labeled non-desired images.
To further reduce manual annotations, this chapter presents the Target-Guided Generative Model (TGGM),
which requires a few manually labeled desired images as guidance to form a specific cluster exclusively for
the desired images. TGGM is a weakly-supervised probabilistic clustering method built under the Vari-
ational Auto-encoder (V AE) framework [43]. TGGM’s architecture, following V AE [43], consists of an
encoder and a decoder network. TGGM’s encoder maps input images to a latent space. TGGM’s decoder
takes samples from the latent space and reconstructs the original images. The optimization process of
TGGM increases the similarity between the original and the reconstructed input images to learn the latent
space, effectively capturing the underlying distribution of input images. Figure 2.3 shows the various posi-
tions of wetland symbols within images from the desired cluster. Leveraging V AE’s capacity of capturing
input variability [43], TGGM models images as a probability distribution in a latent space to capture the
position variability of desired symbols. As for the distribution of latent space, TGGM employs a Mixture
of Gaussian (MoG), which is a weighted combination of multiple Gaussian components. Each Gaussian
component in MoG corresponds to one cluster. During the inference phase, TGGM assigns an image to a
cluster according to the image’s distribution in the latent space. If the distribution of the image is closer to
the Gaussian component representing desired images than other Gaussian components, TGGM assigns the
image to the desired cluster.
Figure 2.3: The images covering the wetland symbols within an RoI.
16
The innovative architecture of TGGM enables multiple reconstructions of an input image, allowing spe-
cific learning of the Gaussian component in MoG for desired images. Each TGGM’s reconstructed input
image is from one Gaussian component in MoG. The novel design empowers TGGM to specifically optimize
the reconstruction from the desired Gaussian component using labeled desired images while simultaneously
optimizing the reconstructions from all Gaussian components using all images. In contrast, the existing ap-
proaches based on V AE generate a single reconstruction for an input image. Consequently, the optimization
process of the existing methods can only update the MoG as a whole or use a subset of the data to update a
specific component in the MoG in separate optimization iterations (e.g., a ”minibatch”). TGGM’s network
design allows the update of individual components in the MoG separately in the same optimization iteration.
Using labeled and unlabeled data in one optimization iteration is important because when the data in some
categories are (partially) labeled, an optimization iteration should update the MoG using both labeled and
unlabeled data so that the data distributions of some specific components in MoG is specific for the partially
labeled data in some categories. Consequently, TGGM efficiently groups candidate images into the desired
and non-desired clusters with minimal manual annotations, requiring only a few manually labeled desired
images. Images belonging to the desired cluster are the training data for CNN training.
TGGM adapts an iterative learning process to gather the desired images within a RoI, as depicted in
Figure 2.4. The labeled desired images for TGGM training start from a few manually labeled images and
gradually increase as additional images are assigned to the desired cluster. In each iteration’s training pro-
cess, TGGM uses the images belonging to the desired cluster from the last iteration to optimize the desired
Gaussian component. During the inference process of each iteration, TGGM obtains additional desired
images within a RoI. TGGM progressively enhances the coverage of the desired cluster by optimizing the
desired Gaussian component with additional labeled desired images. The iteration ends when the number
of images in the desired cluster remains unchanged.
The main contributions of TGGM are in the following:
17
Figure 2.4: TGGM, the proposed model, iteratively separates images within the RoI (the red region) into the
desired or non-desired cluster. In the first iteration, TGGM uses the few manually labeled wetland symbols
(purple box in the left-most figure) to form the desired and non-desired cluster for wetland symbols. In
the following iterations, TGGM uses the images in the wetland cluster from the previous iteration to collect
more wetland images. The iteration ends when the number of images in the wetland cluster does not change.
The images in the wetland cluster are labeled data to train an extraction model.
• TGGM leverages the probability distribution to effectively capture the position variability of desired
symbols in images.
• By generating multiple reconstructions of an input image, TGGM can optimize one specific Gaussian
component using labeled desired images and all Gaussian components using all images simultane-
ously. Consequently, TGGM forms one specific cluster for desired images. The joint optimization
approach significantly reduces manual annotation, requiring only a few manually labeled desired im-
ages within a RoI.
• The iterative learning allows TGGM to gradually collect the most desired images within RoI, starting
from a few manually labeled desired images.
The experiment section will show the symbol extraction results using the labeled desired images from
TGGM. The extraction archives an average 80% accuracy.
18
2.2 Target-GuidedGenerativeModel(TGGM)
2.2.1 TheNotationsforTarget-GuidedGenerativeModel
Here are symbols used for explaining TGGM. x
t
represents labeled images covering the desired objects,
named target images. x
u
is unlabeled images. ˆ x
u
stands for the generated unlabeled images from TGGM.
z is a continuous variable to represent the distribution of x
t
and x
u
in the latent space. y is a categorical
variable representing the labels for x
u
. y={0,1}, 1 for target images and 0 for non-target images.
Figure 2.5 shows TGGM’s architecture. f
in f
encodes x
u
concatenating with y into z then f
gen
decodes z
into ˆ x
u
. f
cls
assigns x
u
into clusters. Section 3.1 describes the distribution of z given x
u
and y learned by f
in f
and f
cls
. Section 3.2 describes the process for f
prior
. Section 3.3 formulates the evidence lower bounds of
the marginal likelihood of x
u
and x
t
to optimize TGGM. Section 3.4 explains how TGGM leverages labeled
target images to form target and non-target clusters.
!
!
!
(#
"
|%) !
!
"
('|%, #
"
) )
#
"
('|#
"
)
%
$
%
%
… %
&
#
&
#
$
%
&
%
%
#
"
…
%
$
× !
!
!
(#
&
|%)
%
&
, %
$
, %
%
,
… × !
!
!
#
$
%
%
%
, %
$
, %
&
,
…
…
…
…
…
…
…
…
f
inf
f
gen
f
prior
f
cls
Figure 2.5: The above figure shows TGGM’s architecture. f
in f
in the center model encodes x
u
concatenating
with y into the distribution of the latent variable z, and f
gen
decodes z into ˆ x
u
weighted by q(y|x
u
) from f
cls
represented by the right model. Section 3.1 describes f
in f
and f
cls
. Section 3.2 describes f
prior
represented
by the left model. Section 3.3 and 3.4 shows the optimization procedure.
2.2.2 TheVariationalInference(Encoder)
TGGM learns the approximation posterior distribution for z, i.e., q(z|x
u
) to estimate the true posterior proba-
bility p(z|x
u
) which is intractable [43]. The approximate posterior probability of z is MoG with two Gaussian
19
components showing in eq. 2.1. The parameters of Gaussian components are learned by f
in f
in eq. 2.2. The
weights of two components are the probabilities of the latent categorical variable y in eq. 2.3. The latent
categorical variable y is learned by f
cls
activated by the softmax function. The joint approximate posterior
probability of z and y is shown in eq. 2.4.
q(z|x
u
)=
∑
y
q(y|x
u
)q(z|x
u
,y) (2.1)
q(z|x
u
,y)=N ( ˜ µ
z
, ˜ σ
z
), [ ˜ µ
z
, ˜ σ
z
]= f
in f
(x
u
,y) (2.2)
q(y|x
u
)= Cat(y|π(x
u
))= so ftmax( f
cls
(x
u
)) (2.3)
q(z,y|x
u
)= q(z|x
u
,y)q(y|x
u
) (2.4)
2.2.3 TheGenerativeProcessforImages(Decoder)
Generating a target image follows the steps below:
1. Sample a latent vector z from p(z|y= 1)
2. Compute the distribution of x
t
followingN (µ
x
t
,σ
x
t
)
3. Sample a x fromN (µ
x
t
,σ
x
t
)
The generative process for non-target images is similar to the above process, except sampling z from
p(z|y= 0) and sampling x from the Gaussian distribution for non-target images. p(z|y) in step 1 above
20
follows a Gaussian distribution showing in eq. 2.5. The parameters of Gaussian distribution are learned by
f
prior
(y) in eq. 2.6. The parameters ofN (µ
x
t
,σ
x
t
) in step 2 are learned by f
gen
in eq. 2.7.
p(z|y)=N (µ
z
,σ
z
) (2.5)
[µ
z
,σ
z
]= f
prior
(y) (2.6)
[µ
x
t
,σ
x
t
]= f
gen
(z) (2.7)
According to the generative process above, the joint probability p(x,z,y) can be factorized as:
p(x,z,y)= p(x|z,y)p(z|y)p(y) (2.8)
where p(y) is the prior distribution for y defined in eq. 2.9:
p(y)= Cat(1/K) (2.9)
where K is the number of categories. In our case, the number of categories is two: one for the target category
and the other for the non-target one. At the beginning of the iterations, TGGM assumes that the number of
target and non-target images, denoted as p(y), is uniformly distributed since we do not have prior knowledge
about the number of target objects in RoI. This assumption is relaxed in later iterations.
21
2.2.4 TheGenerativeandInferenceProcessesforLabeledImages
For the labeled target images, y becomes an observation instead of a latent variable, i.e., y= 1. Because
y is an observation, TGGM only uses labeled target images to learn one Gaussian component of z and
generate images from the component. As a result, the approximate posterior distribution of z and generative
x
t
distribution can be written as eq. 2.10 and eq. 2.11, respectively.
q(z|x
t
)= q(z|x
t
,y= 1) (2.10)
p(x
t
|z,y= 1)= p(x
t
|z,y= 1)p(z|y= 1) (2.11)
2.2.5 EvidenceLowerBounds
TGGM aims to separate target and non-target images into two clusters by maximizing the marginal likeli-
hood of unlabeled images, x
u
. Because the log-likelihood of x
u
(the leftmost term in eq. 2.12) is intractable,
TGGM optimizes the evidence lower bound (L
ELBO
) (the rightmost term in eq. 2.12) instead.
log p(x
u
)= log
∑
y
Z
z
p(x
u
,z,y)dz≥ E
q(z,y|x
u
)
h
log
p(x
u
,z,y)
q(z,y|x
u
)
i
=L
ELBO
(x
u
) (2.12)
Eq. 2.13 shows the details ofL
ELBO
for unlabeled images.L
ELBO
after using the reparameterization trick
proposed in V AE [43] can be written as:
L
ELBO
(x
u
)=E
q(z,y|x
u
)
h
log
p(x
u
,z,y)
q(z,y|x
u
)
i
=
∑
y
q(y|x
u
)log p(x
u
|z,y) − ∑
y
q(y|x
u
)KL
h
q(z|x
u
,y)∥p(z|y)
i
− KL
h
q(y|x
u
)∥p(y)
i
(2.13)
22
According to the generative and variational inference processes for labeled target images, x
t
, described
in the last subsection, TGGM uses x
t
to optimize one Gaussian component in z and generate x
t
from the
component.L
ELBO
for x
t
is defined in eq. 2.14.
L
ELBO
(x
t
)=E
q(z|x
t
,y=1)
h
log
p(x
t
,z,y= 1)
q(z|x
t
,y= 1)
i
= log p(x
t
|z,y= 1) − KL
h
q(z|x
t
,y= 1)∥p(z|y= 1)
i
(2.14)
The totalL
ELBO
as the optimization goal of TGGM is the summation ofL
ELBO
for x
u
and x
t
showing
in eq. 2.15.
L
ELBO
(x)=L
ELBO
(x
u
)+L
ELBO
(x
t
) (2.15)
2.2.6 NetworkDesign
When the input is a labeled target image x
t
, the encoder ( f
in f
) in TGGM, showing in Figure 2.5, encodes
x
t
concatenating with y= 1 into q(z|x
u
,y= 1), then decoder ( f
gen
) sampled z from p(z|y= 1) into ˆ x
t
. In
the optimization process, the optimization goal of the second term in eq. 2.14 is that one component of the
latent variable z represents the labeled target images. The optimization goal of the first term in eq. 2.14 is
that the decoder generates similar ˆ x
t
to x
t
from the component of z which represent labeled target images in
the latent space. TGGM explicitly learns one component in z to represent the labeled target images. We call
the learning process the target guidance mechanism.
When the input is a unlabeled image, x
u
, f
in f
encodes x
u
concatenating with y into the latent variable z.
f
gen
decodes the sampled z
0
from p(z|y= 0) and z
1
from p(z|y= 1) into ˆ x
u0
and ˆ x
u1
. In the optimization
process, the optimization goal of the first term in eq. 2.13 is assigning a high weight ( q(y|x
u
)) to ˆ x
u0
or
ˆ x
u1
, whichever is more similar to x
u
. For example, when the input is an unlabeled image covering the
target objects, the ˆ x
u1
from p(z|y = 1) should be more similar to ˆ x
u
than ˆ x
u0
from p(z|y = 0) because
23
p(z|y = 1), optimized by labeled target images in the target guidance mechanism, represents the target
images. Therefore, the weight for ˆ x
u1
is higher than the weight for ˆ x
u0
(i.e., (q(y= 1|x
u
)> q(y= 0|x
u
))
Similarly, the optimization goal of the second term in eq. 2.13 is assigning a high weight (q(y|x
u
)) to one
component of z, which has the smaller KL value than the other component does. For example, when the input
is an unlabeled image covering a target object, the KL value between q(z|x
u
,y= 1) and p(z|y= 1) should
be smaller than the KL value between q(z|x
u
,y= 0) and p(z|y= 0) (i.e.KL[q(z|x
u
,y= 1)∥p(z|y= 1)]<
KL[q(z|x
u
,y= 0)∥p(z|y= 0)]) because q(z|x
u
,y= 1) and p(z|y= 1), optimized by labeled target images
in the target guidance mechanism, represent the target images. Hence, TGGM assigns a higher weight to the
component in z which represents the target images (i.e., (q(y= 1|x
u
)> q(y= 0|x
u
)). The weights, q(y|x
u
),
are the probabilities of an image belonging to the target and non-target clusters. With the target guidance
mechanism, TGGM assigns unlabeled images covering the target objects into the target cluster.
2.3 PolygonalObjectsExtractionandVectorization
Extracting the desired polygonal objects from a map is categorizing images across the map into either the
desired or non-desired class. We adopt a sliding window approach to generate images. The sliding window
moves across a map horizontally and vertically and crops an image at every fixed x-pixel stride. The image
classifier is VGG [71], a convolutional neural network. For the labeled images used for training VGG, the
labeled desired images are from TGGM. The labeled non-desired images are from two ways. One way is
to use the sliding window approach to crop images within map areas covered by the external data for other
geographic objects. The other way is to crop images across the map randomly. The two ways of generating
non-desired images contribute to a comprehensive and diverse training dataset for the image classifier.
During the inference phase, VGG classifies cropped images (windows) generated by the sliding window
approach. The classified desired windows are combined to form polygons representing the desired polygonal
24
object on the maps. The polygons cover the areas where the classified desired windows overlap or are
adjacent to each other, effectively delineating the boundaries of the desired object on the maps.
To obtain the analysis-ready data, we convert the raster polygons into vector data. We use the vectoriza-
tion function in QGIS
*
to trace the boundary of the raster polygon and save the location of the boundary as
the polygon vector data. The vector data are ready for further analysis.
2.4 ExperimentResultsandAnalysis
2.4.1 ExperimentData
The desired polygonal geographic object in the experiment is wetlands on the six scanned historical topo-
graphic maps. These maps belong to three regions with different publication dates and scales: Big Swamp
city in California (published in 1993 and 1990, scale 1:24,000), Duncanville city in Texas (published in 1995
and 1959, scale 1:24,000), and Palm Beach city in Florida (published in 1987 and 1956, scale 1:250,000).
All six maps are from the United States Geological Survey (USGS).
For the labeling task, we utilize the external vector data from USGS, published in 2018. TGGM identi-
fies desired wetland symbols from the candidate images provided by USGS vector data.
2.4.2 EvaluationMetrics
We calculate the intersection over union (IoU) between the extracted and the referenced polygon to evaluate
the performance of extraction results. IoU represents the ratio of the interaction area to the union area
between the extraction results and referenced polygons. IoU ranges from zero to one. IoU close to one
indicates the high similarity between the predicted results and the referenced polygons.
*
https://docs.qgis.org/2.8/en/docs/user_manual/processing_algs/gdalogr/gdal_conversion/
polygonize.html
25
2.4.3 GroundTruthGeneration
The referenced wetland areas on six maps are manually generated. The annotators create polygons to de-
lineate all wetland areas on the Big Swamp and Duncanville maps. For the Palm Beach area, annotators
annotate the wetland areas for the right half of the maps. As the wetland areas on the Palm Beach maps
cover over 70% of the total area, annotating an entire Palm Beach map would require at least eight hours
of manual work. However, since the wetland areas on the Palm Beach map edited in 1956 are present on
both the right and left halves of the map, and for the Palm Beach map edited in 1987, most wetland areas
are on the right half, evaluating the extraction results for the right half is sufficient to represent the overall
extraction performance of the entire maps.
The referenced wetland polygons were generated by three annotators with diverse Geographical Infor-
mation Science (GIScience) backgrounds. The first annotator is a GIScience expert with a Bachelor’s degree
in GIScience and extensive experience in editing topographic maps. The second annotator’s background lies
in Computer Science, but she possesses substantial knowledge in GIScience and has been working with to-
pographic maps for over four years. The third annotator is skilled in reading topographic maps and proficient
in Computer-aided design tools, which facilitates the learning of GIScience software (QGIS) for drawing
wetland polygons.
These three annotators exhibit varying levels of expertise and experience in GIScience, ranging from
expert to experienced. The process of drawing wetland polygons does not adhere to strict and rigid rules,
making it prone to human bias. By involving annotators with diverse GIScience backgrounds, we seek to
mitigate potential biases and ensure the referenced data are more objective and comprehensive.
The rule for drawing a wetland polygon is to encompass a group of wetland symbols within a topo-
graphic map. However, the definition of what constitutes a ”group” of wetland symbols lacks a precise and
rigid definition. Consequently, annotators group wetland symbols based on their own understanding and
interpretation. For example, Figure 2.6 shows three ways to group the same wetland symbols on the map
26
covering Duncanville, Texas, circa 1995. Based on the different understanding of the maximum interval
among wetland symbols in one wetland polygon, the wetland symbols in the same area can be grouped into
one, two, and three polygons. All three drawing methods are correct since all resulting polygons accurately
fit the boundaries of the wetland symbols on the map. Due to the subjective nature of the grouping process,
different annotators may arrive at slightly different interpretations of wetland groupings. Acknowledging
this inherent variability and considering multiple annotator perspectives is essential to ensure a comprehen-
sive and accurate representation of the wetland polygons.
(a) (b) (c)
Figure 2.6: Three different ways to group the wetland areas on the topographic map covering Duncanville,
Texas, circa 1995. The purple areas represent polygons created by three different annotators to depict the
wetland areas.
As the wetland polygons drawn by different annotators may differ, we use IoU to evaluate the difference
among wetland polygons drawn by the three annotators. Table 2.1 shows the evaluation results by calcu-
lating the IoU for all pairs of referenced wetland polygons. The average IoU is over 0.9, indicating high
similarity among the wetland areas drawn by the three annotators. There is only a slight difference in the
boundary of wetland polygons, which depends on the annotators’ individual interpretations of how tightly
the polygons should fit the wetland areas’ boundaries. For example, Figure 2.7 shows the comparisons for
wetland polygons edited by three annotators for the same wetland area on the Big Swamp map circa 1990.
The arrows in Figure 2.7a point out that the expert drew the tightest wetland boundary among the three
27
annotators, while Figure 2.7c shows that the wetland boundary from the experienced researcher is relatively
loose. As topographic maps do not provide detailed information about the exact boundaries of polygonal
geographic objects, both tight and loose boundaries are possible and acceptable interpretations. The high
IoU values indicate a strong agreement among the annotators in identifying and delineating wetland areas
on the maps.
In addition to delineating boundaries, the presence of other geographic objects’ symbols surrounding
the wetland symbols can lead to variations in the polygons drawn by different annotators. For example,
Figure 2.8 shows the wetland polygons created by the three annotators for the same wetland area on the
Duncanville map circa 1959. The annotators make choices about whether to include or exclude the contour
line areas in the top left area of the map and the river areas on the right side of the map. In reality, one area
may be occupied by more than one geographic object. However, on topographic maps, one area can only be
represented by one geographic object’s symbol. The referenced wetland polygons in both Figure 2.8a and
2.8b are plausible representations of the wetland areas in reality, according to the information provided by
the topographic map. Consequently, the reference editions from different annotators encompass all possible
locations of wetland areas on the topographic maps, considering the uncertainty arising from the presence
of multiple geographic objects within the same area.
Table 2.1: The comparison among three groups of referenced wetland polygons using IoU. Group 1, 2, and
3 are referenced wetland polygons edited by the GIScience expert, the knowledgeable annotator, and the
experienced annotator, respectively.
Big Swamp Duncaville Palm Beach
1990 1993 1959 1995 1956 1987
Group 1 V .S. 2 0.9443 0.9365 0.7475 0.8868 0.9566 0.9406
Group 2 V .S. 3 0.9509 0.9433 0.8452 0.8944 0.9686 0.9530
Group 1 V .S. 3 0.9414 0.9599 0.7582 0.8953 0.9657 0.9459
28
(a) Group 1 (b) Group 2 (c) Group 3
Figure 2.7: The three groups of referenced wetland polygons are for the Big Swamp map edited in 1993.
The red, orange, and green polygons represent the referenced wetland polygons created by the GIScience
expert, the knowledgeable annotator, and the experienced annotator, respectively.
(a) Group 1 (b) Group 2 (c) Group 3
Figure 2.8: The three groups of referenced wetland polygons are for the Duncanville map edited in 1959.
The red, orange, and green polygons represent the referenced wetland polygons created by the GIScience
expert, the knowledgeable annotator, and the experienced annotator, respectively.
2.4.4 ExperimentSettings
Hyperparameters Settings The hyperparameters are from two parts: data generation and model training.
In the data generation process, both TGGM and VGG need cropped images using a sliding window with a
x-pixel stride. Because the dimensions and scales of the map sheets are not the same, the sizes of wetland
symbols and the gaps between two wetland symbols are not the same. We set the stride sizes and the
dimensions of sliding windows according to the sizes of wetland symbols on the maps. For the Big Swamp
map, edited in 1990, and the Palm Beach map, edited in 1956, the dimensions of the sliding windows and
the stride size are 48× 48 pixels and 20 pixels, respectively. For the rest of the testing maps, the dimensions
of the sliding windows and the stride size are 80× 80 pixels and 40 pixels, respectively.
29
For the model settings, all submodels in TGGM are multilayer perceptrons with two fully connected
layers. TGGM is optimized by the Adam optimizer with a learning rate of 1e− 3. The iterative learning of
TGGM for all tasks ends around seven iterations. Each iteration converges around 200 epochs. For VGG’s
settings, we employ VGG-16, which is 16 layers deep. VGG-16 is optimized by the Adam optimizer with a
learning rate of 1e− 4. The training process of VGG-16 converges around 100 epochs.
DataAugmentation To augment the number of labeled target images, we translate the positions of the
target objects in the images. For the 48-× 48-pixel images, the target objects shift along the x- and y-axis
within the range of [-10 pixels, 10 pixels], respectively. For the 80-× 80-pixel images, the target objects shift
along the x- and y-axis within the range of [-20 pixels, 20 pixels], respectively.
2.4.5 ExperimentResultsandAnalysis
The average IoU for two Big Swamp maps in Table 2.2 is over 0.8. Figure 2.9a - 2.9f show the extraction
results and three groups of referenced wetland polygons. The purple polygons are the extraction results
from VGG. The red, orange, and green polygons represent the referenced wetland polygons created by the
GIScience expert, the knowledgeable annotator, and the experienced annotator, respectively. The extracted
wetland boundaries in Figures 2.9a - 2.9f are less precise than the referenced wetland polygons. The
approximate extracted boundary is the major reason for the loss of IoU. The classification of images situated
at the boundary of wetland areas is the reason for the approximate boundaries. Despite the images situated at
the boundary of wetland areas encompassing only partial wetland symbols, VGG classifies these images into
the desired object class. Images covering a partial wetland symbol include more background areas compared
to images covering a complete wetland symbol. Consequently, the images covering partial wetland symbols
lead to the extracted boundaries encompassing more background areas than what is present in the referenced
wetland polygons. The disparity in boundary coverage between extraction and referenced wetland polygons
contributes to the observed IoU loss. Although the extracted boundaries do not tightly fit the boundaries of
30
the wetland area on the maps, VGG can extract both small and large wetland areas from two Big Swamp
maps. Figures 2.9a - 2.9c show that VGG can extract small wetland areas from the Big Swamp map edited
in 1990. Figures 2.9d - 2.9f show that VGG can also extract large wetland area from the Big Swamp map
edited in 1993. Despite not achieving perfect boundary alignment, VGG successfully identifies wetland
areas of varying sizes from both maps.
Table 2.2: The evaluation for the wetland extraction using IoU
Big Swamp Duncanville Palm Beach
1990 1993 1959 1995 1956 1987
Group 1 0.8243 0.8734 0.6813 0.7577 0.8749 0.8142
Group 2 0.8156 0.8617 0.6867 0.7703 0.8744 0.8257
Group 3 0.8086 0.8506 0.7040 0.7468 0.8691 0.8284
Avg. 0.8161 0.8619 0.6907 0.7583 0.8728 0.8228
The average IoU for two Duncanville maps in Table 2.2 is 0.6907 and 0.7583, respectively. Besides the
approximate extracted boundaries, the overlapped objects and incomplete symbols contribute to the reduced
IoU. The top zoom-in figures in Figures 2.9g - 2.9i show the contour line next to the wetland symbols. As
the area lacks wetland symbols, the extracted polygon does not cover it. However, the annotators consider it
a wetland area that overlaps with the contour line. Hence, all three groups of referenced wetland polygons
include this area. When other geographic objects are in close proximity to wetland symbols, it is challenging
to determine if the area occupied by the other object’s symbol is exclusively for that object or if it overlaps
with the wetlands. Consequently, both referenced wetland polygons and the extractions from VGG represent
possible wetland areas in Duncanville.
The presence of incomplete wetland symbols is another reason for the low IoU in the Duncanville map.
The bottom zoom-in figures in Figures 2.9g - 2.9i show that the map uses short horizontal blue lines
to represent the simplified wetland symbols. VGG learns the bluegrass symbol as the representation of
wetlands. Therefore, VGG fails to extract the wetland area represented by the short horizontal blue lines.
31
However, the annotators recognize that the short horizontal blue lines represent wetlands since the short
horizontal blue lines do not represent any other objects on the map and are located close to the completed
wetland symbols. Therefore, the referenced wetland polygons include the area represented by the short
horizontal blue lines. Although VGG cannot extract the wetland areas represented by the short horizontal
blue lines, VGG extracts the wetland areas represented by completed wetland symbols.
The average IoU for two Palm Beach maps in Table 2.2 exceeds 0.8. For the map edited in 1956, the
disparity between the referenced wetland polygons and the extracted wetland polygons lies in the areas
occupied by other geographic objects, specifically the lakes. The arrows in Figure 2.9m - 2.9o point out
that the areas occupied by the lakes are within the referenced wetland polygons but outside the extracted
polygons. The purple polygons are the extraction results from VGG. The red, orange, and green polygons
represent the referenced wetland polygons created by the GIScience expert, the knowledgeable annotator,
and the experienced annotator, respectively. The annotators believe that the areas covered by the lakes on
the map represent a combination of wetlands and lakes in reality, so they include these areas within the
referenced wetland polygons. However, VGG does not extract these areas as wetlands because the wetland
symbols do not appear within the regions. For the map edited in 1987, the extracted wetland areas have
false locations. Arrows in Figure 2.9p - 2.9r point out the false extracted wetland areas. We can see that
the background areas for wetlands on the map are more complex than the other maps. Wetlands are situated
around roads and buildings, creating intricate backgrounds, whereas wetlands on the other maps appear in
white or green backgrounds. The false extractions include blue lines in backgrounds similar to the actual
wetland area on the map. The similarity in backgrounds contributes to the false extractions. However, it’s
important to note that only small wetland areas are present in the map’s complex backgrounds. Achieving
an IoU of 0.8228 demonstrates that VGG successfully extracts the majority of wetland areas on the map.
Comparing the extracted wetland areas between the Duncanville map edited in 1959 and 1995, repre-
sented by purple polygons in Figure 2.9g - 2.9l, reveals noticeable differences and newly emerged wetland
32
areas. In the map edited in 1995, the bottom wetland area in Figure 2.9j - 2.9l does not exist in the map
edited in 1959. Additionally, the wetland area in the top right corner of the map edited in 1995 is larger
than the corresponding area in the map edited in 1959. Notably, there was a lake in the middle of the wet-
land areas in the map edited in 1959, but in the map edited in 1995, the lake disappeared, and the area had
transformed into wetlands. The detected changes demonstrate the effectiveness of automatically extracting
wetland areas from the maps to detect and highlight changes in wetland areas over time. The automated ap-
proach provides valuable insights into the dynamic nature of wetland landscapes and their evolution across
different historical map editions.
33
(a) Big Swamp, 1990, group 1 (b) Big Swamp, 1990, group 2 (c) Big Swamp, 1990, group 3
(d) Big Swamp, 1993, group 1 (e) Big Swamp, 1993, group 2 (f) Big Swamp, 1993, group 3
(g) Duncanville, 1959, Group 1 (h) Duncanville, 1959, Group 2 (i) Duncanville, 1959, Group 3
34
(j) Duncanville, 1995, Group 1 (k) Duncanville, 1995, Group 2 (l) Duncanville, 1995, Group 3
(m) Palm Beach, 1956, Group 1 (n) Palm Beach, 1956, Group 2 (o) Palm Beach, 1956, Group 3
(p) Palm Beach, 1987, Group 1 (q) Palm Beach, 1987, Group 2 (r) Palm Beach, 1987, Group 3
Figure 2.9: The visualizations of extraction results from the six maps. The purple areas are the extracted
polygons for the wetland areas. The red, orange, and green polygons represent the referenced wetland
polygons generated by the expert, the knowledgeable annotator, and the experienced annotator, respectively.
The caption below each figure follows the format ”X, Y , Z”. X is the city name covered by the map, Y is
the map editing year, and Z is the reference. Group 1, 2, and 3 are for referenced wetland polygons from the
expert, the knowledgeable annotator, and the experienced annotator, respectively.
35
2.5 RelatedWork
This section briefly summarizes how SOTA object detectors deal with limited labeled images.
Weakly supervised object detection [24, 34, 42, 47, 49, 60, 68, 72, 97, 102, 109], aiming to localize
objects with image-level annotations, has attracted attention because bounding box annotations require in-
tensive manual work. However, existing weakly supervised object detectors focus on scenic images, such as
the PASCAL VOC Challenge [21]. The desired objects, such as cats, are usually large and notable in scenic
images. Because of the notability property and proportional size of target objects in scenic images, exist-
ing methods transfer prior knowledge learned from scenic images [34, 68] or utilize the multiple instances
learning methods [24] to localize the objects using image annotations. However, the notability property of
desired objects is usually not valid in topographic map images [59, 96]. The target objects on topographic
map images, like wetland symbols, are small, densely clustered, or spatially dispersed over large areas [67].
Therefore, the weakly supervised object detectors cannot directly apply to the polygonal object extraction
task on topographic map images.
Sliding-window-based detectors are commonly used to detect if a window covers small desired objects in
overhead images [25, 59, 67, 86] because SOTA detectors [61, 62] lose the information about small objects
when taking entire large-dimension images as the inputs due to the down-sampling operations. The sliding
window is a straightforward way to search for small objects across large-dimension images. Detecting if
the sliding window contains the desired objects is equivalent to classifying the images into desired or non-
desired categories. The SOTA image classifiers without or with a limited number of labeled images fall into
two categories: unsupervised and semi-supervised methods.
Unsupervised methods separate data into multiple clusters by identifying separable patterns in the data.
Generative models under the V AE framework are common unsupervised clustering methods [14, 39, 53, 54,
99], which learn the distribution to represent and separate inputs in the latent space. Generative clustering
models have shown impressive results by learning flexible distribution representations in the latent space.
36
However, images in sliding windows have many separable patterns, such as light-vs.-dark or desired-vs.-
non-desired patterns. The methods do not produce desired and non-desired clusters when unsupervised
methods use other separable patterns to cluster the images. In contrast, the target guidance mechanism
guides TGGM to separate images in sliding windows into the desired and non-desired clusters.
Semi-supervised methods [20, 35, 37, 44, 50, 53, 66, 93, 94, 98, 101] aim to automatically annotate large
amounts of unlabeled data using a small set of labeled data in each category. Recent work [44, 53] shows
that unsupervised generative clustering models can convert to semi-supervised models easily by adding a
cross-entropy loss for labeled data. The cross-entropy loss optimized by labeled data helps to improve
the clustering results. However, labeling both desired and non-desired images can require a huge amount
of manual work. In contrast, TGGM reduces the manual work for non-desired image annotations and
separates images into the desired and non-desired categories by leveraging the limited object categories
property within the RoI provided by the external vector data.
37
Chapter3
AutomaticTrainingDataGenerationforLinearGeographicObject
Extraction
This chapter presents my work exploiting external data to automatically generate training data for linear
object detection on scanned topographic maps. Thousands of scanned historical topographic maps contain
valuable information covering long periods of time, such as hydrography changes over time. Efficiently
unlocking the information in maps requires a large amount of labeled data for training a geographic object
extraction system. However, manually annotating data is expensive and time-consuming. To address the
challenge of limited labeled data, this chapter introduces my automatic method to generate labeled data by
leveraging external data. The rest of this chapter first presents the details of the proposed Automatic Label
Generation (ALG) algorithm. Secondly, the experiments show the label accuracy of the ALG algorithm on
two linear objects: railroads and waterlines, on scanned historical topographic maps.
3.1 Introduction
The availability of large-scale pixel-level annotated datasets allows semantic segmentation models, specifi-
cally Convolutional Neural Networks (CNN), to extract linear objects from scanned topographic maps auto-
matically. However, producing high-quality pixel-level annotations is a costly and time-consuming process.
External vector data covering the same areas as the topographic maps can help generate labeled training data
38
automatically. However, the labels from the external vector data might suffer from a misalignment error.
Figure 3.1 is an example of the misaligned error in the labeled railroads from the external vector data on a
scanned topographic map. In Figure 3.1, the green line represents the pixel-level railroad labels obtained
from the rasterized railroad vector data, while the black line with crosses represents the true railroads on the
map. It is apparent that the labels from the railroad vector data are several pixels away from the railroads
on the map. The position discrepancy is due to the different scales and coordinate projection systems be-
tween the external vector data and the maps. Misaligned labels can confuse a semantic segmentation model,
leading to inaccurate extraction of object locations on the maps.
Figure 3.1: The misaligned railroad labels (depicted by the green area) are from the external vector data.
The green area is the buffer zone of the external vector data. The buffer width is equivalent to the railroad’s
width on the map. The railroad is the black line with crosses on the scanned topographic map.
Leveraging the color information from map images, along with the location and shape information from
the external vector data, helps reduce misaligned labels. Topographic maps utilize homogeneous colors to
represent linear objects. The color information allows for finding many candidates for desired linear objects
across the topographic maps. The location information helps to reduce the number of candidates for desired
linear objects. Given that the vector data are situated in close proximity to the desired linear objects on the
maps, we only focus on the candidate linear objects nearby the vector data. Furthermore, the nearby area
may encompass multiple candidate linear objects. To accurately label the desired linear objects, we can
compare the shape similarity between the candidate linear objects and the external vector data. The labeled
desired linear objects have the highest shape similarity to the vector data among all candidates. In summary,
39
considering homogeneous colors, proximity to the vector data, and similarity in shape to the vector data
ensures the accurate labeling of desired linear objects on topographic maps.
Existing methods use only color and location information during or before training CNN (semantic
segmentation models) to reduce the impact of annotation errors on detection accuracy. To deal with noisy
annotations during training, existing models propose the noise-aware loss functions [2, 41, 92]. For example,
the road detector [92] adds the normalized cut loss (Ncut) during the training process to address the inac-
curate detection caused by noisy labels from OpenStreetMap (OSM), the external data. The Ncut loss uses
color and location information by calculating the similarities among the detected pixels in the color (RGB)
and spatial (XY) spaces. The underlying assumption is that true road pixels should exhibit similar colors and
be spatially connected. By minimizing the Ncut loss, the road detector encourages the detected road pixels
to have similar colors and maintain spatial proximity. However, the effectiveness of the noise-aware loss is
contingent upon the extent of misaligned annotations. If misaligned annotations, as depicted in Figure 3.1,
constitute a significant portion of the annotations, CNN struggles to learn accurate representations of the
desired linear objects. Consequently, training the CNN with predominantly misaligned annotations leads to
poor detection results.
To improve annotation accuracy for non-minor annotation noise, various existing detection systems de-
sign algorithms prior to the training process. For example, a recent vector-to-raster alignment algorithm [18]
leverages color and location information to relocate the vector data in topographic maps. The relocation
process aims to position the vector data in a map area that falls within the color range of the desired linear
objects and is close to the original location of the vector data in the maps. The relocated vector data are
often at the boundaries of the desired linear objects, which are also in the color range. However, relocation
to the boundary causes misaligned annotations. The annotations are in the map area overlapped with the
buffer zone of the relocated vector data. The buffer zone represents an expanded region around the relocated
vector data. Since the relocated vector data align with the boundary of desired linear objects, the buffer
40
zone extends to the background area surrounding the desired linear objects. Consequently, the resulting
annotations falsely encompass pixels surrounding the desired linear objects on the maps. Therefore, only
color and location information is insufficient to label desired linear objects accurately.
The proposed Automatic Label Generation (ALG) algorithm utilizes color, location, and shape infor-
mation to identify the desired foreground areas, which are the labeled desired linear objects for training
semantic segmentation models. The desired foreground areas have homogeneous colors, are in close prox-
imity to the vector data, and possess a similar shape to the vector data. To measure the shape similarity,
the ALG algorithm calculates the pixel-level overlap size between the desired foreground area and buffered
vector data. However, direct calculation of the overlap area can result in low shape similarity. For exam-
ple, in Figure 3.1, the overlap size between the rasterized railroad vector data and the railroad on the map
is calculated by the number of overlapping pixels between the green area and the railroad’s pixels on the
map. As we can see, the rasterized railroad vector data (the green area) and the railroad on the map have
a low shape similarity, as indicated by the small number of overlapping pixels between the green area and
the railroad on the map in Figure 3.1. The ALG algorithm introduces an innovative approach to increase
shape similarity: calculating an affine transformation to move the rasterized vector data toward the desired
foreground area. An affine transformation involves translation, rotation, scaling, and shearing operations ap-
plied to the rasterized vector data. For utilizing color and location information, the ALG algorithm assigns
pixels to foregrounds or backgrounds according to color homogeneity within a specified area-of-interest.
The area-of-interest is a map area that overlaps with the buffer zone of the vector data. The buffer zone
width is chosen to encompass the desired linear objects on the maps sufficiently. In summary, the ALG
algorithm innovatively leverages the object’s shape from the external vector data plus the color and location
information to ensure that the identified foreground areas are the desired linear objects on topographic maps.
The main contribution of the ALG algorithm is to leverage an affine transformation to harness shape
information from the external vector data. Using color, location, and shape information, the ALG algorithm
41
obtains accurate labels for the desired linear objects on scanned historical topographic maps. The experiment
section shows that the labeled linear objects from the ALG algorithm are more accurate than the SOTA
baseline, which only uses color and location information.
3.2 AutomaticLabelGeneration(ALG)AlgorithmusingObject’sShape
3.2.1 Chan-VeseAlgorithm(ForegroundsDetectionwithoutLeveragingObject’sShape)
The ALG algorithm builds upon the Chan-Vese algorithm by incorporating a shape similarity criterion.
Thus, this subsection briefly summarizes the Chan-Vese algorithm as a foundation for understanding the
ALG algorithm.
The Chan-Vese algorithm [6], an image segmentation algorithm, assigns pixels in an image into the
foreground and background areas according to the pixels’ colors. The foreground pixels have colors closer
to the average color of foreground areas than the average color of background areas in an image. Similarly,
the background pixels have colors closer to the average color of background areas than the average color of
foreground areas in an image. In the optimization process, the Chan-Vese algorithm iteratively optimizes the
objective function to increase color homogeneity within the foreground and background areas. The objective
function is as follows:
E(c
1
,c
2
,φ)=
Z
Ω
{(u(x,y)− c
1
)
2
H(φ(x,y))+(u(x,y)− c
2
)
2
(1− H(φ(x,y)))}dxdy (3.1)
where u(x,y) is the pixel color at coordinate (x,y) in an image, which is defined on Ω→ R
2
. c
1
and c
2
are two scalar variables representing the average color of foreground and background areas, respectively.
φ(x,y) is a signed distance function describing the foreground and background areas in an image. Eq. 3.2
below defines a signed distance function:
42
φ(x,y)=
0, (x,y) on the foreground boundary
> 0 (x,y) inside the foreground boundary
< 0 (x,y) outside the foreground boundary
(3.2)
H(φ(x,y)) in eq 3.1 is a Heaviside function, which binaries the foreground and background areas in an
image. Eq. 3.3 below defines a Heaviside function:
H(φ(x,y))=
1, (x,y) inside/on the foreground boundary
0, (x,y) outside the foreground boundary
(3.3)
During the optimization process, the Chan-Vese algorithm iteratively updates c1, c
2
, andφ to minimize
the objective function in eq. 3.1 until the pixel assignments do not change. The following equations describes
the optimization of c1, c
2
, andφ(x,y) in each interaction:
c
1
=
R
Ω
u(x,y)H(φ(x,y))dxdy
R
Ω
H(φ(x,y))dxdy
(3.4)
c
2
=
R
Ω
u(x,y)(1− H(φ(x,y)))dxdy
R
Ω
(1− H(φ(x,y)))dxdy
(3.5)
∂φ(x,y)
∂t
=− [(u(x,y)− c
1
)
2
− (u(x,y)− c
2
)
2
]δ(φ(x,y)) (3.6)
43
In each iteration, the average color calculation for foreground and background areas is based on eq. 3.4
and 3.5, respectively. H(φ(x,y)) in eq. 3.4 represents the foreground areas, while(1− H(φ(x,y)) in eq. 3.5
represents background areas.
In each iteration, the Chan-Vese algorithm updates a pixel as foreground if the pixel color (u(x,y)) is
closer to the foreground area’s average color (c
1
) than the background area’s color (c
2
) according to eq. 3.6.
In practice, the Chan-Vese algorithm ignoresδ(φ(x,y)) in eq. 3.6, the derivation of the Heaviside function,
to accelerate the optimization process. Song et al. [73] provide detailed proof of the correctness of fast
optimization of the Chan-Vese algorithm.
To obtain the foreground areas only for the desired linear objects, the ALG algorithm adds the term
for the shape similarity in the objective function. While previous methods [7, 12] also incorporate a shape
term in the Chan-Vese algorithm, their shape similarity measurement assumes that the object’s shape and
the desired linear object are positioned identically and share the same scales and orientations. However, a
spatial transformation exists between the object’s shape provided by the external vector data and the desired
linear objects on maps due to the different scales and coordinate systems between the external vector data
and maps. Consequently, existing methods [7, 12] cannot directly utilize the object’s shape from the external
vector data to find the desired foreground area. To mimic the spatial translation between the external vector
data and the desired linear objects on the maps, the ALG algorithm leverages an affine transformation to
move the object’s shape (the rasterized external vector data) toward the foreground area.
3.2.2 Area-of-interestandObject’sShapeRepresentationintheALGAlgorithm
Following the Chan-Vese algorithm’s utilization of the signed distance function to represent foreground
areas in images, the ALG algorithm also adopts the signed distance function (eq. 3.2) to represent the area-
of-interest and object’s shape. Given that the external vector data are situated several pixels away from the
desired linear objects on the maps, the ALG algorithm focuses on finding the desired foreground area within
44
the area-of-interest, which is in close proximity to the external vector data in maps. Therefore, an area-
of-interest refers to the map area that overlaps with the buffer zone of the external vector data, effectively
encompassing the nearby desired linear objects. For example, in Figure 3.2, the green area represents an
area-of-interest for a railroad. The railroad is a black line with crosses. To represent an area-of-interest, the
signed distance function assigns positive values to pixels inside the area-of-interest and negative values to
pixels outside the area-of-interest.
Figure 3.2: The green and blue areas represent the area-of-interest and the object’s shape for the railroad on
a scanned topographic map, respectively. The railroad is the black line with crosses on the map.
Similar to the area-of-interest, the object’s shape also corresponds to the map region that overlaps with
the buffer zone of external vector data on the maps. However, unlike the area-of-interest, the buffer zone
for an object’s shape is the same as the width of desired linear objects on the maps. For example, the blue
area in Figure 3.2 exemplifies the railroad’s shape. The signed distance function assigns positive values to
pixels inside the object’s shape and negative values to pixels outside the object’s shape. As we can see, the
railroad’s shape (the blue area) is not aligned with the railroad’s location in Figure 3.2. The misalignment
poses a challenge for measuring the shape similarity between the object’s shape and the desired linear
objects on the maps. To address the misalignment, the ALG algorithm applies an affine transformation
on the object’s shape represented by the signed distance function. The transformation aims to adjust the
45
location of the object’s shape toward the location of the desired linear objects on maps. Eq. 3.7 represents
applying an affine transformation to an object’s shape represented by a signed distance function.
φ
2
(x,y)= sφ
1
"
1
s
(x− tr
x
)(cosθ+ sh
y
sinθ)+(y− tr
y
)(sinθ+ sh
x
cosθ)
, (3.7)
1
s
(x− tr
x
)(− sinθ+ sh
y
cosθ)+(y− tr
y
)(cosθ− sh
x
sinθ)
#
where φ(·) is the singed distance function. φ
1
(·) and φ
2
(·) represent an object’s shape before and after an
affine transformation. The parameters of an affine transformation are the translation along the x-axis ( tr
x
),
the translation along the y-axis (tr
y
), the scale (s), the rotation angle (θ), the shear along the x-axis (sh
x
),
and the shear along the y-axis (sh
y
).
3.2.3 TheObjectiveFunctionfortheALGAlgorithm
The objective function represents two goals of the ALG algorithm to find the desired foreground area in map
images. One goal is grouping the images into foreground and background areas by increasing color homo-
geneity. The foreground areas include the desired linear objects. The other goal is maximizing the similarity
between the foreground area and the object’s shape within an area-of-interest. The desired foreground area
corresponds to the foreground area within an area-of-interest that exhibits the highest shape similarity to the
object’s shape. The objective function of the ALG algorithm representing the two goals is:
E(c
1
,c
2
,φ,L,ψ)=
Z
Ω
(u− c
1
)
2
H(φ)+(u− c
2
)
2
(1− H(φ))dxdy+λ
Z
Ω
(H(φ)H(L)− H(ψ))
2
dxdy (3.8)
where the first and second terms in eq. 3.8 represent the color homogeneity and shape similarity goal,
respectively. The weightλ in the second term aims to balance the magnitudes of the first and second terms.
The first term in eq. 3.8 is the same as the objective function of the Chan-Vese algorithm, as explained
46
in Section 3.2.1. In the second term of eq. 3.8, the Heaviside function (eq. 3.3), denoted by H(L) and
H(ψ), assigns binary values,{0,1}, to pixels. Specifically, the Heaviside function assigns ones to pixels
inside the area-of-interest or object’s shape, and zeros to pixels outside an area-of-interest or object’s shape.
Section 3.2.2 elaborates H(L) and H(ψ) in detail. H(φ)H(L) represents the desired foreground area, which
is the intersection area between the area-of-interest and foreground areas. H(φ)H(L)− H(ψ) quantifies the
shape similarity between the object’s shape and the foreground area within the area-of-interest.
3.2.4 TheOptimizationProcessoftheALGAlgorithm
The ALG algorithm iteratively increases the color homogeneity of the desired foreground area and the shape
similarity by minimizing the objective function in eq. 3.8. In each iteration, the ALG algorithm minimizes
the objective function in two steps. In the first step, the ALG algorithm updates average foreground and
background colors (c
1
and c
2
) according to eq. 3.4 and 3.5. The second step updates foreground areas, area-
of-interest, and object’s shape according to eq. 3.9, 3.10, and 3.11, respectively. The iterative optimization
ends when the pixels assignments to foregrounds, area-of-interest, and object’s shape are unchanged.
∂φ
∂t
=− [(u− c
1
)
2
− (u− c
2
)
2
+ 2λH(L)(H(φ)H(L)− H(ψ))]δ(φ) (3.9)
∂L
∂t
=− 2λH(φ)[H(φ)H(L)− H(ψ)]δ(L) (3.10)
∂ψ
∂t
=− 2λ[H(φ)H(L)− H(ψ)]δ(ψ) (3.11)
47
For foreground area updates in eq. 3.9, the ALG algorithm assigns pixels to foregrounds and back-
grounds based on pixels’ colors, as well as the area-of-interest and object’s shape. Specifically, the ALG
algorithm assigns a pixel to foreground areas according to two criteria. One criterion considers the pixels’
colors, represented by (u− c
1
)
2
− (u− c
2
)
2
in eq. 3.11. A foreground’s pixel has a color value closer to
the average foreground color (c
1
) than the average background color (c
2
). The other criterion involves the
pixel’s location, represented by H(L)(H(φ)H(L)− H(ψ)) in eq. 3.11. A foreground’s pixel is within the
overlap area of the area-of-interest and the object’s shape. The weightλ balances the magnitude of color and
location terms, allowing the method to consider both criteria when assigning pixels to foregrounds and back-
grounds. In practice, the ALG algorithm, following the optimization strategy of the Chan-Vese algorithm,
ignoresδ(φ) in eq. 3.9 to accelerate the optimization process.
For the area-of-interest updates in eq. 3.10, the area-of-interest’s pixels are in the intersection between
the foreground area and the object’s shape, i.e., H(φ)[H(φ)H(L)− H(ψ)]≥ 0. To accelerate the optimiza-
tion process of the area-of-interest, the ALG algorithm only shrinks the size of the area-of-interest following
the optimization strategy proposed by Peng et al. [58]. Additionally, area-of-interest optimization also ig-
noresδ(L) in eq. 3.10, similar to the foreground updates.
The object’s shape updates in eq. 3.11 consists of two steps. In the first step, the ALG algorithm updates
the parameters of an affine transformation. Eq. 3.12 to 3.17 describe the parameter updates of an affine
transformation in each iteration. The updates of affine transformation aim to move the object’s shape toward
the desired foreground area. In the second step, the ALG algorithm updates the object’s shape according to
eq. 3.11. A pixel in the object’s shape is the intersection area between the desired foreground area and the
object’s shape, i.e.,[H(φ)H(L)− H(ψ)]≥ 0.
48
∂tr
x
∂t
=
Z
Ω
− (H(φ)H(L)− H(ψ)){ψ
0x
(x
∗ ,y
∗ )(cosθ+ sh
y
sinθ)
− ψ
0y
(x
∗ ,y
∗ )(sinθ+ sh
x
cosθ)}δ(φ)dxdy (3.12)
∂tr
y
∂t
=
Z
Ω
− (H(φ)H(L)− H(ψ)){ψ
0x
(x
∗ ,y
∗ )(sinθ+ sh
x
cosθ)
+ψ
0y
(x
∗ ,y
∗ )(cosθ− sh
x
sinθ)}δ(φ)dxdy (3.13)
∂s
∂t
=
Z
Ω
− (H(φ)H(L)− H(ψ)){− ψ
0
(x
∗ ,y
∗ )+ψ
0x
(x
∗ ,y
∗ )x
∗ +ψ
0y
(x
∗ ,y
∗ )y
∗ }δ(φ)dxdy (3.14)
∂θ
∂t
=
Z
Ω
− (H(φ)H(L)− H(ψ)){− sψ
0x
(x
∗ ,y
∗ )y
∗ )+ sψ
0y
(x
∗ ,y
∗ )x
∗ }δ(φ)dxdy (3.15)
∂sh
x
∂t
=
Z
Ω
− (H(φ)H(L)− H(ψ)){ψ
0x
(x
∗ ,y
∗ )((x− a)cosθ+(y− b)sinθ)}δ(φ)dxdy (3.16)
∂sh
y
∂t
=
Z
Ω
− (H(φ)H(L)− H(ψ)){ψ
0x
(x
∗ ,y
∗ )((x− a)cosθ+(y− b)sinθ)}δ(φ)dxdy (3.17)
whereψ
0x
, andψ
0y
are partial derivation ofψ
0
of x and y, respectively, described as follows:
ψ
0x
=
∂ψ
0
∂x
,ψ
0y
=
∂ψ
0
∂y
(3.18)
49
3.3 ExperimentResultsandAnalysis
3.3.1 ExperimentData
We conduct the experiment on two linear objects: railroads and waterlines, on the scanned topographic
maps, which covered Bray city in California, published in 2001, and Louisville city in Colorado, published
in 1965, from the United States Geological Survey (USGS). The maps contain diverse geographic objects,
including railroads, waterlines, roads, lakes, mountains, and wetlands. The dimensions of the two map
sheets are large: the Bray map is 12,943-pixel height and 16,188-pixel width, and the Louisville map is
11,347-pixel height and 13,696-pixel width. The external vector data, sourced from USGS and published in
2018, provides the shape information about the desired linear object on the maps.
3.3.2 EvaluationMetrics
We use the pixel-level precision, recall, and F
1
score to evaluate the quality of labels. The manually digitized
railroads and waterlines on the maps are the ground truth for evaluating the ALG algorithm’s performance.
3.3.3 Baselines
The experiment has three groups of labeled data. Two groups are the baselines, and the other is from the
ALG algorithm.
• Group one: The original USGS vector data. We buffer the original vector data from USGS to the
approximate width of the desired linear objects on the maps. The overlap areas between the buffered
vector data and the maps are the labels for desired objects on the maps. The buffer size for railroads
and waterlines on the Bray map is five pixels, and the buffer size for the railroads on the Louisville
map is three pixels. The buffer size is the approximate width of desired linear objects on the maps.
The name of this group’s labels isoriginalvector.
50
• Group two: The vector-to-raster algorithm. We use the vector-to-raster algorithm proposed by Duan
et al. [18] to align original USGS vector data to the desired linear objects on the maps. We buffer
the aligned vector data from the vector-to-raster algorithm to the approximate width of desired linear
objects on the maps. The overlap areas between the buffered vector data and the maps are the labels
for the desired objects. The buffer size for railroads and waterlines on the Bray map is five pixels, and
the buffer size for the railroads on the Louisville map is three pixels. The name of this group’s labels
isvector-to-raster.
• Group three: The ALG algorithm. The desired foreground pixels from the ALG algorithm are the
labels in this group. The name of this group’s labels isALG.
3.3.4 TheALGAlgorithmDetails
We set the buffer width of the external vector data for the area-of-interest as 30 pixels, sufficiently covering
the desired foreground area near the external vector data. As for the object’s shape, the buffer width of the
external vector data is the same as the width of desired linear objects on the maps. Therefore, the buffer
width of the external vector data for the object’s shape in the Bray and Louisville maps is five pixels and
three pixels, respectively. The weight,λ in the objective function (eq. 3.8), is three.
3.3.5 TrainingDataGenerationResultsandAnalysis
Table 3.1: The evaluation for the quality of training data labels using pixel-level precision, recall, and F
1
score for the linear object extraction.
Bray railroads Louisville railroads Bray waterlines
Precision Recall F
1
Precision Recall F
1
Precision Recall F
1
Original vector 37.02% 58.72% 45.41% 59.20% 60.31% 59.75% 65.01% 96.44% 77.67%
Vector-to-raster 77.78% 87.85% 82.55% 96.56% 83.73% 89.69% 86.24% 88.65% 87.43%
ALG 95.17% 86.15% 90.42% 99.78% 83.02% 90.63% 96.39% 88.98% 92.53%
51
(a) Railroad labels from orig vec (b) Railroad labels from vec2raster (c) Railroad labels from ALG
(d) Waterline labels from orig vec (e) Waterlines labels from vec2raster (f) Waterline labels from ALG
(g) Railroad labels from orig vec (h) Railroad labels from vec2raster (i) Railroad labels from ALG
Figure 3.3: The areas within the blue polygons are the labeled pixels for the desired linear objects. ”Orig
vec” in Figure 3.3a, 3.3d and 3.3g refers to labels from the buffered original vector data. ”vec2raster” in
Figure 3.3b, 3.3e, and 3.3h refer to labels from the aligned vector data processed by the vector-to-raster
algorithm. ”ALG” in Figure 3.3c, 3.3f, and 3.3i refer to labels from the proposed ALG algorithm.
Table 3.1 demonstrates the significant improvement in label quality achieved by the ALG algorithm
compared to labels generated from the original vector data and the vector-to-raster alignment algorithm.
The average precision of labels from the ALG algorithm is around 10% higher than the two baselines.
Figure 3.3 shows the visualization of label quality from the ALG algorithm and two baselines. The areas
within the blue boundaries in Figure 3.3 are the positive (i.e., the desired linear objects) labels in the Bray
map and Louisville map from three groups of labels. The first column in Figure 3.3 shows the misaligned
labels from the original vector data due to the misalignment between the original vector data and the desired
linear objects on the maps. The second column in Figure 3.3 shows the label noise from the vector-to-
raster alignment algorithm. The noisy labels result from the vector lines from the vector-to-raster algorithm
aligning with the boundary of desired linear objects. In contrast, the ALG algorithm refines the boundary
labels using color information to separate the maps’ foreground and background areas. Therefore, the ALG
algorithm reduces the label noise located on the boundary of desired linear objects, showing in the last
column in Figure 3.3. However, the ALG algorithm cannot refine the labels at the boundaries when the
52
background color is closer to the foreground’s color than the background’s color in the maps. The arrow in
Figure 3.3f shows that the ALG algorithm falsely annotates the red background as waterlines because the
red color is closer to the color of the waterline than the dominated green background color. In general, the
color of the areas surrounding the desired linear objects tends to be more similar to the background color
than the color of the desired linear object areas. As a result, the average F
1
score from the ALG algorithm is
5% higher compared to the vector-to-raster algorithm.
The ALG algorithm achieves an average recall of 85%. The primary reason for not reaching 100% recall
is that the external vector data lack information for 14.84% of the desired linear objects on the maps. The
14.84% objects did not exist when the vector data was originally published. Additionally, the ALG algorithm
falsely annotates other objects as desired linear ones on the maps when the desired lines are outside the
area-of-interest. For example, the area within the red polygon in Figure 3.4a shows the falsely labeled
railroads generated by the ALG algorithm in the Bray map. The true railroads are the bottom black line in
Figure 3.4a. The original vector (red) line in Figure 3.4b overlaps with the middle black line. Therefore, the
area-of-interest provided by the external vector data does not cover the railroad on the map. Consequently,
the ALG algorithm falsely labels another black line as a railroad on the Bray map. Nevertheless, the 85%
recall indicates that the misaligned original vector data is generally close to the desired linear objects in
most locations on the maps, enabling the ALG algorithm to correctly identify and label a substantial portion
of the desired linear objects.
(a) The area within the red polygon is the rail-
roads’ labels from the ALG algorithm in the
Bray map. However, the true railroad is the black
bottom line.
(b) The original railroad’s vector (red) line over-
laps with the middle black line rather than the
bottom railroad line.
Figure 3.4: (a) shows the false labels from the ALG algorithm. The ALG algorithm cannot correct the label
error when the original vector data showing in (b) is closer to other linear objects than the desired objects.
53
3.3.6 ExtractionResultsandAnalysis
In this section, we conduct two sets of experiments to assess the impact of label quality on the extraction
results. Firstly, we train deeplabv3+ using three groups of labels, as described in the previous subsection,
to analyze how the quality of labels influences the extraction results. Secondly, we compare the extraction
results from deeplabv3+ with and without the Ncut loss to assess the necessity of our ALG algorithm for
achieving accurate extraction results. The Ncut loss [77] is a noise-aware loss employed during the train-
ing of extraction models, specifically designed to reduce the influence of noisy labels on the accuracy of
extraction.
By examining the accuracy of the extraction results from deeplabv3+ using the Ncut loss (a noise-aware
loss) trained with noisy labels from the original vector data, we can verify if the ALG algorithm is needed
to generate accurate labels before the training process. If the extraction results are accurate without using
the ALG algorithm, it suggests that the Ncut loss effectively compensates for the noisy labels, making the
ALG algorithm unnecessary.
We use correctness, completeness [32], and average path length similarity (APLS) [85] to evaluate the
precision, coverage, and connectivity of the extracted lines, respectively. The calculation of correctness,
completeness, and APLS compares the detected graphs and ground truth. We manually annotate the loca-
tions of desired railroads and waterlines as ground truth. Correctness measures the extent to which detected
lines correspond to the true locations of desired lines. Completeness, on the other hand, measures the extent
to which the desired lines are detected. Figure 3.5 illustrates the calculation of correctness and completeness.
We set the buffer width in the experiment as seven pixels, the width of desired linear objects in maps.
Average path length similarity (APLS) [85] quantifies the connectivity of the extracted lines by compar-
ing the similarity of all paths between the ground truth and extraction. The similarity is measured by the path
lengths. APLS penalizes disconnections in the extracted lines and falls within the range of [0,1]. Higher
values within the range indicate better connectivity in the extraction. Eq. 4.2 demonstrates the calculation
54
=
+
=
+
Figure 3.5: Illustration of correctness and completeness calculation.
of APLS. In the equation, L(a,b) represents the path length between two nodes in the ground truth, while
L(a
′
,b
′
) represents the path length between the corresponding nodes of a and b in the extractions. In the
experiments, we used a 7-pixel buffer to find the corresponding nodes. 7-pixel is the width of desired linear
objects in maps.
APLS= 1− 1
N
∑
N
min{1,
|L(a,b)− L(a
′
,b
′
)|
L(a,b)
} (3.19)
3.3.6.1 TheImpactofLabelQualityontheExtractionResults
In Table 3.2, the correctness obtained by deeplabv3+ (the extraction model) without the Ncut loss demon-
strates that the accurately labeled data from the ALG algorithm achieves the highest correctness compared
to the results from the other two groups of labels. For visual comparisons, Figure 3.6a and 3.6b depict
extraction results for railroads and waterlines from the Bray map, respectively. Moreover, Figure 3.6c visu-
alizes the railroad extractions from the Louisville map. The rows from top to bottom represent the extraction
results using the labeled data from the original vector, vector-to-raster, and ALG groups, respectively. In
each sub-figure, the green and red lines represent the true positive and false positive extractions, respec-
tively. The visualization shows that the extraction results from the ALG algorithm have more true positive
lines compared to the original vector group while having fewer false positive lines when contrasted with
55
the vector-to-raster group. The completeness in Table 3.2 shows that deeplabv3+ trained using ALG’s la-
bels extracts 5% and 8% longer railroads and waterlines from the Bray map, respectively, compared to the
vector-to-raster group. The extraction visualization in Figure 3.7a and 3.7b illustrates that the extracted
waterlines from the ALG algorithm are longer than those from the vector-to-raster group. The extraction
result comparison among the three label groups highlights the significant impact of the quality of labeled
data on extraction performance. The extraction model trained using accurately labeled data from the ALG
algorithm achieves the best extraction results among the three label groups.
Table 3.2: The evaluation for the vectorized extraction results. ”Original”, ”Vec2raster”, and ”ALG” rep-
resent three groups of labeled data. ”Ncut” and ”no-Ncut” represent deeplab v3+ with and without the
normalized cut loss, respectively. The highlighted numbers indicate the most optimal extraction results.
Bray railroads Bray waterlines Louisville railroads
Original Vec2raster ALG Original Vec2raster ALG Original Vec2raster ALG
Correctness
Ncut 0.4437 0.2385 0.7915 0.0 0.7429 0.6743 0.1139 0.3230 0.5034
no-Ncut 0.4937 0.7644 0.8078 0.7775 0.7968 0.8223 0.6084 0.7259 0.8661
Completeness
Ncut 0.4772 0.5025 0.6482 0.0 0.8524 0.9341 0.2167 0.4945 0.2704
no-Ncut 0.4543 0.7918 0.8408 0.9454 0.8874 0.9671 0.4664 0.8182 0.7154
APLS
Ncut 0.3319 0.2437 0.6353 0.0 0.2880 0.5424 0.1239 0.1799 0.1704
no-Ncut 0.3120 0.4052 0.5750 0.3533 0.3142 0.4600 0.5333 0.3059 0.4993
(a) Railroads on the Bray map (b) Waterlines on the Bray map (c) Railroads on the Louisville map
Figure 3.6: The above figure illustrates the extraction results. The sub-figures from top to bottom show the
results using the annotations from the original vector, vector-to-raster, and ALG group, respectively. The
green and red lines represent the true positive and false positive extraction results, respectively.
56
(a) Vec2raster + no-Ncut (b) ALG + no-Ncut (c) ALG + Ncut
Figure 3.7: The waterlines extraction results on the Bray map. The green lines represent the true positive
extracted waterlines. For the labels, (b) and (c) used the ALG group, while (a) used the vector-to-raster
group. For the extraction models, (a) and (b) show the extraction from the model without the Ncut loss,
while (c) is with the Ncut loss.
3.3.6.2 With&WithouttheNormalizedCutLoss
(a) no-Ncut (b) Ncut
Figure 3.8: The false extracted railroads (red lines) on the Bray map. (a) and (b) are the results from the
model without and with the Ncut loss, respectively.
APLS for the railroads and waterlines on the Bray map in Table 3.2 show that the Ncut loss signifi-
cantly improves the continuity of the extraction outcomes. Specifically, when using the ALG algorithm with
the Ncut loss, the APLS score is 12.40% higher compared to using the ALG algorithm without the Ncut
loss. Figure 3.7c shows that the extraction model with the Ncut loss extracts continuous waterlines (the
green lines), while Figure 3.7b shows that the extraction model without the Ncut loss extracts disconnected
waterlines. The Ncut loss promotes the grouping of neighboring pixels with similar colors into the same
class, effectively minimizing gaps in extraction results. However, the Ncut loss function also contributes to
a rise in false positive extractions that are in close proximity to the desired objects and share similar colors
with the desired objects. As a result, the Ncut loss results in an average decrease of 29.93% in correctness.
Figure 3.8 shows that the model with the Ncut loss extracts more continuous false railroads (red lines) on the
Bray map. Therefore, the Ncut loss increases the continuity of both true positive and false positive results.
57
The low correctness and completeness for the original vector labels in Table 3.2 show that the Ncut loss
cannot deal with the label noise problem directly. The labels from the original vector data have an average
60.94% F
1
score, which shows that the label noise is not minor. The major occurrences of label noise are
in the background areas surrounding the target objects. Therefore, the label noise is similar. As a result, the
Ncut loss is ineffective in correcting the label noise by comparing the similarity among labels. Therefore,
the Ncut loss cannot directly apply to our noisy label problem.
3.4 RelatedWork
In this section, we discuss the existing methods that reduce label noise before training semantic segmentation
models, as the proposed ALG algorithm improves label accuracy before the training process. The existing
methods fall into two categories: semi-automatic and fully automatic approaches.
In the semi-automatic category, the two-stage building detector [23] addresses the misaligned annota-
tions from the OSM data by leveraging a small set of manually corrected annotations. The detector uses
the smaller set of manual labels to train an alignment correction network (ACN) before training a semantic
segmentation model. The ACN learns the translation relationships between the buildings’ annotations from
the external OSM data and the buildings in the overhead satellite imagery. During the training of the build-
ing detector, the ACN adjusts the OSM’s annotations by aligning with the buildings on the maps. However,
the semi-automatic approach requires manual work. In contrast, the ALG algorithm automatically generates
labels by fully leveraging local and shape information from the external vector data.
In the fully automatic category, the existing methods [8, 18, 64, 74] reduce annotation noise by auto-
matically aligning the external vector data to the desired linear objects on the map images. Some existing
methods [8, 64, 74] focus on aligning the road vector data and roads on the maps. The road alignment
algorithms propose diverse methods to detect and match the road intersections for the alignment. However,
the waterlines and railroads have fewer intersections than roads. Therefore, the existing methods for road
58
alignment cannot apply to our desired linear objects. The baseline of the experiment, the vector-to-raster
alignment algorithm [18], has a reward function to guide the vector data moving to the desired linear objects
on the maps. The reward function places the vector data at the pixels with colors falling within the color
range of the desired linear objects on the maps. However, the vector data may be placed at the desired linear
objects’ edges since edge colors are also in the color range. Consequently, labels from the aligned vector
data may falsely annotate background areas surrounding the desired linear objects. In contrast, the labels
from the ALG algorithm do not include the background surrounding the desired linear objects because the
ALG algorithm accurately separates foreground and background areas in maps based on color homogeneity.
59
Chapter4
LinearGeographicObjectsExtraction
This chapter introduces my extraction model for linear objects, such as railroads, on scanned historical to-
pographic or geological maps. Accurate linear object extraction can benefit various application areas, such
as hydrography change analysis and mining resource potential prediction. However, existing models en-
counter challenges in capturing adequate image context to differentiate desired linear objects from others
with similar local appearances and spatial context to delineate elongated, slender-shaped linear objects accu-
rately. Consequently, detection results from the existing models often suffer from inaccurate connectivities
and false detection. Accurate connectivity detection ensures that the extracted line segments are intersected,
merged, and split correctly. This chapter introduces my extraction model, the Linear Object Detection
TRansformer (LDTR), for directly generating accurate vector graphs for linear objects from scanned map
images. By leveraging the multi-scale deformable attention mechanism, LDTR learns representative im-
age context, reducing false detection. Furthermore, the innovative N-hop connectivity component in LDTR
explicitly encourages interactions among nodes within a given N-hop distance in a graph. The interac-
tions enable LDTR to acquire sufficient spatial context, generating graphs with accurate connectivities. The
experiment results in this chapter demonstrate that LDTR achieves an average correctness of 0.87 and sig-
nificantly improves connectivity by approximately 20% compared to state-of-the-art linear object detectors,
representing a significant advancement. LDTR won first place in the United States Geological Survey and
60
the Defense Advanced Research Projects Agency 2022 AI for Critical Mineral Assessment Competition
*
,
significantly outperforming second place by 184%.
4.1 Introduction
Detecting linear objects, such as railroads and fault lines, from the scanned historical topographic and geo-
logical map images helps extract valuable information about the natural features and human activities [10,
11, 40, 55], such as critical minerals [33] and the railroad networks development [1]. For example, Fig-
ure 4.1 shows the railroad locations in the same region covering Los Angeles, California, from the United
States Geological Survey (USGS) topographic maps published in 1928, 1966, and 2018, respectively. Com-
paring the railroad locations on the three maps facilitates the analysis of railroad changes over time, such as
railroad locations, length, and density. These types of analysis [65, 81, 82] rely on accurate linear object de-
tection from scanned maps [16, 19, 46, 91, 100]. For example, accurate fault lines extracted from geological
maps are often the only reliable source of such data for critical mineral assessment.
†
(a) Year 1928 (b) Year 1966 (c) Year 2018
Figure 4.1: The black lines highlighted with the arrows are the historical railroad locations on the USGS
topographic maps in Los Angeles, California, circa 1928, 1966, and 2018, respectively. The density of
railroad areas in three different years shows the changes in railroads over time.
*
https://criticalminerals.darpa.mil/The-Competition
†
e.g., seehttps://criticalminerals.darpa.mil/Background
61
Detecting accurate linear objects requires the detector to capture both image and spatial context. The
image context should encompass the cartographic symbols of the desired linear objects. For example, the
black crosses, highlighted with red circles in Figure 4.2b, are representative cartographic symbols for rail-
roads. In Figure 4.2, both roads and railroads are black lines. Capturing the black crosses as an image
context is crucial to differentiate railroads from roads. Therefore, capturing the representative image context
can reduce false detection. Spatial context, on the other hand, refers to the correlations among pixels in an
image. For example, nearby pixels with similar colors are likely to belong to the same object. Therefore,
capturing the spatial context can improve the connectivity of detected linear objects. The detector must also
be able to combine the image and spatial contexts, e.g., detecting black(ish) pixels following an elongated
area with repeated occurrences of the cross symbols to accurately extract the railroads.
(a) (b)
Figure 4.2: Examples of road (a) and railroad (b) on a USGS topographic map
The state-of-the-art (SOTA) linear object detectors have two categories: segmentation- and graph-based
models. The SOTA segmentation-based models [22, 38, 70, 75, 79, 106, 107] propose two solutions to
learning image context: the attention mechanism and strip convolutional kernels. The attention mechanism,
proposed by the Transformer family [3, 26, 51, 88, 104, 108], aggregates image contexts for the cartographic
symbols of desired lines. The strip convolution kernels [13, 56, 78, 89] have horizontal, vertical, and
diagonal shapes to learn the image contexts along the linear objects, including cartographic symbols. After
learning the image context, the segmentation-based models independently predict each pixel in the input
62
image belonging to the desired object. However, independent predictions [15, 45, 57] in segmentation-
based models fail to explicitly use the spatial context among pixels.
The SOTA graph-based detectors [30, 31, 48, 70, 87, 95, 103] adopt the same solutions as segmentation-
based models to learn image context but directly generate vector graphs for linear objects to learn the spatial
context instead of predicting individual pixels. Here, the spatial context aims to capture linear objects’
connectivity and curvature. Relationformer [70], the latest graph-based detector, based on Deformable
DETR (def-DETR) [108], proposed a learnable relation token to capture the spatial context. The relation
token has high-order interactions with the node tokens, allowing the exchange of spatial context among node
tokens. In particular, Relationformer focuses on learning the spatial context of node adjacency (i.e., using
adjacency information to optimize the relation token). However, the adjacency relations are insufficient
to encode the complex spatial context among nodes. For example, the spatial context of a curved line is
more complicated than a straight line. As a result, Relationformer would produce high precision and recall
detection but with low connectivity, which is important for many real-world applications to analyze the
extracted line data (e.g., shortest paths).
To learn complex spatial contexts and improve connectivity prediction, this chapter proposes the Linear
object Detection TRansformer (LDTR) based on Relationformer [70] and includes a novel N-hop connectiv-
ity prediction module. The N-hop connectivity module encourages the node tokens to encode information
from adjacent node tokens and node tokens within a given N-hop distance. By considering a wide con-
nectivity extent, LDTR effectively captures complex spatial context, including the line’s orientation and
curvature as well as the topological relationships among multiple lines. LDTR’s learned spatial context
helps predict edges in a complex graph accurately. For example, when a node has multiple potential connec-
tions in various directions, knowing the graph’s topology helps LDTR correctly predict an edge that aligns
with the orientation of the linear object. As for learning representative image context, particularly carto-
graphic symbols, LDTR leverages deformable attention [108], which enables selective interactions among
63
local features. For example, local features for railroads are black lines or black crosses. With the learned
selective interactions, LDTR effectively incorporates black lines and black crosses into the representation
of railroads, ensuring the accurate extraction of desired linear objects. Figure 4.3 shows the overview of
the LDTR architecture. LDTR takes tokens for the image-patch, node, and edge as inputs and predicts the
nodes and edges to construct a graph. In summary, LDTR achieves accurate node localization by capturing
cartographic symbols using deformable attention. Additionally, the N-hop connectivity prediction module
captures spatial information to predict precise edges in the graph.
Vision Transformer
Edge token Image-patch tokens
Input image
Node tokens
N-hop
connectivity
Prediction
Edge
prediction
Node
prediction
Figure 4.3: The above figure shows the LDTR overview. LDTR takes images as inputs and generates the
graph for desired lines by predicting nodes and edges. Notably, the N-hop connectivity prediction head
in LDTR plays a crucial role in capturing complex spatial context among nodes and hence improves the
connectivity of the detected lines. Best viewed in color.
The main contributions of LDTR are the following:
• LDTR utilizes the multi-scale deformable attention mechanism to effectively capture representative
image context, specifically cartographic symbols, which is important to reduce false detection.
• The N-hop connectivity component in LDTR facilitates the spatial information aggregation among
nodes within a given N-hop distance to capture sufficient spatial context, which is crucial to gen-
erating accurate connectivity in a graph. Accurate graph connectivity is the foundation for various
64
downstream analyses and applications, such as route planning (e.g., road navigation) and identifying
interconnected areas delineated by linear features.
• LDTR, developed under the def-DETR [108] framework, harnesses both image and spatial context to
generate accurate graphs for desired linear objects on scanned maps.
In the experiment section, we evaluate LDTR and baselines on real-world linear-object detection tasks in
scanned maps. The detected results from LDTR have significantly less false detection and better connectivity
than baselines.
4.2 LinearobjectDetectionTRansformer(LDTR)
LDTR consists of five components: a Convolutional Neural Network (CNN) backbone, Transformer, a node
prediction head, an edge prediction head, and an N-hop connectivity prediction head. Figure 4.4 shows
the detailed LDTR architecture. The first four components follow the conventions in Deformable-DETR-
based object detection approaches [108] and Relationformer [70], while the fifth component, the N-hop
connectivity prediction head, is the novel component to improve the connectivity of linear object detection.
This section describes each component in detail.
4.2.1 LDTRModelArchitecture
CNN Backbone: Starting from the initial image, I∈R
3× H
0
× W
0
(3 represents three color channels of an
image), a CNN backbone generates a lower-resolution feature map f ∈R
C× H× W
. Each C-length feature
vector in the feature map represents a H
0
/H× W
0
/W grid cell in the initial image.
DeformableTransformer: We adopt an encoder-decoder Relationformer [70] architecture with multi-
scale deformable attention proposed by def-DETR [108]. The deformable attention enables the query tokens
to selectively attend to a small set of tokens within a spatial area determined from learned offsets of the
65
Image
Feature
maps
C*H
0
/s*W
0
/s
H
0
W
0
s
C
…
Positional Encoding
(H
0
/s, W
0
/s)
CNN
Image-patch tokens
Encoder
Decoder
Node tokens
Edge
token
Transformer
Edge
classification
Edge
classification
Edge Prediction Head
Node
classification
Node
Regression
Node
classification
Node
Regression
Node Detection Head
Connectivity
classification
Connectivity
classification
N-hop Connectivity Prediction Head
Figure 4.4: The above figure shows the LDTR’s architecture. LDTR consists of five components. The first
component, CNN, takes images as inputs and generates the feature maps. The second component, Trans-
former, comprises two sub-components: an encoder and a decoder. The Transformer’s encoder processes
image-patch tokens obtained by combining flattened feature maps and positional encodings to produce re-
fined image-patch tokens. The Transformer’s decoder takes node tokens, edge tokens, and refined image-
patch tokens as inputs and generates refined node and edge tokens. The third component, the edge prediction
head, takes the concatenation of pair-wise refined node tokens and the refined edge token as inputs and pre-
dicts if an edge exists between two node tokens. The fourth component, the node detection head, takes the
refined node tokens as inputs and predicts if a node token is a valid node in the graph plus the node location
in the input image. The fifth component, the N-hop connectivity prediction head, takes the concatenation
of pair-wise refined node tokens as inputs and predicts if two nodes are connected within a specified N-hop
distance. The fifth component is only used in training time.
reference points. The learnable selective attention is particularly beneficial for detecting linear objects, as
the existence of a linear object is typically confined to specific spatial positions. For example, the railroad
in Figure 4.2 only occupies a slender and elongated area in the middle of the image. By leveraging the
deformable attention, the tokens are able to effectively focus their attention on the relevant tokens that are
more informative for detecting the linear object accurately.
Encoder: LDTR follows the encoder in Relationformer [70], employing multi-scale deformable self-
attention. The encoder takes the sum of flattened image features from the CNN backbone and position
66
encodings as inputs. For positional encoding, we utilize the widely-used two-dimensional sinusoidal po-
sitional encoding method in Vision Transformer (ViT) [3, 4, 26, 104]. The multi-scale deformable self-
attention enables selective interactions among image-patch tokens within multi-scaled feature maps from
the CNN backbone. The interaction allows LDTR to effectively aggregate visual information along the de-
sired lines from fine to coarse resolutions. By aggregating visual information, the image-patch tokens obtain
representative features, specifically the distant cartographic symbols (e.g., railroad crosses), which play a
crucial role in distinguishing the desired lines from others on the map images.
Decoder: Following Relationformer’s decoder, our decoder has N+1 tokens, T ∈R
(N+1)× d
, where N
represents the number of node tokens plus a single edge token, d is the length of an image-patch token.
LDTR randomly initializes each element in N+1 tokens from a normal distribution,N (0,1). The image-
patch tokens from the encoder serve as the second input of our decoder. The decoder uses two types of
attentions to process N+1 tokens. One type is deformable cross-attention [108] between the node tokens and
the image-patch tokens, the encoder’s outputs. The other type is the self-attention [88] among node and edge
tokens. In the multi-scale deformable cross-attention between the node and image-patch tokens, the node
tokens learn to attend to image-patch tokens at specific spatial positions, which are the positions of desired
linear objects. The deformable cross-attention mechanism enables the model to establish correspondences
between node tokens and relevant image-patch tokens in order to gather visual information within the object
regions in the images. Meanwhile, self-attention allows information exchange among node and edge tokens.
The node tokens capture the correlations to other node tokens, and the edge token learns how nodes interact
within other nodes’ contexts. Here, note that the edge token, following the design of Relationformer [70],
does not have cross-attention to image-patch tokens because the edge token focuses on describing the spatial
context among node tokens rather than the image context. Additionally, Relationformer [70] demonstrates
the edge token’s importance in accurate graph generation. Therefore, LDTR, following the Relationformer
design, has an edge token as input for the decoder.
67
Node Detection Head: The node detection head has two sub-components, sharing a similar design to
an object detection task [4, 108]. The first sub-component is a multi-layer perceptron (MLP) responsible for
regressing the node locations. The MLP takes the node tokens from the decoder and predicts the coordinates
of the nodes. We utilize normalized coordinates for the node locations regression to ensure scale invariance
in our predictions. The second sub-component is a single-layer classification module, which also takes
the node tokens as inputs. The classification module directly classifies each node token into one of the
categories: node or not-node. Following Relationformer [70], which employs one-stage def-DETR for road
graph detection in satellite imagery, LDTR also leverages one-stage def-DETR without the region proposal
module.
Edge Prediction Head: The edge prediction head in LDTR is also an MLP, which takes the concate-
nation of pair-wise node tokens and the edge token from the decoder as inputs. The output of the edge
prediction head is a binary value,{0,1}. A prediction of one indicates an edge between the pair of nodes,
while zero means absence. During training, LDTR randomly shuffles the node order to construct inputs for
the edge prediction head. Since the desired output graph is undirected, which is invariant to node ordering,
the random shuffling helps eliminate any bias or dependency on the specific ordering of the nodes.
N-hopConnectivityPredictionHead: The novel N-hop connectivity prediction head is also an MLP.
The inputs are the concatenation of pair-wise node tokens. The output is a binary value,{0,1}. A prediction
of one means two node tokens are connected within a given N-hop distance in the output graph, while zero
means two nodes are not connected within a given N-hop distance.
One main advantage of LDTR is that it does not require additional tokens for connectivity prediction
compared to using a special edge token for edge prediction in Relationformer. To capture multi-hop infor-
mation, one naive approach is to add a connectivity token, similar to the edge token in Relationformer [70].
However, unlike the edge token, which only learns adjacent information among two node tokens, a connec-
tivity token needs to store multi-hop complex information between multiple nodes. A single token would
68
not be able to store the multi-hop complex information for all node combinations, and using the number
of tokens equivalent to all possible node pairs would significantly increase the computational overhead. In
contrast, LDTR employs a simple but highly effective approach by concatenating pair-wise node tokens, en-
abling pairwise node token interactions. The N-hop connectivity component encourages strong correlations
between pairwise node tokens within a given N-hop distance while simultaneously promoting dissimilar-
ity between other node token pairs. Consequently, the nearby node tokens have high attention scores for
each other while low attention scores for other node pairs. The pairwise node token interactions enable
LDTR to effectively capture local connectivity information without needing an additional token. The local
connectivity information, including the topological relationships among lines and curvature of lines, helps
LDTR connect neighboring nodes within an N-hop extent correctly. Not using additional tokens also helps
keep similar memory footprints (i.e., numbers of parameters) as Relationformer and can handle large input
images.
Specifically, the N-hop connectivity component facilitates two types of attention for node tokens. One
is the self-attention among node tokens. The N-hop connectivity component helps the self-attention mecha-
nism aggregate the spatial context by encouraging the pairwise node token interactions within a given N-hop
distance. The aggregated spatial context helps LDTR generate accurate graphs. The other is the cross-
attention between image-patch and node tokens. The node tokens’ spatial information learned from the
self-attention helps the deformable cross-attentions locate reference points within desired objects’ regions
in an image. This way, LDTR can effectively capture image patterns for desired lines. The representative
image patterns help LDTR differentiate desired lines from others. In summary, the N-hop connectivity com-
ponent facilitates learning relevant spatial and image information through self-attention and cross-attention,
respectively.
The parameter N in the N-hop connectivity component plays a vital role in guiding the model to learn
relevant connectivity information. A small value of N cannot capture sufficient graph information. On the
69
other hand, if N is too large, a node token will obtain irrelevant information from other node tokens in the
same sub-graph but located far apart in terms of their shortest graph path. In the experiments, we will show
the impact of N values on the detection performance.
4.2.2 TrainingPhase
LDTR uses a combination of loss functions for optimization, including node classification loss ( L
cls
), node
regression loss (L
reg
), edge classification loss ( L
edge
), and N-hop connectivity loss (L
conn
). For node
classification loss, LDTR calculates the loss and optimizes for each node token. For the remaining losses:
node regression loss, edge classification loss, and N-hop connectivity loss, LDTR only considers predicted
nodes assigned to the ground truth nodes by the Hungarian matcher proposed by DETR [4]. Following
DETR [4], the Hungarian matcher in LDTR does one-to-one matching, assigning one predicted node to one
ground truth node. The matching score is measured by node classification and node regression loss.
Eq. 4.1 calculates the weighted overall loss to optimize LDTR:
L
total
=λ
cls
Σ
N
i=1
L
cls
v
i
cls
, ˆ v
i
cls
+λ
reg
Σ
N
i=1
h
1
v
i
cls
/ ∈/ 0
L
reg
v
i
node
, ˆ v
i
ndoe
i
+Σ
{i, j}/ ∈/ 0
h
λ
edge
L
edge
e
i j
edge
, ˆ e
i j
edge
+λ
conn
L
conn
e
i j
conn
, ˆ e
i j
conn
i
(4.1)
whereL
cls
,L
edge
, andL
conn
are binary cross entropy loss, whileL
reg
is Least Absolute Deviations loss
(L1 loss). λ
cls
,λ
reg
,λ
edge
, andλ
conn
are weights for corresponding losses.L
cls
andL
reg
are the losses that
optimize the node classification and regression in the node prediction head. 1
v
i
cls
/ ∈/ 0
in the second term means
the valid node assigned by the Hungarian matcher.L
edge
, andL
conn
in the third term is the loss for the edge
70
prediction head and the N-hop connectivity prediction head, respectively. Σ
{i, j}/ ∈/ 0
means the node pairs in
the valid set assigned by the Hungarian matcher.
4.2.3 InferencePhase
During the inference phase, LDTR only uses the node detection head and edge prediction head to generate
the graph for desired liner objects since predicting nodes and edges is sufficient to construct a graph. Re-
garding edge prediction, LDTR performs two predictions by swapping the order of the two nodes and takes
the average as the final prediction probability for the edge.
4.3 Experiment
4.3.1 Datasets
We test LDTR and three baselines on four linear objects on scanned historical topological and geological
map images from USGS. Figure 4.5 shows examples of four linear objects. The selection of these four linear
objects for testing LDTR is based on two considerations. First, all four linear objects possess representative
recurrent cartographic symbols, making them suitable for evaluating LDTR’s capacity to capture relevant
image context (i.e., cartographic symbols). Specifically, the cartographic symbols for railroads, waterlines,
scarp lines, and thrust fault lines are black crosses, blue dots, teeth-like short black lines, and black triangles,
respectively. Second, the four linear objects have diverse graph topologies. Railroads, scarp lines, and thrust
fault lines consist of closely situated lines with slow turns, while waterlines are curved lines with many
junctions and sharp turns. The variation in graph topology ensures that the four linear objects effectively
evaluate LDTR’s capacity of capturing relevant spatial context to construct graphs.
Following the convention of processing large maps [17, 30, 70], which often have dimensions of over
10,000× 10,000 pixels, we employ a two-step process to generate data for LDTR and the baselines. First,
we slice each map into non-overlapping regions of size 2,048× 2,048 pixels and divide these regions into
71
(a) Railroads (b) Waterlines (c) Scarp lines (d) Thrust fault lines
Figure 4.5: Examples of tested linear objects.
Table 4.1: Table shows the number of regions in the training, validation, and testing sets, the window size
for models’ inputs, and the stride for the sliding window. The window size and stride refer to the number of
pixels. Fault lines
⋆
refers to thrust fault lines.
Object #Train #Val #Test Window size Stride
Railroads 57 6 9 256 * 256 32
Waterlines 53 6 9 256 * 256 50
Scarp lines 250 27 91 256 * 256 32
Fault lines
⋆
626 70 193 340 * 340 128
three sets: 80% for training, 10% for validation, and 10% for testing. Second, we use a sliding window
approach to generate inputs for LDTR and the baselines in each region. The approach crops an image every
time the sliding window moves an X-pixel stride horizontally or vertically across a region. Table 4.1 shows
the number of training, validation, and testing regions, the window size, and the stride for each linear object.
We use a larger window size for the thrust fault lines than the others because the distant cartographic symbols
representing thrust fault lines are further apart compared to the other three linear objects. The various strides
accommodate the varying densities of desired lines on the maps. To obtain an adequate number of training
images covering desired lines, we use a small stride to crop images for desired lines with sparse spatial
distribution in maps. To enhance the diversity of training data, we apply rotation and color augmentation on
the input images.
72
4.3.2 EvaluationMetrics
We use correctness, completeness [32], and average path length similarity (APLS) [85] to evaluate the
precision, coverage, and connectivity of detected lines, respectively. The calculation of correctness, com-
pleteness, and APLS is between the detected graphs and ground truth. We manually annotate the locations
of desired railroads and waterlines as ground truth. As for the scarp lines and thrust fault lines, the DARPA
critical mineral assessment competition provided the ground truth
‡
. Correctness measures how long the
detected lines correspond to the true locations of desired lines. Completeness, on the other hand, measures
how long the desired lines are detected. Heipke et al. [32] provide details about calculating correctness and
completeness within a buffer width. We set the buffer width in the experiment as 25 pixels to account for
slight location shifts between the detected and true lines on the maps. The small shifting results from the
regression method employed to predict the positions of the nodes. The predicted coordinate values usually
have a slight difference from the true coordinates. Consequently, the detected lines may deviate by a few
pixels from the true lines on the maps.
Average path length similarity (APLS) [85] measures the connectivity of the detected lines by comparing
the similarity of all paths between the ground truth and the detected graph. The similarity is measured by
the path lengths in the two graphs. The overall APLS is the average of two calculations. One calculates
APLS for the detected graph, which penalizes disconnections, and the other is for the ground truth graph,
which penalizes false detection. Eq. 4.2 illustrates how to calculate APLS in one round.
APLS= 1− 1
N
∑
N
min{1,
|L(a,b)− L(a
′
,b
′
)|
L(a,b)
} (4.2)
‡
The ground truth can be found on Dropbox
73
where (a,b) and (a
′
,b
′
) are corresponding nodes between two graphs. L(·) represents the path length be-
tween two nodes. In the experiments, we used a 25-pixel buffer to find the corresponding nodes. The APLS
range is[0,1], where higher values indicate better connectivity in the detection graphs.
4.3.3 ModelDetails
CNN backbone: The LDTR uses the ResNet model [29] pretrained by the ImageNet dataset as the CNN
backbone. The output feature map from the ResNet backbone has a channel dimension of C= 2048 and
height and width dimension of H,W = H
0
/32,W
0
/32, where H
0
and W
0
are the height and width of the input
image, respectively. Each 2048-length feature vector in the feature map represents a 32× 32 grid cell in the
initial image.
Transformer: Both the encoder and decoder are three-layer multi-head attention modules. Each multi-
head attention module has eight heads and adopts the same architecture as def-DETR [108]. In the multi-
scale self- and cross-attention, LDTR has four scales: 1/4, 1/8, 1/16, and 1/32 dimensions of input images.
The number of node tokens in the decoder is 40, 60, 50, and 110 for railroads, waterlines, scarp lines, and
thrust fault lines, respectively. We set the number of node tokens according to the maximum number of
nodes observed in the training images.
Prediction Heads: We use a three-layer MLP for the node regression head, edge prediction head,
and N-hop connectivity head. N for N-hop connectivity head is five for railroads, scarp lines, and thrust
fault lines, while seven for waterlines. We set the N values according to the curvature of desired linear
objects. Railroads, scarp lines, and thrust fault lines have slow turns, while waterlines have many sharp
turns. Therefore, we choose a higher N value of seven for waterlines to effectively capture the sharp turns
while using a lower N value of five for the other linear objects.
Loss weights and Optimizer: The weights for node classification loss, node regression loss, edge
loss, and N-hop connectivity loss are three, five, four, and one, respectively. During the training phase, the
74
Table 4.2: Evaluation results for four objects detection in scanned historical topographic and geological
maps. The correctness, completeness, and APLS evaluate the detection results’ precision, coverage, and
connectivity, respectively.
Object Model type Model Correctness Completeness APLS
segmentation-based
SIINet 0.7058 0.6011 0.3085
CoANet 0.6855 0.9921 0.3582
graph-based
Relationformer Relationformer 0.9740 0.9242 0.5048
Railroads
LDTR(Ours) 0.7905 0.9889 0.6592
segmentation-based
SIINet 0.8930 0.9105 0.3494
CoANet 0.9033 0.9581 0.4231
graph-based
Relationformer Relationformer 0.9653 0.9790 0.5320
Waterlines
LDTR(Ours) 0.9890 0.9737 0.6571
segmentation-based
SIINet 0.7719 0.9765 0.4469
CoANet 0.5492 0.9404 0.3117
graph-based
Relationformer Relationformer 0.7334 0.9628 0.5380
Scarp lines
LDTR(Ours) 0.8217 0.9611 0.6246
segmentation-based
SIINet 0.6998 0.9930 0.4212
CoANet 0.4649 0.9521 0.4626
graph-based
Relationformer Relationformer 0.6200 0.9226 0.4587
Thrust fault lines
LDTR(Ours) 0.8872 0.9459 0.6191
optimizer is Adam, with a learning rate of 1e− 4. Additionally, we incorporate a learning rate decay strategy
by reducing the learning rate by 1e-4 every 80 epochs.
Baselines’ details: We compare LDTR to SOTA linear object detectors, including SIINet [78] and
CoANet [56] from the segmentation-based category, as well as Relationformer [70] from the graph-based
category. For SIINet [78] and CoANet [56], we follow the hyperparameters settings and use the optimizer
mentioned in the papers. As for Relationformer [70], we keep the same hyperparameters settings and opti-
mizer as LDTR.
4.3.4 ExperimentResultsandAnalysis
Table 4.2 presents the evaluation of precision, coverage, and connectivity for detection results using cor-
rectness, completeness, and APLS, respectively. Comparing the graph-based models (Relationformer and
LDTR) to the segmentation-based models (SIINet and CoANet) in Table 4.2, the correctness from graph-
based models is, on average, 23% higher than the segmentation-based models. The significant improvement
75
Ground truth
Map image Ground truth SIINet CoANet Relationformer LDTR (ours)
Figure 4.6: Visualization of detection results. The figures in the first and second columns are map images
and ground truth, respectively. The figures in the remaining columns are the detection results from three
baselines (SIINet, CoANet, Relationformer) and LDTR. The first two rows are for railroad detection. The
third and fourth rows are for waterline detection. The fifth row is for the scarp lines detection, and the sixth
and seventh rows are for thrust fault lines detection.
76
in correctness indicates that the graph-based models outperform segmentation-based models in reducing
false positive detection. The second and last row in Figure 4.6 shows that SIINet and CoANet (segmentation-
based models) falsely detect other black lines as railroads and thrust fault lines, respectively. SIINet and
CoANet use the strip convolutional kernels to capture image context along linear objects. However, the strip
convolutional kernels cannot capture the cartographic symbols as image context when the orientations of
desired lines do not align with the strip kernels. In contrast, the graph-based models using deformable at-
tention can capture distant cartographic symbols by flexible interactions among tokens positioned anywhere
in an image. This flexibility enables the graph-based detectors to capture distant cartographic symbols ef-
fectively. As a result, the detected lines from the graph-based models exhibit higher precision than the
segmentation-based models.
In the graph-based family, the correctness for thrust fault lines from LDTR is 78% higher than Rela-
tionformer. The last row in Figure 4.6 shows that Relationformer falsely detects small segments with single
triangles as thrust fault lines, while LDTR does not. Unlike Relationformer, the N-hop connectivity compo-
nent in LDTR explicitly allows the information exchange within a given N-hop distance, enabling LDTR to
emphasize the recurrent appearance of cartographic symbols as the pattern for desired lines. In the case of
the thrust fault lines detection, LDTR effectively captures that triangles appear multiple times within a given
N-hop distance. However, Relationformer, which only has the adjacency prediction component, does not
explicitly emphasize recurrent patterns and fails to learn the recurring patterns. As a result, Relationformer
falsely detects lines with single triangles as thrust fault lines. The LDTR’s ability to capture the recurrent
patterns highlights the importance of the N-hop connectivity component, which helps reduce false detection.
On average, achieving over 90% completeness in Table 4.2 shows that all models detected most of the
desired lines. However, APLS in Table 4.2 shows that detected lines from the segmentation-based models
have significantly more gaps than the graph-based models. For example, in the fourth row of Figure 4.6, wa-
terlines are blue lines with dots. The strip convolutional kernels in SIINet and CoANet (segmentation-based
77
models) fail to detect blue dots as patterns for curved waterlines, resulting in gaps in detected waterlines.
In contrast, the graph-based models, which can learn the distant patterns, outperform segmentation-based
models in generating complete waterline graphs, including the blue dots.
In the graph-based family, the connectivity of detected lines from LDTR is almost always more than 10%
higher than Relationformer in terms of APLS. The first row in Figure 4.6 shows that LDTR detects parallel
railroads, while Relationformer only detects one of the railroads. The railroads are long lines with slow
turns. Relationformer, only capturing the adjacency information, tends to connect all nodes into a single
long line, disregarding the topological relationship among multiple lines. In contrast, LDTR, capturing both
adjacency and N-hop connectivity information, learns the topological relationship among multiple lines. As
a result, the railroad’s detection in the first row of Figure 4.6 shows that LDTR can accurately detect parallel
railroads, while Relationformer predicts a single railroad. Besides constructing the accurate graph for lines
with slow turns, LDTR generates a more precise graph for curved lines compared to Relationformer, shown
in the third and fourth row in Figure 4.6. Relationformer, relying solely on adjacency information, fails to
capture the complex topology of waterlines, leading to false edge predictions. In contrast, LDTR leverages
the N-hop connectivity component to learn the local topological information of linear objects, including
the curvature of the lines. Consequently, LDTR generates more accurate graphs for curved waterlines than
Relationformer does.
We compare the learned patterns in Relationformer and LDTR by illustrating the attention and reference
points for the query node token in the self-attention and deformable cross-attention. In Figure 4.7, the
query node tokens are represented by red dots, while the remaining dots represent the key node tokens
that have high attention scores with the query node token in the self-attention of the decoder. The dots’
color gradient, ranging from yellow to orange, indicates the magnitude of the attention scores, with shades
closer to orange signifying higher scores. As we can see, the query tokens from LDTR have high attention
to the nearby tokens in the same line. In contrast, the query node tokens from Relationformer have high
78
Figure 4.7: The figures show the query node tokens’ reference points and attention toward the key node
tokens from LDTR (top row) and Relationformer (bottom row) for waterlines and scarp lines detection.
In the map images, the red dots represent the query node tokens, while the remaining dots represent the
top key node tokens which have high attention scores to the query node tokens. The dots’ color gradient,
ranging from yellow to orange, indicates the magnitude of the attention scores, with shades closer to orange
signifying higher scores. Additionally, the squares on the map images are the query node tokens’ reference
points in the multi-scaled deformable cross-attention. In the images with black backgrounds, the green dots
and white lines are the predicted nodes and edges, respectively.
attention to the tokens scattered across multiple lines. The high attention to the nearby tokens helps LDTR
learn the curvature and topological relations of the lines. To illustrate the cross-attention mechanism, we
plot reference points as squares in Figure 4.7. In the multi-scale cross-attentions, the query node token has
attention to image-patch tokens within offsets of the reference points. As we can see, the reference points
from LDTR locate on or close to query node tokens’ locations (red dots). In contrast, Relationformer’s
reference points situate away from the query node tokens. By having attention to more relevant node and
image-patch tokens compared to Relationformer, LDTR generates more precise graphs for waterlines and
scarp lines than Relationformer does, showing in figures in the black backgrounds in Figure 4.7.
79
4.3.5 AblationStudies
We conduct two ablation studies for the N-hop connectivity component on waterline detection. The first
study investigates the impact of removing the N-hop connectivity component on the accuracy of graph
generation. The second study explores the effect of introducing an additional token for connectivity learning
on graph generation.
Similar to the edge token, the connectivity token (conn-token) interacts with the node tokens through
the attention mechanism. After the decoder, the N-hop connectivity component takes the refined conn-token
concatenated with pair-wise node tokens as inputs to predict if two node tokens are connected within a
given N-hop distance. The conn-token exclusively interacts with node tokens for two reasons. First, the
conn-token’s purpose is capturing spatial context, while the image-patch tokens focus on image context.
Thus the conn-token does not attend to the image-patch tokens. Second, interacting with the edge token
would result in the conn-token learning superficial connectivity information directly from the edge token. A
mapping function exists between the edge existence (adjacency) and the N-hop connectivity in a graph. If
the conn-token attends to the edge token, the refined conn-token would obtain adjacency information from
the edge token. Furthermore, taking the conn-token as input, the N-hop connectivity component would
learn the mapping function from the adjacency information to the N-hop connectivity. Therefore, only
interacting with node tokens ensures that the conn-token learns spatial information from node tokens, rather
than deriving from adjacency information.
Without the N-hop connectivity component, LDTR reverts to its base model, Relationformer. Table 4.2
shows that the graphs’ connectivity (measured by APLS) from LDTR is an average of 10% higher than
Relationformer. Furthermore, Figure 4.7 shows that LDTR effectively learns more relevant spatial and
image context for graph generations. Hence, the N-hop connectivity component helps generate accurate
graphs.
80
Table 4.3: The waterlines detection evaluation. LDTR+c-token
⋆
refers to LDTR with a connectivity token.
Model Correctness Completeness APLS
LDTR 0.9890 0.9737 0.6571
LDTR+c-token
⋆
0.9939 0.9328 0.5752
Table 4.3 shows that the conn-token leads to poor connectivity in the detected waterlines graph, resulting
in a decrease of 0.04 in completeness and 0.08 in APLS. Figure 4.8 shows that LDTR with the conn-token
misses a long segment of waterlines because one conn-token has limited capacity to capture complete spatial
context. Incomplete spatial context results in missing a connected sub-graph, such as a long waterline
segment in Figure 4.8. Therefore, a single conn-token has a negative impact on graph connectivity. To avoid
the limited capacity of one conn-token, we can assign each pair of node tokens one conn-token. However,
the number of tokens will increase dramatically. N node tokens require N
2
− 1 conn-tokens, resulting in
exponentially growing computations that become intractable. Therefore, directly concatenating pair-wise
node tokens employed in LDTR is a simple but highly effective design choice.
Map image Ground truth LDTR LDTR+conn
⋆
Figure 4.8: The waterlines detection results from LDTR and LDTR with a conn-token. LDTR+conn
⋆
refers
to LDTR with a conn-token.
4.3.6 SensitivityStudies
The sensitivity studies test the effect of the number of N in the N-hop connectivity component and the
number of node tokens on the detection performance. We use waterlines, which are curved lines with
complicated topology, for sensitivity studies.
81
Table 4.4: The evaluation for waterlines detection with various N in the N-hop connectivity prediction head.
N-hop connectivity Correctness Completeness APLS
3-hop 0.9951 0.9719 0.4932
5-hop 0.9888 0.9697 0.5763
7-hop 0.9890 0.9737 0.6571
9-hop 0.9825 0.9319 0.4609
Map image Ground truth 3-hop
5-hop 7-hop 9-hop
Figure 4.9: The detected waterlines from LDTR with various N in the N-hop connectivity prediction head.
Table 4.4 evaluates the waterlines detection with different N values in the N-hop connectivity prediction
head. The correctness and completeness from 3-, 5-, 7-, and 9-hop connectivity are similar, but 7-hop
connectivity achieves the highest APLS. Figure 4.9 shows the waterline detection results from 3-, 5-, 7-,
and 9-hop connectivity. As we can see, the overly small (3-hop) or large (9-hop) values for N can lead to
inaccurate connectivity in the waterline graph, particularly when the graph has multiple nearby waterlines.
A comparison between the waterlines detected with 5-hop and 7-hop connectivity shows that the 7-hop
connectivity generates a smoother graph than the 5-hop connectivity.
This experiment shows that N in the N-hop connectivity is an important hyperparameter to learn the
spatial information for linear objects. A small N cannot capture enough topology information for accurate
graph generation. However, a larger N does not mean a more precise graph prediction. The larger N, such
82
as 9 for waterlines detection, makes the model have attention to irrelevant nodes, i.e., the node in the same
sub-graph but located far apart. Therefore, choosing the proper N is crucial to predict an accurate graph. We
empirically set N to be approximately M/2, where M is the number of nodes representing a line vector for
the desired linear objects. Since the curved lines are represented by more nodes than straight lines, we set
N as seven for curved linear objects, such as waterlines, and five for linear objects with slow turns, such as
railroads.
Table 4.5: The evaluation for waterlines detection with the different number of node tokens.
#Tokens Correctness Completeness APLS
60 0.9890 0.9737 0.6571
80 0.9914 0.9435 0.4953
100 0.9536 0.9258 0.3891
Table 4.5 presents the evaluation for the waterlines detection with varying numbers of node tokens. Since
some patches do not contain any desired lines, training with large numbers of node tokens results in many
more negative node tokens than positive node tokens. In Table 4.5, increasing the number of node tokens
(indicating an increase in the imbalanced ratio) decreases LDTR’s completeness and APLS scores. The
completeness decreasing indicates that LDTR misses more nodes as the number of node tokens increases.
Figure 4.10 shows the waterlines detection from LDTR with 60, 80, and 100 node tokens. As we can see,
the waterlines detected with 100 node tokens have more false positive nodes than 60 and 80 node tokens
do. Consequently, the connectivity of predicted lines from 100 node tokens is worst among 60, 80, and 100
node tokens.
This experiment shows the significance of selecting an appropriate number of node tokens to mitigate
the negative impact on detection results caused by the imbalanced ratio in the training data. To minimize
the data imbalance, we adopt a common solution used in object detection tasks [70]. We set the number of
node tokens as max(V), V is a set of the number of nodes in each training image.
83
Map image Ground truth 60 tokens 80 tokens 100 tokens
Figure 4.10: The detected waterlines from LDTR with the various number of node tokens.
4.4 RelatedWork
There are two main SOTA linear object detector categories: graph-based and segmentation-based models.
Segmentation-based models predict if pixels belong to desired objects in images. Graph-based models
construct graphs for linear objects directly from images.
segmentation-based detection models. The architecture of segmentation-based models [13, 56,
78, 89] is usually U-shaped [63] including an encoder and decoder. The encoders learn the image contexts
using convolutional kernels with gradual downsampling of the spatial resolution, and the decoders gradually
recover a high-resolution prediction map. However, the shape of convolutional kernels is usually square,
whereas linear objects have an elongated and slender shape. As a result, square convolutional kernels tend
to capture irrelevant image contexts for linear objects. To capture relevant image contexts along the linear
objects, the SOTA segmentation-based models [13, 56, 89] propose horizontal-, vertical-, and diagonal-
shaped convolutional kernels. However, the strip convolution kernels are insufficient to capture the image
contexts along linear objects with orientations not in the horizontal, vertical, and diagonal directions.
The encoder in some other segmentation-based detector [22, 38, 75, 79, 106, 107] leverages the attention
mechanism in Transformer or other variations [51, 108] to learn the image contexts along the linear objects.
However, since the models predict each pixel independently and the attention mechanism does not explicitly
capture the two-dimensional spatial information, the models cannot explicitly capture the correlations among
nearby pixels, such as the nearby pixels with similar colors have a high likelihood of belonging to the same
84
objects. Thereby, the small contextual variations nearby the desired linear objects could cause the models
to miss a few pixels belonging to the desired lines. Consequently, the detected lines from the segmentation-
based models have a high possibility of containing gaps.
Graph-based detection models. Some graph-based detectors [30, 48, 87] utilize CNN or RNN to
construct graphs. For example, PolyMapper [48] uses a combination of CNN and RNN to predict graphs
for road networks. The CNN in PolyMapper extracts the roads’ nodes, and the RNN connects the nodes.
However, PolyMapper assumes that all the nodes extracted by the CNN are correct. Consequently, the RNN
generates a false graph when the nodes from CNN are false. Another example is Sat2Graph [30], which
proposes a graph-tensor encoding module (GTE) encoding the adjacency information about road networks
into a tensor. However, the context learned by the GTE primarily focuses on capturing local adjacency
relationships and does not adequately consider the distant context (e.g., the reoccurrence of cartographic
symbols on a linear object). Most importantly, the representative patterns for linear objects in maps are
typically distant cartographic symbols, such as black crosses for railroads. Therefore, Sat2Graph, without
capturing distant context, cannot reduce false detection, either.
Other graph-based models [31, 70, 95, 103] utilize Transformer to construct graphs. For example,
RNGDet [95] constructs road graphs by iteratively predicting the next adjacent nodes. Transformer in
RNGDet captures the context among previous nodes. However, without using all nodes as context, RNGDet
cannot learn the completed spatial information of graphs, limiting its ability to improve connectivity. In con-
trast, Relationformer [70] incorporates all nodes as context, enabling the acquisition of more comprehensive
spatial information compared to RNGDet. Relationformer’s experiments [70] also show that Relationfomer
outperforms other SOTA graph-based detectors, such as Sat2Graph [30]. However, Relationformer suf-
fers from inaccurate graph generation due to the lack of explicit connectivity information learning among
nodes. In summary, the existing graph-based methods lack the capability to adequately capture completed
85
spatial contexts, resulting in limitations in improving connectivity for linear object detection from scanned
historical topographic and geological map images.
86
Chapter5
ConclusionandFutureExtensions
In this chapter, I first summarize the contributions and limitations of proposed approaches for labeling and
extraction tasks. Second, I discuss potential future research directions.
5.1 ContributionsandLimitations
This dissertation introduces innovative approaches that minimize the manual effort required to accurately
extract polygonal and linear geographic objects from scanned historical maps. Typically, generating labeled
data for training extraction models requires significant manual work. However, by using external vector
data, we can reduce the manual efforts to label desired objects on the maps. To label polygon symbols,
Chapter 2 presents the approach that only requires one or a few manually annotated symbols to identify all
the desired symbols from candidates provided by external vector data. For labeling linear objects, Chapter
3 presents the method to label desired linear objects near external vector data automatically. Subsequently,
chapter 4 presents the extraction model capable of extracting precise and continuous vector lines for linear
objects. The rest of this section summarizes the contributions of each proposed labeling or extraction method
in detail.
In Chapter 2, the proposed target-guided generative model (TGGM) aims to identify desired symbols
from candidates provided by external vector data. Candidates are cropped images, including desired symbols
87
and other symbols, generated by a sliding-window approach across a map area covered by the external vector
polygon. The identifying process involves grouping candidate images into desired and non-desired clusters.
As a result, the images in the desired cluster serve as the labeled desired symbols to train an extraction
model.
TGGM is a weakly supervised probabilistic clustering framework based on the Variational Auto-encoder
and Gaussian Mixture Models. The main contribution of TGGM is that it can exploit a few labeled desired
symbols in the candidate images to guide the clustering process so that the images in one of the resulting
clusters contain desired symbols. Specifically, unlike the existing approaches in which the training pro-
cess can only update the Mixture of Gaussians (MoG) as a whole or use a subset of the data to update a
specific component in the MoG in separate optimization iterations (e.g., a “minibatch”), TGGM’s network
design allows the update of individual components in the MoG separately in the same optimization itera-
tion. Using labeled and unlabeled data in one optimization iteration is important because when the data in
some categories are (partially) labeled, an optimization iteration should update the MoG using both labeled
and unlabeled data so that the data distributions of some specific components in MoG are specific for the
partially labeled data in some categories.
TGGM adapts an iterative learning process to gather the desired images within candidate images. The
labeled desired images for TGGM training start from a few manually labeled images and gradually increase
as additional images are assigned to the desired cluster. In each iteration’s training process, TGGM uses
images belonging to the desired cluster from the last iteration to optimize the desired Gaussian component.
During the inference process of each iteration, TGGM obtains additional desired images. TGGM progres-
sively enhances the coverage of the desired cluster by optimizing the desired Gaussian component with
additional labeled desired images. Iterative learning ends when the number of images in the desired cluster
remains unchanged.
88
The experiment shows that TGGM achieves superior accuracy in wetland symbol labeling compared
to unsupervised generative clustering methods. Furthermore, TGGM’s labeling results are comparable to
the results from the semi-supervised generative clustering method but with the significant advantage of
requiring substantially less manual effort. Subsequently, the extraction model trained by the labeled data
from the TGGM archives an average accuracy of 80%. The extracted polygons successfully encompass most
of the polygonal objects on the maps. The slight reduction in accuracy can be attributed to the delineation
of polygon boundaries. The extracted polygons may not precisely align with the object boundaries on the
maps. Despite the slight imprecision in the extracted boundaries, the extracted polygons are sufficient for
further analysis, such as examining the spatial distribution of polygonal objects.
The effectiveness of TGGM relies on the availability of suitable external vector data to ensure the accu-
rate generation of labeled data. TGGM requires external vector data to cover map areas that include both the
desired symbols and other objects falling within limited categories. If the external vector data do not cover
the desired symbols on the maps, TGGM fails to label symbols. Additionally, labeled symbols could be-
come noisy when the external vector data contains other objects in diverse categories. TGGM fails to form
a compact cluster for diverse non-desired objects, which could lead to false assignments of non-desired ob-
jects to the desired cluster. Therefore, TGGM is most effective when the external vector data align with the
assumption, i.e., covering a group of desired symbols and other objects from limited categories.
In Chapter 3, the proposed automatic label generation algorithm (ALG) aims to automatically label
pixels belonging to the desired linear objects near external vector lines. The ALG algorithm utilizes the
color information from the map images plus the location and shape information from the external vector
data to label pixels belonging to the desired linear objects on the maps. The labeled group of pixels exhibits
homogeneous colors, is close to the external vector data, and has a shape similar to that of the external vector
data.
89
The ALG algorithm innovatively incorporates shape information from the external vector data to ensure
that the labeled pixels belong to the desired linear objects. Without using the shape information, existing
approaches could falsely label other objects close to the external vector data. The ALG algorithm calculates
an affine transformation as a key step in effectively comparing shape similarity between the labeled pixels
and the rasterized external vector data. To measure shape similarity, the ALG algorithm calculates the over-
lap size between the labeled pixels and the rasterized external vector data. However, due to misalignment
between the external vector data and desired linear objects on the maps, a direct calculation of the overlap
size cannot accurately represent the true shape similarity. The ALG algorithm applies an affine transfor-
mation to align the rasterized vector data with the labeled desired lines. After an affine transformation, the
overlap size can accurately measure the shape similarity between the desired labeled lines and the rasterized
vector data. In addition to high shape similarity, the ALG algorithm ensures that labeled pixels belong to
the desired linear objects by considering labeled pixels’ homogeneous color and proximity to the external
vector data.
The experiment shows that the labeled linear objects obtained from the ALG algorithm exhibit an aver-
age 10% higher precision than the SOTA baseline, which only uses color and location information. Conse-
quently, when using the labeled data from the ALG algorithm, the extracted linear objects are significantly
more accurate than the extraction obtained from the baselines.
The effectiveness of the ALG algorithm depends on the location information provided by the rasterized
external vector data. The ALG algorithm labels desired pixels near the rasterized external vector data. If the
rasterized external vector data are located far away from the desired linear objects on the maps, the ALG
algorithm fails to label the desired linear objects accurately. However, in most cases, there are small shifts
between the rasterized external vector data and the desired linear objects on the maps due to misalignment
caused by differences in coordinate systems between the external vector data and the maps.
90
In Chapter 4, the proposed linear object detection transformer (LDTR) aims to learn a representative
image context and sufficient spatial context to extract precise and continuous vector lines for desired linear
geographic objects on scanned historical topographic and geological maps. LDTR leverages the attention
mechanism proposed by Transformer [88] to learn the image and spatial context.
To learn sufficient spatial context, the novel N-hop connectivity prediction module in LDTR encourages
the node tokens to gather information from other node tokens connected within a given N-hop distance. By
considering a wide connectivity extent, LDTR effectively captures complex spatial context, including the
line’s orientations and curvature, as well as the topological relationships among multiple lines. The learned
spatial context helps LDTR predict edges in a complex graph accurately. In contrast, existing detectors
could generate false edges for linear objects with complex topological relationships, such as curved lines,
by only capturing the spatial information for adjacent nodes. Regarding learning representative image con-
text, particularly representative symbols, LDTR uses deformable attention [108], which enables selective
interactions among local characteristics. In summary, LDTR achieves precise extraction by learning repre-
sentative image context, i.e., cartographic symbols. Additionally, LDTR obtains continuous extracted lines
by learning a sufficient spatial context from the N-hop connectivity prediction module.
The experiments show that LDTR significantly improved the connectivity of extracted lines compared to
SOTA approaches. Significant connectivity improvement highlights the effectiveness of LDTR in capturing
spatial context for linear objects. Additionally, LDTR won first place in the United States Geological Survey
and the Defense Advanced Research Projects Agency 2022 AI for Critical Mineral Assessment Competi-
tion
*
, significantly outperforming second place by 184%.
*
https://criticalminerals.darpa.mil/The-Competition
91
5.2 FutureExtensions
The georeferenced and machine-editable geographic objects extracted from scanned historical topographic
and geological maps offer many possibilities for future research. This section presents possible future direc-
tions based on the research in this dissertation.
Integrating uncertainty measurement [28, 76] into geographic object extraction encompasses the recog-
nition that map data can often be imprecise, leading to variations in the accuracy of extracted objects. By
quantifying uncertainty, the extraction model can provide a more nuanced representation of the reliability
of the extraction results. The uncertainty measurement for map images is particularly important in situa-
tions where ambiguities or inaccuracies are inherent within the map data, especially in the case of scanned
maps subject to diverse scanning conditions. Various methods, such as probabilistic modeling and Bayesian
analysis [52], can be employed to measure and convey uncertainty. Incorporating uncertainty into the ex-
traction process enhances the overall understanding of the extracted geographic objects, allowing users to
make more informed decisions based on the confidence levels associated with the results.
Spatial context learning [36] is a future direction to enhance extraction accuracy for geographic objects
on map images. Spatial context provides information about the relationships and arrangements of objects
within the geographical space. By considering the surrounding context of a desired object, the extraction
model can better discriminate between desired objects and potential false positives. For instance, in the case
of road extraction, understanding the typical patterns of road networks and their alignment with other objects
like buildings or water bodies aids in distinguishing roads from similar-looking linear features. Moreover,
spatial context also aids in resolving uncertainties caused by occlusions, overlapped objects, or map condi-
tions. By incorporating spatial context, the extraction model can make more informed decisions, resulting
in improved accuracy by leveraging the intrinsic relationships among objects within the geographical land-
scape.
92
The extracted vector data for polygon and linear objects can be input for the geospatial data linkage
task [65], which involves the process of connecting or associating different sets of geospatial data based
on the common spatial attributes of the data. Geospatial data linkage helps to combine information from
various sources to completely represent the real-world environment. The linked geospatial data help facili-
tate decision-making processes that require a comprehensive understanding of the spatial relationships and
interactions between different geospatial datasets, such as urban planning, environmental monitoring, and
disaster response.
5.3 Conclusions
This dissertation presents approaches to extracting polygon and linear geographic objects from scanned
historical topographic and geological maps with minimum human involvement. Much manual work usually
occurs in labeling data for training extraction models. The proposed approaches leverage external vector
data to reduce manual work for the labeling process. To label desired symbols of polygon objects, the
proposed generative clustering approach requires one or a few labeled symbols to guide the construction of
one cluster exclusively for the desired symbols. The images in the desired cluster are labeled data for training
the polygon object extraction model. For labeling linear objects, the proposed approach automatically labels
groups of pixels that exhibit homogeneous color, are close to the external vector data, and share a similar
shape to the external vector data. The labeled pixels serve as labeled linear objects to train a linear object
extraction model. The proposed extraction model uses multi-scale deformable attention to extract precise
and continuous linear objects to capture the distant cartographic symbols to reduce false detection. The
innovative N-hop connectivity component in the proposed model improves the connectivity of the detected
linear objects. Integrating all proposed methods allows the extraction of polygon and linear geographic
objects from scanned historical topographic and geological maps with minimum human work.
93
The extracted geographic objects help extract valuable information about the natural features and human
activities, such as critical minerals [33] and the development of the railroad networks [1]. Linking the ex-
traction results to other data sources allows spatial analysis and spatial data integration for various domains,
such as urban planning, environmental monitoring, disaster response, and public health.
94
References
[1] T Berger. “Railroads and rural industrialization: Evidence from a historical policy experiment”. In:
Explorations in Economic History 74 (2019), p. 101277.
[2] Bharath Bhushan Damodaran, R´ emi Flamary, Viven Seguy, and Nicolas Courty. “An Entropic
Optimal Transport Loss for Learning Deep Neural Networks under Label Noise in Remote Sensing
Images”. In: arXiv e-prints (2018), arXiv–1810.
[3] N Carion, F Massa, G Synnaeve, N Usunier, A Kirillov, and S Zagoruyko. “End-to-end object
detection with transformers”. In: European conference on computer vision. Springer. 2020,
pp. 213–229.
[4] N Carion, F Massa, G Synnaeve, N Usunier, A Kirillov, and S Zagoruyko. “End-to-end object
detection with transformers”. In: European conference on computer vision. Springer. 2020,
pp. 213–229.
[5] Elena Cervelli, Ester Scotto di Perta, and Stefania Pindozzi. “Identification of marginal landscapes
as support for sustainable development: GIS-based analysis and landscape metrics assessment in
southern Italy areas”. In: Sustainability 12.13 (2020), p. 5400.
[6] T.F. Chan and L.A. Vese. “Active contours without edges”. In: IEEE Transactions on Image
Processing 10.2 (2001), pp. 266–277. DOI: 10.1109/83.902291.
[7] Tony Chan and Wei Zhu. “Level set based shape prior segmentation”. In: 2005 IEEE Computer
Society Conference on Computer Vision and Pattern Recognition (CVPR’05). V ol. 2. IEEE. 2005,
pp. 1164–1170.
[8] Bin Chen, Weihua Sun, and Anthony V odacek. “Improving image-based characterization of road
junctions, widths, and connectivity by leveraging OpenStreetMap vector map”. In: 2014 IEEE
Geoscience and Remote Sensing Symposium. IEEE. 2014, pp. 4958–4961.
[9] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam.
“Encoder-decoder with atrous separable convolution for semantic image segmentation”. In:
Proceedings of the European conference on computer vision (ECCV). 2018, pp. 801–818.
[10] Y-Y Chiang, W Duan, S Leyk, J Uhl, and C Knoblock. Using historical maps in scientific studies:
Applications, challenges, and best practices. Springer, 2020.
95
[11] Y-Y Chiang, S Leyk, and C Knoblock. “A survey of digital map processing techniques”. In: ACM
Computing Surveys (CSUR) 47.1 (2014), pp. 1–44.
[12] Daniel Cremers, Nir Sochen, and Christoph Schn¨ orr. “Towards recognition-based variational
segmentation using shape priors and dynamic labeling”. In: International Conference on
Scale-Space Theories in Computer Vision. Springer. 2003, pp. 388–400.
[13] Ling Dai, Guangyun Zhang, and Rongting Zhang. “RADANet: Road Augmented Deformable
Attention Network for Road Extraction from Complex High-Resolution Remote-Sensing Images”.
In: IEEE Transactions on Geoscience and Remote Sensing 61 (2023), pp. 1–13.
[14] Nat Dilokthanakul, Pedro AM Mediano, Marta Garnelo, Matthew CH Lee, Hugh Salimbeni,
Kai Arulkumaran, and Murray Shanahan. “Deep unsupervised clustering with gaussian mixture
variational autoencoders”. In: arXiv:1611.02648 (2016).
[15] R Dong, X Pan, and F Li. “DenseU-net-based semantic segmentation of small objects in urban
remote sensing images”. In: IEEE Access 7 (2019), pp. 65347–65356.
[16] W Duan, Y-Y Chiang, S Leyk, J Uhl, and C Knoblock. “A Label Correction Algorithm Using Prior
Information for Automatic and Accurate Geospatial Object Recognition”. In: 2021 IEEE
International Conference on Big Data (Big Data). IEEE. 2021, pp. 1604–1610.
[17] W Duan, Y-Y Chiang, S Leyk, J Uhl, and C Knoblock. “Guided Generative Models using Weak
Supervision for Detecting Object Spatial Arrangement in Overhead Images”. In: 2021 IEEE
International Conference on Big Data (Big Data). IEEE. 2021, pp. 725–734.
[18] Weiwei Duan, Yao-Yi Chiang, Craig A Knoblock, Vinil Jain, Dan Feldman, Johannes H Uhl, and
Stefan Leyk. “Automatic alignment of geographic features in contemporary vector data and
historical maps”. In: Proceedings of the 1st workshop on artificial intelligence and deep learning
for geographic knowledge discovery. 2017, pp. 45–54.
[19] N Dumakor-Dupey and S Arya. “Machine Learning—A Review of Applications in Mineral
Resource Estimation”. In: Energies 14.14 (2021), p. 4079.
[20] M Ehsan Abbasnejad, Anthony Dick, and Anton van den Hengel. “Infinite variational autoencoder
for semi-supervised learning”. In: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition. 2017, pp. 5888–5897.
[21] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual
Object Classes Challenge 2012 (VOC2012) Results.
http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html.
[22] E Firkat, J Zhang, D Wu, M Yang, J Zhu, and A Hamdulla. “ARDformer: agroforestry road
detection for autonomous driving using hierarchical transformer”. In: Sensors 22.13 (2022),
p. 4696.
[23] Simone Fobi, Terence Conlon, Jayant Taneja, and Vijay Modi. “Learning to segment from
misaligned and partial labels”. In: Proceedings of the 3rd ACM SIGCAS Conference on Computing
and Sustainable Societies. 2020, pp. 286–290.
96
[24] Yan Gao, Boxiao Liu, Nan Guo, Xiaochun Ye, Fang Wan, Haihang You, and Dongrui Fan.
“C-midn: Coupled multiple instance detection network with segmentation guidance for weakly
supervised object detection”. In: Proceedings of the IEEE/CVF International Conference on
Computer Vision. 2019, pp. 9834–9843.
[25] Austen Groener, Gary Chern, and Mark Pritt. “A comparison of deep learning object detection
models for satellite imagery”. In: 2019 IEEE Applied Imagery Pattern Recognition Workshop
(AIPR). IEEE. 2019, pp. 1–10.
[26] M-H Guo, T-X Xu, J-J Liu, Z-N Liu, P-T Jiang, T-J Mu, S-H Zhang, R Martin, M-M Cheng, and
S-M Hu. “Attention mechanisms in computer vision: A survey”. In: Computational visual media
8.3 (2022), pp. 331–368.
[27] Yinong Guo, Chen Wu, Bo Du, and Liangpei Zhang. “Density Map-based vehicle counting in
remote sensing images with limited resolution”. In: ISPRS Journal of Photogrammetry and Remote
Sensing 189 (2022), pp. 201–217.
[28] Reihaneh H Hariri, Erik M Fredericks, and Kate M Bowers. “Uncertainty in big data analytics:
survey, opportunities, and challenges”. In: Journal of Big Data 6.1 (2019), pp. 1–16.
[29] K He, X Zhang, S Ren, and J Sun. “Deep residual learning for image recognition”. In: Proceedings
of the IEEE conference on computer vision and pattern recognition. 2016, pp. 770–778.
[30] S He, F Bastani, S Jagwani, M Alizadeh, H Balakrishnan, S Chawla, M Elshrif, S Madden, and
M A Sadeghi. “Sat2Graph: road graph extraction through graph-tensor encoding”. In: Computer
Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings,
Part XXVII 16. Springer. 2020, pp. 51–67.
[31] Y He, R Garg, and A R Chowdhury. “TD-Road: Top-down road network extraction with holistic
graph construction”. In: European Conference on Computer Vision. Springer. 2022, pp. 562–577.
[32] C Heipke, H Mayer, C Wiedemann, and O Jamet. “Evaluation of automatic road extraction”. In:
International Archives of Photogrammetry and Remote Sensing 32.3 SECT 4W2 (1997),
pp. 151–160.
[33] J Hronsky and O Kreuzer. “Applying spatial prospectivity mapping to exploration targeting:
Fundamental practical issues and suggested solutions for the future”. In: Ore Geology Reviews 107
(2019), pp. 647–653.
[34] Tao Hu, Pascal Mettes, Jia-Hong Huang, and Cees GM Snoek. “Silco: Show a few images, localize
the common object”. In: Proceedings of the IEEE/CVF International Conference on Computer
Vision. 2019, pp. 5067–5076.
[35] Young Kyun Jang and Nam Ik Cho. “Generalized Product Quantization Network for
Semi-Supervised Image Retrieval”. In: Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition. 2020, pp. 3420–3429.
97
[36] Krzysztof Janowicz, Song Gao, Grant McKenzie, Yingjie Hu, and Budhendra Bhaduri. GeoAI:
spatially explicit artificial intelligence techniques for geographic knowledge discovery and beyond .
2020.
[37] Jisoo Jeong, Seungeui Lee, Jeesoo Kim, and Nojun Kwak. “Consistency-based semi-supervised
learning for object detection”. In: Proceedings of Advances in Neural Information Processing
Systems. 2019, pp. 10759–10768.
[38] X Jiang, Y Li, T Jiang, J Xie, Y Wu, Q Cai, J Jiang, Ji Xu, and H Zhang. “RoadFormer: Pyramidal
deformable vision transformers for road network extraction with remote sensing images”. In:
International Journal of Applied Earth Observation and Geoinformation 113 (2022), p. 102987.
[39] Zhuxi Jiang, Yin Zheng, Huachun Tan, Bangsheng Tang, and Hanning Zhou. “Variational deep
embedding: An unsupervised and generative approach to clustering”. In: arXiv:1611.05148 (2016).
[40] C Jiao, M Heitzler, and L Hurni. “A fast and effective deep learning approach for road extraction
from historical maps by automatically generating training data with symbol reconstruction”. In:
International Journal of Applied Earth Observation and Geoinformation 113 (2022), p. 102980.
[41] Jian Kang, Ruben Fernandez-Beltran, Xudong Kang, Jingen Ni, and Antonio Plaza.
“Noise-Tolerant Deep Neighborhood Embedding for Remotely Sensed Images With Label Noise”.
In: IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 14 (2021),
pp. 2551–2562.
[42] Dahun Kim, Donghyeon Cho, Donggeun Yoo, and In So Kweon. “Two-phase learning for weakly
supervised object localization”. In: Proceedings of the IEEE International Conference on
Computer Vision. 2017, pp. 3534–3543.
[43] Diederik P Kingma and Max Welling. “Auto-encoding variational bayes”. In: arXiv preprint
arXiv:1312.6114 (2013).
[44] Durk P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. “Semi-supervised
learning with deep generative models”. In: Proceedings of Advances in Neural Information
Processing Systems. 2014, pp. 3581–3589.
[45] A Kirillov, E Mintun, N Ravi, H Mao, C Rolland, L Gustafson, T Xiao, S Whitehead, A Berg,
W-Y Lo, et al. “Segment anything”. In: arXiv preprint arXiv:2304.02643 (2023).
[46] S Li, J Chen, and J Xiang. “Applications of deep convolutional neural networks in prospecting
prediction based on two-dimensional geological big data”. In: Neural computing and applications
32 (2020), pp. 2037–2053.
[47] Xiaoyan Li, Meina Kan, Shiguang Shan, and Xilin Chen. “Weakly supervised object detection with
segmentation collaboration”. In: Proceedings of the IEEE/CVF International Conference on
Computer Vision. 2019, pp. 9735–9744.
[48] Z Li, J D Wegner, and A Lucchi. “Topological map extraction from overhead images”. In:
Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019, pp. 1715–1724.
98
[49] Xiaodan Liang, Si Liu, Yunchao Wei, Luoqi Liu, Liang Lin, and Shuicheng Yan. “Towards
computational baby learning: A weakly-supervised approach for object detection”. In: Proceedings
of the IEEE International Conference on Computer Vision. 2015, pp. 999–1007.
[50] Jiamin Liu, Jianhua Yao, Mohammadhadi Bagheri, Veit Sandfort, and Ronald M Summers. “A
Semi-Supervised CNN learning method with pseudo-class labels for atherosclerotic vascular
calcification detection”. In: Proceedings of IEEE International Symposium on Biomedical Imaging.
2019, pp. 780–783.
[51] Z Liu, Y Lin, Y Cao, H Hu, Y Wei, Z Zhang, S Lin, and B Guo. “Swin transformer: Hierarchical
vision transformer using shifted windows”. In: Proceedings of the IEEE/CVF international
conference on computer vision. 2021, pp. 10012–10022.
[52] Yuchi Ma, Zhou Zhang, Yanghui Kang, and Mutlu
¨
Ozdo˘ gan. “Corn yield prediction and
uncertainty analysis based on remotely sensed variables using a Bayesian neural network
approach”. In: Remote Sensing of Environment 259 (2021), p. 112408.
[53] Lars Maaløe, Marco Fraccaro, Valentin Li´ evin, and Ole Winther. “BIV A: A very deep hierarchy of
latent variables for generative modeling”. In: Proceedings of Advances in Neural Information
Processing Systems. 2019, pp. 6548–6558.
[54] Lars Maaløe, Marco Fraccaro, and Ole Winther. “Semi-supervised generation with cluster-aware
generative models”. In: arXiv:1704.00637 (2017).
[55] A Maxwell, M Bester, L Guillen, C Ramezan, D Carpinello, Y Fan, F Hartley, S Maynard, and
J Pyron. “Semantic segmentation deep learning for extracting surface mine extents from historic
topographic maps”. In: Remote Sensing 12.24 (2020), p. 4145.
[56] J Mei, R-J Li, W Gao, and M-M Cheng. “CoANet: Connectivity attention network for road
extraction from satellite imagery”. In: IEEE Transactions on Image Processing 30 (2021),
pp. 8540–8552.
[57] Y Mo, Y Wu, X Yang, F Liu, and Y Liao. “Review the state-of-the-art technologies of semantic
segmentation based on deep learning”. In: Neurocomputing 493 (2022), pp. 626–646.
[58] Danping Peng, Barry Merriman, Stanley Osher, Hongkai Zhao, and Myungjoo Kang. “A
PDE-based fast local level set method”. In: Journal of computational physics 155.2 (1999),
pp. 410–438.
[59] Mark Pritt. “Deep learning for recognizing mobile targets in satellite imagery”. In: 2018 IEEE
Applied Imagery Pattern Recognition Workshop (AIPR). IEEE. 2018, pp. 1–7.
[60] Shafin Rahman, Salman Khan, and Nick Barnes. “Transductive learning for zero-shot object
detection”. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019,
pp. 6082–6091.
[61] Joseph Redmon and Ali Farhadi. “Yolov3: An incremental improvement”. In: arXiv:1804.02767
(2018).
99
[62] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. “Faster r-cnn: Towards real-time object
detection with region proposal networks”. In: Advances in Neural Information Processing Systems.
2015, pp. 91–99.
[63] O Ronneberger, P Fischer, and T Brox. “U-net: Convolutional networks for biomedical image
segmentation”. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015:
18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18.
Springer. 2015, pp. 234–241.
[64] JJ Ruiz-Lendinez, B Mackiewicz, P Motek, and T Stryjakiewicz. “Method for an automatic
alignment of imagery and vector data applied to cadastral information in Poland”. In: Survey
Review 51.365 (2019), pp. 123–134.
[65] B Shbita, C Knoblock, W Duan, Y-Y Chiang, J Uhl, and S Leyk. “Building linked spatio-temporal
data from vectorized historical maps”. In: The Semantic Web: 17th International Conference,
ESWC 2020, Heraklion, Crete, Greece, May 31–June 4, 2020, Proceedings 17. Springer. 2020,
pp. 409–426.
[66] Yunhan Shen, Rongrong Ji, Shengchuan Zhang, Wangmeng Zuo, and Yan Wang. “Generative
adversarial learning towards fast weakly supervised detection”. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. 2018, pp. 5764–5773.
[67] Jacob Shermeyer and Adam Van Etten. “The effects of super-resolution on object detection
performance in satellite imagery”. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition Workshops. 2019, pp. 0–0.
[68] Miaojing Shi, Holger Caesar, and Vittorio Ferrari. “Weakly supervised object localization using
things and stuff transfer”. In: Proceedings of the IEEE International Conference on Computer
Vision. 2017, pp. 3381–3390.
[69] Yujiao Shi and Hongdong Li. “Beyond cross-view image retrieval: Highly accurate vehicle
localization using satellite image”. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition. 2022, pp. 17010–17020.
[70] S Shit, R Koner, B Wittmann, J Paetzold, I Ezhov, H Li, J Pan, S Sharifzadeh, G Kaissis, V Tresp,
et al. “Relationformer: A unified framework for image-to-graph generation”. In: European
conference on computer vision. Springer. 2022, pp. 422–439.
[71] Karen Simonyan and Andrew Zisserman. “Very deep convolutional networks for large-scale image
recognition”. In: arXiv preprint arXiv:1409.1556 (2014).
[72] Krishna Kumar Singh and Yong Jae Lee. “Hide-and-seek: Forcing a network to be meticulous for
weakly-supervised object and action localization”. In: 2017 IEEE international conference on
computer vision (ICCV). IEEE. 2017, pp. 3544–3553.
[73] Bing Song and Tony Chan. “A fast algorithm for level set based optimization”. In: UCLA Cam
Report 2.68 (2002).
100
[74] Wenbo Song, James M Keller, Timothy L Haithcoat, Curt H Davis, and Jason B Hinsen. “An
automated approach for the conflation of vector parcel map with imagery”. In: Photogrammetric
Engineering & Remote Sensing 79.6 (2013), pp. 535–543.
[75] Z Sun, W Zhou, C Ding, and M Xia. “Multi-Resolution Transformer Network for Building and
Road Segmentation of Remote Sensing Image”. In: ISPRS International Journal of
Geo-Information 11.3 (2022), p. 165.
[76] Kun Tan, Yusha Zhang, Xue Wang, and Yu Chen. “Object-based change detection using multiple
classifiers and multi-scale uncertainty analysis”. In: Remote Sensing 11.3 (2019), p. 359.
[77] Meng Tang, Abdelaziz Djelouah, Federico Perazzi, Yuri Boykov, and Christopher Schroers.
“Normalized cut loss for weakly-supervised cnn segmentation”. In: Proceedings of the IEEE
conference on computer vision and pattern recognition. 2018, pp. 1818–1827.
[78] C Tao, J Qi, Y Li, H Wang, and H Li. “Spatial information inference net: Road extraction using
road-specific contextual information”. In: ISPRS Journal of Photogrammetry and Remote Sensing
158 (2019), pp. 155–166.
[79] J Tao, Z Chen, Z Sun, H Guo, B Leng, Z Yu, Y Wang, Z He, X Lei, and J Yang. “Seg-Road: A
Segmentation Network for Road Extraction Based on Transformer and CNN with Connectivity
Structures”. In: Remote Sensing 15.6 (2023), p. 1602.
[80] Anne-Marie Thow, Agnes Erzse, Gershim Asiki, Charles Mulindabigwi Ruhara, Gemma Ahaibwe,
Twalib Ngoma, Hans Justus Amukugo, Milka N Wanjohi, Mulenga M Mukanu,
Lebogang Gaogane, et al. “Study design: policy landscape analysis for sugar-sweetened beverage
taxation in seven sub-Saharan African countries”. In: Global Health Action 14.1 (2021),
p. 1856469.
[81] J Uhl, S Leyk, Y-Y Chiang, W Duan, and C Knoblock. “Automated extraction of human settlement
patterns from historical topographic map series using weakly supervised convolutional neural
networks”. In: IEEE Access 8 (2019), pp. 6978–6996.
[82] J Uhl, S Leyk, Z Li, W Duan, B Shbita, Y-Y Chiang, and C Knoblock. “Combining
remote-sensing-derived data and historical maps for long-term back-casting of urban extents”. In:
Remote sensing 13.18 (2021), p. 3672.
[83] Johannes H Uhl, Stefan Leyk, Yao-Yi Chiang, Weiwei Duan, and Craig A Knoblock. “Map archive
mining: visual-analytical approaches to explore large historical map collections”. In: ISPRS
international journal of geo-information 7.4 (2018), p. 148.
[84] Johannes H Uhl, Stefan Leyk, Caitlin M McShane, Anna E Braswell, Dylan S Connor, and
Deborah Balk. “Fine-grained, spatiotemporal datasets measuring 200 years of land development in
the United States”. In: Earth System Science Data 13.1 (2021), pp. 119–153.
[85] A Van Etten, D Lindenbaum, and T M Bacastow. “Spacenet: A remote sensing dataset and
challenge series”. In: arXiv preprint arXiv:1807.01232 (2018).
101
[86] Adam Van Etten. “Satellite imagery multiscale rapid detection with windowed networks”. In: 2019
IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE. 2019, pp. 735–743.
[87] S Vasu, M Kozinski, L Citraro, and P Fua. “Topoal: An adversarial learning approach for
topology-aware road segmentation”. In: Computer Vision–ECCV 2020: 16th European Conference,
Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVII 16. Springer. 2020, pp. 224–240.
[88] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. “Attention is all you need”. In: Advances in neural
information processing systems 30 (2017).
[89] Q Wang, H Bai, C He, and J Cheng. “Fe-LinkNet: Enhanced D-LinkNet with Attention and Dense
Connection for Road Extraction in High-Resolution Remote Sensing Images”. In: IGARSS
2022-2022 IEEE International Geoscience and Remote Sensing Symposium. IEEE. 2022,
pp. 3043–3046.
[90] Shan Wang, Yanhao Zhang, Ankit V ora, Akhil Perincherry, and Hengdong Li. “Satellite image
based cross-view localization for autonomous vehicle”. In: 2023 IEEE International Conference on
Robotics and Automation (ICRA). IEEE. 2023, pp. 3592–3599.
[91] J Woodhead and M Landry. “Harnessing the power of artificial intelligence and machine learning
in mineral exploration—opportunities and cautionary notes”. In: SEG Discovery 127 (2021),
pp. 19–31.
[92] Songbing Wu, Chun Du, Hao Chen, Yingxiao Xu, Ning Guo, and Ning Jing. “Road extraction from
very high resolution images using weakly labeled OpenStreetMap centerline”. In: ISPRS
International Journal of Geo-Information 8.11 (2019), p. 478.
[93] F. Xing, T. C. Cornish, T. Bennett, D. Ghosh, and L. Yang. “Pixel-to-Pixel Learning With Weak
Supervision for Single-Stage Nucleus Recognition in Ki67 Images”. In: IEEE Transactions on
Biomedical Engineering 66.11 (2019), pp. 3088–3097. DOI: 10.1109/TBME.2019.2900378.
[94] Chaoyang Xu, Yuanfei Dai, Renjie Lin, and Shiping Wang. “Social image refinement and
annotation via weakly-supervised variational auto-encoder”. In: Knowledge-Based Systems 192
(2020), p. 105259.
[95] Z Xu, Y Liu, L Gan, Y Sun, X Wu, M Liu, and L Wang. “RNGDet: Road Network Graph
Detection by Transformer in Aerial Images”. In: IEEE Transactions on Geoscience and Remote
Sensing 60 (2022), pp. 1–12.
[96] Fan Yang, Heng Fan, Peng Chu, Erik Blasch, and Haibin Ling. “Clustered object detection in aerial
images”. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019,
pp. 8311–8320.
[97] Ke Yang, Dongsheng Li, and Yong Dou. “Towards precise end-to-end weakly supervised object
detection network”. In: Proceedings of the IEEE/CVF International Conference on Computer
Vision. 2019, pp. 8372–8381.
102
[98] Linxiao Yang, Ngai-Man Cheung, Jiaying Li, and Jun Fang. “Deep clustering by gaussian mixture
variational autoencoders with graph embedding”. In: Proceedings of the IEEE International
Conference on Computer Vision. 2019, pp. 6440–6449.
[99] Xu Yang, Cheng Deng, Feng Zheng, Junchi Yan, and Wei Liu. “Deep spectral clustering using dual
autoencoder network”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. 2019, pp. 4066–4075.
[100] C Yeomans, R Shail, S Grebby, V Nyk¨ anen, M Middleton, and P Lusty. “A machine learning
approach to tungsten prospectivity modelling using knowledge-driven feature extraction and model
confidence”. In: Geoscience Frontiers 11.6 (2020), pp. 2067–2081.
[101] Yu Zeng, Yunzhi Zhuge, Huchuan Lu, Lihe Zhang, Mingyang Qian, and Yizhou Yu. “Multi-source
weak supervision for saliency detection”. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition. 2019, pp. 6074–6083.
[102] Zhaoyang Zeng, Bei Liu, Jianlong Fu, Hongyang Chao, and Lei Zhang. “Wsod2: Learning
bottom-up and top-down objectness distillation for weakly-supervised object detection”. In:
Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019, pp. 8292–8300.
[103] J Zhang, X Hu, Y Wei, and L Zhang. “Road Topology Extraction From Satellite Imagery by Joint
Learning of Nodes and Their Connectivity”. In: IEEE Transactions on Geoscience and Remote
Sensing 61 (2023), pp. 1–13.
[104] P Zhang, X Dai, J Yang, B Xiao, L Yuan, L Zhang, and J Gao. “Multi-scale vision longformer: A
new vision transformer for high-resolution image encoding”. In: Proceedings of the IEEE/CVF
international conference on computer vision. 2021, pp. 2998–3008.
[105] Xiang Zhang, Lina Yao, and Feng Yuan. “Adversarial variational embedding for robust
semi-supervised learning”. In: Proceedings of the 25th ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining. 2019, pp. 139–147.
[106] Z Zhang, C Miao, C Liu, and Q Tian. “DCS-TransUperNet: Road segmentation network based on
CSwin Transformer with dual resolution”. In: Applied Sciences 12.7 (2022), p. 3511.
[107] Z Zhang, C Miao, C Liu, Q Tian, and Y Zhou. “HA-RoadFormer: Hybrid Attention Transformer
with Multi-Branch for Large-Scale High-Resolution Dense Road Segmentation”. In: Mathematics
10.11 (2022), p. 1915.
[108] X Zhu, W Su, L Lu, B Li, X Wang, and J Dai. “Deformable detr: Deformable transformers for
end-to-end object detection”. In: arXiv preprint arXiv:2010.04159 (2020).
[109] Yi Zhu, Yanzhao Zhou, Qixiang Ye, Qiang Qiu, and Jianbin Jiao. “Soft proposal networks for
weakly supervised object localization”. In: Proceedings of the IEEE International Conference on
Computer Vision. 2017, pp. 1841–1850.
103
Abstract (if available)
Abstract
Scanned historical maps contain valuable information about environmental changes and human development over time. For instance, comparing historical waterline locations can reveal patterns of climate change. Extracting geographic objects in map images involves two main steps: 1. obtaining a substantial amount of labeled data to train extraction models, and 2. training extraction models to extract desired geographic objects. However, the extraction process has two main challenges. One challenge is generating a large amount of labeled data with minimal human effort, as manual labeling is expensive and time-consuming. The other challenge is ensuring that the extraction model learns representative and sufficient knowledge for the accurate extraction of geographic objects. The success of subsequent analyses, like calculating the shortest paths after extracting railroads, heavily depends on the accuracy of the extractions.
To generate labeled data with minimal human effort, this dissertation presents semi- and fully automatic approaches to generate labeled desired geographic objects by leveraging external data. The semi-automatic approach requires one or a few manually labeled desired objects to collect all desired objects from candidates provided by the external data. In contrast, existing methods require more than a few manually labeled desired objects to achieve the same goal. On the other hand, the proposed automatic approach aims to label the desired objects in close proximity to the external data. Using the location and shape information fully from the external data, the proposed automatic approach can accurately label the desired objects on the maps. On the contrary, existing methods that do not utilize shape information may lead to false labels. The novel approaches introduced in this dissertation significantly reduce the need for manual labeling while ensuring accurate labeling results.
Extracting accurate geographic objects is the other challenge due to the ambiguous appearances of objects and the presence of overlapped objects on maps. The extraction model presented in this dissertation captures cartographic symbols to differentiate desired objects from other objects with similar appearances. When the desired objects overlap with other objects on maps, the extracted results could be broken. The proposed extraction model captures sufficient spatial context to reduce broken extraction. For example, the proposed extraction model learns the long and continuous structure of linear objects to reduce the gaps in the extracted lines. On the contrary, existing extraction models lack the ability to learn sufficient spatial context, resulting in the broken extraction of linear objects. In summary, the proposed extraction model learns representative cartographic symbols and sufficient spatial context to accurately extract desired objects.
The results of the experiment demonstrate the superiority of both the labeling and extraction approaches compared to the existing methods. The proposed methods significantly improve the quality of training data for extraction models by generating accurately labeled data. The extraction results from the proposed extraction model have much less false extraction and better continuity than state-of-the-art baselines. The combination of precise labeling and accurate extraction allows us to extract geographic objects from scanned historical maps. Therefore, we can analyze and interpret historical map data effectively.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
Acceleration of deep reinforcement learning: efficient algorithms and hardware mapping
PDF
Inferring mobility behaviors from trajectory datasets
PDF
Leveraging structure for learning robot control and reactive planning
PDF
Object detection and digitization from aerial imagery using neural networks
PDF
Learning the semantics of structured data sources
PDF
Transforming unstructured historical and geographic data into spatio-temporal knowledge graphs
PDF
Object localization with deep learning techniques
PDF
Learning controllable data generation for scalable model training
PDF
Video object segmentation and tracking with deep learning techniques
PDF
Leveraging training information for efficient and robust deep learning
PDF
Human appearance analysis and synthesis using deep learning
PDF
Efficient pipelines for vision-based context sensing
PDF
Learning to adapt to sensor changes and failures
PDF
Federated and distributed machine learning at scale: from systems to algorithms to applications
PDF
Efficient graph learning: theory and performance evaluation
PDF
3D deep learning for perception and modeling
PDF
Learning distributed representations from network data and human navigation
PDF
Efficient learning: exploring computational and data-driven techniques for efficient training of deep learning models
PDF
Learning to optimize the geometry and appearance from images
Asset Metadata
Creator
Duan, Weiwei
(author)
Core Title
Efficient and accurate object extraction from scanned maps by leveraging external data and learning representative context
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2023-12
Publication Date
09/12/2023
Defense Date
08/04/2023
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
attention mechanism,computer vision,deep learning,external data,generative models,geographic information sciences,geological maps,image context,map processing,misaligned and missing labels,OAI-PMH Harvest,object detection,prior knowledge,segmentation,spatial context,topographic maps,variational auto-encoder,vision transformer,weakly supervised learning
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Knoblock, Craig A. (
committee chair
), Chiang, Yao-Yi (
committee member
), Nevatia, Ram (
committee member
), Wilson, John P. (
committee member
)
Creator Email
weiweidu@usc.edu,weiweiduan1992@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113310169
Unique identifier
UC113310169
Identifier
etd-DuanWeiwei-12369.pdf (filename)
Legacy Identifier
etd-DuanWeiwei-12369
Document Type
Dissertation
Format
theses (aat)
Rights
Duan, Weiwei
Internet Media Type
application/pdf
Type
texts
Source
20230912-usctheses-batch-1096
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
attention mechanism
computer vision
deep learning
external data
generative models
geographic information sciences
geological maps
image context
map processing
misaligned and missing labels
object detection
prior knowledge
segmentation
spatial context
variational auto-encoder
vision transformer
weakly supervised learning