Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Vision-based and data-driven analytical and experimental studies into condition assessment and change detection of evolving civil, mechanical and aerospace infrastructures
(USC Thesis Other)
Vision-based and data-driven analytical and experimental studies into condition assessment and change detection of evolving civil, mechanical and aerospace infrastructures
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
VISION-BASED AND DATA-DRIVEN ANALYTICAL AND EXPERIMENTAL STUDIES
INTO CONDITION ASSESSMENT AND CHANGE DETECTION OF EVOLVING CIVIL,
MECHANICAL AND AEROSPACE INFRASTRUCTURES
by
Preetham Aghalaya Manjunatha
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(CIVIL ENGINEERING)
May 2022
Copyright 2022 Preetham Aghalaya Manjunatha
Dedicated to parents, grandparents, sister, and wife.
ii
Acknowledgments
I want to express by foremost appreciation and gratitude to my dissertation advisor, Professor
Sami F. Masri, for his guidance, encouragement, support, and “god-like” patience throughout
my doctoral studies. He was instrumental in encouraging me to conduct unconventional
and exceptional multidisciplinary research crossed between Structural Engineering, Electrical
Engineering, and Computer Science to solve civil and mechanical condition assessment
problems. He was one of the people who believed in my ability during the difficult times
and provided fatherly advice throughout this study. His constant positive impact on my
academic life, critical scientific thinking, and career development has been tremendous. This
dissertation would not have been possible without his ongoing help, support and patience.
Second, I would like to thank my MS thesis advisor, Professor Carter Wellford, who always
inspired me and supported me throughout my MS and doctoral studies since the second day
of USC school. He was another person who believed in my ability every time and provided
fatherly advice.
I’m also thankful to my qualifying and dissertation committee members to Professors
Carter Wellford, Aiichiro Nakano, Ketan D. Savla, and Bora Gencturk for their invaluable
advice and support throughout this doctoral study. In addition, I want to express my gratitude
to USC, Viterbi School of Engineering, Dean Yannis C. Yortsos, former and present chairs,
Professors Lucio Soibelman and Burcin Becerik-Gerber for their administrative support to
complete this dissertation. Finally, I would like to thank all the professors under whom I
expanded my teaching experience and knowledge as an assistant to the course.
I would also like to thank Professors Mohammad Reza Jahanshahi, Purdue University,
and Anand Joshi, USC, Viterbi Electrical Engineering, for mentoring during the early days
of my doctoral studies. I appreciate their valuable time in collaboration on the research. I
appreciate the assistance and collaboration of Professors Dr. Rong Wang and Professor Zhi
Xiong (Nanjing University) during the experimental phase of mobile robots in the condition
iii
assessment of infrastructures. In addition, I want to thank Dr. Yulu Luke Chen for his
assistance and collaboration in multiple projects. I thank Professor Ramakant Nevatia and
his research group for their invaluable time during the scientific discussion on computer
vision and deep learning. I am also grateful to Viterbi, Doctoral Programs Directors Jennifer
Gerson (former), Andy Chen (present), and Doctoral Programs Coordinator Tracy Charles,
and academic advisor Christine Hsieh for their constant support all through these years.
I want to thank the team members of the structural health monitoring research group
at USC and my officemates, Drs. Miguel Ricardo Hernandez-Garcia, Mohammad-Reza
Jahanshahi, Reza Jafarkhani, Ali Bolourchi, Vahid Keshavarzzadeh, Mohamed H. Abdelbarr,
and Yulu Luke Chen for providing a unique working experience. I also thank Drs. Pras-
anth Koganti, Charanraj Thimmisetty, and Ramakrishna Tipireddy for their valuable time
expanding my scientific knowledge by discussion during the early days of my doctoral studies.
I want to thank the Viterbi School of Engineering, the Department of Civil and Envi-
ronmental Engineering, and USC Dornsife College of Letters, Arts and Sciences, and the
Department of Physics and Astronomy for awarding graduate assistantships. In addition,
I acknowledge the partial financial support by a contract with NHCRP under the IDEAS
program, Federal Highway Administration (FHWA), and Korean Airlines. Also, I am thank-
ful to Pratt & Whitney, an Aerospace company, for their invaluable suggestions to hone
the research in the right direction. Furthermore, I am also grateful to USC Libraries for
providing the challenge grant fund support for over two semesters, and Caroline Muglia, who
administered the funds towards the research of smart libraries. Lastly, I am also thankful to
Jose Delgado, USC, Facilities Management Services, CAD services manager, for providing
the work opportunities during the needed semesters.
I am thankful to Jason Walborn, Chris Clauser, and Anthony Espriu, Pro-pipe professional
pipe services company, for sharing their sewer pipe datasets and NASSCO sewer pipe
inspection materials for scientific research. Without their support and suggestions, the sewer
pipe study would not have been possible. I am grateful to mentor MS and BS students
iv
from Computer Science and Electrical Engineering for the last seven years in our lab. I
appreciate their efforts in the experimental setup, conscientious data cleaning, labeling,
and development of auxiliary computer programs to aid this dissertation. A special thanks
and deep appreciation to Dr. Michael Roberts for reviewing this dissertation in detail and
promptly providing valuable comments and suggestions.
My education at USC could not be more colorful and exciting without countless friends
from across the globe and colleagues I have made throughout this journey. Finally, I would
like to express my sincere gratitude to my parents, who worked hard to provide me with all
possible opportunities. In addition, I sincerely appreciate my sister’s constant support during
my doctoral studies. Last but not least, I thank my amazing wife, Reshma, for her emotional
support, encouragement, and patience at all times.
v
Table of Contents
Dedication ii
Acknowledgments iii
List of Tables xi
List of Figures xiii
Abstract xxv
Chapter 1: Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Chapter 2: Crack detection using a multiscale hybrid algorithm based
on anisotropic diffusion filtering and eigenanalysis of Hes-
sian matrix of a fractional anisotropy tensor 10
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.1 Review of the literature . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.3 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.1 Synthetic cracks generation . . . . . . . . . . . . . . . . . . . . . . 19
2.2.1.1 Elastic deformations of synthetic cracks . . . . . . . . . . 21
2.2.2 A hybrid filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.2.1 Anisotropic diffusion filter . . . . . . . . . . . . . . . . . . 24
2.2.2.2 Multiscale fractional anisotropy tensor . . . . . . . . . . . 27
2.2.2.3 Supervised learning system . . . . . . . . . . . . . . . . . 30
2.2.2.3.1 Feature selection . . . . . . . . . . . . . . . . . . 31
2.2.2.3.2 Classification . . . . . . . . . . . . . . . . . . . . 31
2.3 Multiscale vesselness Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4 Multiscale morphological method . . . . . . . . . . . . . . . . . . . . . . . 33
2.5 Deep convolutional neural network . . . . . . . . . . . . . . . . . . . . . . 34
2.6 Experimental results and discussion . . . . . . . . . . . . . . . . . . . . . . 35
2.6.1 Dataset preparation . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.6.1.1 Training dataset . . . . . . . . . . . . . . . . . . . . . . . 35
2.6.1.1.1 Non-cracks connected components: . . . . . . . . 35
2.6.1.1.2 Synthetic cracks: . . . . . . . . . . . . . . . . . . 37
2.6.1.2 Testing dataset . . . . . . . . . . . . . . . . . . . . . . . . 39
2.6.2 Crack segmentation on real-world datasets . . . . . . . . . . . . . . 41
2.6.2.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
vi
2.6.2.2 Effect of the anisotropic diffusion coefficients and itera-
tions on the crack width . . . . . . . . . . . . . . . . . . . 42
2.6.2.3 On the choice of the parameters for the hybrid method . . 45
2.6.2.4 Comparison of the segmentation methods . . . . . . . . . 47
2.6.2.5 Effects of the post-processing procedures . . . . . . . . . . 52
2.6.3 Crack profile analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.6.4 Computational time analysis . . . . . . . . . . . . . . . . . . . . . . 57
2.7 Summary and conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.8 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.9 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Chapter 3: CrackDenseLinkNet: A deep convolutional neural network
for semantic segmentation of cracks on concrete surface im-
ages 61
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.1.1 Review of the literature . . . . . . . . . . . . . . . . . . . . . . . . 63
3.1.1.1 Crack classification . . . . . . . . . . . . . . . . . . . . . . 63
3.1.1.2 Crack detection . . . . . . . . . . . . . . . . . . . . . . . . 64
3.1.1.3 Crack semantic segmentation . . . . . . . . . . . . . . . . 65
3.1.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.1.3 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.2 Convolutional neural network preliminaries . . . . . . . . . . . . . . . . . . 70
3.2.1 Network layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.2.1.1 Convolutional layers . . . . . . . . . . . . . . . . . . . . . 72
3.2.1.2 Non-linearity . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.2.1.3 Pooling layers . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.2.1.4 Fully connected layers . . . . . . . . . . . . . . . . . . . . 74
3.2.1.5 Transposed convolution layer . . . . . . . . . . . . . . . . 75
3.2.2 Batch normalization . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.2.3 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.3.1 An end-to-end encoder-decoder semantic segmentation convolu-
tional neural network . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.3.2 Loss function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.4 Experimental results and discussion . . . . . . . . . . . . . . . . . . . . . . 85
3.4.1 Dataset preparation . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.4.2 Architecture and training details . . . . . . . . . . . . . . . . . . . 86
3.4.3 Training and validation results . . . . . . . . . . . . . . . . . . . . . 88
3.4.4 Crack segmentation on real-world datasets . . . . . . . . . . . . . . 90
3.4.5 Crack profile analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 97
3.4.6 Computational time analysis . . . . . . . . . . . . . . . . . . . . . . 100
3.5 Summary and conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
3.6 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
vii
Chapter 4: An image registration procedure for quantifying crack evo-
lution in multi-image time series data 103
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.1.1 Review of the literature . . . . . . . . . . . . . . . . . . . . . . . . 106
4.1.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.1.3 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.2 MAV model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.2.1 MAV - system overview . . . . . . . . . . . . . . . . . . . . . . . . 112
4.2.1.1 Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.2.1.2 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.2.1.2.1 Forces and moments on a single rotor . . . . . . . 113
4.2.1.2.2 Robot dynamics . . . . . . . . . . . . . . . . . . 114
4.2.2 MAV control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
4.2.3 State estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.2.3.1 Sensor model . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.2.3.2 State representation . . . . . . . . . . . . . . . . . . . . . 117
4.2.3.3 Measurement model . . . . . . . . . . . . . . . . . . . . . 118
4.2.4 Collision avoidance and path planning . . . . . . . . . . . . . . . . 118
4.3 Camera model and calibration . . . . . . . . . . . . . . . . . . . . . . . . . 119
4.3.1 A pinhole camera model . . . . . . . . . . . . . . . . . . . . . . . . 120
4.3.2 Camera calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
4.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
4.4.1 Nearest neighbor search . . . . . . . . . . . . . . . . . . . . . . . . 127
4.4.2 Multi-image registration and scene reconstruction . . . . . . . . . . 129
4.4.2.1 Feature matching . . . . . . . . . . . . . . . . . . . . . . . 131
4.4.2.2 Outlier rejection and image matching . . . . . . . . . . . . 132
4.4.2.3 Bundle adjustment for homographies . . . . . . . . . . . . 133
4.4.2.4 Gain compensation . . . . . . . . . . . . . . . . . . . . . . 134
4.4.2.5 Multi-band blending . . . . . . . . . . . . . . . . . . . . . 135
4.4.2.6 Image transformation and image resampling . . . . . . . . 136
4.4.3 Damage assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
4.4.3.1 Damage and change detection . . . . . . . . . . . . . . . . 139
4.4.3.2 Damage quantification . . . . . . . . . . . . . . . . . . . . 142
4.5 Experimental results and discussion . . . . . . . . . . . . . . . . . . . . . . 145
4.5.1 Dataset preparation . . . . . . . . . . . . . . . . . . . . . . . . . . 145
4.5.1.1 MAV simulation synthetic dataset . . . . . . . . . . . . . 145
4.5.1.2 Real-world dataset . . . . . . . . . . . . . . . . . . . . . . 150
4.5.2 Crack images recovery after transformation and limitations . . . . . 154
4.5.3 Crack change detection and localization . . . . . . . . . . . . . . . 155
4.5.3.1 A demonstration on a single real crack on the concrete
surface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
4.5.3.2 Multi-image datasets . . . . . . . . . . . . . . . . . . . . . 158
4.5.3.2.1 Crackwidthandlengthestimationonasynthetic
dataset . . . . . . . . . . . . . . . . . . . . . . . 158
viii
4.5.3.2.2 Synthetic time-series dataset collected from a
MAV . . . . . . . . . . . . . . . . . . . . . . . . . 160
4.5.3.2.3 Real-world time-series dataset . . . . . . . . . . . 167
4.5.4 Practical problems in crack localization . . . . . . . . . . . . . . . . 170
4.6 Summary and conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
4.7 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
Chapter 5: Automated classification of sewer pipe defects using vi-
sual bags-of-words in closed-circuit television (CCTV) im-
ages 176
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
5.1.1 Review of the literature . . . . . . . . . . . . . . . . . . . . . . . . 177
5.1.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
5.1.3 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
5.2 Point-based feature detectors . . . . . . . . . . . . . . . . . . . . . . . . . 181
5.2.1 Scale-invariant feature transform (SIFT) . . . . . . . . . . . . . . . 181
5.2.1.1 Scale-space extrema . . . . . . . . . . . . . . . . . . . . . 181
5.2.1.2 Detection of local extrema . . . . . . . . . . . . . . . . . . 183
5.2.1.3 Keypoint localization . . . . . . . . . . . . . . . . . . . . . 184
5.2.1.4 Orientation assignment . . . . . . . . . . . . . . . . . . . . 186
5.2.1.5 Keypoint descriptor . . . . . . . . . . . . . . . . . . . . . 186
5.2.2 Speeded up robust features (SURF) . . . . . . . . . . . . . . . . . . 187
5.2.2.1 Hessian matrix-based interest points . . . . . . . . . . . . 187
5.2.2.2 Scale space representation . . . . . . . . . . . . . . . . . . 188
5.2.2.3 Interest point localization . . . . . . . . . . . . . . . . . . 188
5.2.2.4 Orientation assignment . . . . . . . . . . . . . . . . . . . . 188
5.2.2.5 Point descriptor . . . . . . . . . . . . . . . . . . . . . . . . 189
5.3 Visual bag-of-words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
5.3.1 Pre-processing and feature extraction locations . . . . . . . . . . . . 192
5.4 Experimental results and discussion . . . . . . . . . . . . . . . . . . . . . . 193
5.4.1 Sewer pipe dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
5.4.2 Training, validation and testing dataset . . . . . . . . . . . . . . . . 195
5.4.3 Classification results . . . . . . . . . . . . . . . . . . . . . . . . . . 196
5.5 Summary and conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
5.6 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
Chapter 6: An approach for change detection, localization and evolu-
tion of defects in mechanical systems by three-dimensional
point cloud registration 200
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
6.1.1 Review of the literature . . . . . . . . . . . . . . . . . . . . . . . . 203
6.1.1.1 Mechanical systems . . . . . . . . . . . . . . . . . . . . . . 203
6.1.1.2 Aircraft components . . . . . . . . . . . . . . . . . . . . . 204
6.1.1.3 Civil infrastructures . . . . . . . . . . . . . . . . . . . . . 205
6.1.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
ix
6.1.3 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
6.2 Three-dimensional data acquisition methods . . . . . . . . . . . . . . . . . 207
6.2.1 Vision-based camera technologies . . . . . . . . . . . . . . . . . . . 207
6.2.2 Structure-from-motion (a photogrammetry technique) . . . . . . . . 211
6.2.2.1 Triangulation . . . . . . . . . . . . . . . . . . . . . . . . . 212
6.2.2.2 Epipolar geometry . . . . . . . . . . . . . . . . . . . . . . 213
6.2.2.3 Bundle adjustment . . . . . . . . . . . . . . . . . . . . . . 217
6.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
6.3.1 Deep learning-based pairwise 3D point clouds registration . . . . . 220
6.3.1.1 Feature extraction and correspondence prediction . . . . . 222
6.3.1.2 Weighted Procrustes method . . . . . . . . . . . . . . . . 225
6.3.1.3 Global registration . . . . . . . . . . . . . . . . . . . . . . 225
6.3.2 A change detection method for 3D point clouds . . . . . . . . . . . 227
6.3.2.1 Cloud-to-cloud (C2C) distance . . . . . . . . . . . . . . . 228
6.3.2.2 Statistical change detection by ensemble averaging . . . . 228
6.4 Experimental results and discussion . . . . . . . . . . . . . . . . . . . . . . 231
6.4.1 Dataset preparation . . . . . . . . . . . . . . . . . . . . . . . . . . 231
6.4.2 Change detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
6.4.3 Computational time analysis . . . . . . . . . . . . . . . . . . . . . . 240
6.4.4 Qualitative results of complex scene registration . . . . . . . . . . . 241
6.4.5 Decimation and noise effects on the pairwise registration of the
point clouds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
6.4.6 Practical problems in data acquisition and 3D point clouds regis-
tration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
6.5 Summary and conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
6.6 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
Chapter 7: Summary, conclusions and future work 252
7.1 Summary and conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
7.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
Bibliography 259
x
List of Tables
Table 2.1: Training dataset attributes. . . . . . . . . . . . . . . . . . . . . . . 36
Table 2.2: Synthetic cracks elastic deformation parameters. . . . . . . . . . . . 38
Table 2.3: Materials, texture, crack information and the images attributes of
the testing dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Table 2.4: Semantic segmentation results of the hybrid, vesselness, morpholog-
ical and DeepCrack on the datasets I to III without post-processing
procedures using ANN, k-NN, SVM and CNN as classifiers. A
pre-trained DeepCrack CNN was used in a classifier mode. . . . . . 50
Table 2.5: Semantic segmentation results of the hybrid, vesselness, and mor-
phological methods on the datasets I to III with post-processing
procedures using ANN, k-NN and SVM as classifiers. . . . . . . . . 54
Table 2.6: Computational time for the three real-world datasets using the
hybrid, vesselness, morphological, and DeepCrack methods. . . . . . 58
Table 3.1: The detailed specifications of encoder-decoder architecture of Crack-
DenseLinkNet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Table 3.2: Materials, texture, crack information and the image attributes of
the four training and testing dataset. . . . . . . . . . . . . . . . . . 85
Table 3.3: Semantic segmentation results of the FCN, DeepCrack, Crack-
SegNet, FPHB, and CrackDenseLinkNet on the datasets FCN,
DeepCrack, CrackSegNet, and CrackDenseLinkNet. . . . . . . . . . 92
Table 3.4: Computational time for the four real-world datasets using the
FCN, DeepCrack, CrackSegNet, FPHB, and CrackDenseLinkNet
methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Table 4.1: Camera intrinsic parameters for the MAV simulation and real Sony
cameras. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
Table 4.2: Crack thickness quantities and correlation coefficient of two PDFs
for source and target images. . . . . . . . . . . . . . . . . . . . . . . 158
xi
Table 4.3: Standardized values of the non-crack and crack samples by the
mean and standard deviation of the non-crack ensemble average for
change detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
Table 4.4: Semantic segmentation metrics for the real-world datasets. . . . . . 169
Table 5.1: Summary of the best visual bags-of-words results. . . . . . . . . . . 197
Table 6.1: 3D vision technologies comparison [174]. . . . . . . . . . . . . . . . 210
Table 6.2: Specifications and general details of 3D technologies [174]. . . . . . . 211
Table 6.3: Standardized values of the no-change and change samples by the
mean and standard deviation of the no-change ensemble average. . . 240
Table 6.4: Qualitative datasets attributes. . . . . . . . . . . . . . . . . . . . . 242
xii
List of Figures
Figure 1.1: A damage prognosis model. (a). Flowchart. (b). A deterioration
and end-life prediction curve obtained from the measurements,
model fitting and extrapolation by predictive analytics. . . . . . . . 4
Figure 1.2: A subset of the vision-based and data-driven condition assessment
problems that can assist damage prognosis model of civil, me-
chanical, and aerospace infrastructures that are discussed in this
work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Figure 2.1: Synthetic crack generation procedure. (a). Synthetic crack seam
overlayed on a Gaussian noise image. (b). Synthetic crack seam of
1 pixel width. (c). Morphological dilated synthetic crack with a
rectangular structuring element of size 30 15 pixels. . . . . . . . 20
Figure 2.2: Effects of performing an elastic deformation using a piecewise
linear function on a synthetically generated crack. (a). Synthetic
image with grid lines for reference. (b). Deformed image with the
parameters σ 5, α 35 and α
g
100. . . . . . . . . . . . . . . . 21
Figure 2.3: A block diagram of the proposed hybrid filter. . . . . . . . . . . . . 23
Figure 2.4: Different type of image filtering. (a). Isotropic filtering, where the
pixels intensities are smoothed uniformly. (b).Anisotropic filtering,
where edge pixels intensities are preserved. . . . . . . . . . . . . . . 25
Figure 2.5: Comparision of the isotropic and anisotropic diffusion using two
different functions. (a). Original image. (b). Isotropic filtering
using Gaussian kernel. (c). Anisotropic filtering with exponential
diffusion function. (d). Anisotropic filtering with Tukey’s biweight
diffusion function. . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Figure 2.6: A multiscale response of FAT crack function. From left to right,
scales σ 1 4, and final crack response varies fromr0, 1s. . . . . 29
Figure 2.7: A crackmap response from the MFAT curvilinear filters at scales
σ 1 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Figure 2.8: A block diagram of the decision making system using supervised
learning paradigm (for training and testing). . . . . . . . . . . . . . 30
xiii
Figure 2.9: A crackmap from the union of the multiscale crack filters at scales
σ 1 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Figure 2.10: A crackmap from the union of the multiscale morphological filters
of a line structuring element, S, at scales 3 to 143 pixels with an
interval of 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Figure 2.11: A crackmap of the DeepCrack after fusing side-outputs using the
[165]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Figure 2.12: Training examples (due to the montage plotting images of various
sizes have been resized to 100 100 pixels. Thus, they look same.
But they are of the varying scales). (a). Non-crack samples
obtained from the responses of the three conventional filters. (b).
Crack samples generated from the seam carving technique with
morphological dilation. . . . . . . . . . . . . . . . . . . . . . . . . . 36
Figure 2.13: Uniqueness of the synthetic crack training dataset. Pairwise dis-
tance matrix of the 100 ensemble average of 5000 samples feature
vectors that were sampled from 100,000 training images. (a).
Without elastic deformation; (b). With elastic deformation. (c).
Probability density function of the 100 ensemble average 5000
samples feature vectors with and without elastic deformation. (d).
Cumulative density function . . . . . . . . . . . . . . . . . . . . . 39
Figure 2.14: Two examples from the dataset III. Red dashed box shows the
region of interest and the green dashed line represent the crack
profile line. (a). Thin crack (note that this image has been rotated
90
0
counterclockwise to match the orientation of the image b). (b).
Thick crack image. . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Figure 2.15: Effects of anisotropic diffusion coefficients on two of dataset III
images. (a). Variation of the anisotropic diffusion coefficients on
a thin crack image that has a width of 14 pixels along the cross-
sectional profile line. (b). Variation of the normalized intensity
of the pixels of a thin crack along the cross-sectional profile line.
(c). Anisotropic diffusion coefficients variation on a thick crack
image of profile width 67 pixels. (d). Variation of the normalized
intensity of the pixels of a thick crack along the profile line. . . . . 43
xiv
Figure 2.16: Effects of diffusion iterations on two of dataset III images. (a).
Variation of the diffusion iterations on a thin crack image that
has a width of 14 pixels along the cross-sectional profile line. (b).
Variation of the normalized intensity of the pixels of a thin crack
along the cross-sectional profile line. (c). Anisotropic diffusion
iterations variation on a thick crack image of cross-sectional profile
width of 67 pixels. (d). Variation of the normalized intensity of
the pixels of a thick crack along the cross-sectional profile line. . . 44
Figure 2.17: Median variations of the F1-scores and relationship of Gaussian
filter diameter. (a) and (b). Anisotropic diffusion coefficients for
Synthetic500 and real-world datasets, respectively. (c) and (d).
Anisotropic diffusion iterations for Synthetic500 and real-world
datasets, respectively. (e). A relationship of actual crack width
and Gaussian filter diameter used in the MFAT filter to segment
the cracks on concrete surface. . . . . . . . . . . . . . . . . . . . . 46
Figure 2.18: A comparison of the crack segmentation methods. Columns left
to right show the images of, original color, ground-truth, hybrid,
vesselness, morphological and DeepCrack CNN. The rows 1 to 5,
6 to 10 and 11 to 15 show the images of datasets I, II and III,
respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Figure 2.19: Three datasets median variations of the F1-scores against. (a).
Anisotropic diffusion coefficients. (b). Anisotropic diffusion itera-
tions. PP stands for with post-processing and No PP for without
post-processing. Better visualization in color. . . . . . . . . . . . . 53
Figure 2.20: Cross-sectional profile of two concrete surface images. (a). Thin
crack of a width 14 pixels. (b). Thick crack of a width 67 pixels. . 55
Figure 2.21: Statistics of the estimated crack’s physical properties on three
datasets. Rows 1, 2 and 3 represent crack’s thickness (averaged),
length and area, respectively. Columns 1, 2 and 3 represent dataset
I, II and III, respectively. The rectangular boxes represent the
range between the 25th and 75th percentiles. The horizontal
lines within the boxes denote the median values. The protruding
horizontal lines outside the boxes represent the minimum and
maximumvalues. Thewhiskersarelinesextendingaboveandbelow
each box. Mean values are indicated by the small squares inside
the boxes. Lastly, the red asterisk symbols are the outliers. . . . . 57
Figure 3.1: A convolution operations in a layer. . . . . . . . . . . . . . . . . . 72
xv
Figure 3.2: An example of the maximum and average pooling operations in a
layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Figure 3.3: A typical transposed convolution operations in a layer. . . . . . . . 75
Figure 3.4: An encoder-decoder architecture of the DenseLinkNet semantic
segmentation convolutional neural network for concrete images. . . 82
Figure 3.5: Training scores for different learning rates vs. epochs. (a). Focal-
Dice losses. (b). Accuracies. (c). IOU scores. (d). Precision. (e).
Recalls. (f). F1-scores. . . . . . . . . . . . . . . . . . . . . . . . . . 88
Figure 3.6: Validation scores for different learning rates vs. epochs. (a). Focal-
Dice losses. (b). Accuracies. (c). IOU scores. (d). Precisions. (e).
Recalls. (f). F1-scores. . . . . . . . . . . . . . . . . . . . . . . . . . 89
Figure 3.7: A comparison of the CNN-based crack segmentation methods.
From left to right the columns show the images of, original color,
ground-truth, FCN, DeepCrack, CrackSegNet, FPHB and Crack-
DenseLinkNet (CrackDLNet) end-to-end CNNs. The rows 1 to 4,
5 to 8, 9 to 12 ans 13 to 16 show the images of datasets FCN, Deep-
Crack, CrackSegNet and CrackDenseLinkNet, respectively. . . . . . 91
Figure 3.8: Feature maps at various layers of the segmentation network. (a).
First sequential batch normalization layer after the convolution
layer. (b). Last convolution layer of encoder block 2. (c). Last
batch normalization layer after the decoder block 4. (d). Transpose
convolution layer of decoder block 5. . . . . . . . . . . . . . . . . . 95
Figure 3.9: Filtermapsorconvolutionweightsattwolayersofthesegmentation
network. (a). Convolution layer 1 (size 7 7). (b). Convolution
layer in encoder block 1 (size 3 3). . . . . . . . . . . . . . . . . . 96
Figure 3.10: Cross-sectional profile of two concrete surface images. (a). Thin
crack of a width 14 pixels. (b). Thick crack of a width 67 pixels.
Better visualization in color. . . . . . . . . . . . . . . . . . . . . . . 97
xvi
Figure 3.11: Statistics of the estimated cracks physical properties on four
datasets. Rows 1, 2 and 3 represent crack thickness (averaged),
length and area, respectively. Columns 1, 2 and 3 represent
datasets FCN, DeepCrack, CrackSegnet and CrackDenseLinkNet,
respectively. The rectangular boxes represent the range between
the 25th and 75th percentiles. The horizontal lines within the
boxes denote the median values. The protruding horizontal lines
outside the boxes represent the minimum and maximum values.
The whiskers are lines extending above and below each box. Mean
values are indicated by the small squares inside the boxes. Lastly,
the red asterisk symbols are the outliers. . . . . . . . . . . . . . . . 99
Figure 4.1: A six rotor micro aerial vehicle, AscTec Firefly, in Gazebo simula-
tor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Figure 4.2: Building blocks required to launch a MAV. . . . . . . . . . . . . . 112
Figure 4.3: Forces and moments operating on a single rotor’s center. . . . . . . 113
Figure 4.4: Body-centered body frame B and global world frame W in a
hexarotor sketch. The main body is being acted on by the primary
forces F
i
from the various rotors, and F
G
. . . . . . . . . . . . . . . 114
Figure 4.5: The intended location p
d
and the desired yaw angle ψ
d
are shown
in the controller drawing. Position control is usually divided into
two parts: An outside trajectory tracking controller calculates
the attitude and thrust references that an inner attitude tracking
controller follows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Figure 4.6: A pinhole camera model coordinate system. . . . . . . . . . . . . . 121
Figure 4.7: Checkerboard pattern images used for the camera calibration
procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Figure 4.8: Camera extrinsic parameters. (a). Checkerboard location and
orientation with respect to the camera. (b). Mean reprojection
error per image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
Figure 4.9: Nearest neighbor search strategy for finding the neighboring im-
ages. Red dots are the image coordinates obtained from the IMU
data. Green and blues circles are the possible area that engulf the
neighboring images for the image matching, registration and scene
reconstruction. R1 and R2 are the radii of the possible engulf
circles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
xvii
Figure 4.10: Feature-based multi-image registration pipeline. Firstly, feature
descriptors of the neighboring and current images are found. Sec-
ondly, based on the ratio threshold, putative matches are found
between the images. Thirdly, outlier keypoints are rejected by
the RANSAC algorithm and the images are matched and selected.
Fourthly, the bundle adjustment optimization was utilized to cor-
rect the camera parameters and minimize the drift error between
the images. Fifthly, the images are transformed by using the pro-
jective or affine transformation matrix. Sixthly, gain compensation
and multi-band blending are used to produce the seamless recon-
struction of the scene. Lastly, the reconstructed scene is cropped
to match the current view image and compared for the change
detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Figure 4.11: Two images of the same crack at time periods T
0
and T
1
. (a).
Source/reference image. (b). Target images. . . . . . . . . . . . . . 131
Figure 4.12: Putative matches of the two images. . . . . . . . . . . . . . . . . . 132
Figure 4.13: Inlier matches of the two images. . . . . . . . . . . . . . . . . . . . 133
Figure 4.14: A sample crack image shown with three different linear transfor-
mation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
Figure 4.15: A registered and resampled image. . . . . . . . . . . . . . . . . . . 139
Figure 4.16: Results of a semantic segmentation of two images of the same
crack at time periods T
0
and T
1
. (a). Source/reference image. (b).
Source binary output crack image. (c). Target image. (d). Target
binary output crack image. . . . . . . . . . . . . . . . . . . . . . . 140
Figure 4.17: Hit-or-miss structuring elements. . . . . . . . . . . . . . . . . . . . 143
Figure 4.18: Thinning operation. (a). Crackmap obtained from hierarchical
hybrid filter. (b). Thinned crack skeleton. . . . . . . . . . . . . . . 143
Figure 4.19: Orientation of the crack pixels. (a). Skeleton obtained from
morphological thinning. (b). Orientation of the blue pixel about
the 5 5 neighborhood. . . . . . . . . . . . . . . . . . . . . . . . . 144
Figure 4.20: MAV captured synthetic images. (a). Experimental images at
time period T
0
. (b). Experimental images at time period T
1
. The
images follow snake zig-zag pattern (top row left to right, followed
by next row right to left) . . . . . . . . . . . . . . . . . . . . . . . 147
xviii
Figure 4.20: MAV captured synthetic images. (c). Experimental images at
time period T
2
. (d). Experimental images at time period T
3
. The
images follow snake zig-zag pattern (top row left to right, followed
by next row right to left) . . . . . . . . . . . . . . . . . . . . . . . 148
Figure 4.20: MAV captured synthetic images. (e). Experimental images at
time period T
4
. (f). Experimental images at time period T
5
. The
images follow snake zig-zag pattern (top row left to right, followed
by next row right to left) . . . . . . . . . . . . . . . . . . . . . . . 149
Figure 4.21: Real-world thin sized crack datasets. (a). Experimental images at
time period T
0
. (b). Experimental images at time period T
1
. The
images follow zig-zag pattern (top row left to right, followed by
next row left to right) . . . . . . . . . . . . . . . . . . . . . . . . . 151
Figure 4.22: Real-world medium sized crack datasets. (a). Experimental images
at time period T
0
. (b). Experimental images at time period T
1
.
The images follow zig-zag pattern (top row left to right, followed
by next row left to right) . . . . . . . . . . . . . . . . . . . . . . . 152
Figure 4.23: Real-world thick sized crack datasets. (a). Experimental images
at time period T
0
. (b). Experimental images at time period T
1
.
The images follow zig-zag pattern (top row left to right, followed
by next row left to right) . . . . . . . . . . . . . . . . . . . . . . . 153
Figure 4.24: RMS error due to linear transformation . . . . . . . . . . . . . . . 155
Figure 4.25: Results from the proposed crack detection method. (a). Source
crack map. (b). Target crack map. (c). Transformed target crack
map to source. (d). Aligned target crack map to source (green
color in the source crack pixels, yellow color is the transformed
target crack and red color is false positives). . . . . . . . . . . . . . 156
Figure 4.26: Crack normals and crack width variation along the centerline.
(a). Crack normals. Blue and red arrows represents the positive
and negative directions, respectively. (b). Crack width variation
contour along the center line. . . . . . . . . . . . . . . . . . . . . . 157
Figure 4.27: Comparison of the crack width PDF and CDF obtained from the
proposed MFAT approach for the same image taken at two different
times. (a). Probability distribution function. (b). Cumulative
distribution function. . . . . . . . . . . . . . . . . . . . . . . . . . . 158
xix
Figure 4.28: The correlation of the synthetic crack width (ground-truth) to the
algorithm output. (a). Crack width, the correlation coefficient
is r 0.9306. (b). Crack length, the correlation coefficient is
r 0.7834. The correlation coefficient range r1, 1s, -1 and 1
indicates the worst and best fit. . . . . . . . . . . . . . . . . . . . . 159
Figure 4.29: Synthetic images registration. (a). Five synthetic images are
registered with reference to the current view image (colored
quadrilateral shows projective transformed images bounding boxes
and red dashed rectangle is the cropped region). (b). Current
view/reference image at time period T
0
. (c). Final aligned, recon-
structed and cropped image at time period T
1
. . . . . . . . . . . . 162
Figure 4.30: A registration of 63 complete synthetic images at trial one and
its semantic segmentation results at time periods T
0
T
5
. (a).
No-crack image at time periodT
0
. (b). Crack image at time period
T
1
. (c). Crack image at time period T
2
. (d). Crack image at time
period T
3
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
Figure 4.30: A registration of 63 complete synthetic images at trial one and its
semantic segmentation results at time periodsT
0
T
5
. (e). Crack
image at time period T
4
. (f). Crack image at time period T
5
. . . . 164
Figure 4.31: Probability density functions of the synthetic datasets for change
detection at inspection rounds T
0
to T
5
. (a). Crack width. (b).
Crack length. (c). Crack area. . . . . . . . . . . . . . . . . . . . . 165
Figure 4.32: Real-world images registration (thin cracks dataset). (a). Three
concrete crack images are registered with reference to the current
view image (colored quadrilateral shows projective transformed
images bounding boxes and red dashed rectangle is the cropped
region). (b). Current view image. (c). Final cropped image after
reconstruction. (d). Binary crackmap of current view image. (e).
Binary crackmap of cropped image. . . . . . . . . . . . . . . . . . . 168
Figure 4.33: Probability density functions of the real-world datasets for change
detection at inspection rounds T
0
to T
1
. (a). Thin crack (cor-
relation coefficient, r = 0.8596, 0.8311). (b). Medium crack
(correlation coefficient, r = 0.9462, 0.8497). (c). Thick crack
(correlation coefficient, r = 0.8785, 0.8314). . . . . . . . . . . . . . 169
xx
Figure 4.34: Keypoint based method (SURF) correspondences matching error.
(a). Two artificially drawn cracks on a smooth white wall at USC,
KAP. The transparent red region is the overlap area between two
images, (b). Falsely matched key points. Key points on the left
image are pointed towards the right image. . . . . . . . . . . . . . 171
Figure 5.1: An incremental difference-of-Gaussian. The left image shows the
octaves at each scale, and the right image shows the DOG. After
each octave, the Gaussian image is down-sampled by a factor of 2,
and the process is repeated. . . . . . . . . . . . . . . . . . . . . . . 182
Figure 5.2: Maxima and minima of the DOG as compared to 26 pixels. . . . . 184
Figure 5.3: A SIFT keypoint descriptor. Here, for illustrative purposes, it is
a 2 2 descriptor on the right which is sampled from an 8 8
sample array. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
Figure 5.4: A pictorial presentation of visual bags-of-words. . . . . . . . . . . . 190
Figure 5.5: An image showing the feature points location. (a). Conventional
SURF detector keypoints. The green circle denotes the scale σ,
and the line indicates the feature orientation. (b). Dense features
(SIFT/SURF) are extracted at the grid (square) points. . . . . . . 192
Figure 5.6: Challenges in sewer pipe assessment. Same class images with
highly diverse intensity distribution. . . . . . . . . . . . . . . . . . 193
Figure 5.7: Sample of various image classes that are trained and tested, non-
defective. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
Figure 5.8: Operational and structural defect samples of various image classes
that are trained and tested. . . . . . . . . . . . . . . . . . . . . . . 194
Figure 5.9: Training, testing and validation samples distribution. . . . . . . . . 195
Figure 5.10: Classification accuracy comparison of detector and grid-based
feature points over 50 to 8000 dictionary size. (a) Precision. (b).
Recall. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
Figure 6.1: Widely used 3D vision technologies. Left: stereo vision; middle:
structured light; right: time-of-flight. . . . . . . . . . . . . . . . . . 208
xxi
Figure 6.2: Off-the-shelf commercially available 3D sensors that were evaluated
in this study [174]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
Figure 6.3: Overview of structure from motion. . . . . . . . . . . . . . . . . . . 211
Figure 6.4: 3D point triangulation. (a). 3D point triangulation by finding the
point p. (b). Rotation around an axis ˆ n by an angle θ. . . . . . . . 212
Figure 6.5: Epipolar geometry relation to point P. . . . . . . . . . . . . . . . . 214
Figure 6.6: Chained transformations for projecting a 3D point, p
i
, into a 2D
measurement, x
ij
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
Figure 6.7: An overview of the data extraction pipeline. . . . . . . . . . . . . . 218
Figure 6.8: Overview of registration and change detection methods for inspec-
tion. (a). Deep learning-based global registration of the point
clouds. (b). Change detection pipeline. . . . . . . . . . . . . . . . . 219
Figure 6.9: A random sampling and negative-mining strategy for contrastive
and triplet losses. Traditional vs. FCGF hardest-contrastive and
hardest-triplet losses that use the hardest negatives. Cyan color
circles positive feature vectors, and brown are negative. . . . . . . . 222
Figure 6.10: Six-dimensional Residual U-Net type convolutional network ar-
chitecture for inlier likelihood prediction. The network has the
residual blocks between strided convolutions. . . . . . . . . . . . . 224
Figure 6.11: Data acquisition and C2C distance of the source/reference and
target point clouds. (a). Point clouds acquired by the Kinect
RGB-D camera. Camera is oriented with local axis. Registration
method finds the transformation matrix that minimizes the C2C
distance between source and target points. (b). Registered point
clouds. C2C is the shortest distance between a point in source and
target point clouds [174]. . . . . . . . . . . . . . . . . . . . . . . . 227
Figure 6.12: Source/reference and inspection point clouds at various time steps
T
0
,...,T
n
. Corresponding C2C distances are depicted in the right
side of the figures. (a). Two reference and a single inspection
point clouds at 3 time steps T
0
,...,T
2
. (b). Multiple reference
and single inspection point clouds from time steps T
0
,...,T
i
. (c).
Multiple reference and inspection point clouds from time steps
T
0
,...,T
n
. Blue and green color histogram represents the reference
(no-change) and inspection (change) point clouds mean of C2C
distances [174]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
xxii
Figure 6.13: PDF/histogram constructed from using the first point cloud as
source/reference and the remaining as target point clouds. μ
0
and μ
s
are the mean of all the C2C means of no-change and
change point clouds. Blue and green color histogram represents
the reference (no-change) and inspection (change) point clouds
mean of C2C distances [174]. . . . . . . . . . . . . . . . . . . . . . 230
Figure 6.14: Original images of the three datasets. (a). Clamped plate with
large bolts. (b). Plate with large and medium sized bolts. (c).
Car engine with alignment markers and large/medium sized bolts
[174]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
Figure 6.15: Complete 3D point cloud dataset of the fixed plate. Sample
numbers 1 to n from top left to right for zig-zag rows. (a). no-
change. (b). Change with loose bolts. . . . . . . . . . . . . . . . . 232
Figure 6.16: Complete 3D point cloud dataset of the plate with large/medium
sized bolts. Sample numbers 1 to n from top left to right for
zig-zag rows. (a). no-change. (b). Change with one loose bolt.
(c). Change with two loose bolts. (d). Change with three loose
bolts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
Figure 6.17: Complete 3D point cloud dataset of the car engine where a plate
with alignment markers are placed that includes large/medium
sized bolts. Sample numbers 1 to n from top left to right for
zig-zag rows. (a). no-change. (b). Change with one loose bolt.
(c). Change with two loose bolts. (d). Change with three loose
bolts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
Figure 6.18: C2C distance of the point clouds against the target point cloud
for fixed plate. Sample numbers 1 to n from top left to right for
zig-zag rows. (a). no-change dataset. (b). Change dataset. . . . . . 235
Figure 6.19: C2C distance of the point clouds against the target point cloud
for the simply supported plate. Sample numbers 1 to n from top
left to right for zig-zag rows. (a). No-change dataset. (b). Change
with one loose bolt dataset. (c). Change with two loose bolts
dataset. (d). Change with three looses bolt dataset. All the bolts
are loosened by 6.4 mm. . . . . . . . . . . . . . . . . . . . . . . . . 237
Figure 6.20: C2C distance of the point clouds against the target point cloud for
an engine. Sample numbers 1 ton from top left to right for zig-zag
rows. (a). No-change dataset. (b). Change with one loose bolt
dataset. (c). Change with two loose bolts dataset. (d). Change
with three loose bolts dataset. . . . . . . . . . . . . . . . . . . . . . 238
xxiii
Figure 6.21: Probability density functions of the three datasets for the change
detection. (a). Clamped plate. (b). Plate with large and medium
sized bolts. (c). Car engine with a plate that include alignment
markers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
Figure 6.22: Qualitative results of the complex mechanical scene registration.
(a). Plateonfloorwithtwoloosebolts. (b). HondaCiviccarengine
with partial target point cloud overlap with cable displacements.
(c). Full overlap of the hood region of the Toyota RAV4 car engine
with steel tubes chafing. (d). Lateral side of an aircraft engine
with no-change. The magenta ellipse shows the defect locations.
(a)-(c) produced by photogrammetry and unscaled. (d) generated
by Kinect RGB-D camera that was mounted on a drone in robotic
simulation, and the units are in meters. . . . . . . . . . . . . . . . 241
Figure 6.23: C2C distance variation on the first five no-change engine point
clouds. (a). Points retained after decimation range from 5% to
100%. (b). The variance of the Gaussian noise range from 0% to
100% with a step size of the 5%. . . . . . . . . . . . . . . . . . . . 243
Figure 6.24: Dispersion of the points at the edge of a plate known as “edge
effect” caused by a LiDAR scanner. The red boxes show the edge
effect area. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
Figure 6.25: Uncertainty in data acquisition. While scanning a complex scene,
some portions may or may not be acquired from the deeper regions.
Hence, it requires large samples to cover all the possibilities. (a).
The first scan of a car engine. (b). A second scan of the same car
engine at a different viewpoint. Pink circles A, B, and A
1
and B
1
show the same regions of two scans with uncertain points clusters
[174]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
Figure 6.26: In practical field applications, the scanned point clouds (Kinect)
usually contain the noise of geometry distortion due to the low
resolution of the sensor. However, the scanned point clouds have
similar distortion patterns; hence, it is suitable to estimate C2C
differences by comparing a scanned point cloud to another scanned
point cloud. (a). Kinect and Solidworks model point cloud differ-
ences. Blue is Solidworks, and red is Kinect (b). Two Kinect scans
of the same object [174]. . . . . . . . . . . . . . . . . . . . . . . . . 248
xxiv
Abstract
Civil, mechanical, and aerospace infrastructures are subjected to applied loads and environ-
mental forces like earthquakes, wind, and water waves in their operating lifespan. These
factors will slowly deteriorate the structures during their service period, and often subtle
observations of substantial damages are challenging. Due to the cost-effectiveness of high-
resolution color, depth cameras, location sensors, and Micro Aerial Vehicles (MAVs), image
processing, computer vision, and robotics techniques are gaining interest in Non-Destructive
Testing (NDT) and condition assessment of infrastructures. In this study, several promising
vision-based and data-driven, automated, and semi-automated condition assessment tech-
niques are proposed and evaluated to detect and quantify a class of problems under the
umbrella of infrastructure condition assessment.
A synthetic crack generation methodology is introduced to generate “zero-labeled” samples
for training the classical classifiers. This classifier was tested on a real-world dataset using
the gradient-based hierarchical hybrid Multi-scale Fractional Anisotropy Tensor (MFAT)
filter to segment the cracks. The results demonstrate the promising capabilities of the
proposed synthetic crack generation method. Furthermore, the textural noise suppression and
refinement are carried out by using an anisotropic diffusion filter. Guidelines are provided
to select the parameters for the anisotropic diffusion filter. Further, this study presents the
semantic segmentation of the cracks on concrete surface images using a deep Convolutional
Neural Network (CNN) that has fewer parameters to learn. Several illustrative examples are
presented to demonstrate the capabilities of the CNN-based crack segmentation procedure.
The CNN was tested on the four real-world datasets, and the results show the proposed
CNN’s superiority against four state-of-the-art methods.
As a part of this study, an efficient and autonomous crack change detection, tracking,
and evolution methodology is introduced. Among the image registration methods, feature-
based registration is robust to the noise, intensity change, and partial affine motion model.
xxv
This study uses an efficient k-d tree-based nearest neighbor search which is faster than the
quadratic computational complexity of the current pairwise search. Furthermore, unlike
other methods, the fixed camera assumption is relaxed in this study. Another significant
contribution is a probabilistic measure of the reliability of the analysis results that can aid
the prognosis damage detection models.
After the nearest neighbor search, the SURF-based keypoints are extracted from the
images in the previous database and the current one. This is followed by the Random Sample
Consensus (RANSAC)-based outliers rejection, bundle adjustment to refine the homographies,
gain/exposure compensation and multi-band blending for the seamless registration images.
Lastly, the registered image is compared to the current images for the change detection in
crack physical properties. To demonstrate the capabilities of the proposed method, two
datasets were utilized; a real-world dataset, and a synthetic dataset. The experimental results
show that the performance of the proposed methodology is suitable for detecting the crack
changes in two datasets.
This work also studies the condition assessment of public sewer pipelines. The visual-bags-
of-words model was evaluated for classifying the defective and non-defective sewer pipeline
images using two feature descriptors. Three classical classifiers are trained and tested on a
moderate-sized dataset of 14,404 images. The experimental results demonstrate that the
classification accuracy of the visual-bags-of-words model is satisfactory and comparable to
deep learning methods given the moderate dataset size.
Lastly, defect detection of the three-dimensional surface of mechanical parts is studied. A
preliminarystudyonavision-basedsemi-autonomousspatio-temporalmethodtodetect, locate
and quantify the defects such as loose bolts, displacements, pipe chafing, or deformation is
proposed. In addition, a probabilistic reliability quantification method based on the ensemble
averaging of the Cloud-to-Cloud (C2C) distances is introduced for mechanical systems.
Several quantitative and qualitative examples are presented to illustrate the capabilities of
xxvi
the proposed method. The results show that the proposed method is promising and robust
to register the complex shapes, and detect and locate the changes in the mechanical systems.
xxvii
Chapter 1
Introduction
1.1 Background
Civil, mechanical, and aerospace infrastructures are subjected to the external (applied) and
environmental forces in their operating life span. These factors will lead to the deterioration of
the structures during their service period [138]. However, infrastructure deterioration is slow,
and often subtle observations of substantial damages are challenging. For many civil facilities
(bridges, dams, pavements, and sewer pipes), the nature and rate of this deterioration are
sufficiently well understood. However, regular periodic inspection and preventive maintenance
are expensive due to the high costs of such assessment.
To avoid maintenance and the risk of structural failure and its potential impact on service,
a regular inspection must be performed. However, condition assessment and monitoring often
appear highly expensive because facilities are massive, inaccessible, or difficult to examine
without interrupting service. If periodic inspections are performed, the costs are typically
lower than those incurred when service deteriorates or fails.
Tracking of the dynamic change in infrastructures using digital image and video processing
is an extensively researched topic [223, 53]. Apart from civil infrastructures, change detection
on aircraft outer skin [224], land-cover transitions using remote sensing [30] and other fields
of science and engineering are primarily studied. In addition, structural health monitoring
based on sensor fusion [290] and wireless networks [197] is also implemented on large-scale
civil infrastructures like bridges, dams, and high-rise buildings. Change detection of traffic
systems using calibrated cameras [69], and wireless smart sensor networks [15] are also areas
of research.
1
Civil infrastructures cost in the United States adds more than $20 trillion, including
today’s inflation [74]. These systems are subject to deterioration due to excessive usage,
overloading, aging materials, insufficient maintenance, and inspection deficiencies. Most
of this critical infrastructure that serves society today, including bridges, dams, highways,
lifeline systems, and buildings, was erected several decades ago and is well past its design
life. According to the 2021 Infrastructure Report Card published by the American Society of
Civil Engineers (ASCE), over 46,154 structurally deficient bridges require a massive $125
billion for rehabilitation with a GPA of C. In addition, there are over 800,000 miles of sewer
pipeline and 500,000 miles of laterals connecting the wastewater network with a GPA D+,
and requires more than $434 billion for rehabilitation of water and wastewater sectors by the
year 2029. Overall an estimated cost of $2.59 trillion is required over ten years to fix the US
infrastructure [12].
Manual visual inspection by trained inspectors is still the main form of assessing civil
infrastructure’s physical and functional conditions at regular intervals to ensure that infras-
tructure still meets its expected service requirements and is operable. However, there are still
several accidents that are related to insufficient inspection and condition assessment [138].
For example, the collapse of the I-35W Highway Bridge in Minneapolis (Minnesota, USA) in
2007 killed 13 people, and 145 people were injured.
According to the American Association of State Highway and Transportation Officials
(AASHTO) manual for condition evaluation of bridges, there are five main categories of
inspections [172, 97]:
• Initial inspections: primary type carried out when the bridge is new.
• Routine inspections: usually performed every six months to 2 years, depending on the
importance of the bridge.
• In-depth inspections: close-up inspection using non-destructive evaluations when some
of the structural defects are not detected in routine inspection, which takes a long time.
2
• Damage inspections: carried out in response to damage caused by human actions or
environmental conditions.
• Special inspections: monitoring of any growing defect.
In the current bridge inspection procedure, a robotic arm mounted on a particular truck
is used to inspect inaccessible areas [264]. Many inspectors need to have the unique ability
to climb the scaffolding of tall structures [249, 164].
Structural cracks are the initial signs of the deterioration of the structure. If unnoticed and
unmaintained, it could lead to fracturing by applying external loads and eventually collapse.
Localization of cracks on a structure is of prime importance to quantify crack propagation.
Sewer pipeline has an average service life of 70 years [219], and many existing underground
pipes were installed 50-60 years ago [120]. Insufficient inspection and maintenance are the
main reasons for the poor condition of sewer pipes in the United States, with a GPA D+
(ASCE report card). Human technicians inspect the recorded CCTV video footage to detect
defects to prepare an assessment report. This inspection technique is very subjective, and it
directly depends on the inspector’s experience. Furthermore, this approach is expensive as
human resources are not always cheap. The inspection of videotapes consumes 30% of the
total cost for this procedure [219].
Lastly, an inspection of complicated mechanical systems (example: aircraft and car
engine) is a tedious task for even specialized technicians as the outer surface is cluttered with
components [33, 124]. Furthermore, it is challenging to detect defects on a three-dimensional
surface of mechanical parts, as they are cluttered by objects and occluded by other parts
[33]. In addition, untimely inspection and maintenance of the components incur operational
costs in the manufacturing and production sector of aerospace, automobile, and mechanical
components industries. Thus, a conventional inspection of civil, mechanical, and aerospace
infrastructures is time-consuming, expensive, and subjective to the inspector’s experience.
3
1.2 Motivation
Vision-based condition assessment approaches have been growing in popularity exponentially
for a decade as the computational resources have improved to solve complex problems, and
the cost of hardware (cameras, 3D sensors) has dramatically decreased. Autonomous or
semi-autonomous visual inspection by image processing and computer vision techniques are
efficient and reduce the labor cost. Using these methods, obtaining meaningful information
Current
loads and
conditions
Usage monitoring
Design
loads
Structural health monitoring /
condition assessment models
Load prediction
model
Damage prognosis model
Response
measurement
(a)
Time (t)
Deterioration index
Failure threshold
Predictions
Measurements
Fitted Model
End-life
(b)
Figure 1.1: A damage prognosis model. (a). Flowchart. (b). A deterioration and end-
life prediction curve obtained from the measurements, model fitting and extrapolation by
predictive analytics.
from the raw color images, color fused depth/range images, and/or multimodal data of the
infrastructures assists in detecting, locating, and quantifying the defects, which is known
as the damage diagnosis model. Damage prognosis is a challenging problem that deals in
estimating the end-life of an infrastructure system [109, 200]. It uses the current loading
condition and measures the response of the systems by health monitoring sensors. In addition,
the design load criteria, the feedback from the load prediction model, continuous health
monitoring, and condition assessment methods aid in predicting the end-life by estimating the
failure threshold. Figure 1.1a shows the simplified representational flowchart of the damage
4
prognosis model. Figure 1.1b shows the deterioration curve that is used to estimate the end-
life of infrastructure by estimating the failure threshold. Measurements are obtained by the
continuous monitoring systems such as networks of sensor arrays (Micro-Electro-Mechanical
Systems (MEMS)) or in-service visual inspection methods by humans or robots coupled
with image processing and computer vision techniques, or both. For these measurements,
analytical or computational models can be used to obtain the initial deterioration curve.
Lastly, probabilistic inference, probabilistic risk assessment, and reliability methods can be
used to extrapolate the deterioration curve to estimate the failure threshold.
The advent of next-generation vision-based sensors, image processing, computer vision,
artificial intelligence techniques (machine learning and deep learning), and availability of
massive data paves the way for developing autonomous and semi-autonomous systems to
assist the conventional visual inspection methods. This dissertation focuses on a subset of
Condition assessment problem
(evolution of defects)
Data generation
Defect
detection
Defect
localization
Defect
quantification
Synthetic cracks 2D camera
3D sensors
Image processing
and computer
vision methods
Probabilistic inference
Machine learning
models
• Classical
• Deep learning
Drone
Location sensors
Robots / sensors
Crack physical
properties
• Loose bolts
• Displacements
• Chafing
Cracks
Mechanical parts
Damage prognosis model
Data analysis
Data collection,
cleaning,
processing
Change detection
Damage diagnosis model
• Cracks
• Sewer pipes
• Mechanical
systems
Feature extraction
Figure 1.2: A subset of the vision-based and data-driven condition assessment problems that
can assist damage prognosis model of civil, mechanical, and aerospace infrastructures that
are discussed in this work.
5
vision-based and data-driven damage diagnosis approaches that directly integrate with the
damage prognosis system. Primarily developing the techniques for detecting, locating, and
quantifying the defects such as structural cracks and loose bolts, displacement and chafing
on mechanical systems. In addition, this work develops the probabilistic reliability methods
on the change detection and quantification for the same defect types. Figure 1.2 shows the
organization of the subset of the vision-based and data-driven condition assessment problems
focused on in this research work.
Supervised decision-making machine learning algorithms require a large amount of data
to generalize well for unseen and new datasets. Generally, the labels for the dataset are
prepared conscientiously by human experts for classification, object detection, and semantic
segmentation problems. However, this needs special skills, patience, and painstaking efforts
to create accurate ground-truth labels and is financially expensive to produce. Therefore,
developing synthetic data (e.g., cracks) generation methods is essential to augment or partially
replace the real-world dataset.
Cracks are the most commonly found defects in civil infrastructures. Periodic detection,
localization, and quantification of cracks are necessary to monitor crack propagation. Most
of the prior work focused on using conventional image processing-based filters to denoise
the images. Although cracks are generally darker in color intensities, they are surrounded
by texture noise. Hence, textural noise suppression becomes necessary for the accurate
segmentation of the cracks. In contrast to the classical methods, deep learning-based end-to-
end fully Convolutional Neural Networks (CNNs) have proven efficient and accurate to crack
segmentation problems provided a large dataset. However, their training phase is longer
in computational time due to the number of learnable parameters. Thus, it is crucial to
adopt a semantic CNN with fewer parameters to learn and select the proper loss function to
overcome the class imbalance problems between crack and background pixels and achieve
higher accuracy.
6
Sewerpipelinesdeteriorateduetostructuralandenvironmentalfactorsovertime. Similarly,
detectingandlocatingthedefectsinmechanicalsystemsisvitaltoshortenthevisualinspection
by trained technicians. Thus, developing autonomous or semi-autonomous methods with no
feature engineering to detect, localize and classify the defects is essential to reduce the manual
process. Earlier vision-based methods were agnostic in considering the tracking of defects
through time, as they only considered a single time frame defect detection. Therefore, change
detection methods provides an opportunity to develop the probabilistic reliability model to
assist in the damage prognosis model. In addition, the results of the change detection and
tracking methods can serve as the initial conditions for the computational mechanics models.
1.3 Scope
Chapter 2 introduces a synthetic crack generation algorithm to aid the training process in the
supervised machine learning approach with “zero-labeled data”. In addition, a hierarchical
hybrid filter is proposed with two-frame architecture. First, an anisotropic diffusion filtering
technique was used to suppress the textural noise of the images and simultaneously improve
the crack regions. The experimental validation proposes a judicial range for the anisotropic
diffusion filtering parameters. Second, a Multiscale Fractional Anisotropy Tensor (MFAT)
was used to segment the cracks semantically. The linear relationship between the cracks
and MFAT filter scales was found experimentally on a synthetic dataset. Furthermore, a
comprehensive comparison between the classical gradient-based filters and deep learning-based
semantic segmentation CNN are presented. Lastly, the details of the proposed system are
presented, and several illustrative examples are shown to demonstrate the capabilities of the
proposed method on real-world datasets.
Chapter 3 details the deep learning-based semantic segmentation of cracks on concrete
surface images. An encoder-decoder-based fully convolutional neural network architecture
was used to segment the cracks. DenseNet was used as the encoder to obtain the feature-maps
7
of the crack images. Due to its computational efficiency, a LinkNet decoder was adapted to
segment the crack pixels from the background. To mitigate the adverse effects of the class
imbalance between the number of cracks and background pixels, a focal loss and dice loss
were implemented. In addition, the proposed method was tested on three public datasets
and one private dataset. Lastly, crack profile analysis and the statistical estimation of the
properties of the cracks, such as length, width, and area for the various methods, are also
presented.
Chapter 4 describes a robust methodology of automatic feature-based image registration
to solve the reconstruction of previous pairs of neighboring images to the current image
for crack quantification, and tracking the evolution of the cracks on the concrete surfaces.
This study used multiple time-series image data captured by a camera-mounted drone to
quantify the crack propagation in robotic simulation. In addition, the real-world data of
varying crack sizes (thin, medium, and thick) at two different time frames were evaluated
to understand the capabilities and limitations of the proposed system. The segmentation
method presented in Chapter 3 was used to extract the crack pixels. The crack physical
properties such as length, width, and area are also assessed over different time-frames to
develop the probabilistic reliability of the crack change detection and quantification.
Chapter 5 presents a general methodology for classifying the sewer pipe defective and
non-defective images using the visual-bags-of-words model with a moderate-sized dataset
according to today’s standard. The widely used feature descriptors, Scale-Invariant Feature
Transform (SIFT) and Speeded Up Robust Features (SURF), are evaluated for the sewer pipe
classification problem. This method is fully autonomous and it requires no hand-designed
features. In addition, classification scores are compared at the conventional detector and
fixed grid spatial location across the images.
In Chapter 6, the results of an exploratory study focused on the detection and location of
structural changes in mechanical systems are presented. The three-dimensional (3D) point
cloud data obtained from the 3D sensor (Microsoft Kinect 1) and photogrammetry technique
8
are used to detect and localize changes such as loose bolts, displacements, and chafing. It is
necessary to align the point clouds accurately to detect changes in the millimeter range. To
accomplish this, a deep learning-based registration algorithm was adapted. Prior work focused
on detecting and localizing the defect in the single time frame. However, this work proposes
a probabilistic reliability change quantification method based on the ensemble averaging of
the point cloud distance. Also, visualization methods are developed to track an evolution
(time-history) of the changes along the time domain. This provides greater flexibility for the
visual inspectors to identify the specific changes in the components.
Chapter 7 deals with the overall summary and conclusion of this work. In addition, it
provides a roadmap to future extensions and improvements of this work to aid the vision-
based condition assessment of the civil, mechanical, and aerospace infrastructure research
community or another dissertation.
9
Chapter 2
Crack detection using a multiscale hy-
brid algorithm based on anisotropic
diffusion filtering and eigenanalysis
of Hessian matrix of a fractional
anisotropy tensor
2.1 Introduction
Public civil infrastructures are the backbone of the nation’s economy. The nation’s prosperity
and the public’s health and welfare directly depend on these. The infrastructure’s condition
directly impacts the nation’s economy. Due to regular usage, fatigue and failure are common
in civil infrastructures. ASCE report card grades the U.S. bridges, dams, and levees as C+,
D, and D, respectively [11]. More than 200,000 of the bridges in the U.S. are made of concrete
superstructure, and they are in wearing condition [72]. Periodic inspection of the bridges
are performed to assess the overall condition and surface defects such as cracking, spalling,
etc. [73, 71, 184]. It is essential to perform condition assessments regularly in order to
minimize serious structural failure. Structural Health Monitoring (SHM) is primarily focused
on vibration-based methods that have been practiced for more than four decades. However,
the development of accurate, robust, and inexpensive visual sensors, vision-based autonomous
SHM, and condition assessment techniques have gained more attention. As a result of this
development, numerous works in the field of civil engineering on image processing, computer
vision, and their applications to solve complex vision-based condition assessment are active
topics in the research community.
10
Cracks on the surface of concrete structures are the most commonly found defects; image-
based methods are used to detect the cracks and quantify them [293, 274]. Moreover, many
applications of the image processing techniques are leveraged in the pavement distress analysis
systems, where the cracks, ruts, and undulations of the pavements are measured by using
vision-based methods [283, 76]. Further, during the late 1990’s most of the applications of
conditional assessment of the pavement motivated the automation of the industrial researchers
in the field of construction engineering. Consequently, the last two-and-a-half decades has
seen a significant improvement over developing vision-based sensors in robotic and mounting
platforms used to monitor bridge decks and other large scale structures [100, 145, 22, 89].
Visual inspection is very reliable as most of the structural defects such as cracks, corrosion,
fracture, spalling and structural deformation are visible through our unaided vision system
[70]. But it is highly subjective to the inspector’s experience. Due to the physical and mental
stresses involved in visual inspection, sometimes it is difficult to assess the infrastructure’s
condition in an efficient and consistent method. This can lead to the falsification of the data
due to human-error. In addition, it poses a fatalities threat to a human inspector while
working near the inaccessible regions of the buildings, or on tall structures.
A structural system during its life cycle undergoes a complex dynamic loading system.
This makes it unpredictable to precisely locate the defects which can grow large in time.
Cracks are the most common structural defects in civil infrastructures [275]. Since they
vary in terms of size, shape, intensity, and texture, it is difficult to detect and quantify
them. In computer vision, crack recognition can be classified into three categories, namely,
classification, detection, and semantic segmentation. A large body of work is focused on
all three categories, but, the semantic segmentation of the cracks is more challenging and
interesting. Recently, in addition to the conventional methods, Deep Learning (DL) methods
have become popular since the tediousness involved in the feature engineering is not required
[147]. Also, the accuracy is higher compared to the conventional approaches. As a caveat, a
large amount of data is required to overcome the over-fitting and training of the DL models
11
[141]. Deep learning methods based on the convolutional neural networks are widely used
in the civil engineering community for the crack segmentation and detection on concrete,
pavement, and steel surfaces [34, 38, 285, 39, 40].
Although deep learning approaches have flourished in the research literature, direct com-
parisons between the conventional filters like multiscale vesselness, morphological, Fractional
Anisotropy Tensor (FAT), to a semantic segmentation Convolutional Neural Network (CNN)
approach is a gap in the image-based crack detection studies, except for a few papers in
different domains. Abdel-Qader et al. [2] compared the fast Haar transform, Fourier trans-
form, Sobel filter, and Canny filter for crack detection on a concrete surface using 25 images.
Mohan and Poobal [182] reviewed several edge detection techniques for information-retrieval
in visual, thermal, and ultrasonic images in different datasets. Cha et al. [34] compared two
edge detectors, Canny and Sobel, to a CNN. Recently, Dorafshan et al. [60] compared various
edge detectors methods such as Roberts, Prewitt, Sobel, Laplacian of Gaussian, Butterworth,
and Gaussian in spatial and frequency domains to an AlexNet-based CNN [141] model.
This study focuses on three main aspects of image-based crack detection on concrete sur-
faces: Firstly, the study proposes synthetic crack generation and data augmentation methods
that aids in the training process of the machine learning models with “zero semantically
labeled images”. Secondly, a hierarchical filter that refines the texture-dominant (dark texture,
dirt, or surrounded by other similar contrast objects) images to crack segmentation and im-
proves decision making, using classical machine learning approaches. Lastly, a comprehensive
analysis and comparison of the conventional filters like multiscale vesselness, morphological,
and FAT to a semantic segmentation CNN.
2.1.1 Review of the literature
Song et al. [229, 230] developed a method that detects cracks on a randomly distributed
texture using the discrete Wigner model. Petrou et al. [199] has shown that a modified Walsh
transformation can be used in crack detection in random textured images. They implemented
12
both the probabilistic relaxation labeling and optimal line filter to distinguish between the
crack pixels and non-crack pixels. Although they successfully detected the crack within the
random texture, the captured image was of uniform intensity, and explicit training of the
non-defective texture patterns was performed.
In a Closed-Circuit-Television (CCTV)-based monitoring system, Fieguth and Sinha [75]
used digital image processing to detect the cracks in underground pipes employing statistical
parameters such as sample mean, variance, and cross-correlation with the crack pixels and
their surrounding window. Although this method can extract features like cracks and, to
some extent, joint openings of the buried pipe, it failed to minimize false positives. Using
a warped tunnel vision by a tool called Sewer Scanner and Evaluation Technology (SSET),
[110, 228, 112] exhibited the method which segmented the crack (eliminating joints, holes,
and laterals) from the buried pipeline images by using the statistical filters for crack detection,
followed by cleaning and linking operations.
Decitre et al. [57] proposed a method to detect the faulty rivet images obtained from
a magnetic-optic imager. To perform the pattern classification, they implemented a multi-
layered neural network combined with morphological operators to extract the geometric
features. Iyer and Sinha [111] worked on multiple crack detection algorithms for contrast
enhancement, morphology, non-linear filtering, and curvature evaluation of crack patterns. It
was shown that morphological characteristics such as branches, thickness, and orientation
of cracks could be exploited for their autonomous recognition. Furthermore, with the
concept of Bayesian classification, mathematical morphological processes, and segmentation
by thresholding, Sinha and Fieguth [227] studied the feature extraction methods for multiple
cracks and joints in the sewer pipeline.
Oka et al. [191] suggested a sub-millimeter crack detection method by combining a near-
field millimeter-wave imaging technique and self-organizing neural network. In their approach,
a transmitted wave scatters if it encounters a crack, resulting in low signal confidence in
that region. Once the scanning is completed, a blurry grayscale image is obtained. It is
13
practically impossible to find a crack from this image. Therefore, they used a color image
to find the crack using a self-organizing map (SOM). Their approach is valuable in finding
cracks under tiles, wallpaper, and other cladding materials, but impractical for scanning large
areas. Additionally, their method is susceptible to the noise or random textures in a color
image. Fujita et al. [81], [83] and [82] used the Hessian matrix in order to extract cracks from
a concrete surface. In addition, they used Receiver Operating Characteristic (ROC) analysis
[66] to measure the performance of their method and used a probabilistic relaxation method
to connect the pixels belonging to the cracks.
Chen and Hutchinson [46] proposed a method to monitor and quantify the concrete
surface cracks. In their approach, multi-temporal images and motion-invariant features based
on the manifold-distance are extracted to localize the crack growth. Additionally, they used
level-sets to segment the crack pixels and a morphological method to obtain the dimensions
of the cracks. In their setup, cameras are stationary, and the use of the level-set method
has a limitation while working with any noisy texture data. Krause et al. [140] presented a
method for detecting cracks in volumetric images of composite materials obtained through
Computed Tomography (CT). A Radon transform is used to model the local variations of
3D volume. A second derivative of the radon transform segments the crack region, and a
multiscale approach is used to extract the cracks of various sizes. A CT image was obtained
through the sophisticated sensors. This ensures uniform intensity over the 2D image obtained
from the 3D reconstructed data, but in reality, the intensity varies sharply throughout the
image, which is not addressed in their work. Tsai et al. [246] proposed a micro-crack detection
method in solar wafers in heterogeneously intra-grain texture using anisotropic diffusion. The
anisotropic diffused image is subtracted from the original image to highlight the micro-cracks
of the solar wafer. Although this method can extract micro-cracks in solar wafers with a
piecewise-smooth intra-region of grains, it is unsure whether this method is suitable for highly
textured images like concrete and pavement.
14
Chen et al. [45] included two different lighting conditions to capture crack and non-crack
patterns on concrete surfaces. Three to six features were utilized to classify their dataset
of eleven images. Although they were able to achieve an accuracy of more than 80%, their
method lacked testing on a diverse set of images. Furthermore, their experimental setup
is cumbersome, and is less effective when capturing in outdoor lighting conditions. Linda
and Jiji [161] have shown a procedure for crack detection in an X-ray image based on
the minimization of a fuzzy measure. They divided the image histogram into three fuzzy
subsets to obtain the thresholding parameters. These parameters are used to segment the
skin, bone, and background region. Furthermore, when the segmented image is formed,
holes are filled, and the image is subtracted to obtain the residual patch (i.e., crack in this
case). Consequently, filtering is performed by using the mathematical morphological kernels.
Although this method works well for detecting cracks, it requires a robust prior knowledge
to classify the background and foreground regions, but in the case of infrastructures, it is
challenging to define segmentation regions as no explicit information can be extracted from
the images.
Additionally, Shen et al. [221] proposed a concept of gray-intensity wave transformation,
which overcomes the issue of non-uniform illumination in gray image binarization. Crack
detection was performed on satellite solar cell images. Firstly, they normalized the image
intensity by using the peaks and troughs of the thresholded intensity values. Secondly, they
transformed the image using a bi-directional convolution filter, followed by a feature reduction
using Principal Component Analysis (PCA). This image information was further thresholded
using a global threshold method such as Otsu’s method [192]. Lastly, a pre-defined match
filter was used to fit an ellipse by least squares to extract the curvilinear crack features,
smear, and chipping of the solar cells. Since cracks are structurally random, an assumption of
matched filter works roughly in practice. Ehrig et al. [64] also compared three different crack
detection methods. Firstly, they examined template matching, which used a pre-defined filter
to convolve over the given image. Secondly, a sheet filter based on the Hessian eigenvalues
15
was utilized. Thirdly, they investigated a percolation method that was based on a physical
model of liquid permeation. All these methods were tested on 3D CT datasets. Eventually,
noise effects on the images were excluded in crack detection, which is important to consider.
Choudhary and Dey [50] adapted neural networks to classify crack and non-crack images
(binary classification). Their methodology consists of using an edge detection method to
extract the crack pixel locations. They fed each element through the neural network to
classify them as either a crack or a non-crack entity. Although they achieved good accuracy,
edge detection is too sensitive to noise, increasing the risk of false positives. Bai et al. [14]
used genetic programming to extract the crack from concrete texture images. In their method,
a group of images was segregated as target, weight, and output. These images are convoluted
with multiple kernels. Based on the minimization of their cost function, a combination of the
filters are retained (i.e., responsible for minimum distance between output and target image).
Their method is computationally costly and also heuristic as far as the convolution kernels
are concerned.
Jahanshahietal.[117]proposedacrackdetectionalgorithmbasedonimage-based3Dscene
reconstruction in which crack segmentation is performed by a mathematical morphology and
multiscale crack mapping techniques. Furthermore, they used pattern recognition classifiers
such as neural networks and a Support Vector Machine (SVM) to classify crack patterns.
In their method, the caveat was the camera-to-object distance was needed to estimate the
structuring element size in the morphological operation. Torok et al. [244] proposed a
method for detecting crack and concrete spalling through a 3D surface reconstruction process.
Methodologically, a dense point cloud is generated by a feature-based bundle adjustment.
Then, the object is reconstructed from the point cloud using Poisson’s technique, and the
surface is labeled as damaged, undamaged, and cracked by thresholding the surface normal
at each vertex. Active contour-based anisotropic filtering was used by Tang and Gu [237] to
segment the cracks from a pavement surface. In their work, the images used were mostly
low textured. Thus, anisotropic diffusion alone is sufficient to extract the crack regions.
16
Ghanta et al. [87] proposed a crack segmentation method based on the Hessian matrix.
Segmentation accuracy and performance tests were carried out, and found that the Hessian
process produces good results. In their approach, defect detection and quantification of
potholes were considered.
Chen et al. [41] proposed a texture-based Bayesian data fusion method to detect the
cracks on the metallic surfaces of a nuclear reactor. Jahanshahi et al. [119] leveraged the fast-
marching method to extract the center-line of the microcracks on the nuclear reactor internal
components and used the crack center-line orthogonal orientation-based crack width algorithm
to estimate the crack thickness. Jahanshahi and Masri [115] used the photogrammetry
approach to correct the perspective error for crack width estimation using the binary strip
kernels.
2.1.2 Contribution
This work mainly focuses on the three different contributions to the civil engineering research
community on image-based crack detection on concrete surfaces. Firstly, deep learning-based
object classification, detection, and semantic segmentation methods are overwhelmingly
used on crack images of various materials since 2017, but, they require carefully labeled
data for training purposes. In this work, seam-carving-based synthetic crack generator and
elastic-deformation-based data augmentation methods are developed to generate the strands
of cracks. This “zero semantically annotated images” minimizes the financial cost and tedious
efforts in preparing the datasets for the training process of the machine learning classifiers
for the crack and outlier detection. Although these methods are used to train conventional
methods, this can be generalized well for the deep learning models.
Secondly, concrete-based civil infrastructures comprise intricate textures, and non-uniform
intensities are spread around the crack regions. Using a refinement method and anisotropic
diffusion, the non-crack features such as textural noise and small blobs can be suppressed
iteratively by enhancing the crack components. Furthermore, the performance of a hybrid
17
filter is invariant to texture, intensity, and scale variation on the concrete image surface.
Multiscale Fractional Anisotropy Tensor (MFAT) segments the cracks more precisely than
conventional filtering methods. Moreover, this work provides guidelines to use anisotropic
diffusion and its parameters in a judicial way for the texture noise suppression and crack
segmentation problem.
Lastly, a comprehensive analysis and comparison of the proposed method were made
against conventional vesselness and morphological methods using three different real-world
datasets of various sizes and texture attributes. Furthermore, for comparing the DeepCrack, a
state-of-the-art deep learning model, and its publicly available dataset is used as a reference.
2.1.3 Scope
Sections 2.2.1 and 2.2.1.1 introduces the synthetic crack generation and data augmentation
techniques. In Section 2.2.2, the details of the hybrid filtering are presented. The proposed
hybrid filter consists of various stages. Firstly, an anisotropic diffusion smoothing is dealt
with in Section 2.2.2.1. Secondly, curvature dominant feature extraction is explained in
Section 2.2.2.2 for the segmentation of synthetic and real-world cracks. Lastly, the supervised
machine learning approach is discussed in the Section 2.2.2.3.
Sections 2.3 to 2.5 introduces the vesselness, morphological and DeepCrack CNN methods.
Section 2.6 presents the dataset preparation, experimental results, comparison of the various
methods and its discussion. Lastly, Sections 2.7 and 2.8 concludes this article and provides
some useful suggestions for future work.
18
2.2 Methodology
2.2.1 Synthetic cracks generation
Supervised machine learning algorithms require carefully annotated data [62]. These annota-
tions can be performed on text, images, videos, or other modalities to learn a generalizing
function that can decide on the new data. Annotations are primarily done by skilled humans
and are considered as the ground-truth or the “gold standard” in the Artificial Intelligence
(AI) community. Manual labor of annotations takes a long time, is tedious and expensive.
Ren et al. [211] mentioned that typically the crack semantic annotation of a 4032 3016
pixels image takes about 40 to 60 min for a skilled person.
Jahanshahi et al. [117] generated randomized synthetic cracks by manually segmenting the
real crack images using Adobe
®
Photoshop and augmented by random rotation and scaling.
In their work, they cannot eliminate the initial manual labor. Lee et al. [148] proposed a
synthetic crack generation method based on the Brownian motion process. In real-world
scenarios, cracks tend to have high tortuosity, but their synthetic cracks (‘seams’ in this
work) were linear and less tortuous. In addition, Brownian movement produced very high
fluctuations in the crack strands and the parameters were hard to setup. Kanaeva and
Ivanova [126] used the public semantic crack datasets and blended (overlayed) them on the
pavement images with cropping, scaling, and rotation. This work lacks the generation of the
synthetic cracks in complete. Generative Adversarial Networks (GANs) and its variants were
used for synthetic image generation or augmentation [93]. However, these methods do not
produce images with semantic labels.
In the proposed work, an image targeting method [13] was used in a different manner.
Their approach called ‘seam carving’ was used to perform content-aware resizing of the images
without losing meaningful details. Here, the seam is defined as a pixel-wide connecting
19
component that joins top to bottom rows vertically (see Figure 2.1b). This article refers to
the seam as ‘crack seam.’
Crack
seam
(a) (b) (c)
Figure 2.1: Synthetic crack generation procedure. (a). Synthetic crack seam overlayed on a
Gaussian noise image. (b). Synthetic crack seam of 1 pixel width. (c). Morphological dilated
synthetic crack with a rectangular structuring element of size 30 15 pixels.
The energy function for an image is epIq
BpIq
Bx
BpIq
By
, where I is an image of size
nm pixels and x and y are the spatial coordinates. The vertical seam is defined as, s
x
ts
x
i
u
n
i1
tpxpiq,iqu
n
i1
, s.t.@i|xpiqxpi1q|¤ 1, wherex mapsx :r1, ,nsÑr1, ,ms.
The goal is to minimize the cost function to obtain an optimal seam,
s
min
s
n
¸
i1
epIps
i
qq. (2.1)
The above equation is solved by recursion using the dynamic programming approach [136].
Note that the above equations produce the crack seam that is tortuous but linear. By
maximizing the Equation (2.1) the higher tortuous crack seams can be obtained.
Uniform and Gaussian random noise images were used to generate the crack seams
(see Figure 2.1a) for calculating the energy. Cracks on the concrete surfaces have varying
thicknesses with discontinuities along the edges. To simulate this behavior, the crack seams
are dilated using the morphological process by a rectangular structuring element of various
sizes [92]. This approach to generate the cracks is acceptable since the decision making
classifiers are trained to learn the geometric characteristics of the thin and thick cracks, with
or without edge discontinuities. One of the goals of this work is to compare the responses of
20
the conventional crack extraction filters. These filter outputs were the thresholded binary
crack maps representing either the crack or non-crack connected components as geometric
features. However, the same method can be used to generate synthetic cracks with texture
blending to represent the real-world cracks with intensity variations.
2.2.1.1 Elastic deformations of synthetic cracks
Deformation happens when the external forces are applied to physical bodies. If the body
recovers to the original position, it is referred to as the ‘elastic deformation.’ Generally,
cracks are randomly shaped and have no pre-defined shape characteristics. A linear crack can
have a curvy shape when the crack propagates in time. Previously, the elastic deformation
Fixed Points
Moving Points
(a) (b)
Figure 2.2: Effects of performing an elastic deformation using a piecewise linear function
on a synthetically generated crack. (a). Synthetic image with grid lines for reference. (b).
Deformed image with the parameters σ 5, α 35 and α
g
100.
was used in the data augmentation of hand-written images [225]. Castro et al. [32] used
the same to generate synthetic breast cancer screeching images. The most widely used
augmentation techniques are random rotation, horizontal, vertical flipping, and chromatic
intensity variations [141]. All these techniques do not generate the new structural details,
meaning this preserves the content as it is. For example, the deformation of the cracks
produces another semantically and structurally varying crack that still resembles a crack.
Thus elastic deformation of the cracks is more convenient as compared to the regular image
21
augmentation approaches. To the best of the authors knowledge, this is the first work in
the civil engineering community that used the elastic deformation technique to augment the
cracks dataset. In addition, this work expands the technique by using various geometric
transformation for achieving large deformations of synthetic cracks.
Synthetic augmentation consists of three parts. First, a random field with small displace-
ments, δx, and δy, in horizontal and vertical directions is produced. This ensures that the
pixels maintain proximity to its previous position. For each pixel, a uniform random value of
the range, U r1,1s is applied in both directions separately (Equations (2.2) and (2.3)).
This guarantees that the synthetic cracks’ edges are slightly discontinuous yet hold global
shape characteristics. A Gaussian kernel, σ, is used to smooth the roughness in the pixel
displacements.
δxGpσqpαUpn,mqq, (2.2)
δyGpσqpαUpn,mqq, (2.3)
I
t
piδ
x
pi,jq,jδ
y
pi,jqqpα
g
qIpi,jq, (2.4)
whereG is the Gaussian kernel of scaleσ, the scaling factor isα,n andm are the dimensions
of the image in horizontal and vertical directions. I and I
t
are the original and transformed
images, respectively.
Second, the control points for the geometric transformations (affine, projective, polynomial,
and piecewise linear) are carefully chosen [95]. After this, a scaling factor to displace the
fixed control points of the geometric transformation to the moving points, α
g
, is selected.
Figure 2.2a shows a synthetic crack with the fixed control points and moving control points
for a piecewise linear transformation. Lastly, the distortion transform is applied to the
original image using a bicubic interpolation. An illustration of the deformation is shown in
the Figure 2.2b.
22
2.2.2 A hybrid filter
Prior conventional approaches focused on the segmentation of the cracks either using denoising
filters such as mean and median [243] or by contrast enhancement [265]. These filters perform
well when the noise level is minimal. Furthermore, the noisy texture is the other important
aspect that directly affects the crack segmentation capability of algorithms. In this work, a
new refinement paradigm is proposed, which suppresses the texture with minimal harm to
the crack pixels. Generally, structural cracks are curvilinear. This aspect has been exploited
by using conventional filters such as multiscale Hessian-based vesselness and morphological
methods. Here, the MFAT filtering approach is adapted to segment the cracks on concrete
surfaces.
The organizational flowchart of the hybrid filter is shown in Figure 2.3. Firstly, an
Input image
(color/
grayscale)
Anisotropic
filtering
Eigen analysis
of Hessian
matrix for
MFAT filter
Image
binarization
(Otsu ’s
method)
Supervised
learning
Output
image
(crackmap)
Figure 2.3: A block diagram of the proposed hybrid filter.
input image will be filtered through an anisotropic diffusion. This ensures that most of
the small-curvature dominant non-crack features are eliminated. Secondly, multiscale Eigen
decomposition of the Hessian matrix of a FAT filter guarantees that the curvature-prevalent
cracks are highlighted. Thirdly, a crack response image from the MFAT filter will be binarized.
Lastly, a supervised classification system will distinguish the crack and non-crack entities.
23
2.2.2.1 Anisotropic diffusion filter
Windowed filtering kernels such as Gaussian, median, and average are widely used in denoising
the images. Although these kernels are effective in their task, as a downside, they will remove
the vital information if the kernel sizes are large. Furthermore, achieving the scale-invariant
properties through these filters is difficult. Usually, cracks and their surrounding regions have
a strong gradient in their intensity. If this information is suppressed, then the accuracy of
crack detection is compromised. Therefore, an anisotropic filter is used to smooth the noisy
texture region with minimal damage to the crack pixels [198, 86].
A continuous anisotropic diffusion equation is given by:
Bφp¯ x,tq
Bt
∇rDp¯ x,tq∇φp¯ x,tqs, (2.5)
where, ¯ x is the vector that represents the spatial coordinates (R
n
) and t is the diffusion time
or iterations in a discrete sense. Dp¯ x,tq stands for the space-time varying diffusion coefficient.
The function φp¯ x,tq corresponds to the image intensity. Throughout this work the images
used are two-dimensional (¯ xÑR
2
, i.e. px,yq is the pixel location).
If the diffusion coefficient is constant, then the isotropic diffusion equation is as shown
below:
Bφp¯ x,tq
Bt
D∇
2
φp¯ x,tq. (2.6)
The above equation is the popular heat equation. The fundamental solution of the Equa-
tion (2.6) has the form of
Φp¯ x,tq
1
a
p4πDtq
n
exp
¯ x
T
¯ x
4Dt
. (2.7)
Equation (2.7) is the linear heat equation solution, which is widely used in image processing
and computer vision applications. It hardly respects any change in the image gradient and
smooths out the edges of the given image, which is undesirable for the extraction of a crack
24
map. Figure 2.4 depicts the diffusion direction in isotropic and anisotropic filters for the
four-noded neighborhood. The two-dimensional space-time discretization of the Equation (2.5)
Crack (edge)
Inter-region
Diusion
kernel
Diusion direction
Inter-region
(a)
Crack (edge)
Inter-region
Diusion
kernel
Diusion direction
Inter-region
(b)
Figure 2.4: Different type of image filtering. (a). Isotropic filtering, where the pixels intensities
are smoothed uniformly. (b).Anisotropic filtering, where edge pixels intensities are preserved.
is as follows:
Ip¯ x,t ΔtqIp¯ x,tq Δt
B
Bt
Ip¯ x,tq, (2.8)
where the discrete form of the
B
Bt
Iptq in a eight-noded neighbor scenario is given by:
B
Bt
Ip¯ x,tq
1
Δx
2
Dpx
Δx
2
,y,tq pIpx Δx,y,tqIpx,y,tqq
Dpx
Δx
2
,y,tq pIpx,y,tqIpx Δx,y,tqq
1
Δy
2
Dpx,y
Δy
2
,tq pIpx,y Δy,tqIpx,y,tqq
Dpx,y
Δy
2
,tq pIpx,y,tqIpx,y Δy,tqq
1
Δd
2
Dpx
Δx
2
,y
Δy
2
,tq pIpx,y,tqIpx Δx,y Δy,tqq
Dpx
Δx
2
,y
Δy
2
,tq pIpx Δx,y Δy,tqIpx,y,tqq
1
Δd
2
Dpx
Δx
2
,y
Δy
2
,tq pIpx Δx,y Δy,tqIpx,y,tqq
Dpx
Δx
2
,y
Δy
2
,tq pIpx,y,tqIpx Δx,y Δy,tqq
,
(2.9)
25
where Δx Δy 1 pixel. The diagonal step size Δd
?
2 pixels can be used if the cross
pixels are considered. Also, Ip¯ xq represents a grayscale image.
It is common to consider the immediate (four-node) neighboring pixels while taking the
partial derivatives. However, to enhance the diffusion flow, it is recommended to use the
eight-node [86]. In this study, an eight-node grid is considered to perform the diffusion in
image pixels Equation (2.9).
The anisotropic filter’s impressive property is that it sharpens and preserves the edges
while blurring the texture and small discontinuities when the gradient of the image is small.
Perona and Malik [198] used the diffusion or conductance function Equation (2.10) in their
work as it privileges high-contrast edges over low-contrast ones, but, as the diffusion iterations
increase, this function over-smooths the edges as shown in Figure 2.5c. This happens as
the error norm of the diffusion function asymptotically goes to zero as the iterations are
increased [25]. To overcome this, Black et al. [25] proposed Tukey’s biweight diffusion function
Equation (2.11) and the same is used in this study to perform the anisotropic smoothing of
the grayscale images as the cracks have high contrast compared to the background.
Dp¯ x,tqexp
|∇Ip¯ x,tq|
κ
2
, (2.10)
Dp¯ x,tq
$
'
'
&
'
'
%
1
2
1
|∇Ip¯ x,tq|
κ
2
2
if|∇Ip¯ x,tq|¤κ
,
0 otherwise,
(2.11)
where κ is the arbitrary diffusion constant. It is defined based on the output of the crack
map and κ
κ
?
2. As the gradient of the image,|∇Ip¯ x,tq|, increases, the diffusion effect
decreases.
26
(a) (b) (c) (d)
Figure 2.5: Comparision of the isotropic and anisotropic diffusion using two different functions.
(a). Original image. (b). Isotropic filtering using Gaussian kernel. (c). Anisotropic filtering
with exponential diffusion function. (d). Anisotropic filtering with Tukey’s biweight diffusion
function.
Figure 2.5 shows the outcome of the isotropic and anisotropic filtering procedures. Fig-
ure 2.5a shows a heterogeneous texture image of a concrete crack (grayscale image). Fig-
ures 2.5b to 2.5d illustrates the outcome of isotropic and anisotropic diffusion filters, respec-
tively. For both the filters, 200 iterations were used to smooth the image. In the case of
isotropic (Gaussian) diffusion, a 3 3 kernel size with σ 0.5 was used. For anisotropic
diffusion, Δt
1
7
[86] and κ 30 was used. In Figure 2.5c, it is evident that the anisotropic
filter smoothed the texture details, but, there exist discontinuities in the cracks, and thin
crack details were not preserved. From the Figure 2.5d, the crack details were well-preserved.
Due to the dark speckles, a sharp noise is visible in the image. A decision-making system can
take care of them. In contrast, a Gaussian filter smoothed throughout the image irrespective
of edges. Additionally, the texture noise of the image was intact even after 200 iterations.
2.2.2.2 Multiscale fractional anisotropy tensor
Fractional anisotropy measures the degree of the anisotropy in the diffusion process. It is
used in the Diffusion Tensor Imaging (DTI) of the brain to estimate the fiber structures using
water diffusion [56]. Also, it is used for the vessel extraction in retinal images [6]. Most cracks
possess unique shape characteristics even within different materials and textural conditions;
curvature is one of these features. Cracks can be of any size and have curvy patterns, and
MFAT filter based on the Hessian matrix [6] is adapted to segment the cracks on the concrete
27
surface images and to minimize noise simultaneously. A two-dimensional Hessian matrix is
given by:
∇
2
Ipx,yq
I
xx
px,yq I
xy
px,yq
I
yx
px,yq I
yy
px,yq
. (2.12)
where I
xx
px,yq
B
2
Ipx,yq
B
2
x
, I
xy
px,yq
B
2
Ipx,yq
BxBy
, I
yx
px,yq
B
2
Ipx,yq
ByBx
and I
yy
px,yq
B
2
Ipx,yq
B
2
y
.
Due to the associative property of the convolution, derivatives of Gaussian kernels are
possible. The general form is given by:
B
B¯ x
Ip¯ x,σ
i
qσ
γ
i
Ip¯ xq
B
B¯ x
Gp¯ x,σ
i
q. (2.13)
where scale-space analysis is performed on the image by using multiscale Gaussian kernels σ
i
,
whereiPt1, 2, 3,...nu is the standard deviation,γ is the scaling parameter and
B
Bx
Gp¯ x,σ
i
q is
the two-dimensional Gaussian kernel in this framework. The second order partial derivatives
are convoluted with the image to obtain the Hessian matrix (of order 2 2) at each pixel.
The modified FAT equation that account for differently signed eigenvalues, for a more
uniform response [6], is given by:
μ
crack
pσ
i
q
c
3
2
d
pp
2
¯
D
p
q
2
pp
ρ
¯
D
p
q
2
pp
ν
¯
D
p
q
2
p
2
2
p
2
ρ
p
2
ν
, (2.14)
where
¯
D
p
1
3
is the mean of diffusivity, T
r
°
2
i1
λ
i
is the trace of the diffusion tensor,
p
2
|
λ
2
Tr
|, p
ρ
|
λρ
Tr
| and p
ν
|
λν
Tr
| are the regularization terms. λ
ρ
and λ
ν
are defined as:
λ
ρ,ν
$
'
'
'
'
'
'
&
'
'
'
'
'
'
%
λ
2
px,y,σq if λ
2
px,y,σq¡τ
ρ,ν
max
x,y
pλ
2
px,y,σqq,
τ
ρ,ν
max
x,y
pλ
2
px,y,σqq if 0 λ
2
px,y,σq¤τ
ρ,ν
max
x,y
pλ
2
px,y,σqq,
0 otherwise,
(2.15)
28
where τ
ρ
and τ
ν
are the cut-off thresholds betweenr0, 1s.
The response conditions that remove the noise from the background are given by:
R
σ
$
'
'
'
'
'
'
'
'
'
'
&
'
'
'
'
'
'
'
'
'
'
%
0 if λ
ρ
¡λ
ρ
λ
2
_λ
ρ
¥ 0_λ
2
¥ 0
_λ
ρ
λ
2
max
x,y
pλ
ρ
λ
2
q,
1 if λ
ρ
λ
2
min
x,y
pλ
ρ
λ
2
q,
1μ
crack
pσ
i
q otherwise.
(2.16)
Using the magnitude regularization, the maximized co-addition of response at the junctions
is obtained at each scale σ
i
, and the final enhancement equation is given by:
μ
crack
pσ
i
qμ
crack
pσ
i
1qδpR
σ
δq, (2.17)
μ
final
maxpμ
crack
pσ
i
q,R
σ
q, (2.18)
where σ
i
is the current or i-th scale and σ
i
1 is the previous scale, and δ is the step size of
the solution. A range betweenr0, 1s for δ is ideal for cracks segmentation.
0
0.2
0.4
0.6
0.8
1
+ + +
=
=
Figure 2.6: A multiscale response of FAT crack function. From left to right, scales σ 1 4,
and final crack response varies fromr0, 1s.
Figure 2.6 shows the multiscale response of the MFAT filter. σ 1 4 are the scales
used to extract the cracks on the concrete surface images. From the final response, it can be
noted that small speckles are present due to the maximum response in the lower scales. All
other anisotropic diffusion parameters remain the same, as discussed above. The response
Figure 2.6 of the crack measure ranges from 0 to 1. The higher values correspond to a high
29
Figure 2.7: A crackmap response from the MFAT curvilinear filters at scales σ 1 4.
crack index or the solid presence of a crack pixel, and the lower is the opposite in extracting
the crack features. Figure 2.7 shows the binary crackmap after image binarization.
2.2.2.3 Supervised learning system
Cracks are randomly shaped and hardly have any defining features to exploit. Hessian-
based MFAT, vesselness, and morphological filters can reasonably extract the cracks. Also,
false alarm pixels are produced. But to completely distinguish between crack and non-crack
patterns, an autonomous decision system is desired. In this work, machine learning approaches
such as Artificial Neural-Network (ANN), Support Vector Machine (SVM), and k-Nearest
Neighbors (k-NN) are used to filter the false positive pixels by keeping the actual crack pixels
[62, 24].
A block diagram of supervised learning is shown in Figure 2.8. Here, the input was the
binarized response image using Otsu’s method [192] for the methods in action. First, the
feature matrix was constructed for each of the individual connected components with their
corresponding truth labels. This was followed by the training operation of the aforementioned
classifiers. For the testing, a trained classifier is used to predict the labels of the binarized
response images.
Image
binarization
(Otsu ’s
method)
Ground-truth
or crackmap
image
Classifiers
Train/test
labels
Figure 2.8: A block diagram of the decision making system using supervised learning paradigm
(for training and testing).
30
The classifiers attribute that was used while training and testing for all three conventional
methods are as follows: firstly, for an ANN, a shallow network was used with one Hidden
Layer (HL) and ten neuron units; secondly, in k-NN 5 nearest neighbors were used, and the
SVM was trained with a radial basis function.
2.2.2.3.1 Feature selection In any supervised learning paradigm, selecting the features
plays an essential role in the classifiers’ outcome. In this study, initially, a 22-dimensional
feature vector was defined that included the chromatic and geometric information. Since the
synthetic cracks are binary, using geometric features was more suitable. Thus, the features
recommended by Jahanshahi et al. [117] were leveraged in this study, as they were more
concise and effective in the object segmentation task. The feature vector consists of the
following features: 1. correlation coefficient, 2. eccentricity, 3. area divided by the ellipse
area, 4. solidity, 5. compactness. Features 2 and 3 are scale and rotation invariant. In
contrast, feature 1 is scale-invariant, but not in rotation, and the other two are scale and
rotation variant.
2.2.2.3.2 Classification In this study, two classes, such as non-crack and crack Con-
nected Components (CC), are considered. A training set is separately trained using the
CC and is tested on three independent real-world image sets. In order to quantify these
datasets by their texture, a classification system using the Gray-Level Co-Occurrence Matrix
(GLCM) features (Haralick et al. [102]) was utilized. Using k- means clustering, the textures
are grouped as high (rough) or low (smooth) based on their centroids’ Euclidean distance.
2.3 Multiscale vesselness Filter
Multiscale, second-order local structures of the images (Hessian) were used to develop the
vessel enhancement filter [79], also known as “Frangi filter”. A vesselness or “crack” (as it
is called in this work) measure is obtained by the eigenvalues of the Hessian matrix of the
31
image, and the eigenvectors are used for the principal directions of the ridges of the vessels or
cracks. Originally, medical images such as two-dimensional Digital Subtraction Angiography
(DSA), three-dimensional aortoiliac, and cerebral Magnetic Resonance Angiography (MRA)
were used for the vessel extraction. Later, in the civil engineering community, Hessian-based
crack functions were adapted to segment the cracks on concrete and pavement surfaces
[81, 83, 82, 87].
The multiscale crack function measures the principal curvature [79] and is given by:
μ
crack
pσ
i
q
$
'
'
&
'
'
%
1 exp
S
2
2c
2
exp
M
2
b
2β
2
if λ
2
¡ 0,
0 if λ
1
0,
(2.19)
where λ
1
and λ
2
are the eigenvalues of the Equation (2.12), M
b
λ
2
λ
1
measures the blobness
of each pixel, S
a
λ
2
1
λ
2
2
, and β and c are the constants which govern the sensitivity of
the crack filter. λ
1
¤λ
2
for the dark cracks and vice-versa for bright cracks. In this work, it
is noted that for most of the crack images, β 0.5 and c 25 produced clean crackmaps
consistently. Equation (2.19) can be viewed as a likelihood function which maps the geometric
meaning of the eigen-decomposition of the Hessian matrix to the crack criterion.
The responses at various scales, σ
i
, has to be merged to obtain a single response image.
This ensures that the scale roughly matches the crack to be detected. The final crack response
is given by:
μ
final
σn
¤
σ
1
μ
crack
pσ
i
q iP 1, 2, 3...n (2.20)
Figure 2.9 shows the response of the crack filter (Equation (2.19)) to various Gaussian kernel
scales σ. Similar to Figure 2.6, the crack measure ranges from 0 to 1. It is evident that the
amount of false alarm pixels are high as compared to the MFAT response image, Figure 2.7.
32
Figure 2.9: A crackmap from the union of the multiscale crack filters at scales σ 1 10
2.4 Multiscale morphological method
The morphological method for image segmentation is motivated by mathematical set theory
[92]. Morphological methods are based on the two fundamental operations: dilation, and
erosion. Dilation expands the foreground CC objects of an image, and erosion shrinks the
same. An erosion followed by a dilation operation removes bright CCs from an image and is
called a morphological opening. In contrast, the opposite operations remove the dark CCs
from an image. Salembier [215] adapted these morphological methods with a bottom-hat
transform to segment the defects in the images. The following Equation (2.21) shows the
slightly modified bottom-hat transform adapted to segment the cracks on a concrete surface
[117].
T max
pIS
t0
0
,45
0
,90
0
,135
0
u
q
S
t0
0
,45
0
,90
0
,135
0
u
q,I
I (2.21)
where, ‘0’ and ‘
’ denotes the morphological opening and closing operations, respectively. I
is the gray-scale image and S is a structuring element. A structuring element is used as a
Figure 2.10: A crackmap from the union of the multiscale morphological filters of a line
structuring element, S, at scales 3 to 143 pixels with an interval of 5
33
spatial convolution kernel to keep or eliminate the neighborhood pixels in the morphological
operations. One should be careful while choosing a structuring element, as it directly
determines the shape and size of the cracks to be extracted from an image. In this work, a
linear structuring element (line-shaped) with four different orientationsp0
0
, 45
0
, 90
0
, 135
0
q was
used to segment cracks from grayscale images. Figure 2.10 shows the multiscale response of
the modified bottom-hat transform of the structuring element at scales of 3 to 143 pixels with
an interval of 5. Similar to the MFAT and crack filters, the responses at various structuring
element scales have to be merged using Equation (2.20). It can be seen that even while using
a large structuring element, the noisy components are present as compared to the MFAT and
vesselness methods. It should be noted that when a smaller structuring element scale that
ranges from 7 to 142 pixels with the same interval was used, the response image has a large
number of false positives.
2.5 Deep convolutional neural network
Neural networks are the universal function approximator [210]. With careful supervision, they
can be trained to approximate any function with considerable accuracy. Deep Convolutional
Neural Networks (DCNN) have revolutionized the wide range of the object detection and
segmentation research community for the past 6-7 years [217]. Since 2017, DCNNs have
become prominent in the civil engineering community for the recognition of the defects in
civil infrastructures [34]. This is due to their autonomous feature learning capability [147],
as compared against feature-engineering methods where the features are hand-designed. As a
caveat, DCNNs require a large amount of data to prevent overfitting [141]. A DCNN requires
annotated data that has been labeled as positive, negative or multi-class. A typical DCNN
has input, convolutional, pooling, and output layers. The input layer reads the image and
transfers it to the convolutional layers. Next, the convolutional outputs are sub-sampled
using the maximum or average pooling layers, followed by the output layer, which, as the
34
name suggests, outputs the final results. In classification, object detection, and segmentation
scenarios, the outputs could be probabilities of the classes.
Figure 2.11: A crackmap of the DeepCrack after fusing side-outputs using the [165].
In this work, the conventional filters are compared against the state-of-the-art deep
learning-based CNN, DeepCrack [165]. DeepCrack architecture adapts the VGG network as
the backbone for the semantic segmentation of the cracks on concrete and pavement texture.
It has 13 convolutional layers, five side-output layers, and the refinement module based on
the Conditional Random Field (CRF) and Guided Filtering (GF). In this work, their public
dataset is used for the comparison of the conventional methods. Figure 2.11 shows the binary
image output of the CNN. Compared to all the three traditional approaches, DeepCrack
produced a clean crack map. For more details, readers are referred to read [165].
2.6 Experimental results and discussion
2.6.1 Dataset preparation
2.6.1.1 Training dataset
2.6.1.1.1 Non-cracks connected components: To train the classifier for the object
segmentation (crack/non-crack) task, 59,999 small patches of the non-crack images are
randomly cropped from 50 images of size 5152 3864 megapixels. The image patches of
varying sizes were used (50 50 to 480 480 pixels in dimension). The MFAT, vesselness
and morphological filters are applied to these image patches. The responses from these filters
35
produced 3.3 million objects for MFAT, vesselness and 7.3 million non-crack objects or the
CCs for morphological method. To have a balanced crack and non-crack datasets for the
training, 100,000 non-crack samples are randomly down-sampled from the MFAT, vesselness,
and morphological approaches to maintain the classes balance. This reduces the possibility
of classifier bias of classifying only one class when the imbalanced samples are used.
Dataset type Image size Texture type Image type Image quality
Minimum Maximum
Non-cracks 50 x 50 480 x 480 Concrete Color Medium
Synthetic cracks 73 x 57 1280 x 1280 Smooth intensity Binary High
Table 2.1: Training dataset attributes.
Table 2.1 displays the training samples attributes. For non-crack images, concrete textured
images of color modality were used. Figure 2.12a shows the non-crack components used for
(a) (b)
(a)
(a) (b)
(b)
Figure 2.12: Training examples (due to the montage plotting images of various sizes have
been resized to 100 100 pixels. Thus, they look same. But they are of the varying scales).
(a). Non-crack samples obtained from the responses of the three conventional filters. (b).
Crack samples generated from the seam carving technique with morphological dilation.
the classifier training purpose. Due to the montage plotting, the images of the varying sizes
36
are resized to 100 100 pixels for the display purpose; thus, some CCs looks the same. The
area of the non-crack connected varied from 5 to more than 5000 pixels.
2.6.1.1.2 Synthetic cracks: Crack seams were generated based on the minimization and
maximization of the cost function; this produced seams that have medium to high tortuosity
between two points in a random noise image. In this study, a uniform and Gaussian noise
image are selected with a probability of 0.5. In total, 52670 unique seams are generated with
a varying block size of 50 to 1280 pixels. Seams are rotated at a random angle between 0
0
to 90
0
to account for the rotation of the cracks. To incorporate the crack thickness’s scale
effects, each seam is dilated morphologically using a rectangular structuring element of a
maximum width of 10 pixels and a depth of 2 pixels. This procedure produces discontinuous
edges similar to the real cracks on concrete surfaces.
After generating the synthetic cracks, an elastic deformation was performed on the images
to obtain the deformed synthetic cracks. A uniformly sampledr1, 1s random displacement
field is filtered using the Gaussian kernel of a scale σ to have a smooth field among the pixels.
Next, this stress field is scaled using the scalar value α to have a larger magnitude. If α is
15, the displacement field’s magnitude is negligible for deforming the synthetic cracks, as
this will only produce jagged edges. It is necessary to have discontinuous edges as most of
the concrete poses these characteristics.
During the vision-based inspection of concrete structures, the cracks in the scene can be
distorted when the camera pose is changed. This transformation is generally either affine or
projective (perspective) in nature. To simulate these effects, four geometric transformation
functions are used to warp the synthetic crack images, namely, affine, projective, polynomial,
and piecewise linear transformations. The first two help in mimicking the real world camera
viewpoint situation in perceiving the cracks, and the remaining two aid in increasing the
magnitude of the spatial deformation of the cracks. Each of these four methods requires
a minimum number of control points where the initial fixed points are transformed using
37
the bicubic interpolation technique. Table 2.2 shows the valid range of parameters used in
Transformation
type
Degree
Control points
used
Control points range
(α
g
)
Filter scale
(σ)
Distortion scale
(α)
Affine - 4 [30, 70] 5 [15, 35]
Projective - 4 [30, 70] 5 [15, 35]
Polynomial
2 6 [5, 10] 5 [15, 35]
3 10 [2, 4] 5 [15, 35]
4 15 1 5 [15, 35]
Piecewise linear - 4 [40, 60] 5 [15, 35]
Table 2.2: Synthetic cracks elastic deformation parameters.
deforming the synthetic cracks. It was observed based on the fixed points and the amount of
deformation, values higher than the range will displace the pixels largely, and a meaningful
correlation between the pixels was not preserved. A training dataset consists of 100,000
syntheticsamplesandFigure2.12bdemonstratesfewsamplesofthesyntheticcracksgenerated
from the seam carving approach. As mentioned before, due to montage plotting, the cracks
appear the same, but they are of different length and thickness.
To evaluate the uniqueness of the synthetically generated cracks, a pairwise distance of the
smaller pool of 5000 images that were sampled from a total of 100,000 was considered. Due
to the constraint on the computational and memory resources, 5000 samples were selected.
The pairwise distance between the synthetic cracks was calculated using the five features
that are described on the Paragraph 2.2.2.3.1. The intuition in this approach is to measure
the dissimilarity among the generated synthetic crack samples. If the given crack is very
similar to others, then the normalized Euclidean distance in the feature space is smaller and
vice-versa. Thus, cracks need to be sampled such that they have a larger pairwise distance.
This ensures that the synthetic cracks have a larger difference in the spatial characteristics
and still resembles cracks.
Figures 2.13a and 2.13b show the pairwise distances matrix of the 100 ensemble average
of 5000 samples feature vectors with and without elastic deformation. These were randomly
sampled from the 100,000 feature vectors of synthetic images used for training. Figure 2.13c
shows the normal kernel density estimation of the pairwise distances. The mean of the
38
1e3 2e3 3e3 4e3 5e3
Pair Indices
1e3
2e3
3e3
4e3
5e3
Pair Indices
0
0.2
0.4
0.6
0.8
1
Normalized Distance
(a)
1e3 2e3 3e3 4e3 5e3
Pair Indices
1e3
2e3
3e3
4e3
5e3
Pair Indices
0
0.2
0.4
0.6
0.8
1
Normalized Distance
(b)
0 0.2 0.4 0.6 0.8 1
Normalized Pairwise Distance
0
1
2
3
4
5
6
7
8
Density
No Elastic Deformation
Elastic Deformation
(c)
0 0.2 0.4 0.6 0.8 1
Normalized Pairwise Distance
0
0.2
0.4
0.6
0.8
1
Density
No Elastic Deformation
Elastic Deformation
(d)
Figure 2.13: Uniqueness of the synthetic crack training dataset. Pairwise distance matrix of
the 100 ensemble average of 5000 samples feature vectors that were sampled from 100,000
training images. (a). Without elastic deformation; (b). With elastic deformation. (c).
Probability density function of the 100 ensemble average 5000 samples feature vectors with
and without elastic deformation. (d). Cumulative density function
density of the synthetic cracks without any transformation or deformation is 0.1879, and
with deformation is 0.2998. There is 0.1119 of the mean shift for the elastic deformation.
Figure 2.13c displays the cumulative distribution function of the same. Around 40% of the
elastically deformed samples have a normalized pairwise distance less than 0.30. A data
augmentation method closer to the normalized distance of 1 is preferred as it produces
samples with higher spatial dissimilarities. Thus, elastic deformation considerably produces
synthetic cracks with varying shapes and spatial characteristics.
2.6.1.2 Testing dataset
In this study, three datasets were used to evaluate the crack segmentation capability of the
conventional filters and DeepCrack. Datasets I and II are private and created around the
University of Southern California (USC) campus, and dataset III is available to the public
[165]. The complexity of images varies gradually as the dataset deepens. Dataset I consists of
200 crack samples from the work of Jahanshahi et al. [117] which vary in size. In this dataset,
crack widths are thinner. Also, the image quality is poor and requires contrast enhancement.
A gamma correction was used to pre-process the images from this dataset. In addition, this
dataset has a few images that are very dull in the foreground as compared to the background.
Meaning the cracks are less distinguishable from the background. Furthermore, even with the
39
gamma correction, a few images suffer from the content enhancement of the cracks. GLCM
and k-NN methods were used to cluster the texture of the images in the three datasets into
two classes. If the normalized distance of one centroid is smaller than the other, then its
corresponding samples are classified as low texture and vice-versa. Based on this, 199 of the
images are of high texture, and one is low texture.
Dataset II consists of 250 images of the concrete surface. In this dataset, the cracks
have a stronger contrast, texture, and wider cracks relative to dataset I. Two hundred and
forty nine images have high textural content out of 250. Lastly, dataset III is the public
Dataset I II III
Image size
Minimum 74 x 69
448 x 252 384 x 544
Maximum 345 x 153
Total images 200 250 237
Material Concrete Concrete
Concrete and
pavement
Texture High High High
Crack width
range (pixels)
4 to 12 9 to 69 1 to 180 [165]
Image type Color Color Color
Image quality Very low High
Medium to
high
Table 2.3: Materials, texture, crack information and the images attributes of the testing
dataset.
dataset constructed by [165] to evaluate their method. This dataset was used in this work to
demonstrate the segmentation capability of the proposed and comparison methods. About
78% and 22% of the images have concrete and asphalt material surfaces [165], respectively.
The crack width varies largely across the images of this dataset. Around 21 images in this
dataset have low texture and 216 have high texture. Table 2.3 shows the materials, texture,
crack information and attributes of the image of the three testing datasets used in this study.
40
2.6.2 Crack segmentation on real-world datasets
2.6.2.1 Metrics
Eight different metrics are utilized to evaluate the crack semantic segmentation methods.
Four of them are the common semantic segmentation metrics which use pixel accuracy and
the region Intersection over Union (IoU) [220]. The rest are widely associated with the crack
segmentation literature.
• Global pixel accuracy (GA)
°
i
n
ii
°
i
t
i
• Mean accuracy (MA)
°
i
n
ii
t
i
n
cl
• Mean IoU (MI)
°
i
n
ii
t
i
°
j
n
ji
n
ii
n
cl
• Weighted IoU (WI)
°
i
t
i
n
ii
t
i
°
j
n
ji
n
ii
°
k
t
k
wheren
ij
is the number of pixels of classi predicted to belong to classj, there aren
cl
different
classes and t
i
°
j
n
ij
is the total number of pixels of class i.
The relation between true-positives (TP), false-positives (FP), true-negatives (TN), and
false-negatives (FN) to the specificity, precision, and recall are given below. F1-score is the
harmonic average of precision and recall. All the eight metrics range from 0 to 1, i.e., from
lowest to highest performance.
• Specificity (SP)
TN
TNFP
• Precision (PR)
TP
TPFP
• Recall (RE)
TP
TPFN
• F1-score (F1)
2PrecisionRecall
PrecisionRecall
41
2.6.2.2 Effect of the anisotropic diffusion coefficients and iterations on the crack
width
Anisotropic diffusion is the refinement method used as pre-processing to minimize the textural
noise by preserving the edges of the cracks. By diffusing the image, texture details are
(a) (b)
Figure 2.14: Two examples from the dataset III. Red dashed box shows the region of interest
and the green dashed line represent the crack profile line. (a). Thin crack (note that this
image has been rotated 90
0
counterclockwise to match the orientation of the image b). (b).
Thick crack image.
smoothed, and this also affects the edges of the cracks if they have a small width. Both the
anisotropic diffusion coefficient and the iterations play a major role in the suppression of the
texture. If the anisotropic diffusion coefficient is high, it disrespects the edges and smooths
the cracks; if it is too small, then the effect of the diffusion (to suppress the texture) will be
minimum. This also holds for the anisotropic diffusion iteration. Figure 2.14 shows the thin
and thick crack images taken from dataset III. The thin and thick cracks have a width of 14
and 67 pixels, respectively.
The two examples were used to study the effect of the anisotropic diffusion coefficients
and iterations in the proposed hybrid method setup. The hybrid method’s MFAT algorithm
was set with a scale of σ 1, 2,..., 5. The cut-off threshold τ
p,v
was set to 0.25 and 0.5
throughout this study as they produced clean crack maps. The cracks were segmented using
the hybrid method for the anisotropic diffusion coefficients that varied from κ 1, 2,..., 100.
Any values beyond 100 made the images blurry as this behaves like an isotropic diffusion.
42
1 20 40 60 80 100
Diffusion Coefficient ( )
14
14.5
15
15.5
16
Crack Width (pixels)
Actual Width
Hybrid
(a)
(b)
1 20 40 60 80 100
Diffusion Coefficient ( )
67
67.2
67.4
67.6
67.8
68
Crack Width (pixels)
Actual Width
Hybrid
(c)
(d)
Figure 2.15: Effects of anisotropic diffusion coefficients on two of dataset III images. (a).
Variation of the anisotropic diffusion coefficients on a thin crack image that has a width of
14 pixels along the cross-sectional profile line. (b). Variation of the normalized intensity of
the pixels of a thin crack along the cross-sectional profile line. (c). Anisotropic diffusion
coefficients variation on a thick crack image of profile width 67 pixels. (d). Variation of the
normalized intensity of the pixels of a thick crack along the profile line.
The number of iterations was kept constant at 50. The time step-size Δt was also set to
1
7
throughout this work. The thin and thick ground-truth and binarized response images were
filtered using a Gaussian kernel of size 5 5 with σ
G
1, and 25 25 with σ
G
5 (not to
be confused with the MFAT scales σ), respectively. It ensures smooth crack edges. This is
acceptable as the same filter is used across all the images.
Figures 2.15a and 2.15c shows the effect of anisotropic diffusion coefficients on the thin
and thick cracks, respectively. As the anisotropic diffusion coefficients were incremented
iteratively, the crack width remains the same for κ 38 in both images. When κ¡ 40, the
diffusion effect increases similar to the isotropic filter and disobeys the edge information. In
the case of the thin example, it specifically happens due to the discontinuity in the intensity
of dark pixels along the crack profile line (refer to Figure 2.14a). This article refers to it as
“smearing effect” of the diffusion process. Furthermore, this is corroborated by Figure 2.15b,
as the uncertainty along the right side of the profile line is larger as two-pixel intensities were
displaced or smeared. Relatively, the difference in the thickness is only 2 and 1 pixels for
the thin and thick cracks, respectively. Figure 2.15d shows that the intensity variation of
the thick crack varys from 0 to 1. Except for the pixel at the left edge of the profile line at
iteration κ 39, all the iterations of κ produced accurate estimation of the width. Based
43
on the previous argument, neighboring pixels of the left edge of the thick crack also were
displaced. Thus, producing a kink in the crack width profile.
Similar to the diffusion coefficients, a study was conducted on the effect of diffusion
iterations on the thin and thick crack profile. Generally, cracks on the concrete surfaces
0 100 200 300 400 500
Diffusion Iterations
13
13.5
14
14.5
15
Crack Width (pixels)
Actual Width
Hybrid
(a)
(b)
0 100 200 300 400 500
Diffusion Iterations
67
67.2
67.4
67.6
67.8
68
Crack Width (pixels)
Actual Width
Hybrid
(c)
(d)
Figure 2.16: Effects of diffusion iterations on two of dataset III images. (a). Variation
of the diffusion iterations on a thin crack image that has a width of 14 pixels along the
cross-sectional profile line. (b). Variation of the normalized intensity of the pixels of a thin
crack along the cross-sectional profile line. (c). Anisotropic diffusion iterations variation
on a thick crack image of cross-sectional profile width of 67 pixels. (d). Variation of the
normalized intensity of the pixels of a thick crack along the cross-sectional profile line.
are surrounded by a high texture. Therefore, diffusion iterations were varied from κ
0, 2, 4,..., 500, and κ 15. Figures 2.16a and 2.16c show the change in crack width against
the variation of the anisotropic diffusion iterations on thin and thick cracks. For the thin
cracks, the width remains the same throughout the span of iterations, but there is a small
variation in the normalized intensity as observed in Figure 2.16b. Whereas for the thick
crack, normalized variation was observed for two pixels at the iteration number 222 (see
Figure 2.16d). The previously given reason still holds here, as this was caused due to the
variation in the normalized pixel intensity. An important observation that can be made from
this is that the anisotropic diffusion operation does not change the crack thickness for thin
and thick examples, relatively. Further analysis of the effects of diffusion coefficients and
iterations on the multiple datasets is presented in Section 2.6.2.5.
44
2.6.2.3 On the choice of the parameters for the hybrid method
The hybrid method is dependent on two sets of parameters. Firstly, anisotropic diffusion
coefficient, κ, iterations, and time step integration constant, Δt. Secondly, the MFAT’s scale
size, σ, cut-off thresholds, τ
ρ
, τ
ν
, and the step size of the solution, δ. Among these seven
parameters, only anisotropic diffusion coefficients, iterations and scale size play a crucial role
in the crack segmentation. The rest are optimal within the range as proposed in the original
literature. To estimate the valid range for the anisotropic diffusion coefficient κ, iterations
and MFAT scale size σ, a validation dataset consisting of 500 synthetic crack images was
constructed (referred to as Synthetic500). Crack seams were dilated with a known radius of
a disk-shaped structuring element. Thereby, synthetic cracks of known width were produced.
Known width synthetic cracks help develop the relationship between the actual crack width
and the Gaussian filter (or kernel) diameter. Next, the binary synthetic cracks were overlayed
onto the textured non-crack images to replicate the real-world dataset. It is worth noting
that mimicking the exact variation of the intensities of the real cracks is very difficult. Thus,
the synthetic cracks have darker pixels than real cracks. Later, semantic segmentation was
performed on Synthetic500 images using the proposed method to obtain the valid ranges of
parameters. Further details are provided below.
Figures 2.17a to 2.17d show the variation of median F1-scores for diffusion coefficients
and iterations on the Synthetic500 and real-world datasets. Since the median operation is
robust to outliers, it was preferred to visualize the variations. The F1-scores are gradually
increasing in the range of 15 to 50 for the anisotropic diffusion coefficients, then gradually
decrease. In the case of the Synthetic500 dataset, the F1-score variation is much smaller than
real-world datasets due to the dark pixels that were overlaid. For the anisotropic diffusion
iterations, after 150 increments, the F1-score gradually becomes stable for real datasets. Due
to the darker pixels in the synthetic images, the variation of the F1-scores remains constant.
This shows while the pixels are darker, the diffusion effect remains negligible.
45
1 10 20 30 40 50 60 70 80 90 100
Diffusion Coefficients
0.87
0.875
0.88
0.885
0.89
F1-score
(a)
1 10 20 30 40 50 60 70 80 90 100
Diffusion Coefficients
0.69
0.7
0.71
0.72
0.73
0.74
0.75
0.76
F1-score
(b)
0 100 200 300 400 500
Diffusion Iterations
0.87
0.875
0.88
0.885
0.89
F1-score
(c)
0 100 200 300 400 500
Diffusion Iterations
0.69
0.7
0.71
0.72
0.73
0.74
0.75
0.76
F1-score
F1-score ANN Hybrid
F1-score KNN Hybrid
F1-score SVM Hybrid
(d) (e)
Figure 2.17: Median variations of the F1-scores and relationship of Gaussian filter diameter.
(a) and (b). Anisotropic diffusion coefficients for Synthetic500 and real-world datasets,
respectively. (c) and (d). Anisotropic diffusion iterations for Synthetic500 and real-world
datasets, respectively. (e). A relationship of actual crack width and Gaussian filter diameter
used in the MFAT filter to segment the cracks on concrete surface.
This work recommends using the anisotropic diffusion coefficients and iterations carefully
forminimizingthetexturalnoise. Imagesusedinthisstudy(datasetsItoIIIandSynthetic500)
have high texture; the acceptable range ofκ was found to be within 15 and 50. The iterations
can vary from 50 to 250 for low-medium to high textural images of less than 512 512 in
dimension. Although using the hybrid approach, the best F1-score achieved for dataset II
was 485 iterations, slightly lower scores 0.01% were within the recommended range. Lastly,
for the integration step size, Δt, perturbations around the recommended value of
1
7
did not
make a significant change in the experimental outcome.
For the segmentation of the cracks, the right size of the Gaussian filter is necessary. A
smaller filter size misses the crack pixels and a larger one increases the false positives. To set
the appropriate filter size, the camera properties need to be known to translate from world
to camera coordinates [117, 44, 3]. Generally, the operation and maintenance department of
structures has the periodic history of the crack’s physical properties and probable range of
thicknesses with a risk factor [70]. Based on this information and the type of structure, the
right filter size can be estimated in the case of the unavailability of the camera’s intrinsic
parameters. Figure 2.17e show the linear relationship, y 1.1758x 0.5153,R
2
0.9996, of
the actual crack width (x) and the Gaussian filter (or kernel) diameter (y). Here the filter size
is 2r2.355σs 1, rs is the ceiling function and 2.355 is the Full Width at Half Maximum
(FWHM) value of a Gaussian function. Although the width of the crack is 2σ, using FWHM
46
extracts the crack pixels clearly due to larger tails of the filter. By using Figure 2.17e, the
filter sizes σ = 0.71:0.25:2.43, 0.71:0.25:12.41, and 0.71:0.25:7.34 were set on the datasets
I, II, and III, respectively (colon is linear spacing step operator). This was based on the
minimum of 1 pixel crack width and the average width of the cracks in the datasets. The
cut-off thresholds, τ
ρ
, τ
ν
, and the step size of the solution, δ, were set to τ
ρ
0.25, τ
ν
0.5
and δ 0.5 based on the ranger0, 1s, that were suggested in the original literature. In this
work, the small perturbations around the used values did not produce an abrupt change in
the F1-scores on the three datasets. Also, the cut-off thresholds and step size were ideal for
crack segmentation.
2.6.2.4 Comparison of the segmentation methods
To evaluate all four methods on their crack segmentation capability, three real-world datasets
have been used in this study. The conventional methods are all trained on the geometric
features of the synthetic cracks and tested on datasets I to III. Datasets II and III had color
training images, whereas dataset I had only binary images, thus to be consistent in the
evaluation process, DeepCrack CNN was tested only using a pre-trained model on datasets I
and II. Therefore, full training or transfer-learning was not possible. This will demonstrate
the generalization ability of the deep learning model. For the vesselness method, σ = 1:5,
1:15, and 1:15 on the datasets I, II and III, respectively (colon is linear spacing step operator).
β 0.5 and c 25 was used on all three datasets. Lastly, for the morphological method,
the line structuring element size was ranged from 1 pixel to 25% of the maximum image
dimension with a step size of 3, 5, and 5 on datasets I, II, and III, respectively. These were
the best parameters selected after rigorous research.
Figure 2.18 shows the segmentation results of the four methods on three datasets. In
this figure, column 1 is the original color images of the three test datasets described in
Section 2.6.1.2. Columns 3-6 are the binary images of the segmented cracks obtained from
the processing of the original images of the datasets by hybrid, vesselness, morphological, and
47
Original Ground-truth Hybrid Vesselness Morpho
Deep Crack
Figure 2.18: A comparison of the crack segmentation methods. Columns left to right show
the images of, original color, ground-truth, hybrid, vesselness, morphological and DeepCrack
CNN. The rows 1 to 5, 6 to 10 and 11 to 15 show the images of datasets I, II and III,
respectively.
DeepCrack CNN methods, respectively. Otsu’s thresholding method was used to binarize
the processed images for conventional methods. A 0.5 threshold was used to convert the
sigmoid function crack probabilities of the DeepCrack output. The black and white regions
in columns 2-6 represents the crack and background pixels. Although ANN, k-NN, and SVM
48
classifiers are evaluated in this study, the ANN’s crack maps are presented in this figure for
their better F1-scores.
It can be seen from Figure 2.18 that all four methods have segmented the cracks consider-
ably well in most of the three datasets. The hybrid and vesselness methods produce the cracks
slightly thicker because of the higher magnitude of Gaussian scales σ. The morphological
process missed a few crack pixels even when a more considerable structuring element length
of 25% of the maximum image dimension was used. DeepCrack missed a few whole crack
segments in the dataset I completely due to the dataset I images have a low foreground
contrast. Thicker cracks, as shown in rows 6-10, are segmented well by all four methods. Due
to the sensitivity of the MFAT eigenvalues, low contrast cracks, as shown in rows 11 and 14,
were segmented considerably well by the hybrid approach compared to other methods. An
impressive fact to be noted here is the generalization ability achieved by all the classifiers.
The classifiers were able to classify complicated cracks when they were trained on the single
stranded cracks of various length, thickness, and orientation (see rows 7, 8, 9, 13, and 15 in
Figure 2.18). Lastly, the hybrid and vesselness methods have a limitation to detect cracks
when the shadows are present. This is due to the sensitivity in the principal curvature to the
Hessian matrix eigenvalues, resulting in the detection of the darker regions, thus, segmenting
the shadows.
Table 2.4 displays the metrics of the four methods in comparison. Global accuracy, mean
accuracy, and specificity consider TNs, or non-crack pixels. Since the number of non-crack
pixels is very high compared to the crack pixels, these metrics tend to have higher values
than others. In contrast, IoU, MIoU, precision, and recall are sensitive to the TP pixels,
so they have lesser magnitude. Precision and recall are related to the number of FPs and
FNs, respectively. If the precision is high, then a lesser number of FPs have contributed
to the score. This suggests that the classifier has not considered many of the non-crack
pixels. Similarly, for recall, the same applies to FNs, meaning the classifier has missed a
lesser number of crack pixels. F1-score is the harmonic mean of both precision and recall. If
49
Dataset Method Classsifier Metrics
GA MA MI WI SP PR RE F1
ANN 0.9702 0.8575 0.7244 0.9511 0.9793 0.5791 0.7358 0.6481
Hybrid k-NN 0.9697 0.8582 0.7225 0.9505 0.9787 0.5731 0.7377 0.6451
SVM 0.9688 0.8534 0.7169 0.9492 0.9781 0.5638 0.7286 0.6357
ANN 0.9668 0.8601 0.7108 0.9468 0.9754 0.5401 0.7448 0.6261
I Vesselness k-NN 0.9644 0.8586 0.7008 0.9438 0.9729 0.5160 0.7443 0.6094
SVM 0.9637 0.8314 0.6887 0.9423 0.9744 0.5104 0.6883 0.5862
ANN 0.9694 0.8206 0.7071 0.9491 0.9814 0.5785 0.6599 0.6165
Morpho k-NN 0.9626 0.8166 0.6791 0.9406 0.9744 0.4991 0.6588 0.5679
SVM 0.9509 0.8099 0.6415 0.9267 0.9623 0.4033 0.6575 0.4999
DeepCrack 0.9683 0.7396 0.6674 0.9453 0.9868 0.5905 0.4925 0.5370
ANN 0.9599 0.8582 0.8008 0.9258 0.9852 0.8452 0.7312 0.7841
Hybrid k-NN 0.9595 0.8577 0.7993 0.9251 0.9848 0.8417 0.7306 0.7822
SVM 0.9588 0.8528 0.7952 0.9237 0.9851 0.8424 0.7204 0.7766
ANN 0.9582 0.8357 0.7872 0.9218 0.9887 0.8695 0.6827 0.7648
II Vesselness k-NN 0.9585 0.8377 0.7888 0.9223 0.9885 0.8682 0.6870 0.7670
SVM 0.9523 0.8245 0.7649 0.9123 0.9840 0.8214 0.6649 0.7349
ANN 0.9554 0.8098 0.7678 0.9157 0.9915 0.8910 0.6281 0.7368
Morpho k-NN 0.9549 0.8073 0.7652 0.9147 0.9915 0.8903 0.6230 0.7331
SVM 0.9503 0.8033 0.7504 0.9079 0.9868 0.8387 0.6197 0.7128
DeepCrack 0.9539 0.7827 0.7513 0.9114 0.9965 0.9472 0.5688 0.7108
ANN 0.9751 0.8521 0.7645 0.9561 0.9867 0.7099 0.7174 0.7136
Hybrid k-NN 0.9754 0.8583 0.7685 0.9568 0.9865 0.7099 0.7301 0.7199
SVM 0.9756 0.8558 0.7684 0.9569 0.9869 0.7148 0.7247 0.7197
ANN 0.9761 0.8919 0.7831 0.9586 0.9840 0.6938 0.7997 0.7430
III Vesselness k-NN 0.9761 0.8959 0.7845 0.9587 0.9837 0.6912 0.8081 0.7451
SVM 0.9674 0.8795 0.7378 0.9464 0.9757 0.5929 0.7833 0.6749
ANN 0.9814 0.8833 0.8123 0.9662 0.9907 0.7909 0.7758 0.7833
Morpho k-NN 0.9815 0.8888 0.8144 0.9664 0.9903 0.7854 0.7873 0.7863
SVM 0.9723 0.8828 0.7613 0.9532 0.9808 0.6493 0.7849 0.7106
DeepCrack 0.9860 0.9700 0.8590 - - 0.8680 0.8460 0.8650
Table 2.4: Semantic segmentation results of the hybrid, vesselness, morphological and
DeepCrack on the datasets I to III without post-processing procedures using ANN, k-NN,
SVM and CNN as classifiers. A pre-trained DeepCrack CNN was used in a classifier mode.
both increase, then the F1-score increases and vice-versa. And F1-score has a tighter bound
to the TP, FP and FNs. In this work, precision, recall, and F1-scores are presented.
Due to the poor image quality and low contrast, the performance of the segmentation
methods on the dataset I was relatively lower than others. The hybrid method with thek-NN
classifier obtained the best F1-score among all the methods. A pre-trained DeepCrack CNN
performed worst; even a pre-processing method like contrast enhancement did not improve
the image quality. Thus, it suffered from having a low F1-score. On dataset II, again the
hybrid method outperformed others. Here, few crack pixels were missed, but FPs were lesser.
A pre-trained DeepCrack was the worst performer as it missed a considerable number of
50
crack pixels. For dataset III, DeepCrack performed well with an F1-score of 86.50% as it
was pre-trained. The second-best performer was the morphological method with a 78.63%
F1-score for the k-NN classifier. Overall, all three classifiers used in conventional methods
were able to detect the cracks with better accuracy on all three datasets. The hybrid and
vesselness methods scored relatively low as they produced slightly larger cracks due to the
scale-bound σ and increased FPs. Overall, averaged across the datasets, the proposed hybrid
method outperformed the vesselness and morphological approach by 4.38% and 7.00% of the
best F1-score for the SVM classifier, respectively. When compared against the pre-trained
state-of-the-art DeepCrack, the hybrid method had a better F1-score by 0.64% on three
datasets for the same classifier.
The proposed anisotropic diffusion method reduces the FP pixels (usually rough textural
noise) and reinforces the classifier’s prediction by minimizing the FP CCs. Across all three
datasets, the proposed refinement method on average improved the crack segmentation
F1-scores by 0.37%, 0.47%, and 0.54% for ANN, k-NN, and SVM classifiers, respectively.
The generalization ability of ANN due to non-linearity at the neurons was higher and able
to classify the FP CCs well. k-NN, and SVM required some boost in the reduction of the
FP CCs. Thus, these methods have a smaller improvement over the F1-score against ANN.
Overall, the refinement method improved the average precision values by 0.98%, 1.05%, and
1.21%, and a reduction of recall values by 0.44%, 0.31%, and 0.47% on the three classifiers
respectively, among the three classifiers on the three datasets. This is due to the effect of the
anisotropic diffusion coefficient and iterations, where some crack pixels were damaged in the
refinement process.
Lastly, both the proposed and DeepCrack CNN methods have pros and cons. The
DeepCrack CNN is multiscale and does not require any multiscale special parameters to be
tuned, and segments the cracks accurately if it is fully trained. Whereas the generalization of
DeepCrack on new datasets (I and II) did not produce better results as it takes fine-tuning by
transfer learning. On the other hand, the proposed method relatively segmented cracks well
51
on all the datasets even when trained on the synthetic samples, but the algorithm parameters
need to be set in advance. In addition, the proposed method does not need retraining, as
opposed to the deep learning method, where fine-tuning is required whenever it encounters a
new dataset.
2.6.2.5 Effects of the post-processing procedures
The multiscale filters used for the crack segmentation output the responses at various scales
ranging from smaller to larger. Later, the responses of the scales are combined to form a final
response image. After the union at different scales, false-positive CCs can be formed. To
eliminate these CCs, a decision-making system is required. Usually, in a supervised learning
setup, a classifier is trained to predict the crack and non-crack CCs. Even a highly trained
classifier is susceptible to the false CCs that are similar to cracks. After the binarization of
the cracks, the effects of post-processing procedures are studied in this work. To perform
these operations, basic morphological operations such as removing orphans and eccentric
pixels, filling the holes, and bridging pixels were used.
In addition, one of the key issue in any crack detection algorithm is the removal of blobs
such as dark or bright patches of different sizes [81, 82]. This work presents a statistical blob
filtering based on the area of the blobs. In general, cracks are larger in size relative to the
blobs. The distribution of the CCs area was assumed to be Gaussian. In texture-dominant
images, there is a high likelihood of a larger number of speckles and discontinuities present
even after the response image binarization. Most likely, they are confined around the mean
area, μ
area
, within some ασ
blob
of standard deviation, where αPR
¥0
. σ
blob
is set based on
the normal distribution principle and textures of the image. Lastly, the circularity index of
the blobs was considered to remove the CCs that have roundness and line like features. The
circularity index range was found experimentally; if it is within the range of 0.02 to 0.2 the
blobs, are kept, else discarded.
52
1 10 20 30 40 50 60 70 80 90 100
Diffusion Coefficients
0.69
0.7
0.71
0.72
0.73
0.74
0.75
0.76
F1-score
(a)
0 100 200 300 400 500
Diffusion Iterations
0.69
0.7
0.71
0.72
0.73
0.74
0.75
0.76
F1-score
F1-score ANN Hybrid (PP)
F1-score KNN Hybrid (PP)
F1-score SVM Hybrid (PP)
F1-score ANN Hybrid (No PP)
F1-score KNN Hybrid (No PP)
F1-score SVM Hybrid (No PP)
(b)
Figure 2.19: Three datasets median variations of the F1-scores against. (a). Anisotropic
diffusion coefficients. (b). Anisotropic diffusion iterations. PP stands for with post-processing
and No PP for without post-processing. Better visualization in color.
The same parameters that were presented in Section 2.6.2.4 are used for the hybrid method,
except including post-processing procedures. Figure 2.19 shows the median variations of the
F1-scores for the datasets I to III combined. These metrics were plotted against anisotropic
diffusion coefficients and anisotropic diffusion iterations when the ANN, k-NN, and SVM
classifiers were used. Since the median operation is robust to outliers, it was preferred to
visualize the variations. All three classifiers performed well without using post-processing
procedures in both the domains of anisotropic diffusion coefficients and iterations. When the
post-processing procedures were used, the overall median recall across the datasets decreased
by 3.80%, 2.81%, and 1.70% for ANN,k-NN, and SVM classifiers, respectively. This is a clear
indication that the post-processing procedures have discarded the crack CCs and thereby
increasing the false negatives. Moreover, a raise of the F1-scores are observed in the range
of 15 to 50 for the anisotropic diffusion coefficients. For the anisotropic diffusion iterations,
after 150 increments, the F1-score gradually become stable.
Table 2.5 refers to the semantic segmentation results of the hybrid, vesselness, and
morphological on the datasets I to III with post-processing procedures using ANN, k-NN,
and SVM as classifiers. The hybrid method has clearly outperformed the vesselness and
morphological methods by the F1-score on datasets I and II. Contrarily, the morphological
53
Dataset Method Classifier Metrics
GA MA MI WI SP PR RE F1
ANN 0.9698 0.8548 0.7219 0.9505 0.9791 0.5756 0.7304 0.6438
Hybrid k-NN 0.9693 0.8549 0.7193 0.9498 0.9785 0.5684 0.7313 0.6397
SVM 0.9697 0.8542 0.7209 0.9503 0.9790 0.5736 0.7293 0.6422
ANN 0.9658 0.8652 0.7084 0.9456 0.9739 0.5290 0.7565 0.6226
I Vesselness k-NN 0.9653 0.8663 0.7067 0.9450 0.9732 0.5238 0.7593 0.6200
SVM 0.9653 0.8542 0.7029 0.9448 0.9743 0.5255 0.7342 0.6125
ANN 0.9640 0.8562 0.6985 0.9432 0.9727 0.5123 0.7398 0.6054
Morpho k-NN 0.9637 0.8560 0.6972 0.9428 0.9724 0.5091 0.7397 0.6031
SVM 0.9633 0.8581 0.6967 0.9425 0.9718 0.5061 0.7443 0.6025
ANN 0.9600 0.8576 0.8007 0.9257 0.9854 0.8464 0.7299 0.7839
Hybrid k-NN 0.9597 0.8572 0.7998 0.9254 0.9852 0.8447 0.7293 0.7827
SVM 0.9589 0.8523 0.7954 0.9238 0.9854 0.8445 0.7191 0.7768
ANN 0.9582 0.8366 0.7876 0.9218 0.9885 0.8676 0.6847 0.7654
II Vesselness k-NN 0.9586 0.8394 0.7898 0.9226 0.9882 0.8663 0.6905 0.7684
SVM 0.9573 0.8313 0.7825 0.9200 0.9886 0.8669 0.6740 0.7584
ANN 0.9228 0.6609 0.6104 0.8585 0.9878 0.7519 0.3339 0.4625
Morpho k-NN 0.9210 0.6608 0.6078 0.8565 0.9856 0.7208 0.3360 0.4583
SVM 0.9193 0.6599 0.6047 0.8544 0.9837 0.6949 0.3361 0.4530
ANN 0.9746 0.8385 0.7572 0.9551 0.9875 0.7145 0.6894 0.7017
Hybrid k-NN 0.9748 0.8472 0.7613 0.9556 0.9869 0.7096 0.7074 0.7085
SVM 0.9750 0.8441 0.7609 0.9557 0.9874 0.7148 0.7008 0.7077
ANN 0.9741 0.8641 0.7638 0.9551 0.9845 0.6852 0.7437 0.7133
III Vesselness k-NN 0.9735 0.8692 0.7624 0.9544 0.9834 0.6728 0.7550 0.7115
SVM 0.9740 0.8635 0.7627 0.9549 0.9844 0.6831 0.7427 0.7116
ANN 0.9547 0.6005 0.5616 0.9203 0.9882 0.4501 0.2128 0.2890
Morpho k-NN 0.9528 0.5995 0.5578 0.9182 0.9863 0.4127 0.2126 0.2807
SVM 0.9522 0.5970 0.5551 0.9174 0.9858 0.3992 0.2082 0.2737
Table 2.5: Semantic segmentation results of the hybrid, vesselness, and morphological methods
on the datasets I to III with post-processing procedures using ANN, k-NN and SVM as
classifiers.
method performed poorly on datasets II and III, due to the conservative bound of the
circularity index. Generally, the crack maps produced by the morphological method have
sharp noisy ridges along the edges; this happens when the scale of the structuring element is
smaller than the cracks. Thus, the cracks that are not in circularity index range and gets
discarded.
In the evaluation process, the proposed hybrid method performed well compared to the
other two techniques. This work recommends using the anisotropic diffusion coefficients
and iterations carefully for minimizing the textural noise. Images used in this study have
high texture; the acceptable range of κ was found to be within 15 and 50. The iterations
can vary from 50 to 250 for low-medium to high textural images of less than 512 512 in
54
dimension. Although using the hybrid approach, the best F1-score achieved for dataset II
was 485 iterations, slightly lower scores 0.01% were within the recommended range. Lastly,
this works recommends using post-processing methods carefully. If the classifier performs
poorly on detecting the false-positive CCs, it is reliable to use the post-processing methods.
Otherwise, a well-trained classifier is a right choice.
2.6.3 Crack profile analysis
Cross-sectional profile analysis of the proposed method, vesselness, morphological, and
DeepCrack methods are presented in the Figure 2.20. Two crack images from dataset III, as
shown in the Figure 2.14 are utilized. Parameters used for all the three conventional methods
-20 -15 -10 -5 0 5 10 15 20
Distance from Crack Centerline
0
0.2
0.4
0.6
0.8
1
Normalized Intensity
Ground-truth
Hybrid (RMSE: 0.206)
Vesselness (RMSE: 0.099)
Morpho (RMSE: 0.110)
Deepcrack (RMSE: 0.100)
(a)
-50 0 50
Distance from Crack Centerline
0
0.2
0.4
0.6
0.8
1
Normalized Intensity
Ground-truth
Hybrid (RMSE: 0.094)
Vesselness (RMSE: 0.067)
Morpho (RMSE: 0.070)
Deepcrack (RMSE: 0.000)
(b)
Figure 2.20: Cross-sectional profile of two concrete surface images. (a). Thin crack of a width
14 pixels. (b). Thick crack of a width 67 pixels.
remain the same as in Section 2.6.2.4. The proposed method overestimated the width of the
thin crack by 6 pixels (4 on the right and 2 on the left) in total and have the highest Root
Mean Square Error (RMSE) of 0.206 along the profile line. This was due to the higher scale
used in extracting the cracks for dataset III. All other methods have an error of 2 pixels on
the left side of the crack edge. This shows that other methods also struggled to segment the
thin crack exactly due to the scale issue.
55
Figure 2.20b shows the cross-sectional profile of all the methods for a thick crack. Hybrid,
vesslness, and morphological methods have underestimated 9, 6, and 6 pixels, respectively.
In contrast, DeepCrack has a perfect cross-sectional profile. If a smaller scale is used, larger
cracks cross-section will be missed, and vice-versa. This is one of the major disadvantages of
the conventional methods as it is challenging to set the right scale value that works for any
width of the cracks. DeepCrack CNN performs well because of the learning involved in the
convolutional procedure on dataset III.
Crack width, length, and area are the physical properties that define the severity of
the defect. These are used in visual inspection procedures to rate the condition of civil
infrastructure. The width of the cracks is defined as the normal distance to a tangent at each
center-line pixel in both normal directions and averaged across all these points. The length
and area of the cracks are defined as the length of the skeleton and the total number of pixels
in the cracks.
A fast marching method was employed to compute the skeletons of the cracks in [239].
Generally, skeletons possess redundant branches. To overcome this effect, a skeleton pruning
threshold of 10% was used in this study. After skeletonizing the cracks, widths were measured
by counting the white pixels in the normal direction of a major ellipse axis (tangent) for
every three pixels using the approach of [204]. Finally, the quantified crack thickness was
averaged in a small neighborhood of 7 pixels to reduce the effects of outliers.
Figure 2.21 shows the uncertainty statistics of the relative errors of the cracks width,
length, and area of the three datasets. For the crack’s width, length, and area on the dataset
I, the hybrid method performed well. On the other hand, DeepCrack missed many cracks,
thus the variance was large. On dataset II, all four methods have similar statistics for the
crack width and area categories. Whereas for the crack length, the morphological method
performs poorly in calculating the length and had a large variance. This was due to the
presence of the holes in the crack maps and the skeletonizing method overestimated the
length. Lastly, for dataset III on all three quantities, DeepCrack had the best statistics and
56
Hybrid VesselnessMorpho Deepcrack
Methods
0
100
200
Relative errors (%)
Cracks Width
Hybrid Vesselness Morpho Deepcrack
Methods
0
50
100
Cracks Width
Hybrid Vesselness Morpho Deepcrack
Methods
0
100
200
300
Cracks Width
Hybrid VesselnessMorpho Deepcrack
Methods
0
100
200
300
Relative errors (%)
Cracks Length
Hybrid Vesselness Morpho Deepcrack
Methods
0
200
400
600
800
Cracks Length
Hybrid Vesselness Morpho Deepcrack
Methods
0
500
1000
Cracks Length
Hybrid VesselnessMorpho Deepcrack
Methods
0
200
400
Relative errors (%)
Cracks Area
Hybrid Vesselness Morpho Deepcrack
Methods
0
50
100
Cracks Area
Hybrid Vesselness Morpho Deepcrack
Methods
0
200
400
600
800
Cracks Area
Figure 2.21: Statistics of the estimated crack’s physical properties on three datasets. Rows
1, 2 and 3 represent crack’s thickness (averaged), length and area, respectively. Columns 1,
2 and 3 represent dataset I, II and III, respectively. The rectangular boxes represent the
range between the 25th and 75th percentiles. The horizontal lines within the boxes denote
the median values. The protruding horizontal lines outside the boxes represent the minimum
and maximum values. The whiskers are lines extending above and below each box. Mean
values are indicated by the small squares inside the boxes. Lastly, the red asterisk symbols
are the outliers.
minimum variance, as it segments the cracks accurately on the trained dataset. For the crack
width, hybrid and vesselness methods produced a few thinner and thicker crack maps.
2.6.4 Computational time analysis
All conventional and deep learning computations were performed on a desktop computer using
a 64-bit Ubuntu 20.04 operating system, 128 GB memory, and an AMD Ryzen ThreadRipper
2950x processor of 3.5 GHz 16 core processor. The programs of the conventional filters were
developed using the MATLAB 2021a software. Liu et al. [165] public source code was used in
the Linux environment to evaluate the DeepCrack CNN. Table 2.6 presents the computation
time of the methods discussed in this work.
57
Dataset Time/image (seconds)
Hybrid Vesselness Morpho DeepCrack
I 0.4656 0.1095 0.7971 0.0698
II 1.8249 1.1399 8.5111 0.2842
III [165] 13.8449 5.4947 25.3114 0.5132
Table 2.6: Computational time for the three real-world datasets using the hybrid, vesselness,
morphological, and DeepCrack methods.
The hybrid algorithm took 87.67% and 8.66% of the cumulative time to perform the
anisotropic diffusion and segment the cracks over the average of all three datasets, respectively.
The rest of the wall-time was used for obtaining the response image binarization and class
prediction for the individual connected components. Around 14.30% of the total processing
time was consumed by the vesselness filter to produce a crack response map. The remaining
time was utilized for image binarization and predicting the crack or non-crack labels by the
classifier. The morphological method, on average, took 19.12% of the total time, and the
remaining was used for the class prediction. The computational time was more considerable
because the morphological approach produces many false positives. Lastly, DeepCrack took
less than a second to predict the probabilities of the crack and non-crack pixels without
considering the training time. Overall, DeepCrack is the fastest in inference, and the vesselness
method was the second-fastest, followed by the hybrid approach, where anisotropic diffusion
iterations took a considerable amount of time. The morphological method was the slowest
due to the number of arithmetic operations involved in the convolutions of the structuring
elements of various lengths and angles.
2.7 Summary and conclusions
Autonomous crack detection methods using vision-based approaches are receiving increasing
attention in the civil engineering research community. Supervised learning methods require
carefully annotated data. This work has demonstrated how synthetic data generation and
58
augmentation techniques can be leveraged to perform the crack segmentation with “zero
annotated training data”. Also, this proved the potential of synthetic data generation
and augmentation method for the facilitation of supervised machine learning strategies.
Furthermore, an anisotropic diffusion-based hybrid multiscale crack detection algorithm,
MFAT, is proposed to smooth the texture details of the concrete images by preserving the
cracks. The proposed method outperformed the vesselness and morphological method by
4.38% and 7.00% of the F1-score for the SVM classifier, respectively across the datasets.
When compared against the pre-trained state-of-the-art DeepCrack, the hybrid method had a
better F1-score of 0.64% on the three datasets for the same classifier. Furthermore, across the
three datasets, the proposed refinement method improves the crack segmentation F1-score
by 0.37%, 0.47%, and 0.54% for ANN, k-NN, and SVM classifiers, respectively. DeepCrack
performed well in finding the accurate crack profile in dataset III where it was trained. In
addition, the proposed method does not need retraining, as opposed to the deep learning
method, where fine-tuning is required whenever it encounters a new dataset. This work does
not suggest that the proposed approach is a replacement for the current state-of-the-art deep
learning models. Rather it bridges the gap in the literature for a comprehensive analysis and
comparison of the conventional multiscale filters and semantic segmentation CNN, DeepCrack.
Lastly, this work provides guidelines to use anisotropic diffusion and its parameters judicially
for texture noise suppression and assisting in crack segmentation problems.
2.8 Future work
In the current method, the range for the parameters governing the anisotropic diffusion
filter was found by a validation dataset. These parameters generally vary according to the
image gradient, texture noise, and crack size. Current CNN models have better semantic
segmentation accuracy as compared to the feature-engineered methods. Leveraging the
strengths of both deep learning and the proposed refinement scheme, in the near future,
59
a cyclic hybrid CNN model using a learnable anisotropic diffusion module to predict the
parameters based on texture will be developed. This work exhibits the potential of using
synthetic cracks for training a decision system. A generalized closed-form synthetic data
generator will be developed to generate single, multi-stranded, and surface cracks seams.
Using the DL image-to-image translation, blending procedures of crack seams, new datasets
of cracks can be produced. This eventually will assist in training the large DL models to
achieve better accuracy and minimizes the annotation cost.
2.9 Acknowledgments
This study was supported in part by a contract with NHCRP under the IDEAS program.
The authors would like to thank Zhiye Lu and Youngseok Joung for their conscientious efforts
in creating semantic ground-truth images. Furthermore, the authors are thankful to Shravan
Ravi, Vinaykumar S. Hegde, and Milind Bhat. They helped in the preparation of the crack
and non-crack image database around the University of Southern California campus at Los
Angeles.
60
Chapter 3
CrackDenseLinkNet: A deep convolu-
tional neural network for semantic seg-
mentationofcracksonconcretesurface
images
3.1 Introduction
Concrete structures undergo cyclic loading throughout their service span that changes their
material properties. Furthermore, the structural physical properties like fatigue, shrinkage,
creep, and corrosion of reinforcements develop cracks during their life cycle. Development
of cracks shows the deterioration of the structures. Early detection and maintenance of
concrete structures can prevent fractures and other catastrophic disasters. Current inspection
standards require trained personnel to visit and visually inspect the concrete structures
periodically and prepare a report on the condition of the structures. However, this is entirely
manual, tedious, and subjective to the inspector’s experience. The last two decades have seen
the overwhelming research on developing tools and techniques for the condition assessment
of structures using vision-based methods, where conventional and sophisticated filter-based
image processing and computer vision methods are utilized to detect, segment and quantify
the cracks [293, 274]. Additionally, decision-making systems based on machine learning were
also developed by using conventional classifiers like ANN, SVM,k-NN, and ensemble learning
methods (random forest and adaptive boosting).
Conventional classifiers used in the prior work developed features (also known as feature-
engineering) to aid learning. However, in classical image-based crack segmentation methods,
designing the appropriate features was time-consuming and compromised the detection
61
accuracy due to the irregular shapes of the textural and background noise and varying
crack thickness and shapes. In the past 6-7 years, DCNN has revolutionized the wide range
of the classification, object detection, and segmentation research community [217]. Since
2017 DCNNs have been overwhelmingly used in the civil engineering community to solve
challenging defect-recognition problems of civil infrastructures [34, 38, 285, 39, 40], because of
theautonomyinfeaturelearningratherthanfeature-engineering[147]. Asadownside, DCNNs
require a large amount of data to prevent overfitting [141]. Crack recognition problems can
be classified into three categories: classification, detection, and semantic segmentation. Prior
work focused on all three categories, whereas the semantic segmentation of the cracks is more
challenging and interesting. Compared to the conventional filter-based image processing and
computer vision methods, DCNN has demonstrated superior accuracy on the segmentation
tasks.
Prior works have extensively used the DCNNs to classify, detect, and semantically segment
structural cracks. Most of these works relied on the well established networks like AlexNet
[141], VGG16, VGG19 [226], GoogLeNet [233], and Residual network (ResNet)[104] as their
encoder to obtain the feature maps [34, 165, 273, 267, 85]. These networks have hundreds of
millions of trainable parameters and are computationally expensive. In this work, a deeper
end-to-end encoder CNN, Densenet, is used as the encoder network with skip connections
to every other layer for better gradient flow. Thus, it produces better feature maps to the
crack segmentation network. Furthermore, modified LinkNet that has five decoder blocks is
used as a decoder network. The proposed network is referred to as the ‘CrackDenseLinkNet’
in this work. CrackDenseLinkNet reduces the trainable parameters to 19.15 million speed
up the training process without compromising the segmentation accuracy. In addition, a
compound focal and dice loss was used to counter the class imbalance problem encountered
in the crack segmentation problem.
62
3.1.1 Review of the literature
This section details the available literature in crack recognition problems into three categories:
classification, detection, and semantic segmentation, primarily on the concrete surface images
using the DCNNs. Furthermore, extensive literature is available on the DCNN-based crack
recognition problems on the pavement distress and steel surface images. Eventually, this work
only focuses on the concrete surface, and other material types are not considered. In addition,
traditional crack segmentation approaches based on the image processing techniques and
computer vision methods are extensively discussed in the Chapter 2.
3.1.1.1 Crack classification
Cha et al. [34] used a CNN network similar to AlexNet for the classification of cracks on
concrete surface images under various illumination conditions. Gao and Mosalam [85] used
transfer learning to overcome the problem of the small dataset and used the Structural
ImageNet dataset with transfer learned weights from a pre-trained VGG network to classify
different structural defects. Li and Zhao [156] employed a modified AlexNet model for the
task of crack detection in concrete surfaces. To improve the generalization capability of the
network, a large 60K image dataset was used in training. The network was validated on
unseen data, and the proposed model shows high validation accuracy. Le et al. [146] used a
CNN model to the classify and detect crack fractures on concrete surfaces. Zhang and Yuen
[287] proposed a crack classification network by fusing a feature-based broad learning system.
Rao et al. [207] compared multiple CNN architectures for the automated crack detection on
concrete surfaces. Kyal et al. [144] detected the cracks on concrete surfaces by using the
CNN features and random forest as a classifier.
63
3.1.1.2 Crack detection
Chen and Jahanshahi [39] used a fusion of naive Bayes method and CNN network known
as Tubelets for the crack detection in nuclear reactor surface image frames extracted from
video. Kim and Cho [133] employed AlexNet based network to segment cracks in concrete
images semantically. The parameters of the network were fine-tuned to achieve the best
results. Kim et al. [134] used conventional image processing techniques along with CNN’s
to classify crack and non-crack regions in a concrete image. The proposed method uses
Sauvola’s binarization to identify candidate crack regions in the image and then employs a
CNN with SURF features to classify the candidate region as crack or non-crack. Park et al.
[195] proposed a two-stage approach for the detection of cracks in images obtained from
black-boxes in cars. In the first stage, an FCN is used to extract only the road, followed by a
patch-based detection of cracks using a customized 16 layer CNN module for crack detection.
The results indicated high segmentation accuracy for road extraction when compared to crack
detection. Zhang et al. [286] proposed a multi-staged patch-based segmentation of cracks in
concrete images using the SegNet model. In the first step, an adaptive sliding window with
a Sobel-edge detector followed by non-maxima suppression is used to localize patches with
cracks which are passed on to the SegNet CNN model for pixel-wise segmentation. Jiang
and Zhang [122] proposed a framework in which a UAV captured images and streamed them
into a smartphone which acts as the computing device for processing these images, and
detecting cracks. For real-time detection, a trained SSDLite-MobileNetV2 was deployed on
the smartphone to detect and segment the cracks. Finally, the detected crack features are
quantified. Deng et al. [59] employed the FasterRCNN model for detecting the cracks in
concrete images and differentiated them from pre-existing handwriting scripts on the concrete
surface created during onsite inspection. The model was compared with YOLOv2, and the
results indicate that FRCNN clearly outperforms YOLOv2. Park et al. [196] combined a
laser and vision sensor for detecting and quantifying cracks. The vision sensor is used to
capture images, and the images are passed to YOLO architecture with a darknet backbone
64
for crack detection. Two calibrated lasers are projected on the surface of the concrete during
image capture, and they acted as useful markers in accurately estimating the crack features
Zhang et al. [282] employed a single shot YOLOv3 model for real-time detection of four
classes of concrete damages. The authors used other geometrically similar datasets with
many samples to pretrain the network and then train for crack detection to achieve faster
convergence. Transferred learning coupled with batch normalization and focal loss was shown
to yield significant performance gains.
3.1.1.3 Crack semantic segmentation
Yang et al. [273] used an end-to-end fully connected CNN to segment the concrete surface
crack images. Li et al. [159] used LeNet5 with post-processing. However, the loss function
has included the Fisher criterion, useful when data is limited. Dorafshan et al. [60] Compared
CNNs with edge detection techniques (spatial and frequency domain). Ni et al. [190] employed
a combination of GoogLeNet and ResNet to segment cracks in different scales semantically.
After processing the semantic crack output, a novel Zernike moment operator was used to
estimate the width of cracks which is paramount for quantifying the crack. Xu et al. [266]
proposed a modified CNN architecture for semantic segmentation of cracks in bridges, adding
the multilevel and multi-scale feature extraction capability to Fusion CNN’s by making use
of bypass stages (skip connections) inside the network. Ni et al. [189] employed GoogLeNet
for detection and semantic segmentation of cracks in concrete images, and experimented with
different featured fusion techniques to achieve the best performance. Ye et al. [276] proposed
a novel CNN architecture called CiNet, a 16 layer model which is trained to detect structural
defects. The CNN model was compared against traditional edge detection techniques and
outperformed the conventional edge detectors. Hoskere et al. [106] utilized deep network
architectures, a 23 layer ResNet for the damage segmenter, and a modified VGG19 network
as the damage classifier to classify and segment the cracks on post-earthquake structures.
65
Choi and Cha [49] proposed a fast, small CNN model called SDDNet that employs an
encoder-decoder strategy to provide pixel-level segmentation of cracks in the image. The
encoder-decoder module coupled with densely connected separable convolution, atrous spatial
pyramid pooling, and modified IoU loss provides high accuracy segmentation in real-time.
Lee et al. [148] proposed an encoder-decoder-based CNN model called CSN for segmenting
cracks under varied illumination conditions. A novel data augmentation technique that uses
a 2D Gaussian kernel to mimic a thick crack and Brownian motion to define the shape
of the crack was proposed. The CSN model was shown to detect cracks even in complex
environments robustly. Liu et al. [165] proposed a CNN architecture based on a Holistically-
Nested Edge Detection network to segment the crack images. Chen et al. [42] proposed a
switching module called SWM that is aimed at reducing the computational complexity while
training encoder-decoder networks. The SWM module is a binary classifier that detects
crack and non-crack images based on features extracted by the encoder. Only features with
cracks are passed to the decoder for segmentation, thereby integrating a two-stage pipeline
of classification and segmentation into a single efficient framework. The proposed module
shows reduced computational complexity when deployed in both U-Net and DeepCrack
architectures. Wang et al. [248] proposed a multi-layer ELM-based feature extractor that
is composed of a sparse ELM auto-encoder for hierarchical feature extraction, followed by
an incremental ELM classifier to differentiate between crack and non-crack features. Jang
et al. [121] presented an automated decision-making system to identify structural defects
in concrete images by making use of a transfer learned GoogLeNet architecture. Liu et al.
[168] employed encoder-decoder network U-Net with a focal loss for segmentation of cracks
in concrete images. The U-Net is compared with other DCNN models and was shown to
yield higher accuracy. Dung et al. [63] proposed a crack detection method based on Fully
Convolutional Networks for semantic segmentation on concrete crack images. Miao et al.
[179] proposed a crack detection method based on a modified U-Net model, which comprises
squeeze-and-excitation blocks and a residual block for semantic segmentation on concrete
66
crack images. The residual block allowed an easy flow of gradients, and squeeze-and-excitation
is used for determining feature weight. Chen and Jahanshahi [40] Detected cracks in metallic
surfaces of nuclear power plants using a novel NB-FCN approach that is capable of detecting
cracks in real-time. The scores from multiple frames are fused into one using a parametric
data fusion scheme that yielded high precision in contrast with other methods. Drouyer [61]
proposed a simple U-Net trained on multiple datasets to segment the crack images.
Lee et al. [149] proposed a framework for simultaneous detection of crack and crack
length by making use of shape-sensitive kernels called CK kernels along with a VGG-16
backbone. The rectangular CK kernels are used to represent the cracks better, and the
maximum length of a crack within the kernel is estimated using a distance transform. Zhang
et al. [284] proposed a labor-free structure library created using cycle-GANs, and crack
detection is formulated as a domain adaptation problem with the introduction of OneClass
discriminator with a large FOV, used to discriminate paths with and without cracks. Mei and
Gül [177] proposed a novel CNN architecture - DenseCrack201 - to achieve pixel-level crack
segmentation. which improved the segmentation performance. A DFS based post-processing
algorithm is used to remove small clusters of connected components in the segmented output.
Alipour and Harris [9] aimed to study how the type of material influences the accuracy of
crack detection and questions the generalization capability of CNNs based on data used
for training the CNN. The authors used ResNet for analysis and present three techniques -
joint training, sequential learning, and ensemble learning - which can be used to overcome
the dependence on prior knowledge of materials under different scenarios. Ren et al. [211]
presented a novel architecture called CrackSegNet with a modified VGG-16 backbone working
as the encoder, atrous convolution for increased image resolution, spatial pyramid pooling
for multi-scale image fusion, and skip connection to the decoder for better segmentation.
The results showed that the proposed network outperforms the well-known UNet model in
the task of crack segmentation. Mei et al. [178] proposed DenseNet201 - a dense encoder-
decoder network for semantic segmentation of cracks in pavements. A novel loss function
67
that considers eight neighbor connectivity for a pixel is employed to overcome the problem
of convolutional upsampling. The loss function coupled with skip connections improved the
overall segmentation accuracy.
Qu et al. [205] proposed a two-stage approach for solving the problem of pavement crack
segmentation. In the first stage, the modified LeNet-5 network is used to classify crack
and non-crack images. The crack images are then passed into a modified VGG-16 network
for fine-segmentation of cracks in the input image. Kalfarisi et al. [125] employed the well-
known FRCNN with SRFED and MaskRCNN to address the problem of crack detection
and segmentation. In the first approach, an FRCNN is used to identify cracks (the region
of interest), and these regions are passed to a structured edge detector for delineation of
cracks. In contrast to the previous approach, MaskRCNN, a single framework for both
crack identification and segmentation, outperformed FRCNN-SRFED. Further, a qualitative
assessment is performed in 3D using photogrammetry mesh-molding technology. Li and Zhao
[157] used a DenseNet-121 and an upsampling decoder network for the crack segmentation
on concrete surface images. Li et al. [154] proposed U-CliqueNet, a modified U-Net with each
convolutional and deconvolutional block replaced with a clique block. The dual passage of
information through the clique blocks helped speed up extraction of crack-relevant features
and led to the accurate segmentation of cracks in the image. The proposed network was
shown to outperform other popular encoder-decoder architectures.
Feng et al. [68] proposed a novel CDDS network - a modified SegNet with inspirations from
VGG and DeepCrack architectures. The proposed network was trained to detect and segment
cracks on walls of dams and was shown to give better performance than other well-known
segmentation architectures. Li et al. [153] proposed a novel NB-FCN model that is composed
of an FCN module for image feature extraction and a Naive Bayes decision-making module
that determined the presence of cracks using n features slices extracted by FCN. The model
was trained on 72,000 images of bridges and demonstrated significant performance scores on
this dataset. Kang et al. [127] proposed a three-stage approach - detection, segmentation,
68
and quantification of cracks in the image. For detection, a FasterRCNN network trained
on a custom dataset was used. A modified TuFF algorithm that used CLAHE for contrast
adjustment and the Hessian matrix for evolving the level set function was proposed to segment
the crack semantically. Finally, a modified distance transform is employed to determine the
length and thickness of cracks. Huang et al. [108] proposed a generative model in place
of sparsity regularization that can effectively capture a low dimensional representation of
images, and cracks from the compressed images are recovered automatically. Chen and
Jahanshahi [38] adapted a rotation invariant deep fully convolutional network for pixel-level
crack segmentation of concrete and pavement surface images. Yang and Ji [272] used U-Net++
and deep transfer learning for segmentation of concrete surface images.
3.1.2 Contribution
Many prior semantic segmentation networks used Alexnet, VGGNet and ResNet architectures
as the encoders. Although Alexnet and VGGNet architecture had considerably fewer convo-
lutional layers than ResNet (which has more than 150 convolutional layers), the number of
trainable parameters was more than 60 and 144 million in contrast to ResNet’s 36.8 million.
The skip connections in the ResNet model assist in better flow of the gradients in the deeper
network during backpropagation. It is shown that the deeper networks deliver better feature
maps than the shallow networks. Based on this principle, a DenseNet architecture [107]
which is deeper and at the same time has around 12.5 million parameters, is chosen as the
encoder model. Additionally, the decoder network is equally important for the accurate
semantic segmentation of the cracks. Thus, an efficient modified LinkNet with a decoder
depth of five and 6.6 million parameters was adopted in this work. The total trainable
parameters is less than 19.2 million. Therefore, the training time for 200 epochs was around
8.25 hours. Lastly, prior methods used binary cross-entropy as the loss function. This loss
function suffers from the serious class imbalance (there are many more background pixels
than actual cracks) predominant in the crack semantic segmentation problem. To overcome
69
this, a focal loss, which can be seen as a variation of binary cross-entropy loss, and dice loss
which calculates the similarity between two images or entities (ground-truth and prediction
cracks), are adapted in this work. Both these loss functions work efficiently in reducing the
predominant class imbalance problem. In this work, an end-to-end semantic segmentation
network, CrackDenseLinkNet, with fewer trainable parameters is proposed for the crack
segmentation task.
3.1.3 Scope
Section 3.2 discusses the convolutional neural network preliminaries and backpropagation
mathematical model. Section 3.3 introduces the proposed end-to-end deep convolutional
neural network architecture and loss function. Section 3.4 discusses the dataset preparation,
architecture details, detailed experimental results and the comparison to the state-of-the-art
segmentation models. Lastly, Sections 3.5 and 3.6 concludes the current work and explains
the future work for improvisation.
3.2 Convolutional neural network preliminaries
Convolutional Neural Networks (CNNs) are one of the most popular neural networks, mainly
used for high-dimensional data (e.g., images and videos). A CNN with other layers in the
network can be used as a dimensionality reduction tool for the larger pool of classification,
object detection, and semantic segmentation problems in images or videos. A CNN learns an
image’s or pixel’s underlying classification model based on the convolution operation. Unlike
a conventional neural network, a CNN uses two or more dimensional filters to convolve the
given data and find the underlying pattern. The CNN filters incorporate the learned spatial
context of an object (image or video) by having a similar spatial shape to the input object’s
features [132]. These filter shapes are hyperparameters and are provided by a user before the
training process. By sharing the parameters, learnable variables can be reduced significantly
70
across the layers of a CNN. This section briefs all the CNN details used as the building blocks
for constructing the CrackDenseLinkNet, a semantic segmentation network.
3.2.1 Network layers
Network or CNN layers are the basic building blocks of a CNN architecture. A basic CNN
can consist of a few convolutional, nonlinearity, pooling, and fully connected layers to form a
complete network. Pre-processing of the input data is the first step in the CNN layer. Basic
pre-processing techniques like mean-subtraction, normalization, and Principal Component
Analysis (PCA) whitening are commonly used before the first layer, called the input layer
(input image).
• Mean-subtraction: The meanof the trainingsamples (e.g., color channel mean values)
are estimated, and training/testing input images are zero-centered by subtracting the
per color channel mean values with the image’s corresponding channel values.
• Normalization: Mean-subtracted input data values (pixels) are divided by the stan-
dard deviation of each input image channel calculated on the training dataset. This
ensures that the values are normalized to a unit value.
• PCA whitening: PCA whitening reduces the correlations between different image
dimensions (channels) by independently normalizing them. First, the covariance matrix
is calculated between image channels of zero-centered data. Next, the covariance matrix
is decomposed using the Singular Value Decomposition (SVD) algorithm, and the
decorrelated data is projected onto the eigenvectors found via SVD. Lastly, each channel
of the image is normalized by dividing by its corresponding eigenvalue.
After pre-processing the image dataset, each training and testing sample is fed into the
CNN network. The input of the convolutional layer is convolved, and the resulting output
is an input to batch normalization, nonlinearity, pooling layers, and others. This section
71
presents the brief concepts of these building blocks and their functionality in the CNN
architecture.
3.2.1.1 Convolutional layers
A convolutional layer is the most critical layer of a CNN architecture and the most computa-
tionally expensive component. It comprises a set of convolutional filters or kernels convolved
with a given input image to generate an output feature map. Each convolutional filter is a
grid of numbers (usually a square filter, e.g., 3 3 is considered when the image is square for
the convenience of producing a feature map of larger size. Generally, a convolutional filter
can be of any positive integer size). The weights of the filters are randomly initialized and
learned during the training iteration procedure based on the spatial context of the image
objects. The convolution operation can be represented as follows,
1 0 0 4 -4 3
3 -1 -2 0 1 1
0 0 0 -1 3 1
-1 -2 0 3 1 0
0 5 1 -1 0 2
0 0 0 2 2 1
Convolution
1 0 0
3 -1 2
0 0 0
0 0 4
-1 -2 0
0 0 -1
3 9 -9 -1
-3 -4 7 0
1 4 3 2
-1 6 1 -2
Input (6 x 6)
Stride = 1
Subarray
Subarray
1
Output (4 x 4)
Output Size = (I-R)/S + 1, I = Input Size, R= Receptive field Size, S = Stride Size; (6-3)/1+1 = 4
-1 0 1
1 -1 1
0 0 -2
Receptive Field
2 +
Bias
7 2 +
Bias
⊗
Figure 3.1: A convolution operations in a layer.
I
out
pi,j,kq
k
¸
m1
k
¸
n1
I
in
pim,jn,lqw
k
pm,n,lqb
k
, (3.1)
where I
out
and I
in
are the output and input feature maps of the convolutional layer, w
k
and
b
k
are the weights and bias for the layer l. Here the kernel shape is assumed as a square.
Figure 3.1 shows the convolution operation on a 6 6 image array of single channel. A
pre-defined filter (bottom subarray) of size 3 3 is convolved with a subimage of size 3 3,
resulting in a receptive field of size 3 3. The same kernel is slid over a subarray of size 3
72
3 via a raster scan of an image. The number of pixels skipped while sliding the kernel is
called the stride of the kernel. The output of the convolutional layer is a feature map, and
the size is given bypIRq{S 1, where I is the image size, R is the size of the receptive
field, and S is the stride size of the kernel. As in Figure 3.1, the output size is 4 4. The
bias can be added to produce the output feature map.
3.2.1.2 Non-linearity
Generally, convolutionalandfullyconnectedlayersareoftenfollowedbyanon-linearactivation
(or a piece-wise linear) function. The non-linear activation function maps a real-valued number
to a ranger0, 1s or clips the negative values. The non-linear function plays an essential role
in learning the non-linear mappings of the convolutional and fully connected layers. Without
this, the learned weight layers will behave as a linear mapping. The activation functions
used in the deep learning models are differentiable to enable error backpropagation. The two
activation functions used in CrackDenseLinkNet are given below:
• Sigmoid: In a sigmoid activation function, a real number is converted to a number in
the ranger0, 1s. The sigmoid is used as the last layer of a decoder network. Thresholding
the sigmoid values produces a binary map of the input image. It is defined as:
f
sigmoid
pxq
1
1 exp
x
, (3.2)
where x is the input signal.
• Rectifier Linear Unit (ReLU): It is a computationally fast and straightforward
activation function, where the input is mapped to a 0 if it is negative and unchanged if
it is positive. ReLU is used after the convolution operation, and it can be represented
as follows:
f
ReLU
pxq maxp0,xq, (3.3)
where x is the input signal.
73
3.2.1.3 Pooling layers
Pooling layers are used to down-sampling the feature maps to a compact feature representation
invariant to changes in scale, pose, and translation. A pooling layer operates on the block
of feature maps and combines the feature map activations. A pooling function such as
the maximum (max) or mean/average can be used for the down-sampling. Similar to the
convolution layer, the size of the pooling region and stride has to be specified.
1 0 0 4 -4 3 2
3 -1 -2 0 1 1 -1
0 0 0 -1 3 1 1
-1 -2 0 3 1 0 0
0 5 1 -1 0 2 -1
0 0 0 2 2 1 0
3 0 -1 0 -3 1 3
Pooling
1 0 0
3 -1 2
0 0 0
0 4 -4
-2 0 1
0 -1 3
3 4 3
5 3 3
5 2 3
0.1 0.1 0.8
0.3 0.7 0.8
0.9 0.0 0.6
Input (7 x 7)
Stride = 2
Max
Mean
3
0.1
Output (3 x 3)
Output Size = (I-P)/S + 1, I = Input Size, P = Pooling Size, S = Stride Size; (7-3)/2+1 = 3
Figure 3.2: An example of the maximum and average pooling operations in a layer.
Figure 3.2 shows the max and mean pooling operations on a 7 7 image array of a single
channel. A pre-defined pooling filter of size 3 3 is considered with a stride of 2 pixels.
The maximum or mean operation is performed within the pooling filter size, producing the
resulting down-sampled output. The output size of the pooling layer is given bypIRq{S1,
where I is the image size, R is the size of the receptive field, and S is the stride size of the
kernel. As in Figure 3.2, the output size is 3 3 for both the max and averaged feature map
image.
3.2.1.4 Fully connected layers
Fully connected layers are used to squash two or more dimensional feature maps to one-
dimensional vectored values. It is essentially a convolution layer with filters of size 1 1.
Each of these filters in the fully connected layer is densely connected to all the units of the
previous layer. Therefore, the number of trainable parameters is extremely large. Fully
74
connected layers are typically placed toward the end of the architecture. Mathematically, the
fully connected operation is represented by a matrix multiplication followed by bias addition
and an element-wise nonlinear operation of an activation function. It is given by,
yfpW
T
xbq, (3.4)
where x and y are the vector of input and output activations, respectively, W denotes the
weight matrix between the layer units, and b is the bias term.
3.2.1.5 Transposed convolution layer
The standard convolutional layer maps the input to a lower dimension as the network goes
deeper and creates an abstract representation of an input image. This feature of the CNN
is useful in dimensionality reduction for classification problems, where only the abstract
representation is sufficient to predict the class labels. In contrast, for the semantic localization
of each pixel, the down-sampled feature maps do not provide high-resolution details for the
accurate segmentation of objects (cracks). Thus, to overcome this problem, a transposed
convolution layer can be utilized.
0 1
2 3
0 1
2 3
Transposed
Convolution
Input Conv. Kernel
0 0
0 0
0 1
2 3 0 2
4 6
0 3
6 9
0 0 1
0 4 6
4 12 9
=
+ = + +
Output
Figure 3.3: A typical transposed convolution operations in a layer.
A transposed convolution layer is equivalent to a convolution layer, but in the opposite
direction, asinabackwardpassduringbackpropagation. Figure3.3showsatypicaltransposed
convolution operation in a layer where an input feature map of size 2 2 is up-sampled to
75
3 3 by a transpose convolution kernel of size 2 2. Each element in the feature map is
multiplied by the transpose convolution kernel, and the resulting block is stored in a 3 3
matrix. After the element-wise multiplication, the resulting blocks are assembled by adding
the four 3 3 matrices. This results in an up-sampled feature map. The output size of the
transpose convolution layer is given by,
I
1
y
Sp
ˆ
I
y
1qR 2PpI
y
R 2Pq mod S, (3.5)
I
1
x
Sp
ˆ
I
x
1qR 2PpI
x
R 2Pq mod S, (3.6)
where,I
y
andI
x
are the spatial dimensions of the input in the equivalent forward convolution,
ˆ
I
y
and
ˆ
I
x
denote the input dimensions without any zero-padding,R is the size of the receptive
field/transpose convolution kernel,P is the padding,S is the stride and mod is the modulo
operator. Lastly, I
y
and I
x
are the output size of the transpose convolution layer in x and y
directions.
3.2.2 Batch normalization
Batch normalization normalizes the mean and variance of the output activations of a CNN
layer to a unit Gaussian distribution. It reduces the “covariance shift”
1
of the layer activations
during the training process of the deep CNN. If the distribution keeps on changing during
the training process, the computation time increases for the convergence of the network. The
normalization of this distribution produces a consistent activation distribution in subsequent
CNN layers during the training process. Furthermore, it speeds up the convergence and
reduces the network instability issues like the vanishing/exploding gradients that commonly
occur in larger networks when the gradients do not flow during the backpropagation. In
addition, it reduces the activation saturation when the activation neurons fire up for all
elements equally or do not fire up. Batch normalization is followed after the CNN layers
1
Covariance shift is the change in the distribution of activations of each layer during the training phase.
76
before applying the nonlinear activation function. It can be integrated into an end-to-end
network because of its differentiable computations on batch size and implemented as a CNN
layer.
Consider a set of activations tx
i
:iPr1,msu corresponding to an input batch size m
images, where x
i
x
i
j
:jPr1,ns
(
has n dimensions from a CNN layer. The mean and
variance of the batch for each dimension of activations are given by,
μ
x
j
1
m
m
¸
i1
x
i
j
, (3.7)
σ
2
x
j
1
m
m
¸
i1
x
i
j
μ
x
j
2
, (3.8)
where μ
x
j
and σ
2
x
j
are the mean and variance for the j
th
activation dimension computed over
a batch, respectively. The normalized activation operation is given by,
ˆ x
i
j
x
i
j
μ
x
j
b
σ
2
x
j
(3.9)
where is the small tolerance added for numerical stability.
Plain normalization of the activations can alter them and disrupt the useful patterns that
are learned by the network. Therefore, the normalized activations needs to be rescaled and
shifted to allow them to learn useful discriminative representations [132] given by,
y
i
j
γ
j
ˆ x
i
j
β
j
, (3.10)
where y
i
j
are the output activations, and γ
j
and β
j
are learned during error back-propagation
process.
Some of the advantages of batch normalization are: (a) network training becomes less
sensitive to the choice of hyperparameters. (b) stabilizes the network against the bad weights
initialization and negates the effects of the vanishing/exploding gradients. (c) improves the
77
network convergence rate and thereby decreasing the training time. (d) provides end-to-end
learning by backpropagating the errors through the normalization layers. (e) lastly, it makes
the network less dependent on the regularization techniques like dropouts.
3.2.3 Backpropagation
Back-propagation is the most commonly used method for training multilayer feedforward
and convolutional networks [213]. Back-propagation refers to a method to calculate the
derivatives of a CNN training loss function with respect to the weights by a derivative chain
rule. Furthermore, it describes a training algorithm (gradient descent, stochastic gradient
descent or adaptive moment estimation optimization) for using those derivatives to adjust the
weights to minimize the error of a loss function. A loss function is denoted byL, weights by
w
ij
, where ij represents the layer and neuron index, the activation function as fpq and error
asE. The error function between the target (t) and true (y) outputs is given by, ELpt,yq.
The error or loss function has to be minimized with respect to the weights, w
ij
. This is done
through the gradient,
BL
Bw
ij
BL
Bo
j
Bo
j
Bw
ij
, (3.11)
where o
j
is the output of the neuron j and it is given by,
o
j
f
n
¸
k1
w
kj
o
k
. (3.12)
fpq can be a sigmoid or ReLU activation function. To update the weights, w
ij
, using the
gradient descent, a learning rateη¡ 0 must be chosen. The final update equation is given by,
Δw
ij
η
BL
Bw
ij
(3.13)
Further details about the analytical expressions above are available in [213, 209]. Due
to the complexity of the dimensions, the above equations are presented for the multilayer
78
feedforwardnetwork. Thesameexpressionscanbeexpandedtoaccommodatetheconvolutions
of multi-dimensional data like images with color channels. The traditional neural network
multiplications between weights and inputs are replaced with a convolution operation in a
convolutional layer.
3.3 Methodology
The proposed method includes designing an end-to-end segmentation convolutional neural
network using an encoder and decoder blocks. A deep encoder block, DenseNet-169 [107], was
utilizedtoobtainthefeaturemapsfromtheoriginalimages. Similarly, forthedecodernetwork,
a modified LinkNet [37] with an extra decoder block (there were four in the original LinkNet
literature) was used for the semantic segmentation of the crack images. A combination of
encoder-decoder blocks is referred to as CrackDenseLinkNet. In the proposed architecture, the
encoder part is built through the transfer learning (fine-tuning) process of the DenseNet-169.
The fully connected layers and the multi-class prediction softmax layer were discarded. The
DenseNet-169 is an efficient CNN encoder network with 169 convolutional layers, batch
normalization, non-linearity, and pooling layers. Although DenseNet-169 is a deep network,
it has less than 12.5 million parameters.
Furthermore, the DenseNet-169 obtains significant improvements over the other state-of-
the-art CNNs and requires less computational time to achieve good performance. In addition,
using the pre-trained weights on the ImageNet data [58] for a DenseNet-169 model, the
abstraction of the lower level features like lines and blobs are preserved well. Additionally,
it significantly decreases the training time. For the efficiency of training and testing, the
input image will be resampled to 512 512 pixels and then input into the neural network.
The prediction of the CNN is a sigmoid activation layer of the same size as the input image.
Lastly, binary crack maps are generated by thresholding the sigmoid layer.
79
The advantages of the proposed neural network compared to other deep neural networks
are: (a) due to the reuse of parameters in multiple layers, fewer parameters are required to
train the encoder network efficiently with better performance and segmentation accuracy. (b)
dense skip connections help better gradient flow throughout the network, thereby reducing the
vanishing/exploding gradients problems. Thus, the encoder network can be trained efficiently.
(c) the input of each encoder layer is bypassed to the output of its corresponding decoder. By
doingthis, lostspatialinformationattheencoderscanberecoveredeffectively, andupsampling
decoder operations produce accurate crack maps. The details of the CrackDenseLinkNet
based on DenseNet-169 and modified LinkNet and the focal and dice loss functions are
presented in this section.
3.3.1 An end-to-end encoder-decoder semantic segmentation con-
volutional neural network
CrackDenseLinkNet is an end-to-end semantic segmentation network where the input image’s
prediction values are converted to a binary crack map. A simplified pictorial representation of
the CrackDenseLinkNet architecture is illustrated in Figure 3.4, and its detailed specifications
are presented in Table 3.1. The complete CrackDenseLinkNet architecture is composed of
two parts: encoder and decoder blocks. The encoder part is comprised of the DenseNet-169,
which contains 169 convolutional, batch normalization, non-linearity, and pooling layers.
The decoder blocks consist of convolutional, transpose convolutional, batch normalization,
non-linearity, identity mapping, and sigmoid layers. In addition, the skip connections
provide a channel to transfer the spatial information of the encoder blocks (abstract feature
maps) after the max and average pooling to the corresponding decoder blocks. In the
encoder part, the DenseNet-169 connects each layer to every other layer in a feed-forward way.
ResNet-based convolutional networks with L layers have L connections (one between each
layer and its subsequent layer), but the DenseNet-169 has L
pL1q
2
direct connections in each
Dense Block. The main difference is ResNet’s feature maps are combined through summation
80
Layer Type Zero-padding
Kernel size
(Height x Width)
Stride
Output Size
(Height x Width)
Output Chanels Params
1 Input - - - 512 512 3 -
2 Convolution 3 7 7 2 256 x 256 64 9.408k
3 Batch Normalization - - - 256 x 256 64 128.0
4 Max pooling 1 3 3 2 128 x 128 64 -
5 Dense Block (1)
0 conv
1 conv
12
1 1 conv
3 3 conv
12
1 conv
1 conv
12 128 128
128
32
335.040k
6 Batch Normalization - - - 128 x 128 256 512.0
7 Convolution - 1 1 1 128 x 128 128 32.768k
8 Average Pooling 0 2 2 2 64 x 64 128 -
9 Dense Block (2)
0 conv
1 conv
24
1 1 conv
3 3 conv
24
1 conv
1 conv
24 64 64
128
32
919.68k
10 Batch Normalization - - - 64 64 512 1.024k
11 Convolution - 1 1 1 64 64 256 131.072k
12 Average Pooling 0 2 2 2 32 32 256 -
13 Dense Block (3)
0 conv
1 conv
64
1 1 conv
3 3 conv
64
1 conv
1 conv
64 32 32
128
32
4316.16k
14 Batch Normalization - - - 32 32 1280 2.56k
15 Convolution - 1 1 1 32 32 640 819.2k
16 Average Pooling 0 2 2 2 16 16 640 -
17 Dense Block (4)
0 conv
1 conv
64
1 1 conv
3 3 conv
64
1 conv
1 conv
64 16 16
128
32
5913.6k
18 Batch Normalization - - - 16 16 1664 3.328k
19 Decoder Block (1)
0 conv
batchNorm
1 Trans.conv
batchNorm
0 conv
batchNorm
1 1 conv
batchNorm
4 4 Trans.conv
batchNorm
1 1 conv
batchNorm
1 conv
batchNorm
2 Trans.conv
batchNorm
1 conv
batchNorm
16 16
16 16
32 32
32 32
32 32
32 32
416
416
416
416
1280
1280
692.224k
832.0
2.769312M
832.0
532.48k
2.56k
20 Decoder Block (2)
0 conv
batchNorm
1 Trans.conv
batchNorm
0 conv
batchNorm
1 1 conv
batchNorm
4 4 Trans.conv
batchNorm
1 1 conv
batchNorm
1 conv
batchNorm
2 Trans.conv
batchNorm
1 conv
batchNorm
32 32
32 32
64 64
64 64
64 64
64 64
320
320
320
320
512
512
409.6k
640.0
1.63872M
640.0
163.84k
1.024k
21 Decoder Block (3)
0 conv
batchNorm
1 Trans.conv
batchNorm
0 conv
batchNorm
1 1 conv
batchNorm
4 4 Trans.conv
batchNorm
1 1 conv
batchNorm
1 conv
batchNorm
2 Trans.conv
batchNorm
1 conv
batchNorm
64 64
64 64
128 128
128 128
128 128
128 128
128
128
128
128
256
256
65.536k
256.0
262.272k
256.0
32.768k
512.0
22 Decoder Block (4)
0 conv
batchNorm
1 Trans.conv
batchNorm
0 conv
batchNorm
1 1 conv
batchNorm
4 4 Trans.conv
batchNorm
1 1 conv
batchNorm
1 conv
batchNorm
2 Trans.conv
batchNorm
1 conv
batchNorm
128 128
128 128
256 256
256 256
256 256
256 256
64
64
64
64
64
64
16.384k
128.0
65.6k
128.0
4.096k
128.0
23 Decoder Block (5)
0 conv
batchNorm
1 Trans.conv
batchNorm
0 conv
batchNorm
1 1 conv
batchNorm
4 4 Trans.conv
batchNorm
1 1 conv
batchNorm
1 conv
batchNorm
2 Trans.conv
batchNorm
1 conv
batchNorm
256 256
256 256
512 512
512 512
512 512
512 512
16
16
16
16
32
32
1.024k
32.0
4.112k
32.0
512.0
64.0
24 Segmentation Head - - - - - -
Convolution - 1 1 1 512 512 1 33.0
Identity - - - 512 512 1 -
Sigmoid - - - 512 512 1 -
Total: 19.151057M
Table 3.1: The detailed specifications of encoder-decoder architecture of CrackDenseLinkNet.
81
Input layer
(Image)
Convolution + Batch normalization + ReLU
Max pooling
Dense block
Average pooling
Batch normalization
Transpose convolution
Sigmoid
Output layer
(Binary image)
Pooling connections
Figure 3.4: An encoder-decoder architecture of the DenseLinkNet semantic segmentation
convolutional neural network for concrete images.
before they are passed into a layer, whereas in DenseNet-169, feature maps are combined
by concatenating them. Thus, dense connectivity pattern requires fewer parameters than
traditional ResNet-based convolutional networks, as feature maps are reused. Mathematically
the connection expression is as follows:
I
l
HprI
0
,I
1
,...I
l1
sq, (3.14)
where I
i
are the input images (or feature maps), l is the layer index of an overall layer L,
Hpq is a composite function of operations such as convolution, batch normalization, ReLU,
and max/average pooling andrI
0
,I
1
,...I
l1
s refers to the concatenation of the feature maps
produced in layers 0, 1,...l 1.
In DenseNet-169, the network’s input is an image resampled to 512 512 3, where the
3 refers to the image channel. The input image is passed through a convolutional layer with
a kernel of size 7 7, a depth of 64, and a stride of 2. Next, batch normalization and ReLU
operations are performed on all the feature maps. This is followed by the max-pooling of
82
kernel size 3 3 with a stride and padding of 2 and 1. After these operations, feature maps
of size 128 128 64 are generated. After this, a series of dense blocks are applied. Each
composite dense block includes two convolutions (consisting of 1 1 (element-wise operations)
and 3 3 convolutional layers), two batch normalizations, and two ReLU operations. The
dense block does not alter the length and width of feature maps; rather, the depth increases
due to the transition block with a convolutional layer, batch normalization, ReLU, and
average pooling layer. The output of the last densely connected layer has a size of 16 16.
In the decoder part, a modified LinkNet with five upsampling blocks is cascaded to each
block of the encoder network. First, transposed convolutional layers are applied to upsample
a 16 16 feature map obtained from the first and subsequent max and average pooling
layers. The convolutional layers in an upsampling block are composed of 4 4 convolution
with a zero-padding of 1 and a stride of 2, batch normalization, and ReLU. After transposed
convolutions, each of the feature maps is upsampled by a scale of 2. Finally, the feature
map from different layers is upsampled five times to 512 512. Penultimately, an identity
function is used to have the flow of the gradients from later layers to the earlier layers. Lastly,
the sigmoid activation function is used to obtain the crack class probabilities, and this is
converted to a binary image after thresholding.
3.3.2 Loss function
The loss function plays a crucial role in obtaining the best metrics in classification, object
detection or semantic segmentation problems. This work is inspired by the medical image
segmentation tasks where the object of interest is smaller compared to the background, which
is nothing but a class imbalance problem. Crack segmentation is a perfect example, where
the number of crack pixels are in the order of 2-5 % when compared to the whole image size
(background). Therefore, in this study a focal loss and dice loss are utilized in a compound
fashion to counter the class imbalance problem. Focal loss (L
focal
) is a variation of binary
cross-entropy loss. It penalizes the contribution of easy examples and rewards the model on
83
learning hard examples [113]. It works well for highly imbalanced class scenarios due to the
weighting parameter,p1p
t
q
γ
. Focal loss is given by,
L
focal
pp
t
qαp1p
t
q
γ
logpp
t
q, (3.15)
where p
t
is the estimated probability of class and is given by,
p
t
$
'
'
&
'
'
%
p if y 1
1p otherwise,
(3.16)
where y is the actual value of the class. γ ¡ 0 counters the class imbalance problem and
when γ 0 focal loss behaves like the cross-entropy loss function. Similarly, α is generally
chosen from a ranger0, 1s. In this study, α 0.9 and γ 5 are selected which produced the
best metrics for the crack segmentation problem.
The dice coefficient is a widely used metric in computer vision to calculate the similarity
between two images or objects [113]. Similar to the focal loss, the dice coefficient works well
on the highly imbalanced class. It is given by,
L
dice
1
2yp 1
yp 1
, (3.17)
where y and p are the actual value of the class and prediction probability. Here, 1 is added
in numerator and denominator to ensure that numerical stability of the function.
In the compound form, both focal loss and dice loss can be expressed as a convex
combination,
L
final
βL
focal
p1βqL
dice
, (3.18)
where βr0, 1s. In this study, β 0.5 was utilized, as the variation of β did not improve
the crack segmentation metrics.
84
3.4 Experimental results and discussion
3.4.1 Dataset preparation
In this study, four datasets were used to evaluate the crack segmentation capability of
five CNN-based semantic segmentation networks. Datasets FCN [273], DeepCrack [165]
and CrackSegNet [211] are available to public. In addition, the proposed method dataset
CrackDenseLinkNet is currently private (it will be made public during the publication)
and created around the University of Southern California (USC) campus. The complexity
of images varies gradually from CrackDenseLinkNet, FCN, DeepCrack, and CrackSegNet.
CrackDenseLinkNet dataset consists of 250 testing images of a concrete surface. In this
Dataset FCN DeepCrack CrackSegNet CrackDenseLinkNet
Image size
Minimum 256 256 384 544 256 256
448 252
Maximum 334 306 544 384 512 512
Total images
Training 620 300 735 786
Testing 154 237 184 250
Material Concrete
Concrete and
pavement
Concrete Concrete
Texture
(image number)
High
Training 618 286 731 785
Testing 148 216 179 249
Low
Training 2 14 4 1
Testing 6 21 5 1
Crack width
range (pixels)
Training 2 to 327 2 to 247 2 to 56 2 to 354
Testing 2 to 326 1 to 544 1 to 70 3 to 254
Image type Color Color Color Color
Image quality High
Medium to
high
Medium to
high
High
Table 3.2: Materials, texture, crack information and the image attributes of the four training
and testing dataset.
dataset, the cracks have a stronger contrast, texture, and wider cracks relative to the other
datasets. About 249 images have high textural content out of 250. The FCN dataset is
primarily of the concrete surface and consists of 154 testing crack samples that vary in size,
out of which 6 and 148 images have low and high texture, respectively. In this dataset, crack
widths are thin and thicker. Furthermore, it consists of longitudinal, transverse and surface
cracks. Also, the image quality is high, and no special pre-processing was required. However,
85
it was observed that some of the ground-truth images were wrongly labeled (labels are thicker
than the actual width of the cracks).
Similar to the FCN dataset, the DeepCrack dataset consists of 237 testing samples of
various longitudinal, transverse, and surface cracks. About 78% and 22% of the images have
concrete and asphalt material surfaces [165], respectively. The crack width varies largely
across the images of this dataset. Around 21 and 216 images in this dataset have low and
high textures, respectively. Lastly, CrackSegNet consists of 184 testing images of the concrete
surface, 5 and 179 images of low and high texture, respectively. The crack width of this
dataset is relatively thinner compared to the other datasets. This is the most challenging
dataset to train the semantic segmentation CNN methods, as it contains highly blurry images,
textural noise, and paint artifacts. This dataset was purposefully included to assess the
limitations of all five methods in comparison.
Similar to Chapter 2, GLCM and k-NN methods were employed to cluster the texture of
the images in the four datasets into two classes. If the normalized distance of one centroid
is smaller than the other, then its corresponding samples are classified as the low texture
and vice-versa. Table 3.2 shows the materials, texture, crack information, and the attributes
of the image of the three testing datasets used in this study. For the preparation of the
validation dataset, 10% of each of four training dataset samples was randomly selected.
3.4.2 Architecture and training details
All CrackDenseLinkNet layers are summarized in Table 3.1, this architecture consists of 169
convolutional layers in the encoder. The input layer is the first layer, followed by convolution,
batch normalization, ReLU, and max/average pooling layers. In the decoder network, five
transpose convolutional layers were utilized for the upsampling of the feature maps. In total,
the CrackDenseLinkNet consists of 557 layers of various operational blocks. Lastly, the
upsampled transpose convolution output is converted to the crack or non-crack pixel based
86
on the probabilities using a sigmoid function. The threshold in the range ofr0, 1s was used
for the binary crack map conversion.
The computations were performed on a desktop computer using a 64-bit Ubuntu 20.04
operating system, 128 GB memory, and an AMD Ryzen ThreadRipper 2950x processor of 3.5
GHz 16 core processor was utilized. In addition, four Nvidia GeForce RTX 2080 Ti Graphics
Processing Units (GPUs) were also used to speed up the training and testing process in
parallel on four datasets. All five CNN methods were fully trained. The CrackDenseLinkNet
was trained with a batch size of eight images. For the optimization, the Adaptive Moment
Estimation (ADAM) method was used. The hyperparameters of the ADAM are as follows: a
learning rate of 10
5
, the coefficients used for computing running averages of gradient and
its square are 0.9 and 0.999), 10
8
was used to improve numerical stability. The total
number of epochs was set to 200. Furthermore, no scheduler was used to modify/decay the
learning rate. It was kept constant throughout the entire training process. The encoder
network was loaded with pre-trained ImageNet weights, and the decoder layers were randomly
initialized. None of the layers were frozen during training, i.e., the backpropagation was
carried throughout the encoder and decoder layers. On average, the CrackDenseLinkNet took
8.25 hours of training time for 200 epochs.
Lastly, deeper networks like CrackDenseLinkNet require a large amount of data to
train well with the least overfitting. Therefore, the training dataset is augmented with
geometric or pixel-based augmentation. In this work, the online augmentation is performed
to make the training dataset of 444,612 images (2,212 training images 200 epochs + 2,212
original images). Some geometric or pixel-based augmentation methods are random cropping,
additive Gaussian noise, shift and rotate, random brightness, blurring, motion blurring,
random contrast, etc.
87
3.4.3 Training and validation results
Training and validation scores measure how well the network has trained. Furthermore, it
provides an overview of the learning process over time. For example, after some epochs,
an asymptotically converging trend refers to a well-learned model, and a diverging trend
shows an uninformed model. If the desired converging or downwards trend is not observed
0 50 100 150 200
Epochs
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
Loss
LR = 0.00001
LR = 0.0001
LR = 0.0005
LR = 0.001
LR = 0.005
(a)
0 50 100 150 200
Epochs
0.70
0.75
0.80
0.85
0.90
0.95
1.00
Accuracy
LR = 0.00001
LR = 0.0001
LR = 0.0005
LR = 0.001
LR = 0.005
(b)
0 50 100 150 200
Epochs
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Iou score
LR = 0.00001
LR = 0.0001
LR = 0.0005
LR = 0.001
LR = 0.005
(c)
0 50 100 150 200
Epochs
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Precision
LR = 0.00001
LR = 0.0001
LR = 0.0005
LR = 0.001
LR = 0.005
(d)
0 50 100 150 200
Epochs
0.60
0.70
0.80
0.90
1.00
Recall
LR = 0.00001
LR = 0.0001
LR = 0.0005
LR = 0.001
LR = 0.005
(e)
0 50 100 150 200
Epochs
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
F1-score
LR = 0.00001
LR = 0.0001
LR = 0.0005
LR = 0.001
LR = 0.005
(f)
Figure 3.5: Training scores for different learning rates vs. epochs. (a). Focal-Dice losses. (b).
Accuracies. (c). IOU scores. (d). Precision. (e). Recalls. (f). F1-scores.
even after considerable training, the process can be stopped, and better hyperparameters
can be selected. Consequently, by observing the trend in training and validation losses, the
overfitting or underfitting of the network model can be evaluated. Overfitting is observed
when the validation loss is larger than the training loss. Similarly, underfitting is observed
when the training loss is larger than the validation loss. In a well-trained network, the
difference between the training and validation loss will be relatively small. In practice, it is
88
0 50 100 150 200
Epochs
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Loss
LR = 0.00001
LR = 0.0001
LR = 0.0005
LR = 0.001
LR = 0.005
(a)
0 50 100 150 200
Epochs
0.84
0.86
0.88
0.90
0.92
0.94
0.96
0.98
1.00
Accuracy
LR = 0.00001
LR = 0.0001
LR = 0.0005
LR = 0.001
LR = 0.005
(b)
0 50 100 150 200
Epochs
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Iou score
LR = 0.00001
LR = 0.0001
LR = 0.0005
LR = 0.001
LR = 0.005
(c)
0 50 100 150 200
Epochs
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Precision
LR = 0.00001
LR = 0.0001
LR = 0.0005
LR = 0.001
LR = 0.005
(d)
0 50 100 150 200
Epochs
0.50
0.60
0.70
0.80
0.90
1.00
Recall
LR = 0.00001
LR = 0.0001
LR = 0.0005
LR = 0.001
LR = 0.005
(e)
0 50 100 150 200
Epochs
0.40
0.50
0.60
0.70
0.80
0.90
1.00
F1-score
LR = 0.00001
LR = 0.0001
LR = 0.0005
LR = 0.001
LR = 0.005
(f)
Figure 3.6: Validation scores for different learning rates vs. epochs. (a). Focal-Dice losses.
(b). Accuracies. (c). IOU scores. (d). Precisions. (e). Recalls. (f). F1-scores.
challenging to have the same trend with these losses. Therefore, graphical observation of the
losses is enough to quantify the learning process.
Generally, deep CNNs are susceptible to overfitting due to many trainable parameters
that need to be learned and relatively fewer training samples. Therefore, a data augmentation
procedure is often employed. Another critical parameter that affects the training and
validation scores is the batch size of input images. If the batch size is small, the network
generalizes well due to the regularizing effect (noise added to the learning process) offered by
small-batches [94] and vice-versa in some cases. Similarly, the learning rate will affect the
training and validation scores and convergence speed during CNN training. For example,
a larger learning rate might converge faster but jump around the minima, while a smaller
learning rate can take too long to converge.
To choose the best initial learning rate for the CrackDenseLinkNet, the initial learning
rates were set to 0.00001, 0.0001, 0.0005, 0.001, and 0.005 while training the network. The
89
CrackDenseLinkNet is trained for 200 epochs (55,300 iterations) under different initial learning
rates for a batch size of eight images. Figures 3.5 and 3.6 shows the training and validation
losses and scores for the various learning rates. The training and validation loss curves
show that the network suffered an overfitting problem even with 444,612 augmented images.
The losses and scores for the learning rates 0.00001 and 0.0001 converge slowly and fast
respectively. Around 70 epochs, the loss values for all learning rates become stable in the
training and validation processes. It is worth noting that the curves are smoother for training
examples due to averaging of larger samples. In contrast, as the validation samples are fewer,
the curve is jagged.
3.4.4 Crack segmentation on real-world datasets
In this section, the CrackSegnet method is evaluated, and its performance metrics are
compared against four state-of-the-art CNN-based crack segmentation methods. Furthermore,
the segmentation capabilities of these five methods are assessed on the four real-world datasets
(refer to Section 3.4.1) by a complete end-to-end training method. Therefore, transfer learning
or fine-tuning was not considered. The four state-of-the-art methods used in comparison are
FCN [273], DeepCrack [165], CrackSegNet [211], and FPHB [267]. Their original literature
follows the complete training procedure for all four comparison methods. For FCN, 150 epochs
were used for the training process based on pretrained VGG19. As for the hyperparameters,
the initial learning rate was 10
4
, and the batch size equals 2. The input image size is
resampled to 224 224 pixels. While training the DeepCrack method, 400 epochs were
considered as the model converges. The training images are resampled to 256 256 and
resampled back to the ground-truth image size for testing. After 300 epochs, the learning
rates decay by 4 10
4
. The model is trained using a custom weighted cross-entropy loss
function with a batch size of one image. For CrackSegNet, the mixed dataset is trained for
at least 30 epochs as described in the original article. The training images are augmented
as the article described with a rotation range of 20
, height and width shift ranges of 5%,
90
Original Ground-truth FCN DeepCrack CrackSegNet FPHB CrackDLNet
Figure 3.7: A comparison of the CNN-based crack segmentation methods. From left to right
the columns show the images of, original color, ground-truth, FCN, DeepCrack, CrackSegNet,
FPHB and CrackDenseLinkNet (CrackDLNet) end-to-end CNNs. The rows 1 to 4, 5 to
8, 9 to 12 ans 13 to 16 show the images of datasets FCN, DeepCrack, CrackSegNet and
CrackDenseLinkNet, respectively.
91
Dataset Method Metrics
GA MI SP PR RE F1
FCN
FCN 0.9660 0.5672 0.9870 0.7858 0.6710 0.7239
DeepCrack 0.9599 0.5875 0.9671 0.6503 0.8589 0.7402
CrackSegNet 0.9794 0.7266 0.9878 0.8267 0.8572 0.8417
FPHB 0.9571 0.5712 0.9640 0.6297 0.8601 0.7271
CrackDenseLinkNet 0.9786 0.7262 0.9874 0.8286 0.8545 0.8414
DeepCrack
FCN 0.9789 0.5902 0.9911 0.7817 0.7066 0.7423
DeepCrack 0.9822 0.6958 0.9837 0.7231 0.9486 0.8206
CrackSegNet 0.9893 0.7759 0.9921 0.8291 0.9237 0.8738
FPHB 0.9757 0.6118 0.9795 0.6613 0.8909 0.7591
CrackDenseLinkNet 0.9896 0.7842 0.9948 0.8831 0.8751 0.8791
CrackSegNet
FCN 0.9852 0.2627 0.9963 0.5912 0.3211 0.4161
DeepCrack 0.9785 0.3385 0.9837 0.4605 0.6694 0.5058
CrackSegNet 0.9814 0.3891 0.9858 0.4583 0.7203 0.5602
FPHB 0.9778 0.3396 0.9826 0.4000 0.6923 0.507
CrackDenseLinkNet 0.9856 0.4750 0.9888 0.5416 0.7943 0.6441
CrackDenseLinkNet
FCN 0.9705 0.7441 0.9827 0.8459 0.8608 0.8533
DeepCrack 0.9626 0.7099 0.9672 0.7562 0.9207 0.8304
CrackSegNet 0.9747 0.7871 0.9500 0.8212 0.9500 0.8809
FPHB 0.9727 0.7722 0.9775 0.8202 0.9296 0.8715
CrackDenseLinkNet 0.9768 0.7959 0.9841 0.8633 0.9107 0.8864
Mean scores of
all datasets
FCN 0.9751 0.5411 0.9892 0.7511 0.6398 0.6839
DeepCrack 0.9708 0.5829 0.9754 0.6475 0.8494 0.7242
CrackSegNet 0.9812 0.6696 0.9789 0.7338 0.8628 0.7891
FPHB 0.9708 0.5737 0.9759 0.6278 0.8432 0.7161
CrackDenseLinkNet 0.9826 0.6953 0.9887 0.7791 0.8586 0.8127
Table 3.3: Semantic segmentation results of the FCN, DeepCrack, CrackSegNet, FPHB, and
CrackDenseLinkNet on the datasets FCN, DeepCrack, CrackSegNet, and CrackDenseLinkNet.
zooming by a range of 5%, and horizontal flip. The training images are resampled to 512
512 and tested using the actual size of ground-truth images. Lastly, the FPHB method was
trained for 40,000 iterations, and the encoded network is a pre-trained VGG16 CNN.
92
Figure 3.7 shows the segmentation results of the five methods on four datasets. In
this figure, column 1 is the original color images of the four test datasets described in
Section 3.4.1. Columns 3-7 are the binary images of the segmented cracks obtained from
processing the original images of the datasets by FCN, DeepCrack, CrackSegNet, FPHB,
and CrackDenseLinkNet (CrackDLNet) end-to-end CNNs, respectively. The binary image
thresholding method was used to binarize the prediction images. The black and white color
in columns 2-7 represents the crack and background pixels, respectively. It can be seen from
Figure 2.18 that all five methods have segmented the cracks considerably well in most of the
images of the four datasets.
The FPHB method produces the cracks slightly thicker because of the pyramid levels
of the scales. The FCN method missed many thin crack segments in datasets FCN and
CrackSegNet due to the kernel sizes used in the FCN architecture. The DeepCrack and
CrackSegNet methods suffered from the continuities of the cracks in the FCN and CrackSegNet
datasets because some of the images in these two datasets have a low foreground contrast
and blurriness. As shown in rows 13-16, the thicker cracks are segmented well by all five
methods. CrackDenseLinkNet demonstrated precision extraction of crack pixels without
artifacts in some of the images compared to the other four methods. FPHB methods are
limited to detecting cracks when the darker regions were present, as in row 9. Overall the
CrackDenseLinkNet segments the cracks well with better continuity than other methods
because of the better feature maps from the encoder network. The second best method is the
CrackSegNet method.
Six different metrics are utilized to evaluate the crack semantic segmentation CNN
methods. Two of them are the common semantic segmentation metrics: pixel accuracy and
Intersection over Union (IoU). The rest are widely associated with the crack segmentation
literature. Metrics details are provided in Section 2.6.2.1. Table 2.4 displays the metrics of
the five methods in comparison. Global accuracy, mean accuracy, and specificity consider
TNs or non-crack pixels.
93
On the FCN dataset, the F1-scores of CrackDenseLinkNet and CrackSegNet methods
were almost the same. Here, the CrackDenseLinkNet method had the best precision and
comparable recall to the FPHB and CrackSegNet methods. Other methods overestimated a
considerable number of crack pixels. For the DeepCrack dataset, the CrackDenseLinkNet
method performed well with an F1-score of 87.91%, whereas DeepCrack had the highest recall
rate of 94.86%. The second-best performer was the CrackSegNet method with an 87.38%
F1-score. Lastly, for the CrackDenseLinkNet dataset, all the methods performed relatively
well, and the CrackDenseLinkNet method had the best F1-score of 88.64%. Due to the poor
image quality, low contrast, and blurriness of images, the performance of the segmentation
methods on the CrackSegNet dataset was relatively low. The CrackDenseLinkNet method
obtained the best F1-score of 64.41% compared to other methods for this dataset with the
highest recall due to the better feature maps. Overall, all the five CNN methods segmented
the cracks with better metrics on all three datasets except the CrackSegNet dataset. The FCN
method missed many crack segments over some of the images in the four datasets. Overall,
averaged across the datasets, the proposed CrackDenseLinkNet method outperformed the
best CNN method, CrackSegNet, by 2.36% of F1-score. In addition, the CrackDenseLinkNet
method had the best precision, recall, and IoU rates averaged across the datasets. One of
the reasons why CrackDenseLinkNet obtained better binary crack images is the way feature
maps are learned. A small and large receptive field can segment both thin and thick cracks
well amidst some datasets’ bad images. Lastly, it was observed that the learning rate is also
crucial for better training and improved scores.
A CNN learns to abstract the complex spatial features associated with each input image
by learning the fixed size kernels. Visualizing the feature maps provides an insight into
what features have been learned by the encoder and decoder networks. In other words,
precise activation of neurons along the depth of the channels can be visualized. Similarly,
visualization of filter maps also illustrates how the weights of the filters have learned to
distinguish the spatial features of the objects in the input dataset like edges, blobs, and
94
(a) (b)
(c) (d)
Figure 3.8: Feature maps at various layers of the segmentation network. (a). First sequential
batch normalization layer after the convolution layer. (b). Last convolution layer of encoder
block 2. (c). Last batch normalization layer after the decoder block 4. (d). Transpose
convolution layer of decoder block 5.
others. Initial layers of the CNN extract these high-level features depending on the number
of filters used. Furthermore, the feature maps and filter weights can also help understand
95
(a) (b)
Figure 3.9: Filter maps or convolution weights at two layers of the segmentation network.
(a). Convolution layer 1 (size 7 7). (b). Convolution layer in encoder block 1 (size 3 3).
the incorrect predictions based on the spatial features that a human can interpret. As the
depth of the CNN increases, the feature maps or filters associated with the higher depths
are more abstract, and it is challenging to interpret the network performance based on the
deeper layers.
Figure 3.8 illustrates the feature maps obtained from an image of the DeepCrack dataset at
the first sequential batch normalization layer after the convolution layer, the final convolution
layer of encoder block 2, the final batch normalization layer after the decoder block 4, and
transpose convolution layer of decoder block 5. In Figure 3.8a some of the feature maps
display the activations due to the cracks. Furthermore, some of the feature maps have fewer
activated pixels. Similarly, in Figure 3.8b it can be seen how the texture and some crack
pixels of an input crack image have activated in the feature maps. Figure 3.8c shows the
activation of the transpose convolutions at the decoder block 3. The crack pixels are clear,
however, due to the upsampling, some of the pixels in the feature maps are noisy. More
96
abstraction of the crack pixels is evident in Figure 3.8d at the greater depth of the decoder
block.
Figure 3.9 shows the filter maps at the first convolutional layer of size 7 7 and second
convolutional layer in encoder block 1 of size 3 3. In Figure 3.9a the filter weights have
structure due to the presence of high-level features like edges, color blobs, and Gabor filter-like
features. In contrast, Figure 3.9b filter weights do not follow any patterns. This shows that
the high-level features are visually interpretable at the first layer of CNN, and as the depth
progresses, the filter weights lose definitive patterns.
3.4.5 Crack profile analysis
Cross-sectional profile analysis of the proposed CrackDenseLinkNet method, FCN, DeepCrack,
CrackSegNet, and FPHB methods are presented in the Figure 3.10. In addition, two crack
images from the DeepCrack dataset, as shown in the Figure 2.14 are utilized. In Figure 3.10a
-20 -15 -10 -5 0 5 10 15 20
Distance from Crack Centerline
0
0.2
0.4
0.6
0.8
1
Normalized Intensity
Ground-truth
FCN (RMSE: 0.192)
DeepCrack (RMSE: 0.094)
CSN (RMSE: 0.052)
FPHB (RMSE: 0.185)
CDLN (RMSE: 0.110)
(a)
-50 0 50
Distance from Crack Centerline
0
0.2
0.4
0.6
0.8
1
Normalized Intensity
Ground-truth
FCN (RMSE: 0.069)
DeepCrack (RMSE: 0.072)
CSN (RMSE: 0.040)
FPHB (RMSE: 0.037)
CDLN (RMSE: 0.073)
(b)
Figure 3.10: Cross-sectional profile of two concrete surface images. (a). Thin crack of a width
14 pixels. (b). Thick crack of a width 67 pixels. Better visualization in color.
the thickness profile of a relatively thin crack is presented. The CrackDenseLinkNet method
underestimated the width of the thin crack by 3 pixels (1 on the right and two on the left) in
total and have the third-highest Root Mean Square Error (RMSE) 0.110 along the profile line.
97
This was due to the missing pixels along the profile line. The CrackSegNet method was the
best in RMSE, neglected two pixels, and had a lesser crack width. The FCN overestimated
the crack width by two pixels and had the highest RMSE due to the inconsistency in the
intensity values of the ground-truth image. Similarly, the FPHB method overshot five pixels
in width and had an RMSE of 0.185. Lastly, DeepCrack underestimated five crack pixels. All
methods either have overestimated or underestimated at least 2 pixels on the left/right side
of the crack edge. This shows that all methods suffered to segment the thin crack precisely
due to the convolutional scale issue or overfitting or underfitting of the networks. It should
be noted that the RMSE is calculated by considering both the X-direction locations of the
pixels and the intensity values of the crack pixels. Therefore, even though the number of
pixels estimated is small or large, if the intensity values are off, the RMSE scores suffer.
Figure 3.10b shows the cross-sectional profile of all the methods for a thick crack. FCN,
DeepCrack, CrackSegNet, FPHB, and CrackDenseLinkNet methods have underestimated 6,
5, 4, 3, and 6 pixels, respectively. In contrast, FPHB has the best cross-sectional profile and
intensity values. In a nutshell, all the methods performed well on extracting the thick cracks.
This is because the receptive field kernels are large for all the methods; they match the crack
width well. Thus resulting crack maps are precisely extracted.
Crack width, length, and area are the physical properties that define the severity of
the defect. These are used in visual inspection procedures to rate the condition of civil
infrastructure. Similar to Section 2.6.3, width of the cracks is defined as the normal distance
to a tangent at each center-line pixel in both normal directions and averaged across all these
points. The length and area of the cracks are defined as the length of the skeleton and the
total number of pixels in the cracks.
Similar to Section 2.6.3, a fast marching method was employed to compute the skeletons of
the cracks in this work [239]. Generally, skeletons possess redundant branches. To overcome
this effect, a skeleton pruning threshold of 10% was used in this study. After skeletonizing
the cracks, widths were measured by counting the white pixels in the normal direction of a
98
FCN DC CSN FPHB CDLN
Methods
0
100
200
300
Relative errors (%)
Cracks Width
FCN DC CSN FPHB CDLN
Methods
0
200
400
Cracks Width
FCN DC CSN FPHB CDLN
Methods
0
200
400
600
800
Cracks Width
FCN DC CSN FPHB CDLN
Methods
0
100
200
300
Cracks Width
FCN DC CSN FPHB CDLN
Methods
0
200
400
Relative errors (%)
Cracks Length
FCN DC CSN FPHB CDLN
Methods
0
200
400
Cracks Length
FCN DC CSN FPHB CDLN
Methods
0
200
400
600
Cracks Length
FCN DC CSN FPHB CDLN
Methods
0
200
400
Cracks Length
FCN DC CSN FPHB CDLN
Methods
0
500
1000
Relative errors (%)
Cracks Area
FCN DC CSN FPHB CDLN
Methods
0
200
400
600
800
Cracks Area
FCN DC CSN FPHB CDLN
Methods
0
500
1000
1500
Cracks Area
FCN DC CSN FPHB CDLN
Methods
0
100
200
Cracks Area
Figure 3.11: Statistics of the estimated cracks physical properties on four datasets. Rows 1, 2
and 3 represent crack thickness (averaged), length and area, respectively. Columns 1, 2 and 3
represent datasets FCN, DeepCrack, CrackSegnet and CrackDenseLinkNet, respectively. The
rectangular boxes represent the range between the 25th and 75th percentiles. The horizontal
lines within the boxes denote the median values. The protruding horizontal lines outside the
boxes represent the minimum and maximum values. The whiskers are lines extending above
and below each box. Mean values are indicated by the small squares inside the boxes. Lastly,
the red asterisk symbols are the outliers.
major ellipse axis (tangent) for every three pixels using the approach of [204]. Finally, the
quantified crack thickness was averaged in a small neighborhood of seven pixels to reduce the
effects of outliers.
Figure3.11showstheuncertaintystatisticsoftherelativeerrorsofthecrackswidth, length,
and area of the four datasets. For the crack’s width, length, and area on the FCN dataset,
the CrackDenseLinkNet method performed well. On the other hand, FPHB overestimated or
missed many cracks; thus the variance was large. On the DeepCrack dataset, the DeepCrack
method overestimated the lengths of cracks, and FPHB missed some of the cracks; thus, width
and area relative errors are higher. On the CrackSegNet dataset, FCN and CrackSegNet
methods did not precisely extract the cracks due to the dataset’s bad images. Thus, the crack
99
widths are overestimated or underestimated. For the crack length, the FCN performed poorly,
and all methods performed similarly in the crack area category and had a smaller variance.
Lastly, for the CrackDenseLinkNet dataset, on all three quantities, FCN and DeepCrack
underperformed due to missing cracks. Overall the CrackDenseLinkNet method had the best
statistics and minimum variance, as it segments the cracks accurately on the tested datasets.
3.4.6 Computational time analysis
All deep learning computations were performed on a desktop computer using a 64-bit Ubuntu
20.04 operating system, 128 GB memory, and an AMD Ryzen ThreadRipper 2950x processor
of 3.5 GHz 16 core processor. Four Nvidia GeForce RTX 2080 Ti Graphics Processing
Units (GPUs) were used for training and testing of all five deep learning method on four
datasets. The programs were developed using open-source deep learning frameworks like
Caffe, Keras, PyTorch, TensorFlow, and other Python libraries. Methods such as FCN,
DeepCrack, CrackSegNet, and FPHB are evaluated using their public source codes in the
Linux environment. Table 3.4 presents the computation time of the methods discussed in
this work.
Dataset Time (seconds)
FCN DeepCrack CrackSegNet FPHB CrackDenseLinkNet
FCN 0.1955 0.0026 0.0024 0.0700 0.0446
DeepCrack 0.2214 0.0031 0.0032 0.1399 0.0465
CrackSegNet 0.2098 0.0027 0.0025 0.1745 0.0447
CrackDenseLinkNet 0.2219 0.0028 0.0027 0.0848 0.0435
Average time 0.2121 0.0028 0.0027 0.1173 0.0448
Table 3.4: Computational time for the four real-world datasets using the FCN, DeepCrack,
CrackSegNet, FPHB, and CrackDenseLinkNet methods.
The DeepCrack and CrackSegNet algorithms were very similar in the computational
prediction time and 1560.18% faster than the CrackDenseLinkNet. These methods are faster
due to the low depth of the architecture. The FCN method took 0.2121 seconds on average
to predict the probabilities of the crack and non-crack pixels. FCN is 7757.40% and 373.54%
slower than the DeepCrack/CrackSegNet and CrackDenseLinkNet methods respectively.
100
Similarly, FPHB is 4244.44% and 161.83% slower than the DeepCrack/CrackSegNet and
CrackDenseLinkNet methods respectively. Overall the DeepCrack/CrackSegNet is the fastest
in inference, and the CrackDenseLinkNet method was the second-fastest, followed by the
FPHB and FCN approaches. FCN and FPHB had more parameters than the other three
methods.
3.5 Summary and conclusions
Concrete structures undergo periodic cycles of loading. Furthermore, the structural physical
properties like fatigue, shrinkage, creep, and corrosion of reinforcements develop the cracks
during their life cycle. Generally, trained inspectors use drawing tools and crack width
measuring equipment to record the location and physical properties like width, length, and
area of the cracks. However, this procedure is entirely manual, tedious, and subjective to the
inspector’s experience. In the last one-and-a-half to two decades, the condition assessment
research communities developed conventional and sophisticated filter-based image processing
and computer vision methods. In addition, supervised decision-making systems were also
developed by using the conventional classifiers like ANN, SVM, k-NN, and ensemble learning
methods (random forest and adaptive boosting). After mid 2017, CNN-based methods for
the classification, detection, and semantic segmentation of cracks have gained popularity. In
this work, a deep end-to-end encoder CNN, Densenet, and modified LinkNet decoder network,
CrackDenseLinkNet, were adapted to reduce the number of trainable parameters to speed
up the training process without compromising the segmentation accuracy. In addition, a
combined focal and dice loss was used to counter the class imbalance problem in the crack
dataset.
Three public and one private datasets were used to assess the segmentation capabilities
and limitations of CrackDenseLinkNet in comparison with four state-of-the-art methods.
All four methods are fully trained on the four datasets using their public source code and
101
the hyperparameters prescribed in their work. The proposed CNN, CrackDenseLinkNet,
outperformed the best state-of-the-art method, CrackSegNet, by 2.36% of F1-score on average
across four datasets. Furthermore, the proposed method took 8.25 hours on average to train
for 200 epochs. In comparison, the FPHB was the slowest (taking more than 24 hours),
followed by FCN, CrackSegNet, and DeepCrack methods. Lastly, on a detailed analysis of
the crack profile for all four datasets, the proposed method had a lesser variance in relative
errors for the crack width, length, and area categories against the ground-truth data.
3.6 Future work
This work showed that an end-to-end encoder and decoder network with fewer parameters
achieved the state-of-the-art segmentation F1-score compared to the other methods. The
combination of the focal and dice loss was able to handle the class imbalance problem that
is often persistent in the crack semantic segmentation. A hybrid or compound (weighted)
loss combination of the focal, dice, and topology-preserving functions for crack segmentation
can be advantageous in the future. Lastly, ensemble learning of semantic CNNs for crack
segmentation can also be a prime topic to research in the future.
102
Chapter 4
An image registration procedure for
quantifying crack evolution in multi-
image time series data
4.1 Introduction
Cracks are the standard features in the concrete surface that show the deterioration of
the structure in the early stages. Cracks are caused by various external factors such as
environmental loads like wind, earthquakes, floods, and hurricanes, and structural factors
like cyclic loading, shrinkage, creep, and corrosion of reinforcements [167]. Although there
are many non-destructive evaluation methods are available to detect and locate the cracks,
visual inspection is the predominant mode to quantify them [138]. However, visual inspection
is a labor-intensive task that must be carried out at least bi-annually in many cases [36].
Generally, the human technician records the crack’s physical properties and tracks it by
hand-drawn methods over time. As a result, there exists a considerable statistical difference
in the inspection descriptions of the same bridge between different inspectors [97].
In the past, most structural health monitoring applications have relied on contact-based
sensors such as accelerometers and strain gauges. After the data processing and analysis,
these sensors can be helpful to sense the coarse changes in the structure on a global scale.
However, since the advancement and cost reduction of non-contact vision-based sensors, much
of the attention is oriented towards defect detection by Digital Image Processing (DIP) and
computer vision techniques. This assists in obtaining the more significant details of the defect
properties and location. Digital image processing techniques such as keyframes extraction
and 3D surface reconstruction by photogrammetry are widely used to extract the cracks
103
pixels from the images [166]. Recently, the cracks growth monitoring and assessment based
on CNNs were implemented and tested [139]. However, the camera was fixed at a location,
and the field of view was set to acquire the whole concrete beam test model in laboratory
conditions.
Many tall buildings, large dams, and bridges have inaccessible areas where special attention
must be given. Robotic arms and Micro Aerial Vehicles (MAVs) enable the efficient collection
of valuable data for inspection and analysis. MAVs equipped with a high-resolution camera
can collect a large amount of data in a short time and assist in reducing the dangerous,
laborious, and expensive manual inspection tasks. Generally, MAVs are equipped with
sophisticated sensors such as color and thermal cameras, inertial measurement unit, Global
Positioning System (GPS), and ultrasonic module to perform the condition assessment of
power facilities, bridges, building façades and sewer pipes [123].
Crack detection and tracking its evolution is a challenging task that requires multiple
sensors and analysis techniques. Cracks are not always concentrated at a single location.
They propagate in time and develop to fatigue or surface cracks if it is untreated. Thus, it
requires a large area to be scanned, and precise crack localization is necessary to quantify the
evolution. An image stitching approach utilizes individual images of a large area to produce
a seamless scene reconstruction. Additionally, this aids in visualizing and assessing the large
area by providing the more significant details of the defective surface in a larger field of
view. The image stitching algorithm requires carefully identified control points to warp the
individual images onto a planar surface. The feature-based descriptors can automatically
and efficiently select these control points [169, 17]. Once the visual data interest points are
extracted, and the homography matches the images, image blending has to be performed to
obtain a seamless final image. Many researchers have used image-based crack propagation
methods when the camera is fixed and focused on the object. Ghorbani et al. [88] used digital
image correlation for the measurement of full-field deformation and mapping the cracks on
confined masonry walls. Wang et al. [250] used a high-resolution color camera to investigate
104
the crack propagation of an asphalt specimen using the Semi-Circular Bend (SCB) test.
Furthermore, the previous studies focused on automatically stitching the large scan area of
cracks and quantifying the cracks at a single time frame.
Data collection by using the MAVs has increased rapidly in the past few years due to the
cost of MAVs equipped with a high-resolution camera. In addition, MAVs are agile, easy
to maneuver and control in dense and scarce regions that are inaccessible. In this study, a
hexacopter MAV equipped with a color camera, Inertial Measurement Unit (IMU), GPS,
and a controller on board was used to collect the data in different periods. The simulation
environment is based on the physical equations for flight and control. The same computer
program developed for the simulation can be used to control a real MAV. In this work, the
assumption of the fixed camera is relaxed, and the cracks are stationary. Once the visual data
is collected, they are organized by the time periods (number of trials) for each experiment
where cracks exist. Later, the recorded crack/non-crack data will be aligned back with the
reference images to quantify the crack propagation. For this, IMU data is used to localize the
neighboring images for a current image. Generally, in the presence of the stains and blemishes
on the concrete surface, a few false-positive keypoints result from using the feature-descriptors
algorithms. By using the location data from an IMU for each image, better localization can
be achieved, and the computational cost for pairwise search in the database can be reduced
to a greater extent.
In addition, this study proposes a probabilistic reliability approach for quantifying the
changes in crack physical properties such as area, length, and width. A change quantification
by an ensemble averaging can negate the unforeseen effects of distortion in the image warping
when there are bad keypoint matches. Additionally, there are uncertainties in the flight
trajectory or the path of the MAV; ensemble averaging helps minimize the uncertanties over
the large simulation runs. Running multiple flight trajectories on a real-world structure is
laborious and time-consuming. A simulated environment assists in filling the gap that is hard
to replicate in the real-world scans of structure. Lastly, the computational damage mechanics
105
model requires a precise location and physical characteristics of the cracks to estimate the
crack propagation. The proposed probabilistic reliability method helps obtain an accurate
initial condition for the computational methods with uncertainty quantification.
4.1.1 Review of the literature
Image stitching algorithms are among the most widely used algorithms to create a panorama
of a large scene in today’s world [235]. Most phones are equipped with a panorama stitcher by
default. Brown et al. [29] and Brown and Lowe [28] proposed a connected component-based
method to recognize the panorama and stitching of multiple images based on SIFT features.
Jahanshahi et al. [116] adapted the Brown and Lowe [28] method to reconstruct the scene
from multiple images and to compare it with a current image to track defect evolution in
structures. In their method, the camera was assumed stationary and rotating in the optical
axis to acquire the images of the large scene. Additionally, their method was used to track
large synthetic defects. In reality, the camera cannot be fixed at one location while using
mobile robots or drones. Also, the cracks are smaller in size than the synthetic defects
prepared for the evolution of the defects.
Rashidi et al. [208] proposed a keyframes selection method from the video recordings of
civil infrastructures. In their method, video clips are assessed for the quality of keyframes and
later selected for scene reconstruction and 3D point clouds generation. Although the video
frames can be used for the stitching, their purpose was to obtain better keypoints and 3D point
clouds. Lim et al. [160] proposed a method to map the cracks over a small controlled area
using a mobile robot equipped with a laser scanner, pan-tilt-zoom camera, and a laptop. They
adapted the Laplacian of Gaussian (LOG) filter to extract the crack features and a robotic
inspection path planning based on a genetic algorithm to scan the small area by complete
coverage. In their method, they could map the cracks on a concrete surface. However, crack
evolution was not considered. Chaiyasarn et al. [35] proposed a method to perform an image
mosaicing of the tunnel-lining images. In their method, structure-from-motion was utilized
106
to generate the 3D point clouds, and an SVM classifier was used to differentiate the tunnel
surface points and the non-surface points. Later, the surface is estimated to aid the image
warping and produce a panorama with line parallelism and straightness.
Aliakbar et al. [8] and Akbar et al. [5] adapted the SURF feature detector to extract
keypoints from the images. These keypoints are matched, and outliers are rejected using
the RANdom SAmple Consensus (RANSAC) algorithm. The stitched and current images
are subtracted to extract the crack pixels for change detection. However, the changes in
time dimension lack in their work. Jahanshahi et al. [118] introduced an adaptive resection-
intersection bundle adjustment approach to refine the 3D points, and the camera poses. In
their approach, the potential misassociated features are omitted during the resection and
intersection stage. Their approach was convenient for aerial scene reconstruction and 3D
structure estimation. However, it is not tested on the crack evolution problems. Li et al.
[155] proposed a multilayer and multiscale homography model for stitching the pavement
images. In their method, the SIFT feature is extracted from the images. Image matching,
and outlier rejection was performed using the RANSAC algorithm. The bundle adjustment
optimization was carried on the homographies between the images. However, their method
was limited to the stitching of pavement images, and crack evolution was disregarded.
Yeum et al. [278] developed an image localization technique to automatically extract the
regions of interest in a full-scale highway sign truss structure. In their approach, images
of the structure were acquired by a camera-mounted MAV, and the projection matrix was
estimated using Structure-from-Motion (SFM). The SFM coordinates are unitless, and scale,
position, and orientation are unavailable. Thus, the structure model is used as a priori for the
3D coordinate transformation. Lastly, virtual spheres at the weld joints are used to localize
the regions of interest. In their work, regions of interest were extracted and localized, but
not the change detection of the defects along the time dimension. Bang et al. [16] proposed a
method to stitch the construction site images to understand their status. A camera mounted
MAV was used to acquire the construction site images. In their method, blur removal
107
and keyframe selection modules were developed to filter the blurry images and extract the
correct frames from the video sample. Camera lens distortion correction was performed by
estimating the camera intrinsic parameters. Their work focused on the macro-scale of the
site details. However, their method did not discuss the fine-scale details and change detection
of defects. Kromanis and Liang [142] adapted the vision-based tracking method to measure
the deformation of a wooden beam in a laboratory setting using smartphone cameras placed
at different locations. In their method, markers were used as the tracking aid, and change
detection of the defects was not considered.
Xie et al. [264] proposed an image stitching method with a robot inspection system which
combines both the 2D image point features and the 3D line features to reduce the drift of
large-scale images. In their approach, the inspection system scans the bottom area of the
bridge, and images and 3D point clouds are acquired from the color camera and laser range
finder. Keypoints are extracted from the group of images and are matched. Lastly, bundle
adjustment was used to refine the homographies, and a multi-band blending algorithm was
applied to generate the image mosaics of the bottom surface of the bridge. In their, work
multi-image stitching was focused, but not the defect evolution. Choi et al. [48] presented an
approach to evaluate the building façade of a hazardous structure after a disaster event. The
camera-mounted MAV acquired series of images, and orthographic images were generated
from the SFM, planar homography estimation, and image alignment. Regions of interest
were localized for damage detection. However, information in the time dimension was not
considered in the methodology. Similar to [278], Yeum et al. [277] expanded the localization
of regions of interest of a truss structure for defect detection. They used a CNN to filter out
the occluded regions. Again, they disregarded the change detection in the time dimension.
Ghosh and R. [89] developed an approach for the change detection and evolution of the cracks
on concrete structures. In their approach, SURF-based feature keypoints and RANSAC were
utilized for image matching. Also, to reduce the image labeling time, a region-based CNN
108
was used in crack detection, and localization by bounding-boxes, followed by a morphological
image approach for semantic segmentation of the cracks.
Schlagenhauf et al. [216] proposed a stitching algorithm similar to a line scan camera for
rotationally symmetric structural components. In their approach, the camera mounted near
the spindle acquires the image fragments. Based on the rotational velocity and camera frame
rate, the images are stitched using the transformation model, and alpha blending produces
the final images. A deep Visual Geometry Group, OxfordNet (VGG16) CNN, was used to
classify the image patches as defective and non-defective. However, change detection was
not considered in their work. Won et al. [258] proposed a stitching algorithm by adapting
the Deepmatching method for the dense correspondence extraction. In their work, image
sequences are paired by Delaunay triangulation to reduce the computational complexity of
the Deepmatching method compared to the conventional pairwise matching. All other flows
of the tasks were similar to a classical feature-based method that was proposed in [28]. In
this work, stitching of various concrete surface images was considered, but not the defect
evolution. Kang and An [128] proposed a method to remove the background of the structure’s
exterior scene by deep learning-based depth estimation. This was carried out to minimize
the distortion caused by the farther background of the inspection images. Feature keypoints
are extracted from the region of interest images using the SIFT descriptors, and RANSAC
was utilized to perform the image matching. This was followed by a mesh-based digital
image stitching method known as “natural image stitching”. In this work, stitching was the
preeminent focus rather than defect evolution.
4.1.2 Contribution
All the previous work considered a stationary camera and measured the change by image
difference or digital image correlation (DIC). Recent work focused on constructing a panorama
for offline visual inspection, but they lacked the defect evolution. In this study, the camera is
non-stationary and mounted on an unmanned aerial vehicle, and the crack propagation is
109
measured from the reference and current images. Recent work included the time dimension
for the evolution of the cracks on a concrete surface [89]. However, pairwise image matching
has a time complexity of Opn
2
q in this approach (where n is the number of images in the
dataset). In contrast, in the proposed work, neighboring images are localized by the nearest
neighbor search based on ak-d tree, which has a time complexity ofOpkn lognq for building
the tree, where k 2 for a two-dimensional points search and n is the number of image
coordinates from the IMU data. For searching the tree, the time complexity is Opn lognq.
Generally, for m neighboring images, m!n
2
. In addition, the proposed method provides
the probabilistic reliability measure to detect the changes in the analysis results by acquiring
ensemble datasets to reduce the occluded regions. Lastly, this work can use an MFAT method
or CNN for semantic segmentation of the cracks instead of the morphological approach
which results in higher false positives. In this work, viewpoint change and the effect of
geometric transformation on the reference and current image are studied. Also, the crack
width measurement algorithm is adapted based on orthogonal projection since traversing is
faster in the normal direction of the orientation of the crack pixels neighborhood.
4.1.3 Scope
Sections 4.2 and 4.3 discusses the MAV and camera mathematical models. Section 4.4
introduces the nearest neighbor search-based image matching. This is followed by the
feature-based multi-image registration technique and scene reconstruction procedures. In
addition, it presents the crack detection, localization, and quantification methods. Section 4.5
discusses the experimental work under the MAV simulation and the laboratory setting. Lastly,
Sections 4.6 and 4.7 concludes the current work and explains the future work that can be
extended based on this work.
110
4.2 MAV model
Robotic systems are not always a one-size-fits-all proposition. Robot simulation software
aids in fine-tuning and customizing robotic designs so that they integrate easily with current
manufacturing processes and equipment. The entire idea of automation is defeated if the
Figure 4.1: A six rotor micro aerial vehicle, AscTec Firefly, in Gazebo simulator.
loading or unloading of the automation system blocks generates bottlenecks. Simulation
software can be used to demonstrate the efficacy of an automation system. If precise, thorough
modeling is used, one can ensure the efficiency of a robotic system. To test algorithms on
Micro Aerial Vehicles (MAVs), access to expensive hardware is required, and field testing
often takes a significant amount of time and requires a qualified safety pilot. The majority of
mistakes in the actual systems are difficult to recreate and frequently result in MAV damage.
The RotorS simulation framework [84] was developed to shorten field testing duration and
isolate problems for testing, allowing for faster troubleshooting and, ultimately, minimizing
crashes of real MAVs. This is particularly useful for applications when access to an expensive
and complicated real-world platform is not always feasible. In this study, a MAV is used as
the vehicle to transport the sensor array and collect the data for analysis.
111
4.2.1 MAV - system overview
This section offers an overview of the RotorS simulator’s critical components shown in
Figure 4.2. Ideally, all components employed in the simulated environment should run
without modification on the actual platform. Therefore, Gazebo plugins and the Gazebo
physics engine are used to emulate all components present on real MAVs.
Simulated
Sensors
Simulated MAV
Dynamics
Simulated External
Influences (e.g.,
Simulated Wind
Force
Gazebo Controller
Interface
Dynamics
Desired
Motor
Velocities
Gazebo
Odometry
Measurement
MAV Control
Odometry
Estimates
State Estimator
Control Commands
IMU & Pose Measurements
Figure 4.2: Building blocks required to launch a MAV.
4.2.1.1 Assembly
A MAV comprises a body, a set number of rotors that may be positioned at user-specified
places, and specific sensors attached to the body. Each rotor has motor dynamics that
account for the essential aerodynamic impacts. In addition, several sensors, including an
IMU, a standard odometry sensor, a visual-inertial sensor comprised of a stereo camera and
an IMU, and sensors designed by the user, can be connected to the body.
112
4.2.1.2 Modeling
The forces and moments operating on a MAV may be divided into forces and moments acting
on each rotor, and gravitational forces acting on the MAV’s Center of Gravity (CoG). All of
these forces, when combined, describe the entire dynamics of a MAV.
4.2.1.2.1 Forces and moments on a single rotor The thrust forceF
T
, drag forceF
D
,
rolling moment M
R
, and moment from drag M
D
of a rotor blade is given from [173]:
F
T
ω
2
C
T
ZB
(4.1)
F
D
ωC
D
ν
K
A
(4.2)
M
R
ωC
R
ν
K
A
(4.3)
M
D
εC
M
F
T
(4.4)
Here, ω is the rotor blade’s positive angular velocity, C
T
is the rotor thrust constant, C
D
is the rotor drag constant, C
R
is the rolling moment constant, and C
M
is the rotor moment
constant. The Forces and moments acting on the rotor’s center are shown in Figure 4.3. All
the constants here are positive. The turning direction of the rotor is denoted by ε1,
where +1 indicates counterclockwise rotation, and -1 indicates clockwise rotation. ν
A
and ν
K
A
are the velocities (normal and perpendicular) at which the propeller moves with respect to
the geometric center of a rotor.
⊥
Figure 4.3: Forces and moments operating on a single rotor’s center.
113
4.2.1.2.2 Robot dynamics The MAV’s equations of motion can be derived from New-
ton’s law and Euler’s equation, given by:
F ma (4.5)
τ 9 ωωω (4.6)
Herem,a,,ω represent the mass, acceleration, inertia matrix, and angular velocity of the
MAV, respectively. The primary forces acting on the main body is shown below in Figure 4.4.
1
2
3
4
6
5
Figure 4.4: Body-centered body frame B and global world frame W in a hexarotor sketch.
The main body is being acted on by the primary forces F
i
from the various rotors, and F
G
.
4.2.2 MAV control
To operate a MAV, a mapping between the system’s output, which is the resultant thrust, T,
the combined thrust of all rotors, and torque, τ, acting on the helicopter’s center of gravity,
and the system’s input, which are the angular velocities of each rotor,ω
i
, must be found. The
114
thrust forces of each rotor, their resultant moments, and the drag moments can be formulated
with the following equations:
T
τ
A
ω
2
0
ω
2
1
.
.
.
ω
2
n
(4.7)
The mapping matrix APR
44
, also known as allocation matrix is given by:
A
C
T
C
T
C
T
C
T
0 lC
T
0 lC
T
lC
T
0 lC
T
0
C
T
C
M
C
T
C
M
C
T
C
M
C
T
C
M
(4.8)
The RotorS hexacopter features rotors axis normals pointing in the z-axis of the body frame
z
b
. Taking a look at a MAV with all of the rotor axes facing in the same direction, only a
thrust, T, pointing in the direction of the rotor blade’s normal vector, which corresponds
with z
B
, can be created. As a result, only the thrust, T, and the moments around the three
body axes x
B
, y
B
, and z
B
may be directly controlled.
The vehicle must be oriented towards a set-point in order to travel in 3D space. As a
result, the vehicle’s total thrust, the direction of z
B
(through the roll and pitch angle), and
the yaw rate ω
z
must be regulated. This is commonly known as the attitude controller.
Because the dynamics of attitude are generally significantly quicker than the dynamics of
translation, a cascaded control method is frequently used [150]. Various components of the
MAV’s control system are shown in Figure 4.5.
4.2.3 State estimation
Accurate information of the MAV’s status is a critical component for allowing steady and
robust MAV fights. The characteristics of IMU measurements and posture measurements
115
Trajectory Tracking
Attitude
Tracking
MAV
Dynamics
, , , Figure 4.5: The intended locationp
d
and the desired yaw angleψ
d
are shown in the controller
drawing. Position control is usually divided into two parts: An outside trajectory tracking
controller calculates the attitude and thrust references that an inner attitude tracking
controller follows.
are highly complementary: Measurements from IMUs, which are often employed aboard
MAVs, are accessible at a high rate and with low latency, but they are contaminated by
noise and a time-varying bias. As a result, relying exclusively on time-discrete integration
(dead-reckoning) of these sensors renders a consistent assessment of the vehicle’s absolute
posture almost impossible. Methods for estimating the 6 DoF posture, on the other hand,
often exhibit no or extremely little drift. However, their data typically come at a considerably
lower rate with considerable delay due to their computational complexity. Combining the two
measurements results in an (almost) drift-free estimation of the state at a high rate and short
latency. The fundamentals of doing this with an Extended Kalman Filter (EKF) formulation
[4] are shown below.
4.2.3.1 Sensor model
A typical IMU model is provided by:
ω
m
ωb
ω
n
ω
(4.9)
a
m
ab
a
n
a
(4.10)
116
where m is the measured quantity, b
ω
and b
a
represent biases on the observed angular
velocities and accelerations, respectively. These biases are represented by a random walk
with zero-mean white Gaussian noise as the time derivative:
9
b
ω
n
bω
(4.11)
9
b
a
n
ba
(4.12)
where n
bω
and n
ba
are noise levels.
4.2.3.2 State representation
Most controllers are divided into a position loop and an attitude loop. This works nicely
under the premise that rotational dynamics are quicker than translational motion. The outer
loop requires the location, p, and velocity, ν, in world coordinates. The inner attitude loop
requires the orientation, q, and the angular velocity, ω, given by the IMU. When combined
with the bias states from the IMU model, this results in the state vector shown below:
x
p
T
ν
T
q
T
b
T
a
b
T
w
T
(4.13)
Because of the low complexity of time-discrete integration of IMU measurements, they
are frequently utilized as input for the EKF’s time-update phase (prediction). This has the
added benefit of being independent of specific vehicle dynamics and model parameters, and
avoiding having the angular rate in the state. This results in the dynamic model shown
below:
9 pν (4.14)
9 νCpa
m
b
a
n
a
qg (4.15)
117
9
q
1
2
q
0
ω
m
b
ω
n
ω
(4.16)
9
b
a
n
ba
(4.17)
9
b
ω
n
bω
(4.18)
Further details of the analytical expressions are provided in [245].
4.2.3.3 Measurement model
The measurement equations are divided into position, p
m
, and attitude measurements, q
m
,
representing the IMU’s measured posture regarding the world frame:
p
m
pn
p
(4.19)
q
m
qδq
n
(4.20)
whereδq
n
denotes a tiny error rotation, andn
p
is zero-mean white Gaussian noise. This basic
model implies that the pose-origin sensors correspond with the IMU, which is not always the
case. Furthermore, the pose-frame sensor’s of reference may not be aligned with the world
frame. However, because these misalignments are frequently apparent, they do not need to
be calibrated beforehand. The derivations and an observability analysis can be referred from
[4, 254].
4.2.4 Collision avoidance and path planning
It is critical for 3D collision avoidance and path planning to represent obstacles, which
is required for collision checking efficiently. For this purpose, octree representations are
commonly employed [176]. An octree is a tree in which each node has eight offspring, making
it ideal for efficient memory storage. It is frequently used to indicate whether or not a 3D
space is occupied. Every node represents a specific portion of a 3D space; this part may be
118
split into eight equal-sized pieces, which are indicated as octants of this subspace. This can
be performed iteratively until leaf has the required representation of the represented space.
If all node octants have the same value, the node value can be changed to that value, and
the octant nodes can be removed from the tree.
Down-projecting the world onto a 2D ground plane and using the Robotic Operating
System (ROS) navigation stack is a popular approach for solving collision avoidance on
MAVs. To address this issue, a suitable sensor must be installed on the MAV to provide
an approximation of the surroundings. Gazebo offers plugins for 2D laser scanners such
as the Hokuyo [137], which might be installed on a genuine MAV. The 3D collision can
be detected directly on the disparity images, as proposed in [175]. This method has the
significant disadvantage of confining the MAV’s working region to a plane at a fixed height.
Front-facing depth cameras, such as the Kinect sensor [289], which is currently integrated into
Gazebo, provide a suitable starting point for comprehensive 3D collision avoidance. These
sensors are small and provide a wealth of information about their surroundings.
4.3 Camera model and calibration
The camera model is a mathematical principle crucial for understanding the image formation
from the 3D world to the camera coordinate system. A projective transformation camera
model can be visualized by a pinhole camera. A light ray of the 3D object passes through
a tiny hole, and the same object is inverted on the image plane. This is a fundamental
camera model where the behavior of the lens is disregarded. Modern high-resolution cameras
applicable for computer vision applications are equipped with a powerful lens to focus the light
onto the image view plane or the photoreactive sensor. Digital cameras have a charge-coupled
device (CCD) array or Complementary Metal Oxide Semiconductor (CMOS) sensors for the
image acquisition on the image view plane. In this section, a pinhole camera model is used
to derive the intrinsic and extrinsic parameters of the camera for a digital camera. Intrinsic
119
parameters are necessary to understand the internal parameters of the image formation and
model the radial and tangential distortion of the lens. Furthermore, extrinsic parameters
assist in estimating the 3D location of an object with reference to the camera. In this work,
extrinsic parameters are hardly needed as the images do not have a calibration rig, which is
problem-specific. Lastly, the camera calibration procedure is also discussed, and the intrinsic
parameters of the MAV and real Sony camera are tabulated. Using these parameters, the
lens distortion can be corrected.
4.3.1 A pinhole camera model
A camera is an optical device that maps a 3D scene onto a 2D image plane. In a pinhole
camera model, the camera lens represents the optical center in which the 3D point of the world
scene is projected onto the 2D image plane. Furthermore, the optical axis is perpendicular to
the plane of the lens and passes through the optical center [55, 257]. This projection from
the 3D coordinates to 2D coordinates is illustrated in Figure 4.6, where the camera aperture
is a point.
Two main parameters that govern the camera intrinsic properties are Focal length f and
principal pointO
f
. The focal length is the distance between the camera center and the image
plane. The principal axis is the line that passes through the camera center, image, and focal
planes. The principal point is a point on the image and the focal plane where the perspective
center is projected, and the line passing through this point is perpendicular to the image
plane. In a camera pinhole model there are four main coordinate systems (see Figure 4.6):
(a). world coordinate systempX
w
,Y
w
,Z
w
q, (b). camera coordinate systempX
c
,Y
c
,Z
c
q, (c).
image coordinate systempX
f
,Y
f
q, and (d). pixel coordinate systempu
i
,v
i
q.
A 3D point Ppx
c
,y
c
,z
c
q in the camera coordinate system is projected into a 2D point
ppx
f
,y
f
q in the image coordinate system. Using the principles of similar triangles (see the
120
, , , Principal axis
Focal
length
Focal
plane
, Principal
point
Camera
center
, , Pixel
Coordinates
Extrinsic
parameters
Intrinsic
parameters
Image
plane
Figure 4.6: A pinhole camera model coordinate system.
right part of Figure 4.6), the relationship between the 3D camera coordinates and the 2D
image coordinates is given by,
x
f
f
x
c
z
c
, (4.21a)
y
f
f
y
c
z
c
. (4.21b)
In a digital image camera, the metric units (e.g., millimeters or inches) image coordinate
system should be converted into the pixel-based coordinate system. The transformation is
shown in Figure 4.6, where the unit focal plane information is projected onto the image
plane. px
s
,y
s
q are the coordinates of a point, p, in the image coordinate system,pX
s
,Y
s
q,
121
mapped from the unit metric coordinates,px
f
,y
f
q, in the focal plane. The focal plane to
pixel conversion is given by the equations,
x
s
s
u
x
f
, (4.22a)
y
s
s
v
y
f
, (4.22b)
where s
u
and s
v
are the scale factors related to the number of pixels per column and row
(e.g., pixels per world unit (millimeters or inches)). pu
i
,v
i
q is a pixel coordinate system, and
the origin point is located in the top-left corner. The pixel coordinates,pu,vq, of the point,p,
are given by,
ux
s
c
x
, (4.23a)
vy
s
c
y
, (4.23b)
wherepc
x
,c
y
qistheprincipalpointinthepixelcoordinatesystem. CombiningEquation(4.22a)
and Equation (4.23a), the projection from a 3D camera coordinate system to the 2D pixel
coordinate system is given by,
uf
x
x
c
z
c
c
x
, (4.24a)
vf
y
y
c
z
c
c
y
, (4.24b)
where f
x
s
u
f and f
y
s
v
f are the focal lengths in pixel units. Equation (4.24a) can be
expressed in the homogeneous representation and matrix form as,
122
S
u
v
1
f
x
s c
x
0 f
y
c
y
0 0 1
looooomooooon
K
x
c
y
c
z
c
, (4.25)
whereS is a scale factor,K is called the camera matrix, ands is the skew coefficient which is
non-zero if the image axes are not perpendicular or when non-rectangular pixels are present.
K
f
x
s c
x
0 f
y
c
y
0 0 1
(4.26)
There is a lens radial distortion for a wide-angle lens, and when the image plane and lens
are not parallel, tangential distortion values need to be estimated. Equation (4.27) can be
used to correct the camera lens distortion [257].
x
d
y
d
1k
1
r
2
k
2
r
4
k
3
r
6
loooooooooooooomoooooooooooooon
Radial distortion
x
1
c
y
1
c
2p
1
x
1
c
y
1
c
p
2
r
2
2x
1
c
2
2p
2
x
1
c
y
1
c
p
1
r
2
2y
1
c
2
looooooooooooooooomooooooooooooooooon
Tangential distortion
(4.27)
where k
1
, k
2
, k
3
are the radial distortion coefficients, p
1
, p
2
are the tangential distortion
coefficients,x
1
c
xc
zc
andy
1
c
yc
zc
aretheundistortednormalizedcoordinates, andr
2
x
1
c
2
y
1
c
2
.
By using Equation (4.27), lens distortion model, Equation (4.24a) can be rewritten as:
uf
x
x
d
c
x
, (4.28a)
vf
y
y
d
c
y
. (4.28b)
Now using the world coordinate system,px
w
,y
w
,z
w
q, let P be a 3D point. The transfor-
mation from the world to camera coordinate system is given by a 3 3 rotation matrix, R
123
and 3 1 translation vector, t. The homogeneous coordinates of the transformation between
world and camera coordinate system in matrix form is given by,
x
c
y
c
z
c
r
11
r
12
r
13
t
1
r
21
r
22
r
23
t
2
r
31
r
32
r
33
t
3
x
w
y
w
z
w
1
(4.29)
Now using Equation (4.25) and Equation (4.29), projection of the point P in the 3D
world coordinate system to the 2D pixel coordinate system is written as,
S
u
v
1
f
x
s c
x
0 f
y
c
y
0 0 1
looooomooooon
intrinsic parameters
r
11
r
12
r
13
t
1
r
21
r
22
r
23
t
2
r
31
r
32
r
33
t
3
looooooooooomooooooooooon
extrinsic parameters
x
w
y
w
z
w
1
(4.30)
Generally in computer vision literature, Equation (4.30) is written in short notation as,
SpKrR|tsP. (4.31)
Equation (4.30) consists of intrinsic and extrinsic parameters of a pinhole camera model.
The intrinsic parameters refer to the internal geometric properties of the camera sensor and
lens, which includes the focal lengths, f
x
and f
y
, the principal point,pc
x
,c
y
q and the skew
coefficient, s.
The extrinsic parameters define the position and orientation of the camera in the world
coordinate system. The transformation of the 3D point of an object from the world coordinate
system to the camera coordinate system is given by the rotation matrix,R, and the translation
vector, t. Therefore, a camera calibration procedure is required to estimate the intrinsic
124
and extrinsic parameters of a camera. This procedure is discussed in detail in the next
Section 4.3.2.
4.3.2 Camera calibration
A camera calibration procedure is required to estimate the intrinsic and extrinsic parameters
of the camera. Intrinsic and extrinsic parameters can be used to map the 3D points of an
object in the world to image coordinates and find the rotation and translation between the
camera and checkerboard images. The intrinsic calibration provides the unknown optical
Figure 4.7: Checkerboard pattern images used for the camera calibration procedure.
properties of a camera, such as the focal length (f
x
and f
y
), principal point (c
x
and c
y
),
coefficients of lens distortion (radial (k
1
,k
2
and k
3
) and tangential (p
1
and p
2
)) and skew
coefficients, which is non-zero if the image axes are not perpendicular [257]. In this research,
the MATLAB toolbox is used to perform the intrinsic and extrinsic calibration procedures.
While performing the intrinsic calibration procedure, the MATLAB toolbox reads a series
of 35 checkerboard images acquired by a camera in different orientations, see Figure 4.7.
Then the program predicts the grid corners of a checkerboard image using a corner detector
125
algorithm. For every orientation, at least 4 points are used to fit the intrinsic parameters.
The MATLAB toolbox uses a nonlinear optimization technique to optimize and compute the
intrinsic parameters of a camera.
The extrinsic calibration is performed to obtain the translation and rotation matrices
between the camera and the checkerboard pattern images. The origin of the camera’s
coordinate system is at its optical center, and its x and y axes define the image plane. The
-100
-- -50
Q)
Q)
0
E
.E 50
100
400
300
200
C
100
Z (millimeters)
0 -100
C
100
0
X (millimeters)
(a)
0 5 10 15 20 25 30 35
Images
0
0.5
1
1.5
2
2.5
3
3.5
4
Mean Error in Pixels
Overall Mean Error: 2.17 pixels
(b)
Figure 4.8: Camera extrinsic parameters. (a). Checkerboard location and orientation with
respect to the camera. (b). Mean reprojection error per image.
MAV camera Sony camera
Camera
intrinsics
Focal length (pixels)
f
x
241.4268 3784.9901
f
y
241.4268 3788.6197
Principal
point (pixels)
c
x
376.5000 2570.6349
c
y
240.5000 1845.4172
Radial distortion
coefficients
k
1
0.0000 -0.1706
k
2
0.0000 0.6588
k
3
0.0000 -0.7091
Tangential distortion
coefficients
p
1
0.0000 -0.0038
p
2
0.0000 0.0019
Skew
coefficient (pixels/inch)
s 0.0000 8.2269
Table 4.1: Camera intrinsic parameters for the MAV simulation and real Sony cameras.
intrinsic calibration procedure should be performed first. The MATLAB toolbox requires the
126
intrinsic parameters to estimate the extrinsic parameters, the rotation matrix, R, and the
translation matrix, t. The estimated transformation between the camera and checkerboard
pattern images at different locations and the mean reprojection error of each image are shown
in Figure 4.8. Table 4.1 shows the camera intrinsic parameters of the MAV mounted and real
Sony color cameras.
4.4 Methodology
4.4.1 Nearest neighbor search
Generally, for an unordered images dataset, a stitching and scene reconstruction method
must match each pair of images to find appropriate connected component images. Similarly,
Figure 4.9: Nearest neighbor search strategy for finding the neighboring images. Red dots
are the image coordinates obtained from the IMU data. Green and blues circles are the
possible area that engulf the neighboring images for the image matching, registration and
scene reconstruction. R1 and R2 are the radii of the possible engulf circles.
for the structural inspection, a pairwise matching was deduced to find the neighboring images
127
in the previous dataset to the current image [116, 89]. However, a brute-force pairwise image
matching has a time complexity of Opn
2
q (where n is the number of images in the dataset),
making it impractical for most applications when the images in the datasets are enormous.
Exploiting the IMU location information for each image in the dataset, a better strategy can
be devised to efficiently search for the potential candidate matching images.
It is convenient to have a global coordinate system for an area or volume in the structural
inspection. Therefore, all the necessary computations and alignment of the images will be
with respect to this global coordinate system. Furthermore, this work assumes that a MAV
or any mobile robot starts from the same landmark and performs a complete coverage path
planning and maneuvering on the pre-determined path. This is a valid assumption as it is
practical to start from the same location, and based on the structural 3D or 2D CAD model,
a pre-determined indoor or outdoor path can be computed efficiently. Additionally, in this
study, a pre-determined optimal zig-zag path was utilized. This was chosen to accomplish
the easy maneuverability and complete coverage of the inspection area. Figure 4.9 shows
the inspection area without cracks. Red dots are the image coordinates with respect to a
global coordinate system. Green and blue circles are the areas that engulf certain neighboring
images.
An efficient approach is to devise an indexing structure, such as a multi-dimensional
search tree or a hash table, to rapidly search for neighboring images near a given current
image from the previous dataset. Such indexing structures can either be built for each
image independently (which is useful if only specific potential matches are considered, e.g.,
searching for neighboring images of a particular image) or globally for all the images in a
given database (which is faster and removes the redundant iterations over images). In this
work, a nearest neighbor search based on k-d tree is used to find the neighboring images
and has time complexities ofOpkn lognq andOpn lognq for building and searching the tree,
where k 2 for a two-dimensional points search and n is the number of image coordinates
from the IMU data for an inspection area. Generally, for m neighboring images, the value of
128
m is smaller than n
2
. Lastly, after finding the previous database nearest neighboring images,
a hash table stores the indexed information for each current image. Thus, eliminating the
redundant find operations.
4.4.2 Multi-image registration and scene reconstruction
Multi-image registration is the process of determining the correspondence between two or
more images to align and create a seamless larger scene for understanding. By registering
two or more images, multimodality information can be fused to detect the changes in the
scene and object recognition [96]. There are two different types of image registration, namely:
Future Database
Current View Feature Detection
Feature Selection Feature Correspondence and
Ouliers Rejection
Images Selection
= +1
= Future Database Future Database Future Database Future Database
Current View
Reconstructed View
= +1
= Bundle Adjustment Images Geometric
Transformation
Gain Compensation/
Multi-band Image Blending
Cropped Scene
Figure 4.10: Feature-based multi-image registration pipeline. Firstly, feature descriptors
of the neighboring and current images are found. Secondly, based on the ratio threshold,
putative matches are found between the images. Thirdly, outlier keypoints are rejected by
the RANSAC algorithm and the images are matched and selected. Fourthly, the bundle
adjustment optimization was utilized to correct the camera parameters and minimize the
drift error between the images. Fifthly, the images are transformed by using the projective
or affine transformation matrix. Sixthly, gain compensation and multi-band blending are
used to produce the seamless reconstruction of the scene. Lastly, the reconstructed scene is
cropped to match the current view image and compared for the change detection.
direct and feature-based [235]. Although direct stitching benefits from using all of the image
data and can yield accurate registrations in some cases, feature-based registration is faster,
129
robust to light intensity variation, rotation, and scaling. In addition, they are partially robust
to affine and projective transformations and can work in unordered image sets [28].
Figure 4.10 shows the pictorial representation of the multi-image registration pipeline.
Some of the common steps involved in image registration are provided below:
• Pre-processing: If the images contain noise, basic mean and median filters can be used
for noise removal.
• Feature selection: In multi-image registration and scene reconstruction, repeatable
features such as keypoints, lines, corners, or curves must be present in the images
to establish valid correspondences to estimate the geometric transformation between
source/reference and target images.
• Feature correspondence: Corresponding keypoints of features are present in the source
or target images or both. In addition, it provides information to align the images by
matching the feature descriptors.
• Bundle adjustment: Accumulation error is caused when the pairwise homographies do
not match accurately (e.g., ends of the panorama do not align). Bundle adjustment is
an optimization procedure to correct the camera parameters.
• Geometric transformation function: After the bundle adjustment, images are trans-
formed by a projective or affine matrix to a common coordinate system to produce a
large scene.
• Seamless blending: Multiple images have different intensities in the overlapping re-
gion due to the exposure of the camera exposure, vignetting effect, parallax, and
misregistration errors. Blending the images produces a seamless panorama for scene
reconstruction.
130
• Resampling: After finding the transformation function, the source panorama image is
resampled or resized to the geometry of the current image. Thus enabling the detection
of changes in the scene.
4.4.2.1 Feature matching
This study uses feature-based image registration to align the source/reference and target
image to detect crack propagation. Figure 4.11 shows the source/reference image and target
image. The reference and target images are captured at time T
0
and T
1
, respectively. Both
(a) (b)
Figure 4.11: Two images of the same crack at time periods T
0
and T
1
. (a). Source/reference
image. (b). Target images.
the images have different exposure and orientation with respect to the cracks. Since the
images are of high resolution, the denoising method was not employed for noise removal.
The most desired features for image registration are points because their coordinates can
be directly used to determine the parameters of a transformation function that registers the
images. Thus, point-based feature descriptors like the Speeded Up Robust Features (SURF)
/ Scale-Invariant Feature Transform (SIFT) can be used (details are provided in Section 5.2).
For demonstration purposes, SURF keypoints are used in both the images.
Figure 4.12 shows the putative matches of the keypoints from two images after SURF
keypoints extraction. After the SURF key points are detected on the source and target
images, their correspondence between point pairs of two images needs to be found.
131
Figure 4.12: Putative matches of the two images.
Given a set of points in the source image, P p
i
:i 1,...,n
r
and target image,
Qq
j
:j 1,...,n
s
, where p
i
px
i
,y
i
q and q
j
X
j
,Y
j
, all the index pairspi,jq from the
two point sets need to be determined where p
i
andq
j
show the same point in the scene. This
is computationally expensive, but an efficient way to speed up this matching is described in
[185] which uses approximate nearest neighbor search. The matching key points show the
similarity between the reference and target images. The feature point descriptor vector is a
64 1 in dimension for each key point. The distance of a matching pair is calculated with
the sum of squared differences [185]. In this work, all matches in which the distance ratio is
greater than 0.6 were rejected.
4.4.2.2 Outlier rejection and image matching
Even after the matching stage, there exist a few mismatches or outliers between the two
images. To overcome this, RANSAC [77] was used to estimate the homography matrix, H
[103]. The relation between the matches in source and the target image is given by,
p
i
Hq
j
. (4.32)
The RANSAC algorithm chooses four points at random. For each pair, the difference of
Equation (4.32) is calculated, and the pairs with errors greater than a threshold are detected
as outliers. This procedure is repeated several times until the minimum number of outliers is
132
detected or the least total error is calculated. Thus, four correspondence points are required
to estimate the homography matrix. To estimate the homography matrix, symmetric transfer
error is used as the cost function:
d
Sym
dpp
i
,Hq
j
q
2
dpq
j
,H
1
p
i
q
2
, (4.33)
wheredp,q is the Euclidean distance between the homogeneous pointsp
i
andq
j
. Figure 4.13
shows the inlier matches between the two images; there are about 120 matches after RANSAC
outlier rejection.
Figure 4.13: Inlier matches of the two images.
4.4.2.3 Bundle adjustment for homographies
Bundle adjustment is an optimization technique to correct the camera parameters. In addition,
it helps to minimize the accumulation errors caused when the pairwise homographies between
the images cause drift when there are many images and disregard multiple constraints (e.g.,
the ends of the scene should match). Images with maximum consistent matches are added to
the bundle adjuster one by one. The objective function is a robustified sum squared projection
error. Each feature is projected into all the images it matches, and the sum of squared
image distances is minimized with respect to the camera parameters. Given a correspondence
p
k
i
Øq
l
j
(p
k
i
denotes the position of the kth feature in image i), the residual is given by,
133
r
k
ij
p
k
i
P
k
ij
, (4.34)
whereP
k
ij
is the projection from image j to image i of the point corresponding to p
k
i
, and
given by,
P
k
ij
H
ij
q
l
j
, (4.35)
where H
ij
is the homography between the pairwise images. The error function is the sum
over all images of the robustified residual errors, and given by,
e
n
¸
i1
¸
jPIpiq
¸
kPFpi,jq
r
k
ij
2
2
, (4.36)
where n is the number of images,Ipiq is the set of images matching to image i ,Fpi,jq is
the set of feature matches between images i andj. This is a non-linear least squares problem
that can be solved by using the Levenberg-Marquardt algorithm. This work utilizes the
MATLAB optimization toolbox to solve the non-linear problem.
4.4.2.4 Gain compensation
Gain or exposure between the stitching and current images causes a spurious artifact such
as dark or light patches. Therefore, it is necessary to have a similar gain between the
stitched/cropped and current image, so the difference between them does not cause false-
positive pixels for over or under gain. Generally, for the crack images, the difference between
them is in the binary domain. However still, gain differences can cause significant false-positive
crack pixels. Thus, it is highly recommended to do this photometric correction and parameter
estimation.
An error function is defined over all the stitching and current images. The error function
is the sum of normalized gain intensity errors for all the overlapping pixels and is given by,
134
e
1
2
n
¸
i1
¸
p
i
PRpi,jq
g
i
¯
I
i
pp
i
,q
j
qg
cv
¯
I
cv
pq
j
q
2
, (4.37)
where g
i
and g
cv
are the gains of the ith image and current view image respectively, and
Rpi,jq is the region of overlap between images i and j.
¯
I
i
and
¯
I
cv
are approximated by
the mean of pixel intensities in each overlapping region. The gain, g
i
can be estimated by
minimizing the above quadratic error function by setting the derivative to 0. Since the current
view image is unchanged, g
cv
is set to unity.
4.4.2.5 Multi-band blending
After gain compensation and stitching of the images, some edges are still visible due to
vignetting, parallax effects, and misregistration errors [28]. Linear blending causes the
misregistration errors and blurring effect around the high frequency regions such as edges and
sharp contrast regions (e.g., cracks boundaries). To overcome this, a multi-band blending
method proposed by [31] is adapted in this study. Blending weights are initialized by,
W
i
max
px,yq
$
'
'
&
'
'
%
1, if W
i
max
px,yq argmax
j
W
j
px,yq,
0, otherwise
(4.38)
wherepx,yq are the pixel coordinates, andW
i
max
px,yq is 1 forpx,yq values where imagei has
maximum weight, and 0 when some other image has a higher weight. These maximum-weight
maps are successively blurred using a Gaussian kernel to form the blending weights for each
band. A high-pass filter rendered image is formed by,
B
i
σ
px,yqI
i
px,yqI
i
σ
px,yq (4.39)
I
i
σ
px,yqI
i
px,yqG
σ
px,yq, (4.40)
135
whereG
σ
px,yq is a Gaussian kernel of standard deviation,σ, and is the convolution operator.
B
i
σ
px,yq is the difference of Gaussian (Laplacian) that represents spatial frequencies in the
range ofr0,σs. The images are blended by the blend weight (Equation (4.38)) convolved by
the Gaussian operator,
W
i
σ
px,yqW
i
max
px,yqG
σ
px,yq, (4.41)
where W
i
σ
px,yq is the blend weight for the band ranger0,σs. Furthermore, lower frequency
band-pass images are blended with blend weights for k¥ 1,
B
i
pk1qσ
px,yqI
i
kσ
px,yqI
i
pk1qσ
px,yq (4.42)
I
i
pk1qσ
px,yqI
i
kσ
px,yqG
σ
1px,yq (4.43)
W
i
pk1qσ
px,yqW
i
kσ
px,yqG
σ
1px,yq, (4.44)
where σ
1
a
p2k 1qσ is the the standard deviation of the Gaussian blurring kernel.
The final blended image is formed by combining the overlapping images linearly using the
corresponding blend weights, and is given by,
I
final
kσ
px,yq
°
n
i1
B
i
kσ
px,yqW
i
kσ
px,yq
°
n
i1
W
i
kσ
px,yq
. (4.45)
The above equation blends high frequency bands (small kσ) over short ranges and low
frequency bands (large kσ) over larger ranges. Thus, keeping the cracks nearby region
sharper. In this study, two bands, or pyramid levels are used for the image blending.
4.4.2.6 Image transformation and image resampling
In practical applications such as crack detection, the mobile robot or the MAV do not make
complex maneuvers that can create a nonlinear geometric difference between the images. So
it is assumed that geometric difference is very negligible. Therefore, corresponding points in
the images can be related by affine or linear transformations such as scaling and rotation
136
(similarity transform). However, due to the different viewing points, the images can have a
projective transformation also.
Given the coordinates of N corresponding points in the reference and target images,
tpx
i
,y
i
q,pX
i
,Y
i
q :i 1,...,Nu, (4.46)
a transformation functionfpx,yq with componentsf
x
px,yq andf
y
px,yq, is one which satisfies
X
i
f
x
px
i
,y
i
q, (4.47)
Y
i
f
y
px
i
,y
i
q, i 1,...,N, (4.48)
or
X
i
f
x
px
i
,y
i
q, (4.49)
Y
i
f
y
px
i
,y
i
q, i 1,...,N. (4.50)
Based on Equation (4.49) if the matching points are known, the function f
x
px
i
,y
i
q can
be approximated by similarity transformation of the Cartesian coordinate system repre-
senting global translational, rotational, and scaling differences between two images. This
transformation is defined by,
XSrx cosθy sinθsh, (4.51)
Y Srx cosθy sinθsk, (4.52)
whereS,θ, andph,kq are scaling, rotational, and translational differences between the images
(refer to Figure 4.14), respectively. Similarly, if lens and sensor nonlinearities do not exist,
the relation between two images of a rather flat scene can be described by the projective
transformation (refer to Figure 4.14) and is given as,
137
X
axbyc
dxey 1
, (4.53)
Y
fxgyh
dxey 1
, (4.54)
where, a,...,h are the eight unknown parameters of the transformation to be estimated. To
determine them, four non-collinear (one point for two constants) corresponding points in the
images should be known. If correspondences contain noise, more than four correspondences
should be used in the least-squares method to obtain the transformation parameters.
When the scene is very far from the camera, the projective transformation can be
approximated by the affine transformation (refer to Figure 4.14),
Xaxbyc, (4.55)
Y fxgyh. (4.56)
An affine transformation has six parameters, which can be determined if the coordinates of
at least three non-collinear (one point for two constants) corresponding points in the images
are known. Further details about the analytical expressions above are available in [96].
Figure 4.14: A sample crack image shown with three different linear transformation.
138
After the transformation function is estimated, the target image can be mapped back to
the reference image by interpolation techniques such as bilinear or bicubic or nearest neighbor
(refer to Figure 4.15). A planar homography transformation is used as the image composition
surface with respect to the current image. This produces stitched images where straight lines
remain straight after the transformation. In contrast, the cylindrical or spherical composition
surface tends to distort the images and straight lines will be skewed.
Figure 4.15: A registered and resampled image.
4.4.3 Damage assessment
4.4.3.1 Damage and change detection
After registering the images that match the current image, the reconstructed and current
images need to be processed to extract the crack (damage) pixels. An autonomous concrete
cracks segmentation method described in Chapters 2 and 3 can be used. However, generally,
CrackDenseLinkNet CNN requires a transfer learning procedure to learn the new dataset for
better predictions. The gradient-based MFAT method is free from this step, but susceptible
to false positives. Thus, to assess the generalization capability of the CrackDenseLinkNet
method on testing datasets the same will be leveraged in this chapter. In addition, the
inference/prediction time of the MFAT method for each image is a few seconds greater than
139
(a) (b)
(c) (d)
Figure 4.16: Results of a semantic segmentation of two images of the same crack at time
periods T
0
and T
1
. (a). Source/reference image. (b). Source binary output crack image. (c).
Target image. (d). Target binary output crack image.
the CrackDenseLinkNet CNN. Figure 4.16 shows the semantic segmentation output of the
source and target images using the MFAT method for demonstration purposes.
Statistical analysis of the crack’s physical properties, such as its width, length, and area,
provides inference on the change and propagation of the defect. Generally, obtaining accurate
physical properties depends on camera noise, camera viewpoint or pose, MAV location, lens
distortion, quality of the matching keypoints, and radiometric errors such as reflectivity of
the surface. Thus, inferring the change or crack propagation obtained from a single scan leads
to large uncertainty in the crack physical properties, and many sample scans are required to
reduce the error. A change detection procedure for a crack is described in this section. The
time axis, T
i
, where i 0, 1, 2,...,nPZ
represents the change and no-change scans. T
0
is
140
the source, and T
n
, where n 1, 2, 3,..., are the current inspections scans. Each of these T
i
has multiple trials of scans. The first trial of the scan,T
0
, can be used as the source/reference
dataset. Subsequently, the remaining trials at scan T
0
and T
i
where i¡ 0 are compared
against the source dataset. The histograms of the means of crack physical properties for the
source and current inspection dataset infer statistical changes at time-steps T
1
to T
n
. It is
worth noting that when the periodicity of the inspection is large, then T
0
scans may lead to
registration errors (mismatch of the keypoints) due to the abrupt change in the scene. Thus,
it is recommended to use the scans at different time periods, T
i
, where i" 0 as the source
dataset.
Crack physical properties are a stochastic and non-stationary process due to the conditions
that govern the data acquisition and registration of the images [19]. The random processes of
thecrackwidth, length, andarea(W
c
,L
c
, andA
c
)aredenotedbytX
Wc,Lc,Ac
ptqu. μ
iX
Wc,Lc,Ac
1
k
°
K
k1
X
Wc,Lc,Ac
rks are the mean of the crack width, length and area, respectively, between
a source and the target scans (i.e. one sample). An ensemble average of the means of crack
width, length and ares is given by μ
iX
Wc,Lc,Ac
1
N
°
N
n1
μ
iX
Wc,Lc,Ac
rNs and converges to the
ensemble average as N Ñ8. The standardized change criteria for source (no-change) and
subsequent change samples by the mean and standard deviations are given by,
i
Wc,Lc,Ac
μ
i
Wc,Lc,Ac
μ
0
Wc,Lc,Ac
μ
0
Wc,Lc,Ac
(4.57)
i
Wc,Lc,Ac
μ
i
Wc,Lc,Ac
μ
0
Wc,Lc,Ac
σ
0
Wc,Lc,Ac
, (4.58)
where μ
0
Wc,Lc,Ac
and σ
0
Wc,Lc,Ac
are the ensemble mean and standard deviation of the source
(no-change) dataset’s crack width, length and area, respectively. μ
i
Wc,Lc,Ac
is the ensemble
mean of the current inspection (change) dataset, where i 1, 2, 3,...,n is the number of
inspection images. It is hard to obtain a complete scan dataset of a complex-shaped structure
in a single attempt due to the occluded regions. Therefore, ensemble datasets can help achieve
a large area coverage of the concrete structure and minimize the uncertainty of the crack
141
physical properties for an accurate assessment and propagation. This information can be
exploited as the initial conditions for the damage mechanics model, where knowing the exact
location of the crack and its physical properties are important. The next section details the
procedure to extract the crack width, length and area from the semantic segmented binary
images.
4.4.3.2 Damage quantification
To quantify the crack propagation, firstly, the crack map obtained from the Chapter 3 is
skeletonized or thinned using the morphological hit-or-miss method [91]. The hit-or-miss
transform is a general binary morphological operation that can look for particular patterns
of foreground and background pixels in an image. It is the essential operation of binary
morphology since almost all the other binary morphological operators can be derived from it.
It takes an input binary image and a structuring element and produces another binary image
as output with other binary morphological operators.
The structuring element used in the hit-or-miss is a slight extension to the type introduced
for erosion and dilation. It can contain both foreground and background pixels, rather than
just foreground pixels (i.e., both ones and zeros). The simpler structuring element used with
erosion and dilation is often depicted containing both ones and zeros. However, in that case,
the zeros are do-not-care regions and are just used to fill out the structuring element to a
conveniently shaped kernel, usually a square. Here, do-not-care regions are shown as blanks
in the kernel in order to avoid confusion. An example of the extended kind of structuring
element is shown in Figure 4.17. Again, foreground pixels are ones, and background pixels
are zeros.
The hit-or-miss operation is performed in much the same way as other morphological
operators such as erosion or dilation, by translating the origin of the structuring element
to all points in the image and then comparing the structuring element with the underlying
image pixels. For example, suppose the foreground and background pixels in the structuring
142
element exactly match foreground and background pixels in the image. In that case, the
pixel underneath the origin of the structuring element is set to the foreground color. If it
does not match, then that pixel is set to the background color.
1
0 1 1
0 0
1
1 1 0
0 0
0 0
1 1 0
1
0 0
0 1 1
1
Figure 4.17: Hit-or-miss structuring elements.
The thinning of set, A (a binary image), by the structuring element, B, is denoted as
AbB and can be defined in terms of the hit-or-miss transform:
AbBAXpA
Bq
{
, (4.59)
where { is the binary complement, and A
BpAaDqXrA
c
apWDqs. W,D are the
neighboring window width and depth. Figure 4.18b shows the thinned crack skeleton (red
Figure 4.18: Thinning operation. (a). Crackmap obtained from hierarchical hybrid filter. (b).
Thinned crack skeleton.
color) of one pixel width on a black background.
It is a valid assumption that the crack width varies along the normal direction of the
tangent at each pixel, as the figure shows Figure 4.19. The centerline method of crack width
143
determination is more accurate than the distance transformation or the boundary-to-boundary
method [119]. This is because centerline and boundaries are assumed to be perfect in the
latter cases, but in practical situations, they are not.
After obtaining the crack skeleton, the orientation of each of the crack pixels is determined
around a 5 5 neighborhood [203] as shown in Figure 4.19. There are generally two scenarios:
The first is the region contains endpoints, and the other is the region contains only non-end
points. Consider the non-end point scenario as in Figure 4.19b. An ellipse is fitted to the
red pixels with the blue pixel as the center. The orientation of the ellipse’s major axis is
with respect to the X-axis, which provides the center pixel’s orientation. This is repeated
for all the centerline pixels. Considering the second scenario, where there are endpoints, a
5 5 neighborhood is slid inwards by constructing a region around endpoints so that the
estimation of the orientation has more pixels.
Figure 4.19: Orientation of the crack pixels. (a). Skeleton obtained from morphological
thinning. (b). Orientation of the blue pixel about the 5 5 neighborhood.
The connected component within thenn neighborhood is extracted, and the orientation
of it,θ, is estimated using the second-order moment of the pixels is as given in Equation (4.60),
θ arctan
|μ
xx
μ
yy
|
b
pμ
xx
μ
yy
q
2
4μ
2
xy
2μ
xy
, (4.60)
144
where, x
i
and y
i
are the image coordinates of the red pixels in the window, ¯ x and ¯ y are the
means of the image coordinates, and
μ
xx
¸
px
i
¯ xq
2
N
1
12
,
μ
yy
¸
py
i
¯ yq
2
N
1
12
,
μ
xy
N
¸
i
px
i
y
i
q
N
.
After estimating θ, the normal is found as N
r
θπ{2. In this study, the normals in
both directions are found and traversed. This is because some artifacts along the crack edges
cannot be purely captured while morphological thinning. Lastly, the quasi-Euclidean distance
d
i
between the two successive points x
i1
,y
i1
and x
i
,y
i
is calculated by using the equation,
d
i
$
'
&
'
%
|x
i1
x
i
|p
?
2 1q|y
i1
y
i
|, if|x
i1
x
i
|¡|y
i1
y
i
|,
|y
i1
y
i
|p
?
2 1q|x
i1
x
i
|, otherwise.
(4.61)
The length of the crack is given by:
l
¸
d
i
. (4.62)
Lastly, the area of the cracks is the number of white pixels in the binary images.
4.5 Experimental results and discussion
4.5.1 Dataset preparation
4.5.1.1 MAV simulation synthetic dataset
Cracks at a given area change in time when external factors such as shrinkage, creep, constant
quasi-static or dynamic overloading, and environmental loads (e.g., earthquake or wind) are
145
applied. These factors can rapidly or slowly cause crack growth, depending on the magnitude
of the external factors. To replicate the crack propagation in the laboratory conditions,
compression test on the structural elements such as beams, columns, or slabs has to be
performed. Due to the lack of resources and monetary support, this study could not build a
small-scale structural component to grow the cracks in real time. Instead, a physics-based
computational model for a simulation of a MAV was utilized. A synthetic crack dataset was
employed in the MAV Gazebo simulator to simulate the growth in time.
A laboratory experiment was conducted using a differential drive mobile robot mounted
with an Asus Xtion Pro Live 3D Sensor to capture the synthetic cracks hand-drawn on the
laboratory floor. Generally, the cracks on the concrete surfaces are darker in comparison to
the background. Therefore, these synthetic cracks were drawn using a black marker. The
width of these cracks varied from 0.001 0.015 m, and the length is around 0.076 0.254
m. The working distance of the camera to the laboratory floor was 0.457 m. The mobile
robot traveled along a straight line of length is 3 0.035 m. Crack changes in five different
scenarios were considered in this study, where the crack width and length of the existing
cracks were modified by drawing, and a few new cracks were also drawn.
When the cracks captured from the mobile robot were directly stitched using commercial
software, the quality of the stitched image was mediocre due to the distortion. Thus,
synthetic cracks along the 3 m length were grouped into three smaller sets. These group
images were stitched together to form a larger image using Adobe Photoshop commercial
software. Similarly, this operation was repeated for all five different scenarios where the
cracks change. In these images, labeled synthetic cracks were extracted using MATLAB for
all scenarios. Later, binary images consisting of synthetic cracks from respective groups were
prepared. The synthetic cracks were randomly placed in a binary image 3.048 m3.048 m
area with random rotations. The coordinate transformation between the image and world
units was based on the marker of diameter 0.038 m. The cracks from different scenarios
were placed at the exact location, replicating the crack’s growth throughout time. These five
146
(a)
(b)
Figure 4.20: MAV captured synthetic images. (a). Experimental images at time period T
0
.
(b). Experimental images at time period T
1
. The images follow snake zig-zag pattern (top
row left to right, followed by next row right to left)
147
(c)
(d)
Figure 4.20: MAV captured synthetic images. (c). Experimental images at time period T
2
.
(d). Experimental images at time period T
3
. The images follow snake zig-zag pattern (top
row left to right, followed by next row right to left)
148
(e)
(f)
Figure 4.20: MAV captured synthetic images. (e). Experimental images at time period T
4
.
(f). Experimental images at time period T
5
. The images follow snake zig-zag pattern (top
row left to right, followed by next row right to left)
149
scenarios represent the oldest (with no cracks) to the newest (with larger and more cracks) in
a consecutive manner. These scenario images are masked and mapped onto a concrete surface
image by blending authentic concrete images using Adobe Photoshop. Lastly, the markers
were added (by determining the ratio between the image pixels and marker length in world
units) onto all the scenarios at the same location to serve as a coordinate transformation
reference in MAV simulation (see Figure 4.9 which shows 3.048m3.048 m concrete surface
without synthetic cracks at initial time period, T
0
).
After creating six 3.048 m3.048 m concrete surface images with and without cracks, the
same images are overlayed on the Gazebo simulator environment floor to mimic the concrete
floor. A simulator hexacopter MAV model, AscTec Firefly, was utilized as the sensors carrying
vehicle to acquire the images and provide each image’s location. The hexacopter flies across
the 3.048 m3.048 m area in a snake zig-zag path at the height of 0.400 m and an average
speed of 0.400
m
s
. It captures the images of the concrete floor along the path. Furthermore,
the path planning using ROS image view package is developed such that the overlap between
the images is at least 50%. These images are of dimension 752 480 pixels with no radial or
tangential distortions. The MAV motion was paused for 0.3 seconds while acquiring these
images to reduce motion blur. The MAV captures 63 images for each trial in the given
scenario, which are later stitched together using the proposed method. Figure 4.20 shows the
six different scenarios used in this study. The first scenario has no cracks, followed by cracks
in others. Lastly, 15 trials of the simulation were performed to create ensemble datasets of
each scenario.
4.5.1.2 Real-world dataset
Similar to the synthetic datasets, real-world datasets were constructed to demonstrate the
capability of the proposed method. In this study, three different datasets of varying crack
thickness were considered. Primarily all the images are of the outdoor concrete surface at
the University of Southern California, Los Angeles campus. The lengths of the various cracks
150
(a)
(b)
Figure 4.21: Real-world thin sized crack datasets. (a). Experimental images at time period
T
0
. (b). Experimental images at time period T
1
. The images follow zig-zag pattern (top row
left to right, followed by next row left to right)
ranged from 1.524 - 2.133 m. Since there was no MAV to acquire the images, a human
mimicked the path traversal along the crack by using the high-resolution Sony H300 camera
151
(a)
(b)
Figure 4.22: Real-world medium sized crack datasets. (a). Experimental images at time
period T
0
. (b). Experimental images at time period T
1
. The images follow zig-zag pattern
(top row left to right, followed by next row left to right)
mounted on a tripod. The working distance to the ground plane is 1.5 meters, and the
camera was firmly fixed to the tripod carefully to reduce the parallax error. Therefore, the
planar homography assumption is valid as the image plane was parallel to the ground plane.
Furthermore, care was taken to keep the tripod leveled during the image acquisition.
152
(a)
(b)
Figure 4.23: Real-world thick sized crack datasets. (a). Experimental images at time period
T
0
. (b). Experimental images at time period T
1
. The images follow zig-zag pattern (top row
left to right, followed by next row left to right)
Generally, simulation environments have ambient uniform sunlight. Thus, the images
have consistent brightness throughout. However, in real-world outdoor scenarios, weather
conditions have a direct impact on the lighting condition. Therefore, better lighting is essential
to obtain clear and crisp images. The datasets acquired at different time periods (morning,
afternoon, or evening) consist of variations in the contrast, brightness, and gain in the color
153
of the images. Each image in the datasets is of size 5152 3864 originally and resampled to
644 483 for swift computations. Figures 4.21 to 4.23 shows the three datasets of the thin,
medium, and thick sized cracks, respectively, acquired at two different time periods. The
thin, medium and thick cracks have varying widths of 0.003 - 0.010 m, 0.012 - 0.018 m, and
0.020 - 0.025 m, respectively. Due to the difficulty in maintaining a consistent orientation
while acquiring the images, there is a natural variation in the orientation mismatch between
the two sets of images. These natural variations occur while scanning large-scale structures
due to the uncontrolled movement of sensor-carrying vehicles. Therefore, these uncertainties
demonstrate the capabilities of the proposed system for different scenarios.
4.5.2 Crack images recovery after transformation and limitations
A motion model such as rigid, affine and projective is required to transform the images
onto the plane of another image. Generally, rigid motion models are sufficient to recover
the unknown orientations and displacement of the images when the transformation of the
images is rigid. However, due to the small movements in the camera or the MAV oscillations,
the camera viewing angle will be changed. Thus, inducing the effect of affine or projective
transformations in image recovery. Since the working distance (distance from the camera to
the object (e.g., concrete surface)) is relatively small while acquiring the images to estimate
the crack propagation, a projective transformation is sufficient to recover the motion model
parameters. Otherwise, an affine motion model is preferred. This study uses a feature-based
image stitching method to stitch the neighboring images at previous time periods compared
to the current image. Thus, it is necessary to know the limitations of the feature-based
registration algorithm in recovering the unknown parameters of the above motion models.
Figure 4.24 shows the Root Mean Square (RMS) error of the total 100 samples of crack
and non-crack images of size 644 x 483 pixels when subjected to arbitrary orientations between
1
to 179
and scaling from 0.5 to 1.5 for similarity distortions. Furthermore, shear terms
(Sh
x
,Sh
y
) were varied from 0.1 to 0.3 for affine distortions, and lastly, the ratio of E{F
154
0 20 40 60 80 100
Image number
0
20
40
60
80
100
120
140
RMS error (pixels)
Similarity
Affine
Projective
Figure 4.24: RMS error due to linear transformation
is kept 0.1, and the projective angle is varied from 1
to 45
. Thus, it is evident that the
feature-based detectors could transform the target image into original configurations for
similarity transformation and projective transformation to a certain extent. However, the
affine transformation was not recovered well in target images because of the higher shear
parameters ratio. Thus, the correspondences between the two images did not match well
because of the larger distortion.
4.5.3 Crack change detection and localization
4.5.3.1 A demonstration on a single real crack on the concrete surface
To demonstrate the capability and limitation of the image registration and crack segmentation
methods, the proposed system is tested on a single image. To measure the change detection
of the same crack, two images were taken at time periodsT
0
andT
1
. Furthermore, to find the
transformation matrix between the reference image and target image, SURF-based keypoint
matching was performed. The reference and target image had 2404 and 3290 keypoints,
respectively. There were a total of 556 putative matches, and 268 inlier matches. The
proposed MFAT crack detection algorithm was used on both the images taken at different
time periods. Iterations for isotropic diffusion are kept at 5, and the scale range is from
155
0.7181 to 7.3443. Once crack maps are produced, the crack width is measured for both the
crack maps separately and compared.
(a) (b)
(c) (d)
Figure 4.25: Results from the proposed crack detection method. (a). Source crack map. (b).
Target crack map. (c). Transformed target crack map to source. (d). Aligned target crack
map to source (green color in the source crack pixels, yellow color is the transformed target
crack and red color is false positives).
Figure 4.26a shows the crack normals obtained after calculating the orientation of the red
pixels in Figure 4.19. In this study, it was found that a 5 5 neighborhood around the center
pixels produced more accurate crack orientation because the ellipse that was fitted using
this was able to estimate the local orientation reasonably. A larger neighborhood generally
smooths the orientation, and too small does not capture the orientation trend. If there are
156
Figure 4.26: Crack normals and crack width variation along the centerline. (a). Crack
normals. Blue and red arrows represents the positive and negative directions, respectively.
(b). Crack width variation contour along the center line.
any branch artifacts, the traversed pixels along the crack normals can hit the artifacts and
lead to overestimation of the crack width.
Figure 4.25d shows the crack registration and difference of the same cracks as green
and yellow regions. The red pixels in the image are false positives. Furthermore, the non-
overlapping region is caused when the image is transformed to the reference plane. To
overcome this, multiple images surrounding the cracks need to be considered to have good
continuity. Also, there was a false positive connected component on the right side of the target
image. Because of the response of the hierarchical hybrid filter, there are some protruding
blobs attached right next to cracks (refer to Figure 4.18). Figure 4.25 shows the results from
the proposed crack detection method. The proposed crack detection algorithm detects some
extra regions. This is due to the change in light intensity and scale while collecting the target
image, as both the images are taken at different times.
Figure 4.27 shows the comparison of the histograms, Probability Density Function (PDF)
and Cumulative Distribution Function (CDF) of two same cracks in two different time periods.
They are very similar in shape. Due to the change in light intensity and camera angles, the
measured crack thickness is slightly different on the left edge (refer to Figure 4.25 column
157
20 40 60 80 100
Crack Width (pixels)
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
Density
Source Crack
Transformed Crack
(a)
10 20 30 40 50 60 70 80 90
Crack Width (pixels)
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Density
Source Crack
Transformed Crack
(b)
Figure 4.27: Comparison of the crack width PDF and CDF obtained from the proposed
MFAT approach for the same image taken at two different times. (a). Probability distribution
function. (b). Cumulative distribution function.
(c)). The similarity between the two crack maps is shown in Table 4.2: Except for the crack
Crack thickness
quantities
Source/ reference
image crack map (pixels)
Target
image crack map (pixels)
Maximum width 44.0000 90.6667
Minmum width 10.0000 6.3333
Average width 15.8637 16.0271
Length 559.0000 320.0000
Standard deviation 5.5364 9.3629
Root mean square (RMS) 16.8005 18.5542
Correlation coefficient of two histograms: 0.9596
Table 4.2: Crack thickness quantities and correlation coefficient of two PDFs for source and
target images.
length, all other quantities match reasonably well. Also, the correlation coefficient of the two
histograms is 0.9596, meaning they are very close to each other.
4.5.3.2 Multi-image datasets
4.5.3.2.1 Crack width and length estimation on a synthetic dataset Labeling
the images obtained from the MAV simulation and real-world scenario is a challenging task,
as there are thousands of samples. Thus, to understand the crack segmentation capability
158
of the proposed system, about 600 synthetic samples of size 644 493 pixels were used.
Synthetic crack seams were generated from the synthetic crack generation algorithm discussed
in Chapter 2 and dilated to replicate the crack thickness. Crack seam length was varied and
rotated from 0
to 90
at random intervals. Also, the width of the synthetic cracks is varied
from 0 to 40 pixels. If the pixel scale is known, it can easily be converted into world units
(e.g., 1 mm/1 pixel).
Figure 4.28 shows the correlation of the synthetic crack width (ground-truth) to the
CrackDenseLinkNet output. For the crack width, the correlation coefficient is r 0.9306
and the crack length has an r 0.7834 (close to 1 indicates the best fit). In Figure 4.28, as
the synthetic crack width is increased, the measured width by the orthogonal projection was
overestimated due to the rasterizing effect, and this is evident in the correlation plot. The
0 10 20 30
Ground-truth Width (pixels)
0
10
20
30
40
50
Measured Width (pixels)
Data Points
Best Fit Line
r = 0.9306
(a)
300 400 500 600 700
Ground-truth Length (pixels)
0
200
400
600
800
1000
1200
1400
Measured Length (pixels)
Data Points
Best Fit Line
r = 0.7834
(b)
Figure 4.28: The correlation of the synthetic crack width (ground-truth) to the algorithm
output. (a). Crack width, the correlation coefficient is r 0.9306. (b). Crack length,
the correlation coefficient is r 0.7834. The correlation coefficient ranger1, 1s, -1 and 1
indicates the worst and best fit.
conventional thinning algorithm [91] produces an artifact if the protrusions are present along
the boundary of the cracks. The Fast Marching Method (FMM) is effective in reducing the
boundary artifacts [239]. Thus, FMM was used for obtaining the centerline pixels along the
cracks. The FMM produced an average of 425 measurements along the crack length. The
crack width relative error was 52.42%, because the circular structuring element assumes a
159
constant radius, but in digital images, pixels are all rasterized. When the circular structuring
element is used to dilate the crack seams, it leads to artifacts at the edges of cracks. Thus,
when the radius of the structuring element is increased, the width of the artifact also increases.
This can be overcome if the continuous assumption of cracks is considered, but images are
all in raster form in practice. Another way is to use a high-resolution image that smooths
the crack’s boundary and eventually minimizes this effect. Furthermore, some of the binary
crackmaps of thick cracks produced by the CrackDenseLinkNet were missing pixels, as the
CNN was trained on relatively thinner cracks. Additionally, some of the thinner cracks were
overestimated. The relative error was measured to be 9.45% for length measurement. This
shows that the thinning algorithm worked relatively well and the reason for this error is
that after thinning, a few pixels at the edges and ends of the cracks will become protruding
artifacts. It is worth noting the relative errors are in reasonable range for the crack extraction,
although the correlation coefficients are not perfect.
4.5.3.2.2 Synthetic time-series dataset collected from a MAV After the synthetic
time-series data collection, the succeeding datasets are organized with respect to the reference
images at time period T
0
, trial one. A k-NN search was performed on the succeeding
datasets, where k 6. Furthermore, the nearest neighbor images within a 1-meter span were
matched to the reference image coordinates to filter the furthest images with no keypoints
matches. This reduces the unnecessary computational overhead in the feature and image
matching. After this stage, 945 reference images (time period 0) have six nearest neighbor
images from the 4,725 succeeding dataset images (time periods 1 to 5, 15 trials). Due to
the higher levels of the octaves, the SIFT algorithm produces a large number of repetitive
and stable keypoints compared to the SURF method. However, SIFT is slightly more
computationally expensive than the SURF method. Therefore, both synthetic and real-
world datasets (presented in the next section) keypoints are extracted by using the SIFT
algorithm. Lastly, the complete crack change detection, localization, and quantification
160
pipeline discussed in Section 4.4.2 are implemented in MATLAB. In addition, the parallel
programming paradigm is leveraged wherever the above pipeline satisfies the perfectly parallel
condition to speed up the computations. As a result, it takes less than 15 seconds to find the
nearest neighbor images among all the succeeding datasets. This information was saved in a
dictionary file for all reference database images. In this way, there is no need to recompute
the nearest neighbors.
For the SIFT detector, an approximate nearest neighbor search method is used to filter
the outliers. A matching threshold of 1.5% was used to select the strongest matches. This
threshold represents a percent of the distance from a perfect match. Two feature vectors
match when the distance between them is less than the matching threshold. In addition, a
ratio threshold of 0.6 was used for rejecting ambiguous matches. If the ratio is close to 1, it
returns more matches and vice-versa. Similarly, for the homography calculation using the
RANSAC, an inliers confidence of 99.9 was used. Maximum random trials were set to 500.
The maximum distance from a point to a projection was set to 1.50. The maximum distance,
in pixels, specifies that a point can differ from the projected location of its corresponding
point to be considered an inlier. Larger values of maximum distance produce more keypoints
matches.
Figure 4.29 shows the reconstruction of the larger area using five nearest neighbor images,
current view/reference and cropped images. The black background region appears in some of
the reconstructed images when the cropped region does not contain the foreground pixels.
Usually, this happens at the edges of the cropped images. This can be mitigated when
the nearest neighbor images are increased and projected on the neighboring images that
match the reference image. However, in this study, it was noticed that increasing the nearest
neighbor images was not necessary as the black background region did not constitute losing
the foreground crack pixels. The cropping of reconstructed images is performed by matching
the keypoints of the reconstructed image against the current view/reference image and finding
the cropping offset coordinate values. A cropping operation follows this. On average, it takes
161
(a)
(b) (c)
Figure 4.29: Synthetic images registration. (a). Five synthetic images are registered with
reference to the current view image (colored quadrilateral shows projective transformed images
bounding boxes and red dashed rectangle is the cropped region). (b). Current view/reference
image at time period T
0
. (c). Final aligned, reconstructed and cropped image at time period
T
1
.
less than 0.5 seconds to match and crop a 752 480 size image. In contrast, homography-
based cropping is highly computationally expensive. Furthermore, the cropped image pixel
intensities are slightly smoothed out after close observation. This is due to the image warping
after the image transformation and resampling. Generally, bilinear interpolation is preferred,
162
but in this study, bicubic interpolation is leveraged to maintain the sharpness in the image
intensities of the cropped images.
In this study, two sets of features, image matching, registration, and reconstruction,
are performed on current view/reference images and overall inspection images. The first
(a) (b)
(c) (d)
Figure 4.30: A registration of 63 complete synthetic images at trial one and its semantic
segmentation results at time periods T
0
T
5
. (a). No-crack image at time period T
0
. (b).
Crack image at time period T
1
. (c). Crack image at time period T
2
. (d). Crack image at
time period T
3
.
163
(e) (f)
Figure 4.30: A registration of 63 complete synthetic images at trial one and its semantic
segmentation results at time periodsT
0
T
5
. (e). Crack image at time periodT
4
. (f). Crack
image at time period T
5
.
reconstruction is required to match the neighboring images to the reference image. This
produces the reconstructed image precisely, similar to the reference image, and is easy to
find and track the changes in subsequent images at various time periods. Second, the overall
image registration is necessary to align the images from the first stage so that the cracks
align correctly to the neighboring aligned images. This reduces the overestimation of the
crack properties such as width, length, and area crucial for the operation, maintenance,
risk management, and rehabilitation decision process. Finally, individual cracks can be
back-tracked based on their locations to obtain the local information of the changes in cracks.
On the other hand, global change detection can also be performed by obtaining a complete
registered image and finding the crack properties over a large area. In this work, the latter
method is adopted to estimate the overall change to provide a probabilistic measure of the
reliability of the analysis results. Generally, the condition assessment organization prefers the
overall change, and by severity, they can look for the detailed change based on the individual
cracks and a region of interest.
164
Since all the subsequent images in the time periods are aligned to the reference/source
images, estimating the transformation matrices only once is necessary. Later, the same trans-
formation matrices can warp and stitch the overall reconstructed image in subsequent images.
Furthermore, the individually aligned images are used to obtain the binary crackmaps instead
-2 0 2 4 6 8 10 12 14
Ensemble Mean of the Crack Width (mm)
0
1
2
3
4
5
6
7
Density
Inspection Round 0 Inspection Round 1 Inspection Round 2
Inspection Round 3 Inspection Round 4 Inspection Round 5
(a)
0 50 100 150 200
Ensemble Mean of the Crack Length (mm)
0
1
2
3
4
5
6
7
8
Density
Inspection Round 0 Inspection Round 1 Inspection Round 2
Inspection Round 3 Inspection Round 4 Inspection Round 5
(b)
0 500 1000 1500 2000
Ensemble Mean of the Crack Area (mm x mm)
0
1
2
3
4
5
6
7
8
Density
Inspection Round 0 Inspection Round 1 Inspection Round 2
Inspection Round 3 Inspection Round 4 Inspection Round 5
(c)
Figure 4.31: Probability density functions of the synthetic datasets for change detection at
inspection rounds T
0
to T
5
. (a). Crack width. (b). Crack length. (c). Crack area.
of the overall fully registered image. This is computationally and memory-wise inexpensive.
Lastly, the reconstructed overall binary crackmaps are based on the transformation matrices
estimated previously.
Figure 4.30 shows the crackmaps of an overall area at time periods T
0
,...,T
5
at first
trial. A total of 63 images in each trial were used to reconstruct the overall scene for crack
change detection. In Figure 4.30a a tiny connected component of a false positive is visible. In
Figures 4.30b to 4.30f all the synthetic cracks were segmented clearly. However, these cracks
165
are slightly thicker when compared to the MFAT method (MFAT results are not presented).
This is because CrackDenseLinkNet was not trained on these types of images. In addition,
cracks of size 1-4 mm or more are successfully segmented in these images. Based on the
markers placed in the inspection area, the length of the pixel is converted to the world units
in millimeters. The marker length and height are 1.5 inches, and after conversion, each pixel
is 1.5875 mm in height and width. It can be further noticed in Figures 4.30b to 4.30f that a
small speck of false-positive pixels are visible, due to the untrained CNN.
Figure 4.31 shows the PDFs of the synthetic datasets for change detection at inspection
rounds T
0
to T
5
of crack width, length and area. Clearly, the changes can be visible in all
three categories of the crack properties in comparison to the reference dataset. In Figure 4.31a
the PDFs monotonically increase for each time periods. Around 0.5 mm of change is observed
Inspection
Round
Crack
Property
Mean
(μ)
Std.
(σ)
Δμ
μ
0
Δμ
σ
0
0
Width (mm)
0.1058 0.4099 0 0
1 11.0955 0.0909 103.8394 26.8112
2 11.4715 0.0621 107.3927 27.7287
3 11.6910 0.1318 109.4670 28.2643
4 12.0475 0.1109 112.8349 29.1339
5 12.1769 0.0851 114.0581 29.4497
0
Length (mm)
0.1058 0.4098 0 0
1 153.7296 2.558 1451.5636 374.7921
2 169.3167 3.3649 1598.8430 412.8194
3 168.0838 2.6479 1587.1933 409.8115
4 165.1603 2.5025 1559.5705 402.6793
5 169.7038 2.2106 1602.5004 413.7638
0
Area (mm
2
)
107.0497 37.4262 0 0
1 1445.2570 61.9141 12.5008 35.7558
2 1680.9803 64.6169 14.7028 42.05416
3 1607.6330 58.0854 14.01763 40.0943
4 1588.9669 40.1002 13.8432 39.5956
5 1578.6988 54.8487 13.7473 39.3212
Table 4.3: Standardized values of the non-crack and crack samples by the mean and standard
deviation of the non-crack ensemble average for change detection.
in the subsequent time period PDFs. In contrast Figures 4.31b and 4.31c does not display the
166
same trend. This is because some of the cracks have ghosting effects due to the misregistration
of one or two images in the overall reconstruction of the fully registered image of the scene.
This could have happened due to the wrong feature matches or encountering local minima in
the bundle adjustment optimization or blending issues. Table 4.3 shows the standardized
values by the mean and standard deviation of the non-crack ensemble average for change
detection of Figure 4.31. The standardized values for crack width increase monotonically,
whereas, for crack length, and area, the values are slightly crossed over.
4.5.3.2.3 Real-world time-series dataset Like the synthetic dataset’s change detec-
tion, real-world datasets of relatively thin, medium, and thick sized cracks were utilized to
detect, localize, quantify the crack width, length, and area changes. There were only two
inspection rounds of the data for thin, medium, and thick sized cracks. The first inspection
round dataset was used as the source/reference images, followed by the succeeding dataset
as target images. All target images are projected onto the reference image plane. The
hyperparameters of feature and image matching algorithms are the same as in the previous
synthetic crack section. Figure 4.32 shows the reconstruction of the larger area using three
nearest neighbor images, current view/reference, cropped, and binary images. For both
reference, and cropped images the binary crackmaps are segmented well by the proposed
CrackDenseLinkNet method.
Figure 4.33 shows the PDFs of the ground-truth and CrackDenseLinkNet output of crack
width for thin, medium, and thick sized cracks. Furthermore, correlation coefficients of
PDFs are also provided for the ground-truth dataset and corresponding inspection rounds
outputs at time periods T
0
and T
1
for all three types of cracks. All the proposed method
output’s correlation coefficients match well against the ground-truth datasets. However,
in Figure 4.33a the inspection rounds peaks are smaller as some of the thin cracks were
overestimated, and due to the presence of false positives, the PDFs are slightly shifted to
the right side of the plot. The inspection round PDFs in Figure 4.33b matches well against
167
(a)
(b) (c)
(d) (e)
Figure 4.32: Real-world images registration (thin cracks dataset). (a). Three concrete crack
images are registered with reference to the current view image (colored quadrilateral shows
projective transformed images bounding boxes and red dashed rectangle is the cropped
region). (b). Current view image. (c). Final cropped image after reconstruction. (d). Binary
crackmap of current view image. (e). Binary crackmap of cropped image.
168
20 40 60 80 100
Crack Width (pixels)
0.02
0.04
0.06
0.08
0.10
Density
GT 0
Inspection Round 0
GT 1
Inspection Round 1
(a)
20 40 60 80 100 120
Crack Width (pixels)
0.02
0.04
0.06
0.08
Density
GT 0
Inspection Round 0
GT 1
Inspection Round 1
(b)
50 100 150 200 250
Crack Width (pixels)
0.02
0.04
0.06
0.08
Density
GT 0
Inspection Round 0
GT 1
Inspection Round 1
(c)
Figure 4.33: Probability density functions of the real-world datasets for change detection
at inspection rounds T
0
to T
1
. (a). Thin crack (correlation coefficient, r = 0.8596, 0.8311).
(b). Medium crack (correlation coefficient, r = 0.9462, 0.8497). (c). Thick crack (correlation
coefficient, r = 0.8785, 0.8314).
the ground-truth PDFs, except for a slight shift towards the right side due to false positives.
Lastly, in Figure 4.33c, similar to the other two figures, there exists a slight shift towards the
right side due to the presence of false positive pixels. Additionally, the first peak on the left
side is higher than the ground-truth PDFs. This is because the CrackDenseLinkNet method
produced thinner false positive cracks. Therefore two peaks exist, one for thin and the other
for thicker cracks.
Crack
Type
Inspection
Round
Accuracy Specificity Precision Recall F1-score
Thin
0 0.9907 0.9916 0.5909 0.8868 0.6904
1 0.9909 0.9915 0.5643 0.9253 0.6784
Medium
0 0.9871 0.9886 0.7065 0.9055 0.7611
1 0.9873 0.9881 0.6509 0.9374 0.7408
Thick
0 0.9910 0.9957 0.8758 0.8942 0.8773
1 0.9908 0.9950 0.8561 0.9087 0.8739
Table 4.4: Semantic segmentation metrics for the real-world datasets.
Table 4.4 shows the average semantic segmentation metrics over multiple large scene
reconstructions for the real-world datasets. The recall is relatively high for all three types
of cracks, around 90%, whereas thin cracks suffered from large false positive pixels. Thus,
the precision was drastically decreased. For the medium and thick sized cracks, precision
169
increases, respectively. Because none of the real-world datasets were used to train or transfer
learn the CrackDenseLinkNet CNN, thin cracks were overestimated in the thin and medium
crack datasets.
4.5.4 Practical problems in crack localization
• Robot localization error: A robot should know where it is with respect to the
coordinate system it is traveling. To achieve this, sensors such as wheel encoders, IMU
or INS, LiDAR or GPS, or a combination of them is required to localize. Unfortunately,
all these sensors are not accurate. Depending on the cost and operation (control system),
they suffer from positional drift [240].
• Camera lens distortions: Cameras are equipped with the lenses: firstly, to gather
the light onto the image sensor, and secondly, a single ray of light would otherwise reach
each point in the image plane under ideal pinhole projection. Due to the manufacturing
defects, the camera lens can have distortions such as [78]:
– Radial:
∗ Barrel: Bends the light outwards along the edges of an image.
∗ Pin-cushion: Bends the light inwards along the edges of an image.
– Tangential: The image forming sensor is not parallel to the plane of a lens.
• Keypoint matching error: If there exist any geometric distortions of the two corre-
sponding images, ideal keypoint matching becomes difficult unless a separate transform
is used to correct the distortions [20]. Also, if there exists a flat intensity over the
corresponding two images, the point-based detectors suffer a matching problem (see
Figure 4.34). This happens if there are no local extrema detected on multiple scales.
To overcome these effects, finding correspondences based on the edges is possible using
Hough transforms and training on the shapes that are most likely present [180], but in
170
(a)
(b)
Figure 4.34: Keypoint based method (SURF) correspondences matching error. (a). Two
artificially drawn cracks on a smooth white wall at USC, KAP. The transparent red region is
the overlap area between two images, (b). Falsely matched key points. Key points on the left
image are pointed towards the right image.
general, on concrete structures, stains and blobs are often present. These local features
can be used for correspondence matching.
4.6 Summary and conclusions
Cracks are the precursor for the deterioration of structures. Cracks are caused by the constant
static and dynamic cyclic loading, shrinkage, creep, and corrosion of the reinforcement in the
structure. Visual inspection is still a predominant method for the non-destructive evaluation
171
of structures. However, this mode of inspection is subjective, tedious, and time-consuming.
Often, a trained inspector is required to climb the scaffolding of tall structures. Vision-based
methods and image processing algorithms assist in automating this inspection procedure
using state-of-the-art sensors such as color and depth cameras, inertial measurement units,
and ultrasonic sensors. The development of low-cost MAVs equipped with a high-resolution
camera and location sensors provides an opportunity to inspect the structure remotely. In
addition, the MAV can perform inspection routines in inaccessible regions of the structures.
Since the MAV can be maneuvered in any direction, a trained inspector can leverage this to
inspect different parts of the structure from various viewing angles.
In this work, an autonomous crack change detection and evolution method is developed.
This work assumes that the camera is not fixed and subjected to small viewing angle changes,
unlike previous studies. In previous works matching was performed by pairwise comparisons
to each image in the database. However, this strategy is quadratic in time complexity and is
impractical and unfeasible if an extensive image database of a large-scale structure exists. To
overcome this, this study adopts the kd-tree nearest-neighbor search to localize the previous
database images to the current image. Furthermore, this study developed and applied a robust
methodology for detecting, locating, quantifying, and tracking evolving cracks on concrete
structures. Thus, providing a probabilistic measure of the reliability of the analysis results can
aid the prognosis damage detection models. The primary purpose of this study is to provide
a convenient tool to a trained inspector for comparing the regions of a structure at different
time periods. In order to achieve this, several trial image databases are constructed from
different time periods. Images at the previous time periods are registered and reconstructed
to produce an image that matches the current image and detects the crack change. Thus,
providing a visual means for an inspector to assess the periodic crack growth.
Among image registration methods, feature-based registration is robust to intensity change
and noise in the images. Furthermore, this method can register multiple images well with
the projective and affine motion models to a certain extent. In order to perform the change
172
detection and evolution of cracks, the previous database will be autonomously searched to
obtain the matching images to the current image. For this purpose, automatic keypoint
locations and feature-based descriptors are extracted. These keypoints are subjected to
mismatch; thus, a RANSAC-based outlier rejection algorithm is employed to obtain the inlier
matching keypoints across the images. Images that match the most number of keypoints with
the current image are selected for the reconstruction. These images are further optimized for
better homographies using the bundle adjustment procedure. Next, the optimized images are
corrected for gain or exposure, followed by the multi-band blending to eliminate the seams
or color change artifacts between the images. Lastly, the reconstructed image is cropped to
match the current image. By taking the image difference, the changes in the crack properties
can be measured. Therefore, if the image database is periodically constructed, the evolution
of the crack growth can be monitored, and necessary action can be taken by the operation
and maintenance inspectors.
In this study, a synthetic and real-world crack dataset was constructed to test the efficacy
of the proposed system. The synthetic dataset consists of images from six different scenarios
where the crack entities change in time periods. In each of these six different scenarios,
15 trials of complete coverage images were collected. A physics-based robotic simulation
environment was utilized to acquire the synthetic concrete crack images. In addition, to
assess the proposed method’s capability, real-world datasets of relatively thin, medium, and
thick size cracks were also acquired manually at two different time periods. Overall these two
different datasets were used to facilitate the tracking of the cracks in a scene.
Experimental results show that the CrackDenseLinkNet CNN correlates crack width and
length well against the ground truth of the synthetic crack samples. Furthermore, assessing
the crack segmentation capabilities on the synthetic dataset, crack width and length relative
errors were about 52.42% and 9.45%, respectively. However, when the proposed method was
tested on the synthetic images collected from a MAV and real-world datasets, the change
detection results and semantic segmentation metrics on real-world samples were adequate,
173
provided the CrackDenseLinkNet CNN was not trained on any of these datasets. Additionally,
the PDFs obtained from the proposed change detection method matches well with the ground-
truth datasets acquired at two different time periods for real-world datasets. Furthermore,
the F1-scores of the real-world datasets show the promising generalization capabilities of
the CrackDenseLinkNet CNN. This shows that the proposed method can register, align the
succeeding images well to the current/reference image, and effectively segment the cracks.
Lastly, the proposed efficient approach for detection, locating, quantification, and tracking of
evolving changes of cracks on concrete surfaces, while simultaneously providing a probabilistic
measure of the reliability of the crack physical properties demonstrates promising outcomes
and robustness of the method.
4.7 Future work
Due to the lack of resources for a real-world experimental setup to acquire crack propagation
and change detection datasets, in this study, a physics-based MAV simulation was proposed
to capture the synthetically generated crack images from each scan trial. Furthermore, the
proposed method can be easily adapted to analyze real-world data. Thus, based on the
available resources, a compression test experiment will be performed on a concrete beam
or slab to generate cracks at various time periods by applying a quasi-static loading. As
a result, real-world data will be acquired with similar control and operating procedure on
a real MAV. Furthermore, the real MAV will be equipped with all the sensors used in the
simulation MAV.
In real-world structural inspection, the lighting condition plays a crucial role in obtaining
the valid keypoints for matching and eventually for image stitching. Thus, the effect of various
lighting conditions will be studied. In addition, sometimes the destabilized camera can cause
motion blur. Therefore, the adverse effects of blurring on the accuracy of the localization
of images, crack segmentation, and propagation has to be studied. Also, if branches exist
174
in the cracks, the orthogonal projection method overestimates the crack width at branches.
Therefore, the branches can be de-branched separately and re-combined after finding the
width along the branches. Thus, reducing the spurious crack measurements at the branches.
In this study, the crack’s physical properties are measured globally throughout the scene.
However, cracks can be labeled and tracked individually or locally based on the main strands
and branches. This is an essential topic for further study.
This study assumes the camera is parallel to the plane of the object. Due to the random
oscillation of the MAV or camera mount, parallax error will be caused. Furthermore, in
real-world structural inspection, the wind speed destabilizes the MAV. Therefore, it is of
prime importance to quantify this uncertainty in the localization of the MAV and its effect
on crack propagation. Lastly, it is valid to study the effects of the change in viewing angle of
the camera and noisy path in crack propagation and quantification. In addition, validating
optimal path plans for complete coverage of a structure is crucial. It is worth noting that
the physics-based simulation is an excellent fit for these uncertainty quantifications in crack
propagation. Performing these experiments on a small to large-scale structure is impractical
and time-consuming. Thus, providing the guidelines based on the above uncertainties using
a physics-based simulation will lay down a good starting point for large-scale structures’
real-world MAV inspection.
175
Chapter 5
Automated classification of sewer pipe
defects using visual bags-of-words in
closed-circuit television (CCTV) im-
ages
5.1 Introduction
Public sewer mains are between 700,000 and 800,000 miles in the United States [11]. These
mains are deteriorating as they were installed after World War II. Globally and particularly
in the United States, sewerage systems are capital intensive. An estimate of $298 billion of
capital investment is required to maintain, rehabilitate, and manage the nation‘s wastewater
and stormwater systems over the next twenty years. In addition, centralized treatment
systems are prevalent. Backed with aging pipes and inadequate capacity to handle the sewer
flow, they are becoming poor and deteriorated. This leads to the discharge of an estimated
3.4 10
12
liters of untreated sewage each year. Since 2001, the ASCE report card has shown
no significant improvement of wastewater systems over the years, with a current grade of D+
given in 2017, indicating poor or mediocre condition [11].
Underground infrastructure like sewer collection systems fails or gets damaged over time,
apparently without signs of wear. This is due to the lack of periodic maintenance unless there
is a failure. For example, St. Louis has combined sewers (collecting stormwater and sewage)
over 125 years old. The city had 4,000 sewer collapses and an astronomical repair bill in 1981
alone [253]. It is a non-trivial task to maintain, repair, and rehabilitate the sewer system
infrastructure. In fact, of the six most capital intensive infrastructure systems, sewerage
176
systems are a significant player. The other five are schools, roads, water, storm drainage, and
solid waste disposal [80].
In many cases, simply digging and replacing the old pipe is still the best (or only) option,
however it is usually highly disruptive and expensive. Furthermore, disruptive sewer mains in
traffic can cause much turmoil among the user. Thus, there is a need and a strong incentive
to find alternative rehabilitation technologies that make use of existing pipes and that provide
solutions with minimum disruption while maintaining cost-effectiveness [212]. Due to the
unreliable information obtained sometimes, conventional condition assessment approaches
and interpretation of sewer systems have been inadequate [256].
Ninety five percent of the sewer system are too small for effective manual inspection
[252]. Hence, until the appearance of Closed-Circuit TV (CCTV) sewer inspections in the
1960s, the condition of most of the sewer systems was little known. The use of CCTV and
other modern inspection techniques brought to light the poor state of the system [212]. The
quality of information obtained through a traditional CCTV depends on the technician‘s
skill, subjective judgments on the defects, or Region of Interest (ROIs). These subjective
judgments are dependent on various factors like experience, mental stress, the capability of the
equipment, distractions in the field, environmental factors, and others. Thus, the information
quality of the traditional CCTV is inadequate [256]. Other sewer pipe condition assessment
techniques are infrared thermography, sonic distance measurement, ground-penetrating radar,
and advanced systems like KARO, PIRAT, and SSET [256].
5.1.1 Review of the literature
In a Closed-Circuit-Television (CCTV)-based monitoring system, Fieguth and Sinha [75]
used digital image processing to detect the cracks in underground pipes through statistical
parameters such as sample mean, variance, and cross-correlation with the crack pixels and
their surrounding window. Although this method can extract features like cracks and, to
some extent, joint openings of the buried pipe, it failed to minimize false positives. Using
177
a warped tunnel vision by a tool called Sewer Scanner and Evaluation Technology (SSET),
[110, 228, 112] exhibited the method which segment the crack (eliminating joints, holes, and
laterals) from the buried pipeline images. This is done by using the statistical filters for
crack detection followed by cleaning and linking operations. Similarly, Iyer and Sinha [111]
worked on multiple crack detection algorithms based on contrast enhancement, morphology,
non-linear filtering, and curvature evaluation of crack patterns. It is shown that morphological
characteristics such as branches, thickness, and orientation of cracks can be exploited for
their autonomous recognition. Furthermore, with the concept of Bayesian classification,
mathematical morphological processes, and segmentation by thresholding, Sinha and Fieguth
[227] studied the feature extraction methods for multiple cracks and joints in the sewer
pipeline.
Yang and Su [270] proposed a method to use the wavelet-based and co-occurrence
matrices for describing the texture of the sewer pipe images and classifying them using
neural networks and Support Vector Machine (SVM). In addition, they used the unsupervised
clustering method to train the radial basis network (RBN). However, they only achieved a 60%
classification accuracy. Yang and Su [271] proposed a method to classify sewer pipe defects
such as cracks, fractures, open joints, and broken pipes by using mathematical morphological
operators and compared the obtained shapes to ideal ones. Their accuracies were good, but
they tested on small samples. In addition, sewer pipes have diverse defects, and finding
an ideal template to match the defects is not practical. Guo et al. [98] proposed a method
to classify defects based on using SIFT feature points on consecutive images by assuming
that feature matches always exists. Their results show 92% accuracy. However, in many
situations, the matches are erroneous and may not sufficiently lead to accurate detection
every time. For example, a crack with a flat intensity background could be left undetected
if a detector only is used. Guo et al. [99] proposed a method to automatically identify the
region of interest in pipe inspection videos, using a change detection-based approach. They
used image differencing to detect the changes by using a reference image of a healthy section.
178
Based on the output of the segmented region, a classifier classifies the region as defect or
non-defect. However, this works if the camera moves straight (fixed angle) and the sewer
pipe surfaces are clean. Otherwise, this method could lead to many false positives.
Kumar et al. [143] presented a framework that uses convolutional neural networks (CNNs)
to classify multiple defects in sewer CCTV images. Specifically, an ensemble of binary CNNs
was used to classify each image as defective (positive) or non-defective (negative) samples. In
their work, 12,000 image samples of 8 classes are used to train and test the CNN. Results
show that CNNs are promising for classification problems. However, the dataset needs to
be augmented to minimize overfitting, and CNNs are impractical for small to medium-sized
datasets. Also, it is worthwhile to note that the number of classes in sewer assessment
videos (from starting manhole to the ending manhole) is usually more than eight. Thus, the
broader labeled dataset may be required to classify the sizeable archived dataset. Cheng
and Wang [47] implemented and used a Zeiler-Fergus type of convolution neural network
for the object-detection of four sewer pipe defects. Their work examined the three different
criteria: the influence of dataset size, types of network architecture, and hyperparameters of
the network on the accuracy and computational cost. However, they ignored the effects of
another types of sophisticated convolutional networks and other defect types.
Li et al. [152] adapted a residual deep convolutional neural network of 18 layers for the
classification of normal and defective sewer pipe images. A hierarchical deep learning model
was used to handle the imbalanced defective images. Furthermore, a high-level binary residual
network classifier was used to classify the normal and defective images. However, they
obtained a mediocre accuracy of 64.8%. In today’s deep learning standard, this accuracy is
way below the acceptable score. Similarly, Xie et al. [262] proposed a hierarchical deep learning
model for sewer pipeline defect classification. They used 40,000 images with augmentation.
Their dataset was challenging for classification and obtained an average F1 score of 84.86%
among all the classes. Zhou et al. [291] proposed a two-layer convolutional neural network
and compared it against the transfer-learned SqeezeNet model. There were 7,200 images
179
after the augmentation. Their proposed method had an accuracy of 91% compared to 95% of
the SqeezeNet model.
Yin et al. [281] employed YOLOv3 deep learning architecture for the sewer pipe defect
detection by a bounding box. Their method used a residual network as the backbone. Their
dataset contained 3664 images with 4056 unique defects retrieved from 63 CCTV videos
and obtained a mean Average Precision (mAP) score of 85.37%. Similarly, Tan et al. [236]
proposed an improved YOLOv3 deep learning architecture where the loss function, data
augmentation, bounding box prediction, and network structure were modified and improved.
Their dataset consisted of 3000 images after augmentation, and they achieved a mAP of 92%.
Pan et al. [194] adapted a U-Net deep learning architecture for semantic segmentation of
the sewer pipe defects. They used the focal loss to mitigate the class imbalance problem of
defective and non-defective pixels. Their dataset had 1472 normal images and 2182 images
with defects and achieved a mean Intersection-Over-Union of 76.37%. Lastly, Li et al. [151]
proposed a deep learning-based object detection network for sewer pipe defects. The proposed
region and global contextual features were concatenated from the image based on a two-stage
object detection model network. Their dataset consists of 10,000 images from different sewer
pipes with different diameters and time frames. Their method achieved a mAP of 49.9%.
5.1.2 Contribution
Prior to the year 2018, most of the above methods relied on either template-based or
hand-engineered features. This study has evaluated a method called “visual bag-of-words”,
which was well-received by the object classification community in computer vision before
the popularity of the deep learning (convolutional neural networks) methods in the civil
engineering community. Unlike the prior method that used SIFT, this method uses a dense
grid for object classification problems. Since SIFT and SURF detectors are invariant to
intensity changes encountered in the sewer pipe CCTV images, it has been studied and
180
compared in this study. Results show that even with a moderate-sized dataset, the results
are comparable to the deep learning models.
5.1.3 Scope
Section 5.2 describes the two point-based feature detectors that are invariant to scale and
rotations. Section 5.3 briefs on the visual bag-of-words method. Next Section 5.4 mentions
the experimental design and evaluation results of the two descriptors. Lastly, Sections 5.5
and 5.6 conclude the current work and future extensions are presented.
5.2 Point-based feature detectors
Before the era of deep learning, point-based features were extensively used for object classifi-
cation and recognition problems [169]. They require a set of images that have an unoccluded
region to extract the feature or keypoints, then descriptor matching is performed to classify
the objects. In this section, two of the most widely used keypoint descriptors will be discussed.
5.2.1 Scale-invariant feature transform (SIFT)
5.2.1.1 Scale-space extrema
A SIFT is a scale-invariant descriptor that uses a cascade filtering approach to identify
keypoint or candidate locations [169], using a scale-space kernel in the Gaussian function
[162]. The scale-space of an image is defined as,
Lpx,y,σqGpx,y,σq~Ipx,yq, (5.1)
where, ~ is a convolution operation in x and y, Ipx,yq is an input image and Gpx,y,σq
1
2πσ
2
exprpx
2
y
2
q{2σ
2
s.
181
-
-
-
-
-
-
-
-
Scale
(next
octave)
Scale
( first
octave)
…
Gaussian
Difference of Gaussian (DOG)
Figure 5.1: An incremental difference-of-Gaussian. The left image shows the octaves at
each scale, and the right image shows the DOG. After each octave, the Gaussian image is
down-sampled by a factor of 2, and the process is repeated.
To detect stable keypoint locations in scale-space, Lowe et al. [170] proposed using
scale-space extrema in the difference-of-Gaussian (DOG) function convolved with the image.
This function, Dpx,y,σq, can be computed from taking the difference of two nearby scales
separated by a constant multiplicative factor k, and is given by,
Dpx,y,σqpGpx,y,kσqGpx,y,σqq~Ipx,yq
Lpx,y,kσqLpx,y,σq. (5.2)
182
Also, the difference-of-Gaussian function provides a close approximation to the scale-
normalized Laplacian of Gaussian, σ
2
∇
2
G and is given by,
σ
2
∇
2
G
Gpx,y,kσqGpx,y,σq
kσσ
. (5.3)
Therefore, in short notation,
Gpx,y,kσqGpx,y,σqpk 1qσ
2
∇
2
G. (5.4)
Since interest points need to be found at different scales, scale-spaces are usually im-
plemented as an image pyramid. The scale space is analyzed by up-scaling the filter size
rather than iteratively reducing the image size. The scale-space is divided into octaves. An
octave represents a series of filter response maps obtained by convolving the same input
image with increasing size filters. In total, an octave encompasses a scaling factor of 2. This
is done to double the filter size to capture the effect of different scales. The initial image is
incrementally convolved with Gaussians to produce images separated by a constant factor k
in scale-space shown in Figure 5.1. Each octave in scale-space (i.e., doubling of σ) is divided
into an integer number of intervals, s, so k 2
1{s
. Adjacent image scales are subtracted to
produce the DOG images as shown on the right of Figure 5.1. Once the octave is computed,
it is re-sampled twice by taking every second pixel in each row and column to reduce the
computation. Further details are available in [170].
5.2.1.2 Detection of local extrema
To detect the local maxima and minima ofDpx,y,σq each sample point is compared with the
8-pixel neighborhood in the current image and nine neighbors in the scale above and below
(see Figure 5.2). The keypoint is selected only if it is larger than all of these neighbors or
smaller than them.
183
Scale
Figure 5.2: Maxima and minima of the DOG as compared to 26 pixels.
5.2.1.3 Keypoint localization
After the keypoint has been found by comparing it to its 26 neighboring pixels, the location,
scale, and ratio of principal curvatures are found to reject the low contrast pixel or pixels
that are poorly localized along the edges. To accomplish this, the current pixel is perturbed
by Taylor’s series expansion of the scale-space function, Dpx,y,σq, and shifted to the origin
at the sample point.
DpxqD
BD
T
Bx
x
1
2
x
T
B
2
D
Bx
2
x, (5.5)
where D and its derivatives are evaluated at the sample point and xpx,y,σq
T
is the offset
from this point. By taking the derivative of Equation (5.5) with respect to x and setting it
to zero, the extremum can be found, given as,
ˆ x
B
2
D
1
Bx
2
BD
Bx
. (5.6)
The Hessian and derivative ofD are approximated by using differences of neighboring sample
points which results in a 3 3 linear system.
Now, substituting Equation (5.6) in Equation (5.5),
Dpˆ xqD
1
2
BD
T
Bx
ˆ x. (5.7)
184
Equation (5.7) is useful for rejecting unstable extrema with low contrast.
A poorly defined DOG peak will have a principal curvature across the edge but a small
one in the perpendicular direction. The principal curvatures can be computed from a 2 2
Hessian matrix, H, computed at the location and scale of the keypoint, given by,
H
D
xx
D
xy
D
yx
D
yy
. (5.8)
Derivatives of theH are estimated by the central difference method of the neighboring sample
points. The eigenvalues of H are proportional to the principal curvatures of D. Let λ
1
and
λ
2
be the eigenvalues with largest and smaller values, then
TrpHqD
xx
D
yy
λ
1
λ
2
(5.9)
DetpHqD
xx
D
yy
pD
xy
q
2
λ
1
λ
2
. (5.10)
Letr be the ratio between the largest magnitude eigenvalue and the smaller one, so,λ
1
rλ2,
then
TrpHq
2
DetpHq
pλ
1
λ
2
q
2
λ
1
λ
2
prλ
2
λ
2
q
2
rλ
2
2
pr 1q
2
r
. (5.11)
Equation (5.11) depends on the ratio of eigenvalues only. The quantity
pr1q
2
r
is at a minimum
when the two eigenvalues are equal, and it increases with r. Therefore, only the ratio of
principal curvatures is to be checked, i.e.,
TrpHq
2
DetpHq
pr 1q
2
r
. (5.12)
Further details about the analytical expressions above are available in [170].
185
5.2.1.4 Orientation assignment
The keypoint’s scale is used to select the Gaussian smoothed image, L. To ensure scale-
invariance, all computations are performed with the closest scale. For each image sample,
Lpx,yq, at this scale, the gradient magnitude, mpx,yq, and orientation, θpx,yq, are pre-
computed using pixel differences, given below:
mpx,yq
a
pLpx 1,yqLpx 1,yqq
2
pLpx,y 1qLpx,y 1qq
2
. (5.13)
θpx,yq tan
1
Lpx,y 1qLpx,y 1q
Lpx 1,yqLpx 1,yq
. (5.14)
The histogram of orientation is formed from the gradient orientations of sample points
within the region of the keypoint. It has 36 bins covering 360
range of orientations. Samples
are weighted by their gradient magnitude and by a Gaussian-weighted circular window with a
σ of 1.5 times that of the scale of the keypoint. Peaks in the orientation histogram correspond
to dominant directions of local gradients. First, the highest peak in the histogram is detected,
and other local peaks within 80% are also used to create keypoints at the same location and
scale, but with different orientations. Then, a parabola is fitted to the top-3 histogram values
closest to the peak position for better accuracy.
5.2.1.5 Keypoint descriptor
The image gradient magnitudes and orientations are sampled around a keypoint location.
For rotation invariance, the descriptor coordinates and gradient orientations are rotated to
the orientation of the keypoint. A 2 2 descriptor has eight orientations and is sampled from
an 8 8 sample array. The keypoint descriptor vector used in SIFT is of length 128 1. The
descriptor is 4 4 spanned over a 16 16 sample array. These feature descriptors can extract
meaningful information from the image or set of images for object recognition problems.
186
Image gradients Keypoint descriptor
Figure 5.3: A SIFT keypoint descriptor. Here, for illustrative purposes, it is a 2 2 descriptor
on the right which is sampled from an 8 8 sample array.
5.2.2 Speeded up robust features (SURF)
SURF is a feature detector and descriptor that uses the Hessian matrix approximation to
detect the keypoints [17]. It uses integral images for fast computation box-type convolution
filters. An integral image, I
°
pxq, is given by,
I
°
pxq
i¤x
¸
i0
i¤y
¸
j0
Ipi,jq, (5.15)
wherexpx,yq
T
represents the sum of all pixels in the input image I within a rectangular
region formed by the origin and x.
5.2.2.1 Hessian matrix-based interest points
The Hessian matrix is used to detect blob-like structures at locations where the determinant
is maximum. Given a point xpx,yq in an image I, the Hessian matrix H is given by,
Hpx,σq
L
xx
px,σq L
xy
px,σq
L
xy
px,σq L
yy
px,σq
, (5.16)
187
where L
xx
px,σq is the convolution of the Gaussian second order derivative
B
2
Bx
2
Gpσq with the
image I in pointx and similarly for L
xy
px,σq and L
yy
px,σq. Further details are available in
[17].
5.2.2.2 Scale space representation
Similar to the SIFT algorithm, the output of the 9 9 filter is considered the initial scale
layer, which is referred to as scale s 1.2 (approximating the Gaussian derivatives with
σ 1.2). Then, the pyramid of the layers is obtained by filtering the image with gradually
increasing masks to 15 15, and 21 21, using the discrete nature of integral images and
the specific structure of the filters. This is done for its computational efficiency.
5.2.2.3 Interest point localization
In order to localize interest points in the image and over scales, a non-maximum suppression
in a 3 3 3 neighborhood is applied. The maxima of the determinant of the Hessian matrix
are then interpolated in scale and image space. This is done because the difference in scale
between the first layers of every octave is a relatively large case.
5.2.2.4 Orientation assignment
A reproducible orientation for the interest points is required to be invariant to image rotation.
For this, Haar wavelet responses in the x and y directions at an interest point of radius 6s
are detected, where s is the scale. The dominant orientation was estimated by calculating
the sum of all responses within a sliding orientation window of size
π
3
. The horizontal and
vertical responses within the window are summed. The two summed responses then yield a
local orientation vector. The orientation of the largest vector over all the windows defines
the orientation.
188
5.2.2.5 Point descriptor
The descriptor describes the distribution of the intensity content within the interest point
neighborhood, similar to the gradient information extracted by SIFT Lowe [169]. SURF uses
the distribution of first-order Haar wavelet responses in the x andy direction rather than the
gradient. It uses the integral images for speed, and the dimension of the feature descriptor is
only 64 1. This reduces the time for feature computation, and matching makes it robust.
For the extraction of the descriptor, the first step consists of constructing a square region
centered around the interest point and oriented along the orientation obtained from the last
subsection. The size of this window is 20s. The region is split up regularly into smaller 4x4
square sub-regions to preserve the essential spatial information. Haar wavelet responses were
computed for each sub-region at 5 5 regularly spaced sample points. d
x
and d
y
are the
Haar wavelet response in the horizontal and vertical directions, respectively.
The wavelet responses, d
x
and d
y
, are summed over each sub-region and form a first set
of entries in the feature vector. Also, to account for the variations, absolute values of the
responses,|d
x
|,|d
y
|, are considered. Hence, each sub-region has a 4 1 dimensional descriptor
vectorv, wherev
°
d
x
,
°
d
y
,
°
|d
x
|,
°
|d
y
|
T
. Concatenating this for all 44 sub-regions,
this results in a descriptor vector of length 64 1. It is mentioned that the wavelet responses
are invariant to a bias in illumination and to contrast by normalizing the descriptor into a
unit vector.
Bay et al. [17] mentions that SURF is similar in concept to SIFT, as they both focus
on the spatial distribution of gradient information. Also, SURF outperforms SIFT because
SURF integrates the gradient information within a sub-patch, whereas SIFT depends on the
orientations of the individual gradients. This makes SURF less sensitive to noise variations.
189
5.3 Visual bag-of-words
Visualbag-of-wordsisamethodthatwasinspiredbythenaturallanguageprocessingalgorithm
called “bag-of-words”. In “bag-of-words”, to classify the text document, the frequency of
the words is counted. For example, there exists a vocabulary in English, and the histogram
of the word counts can signal some information about the document. However, an image
hardly has a pre-determined vocabulary to classify or extract meaningful information about
it. Csurka et al. [54] proposed a method to classify the images based on the codebook, or
visual vocabulary. They used interest point detectors such as SIFT or SURF descriptors to
create a feature vector in a higher-dimensional space. Moreover, the clustering algorithm
k-means was used to vector quantize and to create the visual vocabulary.
K-means
Keypoint
feature
space
Visual-word vocabulary
………………….
………………….
Feature extraction
‘’bags of visual words’’
…
…
Visual word vectors
…
Figure 5.4: A pictorial presentation of visual bags-of-words.
190
Figure 5.4 shows the overview of the visual bags-of-words. The first step is feature
extraction of a large corpus of sewer pipe images. The green ellipses on the top left denote
local feature regions. Furthermore, the blue, orange, and purple dots are points in some
feature space,R
n
, using the descriptors as mentioned earlier. Once the features are obtained
by non-dense interest points [54] or by dense grid points [67], a simple k-means clustering
[62] was used for the quantization and given by Equation (5.17); the size of the vocabulary k
is a parameter that defines the classification outcome.
The objective is to minimize the sum of squared Euclidean distances between points x
i
and their nearest cluster center means, m
k
,
DpX,Mq
¸
k1...K
¸
i1...N
px
i
m
k
q
2
, (5.17)
where K is the total number of clusters or vocabulary size, and N is the number of points.
Choosing the right vocabulary size involves the trade-off between discriminability and
generalizability. Usually, with a small vocabulary, it is not very discriminative if a visual-word
feature maps to the dissimilar keypoints of the same visual word. As the vocabulary size
increases, the feature becomes more discriminative, since similar keypoints can map to
different visual words, and it becomes less generalizable and susceptible to noise [269].
Given a new image, the nearest visual word is identified for each of its features. This
maps the image from a high-dimensional descriptor space,R
n
, to a list of word numbers. A
visual bag-of-words histogram can be used to summarize the entire image. It counts how
many times each of the visual words occurs in the image. Later, this new feature vector can
be used for classification purposes.
191
5.3.1 Pre-processing and feature extraction locations
For pre-processing, the images are normalized by the per-channel normalization to have the
data centered at the mean. The result is given by:
Ipu,v,Rq
new
Ipu,v,Rqm
R
σ
R
(5.18)
Ipu,v,Gq
new
Ipu,v,Gqm
G
σ
G
(5.19)
Ipu,v,Bq
new
Ipu,v,Bqm
B
σ
B
, (5.20)
where, Ipu,v,R...Bq
new
are the new normalized image in red, green, and blue channels at
each pixel location u,v. These are later concatenated to make a color image. Ipu,v,R...Bq
are the original normalized image in red, green, and blue channels. The means for all the
images are m
R
...m
B
respectively. Lastly, σ
R
...σ
B
are the standard deviation of all images
in red, green, and blue channels, respectively.
(a) (b)
Figure 5.5: An image showing the feature points location. (a). Conventional SURF detector
keypoints. The green circle denotes the scale σ, and the line indicates the feature orientation.
(b). Dense features (SIFT/SURF) are extracted at the grid (square) points.
For this study, both non-dense and dense grid points of SIFT and SURF features are used.
Figure 5.5 shows the non-dense conventional detector points green circles and grid-based
192
detector location. The green circle denotes the scale σ, and the line inside the green circles
indicates the feature orientation.
5.4 Experimental results and discussion
Figure 5.6 shows a background of the sewer pipe images, which demonstrates the high
variation of the intensity of the same class. More than 95% of the images used for the analysis
was of a vitrified clay pipe (VCP) material and 8-inch thick diameter. A few of them have
concrete liner material after their rehabilitation. Since both SIFT and SURF descriptors are
Figure 5.6: Challenges in sewer pipe assessment. Same class images with highly diverse
intensity distribution.
scale, rotation, and image intensities invariant, they are used in this study.
5.4.1 Sewer pipe dataset
Figures 5.7 and 5.8 shows the non-defective and defective samples used in this study. For
non-defective classes, there are pipe joints with laterals, pipe joints only, laterals, background,
intersection pipes, and manholes. In contrast, defective classes are classified into two classes,
mainly structural and operational defects. In structural defects, the pipe joint failures and
193
Figure 5.7: Sample of various image classes that are trained and tested, non-defective.
Figure 5.8: Operational and structural defect samples of various image classes that are trained
and tested.
cracks are considered. The operational defects are classified as debris, encrustation, blisters,
and dark spots. The NASSCO (National Association of Sewer Service Companies) specifies all
the codes (defect/non-defect classes) and Pipeline Assessment Certification Program (PACP).
The size of each image is 352 x 240 pixels. The city name, material type, pipe dimension,
194
timestamps, and location are overlaid on each image. These details are preserved, and image
inpainting was not used to remove them due to the loss of information.
5.4.2 Training, validation and testing dataset
There are 14 classes of images, including non-defective and defective. The class distribution of
the images are given in Figure 5.9. There are about 900 images for each training class except
dark spots, pipe joint close-ups, defective pipe joints, and intersection pipe images. Similarly,
the validation set contains 100 images in each class, except the four mentioned classes. This
makes up to 14404 images in total. Lastly, the testing set consists of 200 image samples,
except a few classes. The vocabulary size is one of the important parameters that was studied
here. Vocabulary size was varied from 50 to 8000. Three classifiers were considered for the
classification tasks: Firstly, an artificial neural network with 80 and 20 neurons in three layers
for vocabulary sizes more than and less than 1000. Secondly, k- nearest neighbor with five
nearest neighbors. Lastly, an SVM with radial basis kernel, as the dataset was not linearly
separable. The descriptor extraction time for the smaller vocabulary size from 50 to 1000
took less than 2-3 hours. For the larger size, it took more than 5 hours for feature extraction.
In addition, the training time for an artificial neural network was faster than others by four
times, SVM was the slowest.
Figure 5.9: Training, testing and validation samples distribution.
195
5.4.3 Classification results
Figure 5.10 shows the precision and recall over 50 to 8000 vocabulary (dictionary) size. Both
(a) (b)
Figure 5.10: Classification accuracy comparison of detector and grid-based feature points
over 50 to 8000 dictionary size. (a) Precision. (b). Recall.
precision and recall behaved similarly for 50 to 8000 vocabulary size. The SURF-based sparse
detector performed poorly because most of the keypoint was localized towards the watermark
of the images. In contrast, SIFT-based detectors did not suffer from these issues, as the
sparse keypoints were spread uniformly throughout the images. Also, it was experimentally
found that the vocabulary sizes of 50 to 500 suffered badly as the keypoints were not coherent
and spread over portions of the images with sharp intensity extrema.
Overall the SURF grid-based descriptor performed well with an F1 score of 78.68%. The
grid-based SURF outperforms because it is less sensitive to the noise since SURF integrates
the gradient information within a sub-patch. In contrast, SIFT depends on the orientations
of the individual gradients. The summary of the best results from the figure Figure 5.10 is
provided in Table 5.1.
196
Detector
type
Best
classifier
Average
precision
Average
recall
F1-score
Time (seconds)
per image
SIFT non-dense SVM 0.7236 0.6623 0.6916 1.78
SURF non-dense ANN 0.7039 0.6803 0.6919 1.14
SIFT dense ANN 0.6771 0.6589 0.6679 2.86
SURF dense ANN 0.7911 0.7826 0.7868 1.21
Table 5.1: Summary of the best visual bags-of-words results.
5.5 Summary and conclusions
Public sewer pipelines are between 700,000 and 800,000 miles in the United States. These
pipelines are deteriorating and require high capital for repair and maintenance. Periodic
maintenance by visual inspection is still widely used in practice. The trained inspectors are
assisted by sophisticated sensors such as CCTV cameras, and wheel encoders mounted on
mobile robots. However, visual inspection is labor-intensive and subjective to the expertise
of the inspector. The development of autonomous systems for the condition assessment of
sewer pipelines is necessary to speed up the inspection process and to cut the operation and
maintenance costs.
This work developed an autonomous supervised classification system to classify the
non-defective and defective images obtained from the sewer pipeline inspection. First, the
proposedmethodevaluatedtwowidelyusedfeaturedescriptors, SIFTandSURF,forobtaining
meaningful information from the sewer pipeline images. Later, the feature descriptors were
used to construct the visual bags-of-words model by quantizing the visual information at
feature points or grid locations.
The moderate-sized dataset consists of 14404 images, of which there are 14 classes of
non-defective and defective examples. More than 95% of pipe material is vitrified clay pipe
(VCP), 8-inch thick diameter, and the rest is concrete liner material after their rehabilitation.
There are 900, 100, and 200 samples for training, validation, and testing, respectively, in all
classes. For supervised learning, three classifiers were trained and tested for classification
purposes. Namely, an artificial neural network, k- nearest neighbor, and an SVM. The
197
vocabulary sizes of 50 to 1000 took around 2-3 to extract the feature descriptors, and for
vocabulary sizes greater than 1000, it work more than 5 hours. In addition, the training time
for an artificial neural network was faster, and SVM was the slowest.
The classification results, precision, and recall trend were similar for the size of 50 to 8000
vocabulary (dictionary). SURF-based sparse detector performed poorly due to the clustering
of the keypoints around the text in the images. In comparison, SIFT sparse keypoints were
spread uniformly throughout the images. In addition, it was found that the vocabulary sizes
of 50 to 500 suffered severely as the keypoints were spread over portions of the images with
sharp intensity extrema. Grid-based detectors performed well as the descriptors information
was obtained throughout the images. Among all the classifiers, the SURF-based dense grid
descriptor performed well with an average F1 score of 78.68% across all the classes. In general,
SIFT sparse detector keypoints are denser than SURF. However, SIFT feature extraction is
computationally expensive.
In conclusion, it was found that even a relatively smaller dataset according to today’s
standard provides decent and comparable results to the state-of-the-art methods reviewed in
the literature section. One of the significant drawbacks is that datasets are not made public
in the above publications due to privacy concerns. Thus, accurate comparison to the deep
learning methods was not possible. The datasets used in this study and the ongoing work
might be made public depending on the privacy permission from the collaborators, Pro-Pipe
Company.
5.6 Future work
Recently many other researchers have used a convolutional neural network to classify and
object detection of sewer pipe CCTV images. It was shown that for the larger dataset and
classes, going deeper improves the classification accuracy [226, 104, 107]. As an ongoing
work, a single-input-multiple-output deep learning-based architecture is under development
198
for the video frame classification of the pipe materials, PACP codes (pipe defects), PACP
condition grading, and water-level estimation. A larger dataset consisting of 1350 videos
with more than 130 defective classes and condition grades varying from 1-5 were received
from Pro-Pipe Company, an LADWP consultant. At the current state, data analytics and
data preprocessing were completed. In addition, a Graphical User Interface (GUI) based
data labeling tool called “Framer” has been developed. Furthermore, data extraction and
augmentation work are in progress. They were followed by the implementation of deep
learning-based methods to accomplish the tasks mentioned above.
199
Chapter 6
An approach for change detection,
localization and evolution of defects
in mechanical systems by three-
dimensional point cloud registration
6.1 Introduction
Structural health monitoring of civil infrastructures such as buildings, bridges, dams, and
pipeline systems using the smart sensors (color and 3D cameras, mobile robots, unmanned
aerial vehicles, to name a few) has gained increasing attention since the last two decades
[231]. Sensor data collection quality has improved since the last decade, and the cost has
decreased drastically. With the advances of 2D color cameras and 3D scanning technologies,
obtaining meaningful information has been improved. In addition, computation power to
process extensive data has improved exponentially. Planar scanning of infrastructures using
color cameras in vision-based inspections are common. However, in low light conditions, they
suffer from poor image contrast quality. To overcome this, three-dimensional (3D) scanning of
the surfaces of the infrastructures has seen a large expansion since this adds extra-dimensional
information for structural inspections and condition assessment. Three-dimensional sensors
such as Light Detection and Ranging (LiDAR) scanners and 3D cameras (based on the
structured light and time-of-flight principles) can work even with poor lighting conditions. In
addition, the photogrammetry technique known as Structure-from-Motion (SFM) creates a
highly dense point cloud to millimeter accuracy at the expense of computational cost. Dense
3D point clouds produced by these sensors and photogrammetry techniques have significant
200
potential in condition assessment and nondestructive evaluation of the infrastructures. Similar
strategies are utilized in the condition assessment of mechanical systems [65, 259, 18].
Defect detection falls under the large umbrella of a change detection problem. For decades,
change detection techniques have been developed in Remote Sensing (RS) and Geographic
Information System (GIS). Applications of RS and GIS using the Terrestrial Laser Scanning
(TLS) and Digital Elevation Models (DEM) that provide 3D positions of the ground objects
for change detection are extensively reviewed in [202]. In addition, change detection methods
are leveraged to localize and then quantify the changes in construction sites and street object
modeling using the mobile mapping systems [90, 260, 261, 201]. Geometric temporal change
segmentation of the 3D objects in the points clouds for environment monitoring and spatial
reasoning procedure is of primal importance in classification and quantification problems
[193]. Detecting the structural changes in the city environment using images captured from
vehicular mounted cameras at different traversals times is gaining popularity. Further, these
images are converted to the 3D point clouds using SFM, and the deep learning-based method
is leveraged for non-rigid registration [280]. Acquiring the aerial images and producing the
3D point clouds by photogrammetry for detecting the forest cover changes have become
a cost-effective alternative to the expensive LiDAR scanners [7]. Change detection using
the local information (surface normal) of the point clouds in geographic entities (urban
monitoring and complex earth surface topographies) are also dealt with in [163, 255].
Defect detection on a three-dimensional surface of mechanical parts is challenging as
they are cluttered by objects and occluded by other parts [33]. For example, detecting,
classifying, and quantifying loose bolts on machinery parts requires manual inspection. Visual
inspection is often time-consuming and labor-intensive. The human operators may oversee the
missing parts, and the repeatability rate is low [124]. In addition, untimely maintenance and
operational costs are some of the reasons for the repair of components. In the manufacturing
and aerospace industries, maintenance overhead of the machinery parts can induce financial
pressure. Automatic or semi-automatic visual inspection allows for the detection of defects by
201
image or multimodal data analysis. It ensures that the product quality and the production
rate will be improved and avoids errors caused by subjectivity. This technology is beneficial
for material inspection and quality control.
This study proposes a vision-based semi-autonomous spatio-temporal change detecting,
locating, and probabilistic reliability quantification methods for mechanical systems. To
accomplish the accurate change detection, high-resolution cameras and an inexpensive off-
the-shelf ranging camera were used to produce the dense 3D point clouds. These dense point
clouds were used in detecting and localizing the spatio-temporal changes of the mechanical
systems. Manual processing and analysis of these large point clouds is labor-intensive, time-
consuming, and has high operational costs for specialized inspectors. Thus, a robust distance
measure to quantify the changes was incorporated. Most of the prior work focused only on
detecting and quantifying the changes at a single moment. In this study, an evolution (time-
history) of the changes along the time domain is presented. This allows greater flexibility
for the visual inspectors to record the specific changes in the components. The change
detection accuracy is highly dependent on the point clouds registration procedure. Previous
methods in the literature relied on the Iterative Closest Point (ICP) algorithm to achieve
this task. Generally, the ICP algorithm converges to local minima if the distance between
the point clouds are farther (in other words, the two point clouds need to be closer). Thus,
a deep learning-based global registration is adapted to negate the bad registration of the
point clouds. Furthermore, it is unreliable to quantify the change in a single scan and two
3D point cloud registrations. The uncertainty in changes depends on the accuracy of the
registration process. To overcome this, an ensemble averaging of the Cloud-to-Cloud (C2C)
distances over multiple point cloud registrations of no-change and change is leveraged to
minimize the change detection uncertainty. This provides a probabilistic reliability on change
quantification. Lastly, the proposed method provides comprehensive visualization tools for
the visual inspectors to access the conditions of the mechanical systems in a short time along
with the time dimension of change detection and localization.
202
6.1.1 Review of the literature
This section details the available literature in multiple domains of engineering and technology.
Inspections methods on mechanical, aerospace, and civil infrastructures are briefed. In
addition, methods that exploit the 3D computer-aided design (CAD) models for inspection,
defect localization and registration are also presented.
6.1.1.1 Mechanical systems
Cha et al. [33] proposed a method to classify the loosened bolts using the Hough transform
and SVMs to classify the loose bolts. They sampled real bolt images with loose and tight bolts
and used a Canny edge detector to extract the bolt’s edges. An ellipse is fitted to the extracted
features and classified. However, their method assumes clean and smooth background for the
classification purpose, which is not valid in the real-world scenario. Xiangxiong and Jian [259]
developed an algorithm to rapidly search for Shi-Tomasi features and track the movement
of these feature points between images. If the bolt is loosened, feature points associated
with the loosened bolt will exhibit a unique rotational movement pattern. Their method
works well if reliable and comparable features exist, which are usually rare in an uncontrolled
environment. Ramana et al. [206] used the Viola-Jones algorithm to train on two datasets of
images with and without bolts to localize all the bolts in the images. The localized bolts
were automatically cropped and binarized to calculate the exposed shank length and bolt
head dimensions. However, their method assumed uniform background.
Erdős et al. [65] presented an algorithm to match the complex scenes of the mechanical
systems. They used the 3D CAD model and scanned point clouds for the matching of the
pipe features. Nguyen and Choi [188] proposed a normal-based region growing method for
the segmentation and an efficient random sample consensus(RANSAC) for feature recognition
and parameter extraction of point cloud entities. In their work, the mechanical, electrical, and
203
plumping plant was assessed by a distance-based deviation analysis and geometric parameters
comparison between the on-site LiDAR scanned point clouds and as-built 3D CAD models.
Zhou et al. [292] proposed a multiscale-Hessian-based method for the detection of scratches
anddentsontheautomobilesurface. However, Intheirwork, onlycontrolledlightingcondition
was used. Zhang et al. [288] used a region-based convolutional neural network (CNN) to detect
the loose and tight bolts. However, their methodology assumed uniform background. Finally,
a recent ongoing study by Masri and Chen [174] has developed and applied, under laboratory
conditions, a robust methodology for the detection, locating, quantification, and tracking
of evolving changes in mechanical systems, while simultaneously providing a probabilistic
measure of the reliability of the analysis results.
6.1.1.2 Aircraft components
Malekzadeh et al. [171] used CNN with a SURF-based keypoints detector to detect defects
such as dents and scratches on the aircraft fuselage. However, their method did not quantify
the defects and changes. Jovančević et al. [124] proposed a method to classify and extract
information about the defects such as dents, protrusions, or scratches based on local surface
properties of an aircraft fuselage. They collected the 3D data from a high-resolution sensor.
In their methodology, changes were defined based upon the surface normals. Also, they
assume surfaces are smooth concerning the curvature. This is not always true for intricate
mechanical bodies. Hitchcox and Zhao [105] proposes a graph-based random walker image
segmentation method to segment surface voids and defects in unorganized 3D scans of the
aerospace surfaces. However, only the synthetic point cloud dataset was used to study
different attributes of their algorithm. Xie et al. [263] proposed a rivet detection method for
defect characterization on aircraft fuselages. They leveraged the 3D point clouds for multiple
structures fitting to localize the rivet geometry on scanned point clouds.
Ben Abdallah et al. [18] developed an inspection procedure that matches the 3D CAD
model and real 2D images for the quality control and defects detection on aeronautical, and
204
mechanical assemblies. In their approach, temporal changes were not considered. Abdallah
et al. [1] developed a robust method for defect detection on aircraft electrical wiring systems.
They exploited the 3D point clouds from the scanner and CAD models for inspection of
the wiring components. In addition, they developed a 3D segmentation algorithm for cables
inspection. Boughrara et al. [27] adapted a deep learning approach to inspecting the complex
mechanical assemblies of an aircraft turbine engine. Their approach was to verify that the
assembly components were in the correct positions by leveraging the 3D CAD model and 3D
scanner for obtaining the point clouds. Finally, Nasser and Rabani [187] proposed a 3D point
cloud registration method for matching the partial 3D point clouds to the free-form CAD
models. They used correlation coefficients between the three-dimensional points along with
pose estimated by the Procrustes analysis. This initial estimate of global transformation was
provided to the ICP algorithm to perform the fine registration. In addition, their method
was well-suited for the essentially smooth and featureless point clouds.
Shen et al. [222] used deep learning-based fully convolutional networks (FCN) for the
defect detection of aircraft engine parts using boroscope images. However, they did not
consider the three-dimensional aspect of the quantification. Mohtasham Khani et al. [183]
adapted a two-layer convolutional neural network for the detection of cracks on gas turbine
engines. Their approach exploited image processing filters to smooth the surface and deep
learning methods to classify the crack and non-crack samples. Li et al. [158] adapted the
object detection CNN for the localization of the cracks in aircraft components. Their CNN
network combined the depthwise separable convolution and feature pyramids on top of the
YOLOv3 CNN architecture.
6.1.1.3 Civil infrastructures
Khaloo andLattanzi [129] proposed ahierarchical dense Structure-from-Motion reconstruction
for the condition assessment of infrastructures. In their method, dense point clouds were
used to analyze change. However, the change detection was based on single time frame point
205
clouds. Jafari et al. [114] presented a deformation tracking method using dense point clouds.
In their approach, change analysis was conducted by registering the point clouds using the
ICP algorithm and finding the point-wise distances between the point clouds. However, the
ICP algorithm works well if the initial distance between the point clouds is small. Otherwise,
the optimization will be converging at local minima. Khaloo et al. [130] used a hierarchical
dense Structure-from-Motion algorithm to generate the point cloud for a trail bridge in
Alaska. They compared the dense point cloud with a LiDAR scan for a structural inspection
and condition assessment. Khaloo et al. [131] used a combination of multiple UAV platforms
and multi-scale photogrammetry to produce a high-resolution dense point cloud of a large
gravity dam. In their work, artificial man-made defects were detected by the 3D information
with sub-millimeter accuracy. Lastly, Mohammadi et al. [181] leveraged a 3D, fully CNN
model to segment the point clouds obtained from a hurricane damaged area. The dense SFM
point clouds were generated using a camera mounted on an unmanned aerial system.
6.1.2 Contribution
The literature, as mentioned earlier in this work, focuses only on defect detection regarding
the non-defective components at a single time step, except in [114] where the temporal domain
was considered. This study focuses on a class of problems under the mechanical system’s
surface defect/damage detection. Automated surface inspection and change detection is an
essential task in many industries. In this work, a semi-autonomous method is proposed to
detect the changes in the mechanical systems using a photogrammetry technique known as
structure-from-motion and a structured light-based 3D data acquisition system. Multiple
3D point clouds taken at various time steps are used to detect and localize the evolution of
the changes (i.e., it considers the temporal-axis into the condition assessment of mechanical
systems). It was seen that the photogrammetry technique could generate accurate and
dense point clouds that can be used for change detection. After obtaining the point clouds,
the source/reference and target will be registered by a feature-based deep learning method,
206
followed by the ensemble averaging of the C2C distances of the point clouds for the statistical
change detection. Primarily, the displacement of the bolts on the mechanical components is
studied in this work.
6.1.3 Scope
Section 6.2 provides details about the data acquisition mathematical underpinnings of 3D
vision cameras and photogrammetry techniques like structure-from-motion. Section 6.3
explains the data extraction pipeline. This is followed by a feature-based deep learning
method for point clouds registration, semi-autonomous change detection using an ensemble
averaging technique, and probabilistic reliability on change quantification. Section 6.4
discusses the experimental procedure and setup that was used for the evaluation of the
proposed method. Lastly, Sections 6.5 and 6.6 concludes the current work and lays down a
roadmap for the future extensions.
6.2 Three-dimensional data acquisition methods
6.2.1 Vision-based camera technologies
Three-dimensional vision technologies are prevalent today compared to a decade ago due to
the mass production in gaming industries (Microsoft Kinect), mobile phones, and robotics
enthusiasts looking for drones. Standard 3D technologies are stereo vision, structured vision,
and time-of-flight. Figure 6.1 shows the working principle of three sensors. In stereo vision,
207
b (Baseline)
P
′
z Stereo Vision
Camera Camera
=
Infrared Camera Laser
b (Baseline)
Reference Plain
Object Plain
D
f
f
d
Z L
Structured Light Time of Flight
=
+
Object Plain
E
Time the
pulse received
Time the
pulse sent
Camera Laser
T
=
Figure 6.1: Widely used 3D vision technologies. Left: stereo vision; middle: structured light;
right: time-of-flight.
two cameras are separated by a known small distance, called the baseline. Based on the
similar triangles, the equation is,
Z
f
x
X
L
(6.1)
Z
f
xb
X
R
(6.2)
Z
f
y
Y
L
x
X
L
, (6.3)
where Z is the distance from the point Ppx,yq, f is the focal length of the cameras, b is the
baseline of the system, X
L
and X
R
are the distances between the optical axes and projection
points (p andp
1
) of left and right cameras, respectively. The depthZ, and thex,y coordinates
of the point are given by,
Z
bf
pX
L
X
R
q
(6.4)
Z
bf
d
(6.5)
x
X
L
Z
f
(6.6)
208
y
Y
L
Z
f
, (6.7)
where dX
L
X
R
is the disparity (i.e., the difference in the position between the corre-
sponding points in the two images). For the structured light camera, using the same principle
of similar triangles, the depth is given by,
D
d
Z
f
. (6.8)
D is the distance between the two reflective beams of infrared camera. For the triangle with
respect to reference plane,
D
b
ZL
L
, (6.9)
where L is the distance to the reference plane. Substituting for D from Equation (6.8) to
Equation (6.9) gives,
Z
L
1
L
fb
d
. (6.10)
Time-of-flight is based on the time light takes to reflect after a known pulse is beamed,
the distance is measured by,
Z
CT
2
, (6.11)
where C is the speed of light, and T is the time delay.
Table 6.1 compares the different 3D sensors. Based on the sensing capability, the three
cameras are categorized into passive and active. The passive sensor is the stereo camera, and
the active ones are the structured vision and time-of-flight of cameras. While using these
sensors for precision work like quantifying defects that are not changing in time abruptly
(dynamical systems), some of the critical items that need to be considered are depth accuracy,
image resolution, and low-light performance. Based on these, generally, the time-of-flight
sensor fits best, depending on the cost.
209
Passive 3D Active 3D
Stereo Vision (SV) Structured Light (SL) Time of Flight (TOF)
Operation
Principle
Emulate human eyes with two 2D
Sensors
Detect distortion of visible/invisible
light pattern
Measure travel time of laser pulse
Software
Complexity
High Medium Low
3D Point Cloud
Generation
High software processing Medium software processing Direct out of chipset
Measuring
Range
Mid-range
Depends on spacing between two cameras
Very short to mid-range
Depends on illumination power
Short to long range
Depends on laser power and modulation
Depth
Accuracy
mm to cm
Difficulty with smooth surface
mm to cm
mm to cm
Depends on resolution of sensor
Image
Resolution
Camera-dependent Camera-dependent QVGA(320x240)
Scanning
Speed
Medium
Limited by software complexity
Fast
Limited by camera speed
Fast
Limited by sensor speed
Low-light
Performance
Weak Good Good
Outdoor
Performance
Good Weak/Fair Weak/Fair
Power
Consumption
Low Medium Medium/High
Table 6.1: 3D vision technologies comparison [174].
Figure 6.2 shows the various 3D sensors that were evaluated to perform the quantification
task. SoftKinetic DS 325 and DS 311 were evaluated to capture the dynamical displacements;
Figure 6.2: Off-the-shelf commercially available 3D sensors that were evaluated in this study
[174].
it was found that Softkinetic DS 325’s resolution was low. In addition, it was used for the 3D
reconstruction of polyvinyl chloride (PVC) pipes. However, the point cloud density was low
for quantification purposes. Kinect V1, V2, Riegl VZ-400 were evaluated for change detection
of two-point clouds. Also, Asus Xtion Pro was evaluated for 3D point cloud generation and
to perform the SLAM. Table 6.2 shows the specifications and general details of the sensors
based on the precision, and Kinect V2 is a good candidate provided a minimum distance
210
3D technology Time of flight Time of flight Structured light Structured light Time of flight Time of flight
Company Softkinetic Softkinetic Microsoft Asus Microsoft Riegl
Product DS 311 DS 325 Kinect v1 Xtion pro Kinect V2 VZ 400
Data type Color, depth Color, depth Color, depth Color, depth Color, depth Depth only
Range
0.15 - 1 m
1.5 - 4.5 m
0.15 - 1 m 0.8 - 4 m 0.8 - 3.5 m 0.8 - 4.5 m 1.5 - 600 m
Resolution
640 x 480 (color)
160 x 120 (depth)
320x240 (color,
depth)
640 x 480 (H x V)
1280 x 1024 (H x V),
color
640 x 480, 320 x 240
(depth)
1920 x 1080 (color)
512 x 424 (depth)
5 mm, 122,000
measurements/sec
Viewing angle 50 x 40 x 60 deg (HxVxD)
74 x 58 x 87 deg
(H x V x D), depth
63.2 x 49.3 x 75.2 deg
(H x V x D)
57 x 43 deg 58 x 45 x 70 70 x 60 deg (H x V)
0.0024 x 0.288 (deg)
0.0024 x 0.5 (deg)
Frame rate 25-60 fps
25 - 30 fps
50 60 fps
30 fps
30 fps
60 fps
30 fps NA
Dimension 24 x 4 x 5 cm (WxHxD)
10.5 x 3 x 2.3 cm
(H x V x D)
12 x 3 x 2.5 (inches) 18 x 3.5 x 5 (inch) 24.9 x 6.6 x 6.7cm
180 x 308 mm
(dia x length)
Weight <0.5 kg <0.5 kg <0.5 kg <05. kg <1 kg 9.6 kg
Operating enviornment Indoor Indoor Indoor Indoor Indoor Outdoor
Price $299 $249 $100 $149 $199 $175,000
Software support Good Good Very good Very good Very good Good
Table 6.2: Specifications and general details of 3D technologies [174].
of 1 meter is maintained. Also, the laser scanner Riegl TLS provides dense data but at a
staggering cost.
6.2.2 Structure-from-motion (a photogrammetry technique)
Structure-from-motion (SFM) is a method for estimating the locations of 3D points from
multiple images given only a sparse set of correspondences between image features by
simultaneouslyestimatingboth3Dgeometry(structure)andcamerapose(motion). Figure6.3
shows the flowchart of the SFM. As the first step, features are detected from the two or
Image Sets
Feature detection
(e.g. SIFT)
Keypoint
correspondance
(e.g. ANN)
Keypoint filtering
(e.g. RANSAC)
Structure from motion
(e.g. Bundle adjustment)
Triangulation for
3D points
3D dense point
cloud
Figure 6.3: Overview of structure from motion.
more corresponding images using SIFT or SURF keypoints. Keypoints correspondences are
matched by the approximate nearest neighbor method of Section 5.2. Next, the keypoints are
filtered by using the RANSAC algorithm that is briefed in Section 4.4.2.2. This is followed
by the bundle adjustment, which considers all the images at once to find the 3D geometry
(structure) and camera pose (motion). Given the rotation matrices and translation vectors
for the image pairs, dense 3D point clouds can be generated by triangulation.
211
6.2.2.1 Triangulation
Given the set of corresponding image locations and known camera positions, the problem of
determining a point’s 3D position is known as triangulation [234]. This is done by finding the
3D point p that lies closest to all of the 3D rays corresponding to the 2D matching feature
locations,x
j
, observed by camerasP
j
K
j
rR
j
|t
j
s, wheret
j
R
j
c
j
andc
j
is thej
th
camera
center (refer to Section 4.3). The rays originate at c
j
in the direction ˆ v
j
NpR
1
j
K
1
j
x
j
q.
� � 0
1
(a)
� xx
⊥
⊥
∥
x
(b)
Figure 6.4: 3D point triangulation. (a). 3D point triangulation by finding the point p. (b).
Rotation around an axis ˆ n by an angle θ.
The nearest point to p on this ray, q
j
minimizes the distance,
||c
j
d
j
ˆ v
j
p||
2
, (6.12)
which has a minimum ad d
j
ˆ v
j
ppc
j
q. Thus,
q
j
c
j
pˆ v
j
ˆ v
T
j
qppc
j
q (6.13)
c
j
ppc
j
q
||
. (6.14)
212
Using the notation of the below equation,
v
||
ˆ npˆ n.vqpˆ nˆ n
T
qv (6.15)
results in the squared distance between p and q
j
,
r
2
j
||pI ˆ v
j
ˆ v
T
j
qppc
j
q||
2
(6.16)
||ppc
j
q
K
||
2
. (6.17)
The optimal value for p, which lies closest to all of the rays, can be computed as a regular
least squares problem by summing over all the r
2
j
and finding the optimal value of p,
p
¸
j
pI ˆ v
j
ˆ v
T
j
q
1
¸
j
pI ˆ v
j
ˆ v
T
j
qc
j
. (6.18)
Further details about the analytical expressions above are available in [234].
6.2.2.2 Epipolar geometry
Considering Figure 6.5, which shows a 3D point p being viewed from two cameras whose
relative position can be encoded by a rotation R and a translation t. Since there is no
information about the camera positions, without loss of generality, the first camera can be
set at the origin c
0
0 and at a canonical orientation R
0
I. Note the location of point p
in the first image, p
0
d
0
ˆ x
0
is mapped into the second image by the transformation,
d
1
ˆ x
1
p
1
Rp
0
tRpd
0
ˆ x
0
qt, (6.19)
where, ˆ x
j
K
1
j
x
j
are the local ray direction vectors. Taking the cross product of both sides
with t, gives,
d
1
rts
ˆ x
1
d
0
rts
Rˆ x
0
. (6.20)
213
epipolar
lines
epipolar plane
, Figure 6.5: Epipolar geometry relation to point P.
Taking the dot product on both sides with ˆ x
1
gives,
d
0
ˆ x
T
1
prts
Rqˆ x
0
d
1
ˆ x
T
1
rtsˆ x
1
0, (6.21)
because, the cross product matrix rts
is skew symmetric and returns 0 when pre and
post-multiplied by the same vector.
Therefore, the basic epipolar constraint is given by,
ˆ x
T
1
Eˆ x
0
0, (6.22)
where,
Erts
R. (6.23)
Given this fundamental relationship, Equation (6.23), camera motion encoded in the
essential matrix E has to be recovered. If there are N corresponding measurementspx
i0
,x
i1
q,
214
N homogeneous equations in the nine elements of Ee
00
,...,e
22
can be formed as given
below,
x
i0
x
i1
e
00
y
i0
x
i1
e
01
x
i1
e
02
x
i0
y
i1
e
00
y
i0
y
i1
e
11
y
i1
e
02
x
i0
x
i1
e
00
y
i0
x
i1
e
01
e
22
0, (6.24)
where x
ij
px
ij
,y
ij
, 1q.
Given N ¥ 8 such equations, an estimate (up to scale) for the entries in E using an
SVD can be computed. Once an estimate for the essential matrix E has been recovered,
the direction of the translation vector, t, can be estimated. To estimate this direction,
ˆ
t,
under ideal noise-free conditions, the essential matrix E is singular, because
ˆ
t
T
E 0. This
singularity shows up as a singular value of 0 when an SVD of E is performed,
Er
ˆ
ts
R
UΣV
T
ru
0
u
1
ˆ
ts
1 0 0
0 1 0
0 0 0
v
T
0
v
T
1
v
T
2
. (6.25)
When E is computed from noisy measurements, the singular vector associated with the
smallest singular value gives
ˆ
t. The other two singular values should be similar, but are not,
in general, equal to 1 because E is only computed up to an unknown scale. Because E is
rank-deficient, we only need seven correspondences of the form of Equation (6.24) instead
of eight to estimate this matrix. From this set of seven homogeneous equations, it can be
stacked into a 7 9 matrix for SVD analysis.
215
After
ˆ
t is recovered, the cross-product operator r
ˆ
ts
projects a vector onto a set of
orthogonal basis vectors that includes
ˆ
t, zeros out the
ˆ
t component, and rotates the other
two by 90
,
r
ˆ
ts
SZR
90
rs
0
s
1
ˆ
ts
1 0 0
0 1 0
0 0 0
0 1 0
1 0 0
0 0 1
s
T
0
s
T
0
ˆ
t
T
, (6.26)
where
ˆ
ts
0
s
1
. Using Equations (6.25) and (6.26),
Er
ˆ
ts
R
SZR
90
S
T
R
UΣV
T
, (6.27)
from which we can conclude that SU. For a noise-free essential matrix,pΣZq,
R
90
U
T
RV
T
, (6.28)
and,
RUR
90
V
T
. (6.29)
The matrices U and V are not guaranteed to be rotations (orientation can be flipped).
Therefore, all four combinations of rotation matrices are produced,
RUR
T
90
V
T
, (6.30)
and the one which produces determinant |R| = 1 is retained. Further details about the
analytical expressions above are available in [234].
216
6.2.2.3 Bundle adjustment
The recovery process of structure and motion is commonly known in the photogrammetry (and
now computer vision) communities as bundle adjustment. The feature location measurements,
x
ij
, now depend not only on the point (track index) i, but also on the camera pose index, j,
and are given by,
x
ij
fpp
i
,R
j
,c
j
,K
j
q. (6.31)
The 3D point positions p
i
are also being simultaneously updated.
The 3D point, p
i
, is projected into a 2D measurement, x
ij
, through a series of transforma-
tions, fpkq, each of which is controlled by its own set of parameters. Figure 6.6 shows the
sequence of transformations. f
C
pxq, f
P
pxq, f
R
pxq and f
T
pxq are the image center, projected
point, rotation, and translation parameters, respectively. The flow of information is shown in
dashed lines as partial derivatives and computed during a backward pass. The formula for
the radial distortion function is f
RD
pxqp1k
1
r
2
k
2
r
4
qx.
The box on the left performs a robust comparison of the predicted and measured 2D
locations, ˆ x
i
j, and ˜ x
ij
, after re-scaling by the measurement noise covariance, Σ
ij
. In more
− �
�
=
= ⋯
= / = = − �
Figure 6.6: Chained transformations for projecting a 3D point, p
i
, into a 2D measurement,
x
ij
.
detail, this operation can be written as,
r
ij
˜ x
ij
ˆ x
ij
(6.32)
s
2
ij
r
T
ij
Σ
1
ij
r
ij
(6.33)
e
ij
ˆ ρps
2
ij
q, (6.34)
217
where, ˆ ρpr
2
qρprq. The corresponding Jacobians are written as,
Be
ij
Bs
2
ij
ˆ ρ
1
ps
2
ij
q (6.35)
Bs
2
ij
Bˆ x
ij
Σ
1
ij
r
ij
. (6.36)
Computations of the partial derivatives and Jacobians are simpler, but can also be adapted
to any camera configuration. Generally, a non-linear least square is used to solve the chained
representation. Further details are available in [234].
6.3 Methodology
The proposed approach to detect and localize the defects involves using the image sensors to
acquire 2D and 3D information (refer to Figure 6.7 for data pipeline). In addition, the sensors
used for the precision work need to be calibrated. Sensors such as IMUs (Inertial Measurement
Units) are prevalent sensors in robotics, inertial-only navigation, attitude estimation, and
visual-inertial navigation, for obtaining location information. Nowadays, even smartphone
devices incorporate sophisticated sensors.
Sensor
calibration
IMU
Monocular
camera
Stereo
camera
3D camera
/ LiDAR
Position
data
Color
images
Depth
image
Point
clouds
Raw data
Denoising
Contrast
enhancement
Preprocessing
SFM
Data fusion
Data pipeline
Dense
point
clouds
VSLAM
Non-dense
point
clouds
Final data
Figure 6.7: An overview of the data extraction pipeline.
218
MEMS (micro-electro-mechanical systems) technology is a prevalent one as the sensors
are cheap to manufacture. IMUs have tri-axial clusters like accelerometers, gyros, and often
a magnetometer. An ideal IMU’s tri-axial clusters should have the same 3D orthogonal
sensitivity axes. Unfortunately, low-cost MEMS-based IMUs sometimes suffer from non-
accurate scaling, sensor axis misalignment, cross-axis sensitivities, and non-zero biases [238].
The vision-based sensors such as monocular and stereo cameras suffer from the lens and
radial distortions for accurate analysis; these need to be calibrated (refer Chapter 4).
Deep learning-based global
registration pipeline
Source/reference
point cloud
Target point cloud
Final transformation matrix ( , )
C2C ensemble averaging ( 1
, … , )
Cloud-to-cloud (C2C) distance
Changecritiria =
∗
− 0
∗
0
∗
, =
∗
− 0
∗
0
Display changed point cloud in
heatmap
Change detection pipeline
Source/reference
point cloud
Target point cloud
Feature extraction
(FCGF)
Correspondence confidence
prediction (Residual U-net type CNN)
Feature extraction
(FCGF)
Feature descriptor
(32 x 1)
Feature descriptor
(32 x 1)
Weighted Procrustes for (3)
Initial (
�
, ̂ )
Gradient-based fine registration
Initial ( , )
Iterative closest point ( , )
Final fine registration
Preprocessing (cropping, down sampling, and
denoising)
(a)
Deep learning-based global
registration pipeline
Source/reference
point cloud
Target point cloud
Final transformation matrix ( , )
C2C ensemble averaging ( 1
, … , )
Cloud-to-cloud (C2C) distance
Changecritiria =
∗
− 0
∗
0
∗
, =
∗
− 0
∗
0
Display changed point cloud in
heatmap
Change detection pipeline
Source/reference
point cloud
Target point cloud
Feature extraction
(FCGF)
Correspondence confidence
prediction (Residual U-net type CNN)
Feature extraction
(FCGF)
Feature descriptor
(32 x 1)
Feature descriptor
(32 x 1)
Weighted Procrustes for (3)
Initial (
�
, ̂ )
Gradient-based fine registration
Initial ( , )
Iterative closest point ( , )
Final Fine registration
Preprocessing (cropping, down sampling, and
denoising)
(b)
Figure 6.8: Overview of registration and change detection methods for inspection. (a). Deep
learning-based global registration of the point clouds. (b). Change detection pipeline.
Once the raw data is obtained, calibration factors obtained in the calibration procedures
remove the biases. Commonly, this is followed by pre-processing, as the data is corrupted by
motion blur, noise, and camera misfocus. After the restoration, data is fused by algorithms
219
like SFM and Visual Simultaneous Localization and Mapping (VSLAM) to obtain a global
picture of the scene, also known as scene reconstruction. Next, the reconstructed data can be
3D point clouds, sensor motion, or 2D images from a monocular camera.
Currently, the proposed method involves calibrating the camera for obtaining the high-
resolution images (refer to Figure 6.7). MATLAB camera calibration toolbox was used to
accomplish the calibration. In addition, a known 3D object is used as a calibrating object in
order to obtain the scaling factor. A series of 2D color images are captured with a complete
coverage (zig-zag) path with multiple viewing angles. This is to ensure a good overlapping
area of 40-60% and to match and extract a large number of correspondences. This method is
also known as SFM, as the 3D structure and motion parameters of the camera are found
together by an approach known as bundle adjustment. To obtain this dense point cloud, a
proprietary software called ContextCaptue (a Bentley systems reality modeling) was used
[232]. Usually, the dense point cloud obtained from ContextCaptue is dimensionless; this has
to be scaled by considering the scale factor obtained by measuring a known object length
in 3D. Once the 3D point clouds are scaled, a region of interest is cropped, and a point
cloud-to-point cloud change detection is calculated to localize the change of defective parts.
This completes the data extraction pipeline from 2D images and producing 3D point clouds
from 3D sensors. Lastly, this information will be utilized in the deep learning-based point
cloud registration and decision-making algorithm to detect and localize the defects (see the
algorithm flowcharts in Figure 6.8). The details of the deep learning-based point cloud
registration and change detection are given in Sections 6.3.1 and 6.3.2, respectively.
6.3.1 Deep learning-based pairwise 3D point clouds registration
Aligning two or more point clouds is essential for the tasks such as 3D reconstruction, tracking,
pose estimation, and change detection in computer vision, robotics, civil and mechanical
engineering. A classical method, ICP, was proposed to register the two-point clouds given
the initial poses are aligned close to each other [21]. However, often ICP converges in
220
local minima and suffers an incomplete point clouds registration. Similarly, an extension of
the ICP algorithm proposed by [43] converges to local minima. Furthermore, probabilistic
point clouds registration methods proposed in [23, 186] based on the normal distribution
transform and Gaussian mixture model are computationally expensive. Globally optimized
ICP registration methods that use the branch-and-bound optimization technique converge to
the global minima [268]. However, the computational costs are exuberantly high depending
on the density of the point clouds.
Traditional pairwise point clouds registration consists of two stages: First, the coarse
registration, where the initial estimate of the transformation matrix is obtained from the
handcrafted features extracted from the local information [214, 241, 242], and Random
Sample Consensus (RANSAC) will be used to refine the correspondences between the point
clouds [77]. Second, the coarse alignment is often fine-tuned by the ICP variants. Recently
proposed deep learning-based feature descriptors are well-suited because feature-engineering
is not required [279, 51]. Like the coarse registration, RANSAC can be used to obtain the
optimal transformation matrix, followed by ICP registration.
Recently, end-to-end registration networks have proven to be effective and faster compared
to the traditional methods. Aoki et al. [10] proposed a deep learning model to register the
point clouds. However, their method uses global pooled features to encode the geometry.
This results in decreased registration accuracy. Wang and Solomon [251] proposed a deep
closest point method to register the point clouds. The distribution of the number of points
in the source and target point clouds is fixed. However, in reality, source and target cannot
have the same distribution. In this study, accurate registration is of paramount importance
to detect the minute changes in 3D point clouds. Thus, a deep learning-based method, Deep
Global Registration (DGR) [52] is adapted for the proposed change detection method. DGR
uses a learned feature descriptor, Fully Convolutional Geometric Features (FCGF), due to its
high accuracy and low computational cost. In addition, the DGR works well without the
color information of the two-point clouds. The details are presented in the following section.
221
6.3.1.1 Feature extraction and correspondence prediction
A fully convolutional network extracts the 32-dimensional feature vectors using two 3D scans
of the point clouds. The network is a U-Net type architecture with 3D convolutions and
residual blocks. Each block has a kernel size, stride, and channel dimensionality. The second
and third 3D convolutional blocks have skip connections to the sixth and seventh blocks.
Furthermore, all the convolutions except the last layer have batch normalization followed by
non-linearity (ReLU). Fully-convolutional features are based on the metric learning approach.
1
Contrastive Triplet Hardest-Contrastive Hardest- Triplet
Figure 6.9: A random sampling and negative-mining strategy for contrastive and triplet losses.
Traditional vs. FCGF hardest-contrastive and hardest-triplet losses that use the hardest
negatives. Cyan color circles positive feature vectors, and brown are negative.
Commonly used metric learning contrastive
2
and triplet
3
losses are modified. Negative-mining
is incorporated to obtain the hardest-contrastive and hardest-triplet losses (see Figure 6.9).
Firstly, the anchor and a set of mining points are sampled for a 3D point cloud scan.
Then, the hardest negatives, f
i
andf
j
, are mined for both feature vectors, f
i
and f
j
, in a
positive pair. Next, false negatives that fall within a certain radius from the corresponding
1
Metric learning is a method in which similarity or dissimilarity is estimated by using a distance measure.
For example, Euclidean distance can be used to measure the similarity between two feature vectors in feature
space.
2
In contrastive loss, the positive and negative samples are pulled and pushed together, respectively.
3
Distance from the baseline or anchor point to the positive input is minimized and negative input is
maximized.
222
anchor are removed. Later, the pairwise loss for the quadruplet, f
i
,f
j
,f
i
,f
j
, is mined, and
the fully-convolutional contrastive loss is formed:
L
C
¸
pi,jqPP
#
rDpf
i
,f
j
qm
p
s
2
|P|
λ
n
Ipi,k
i
,d
t
q
rm
n
min
kPN
Dpf
i
,f
k
qm
p
s
2
|P
i
|
λ
n
Ipj,k
i
,d
t
q
rm
n
min
kPN
Dpf
j
,f
k
qm
p
s
2
|P
j
|
+
, (6.37)
where Dpf
i
,f
j
q is the distance metric between featurespi,jq,P is a set of all positive pairs
in fully-convolutionally extrcated features in a minibatch andN is a random subset of fully-
convolutional features in a minibatch that will be used for negative mining. Ipi,k
i
,d
t
q and
Ipj,k
j
,d
t
q are the indicator functions that return 1 if the featurek
i
is located outside a sphere
with diameter d
t
centered at feature i, and 0 otherwise, where k
i
argmin
kPN
Dpf
i
,f
k
q. m
n
is the margin distance between the dissimilar points. |P
i
|
°
pi,jqPP
Ipi,k
i
,d
t
q is the number
of valid mined negatives for the first and|P
i
| is for the second point clouds, respectively. λ
n
is a weight of positives and negatives, and set equal to 0.5. Similarly, a triplet loss with hard
negatives is given by,
L
T
1
Z
¸
pi,jqPP
"
Ipi,k
i
q
mDpf
i
,f
j
q min
kPN
Dpf
i
,f
k
q
Ipj,k
j
q
mDpf
i
,f
j
q min
kPN
Dpf
j
,f
k
q
*
, (6.38)
where Z
°
pi,jqPP
pIpi,k
i
qIpj,k
j
qq is a normalization constant. Equation (6.38) finds the
hardest negatives for pairspi,jqPP (see Figure 6.9).P is the set of all positive pairs in the
minibatch. Further details on the contrastive, triplet losses and fully-convolutional hardest
losses are provided in [101, 218, 51].
Given the feature sets of two point clouds, F
x
f
x
1
,f
x
2
,...,f
x
Nx
, and F
y
f
y
1
,f
y
2
,...,f
y
Nx
, a nearest neighbor search in the feature space is performed to find the set
of putative correspondences or matches,M
"
i, argmin
j
f
x
i
f
y
j
iPr1, 2,...,N
x
s
*
.
223
To find the inlier matches, a CNN was used to learn the geometric structure of the correspon-
dence set. Figure 6.10 shows the residual U-Net CNN architecture with skip connections.
3D+3D coord input
6D Conv , 1, 32
ResBlock, 32
6D Conv , 2, 64
ResBlock, 64
6D Conv , 2, 128
ResBlock, 128
6D Conv , 2, 256
ResBlock, 256
6D ConvTr , 2, 128
ResBlock, 128
6D ConvTr , 2, 64
ResBlock, 64
6D ConvTr , 2, 64
ResBlock, 64
6D Conv , 2, 64
6D Conv , 1, 1
Inlier logit
Figure 6.10: Six-dimensional Residual U-Net type convolutional network architecture for
inlier likelihood prediction. The network has the residual blocks between strided convolutions.
The network predicts the likelihood score in the range, [0,1] for each correspondence. A
correspondence point is a six-dimensional vector, rx
T
i
,y
T
j
s
T
in R
6
. A score of > 0.5 is an
inlier, otherwise, it’s an outlier. The inlier correspondences will be distributed in a lower-
dimensional surface inR
6
.P
"
pi,jq
kT
px
i
qy
j
k τ,pi,jqPM
*
is a set of inliers or
correspondences,pi,jq, that align with higher accuracy up to the threshold, τ, under the
ground truth transformation,T.N P
C
M are the outlier correspondences. For training
the network shown in Figure 6.10, a binary cross-entropy loss function was utilized as in
Equation (6.39),
L
BCE
pM,T
q
1
|M|
¸
pi,jqPP
logp
pi,jq
¸
pi,jqPN
logp
C
pi,jq
, (6.39)
where p
pi,jq
Pr0, 1s is the likelihood prediction that the pair pi,jq is an inlier. This score
is compared with the ground-truth correspondences,P, and the loss in Equation (6.39) is
minimized. p
C
pi,jq
1p
pi,jq
and|M| is the cardinality of the putative correspondences set.
224
6.3.1.2 Weighted Procrustes method
Procrustes method is a statistical analysis technique where the shapes can be matched by
rigid transformations and is given by
1
N
°
pi,jqPM
kx
i
y
j
k
2
. It minimizes the mean squared
error between the correspondence points with equal weights. In contrast, weighted Procrustes
method minimizes the weighted mean squared error,
°
pi,jqPM
w
pi,jq
kx
i
y
j
k
2
. The weights
are the inlier likelihood values for each correspondence obtained from the 6D CNN network
(see Figure 6.10). The weighted Procrustes method is differentiable and the gradients can
be passed through the weights during back propagation and is computationally efficient for
dense correspondences. Weighted Procrustes method minimizes:
e
2
e
2
pR,t;w,X,Yq (6.40)
¸
pi,jqPM
˜ w
pi,jq
py
j
pRx
i
tqq
2
(6.41)
Trace
Y RXt1
T
W
Y RXt1
T
T
(6.42)
where 1p1,..., 1q
T
, X rx
1
,...,x
|M|
s, and Y ry
J
1
,...,y
J
|M|
s. J is the list of indices
that defines the correspondences between x
i
andy
J
i
. wrw
1
, ,w
|M|
s is the weight vector
and ˜ wr ˜ w
1
, , ˜ w
|M|
s is the normalized weight vector after the non-linear transformation,φ,
which applies a filter. W diagp ˜ wq is the diagonal weight matrix. The weighted Procrustes
method produces an initial rotation matrix,
ˆ
R, and translation vector,
ˆ
t. This transformation
is fine-tuned by a robust loss function in a global registration procedure (next section). For
detailed derivation of the Weighted Procrustes method, refer to [52].
6.3.1.3 Global registration
Weighted Procrustes analysis will output an unrefined initial transformation matrix after
scaling, translating, and rotating the inlier correspondences. In practical applications,
weighted Procrustes analysis may generate numerically degenerate solutions when the number
225
of inlier correspondences is insufficient due to the low overlap-ratio or noisy correspondences
in the two scans. To mitigate this adverse effect, the deep learning registration method
computes the ratio of the sum of the filtered weights to the total number of correspondences,
given by
°
n
i1
pφpw
i
qq
|M|
, where φpq is the clipping function and|M| is the cardinality of the
correspondences set. When this ratio is less than a certain threshold,T , the time consuming,
but accurate RANSAC based registration is initialized. Otherwise, fine-tuning based on the
gradient method is followed. Thus, DGR has a failure detection mechanism unlike other
methods for point clouds registration.
A fine-tuning module is required to minimize the registration error. To accomplish this, a
gradient-based method integrated with a robust loss function is used to refine the poses and
improve the accuracy. A global registration module initializes the pose from the weighted
Procrustes method. The cost function for the minimization problem is given by,
CpR,tq
n
¸
i1
φpw
i,j
i
qLpy
J
i
,Rx
i
tq, (6.43)
where ˜ w
i
and J
i
are defined as in Equation (6.41) and φpq is the filtering function given by
φpwqIrw¡τsw, which clips weights below the threshold, τ, for elementwise output in
the CNN logit scores.Lpx,yq is a Huber loss function between x and y. This cost function
is parameterized by the fine-tuned R and t. For further details on the SEp3q representation
and initialization of the transformation matrix, refer to [52].
Generally, theglobalregistrationmethodprovidescoarsetofineregistrationtransformation
parameters,R, andt, depending on the quality of the correspondences. Final fine registration
ofthetwo-pointcloudsiscarriedbythevariantsoftheICPalgorithm, astheinitialregistration
is accurate enough for the ICP algorithm to converge to global minima. This study uses a
classical ICP algorithm based on the point-to-point distance to obtain the final transformation
parameters, R
f
, and t
f
.
226
6.3.2 A change detection method for 3D point clouds
Two point clouds are acquired using the Microsoft Kinect 3D sensor. The pre-processing steps
are required to make the point cloud tractable for the registration process. First, the point
clouds are cropped within a region-of-interest bounding box, followed by downsampling and
denoising. The pre-processed point clouds are then registered by the pairwise DGR method,
X
Z
Y
X
Y
Z
T
0
T
1
X
0
Z
0
Y
0
X
1
Y
1
Z
1
PC
0
PC
1
Same components
R, T
Transformation
PC
0
PC
1
X
0
Z
0
Y
0
(a)
P
k
(x
k
, y
k
, z
k
)
O
X
Y
Z
r
k
C2C absolute distance
Reference point cloud (green points)
Target point cloud (red points)
d
k
Normalizedmean ∗
=
̅ ̅ =
1
�
= 1
2
+ 2
+ 2
Kinect 3D sensor
i = 2,3,4,5,6,
(b)
Figure 6.11: Data acquisition and C2C distance of the source/reference and target point
clouds. (a). Point clouds acquired by the Kinect RGB-D camera. Camera is oriented with
local axis. Registration method finds the transformation matrix that minimizes the C2C
distance between source and target points. (b). Registered point clouds. C2C is the shortest
distance between a point in source and target point clouds [174].
and the point-to-point ICP algorithm performs final fine registration. A ‘source (or reference)
point cloud’ is an initial point cloud scan obtained without change in the system. Similarly,
a ‘target point cloud’ is an inspection sample obtained after the first scan. It could be a
no-change or change sample. In conventional point cloud registration literature, the source
is transformed to the target. However, to have consistency in the inspection procedure, in
this study, a target or change point cloud is transformed and aligned to a source point cloud
(reference). Figure 6.11a shows the data acquisition by a 3D sensor in two local coordinate
systems. These two point clouds are translated to the originp0, 0, 0q by subtracting the mean
227
ofrX
T
i,j
,Y
T
i,j
,Z
T
i,j
s coordinates, where i 1, 2,...,N andj 1, 2,...,M are the points in the
source and target. The green and red color represent the source and target point clouds scans
of the clamped plate. R and t are the transformations obtained after the DGR and final ICP
alignment (see Figure 6.11b).
6.3.2.1 Cloud-to-cloud (C2C) distance
After the alignment of the two point clouds, estimation of the minimum distance between the
aligned and target point clouds is required to quantify the changes. This minimum distance
is referred as the C2C distance between the point clouds (see Figure 6.11b). The target point
cloud changes in time, thus it becomes the reference point cloud for change detection based
on the C2C distance. To detect the change in two point clouds of the target, S
t
and aligned,
S
a
, the deviation between point clouds is estimated as the Hausdorff distance between the
two point clouds, computing for each point, p
t
, of a target cloud, S
t
, the distance to its
nearest point, p
a
, in the other cloud, S
a
.
dpp
t
,S
a
q min
x
1
PSa
kp
t
p
1
a
k
2
, (6.44)
wherekk
2
is the Euclidean distance and p
1
a
is a nearest point in S
a
top
t
inS
t
. This distance
metric is precise compared to the average or best fitting plane distance between nearest
neighbors [90]. However, computationally, it is slightly slower due to the k-nearest neighbor
search, where k 1 for all C2C distance computation. To perform the k-nearest neighbor
search between the target and aligned point clouds, an efficient k-d tree was utilized to
partition the 3D space.
6.3.2.2 Statistical change detection by ensemble averaging
Statistical analysis of the C2C distances provides inference on the change of mechanical
systems. Generally, C2C distance is susceptible to the registration errors, the camera’s
228
sensor noise, uncertainty in the pose, and radiometric error of the mechanical system (for
example, reflectivity from a smooth steel surface). Thus, inferring the change obtained from
• Minimum number of reference point clouds, minimum number of inspections
T
0
T
1
T
2
References Inspection
• Ensemble of reference point clouds, minimum number of inspections
T
0
T
1
T
i-1
References
T
i
Inspection
• Ensemble of reference point clouds, minimum number of inspections
T
0
T
1
T
i-1
References
T
i
Inspections
T
i+1
T
n
PC
0
–PC
1
differences PC
0
–PC
2
differences
PC
0
–PC
i
differences { 1
, 2
, …, i-1
}
{ 1
, 2
, …, i-1
} { i
, i+1
, …, n
}
(a)
• Minimum number of reference point clouds, minimum number of inspections
T
0
T
1
T
2
References Inspection
• Ensemble of reference point clouds, minimum number of inspections
T
0
T
1
T
i-1
References
T
i
Inspection
• Ensemble of reference point clouds, minimum number of inspections
T
0
T
1
T
i-1
References
T
i
Inspections
T
i+1
T
n
PC
0
–PC
1
differences PC
0
–PC
2
differences
PC
0
–PC
i
differences { 1
, 2
, …, i-1
}
{ 1
, 2
, …, i-1
} { i
, i+1
, …, n
}
(b)
• Minimum number of reference point clouds, minimum number of inspections
T
0
T
1
T
2
References Inspection
• Ensemble of reference point clouds, minimum number of inspections
T
0
T
1
T
i-1
References
T
i
Inspection
• Ensemble of reference point clouds, minimum number of inspections
T
0
T
1
T
i-1
References
T
i
Inspections
T
i+1
T
n
PC
0
–PC
1
differences PC
0
–PC
2
differences
PC
0
–PC
i
differences { 1
, 2
, …, i-1
}
{ 1
, 2
, …, i-1
} { i
, i+1
, …, n
}
(c)
Figure 6.12: Source/reference and inspection point clouds at various time steps T
0
,...,T
n
.
Corresponding C2C distances are depicted in the right side of the figures. (a). Two reference
and a single inspection point clouds at 3 time steps T
0
,...,T
2
. (b). Multiple reference
and single inspection point clouds from time steps T
0
,...,T
i
. (c). Multiple reference and
inspection point clouds from time stepsT
0
,...,T
n
. Blue and green color histogram represents
the reference (no-change) and inspection (change) point clouds mean of C2C distances [174].
a single scan leads to large uncertainty in the C2C distance, and many sample scans are
recommended. A change detection procedure is presented in the Figure 6.12. The time axis
is T
i
, where i 0, 1, 2,...,nPZ
. T
0
and T
n
, where n 1, 2, 3,..., are the reference and
inspection scans. All the references are without changes, while inspection samples can have
changes. The distribution of the C2C distances between the two point clouds is shown as a
histogram. The histogram of the means of the reference and inspection point clouds provides
the inference of statistical changes at time-steps T
1
to T
n
.
Formally, a C2C distance is a stochastic process for an arbitrary change in the point cloud
scan. The process in non-stationary due to the conditions that govern the data acquisition and
229
Ref
PC
0
PC
1
PC
2
Pc
i-1
Compare
PC
0
& PC
1
Distribution of PC
0
-PC
1
differences
Distribution of PC
0
-PC
2
differences
Distribution of PC
0
-Pc
i-1
differences
1
*
2
*
i-1
*
PDF
1
PDF
2
PDF
i-1
Normalized
Mean 1
*
Ensemble of normalized
mean 0
*
Ref
PC
0
PC
i
PC
i+1
PC
n
Distribution of PC
0
-PC
1
differences
Distribution of PC
0
-PC
2
differences
Distribution of PC
0
-Pc
i-1
differences
i
*
i+1
*
n
*
PDF
i
PDF
i+1
PDF
n
Ensemble of normalized
mean Compare
PC
0
& PC
2
Compare
PC
0
& PC
i-1
Compare
PC
0
& PC
i
Compare
PC
0
& PC
i+1
Compare
PC
0
& PC
n
Normalized
Mean 2
*
Normalized
Mean i-1
*
Normalized
Mean i
*
Normalized
Mean i+1
*
Normalized
Mean n
*
s
*
0
*
s
*
Ensemble of normalized mean Figure 6.13: PDF/histogram constructed from using the first point cloud as source/reference
and the remaining as target point clouds. μ
0
and μ
s
are the mean of all the C2C means of
no-change and change point clouds. Blue and green color histogram represents the reference
(no-change) and inspection (change) point clouds mean of C2C distances [174].
registration of the point clouds [19]. tX
c2c
ptqu is a random process andμ
iX
c2c
1
k
°
K
k1
X
c2c
rks
is the mean of the C2C distance between a source and the target point cloud (i.e. one sample).
An ensemble average of the means of C2C distances is given by μ
iX
c2c
1
N
°
N
n1
μ
iX
c2c
rNs
and converges to the ensemble average asN Ñ8 (see Figure 6.13). The standardized change
criteria for no-change and change samples by the mean and standard deviations are given by,
μ
i
μ
i
μ
0
μ
0
(6.45)
σ
i
μ
i
μ
0
σ
0
, (6.46)
whereμ
0
andσ
0
are the ensemble mean and standard deviation of the no-change dataset. μ
i
is the ensemble mean of the change dataset, where i 1, 2, 3,...,n is the number of change
datasets. Besides the uncertainty minimization of change, ensemble datasets can help with
large area coverage of the mechanical components.
230
6.4 Experimental results and discussion
6.4.1 Dataset preparation
In order to perform the change detection analysis, three datasets were prepared using the
Microsoft Kinect Version 1, which has a 3D data sensing camera. The original dataset images
are shown in Figure 6.14. Figures 6.15 to 6.17 show the 3D point clouds of a fixed plate on a
(a) (b) (c)
Figure 6.14: Original images of the three datasets. (a). Clamped plate with large bolts.
(b). Plate with large and medium sized bolts. (c). Car engine with alignment markers and
large/medium sized bolts [174].
clamp, plate on the floor, and a car engine, respectively, at various time steps. Figure 6.15 has
the color scans and Figures 6.16 and 6.17 do not have the color information, but are displayed
in color gradient. There are 30 scans of the fixed plate for no-change and change. These
scans were produced by placing a clamped steel plate on a turn-table. For the no-change
and change datasets, 3D scans were acquired by rotating the turn-table at a small angle.
Next, there are 20 data scans for the no-change, and 1, 2, and 3 loose bolts as change objects.
Lastly, 20 samples are used for the engine scan. Here also, the same plate from Figure 6.16
was placed on the engine to detect changes with 1, 2, and 3 loose bolts. For all three datasets,
a region-of-interest bounding box was used to trim the point cloud surrounding. The average
number of points in each sample of clamped plates, plates with large/medium sized bolts,
231
(a)
(b)
Figure 6.15: Complete 3D point cloud dataset of the fixed plate. Sample numbers 1 to n
from top left to right for zig-zag rows. (a). no-change. (b). Change with loose bolts.
and engines are 15,100, 1.0000 10
6
, and 1.1927 10
6
for the three 3D datasets, respectively.
In Figure 6.15, dataset I (fixed plate), there are about seven large bolts that are bolted to a
plate. Figure 6.16 there are 8 large and 12 medium-sized bolts. Only medium-sized bolts are
loosened for the change detection process. Lastly, Figure 6.17 has the car engine, and a plate
mounted on the same for change detection. Here, medium sized bolts are turned loose for
change detection.
6.4.2 Change detection
The deep learning method, DGR, was used to perform the pairwise registration of the point
clouds. A pre-trained voxel size 5 cm model was utilized, as the ground-truth transformation
matrix for change detection datasets was unavailable. For the optimization, Stochastic
Gradient Descent (SGD) was used. The hyperparameters of the SGD are as follows: a
learning rate of 0.01, SGD momentum of 0.9, SGD dampening of 0.1, weight decay of 10
4
,
232
(a)
(b)
(c)
(d)
Figure 6.16: Complete 3D point cloud dataset of the plate with large/medium sized bolts.
Sample numbers 1 to n from top left to right for zig-zag rows. (a). no-change. (b). Change
with one loose bolt. (c). Change with two loose bolts. (d). Change with three loose bolts.
233
(a)
(b)
(c)
(d)
Figure 6.17: Complete 3D point cloud dataset of the car engine where a plate with alignment
markers are placed that includes large/medium sized bolts. Sample numbers 1 to n from top
left to right for zig-zag rows. (a). no-change. (b). Change with one loose bolt. (c). Change
with two loose bolts. (d). Change with three loose bolts.
234
and an exponential learning rate decay factor of 0.99. The batch size was four, and the
maximum epochs were 100.
Generally, the pairwise registration methods can work with a partial overlap ratio between
source and target point clouds. However, in this work, all the source and target models have
an overlap ratio of¡ 90%, i.e., the models are complete scans. In Figures 6.18 to 6.20 all the
(a)
(b)
Figure 6.18: C2C distance of the point clouds against the target point cloud for fixed plate.
Sample numbers 1 to n from top left to right for zig-zag rows. (a). no-change dataset. (b).
Change dataset.
C2C distances are w.r.t the target point clouds. This work assumes that the first point cloud
in the no-change dataset was considered a source/reference and the remaining (no-change
and change) as target point clouds. Figure 6.18 shows the C2C distances of the no-change
and change point clouds of a clamped plate dataset. For the no-change dataset, the pairwise
registration worked well on most of the samples. However, there are small misalignment
errors in two of the samples due to the incomplete registration. Three bolts (top middle,
left, and right-center) were removed in the change detection samples. Thus the difference
235
on these bolts was around 10 to 12 mm on average, i.e., the size of a one centimeter bolt.
A misalignment occurs in the final registration of 10 samples of the change dataset, thus
increasing the C2C distances.
Figure 6.19 shows the symmetrical plate example with loosened medium-sized bolts. In
the change datasets, one, two, and three bolts were loosened by 6.4 mm subsequently. In most
of the change samples, DGR was able to finely register the two point clouds and localize the
minute change of the bolts (see Figures 6.19b to 6.19d) with some exception in C2C distances
near the large bolts. However, due to the symmetry, it is not easy to solve for accurate and
complete registration. In most of the samples in Figure 6.19a, the point clouds were reversed,
and this resulted in a large difference in the C2C distance, as only a pre-trained model was
used. Similarly, the same can be seen in a few samples of loose bolt change datasets (see
Figures 6.19b to 6.19d). This happens when the non-dominant cross-section of the object
is thin (thickness of the plates in Z-direction), symmetric, and featureless. So when the
registration fails (flipping), the RMSE score will still be smaller and converges. In other
words, when the flipping occurs, the distance among the correspondences will still be smaller
as the registration method fails to find the rotations accurately due to the planarity.
Figure 6.20 shows the Honda Civic car engine example with loosened medium-sized bolts.
Similar to Figure 6.19, one, two, and three bolts were loosened by 6.4 mm subsequently in the
changed dataset. Note that the noisy bottom clusters of points were not trimmed. Thus the
C2C distances are larger due to the presence of outliers in some of the samples. In contrast
to Figures 6.18 and 6.19, the registration quality is very high due to the complex shapes and
irregularities with the engine point clouds. Visualization of the changes in loose bolts are
distinguishable but the magnitude is faint due to the presence of the larger noisy clusters of
points that increases the C2C distances (see Figures 6.20b to 6.20d). Trimming the clusters
and scaling the color bar may result in better visualization. In contrast, the uncertainty
change in the C2C distances is visible in the Figure 6.21c.
236
(a)
(b)
(c)
(d)
Figure 6.19: C2C distance of the point clouds against the target point cloud for the simply
supported plate. Sample numbers 1 ton from top left to right for zig-zag rows. (a). No-change
dataset. (b). Change with one loose bolt dataset. (c). Change with two loose bolts dataset.
(d). Change with three looses bolt dataset. All the bolts are loosened by 6.4 mm.
237
(a)
(b)
(c)
(d)
Figure 6.20: C2C distance of the point clouds against the target point cloud for an engine.
Sample numbers 1 to n from top left to right for zig-zag rows. (a). No-change dataset. (b).
Change with one loose bolt dataset. (c). Change with two loose bolts dataset. (d). Change
with three loose bolts dataset.
238
Figure 6.21 shows the Gaussian Probability Density Functions (PDFs) of the three datasets
for no-change and change samples. This study used a Gaussian distribution function due
to the number of samples in each no-change and change pool and the applicability of the
central limit theorem. Some of the no-change and change Gaussian distribution curves
1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8
Ensemble Mean (mm)
0
1
2
3
4
5
6
7
8
9
Density
No change
Change - loose bolts
7
NC
=1.3749
7
C
= 1.5705
<
NC
=0.0936
<
C
= 0.0514
(a)
-6 -4 -2 0 2 4 6 8 10 12 14
Ensemble Mean (mm)
0
1
2
3
4
5
6
7
Density
No change
Change - 1 loose bolt
Change - 2 loose bolt
Change - 3 loose bolt
7
NC
=5.3284
7
C
=1.6453, 2.7681, 2.3627
<
NC
=2.6204
<
C
=1.7210, 2.5937, 2.4396
(b)
1 1.5 2 2.5 3 3.5 4
Ensemble Mean (mm)
0
1
2
3
4
5
6
7
Density
No change
Change - 1 loose bolt
Change - 2 loose bolt
Change - 3 loose bolt
7
NC
=2.1320
7
C
=2.0571, 2.7941, 2.7507
<
NC
=0.3718
<
C
=0.3344, 0.2664, 0.2877
(c)
Figure 6.21: Probability density functions of the three datasets for the change detection. (a).
Clamped plate. (b). Plate with large and medium sized bolts. (c). Car engine with a plate
that include alignment markers.
have a fat tail due to the misalignment in registration for a few samples. In Figure 6.21a
the difference between the no-change and change samples is distinguishable as the false
positive C2C distances are minimal in the clamped plate example. In contrast, the plate with
large bolts samples have Gaussain PDFs that have clustered means due to the incomplete
registration and larger C2C distances. In these PDFs changes are visible but the magnitude
of pure contribution of the changed C2C distances are overshadowed by the noisy C2C
distances. Lastly, in the engine example, the mean of the PDF of one loose bolt is closer to
the no-change PDF mean. However, the PDFs of two and three loose bolts have large means
and are distinguishable.
Table 6.3 shows the standardized values of the no-change and change samples by the
mean and standard deviation for ensemble averages. Negative values appear as the no-change
mean. The standard deviation was more significant than the subsequent values of change
datasets.
Δμ
μ
0
and
Δμ
σ
0
standardizes the change in mean values. The larger values correspond
239
Dataset name Defect type Mean (μ) Std. (σ)
Δμ
μ
0
Δμ
σ
0
Clamped plate Loose bolts
1.3749 0.0936
1.5705 0.0514 0.1423 2.0892
Loose bolts
Plate with large
bolts
0 5.3284 2.6204
1 1.6453 1.7210 -0.6912 -1.4056
2 2.7681 2.5937 -0.4805 -0.9771
3 2.3627 2.4396 -0.5566 -1.1318
Loose bolts
Car engine
0 2.1320 0.3718
1 2.0571 0.3344 -0.0351 -0.2015
2 2.7941 0.2664 0.3105 1.7808
3 2.7507 0.2877 0.2902 1.6641
Table 6.3: Standardized values of the no-change and change samples by the mean and
standard deviation of the no-change ensemble average.
to the larger change. Standardized values can be used as the measure to confirm the changes
in infrastructures and localize them. The final decision can be made based on the analysis of
these values.
The current study is not a complete replacement for the inspection process. Instead, it
benefits human visual inspection by shortening the decision-making time. Prior information
such as the range of magnitude of the defect (e.g., the height of loose bolts from 2-8 mm)
can help minimize the false positives C2C distances and localize the defect regions effectively.
Also, a transfer-learned or full-trained DGR can minimize registration errors.
6.4.3 Computational time analysis
All the deep learning and post-processing computations were performed on a desktop computer
using a 64-bit Ubuntu 20.04 operating system, 128 GB memory, and a 3.5 GHz 16 core
AMD Ryzen ThreadRipper 2950x processor. A single GPU Nvidia RTX 2080 Ti was used
for the registration of pairwise point clouds. The Python programs were integrated with
the MATLAB 2021a for further post-processing procedures. On average, it takes 75 seconds
to register two point clouds of a density of 1.000 10
6
points. Some of the computational
240
overhead might be due to converting point cloud data files of source and target point clouds
to text files and loading the text data and weight file of the DGR method. In addition, more
negligible wall-time overhead can be caused due to the MATLAB integration with Python
files.
6.4.4 Qualitative results of complex scene registration
Section 6.4.2 detailed the experimental results of the continuous-time series scans of the
mechanical parts. In this section, qualitative results of the change detection are documented.
Four complex scenes of the mechanical components utilized in this study are shown in
Figure 6.22. Figures 6.22a to 6.22c point clouds are produced by the photogrammetry
(a) (b) (c) (d)
Figure 6.22: Qualitative results of the complex mechanical scene registration. (a). Plate
on floor with two loose bolts. (b). Honda Civic car engine with partial target point cloud
overlap with cable displacements. (c). Full overlap of the hood region of the Toyota RAV4
car engine with steel tubes chafing. (d). Lateral side of an aircraft engine with no-change.
The magenta ellipse shows the defect locations. (a)-(c) produced by photogrammetry and
unscaled. (d) generated by Kinect RGB-D camera that was mounted on a drone in robotic
simulation, and the units are in meters.
technique, and Figure 6.22d was generated by the Kinect RGB-D camera that was mounted
on a drone in robotic simulation. All these examples have source/reference (no-change) and
target (change) point clouds at a single time frame. Figure 6.22a is a steel plate with 10 Allen
key blots, Figure 6.22b shows a Honda Civic car engine’s complete hood, and the partial scan
as a reference and target point clouds, Figure 6.22c displays a complete scan of the Toyota
RAV4 car engine, and Figure 6.22d is a lateral surface point cloud scan of an aircraft engine
241
Scene type Data generation type
Points (no.)
Source Target
Plate Photogrammetry 65,536 65,536
Civic car engine Photogrammetry 65,536 16,384
RAV4 car engine Photogrammetry 1,048,576 1,048,576
Aircraft engine
Kinect RGB-D camera
(in ROS simulation)
16,384 16,384
Table 6.4: Qualitative datasets attributes.
CAD model. Figures 6.22a to 6.22c are unscaled and unit-less, whereas Figure 6.22d is in
meters. Table 6.4 shows the attributes of the qualitative dataset.
Figure 6.22a has two loose bolts where the top bolt is physically displaced by 7.62 mm
and the bottom by 5.08 mm. Figure 6.22b shows the displacement of the cable in a Honda
Civic car engine. Figure 6.22c depicts the chafing of a tube in the Toyota RAV4 engine and,
Figure 6.22d displays a lateral surface point cloud scan where there is no change w.r.t the
scans. The defective regions are shown in the magenta ellipse. Although it is hidden in the
Figure 6.22, it is worthwhile to note that the C2C distance color map information on a target
point cloud and the reference/source point cloud are overlaid in color for all these point
clouds. To summarize, the DGR can register the point clouds of the complex scenes relatively
well. However, some misalignment exists in registration as the DGR was not transfer-learned
or fully trained on the mechanical components dataset. Lastly, some of the noisy clusters of
the point clouds (e.g., see Figure 6.22c under the hood of the Toyota RAV4 engine) suppress
the C2C magnitude of the actual defect. Thus, it requires fixed distance scanning or accurate
trimming of the point clouds to minimize the false positive C2C magnitude of noisy clusters
and better localize the defects.
6.4.5 Decimation and noise effects on the pairwise registration of
the point clouds
Point clouds produced by the 3D sensors and photogrammetry technique are highly dense,
and they are on the order of 1-100 million points. Processing these points is computationally
242
expensive and often leads to insufficient registration. In addition, the point clouds generated
bythesensorsmightbenoisy. Therefore, itisnecessarytoquantifytheregistrationcapabilities
of the DGR method for change detection of the point clouds in the presence of decimated
and noisy points.
0 20 40 60 80 100
Points Retained after Decimation (%)
1.5
2
2.5
3
3.5
4
4.5
5
5.5
Mean Cloud-to-cloud Distance (mm)
On Source
On Target
On Source and Target
(a)
0 20 40 60 80 100
Noise Variance (%)
1
2
3
4
5
6
7
8
Mean Cloud-to-cloud Distance (mm)
On Source
On Target
On Source and Target
(b)
Figure 6.23: C2C distance variation on the first five no-change engine point clouds. (a).
Points retained after decimation range from 5% to 100%. (b). The variance of the Gaussian
noise range from 0% to 100% with a step size of the 5%.
The experimental setup has the first point cloud as the source/reference point cloud and
the subsequent five-point clouds of the Civic engine change dataset as targets. Each of these
point clouds consists of around 1.1927 10
6
points. The pre-trained model of the DGR
weights is used for the registration. The decimation of the point clouds is achieved by random
sampling with uniform probability [247]. The algorithm produces sorted indices and has a
time complexity of Opnq. A 5% step size of points is retained after every iteration. To create
the noisy point clouds, Equation (6.47) was used to add the Gaussian noise,
N A
n
a
σ
2
Npμ, σ
2
q (6.47)
Y XN, (6.48)
where N is the noise, A
n
10 is the amplitude of the Gaussian noise, σ
2
is the variance
power of the noise (ranges from 0% to 100% at a step size of 5%) andNp,q is the Gaussian
243
random noise. Here μ 0 and σ
2
1. X is the original point cloud, and Y is the noisy
point cloud.
Figure 6.23 shows the decimation and Gaussian noise effects on the reference/source,
target, and combined point clouds. To quantify these effects, the C2C distance w.r.t. the
target point clouds was considered. A lower C2C distance indicates better registration of the
point clouds. Figure 6.23a shows that for all three setups, the mean C2C distance increases
as the decimation increases from 100% to 5% (note that the decimation in percentage refers
to the points retained). On average, decimation of the source/reference point cloud results
in larger C2C distances as the reference is w.r.t the target points. It is required to have a
denser point in the reference/source point clouds because this leads to a smaller C2C distance
after the registration. In contrast, the decimation on the target point cloud has a relatively
better C2C distance. There will be more significant points in the source point clouds that
are relatively closer to the target points in the latter. Lastly, when both source and target
point clouds are decimated, it leads to better C2C distance as the distance statistics are
well-preserved since they are uniformly sampled.
Figure 6.23b shows the effects of the addition of the zero-mean and unit variance Gaussian
noise on the source and target point clouds. For all the cases the C2C distance increases
linearly with the increased power of the noise. The same argument as mentioned above holds
here. As the noise is added to the source, the point clouds lead to more significant C2C
distance. In addition, the noise effect on both point clouds has a slightly larger C2C distance
compared to a target point cloud, because adding noise to both points clouds creates many
uncertain points that lead to a relatively large C2C distance. Adding noise to the target
point clouds and keeping the source point clouds dense has a smaller C2C distance footprint
as the target point cloud itself is used as a reference to find the C2C distance.
In conclusion, obtaining a highly dense, coherent, and noise-free source point cloud is
recommended. Even if some lesser dense and noisy target point clouds exist, the C2C
distances are not much affected. In addition, a decimation of 20-40% on a target point cloud
244
has minor registration errors. For the noise, an addition of around 5% to the target and
target-plus-source point clouds led to smaller C2C distance. Therefore, statistically, adding a
small percentage of the noise may result in smaller C2C distance.
6.4.6 Practical problems in data acquisition and 3D point clouds
registration
Data acquisition plays an essential role in obtaining accurate point clouds. Three-dimensional
sensors have both advantages and limitations when it comes to acquiring data. One of the
most widely used sensors is a Laser scanner (LiDAR). Laser light has important properties
such as coherence, directivity, and divergence of the beam. Ground-based LiDAR, known as
Terrestrial Laser Scanning (TLS), emits a narrow laser beam with high-frequency pulses to
the object of interest. By measuring the round-trip of the pulses (time-of-flight) between the
sensor and target, the position of each point is determined. Some of the main advantages of
a TLS are the fast data acquisition and long-ranging capability. However, spurious effects
in range or intensity data can downgrade the quality of the point cloud. One such cause is
known as “edge effect” [135, 26]. This is shown in Figure 6.24. In addition, the sharp corners
and edges are smoothed out.
Figure 6.24: Dispersion of the points at the edge of a plate known as “edge effect” caused by
a LiDAR scanner. The red boxes show the edge effect area.
245
There are two main reasons for the edge effect. First, when the laser beam is divided
by an object and reflected by the surrounding objects (i.e., the beam stops colliding with
the target and starts colliding with the background). Second, when the laser beam collides
with an object, a part of the signal is lost or weakened due to reflection. In other words, the
energy of the laser beam, after being reflected from the object and returning, is incorrectly
calculated. The edge effect is the difference between the recorded intensity and the expected
intensity after colliding with the object. In addition, the laser divergence, or beam divergence
- the variable diameter of the laser beam, which is proportional to the distance between the
laser scanner and the target - will contribute to the edge effect. Therefore, the quality of the
obtained point cloud is dependent on the distance. Geometrical attributes need to be known
through calibration of the LiDAR at the edges to avoid the edge effect.
Scanning the same object multiple times with the same sensor may not lead to the same
point clouds. This depends on the viewpoint, line of sight of the 3D camera, and occluded
regions of the objects. Figure 6.25 shows a typical example of the uncertainty of the data
acquisition due to the change in viewpoint case. These point clouds were generated by using
(a) (b)
Figure 6.25: Uncertainty in data acquisition. While scanning a complex scene, some portions
may or may not be acquired from the deeper regions. Hence, it requires large samples to
cover all the possibilities. (a). The first scan of a car engine. (b). A second scan of the same
car engine at a different viewpoint. Pink circles A, B, and A
1
and B
1
show the same regions
of two scans with uncertain points clusters [174].
the Microsoft Kinect version 1 RGB-D camera. This camera has a 3D working distance of
0.8 to 4 meters. Inherently, the 3D camera can scan deeper regions. Although in Figure 6.25,
246
the car engine has a finite surface area, due to the partial scans of the occluded regions, the
scans have a noise cluster of points (known as false positives w.r.t. the second point cloud)
when compared with the second scan of the same engine. In other words, some portions may
or may not be entirely acquired from the deeper regions. Thus, minimizing the uncertainty
in data acquisition while scanning a complex scene requires large samples to cover all the
possibilities and combine the point clouds. Lastly, even a fixed distance scanning can also
reduce the uncertainty in the data acquisition.
CAD models converted to the point clouds are great resources for reference or initial
scan. These point clouds are dense, accurate, and have all the details preserved (sharp edges
and curves). For example, Figure 6.26a shows the registered point clouds obtained from the
Solidworks model and a Kinect scan. The details of the bolts of the Solidworks model are
well preserved. In contrast, the point cloud scanned by a Kinect has a geometric distortion
due to the lack of resolution of the sensor. Figure 6.26b depicts the point cloud of the Kinect
sensor only, for two different scans. It can be seen that the scanned point clouds have similar
distortion patterns; hence, it is suitable to estimate the C2C differences by comparing a
scanned point cloud to another scanned point cloud of the same sensor.
The 3D point clouds registration method, DGR, generally fails when symmetries, a low
overlap of neighboring point clouds, and repetitive scene parts are present. For example, for
the symmetry case, the plate examples in Figure 6.19 show how some of the C2C distances are
very large even after the registration. This happened due to the flipping of the plates during
the registration process. However, having the location information and markers can register
the point clouds accurately. In addition, when the overlap ratio of neighboring point clouds is
small, the correspondences between the registration pairs will be minimal, thus, minimizing
the chance of perfect registration. Therefore, it is recommended to have at least 40% for the
overlap ratio. Lastly, repetitive patterns cause the degenerate solution in registration, as it is
hard to obtain the globally consistent final reconstruction and camera poses for all fragments.
247
(a)
(b)
Figure 6.26: In practical field applications, the scanned point clouds (Kinect) usually contain
the noise of geometry distortion due to the low resolution of the sensor. However, the
scanned point clouds have similar distortion patterns; hence, it is suitable to estimate C2C
differences by comparing a scanned point cloud to another scanned point cloud. (a). Kinect
and Solidworks model point cloud differences. Blue is Solidworks, and red is Kinect (b). Two
Kinect scans of the same object [174].
6.5 Summary and conclusions
Defect detection on a three-dimensional surface of a mechanical system is a prime and
challenging problem. Generally, condition assessments of these systems are based on visual
inspection and carried out by trained personnel. Visual inspection is often time-consuming
and labor-intensive. In the manufacturing and aerospace industries, maintenance overhead
of the machinery parts can induce delays in production, delivery, and financial pressure.
Automatic or semi-automatic visual inspection based on computer vision-based methods
allows for detecting defects, localizing and quantifying them by image or multimodal data
analysis such as SFM. This reduces the errors caused by the subjectivity of human expertise.
248
This study proposed vision-based semi-autonomous spatio-temporal change detecting,
locating, and probabilistic reliability quantification methods for mechanical systems. To
accomplish accurate change detection and localization, dense 3D point clouds were used.
Manual processing and analysis of large numbers of the point clouds are labor-intensive,
time-consuming, and have high operational costs for specialized inspectors. Thus, a robust
‘Hausdorff’ distance measure to quantify the changes was utilized. Most of the prior work
focused on only detecting and quantifying the changes at a single time frame. In this work, a
spatio-temporal evolution (time-history) of the changes along the time domain is presented.
This allows greater flexibility for the visual inspectors to record the specific changes in the
components. Furthermore, the most widely used ICP algorithm for the 3D point clouds suffers
from converging to local minima. Thus, a deep learning-based global registration is adapted
to negate the bad registration of the point clouds. In addition, the uncertainty in changes
depends on the accuracy of the registration process. To overcome this, an ensemble averaging
of the Cloud-to-Cloud (C2C) distances over multiple point cloud registrations of no-change
and change is leveraged to minimize the change detection uncertainty. Thus, providing the
probabilistic reliability on change quantification. Lastly, a comprehensive visualization tool
provides a heatmap for changes in the mechanical system, which helps a trained inspector to
glance through the probable defective mechanical parts quickly.
Three temporal datasets (clamped, large bolted plates, and a Honda Civic engine) were
used to detect and localize the loose bolts. The proposed method was able to register the
source/reference accurately and target point clouds for clamped plate and Civic engine
examples. In contrast, some of the point clouds were flipped, and the current method failed
to accurately register the large bolted plates examples due to the symmetry. However, for
all three datasets, quantifying by the probabilistic reliability curves made it so changes
were distinguishable when the ensemble averages of the C2C distances were considered.
Moreover, in this study, pre-trained weights were used for the deep learning-based DGR
method. Nonetheless, transfer learning or the full training of the DGR will perform well
249
on challenging problems. In addition, four complex mechanical system datasets were also
tested for qualitative evaluation of the point cloud registration method, DGR. In all these
experiments, the changes such as loose bolts in a plate, displacement of the cables, and
chafing under the car engine hood are visible in the visualization module.
Point clouds produced from the SFM and close-range 3D photography are highly dense (of
order 1-100 million points) and require considerable computational resources. The decimation
of these points is necessary. It was found that decimating both source and target point clouds
leads to smaller average C2C distances of 2.14 mm, whereas decimating the source point
clouds increases the C2C distance by 4.44 mm over 0-95% of decimation. It is recommended
to decimate 5-15% to keep the smaller C2C distances. Noise effects increase linearly for
source, target, and combined point clouds. The average C2C distances are 1.85 and 5.52 mm
when noise is added to the target and source point clouds, respectively. For both decimation
and noise cases, the C2C distance reference is w.r.t the target points; thus, modifying source
point clouds has adverse effects. In a nutshell, it is essential to keep the source point cloud
dense and noise-free. In addition, a smaller presence of noise on the target point cloud has a
minimal adverse effect on the C2C distance.
Data acquisition plays an essential role in obtaining accurate point clouds. Three-
dimensional sensors such as LiDAR suffers from “edge effect” due to the reflection of the laser
beam by the surrounding objects or loss of the signal. In addition, there exists an uncertainty
in the scanned fragments while scanning a complex scene, as some portions may or may
not be acquired from the deeper regions. Hence, it requires large samples to cover all the
possibilities. Moreover, scanning with a low-resolution sensor (Microsoft Kinect) can cause
geometric distortions compared to the CAD-generated 3D point clouds. Thus it is necessary
to scan the source (no-change) and target (change) components with the same sensor or
use high-resolution data acquisition tools. Lastly, the proposed method has a limitation
with filtering out the false point cloud registration errors that lead to more considerable
C2C distances, overshadowing the actual changes. This uncertainty can be minimized by
250
leveraging the color information of the 2D images and segregating the valid regions of a point
cloud based on changes in the images. Again, a 2D convolutional neural network will be an
excellent fit to localize the changes.
6.6 Future work
Generally, while scanning complex scenes, some portions may or may not be acquired from
the deeper regions. Hence, it requires large samples to cover all the possibilities. Thus,
integrating multiple scans of the same scene is essential to minimize the uncertainty in data
acquisition. Furthermore, multiple fragments need to be registered to perform multiview
registration of the point clouds of larger objects such as an aircraft engine. This ensures a
complete scene of the model of both source and target with no change, and change can be
reconstructed for change detection, localization, and evolution of defects.
Moreover, in non-destructive testing (NDT), the most widely used metric for change
detection is Probability-of-Detection (POD), which quantifies the uncertainty in changes. The
autonomous method that extends the proposed technique can be used for defect quantification
and POD estimation. Lastly, to reduce the misalignment of the point clouds, non-uniformly
sampled point clouds can be utilized. As a result, the sharp changes in the point clouds (e.g.,
bolts, dents, and others) have more points, and the flat regions are heavily decimated. This
ensures minimizing the registration RMSE well and registering the point clouds accurately.
251
Chapter 7
Summary, conclusions and future work
7.1 Summary and conclusions
Civil, mechanical, and aerospace infrastructures undergo structural changes due to the applied
loads and environmental forces such as earthquakes, wind, and ocean waves. These factors
deteriorate the infrastructures during their service time. Civil infrastructures cost in the
United States alone adds more than $20 trillion. According to the 2021 Infrastructure
Report Card published by the American Society of Civil Engineers (ASCE), the overall US
infrastructure was graded with a GPA of C-. This shows a need for better upliftment on the
condition of infrastructure. Vibration-based online or offline methods of structural health
monitoring were the most widely used techniques to continuously monitor changes to the
material and geometric properties of the infrastructures. On the other hand, manual visual
inspection by trained personnel is still the main form of assessing the civil infrastructure’s
condition. However, inspection procedures are subjective, tedious, time-consuming, and
depend on the trained personnel skills. Furthermore, regular periodic manual visual inspection
is expensive due to the cost associated with the assessment procedure.
In the last two and a half decades, vision-based methods have gained the attention of the
research community due to the cost reduction of the sensors (color and 3D cameras) and the
advent of unmanned aerial vehicles such as the MAVs. Furthermore, ground-breaking work
in the artificial intelligence community and the development of faster graphical processing
units have also added a new dimension to the development of state-of-the-art algorithms in
the fields of computer vision and image processing. This study developed autonomous and
semi-autonomous methods for the condition assessment and change detection of the evolving
252
civil, mechanical, and aerospace infrastructures, adapting some of the cutting-edge computer
vision and image processing methodologies.
An artificial intelligence-based data-driven and supervised pattern recognition/decision
system require the careful supervision of datasets by accurate labeling of the positive and
negative examples (crack and non-crack pixels). Constructing labeled datasets requires a
conscientious effort by a human operator, and this process is tedious, highly time-consuming,
and costly. This study proposed a synthetic crack generation algorithm to produce linear
and tortuous single-stranded cracks. In addition, elastic deformation was incorporated to
expand the synthetic crack dataset. These synthetically generated samples were used as the
training examples in the supervised learning methods and tested on real-world datasets. Crack
segmentation results with a best F1-score of 78.41% on a real-world dataset, demonstrated
the potential and promising future of the synthetic crack generation algorithm. Although
single-stranded cracks were used for the supervised training, the generation capability of the
classifiers to segment the multi-stranded and branched real-world cracks was evident in the
results.
Generally, a trained personnel draws the cracks to localize for documentation and main-
tenance purposes in on-site visual inspection. To overcome this, this study proposed two
autonomous segmentation methods of cracks on concrete images. A gradient-based multiscale
fractional anisotropy tensor, and a deep convolutional neural network, CrackDenseLinkNet,
were introduced. In the MFAT method, the principal direction of the Hessian matrix is
leveraged to segment the crack pixels. Furthermore, the concrete cracks are surrounded by
textural noise; anisotropic diffusion filtering is adapted to suppress this. This work provides
the guidelines for the selection of anisotropic diffusion parameters to smooth the textural noise.
In addition, the gradient-based methods were compared against the state-of-the-art segmenta-
tion CNN, DeepCrack. Results show that the proposed anisotropic diffusion-based refinement
method improves the crack segmentation F1-score by 0.37%, 0.47%, and 0.54% for ANN,
k-NN, and SVM classifiers, respectively. Furthermore, the proposed method outperformed
253
the vesselness and morphological method by 4.38% and 7.00% respectively, for the F1-score
for the SVM classifier, across the datasets. A deep encoder CNN, Densenet, and modified
LinkNet decoder network, was adapted to reduce the number of learning parameters to speed
up the training process without compromising the segmentation accuracy. A combined focal
and dice loss was used to counter the class imbalance problem in the crack dataset. The
proposed CNN, CrackDenseLinkNet, outperformed the best state-of-the-art method by 2.36%
for the F1-score on average across four datasets.
An autonomous method for tracking the evolution of cracks can assist in predicting
the overall health and behavior of the structure by prognosis-based computational damage
models. Thus, accurate location and the crack’s physical properties can be used as the initial
condition for these damage models. Furthermore, it allows on-site inspectors to track the
evolution of cracks through time, thereby facilitating more accessible maintenance methods
and devising better risk management plans. This study proposed an efficient feature-based
image registration method for scene reconstruction and crack change tracking. In addition, the
fixed camera assumption was relaxed and the images obtained are subject to small oscillations
in the camera viewpoint. Prior literature considered the pairwise searching of images from
the database. In this work, an efficient nearest neighbor searching of the image database
usingk-d tree is adapted to speed up the image search and matching. Lastly, the probabilistic
measure of the reliability of the analysis results in detecting, locating, quantifying, and crack
evolution was introduced to aid the prognosis damage detection models.
Images in the previous database are searched based on the nearest coordinates (waypoints)
to the current image using the k-d tree. The SIFT/SURF keypoints are extracted using
the nearest neighbor images and matched against the current image. Putative matches are
rejected by the RANSAC method, and the homographies are estimated between the image
matches. Furthermore, the unrefined homographies are optimized by the bundle adjustment
technique. This is followed by the gain/exposure compensation and multi-band blending
to produce a seamless image reconstruction. The reconstructed image is matched with the
254
current image for the detection and segmentation of cracks. Crack physical properties such
as width, length, and area are estimated, and the probabilistic measure of the reliability of
the analysis results was used to detect the changes. A synthetic dataset with six different
scenarios of crack growth and a real-world dataset with relatively thin, medium, and thick
sized cracks was utilized to assess the capability of the proposed method. Experimentally the
proposed method correlates the crack width and length well against the ground truth of the
synthetic crack samples. Although the CrackDenseLinkNet was not fully trained or transfer
learned, the proposed method of change detection results and semantic segmentation metrics
were adequate when tested on the synthetic images collected from a MAV and real-world
datasets. Additionally, the PDFs of the proposed method match well with ground-truth
datasets acquired at two different time periods for real-world datasets. Furthermore, the
F1-scores demonstrate the promising generalization capabilities of the CrackDenseLinkNet
CNN on real-world datasets. Overall, the proposed approach can register, align the succeeding
images well to the current/reference image, and effectively segment the cracks. Lastly, it
demonstrates promising outcomes and robustness of the method to detect, localize, quantify
and track the evolving changes of cracks on concrete surfaces while simultaneously providing
a probabilistic measure of the reliability of the crack physical properties such as width, length,
and area.
Sewer pipelines span more than 700,000 miles in the United States alone. The current
inspection standard requires a NASSCO certified inspector to inspect the pipeline using
a robot equipped with a CCTV camera. On average, the inspection company Pro-Pipe
scans roughly around 5000 miles of pipeline in a year. Thus, it is necessary to develop
autonomous systems to speed up the inspection process. Another part of this study evaluated
the feature-descriptors for the autonomous defect classification of sewer pipe CCTV image
frames using the visual-bags-of-words model. A moderate (by today’s standard) dataset of
14,404 images was constructed, and three supervised classifiers, ANN, KNN, and SVM, were
trained and tested. Results showed that the SURF-based dense grid descriptor performed
255
well with an average F1 score of 78.68% across all the classes. This work demonstrated the
classification capability of the visual-bags-of-words model, and it is comparable to some of
the deep learning classification networks.
Defect detection on 3D mechanical systems is a challenging task due to the cluttered
regions and occlusion of the components. This study proposed a semi-autonomous 3D
point cloud registration-based spatio-temporal change detecting, locating, and probabilistic
reliability quantification method. First, three-dimensional point clouds were produced by
using an off-the-shelf 3D camera and photogrammetry approach. These 3D point clouds were
acquired at different time periods. Then, deep learning-based global registration and ICP
methods were employed to register the point clouds globally and finely. Furthermore, the
ensemble average of the C2C distances was calculated to estimate the change detection in
images taken at different times like loose bolts, displacement, and chafing. In a laboratory
setting, experimental results were promising and showed the potential of the ensemble
averaging method for the change detection in 3D mechanical components. However, due
to the lack of labeled 3D point clouds, this work utilized pretrained weights for the deep
learning-based registration method. Using fully or transfer learned/fine-tuned weights could
have improved some registration failures in symmetric mechanical components. Lastly, the
qualitative results were impressive on the real-scale mechanical components and a 3D CAD
model sampled point clouds.
7.2 Future work
In this study, the synthetic crack generation methodology was developed to replace tedious,
manual labeling work and augment the datasets. As part of the future work, developing a
synthetic crack generation algorithm capable of producing realistic multi-stranded, branched
or surface cracks is desirable to assist the deep learning-based semantic segmentation methods.
Furthermore, exploiting the best features of both deep learning and anisotropic filtering
256
methods, a learnable texture suppression method that aids the crack segmentation will be
worthwhile. For a semantic segmentation CNN, developing a topology-preserving loss function
for crack segmentation would be advantageous in the future.
A simulation environment was used to test the crack evolution and tracking methodology.
In the future, a real-world experimental setup will be constructed to measure the efficacy of
the proposed system in online and offline methods. In addition, it is impractical to conduct
a large-scale experiment to come up with guidelines for unmanned aerial vehicle-based
inspections. As future work, uncertainty quantification of crack tracking and evolution due
to the stochastic nature of the errors in location, camera viewing angles, lighting condition,
motion blur, and aerial vehicle stability are interesting topics to study using the simulation
environment and models. Lastly, the orthogonal projection method overestimates the crack
width at the branch nodes. Debranching and recombining at the branches can help in
producing better crack width estimates.
As an ongoing work, a single-input-multiple-output deep learning-based architecture
is under development. Consequently, the video frame classification of the pipe materials,
PACP codes (pipe defects), a regression model for the PACP condition grading, and water-
level estimation will be accomplished using this architecture. Lastly, as a future study,
multiview registration of the large-scale mechanical components point clouds will be carried
to reconstruct the scene to detect, locate and quantify the defects. In addition, probability-
of-detection is the most widely used metric for change detection in non-destructive testing.
Therefore, solving this challenging autonomous task using vision-based and image processing
methods is also an interesting topic for further study. Furthermore, non-uniformly sampled
point clouds can be utilized to reduce the misalignment of the point clouds by minimizing
the registration RMSE using weights for flat and gradient dominant regions.
The above studies based on adapting cutting-edge computer vision, image processing, and
machine learning methods demonstrate the potential and promising approaches in autonomous
vision-based inspection and change detection in evolving civil, mechanical, and aerospace
257
infrastructures. The further development of cost-effective and high-resolution sensors can be
adapted to increase the precision and accuracy of the proposed method in future.
258
Bibliography
[1] Hamdi Ben Abdallah, Jean-José Orteu, Igor Jovancevic, and Benoit Dolives. Three-
dimensional point cloud analysis for automatic inspection of complex aeronautical
mechanical assemblies. Journal of Electronic Imaging, 29(4):041012, 2020.
[2] I. Abdel-Qader, O. Abudayyeh, and M.E. Kelly. Analysis of edge-detection techniques
for crack identification in bridges. Journal of Computing in Civil Engineering, 17(4):
255–263, 2003. doi: 10.1061/(ASCE)0887-3801(2003)17:4(255).
[3] Mohamed Abdelbarr, Yulu Luke Chen, Mohammad R Jahanshahi, Sami F Masri,
Wei-Men Shen, and Uvais A Qidwai. 3d dynamic displacement-field measurement for
structural health monitoring using inexpensive rgb-d based sensor. Smart materials
and structures, 26(12):125016, 2017.
[4] Markus W Achtelik. Advanced closed loop visual navigation for micro aerial vehicles.
PhD thesis, ETH Zurich, 2014.
[5] Muhammad Ali Akbar, Uvais Qidwai, and Mohammad R Jahanshahi. An evaluation
of image-based structural health monitoring using integrated unmanned aerial vehicle
platform. Structural Control and Health Monitoring, 26(1):e2276, 2019.
[6] Haifa F Alhasson, Shuaa S Alharbi, and Boguslaw Obara. 2d and 3d vascular structures
enhancement via multiscale fractional anisotropy tensor. In Proceedings of the European
Conference on Computer Vision (ECCV), pages 0–0, 2018.
[7] Daniela Ali-Sisto and Petteri Packalen. Forest change detection by using point clouds
from dense image matching together with a lidar-derived terrain model. IEEE Journal
of Selected Topics in Applied Earth Observations and Remote Sensing, 10(3):1197–1206,
2016.
[8] Muhammad Aliakbar, Uvais Qidwai, Mohammad R Jahanshahi, Sami Masri, and Wei-
Min Shen. Progressive image stitching algorithm for vision based automated inspection.
In 2016 International Conference on Machine Learning and Cybernetics (ICMLC),
volume 1, pages 337–343. IEEE, 2016.
[9] Mohamad Alipour and Devin K Harris. Increasing the robustness of material-specific
deep learning models for crack detection across different materials. Engineering Struc-
tures, 206:110157, 2020.
[10] Yasuhiro Aoki, Hunter Goforth, Rangaprasad Arun Srivatsan, and Simon Lucey. Point-
netlk: Robust & efficient point cloud registration using pointnet. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7163–7172,
2019.
[11] ASCE. ASCE report card for america‘s infrastructure (report card), 2017. http:
//www.infrastructurereportcard.org/.
259
[12] ASCE. ASCE report card for america‘s infrastructure (report card), 2021. http:
//www.infrastructurereportcard.org/.
[13] Shai Avidan and Ariel Shamir. Seam carving for content-aware image resizing. In ACM
Transactions on graphics (TOG), volume 26, page 10. ACM, 2007.
[14] Haiying Bai, Noriko Yata, and Tomoharu Nagao. Automatic finding of optimal image
processing for extracting concrete image cracks using features actit. IEEJ Transactions
on Electrical and Electronic Engineering, 7(3):308–315, 2012. ISSN 1931-4981. doi:
10.1002/tee.21732. URL http://dx.doi.org/10.1002/tee.21732.
[15] Walid Balid, Hasan Tafish, and Hazem H Refai. Versatile real-time traffic monitoring
system using wireless smart sensors networks. In 2016 IEEE Wireless Communications
and Networking Conference, pages 1–6. IEEE, 2016.
[16] Seongdeok Bang, Hongjo Kim, and Hyoungkwan Kim. Uav-based automatic generation
of high-resolution panorama at a construction site with a focus on preprocessing for
image stitching. Automation in construction, 84:70–80, 2017.
[17] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features.
In European conference on computer vision, pages 404–417. Springer, 2006.
[18] Hamdi Ben Abdallah, Igor Jovančević, Jean-José Orteu, and Ludovic Brèthes. Auto-
matic inspection of aeronautical mechanical assemblies by matching the 3d cad model
and real 2d images. Journal of Imaging, 5(10):81, 2019.
[19] Julius S Bendat and Allan G Piersol. Random data: analysis and measurement
procedures, volume 729. John Wiley & Sons, 2011.
[20] Alexander C Berg, Tamara L Berg, and Jitendra Malik. Shape matching and object
recognition using low distortion correspondences. Citeseer, 2005.
[21] P.J.BeslandNeilD.McKay. Amethodforregistrationof3-dshapes. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 14(2):239–256, 1992. doi: 10.1109/34.
121791.
[22] Sutanu Bhowmick, Satish Nagarajaiah, and Ashok Veeraraghavan. Vision and deep
learning-based algorithms to detect and quantify cracks on concrete surfaces from uav
videos. Sensors (Basel, Switzerland), 20(21), November 2020. ISSN 1424-8220. doi:
10.3390/s20216299. URL https://europepmc.org/articles/PMC7663834.
[23] Peter Biber and Wolfgang Straßer. The normal distributions transform: A new approach
to laser scan matching. volume 3, pages 2743 – 2748 vol.3, 11 2003. ISBN 0-7803-7860-1.
doi: 10.1109/IROS.2003.1249285.
[24] Christopher M. Bishop. Pattern Recognition and Machine Learning (Information
Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.
ISBN 0387310738.
260
[25] Michael J Black, Guillermo Sapiro, David H Marimont, and David Heeger. Robust
anisotropic diffusion. IEEE Transactions on image processing, 7(3):421–432, 1998.
[26] Fabiane Bordin, Fabrício Galhardo Müller, Elba Calesso Teixeira, Sílvia Beatriz Alves
Rolim, Francisco Manoel Wohnrath Tognoli, Luiz Gonzaga da Silveira Júnior, Mau-
rício Roberto Veronez, and Marco Scaioni. An intensity recovery algorithm (ira) for
minimizing the edge effect of lidar data. European Journal of Remote Sensing, 49(1):
301–315, 2016.
[27] Assya Boughrara, Igor Jovančević, Hamdi Ben Abdallah, Benoît Dolives, Mathieu
Belloc, and Jean-José Orteu. Inspection of mechanical assemblies based on 3d deep
learning approaches. In International Conference on Quality Control by Artificial
Vision, volume 1179407, page 16, 2021.
[28] Matthew Brown and DavidG. Lowe. Automatic panoramic image stitching using
invariant features. International Journal of Computer Vision, 74(1):59–73, 2007. ISSN
0920-5691. doi: 10.1007/s11263-006-0002-3. URL http://dx.doi.org/10.1007/
s11263-006-0002-3.
[29] Matthew Brown, David G Lowe, et al. Recognising panoramas. 2003.
[30] Lorenzo Bruzzone and Sebastiano B Serpico. An iterative technique for the detection
of land-cover transitions in multitemporal remote-sensing images. IEEE transactions
on geoscience and remote sensing, 35(4):858–867, 1997.
[31] Peter J Burt and Edward H Adelson. A multiresolution spline with application to
image mosaics. ACM Transactions on Graphics (TOG), 2(4):217–236, 1983.
[32] Eduardo Castro, Jaime S Cardoso, and Jose Costa Pereira. Elastic deformations for
data augmentation in breast cancer mass detection. In 2018 IEEE EMBS International
Conference on Biomedical & Health Informatics (BHI), pages 230–234. IEEE, 2018.
[33] Young-JinCha, KisungYou, andWooramChoi. Vision-baseddetectionofloosenedbolts
using the hough transform and support vector machines. Automation in Construction,
71:181–188, 2016.
[34] Young-Jin Cha, Wooram Choi, and Oral Büyüköztürk. Deep learning-based crack
damage detection using convolutional neural networks. Computer-Aided Civil and
Infrastructure Engineering, 32(5):361–378, 2017.
[35] Krisada Chaiyasarn, Tae-Kyun Kim, Fabio Viola, Roberto Cipolla, and Kenichi Soga.
Distortion-free image mosaicing for tunnel inspection based on robust cylindrical surface
estimation through structure from motion. Journal of Computing in Civil Engineering,
30(3):04015045, 2015.
[36] Peter C Chang, Alison Flatau, and SC Liu. Health monitoring of civil infrastructure.
Structural health monitoring, 2(3):257–267, 2003.
261
[37] Abhishek Chaurasia and Eugenio Culurciello. Linknet: Exploiting encoder representa-
tions for efficient semantic segmentation. In 2017 IEEE Visual Communications and
Image Processing (VCIP), pages 1–4. IEEE, 2017.
[38] Fu-Chen Chen and M. Jahanshahi. Arf-crack: rotation invariant deep fully convolutional
network for pixel-level crack detection. Machine Vision and Applications, 31, 2020.
[39] Fu-Chen Chen and Mohammad R Jahanshahi. Nb-cnn: Deep learning-based crack
detection using convolutional neural network and naïve bayes data fusion. IEEE
Transactions on Industrial Electronics, 65(5):4392–4400, 2017.
[40] Fu-Chen Chen and Mohammad R Jahanshahi. Real-time crack detection from nuclear
inspection videos using fully convolutional network and parametric data fusion. IEEE
Transactions on Instrumentation and Measurement, 2019.
[41] Fu-Chen Chen, Mohammad R Jahanshahi, Rih-Teng Wu, and Chris Joffe. A texture-
based video processing methodology using bayesian data fusion for autonomous crack
detection on metallic surfaces. Computer-Aided Civil and Infrastructure Engineering,
32(4):271–287, 2017.
[42] Hanshen Chen, Huiping Lin, and Minghai Yao. Improving the efficiency of encoder-
decoder architecture for pixel-level crack detection. IEEE Access, 7:186657–186670,
2019.
[43] Yang Chen and Gérard Medioni. Object modelling by registration of multiple range
images. Image and vision computing, 10(3):145–155, 1992.
[44] Yulu Luke Chen, Mohamed Abdelbarr, Mohammad R Jahanshahi, and Sami F Masri.
Color and depth data fusion using an rgb-d sensor for inexpensive and contactless
dynamic displacement-field measurement. Structural Control and Health Monitoring,
24(11):e2000, 2017.
[45] Z. Chen, R.R. Derakhshani, C. Halmen, and J. T. Kevern. A texture-based method
for classifying cracked concrete surfaces from digital images using neural networks.
In Neural Networks (IJCNN), The 2011 International Joint Conference on, pages
2632–2637, 07 2011. doi: 10.1109/IJCNN.2011.6033562.
[46] ZhiQiang Chen and Tara C Hutchinson. Image-based framework for concrete surface
crack monitoring and quantification. Advances in Civil Engineering, 2010, 2010.
[47] Jack CP Cheng and Mingzhu Wang. Automated detection of sewer pipe defects in closed-
circuit television images using deep learning techniques. Automation in Construction,
95:155–171, 2018.
[48] Jongseong Choi, Chul Min Yeum, Shirley J Dyke, and Mohammad R Jahanshahi.
Computer-aided approach for rapid post-event visual evaluation of a building façade.
Sensors, 18(9):3017, 2018.
262
[49] Wooram Choi and Young-Jin Cha. Sddnet: Real-time crack segmentation. IEEE
Transactions on Industrial Electronics, 67(9):8016–8025, 2019.
[50] G.K.ChoudharyandS.Dey. Crackdetectioninconcretesurfacesusingimageprocessing,
fuzzy logic, and neural networks. In Advanced Computational Intelligence (ICACI),
2012 IEEE Fifth International Conference on, pages 404–411, 10 2012. doi: 10.1109/
ICACI.2012.6463195.
[51] Christopher Choy, Jaesik Park, and Vladlen Koltun. Fully convolutional geometric
features. In Proceedings of the IEEE/CVF International Conference on Computer
Vision, pages 8958–8966, 2019.
[52] Christopher Choy, Wei Dong, and Vladlen Koltun. Deep global registration. In
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition,
pages 2514–2523, 2020.
[53] H-C Chung, J Liang, S Kushiyama, and M Shinozuka. Digital image processing for
non-linear system identification. International Journal of Non-linear mechanics, 39(5):
691–707, 2004.
[54] Gabriella Csurka, Christopher R. Dance, Lixin Fan, Jutta Willamowski, and Cédric
Bray. Visual categorization with bags of keypoints. In In Workshop on Statistical
Learning in Computer Vision, ECCV, pages 1–22, 2004.
[55] BoguslawCyganekandJPaulSiebert. An introduction to 3D computer vision techniques
and algorithms. John Wiley & Sons, 2011.
[56] Gabriel De Erausquin and Lucia Alba-Ferrara. What does anisotropy measure? insights
from increased and decreased anisotropy in selective fiber tracts in schizophrenia. Fron-
tiers in Integrative Neuroscience, 7:9, 2013. ISSN 1662-5145. doi: 10.3389/fnint.2013.
00009. URL https://www.frontiersin.org/article/10.3389/fnint.2013.00009.
[57] Jean-Marc Decitre, Michel Lemistre, and François Lepoutre. Defects localization in
metallic structures by magneto optic image processing. AIP Conference Proceedings, 509
(1):889–896, 2000. doi: http://dx.doi.org/10.1063/1.1306139. URL http://scitation.
aip.org/content/aip/proceeding/aipcp/10.1063/1.1306139.
[58] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A
large-scale hierarchical image database. In 2009 IEEE conference on computer vision
and pattern recognition, pages 248–255. Ieee, 2009.
[59] Jianghua Deng, Ye Lu, and Vincent Cheng-Siong Lee. Concrete crack detection with
handwriting script interferences using faster region-based convolutional neural network.
Computer-Aided Civil and Infrastructure Engineering, 35(4):373–388, 2020.
[60] Sattar Dorafshan, Robert J Thomas, and Marc Maguire. Comparison of deep convolu-
tional neural networks and edge detectors for image-based crack detection in concrete.
Construction and Building Materials, 186:1031–1045, 2018.
263
[61] Sébastien Drouyer. An’all terrain’crack detector obtained by deep learning on available
databases. Image Processing On Line, 10:105–123, 2020.
[62] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classification (2nd
Edition). Wiley-Interscience, 2000. ISBN 0471056693.
[63] Cao Vu Dung et al. Autonomous concrete crack detection using deep fully convolutional
neural network. Automation in Construction, 99:52–58, 2019.
[64] Karsten Ehrig, Jürgen Goebbels, Dietmar Meinel, Olaf Paetsch, Steffen Prohaska, and
Valentin Zobel. Comparison of crack detection methods for analyzing damage processes
in concrete with computed tomography. In Proceedings of International Symposium on
Digital Industrial Radiology and Computed Tomography, DIR, Berlin, Germany, 2011.
[65] Gábor Erdős, Takahiro Nakano, and József Váncza. Adapting cad models of complex
engineering objects to measured point cloud data. CIRP Annals, 63(1):157–160, 2014.
[66] Tom Fawcett. An introduction to {ROC} analysis. Pattern Recognition Letters, 27(8):
861 – 874, 2006. ISSN 0167-8655. doi: http://dx.doi.org/10.1016/j.patrec.2005.10.010.
URL http://www.sciencedirect.com/science/article/pii/S016786550500303X.
{ROC} Analysis in Pattern Recognition.
[67] Li Fei-Fei and Pietro Perona. A bayesian hierarchical model for learning natural scene
categories. In 2005 IEEE Computer Society Conference on Computer Vision and
Pattern Recognition (CVPR’05), volume 2, pages 524–531. IEEE, 2005.
[68] ChunchengFeng, HuaZhang, HaoranWang, ShuangWang, andYonglongLi. Automatic
pixel-level crack detection on dam surface using deep convolutional network. Sensors,
20(7):2069, 2020.
[69] Nicola J Ferrier, Simon Rowe, and Andrew Blake. Real-time traffic monitoring. 1994.
[70] FHWA. Federal highway administration, bridge inspector’s reference manual, fhwa
nhi 12-049. Technical report, U.S. Department of Transportation, Federal Highway
Administration, 2012.
[71] FHWA. Guidelines for the installation, inspection, maintenance and repair of structural
supports for highway signs, luminaries, and traffic signals @ONLINE, June 2014. URL
http://www.fhwa.dot.gov/bridge/signinspection02.cfm. Accessed: 2014-06-01.
[72] FHWA. Federal highway administration, national bridge inventory, fhwa. Technical
report, U.S. Department of Transportation, Federal Highway Administration, 2020.
[73] FHWA. Federal highway administration, national bridge inspection standards,
fhwa–fapg 23 cfr 650c, fhwa. Technical report, U.S. Department of Transportation,
Federal Highway Administration, 2020.
264
[74] P. W. Fieguth and S. K. Sinha. Automated analysis and detection of cracks in
underground scanned pipes. In Image Processing, 1999. ICIP 99. Proceedings. 1999
International Conference on, volume 4, pages 395–399 vol.4, 1999. doi: 10.1109/ICIP.
1999.819622.
[75] Paul W. Fieguth and Sunil K. Sinha. Automated analysis and detection of cracks in
underground scanned pipes. In ICIP (4), pages 395–399, 1999. URL http://dblp.
uni-trier.de/db/conf/icip/icip1999-4.html#FieguthS99.
[76] Sayna Firoozi Yeganeh, Amir Golroo, and Mohammad R Jahanshahi. Automated
rutting measurement using an inexpensive rgb-d sensor fusion approach. Journal of
Transportation Engineering, Part B: Pavements, 145(1):04018061, 2019.
[77] Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm
for model fitting with applications to image analysis and automated cartography.
Communications of the ACM, 24(6):381–395, 1981.
[78] David A. Forsyth and Jean Ponce. Computer Vision - A Modern Approach, Second
Edition. Pitman, 2012. ISBN 978-0-273-76414-4.
[79] AlejandroF. Frangi, WiroJ. Niessen, KoenL. Vincken, and MaxA. Viergever. Multiscale
vessel enhancement filtering. In WilliamM. Wells, Alan Colchester, and Scott Delp,
editors, Medical Image Computing and Computer-Assisted Interventation — MIC-
CAI’98, volume 1496 of Lecture Notes in Computer Science, pages 130–137. Springer
Berlin Heidelberg, 1998. ISBN 978-3-540-65136-9. doi: 10.1007/BFb0056195. URL
http://dx.doi.org/10.1007/BFb0056195.
[80] James E. Frank and Mary Kay Falconer. The measurement of infrastructure capac-
ity: Theory, data structures, and analytics. Computers, Environment and Urban
Systems, 14(4):283 – 297, 1990. ISSN 0198-9715. doi: http://dx.doi.org/10.1016/
0198-9715(90)90003-C. URL http://www.sciencedirect.com/science/article/
pii/019897159090003C.
[81] Y. Fujita, Y. Mitani, and Yoshihiko Hamamoto. A method for crack detection on
a concrete structure. In Pattern Recognition, 2006. ICPR 2006. 18th International
Conference on, volume 3, pages 901–904, 2006. doi: 10.1109/ICPR.2006.98.
[82] Yusuke Fujita and Yoshihiko Hamamoto. A robust method for automatically detecting
cracks on noisy concrete surfaces. In Been-Chian Chien, Tzung-Pei Hong, Shyi-Ming
Chen, and Moonis Ali, editors, Next-Generation Applied Intelligence, volume 5579 of
Lecture Notes in Computer Science, pages 76–85. Springer Berlin Heidelberg, 2009.
ISBN 978-3-642-02567-9. doi: 10.1007/978-3-642-02568-6_8. URL http://dx.doi.
org/10.1007/978-3-642-02568-6_8.
[83] Yusuke Fujita and Yoshihiko Hamamoto. A robust automatic crack detection method
from noisy concrete surfaces. Machine Vision and Applications, 22(2):245–254, 2011.
265
[84] Fadri Furrer, Michael Burri, Markus Achtelik, and Roland Siegwart. Rotors—a modular
gazebo mav simulator framework. In Robot operating system (ROS), pages 595–625.
Springer, 2016.
[85] Yuqing Gao and Khalid M Mosalam. Deep transfer learning for image-based structural
damage recognition. Computer-Aided Civil and Infrastructure Engineering, 33(9):
748–768, 2018.
[86] G. Gerig, O. Kubler, R. Kikinis, and F.A. Jolesz. Nonlinear anisotropic filtering of mri
data. Medical Imaging, IEEE Transactions on, 11(2):221–232, 06 1992. ISSN 0278-0062.
doi: 10.1109/42.141646.
[87] Sindhu Ghanta, Salar Shahini Shamsabadi, Jennifer Dy, Ming Wang, and Ralf Birken. A
hessian-based methodology for automatic surface crack detection and classification from
pavement images. In Structural Health Monitoring and Inspection of Advanced Materials,
Aerospace, and Civil Infrastructure 2015, volume 9437, page 94371Z. International
Society for Optics and Photonics, 2015.
[88] Rahim Ghorbani, Fabio Matta, and Michael A Sutton. Full-field deformation measure-
ment and crack mapping on confined masonry walls using digital image correlation.
Experimental Mechanics, 55(1):227–243, 2015.
[89] Mondal Tarutal Ghosh and Jahanshahi Mohammad R. Autonomous vision-based
damage chronology for spatiotemporal condition assessment of civil infrastructure using
unmanned aerial vehicle. Smart Structures and Systems, 25(6):733–749, 06 2020.
[90] D Girardeau-Montauta, Michel Rouxa, Raphaël Marcb, and Guillaume Thibaultb.
Change detection on points cloud data acquired w ith a ground laser scanner. Laser
scanning, 2005.
[91] Rafael C. Gonzalez and Richard E. Woods. Digital image processing. Prentice Hall,
Upper Saddle River, N.J., 2008. ISBN 9780131687288 013168728X 9780135052679
013505267X.
[92] Rafael C Gonzalez and Richard E. (Richard Eugene) Woods. Digital image processing,
Rafael C. Gonzalez, University of Tennessee, Richard E. Woods, Interapptics. Pearson,
New York, NY, fourth edition. edition, 2018. ISBN 9780133356724.
[93] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,
Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets.
Advances in neural information processing systems, 27:2672–2680, 2014.
[94] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.
[95] A. Ardeshir Goshtasby. 2-D and 3-D Image Registration: For Medical, Remote Sensing,
and Industrial Applications. Wiley-Interscience, USA, 2005. ISBN 0471649546.
[96] Arthur Ardeshir Goshtasby. 2-D and 3-D image registration: for medical, remote
sensing, and industrial applications. John Wiley & Sons, 2005.
266
[97] Benjamin A Graybeal, Brent M Phares, Dennis D Rolander, Mark Moore, and Glenn
Washer. Visual inspection of highway bridges. Journal of nondestructive evaluation, 21
(3):67–83, 2002.
[98] W. Guo, L. Soibelman, and Jr. J. H. Garrett. Automated Defect Detection in Urban
Wastewater Pipes Using Invariant Features Found in Video Images, pages1194–1203. doi:
10.1061/41020(339)121. URL https://ascelibrary.org/doi/abs/10.1061/41020%
28339%29121.
[99] W Guo, L Soibelman, and JH Garrett Jr. Visual pattern recognition supporting
defect reporting and condition assessment of wastewater collection systems. Journal of
Computing in Civil Engineering, 23(3):160–169, 2009.
[100] Carl Haas, Miroslaw Skibniewski, and Eugeniusz Budny. Robotics in civil engineering.
Computer-Aided Civil and Infrastructure Engineering, 10(5):371–381, 1995. ISSN 1467-
8667. doi: 10.1111/j.1467-8667.1995.tb00298.x. URL http://dx.doi.org/10.1111/j.
1467-8667.1995.tb00298.x.
[101] Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning
an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision
and Pattern Recognition (CVPR’06), volume 2, pages 1735–1742. IEEE, 2006.
[102] Robert M Haralick, Karthikeyan Shanmugam, and Its’ Hak Dinstein. Textural features
for image classification. IEEE Transactions on systems, man, and cybernetics, (6):
610–621, 1973.
[103] Richard Hartley and Andrew Zisserman. Multiple view geometry in computer vision.
Cambridge university press, 2003.
[104] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning
for image recognition. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 770–778, 2016.
[105] Thomas Hitchcox and Yaoyao Fiona Zhao. Random walks for unorganized point
cloud segmentation with application to aerospace repair. Procedia Manufacturing, 26:
1483–1491, 2018.
[106] Vedhus Hoskere, Yasutaka Narazaki, Tu Hoang, and BillieF Spencer Jr. Vision-based
structuralinspectionusingmultiscaledeepconvolutionalneuralnetworks. arXiv preprint
arXiv:1805.01055, 2018.
[107] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely
connected convolutional networks. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 4700–4708, 2017.
[108] Yong Huang, Haoyu Zhang, Hui Li, and Stephen Wu. Recovering compressed images
for automatic crack segmentation using generative models. Mechanical Systems and
Signal Processing, 146:107061, 2021.
267
[109] Daniel J Inman, Charles R Farrar, Vicente Lopes Junior, and Valder Steffen Junior.
Damage prognosis: for aerospace, civil and mechanical systems. John Wiley & Sons,
2005.
[110] T. Iseley. Pipeline condition assessment: Achieving uniform defect ratings. volume 1,
pages 121–132. Infra 99 Intl. Conf., 1999.
[111] Shivprakash Iyer and Sunil K. Sinha. Segmentation of pipe images for crack detection
in buried sewers. Computer-Aided Civil and Infrastructure Engineering, 21(6):395–
410, 2006. ISSN 1467-8667. doi: 10.1111/j.1467-8667.2006.00445.x. URL http:
//dx.doi.org/10.1111/j.1467-8667.2006.00445.x.
[112] Shivprakash Iyer and Sunil K Sinha. Automated condition assessment of buried sewer
pipes based on digital imaging techniques. Journal of the Indian Institute of Science,
85(5):235, 2013.
[113] Shruti Jadon. A survey of loss functions for semantic segmentation. In 2020 IEEE
Conference on Computational Intelligence in Bioinformatics and Computational Biology
(CIBCB), pages 1–7. IEEE, 2020.
[114] BahmanJafari, AliKhaloo, andDavidLattanzi. Deformationtrackingin3dpointclouds
via statistical sampling of direct cloud-to-cloud distances. Journal of Nondestructive
Evaluation, 36(4):1–10, 2017.
[115] Mohammad R Jahanshahi and Sami F Masri. A new methodology for non-contact
accurate crack width measurement through photogrammetry for automated structural
safety evaluation. Smart materials and structures, 22(3):035019, 2013.
[116] Mohammad R Jahanshahi, Sami F Masri, and Gaurav S Sukhatme. Multi-image stitch-
ing and scene reconstruction for evaluating defect evolution in structures. Structural
Health Monitoring, 10(6):643–657, 2011.
[117] Mohammad R Jahanshahi, Sami F Masri, Curtis W Padgett, and Gaurav S Sukhatme.
An innovative methodology for detection and quantification of cracks through incorpo-
ration of depth perception. Machine vision and applications, 24(2):227–241, 2013.
[118] Mohammad R Jahanshahi, Fu-Chen Chen, Adnan Ansar, Curtis W Padgett, Daniel
Clouse, and David S Bayard. Accurate and robust scene reconstruction in the presence
of misassociated features for aerial sensing. Journal of Computing in Civil Engineering,
31(6):04017056, 2017.
[119] Mohammad R Jahanshahi, Fu-Chen Chen, Chris Joffe, and Sami F Masri. Vision-based
quantitative assessment of microcracks on reactor internal components of nuclear power
plants. Structure and Infrastructure Engineering, 13(8):1013–1026, 2017.
[120] Mohammad Reza Jahanshahi. Vision-based studies for structural health monitoring
and condition assesment. PhD thesis, University of Southern California, Los Angeles
California, 2011.
268
[121] Keunyoung Jang, Namgyu Kim, and Yun-Kyu An. Deep learning–based autonomous
concrete crack evaluation through hybrid image scanning. Structural Health Monitoring,
18(5-6):1722–1737, 2019.
[122] Shang Jiang and Jian Zhang. Real-time crack assessment using deep neural networks
with wall-climbing unmanned aerial system. Computer-Aided Civil and Infrastructure
Engineering, 35(6):549–564, 2020.
[123] Sophie Jordan, Julian Moore, Sierra Hovet, John Box, Jason Perry, Kevin Kirsche,
Dexter Lewis, and Zion Tsz Ho Tse. State-of-the-art technologies for uav inspections.
IET Radar, Sonar & Navigation, 12(2):151–164, 2017.
[124] Igor Jovančević, Huy-Hieu Pham, Jean-José Orteu, Rémi Gilblas, Jacques Harvent,
Xavier Maurice, and Ludovic Brèthes. 3d point cloud analysis for detection and
characterization of defects on airplane exterior surface. Journal of Nondestructive
Evaluation, 36(4):74, 2017.
[125] Rony Kalfarisi, Zheng Yi Wu, and Ken Soh. Crack detection and segmentation using
deep learning with 3d reality mesh model for quantitative assessment and integrated
visualization. Journal of Computing in Civil Engineering, 34(3):04020010, 2020.
[126] IA Kanaeva and Yulia Aleksandrovna Ivanova. Road pavement crack detection using
deep learning with synthetic data. In 14th International Forum on Strategic Technology
(IFOST-2019), October 14-17, 2019, Tomsk, Russia:[proceedings].—Tomsk, 2019., pages
320–325, 2019.
[127] Dongho Kang, Sukhpreet S Benipal, Dharshan L Gopal, and Young-Jin Cha. Hybrid
pixel-level concrete crack segmentation and quantification across complex backgrounds
using deep learning. Automation in Construction, 118:103291, 2020.
[128] Myung Soo Kang and Yun-Kyu An. Deep learning-based automated background
removal for structural exterior image stitching. Applied Sciences, 11(8), 2021. ISSN
2076-3417. URL https://www.mdpi.com/2076-3417/11/8/3339.
[129] Ali Khaloo and David Lattanzi. Hierarchical dense structure-from-motion reconstruc-
tions for infrastructure condition assessment. Journal of Computing in Civil Engineering,
31(1):04016047, 2016.
[130] Ali Khaloo, David Lattanzi, Keith Cunningham, Rodney Dell’Andrea, and Mark Riley.
Unmanned aerial vehicle inspection of the placer river trail bridge through image-based
3d modelling. Structure and Infrastructure Engineering, 14(1):124–136, 2018.
[131] Ali Khaloo, David Lattanzi, Adam Jachimowicz, and Charles Devaney. Utilizing uav
and 3d computer vision for visual inspection of a large gravity dam. Frontiers in Built
Environment, 4:31, 2018.
[132] Salman Khan, Hossein Rahmani, Syed Afaq Ali Shah, and Mohammed Bennamoun.
A guide to convolutional neural networks for computer vision. Synthesis Lectures on
Computer Vision, 8(1):1–207, 2018.
269
[133] Byunghyun Kim and Soojin Cho. Automated vision-based detection of cracks on
concrete surfaces using a deep learning technique. Sensors, 18(10):3452, 2018.
[134] Hyunjun Kim, Eunjong Ahn, Myoungsu Shin, and Sung-Han Sim. Crack and noncrack
classification from concrete surface images using machine learning. Structural Health
Monitoring, 18(3):725–738, 2019.
[135] Przemysław Klapa and Bartosz Mitka. Edge effect and its impact upon the accuracy of
2d and 3d modelling using laser scanning. Geomatics, Landmanagement and Landscape,
2017.
[136] Jon Kleinberg and Eva Tardos. Algorithm Design. Addison-Wesley Longman Publishing
Co., Inc., USA, 2005. ISBN 0321295358.
[137] Laurent Kneip, Fabien Tâche, Gilles Caprari, and Roland Siegwart. Characterization
of the compact hokuyo urg-04lx 2d laser range scanner. In 2009 IEEE International
Conference on Robotics and Automation, pages 1447–1454. IEEE, 2009.
[138] Christian Koch, Kristina Georgieva, Varun Kasireddy, Burcu Akinci, and Paul Fieguth.
A review on computer vision based defect detection and condition assessment of concrete
and asphalt civil infrastructure. Advanced Engineering Informatics, 29(2):196–210,
2015.
[139] Si-Yu Kong, Jian-Sheng Fan, Yu-Fei Liu, Xiao-Chen Wei, and Xiao-Wei Ma. Automated
crack assessment and quantitative growth monitoring. Computer-Aided Civil and
Infrastructure Engineering, 36(5):656–674, 2021.
[140] M. Krause, J.-M. Hausherr, and W. Krenkel. (micro)-crack detection using local
radon transform. Materials Science and Engineering: A, 527(26):7126 – 7131, 2010.
ISSN 0921-5093. doi: http://dx.doi.org/10.1016/j.msea.2010.07.085. URL http://www.
sciencedirect.com/science/article/pii/S0921509310008373.
[141] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with
deep convolutional neural networks. Communications of the ACM, 60(6):84–90, 2017.
[142] Rolands Kromanis and Haida Liang. Condition assessment of structures using smart-
phones: a position independent multi-epoch imaging approach. 07 2018. 9th European
Workshop on Structural Health Monitoring.
[143] Srinath S. Kumar, Dulcy M. Abraham, Mohammad R. Jahanshahi, Tom Iseley, and
JustinStarr. Automateddefectclassificationinsewerclosedcircuittelevisioninspections
using deep convolutional neural networks. Automation in Construction, 91:273 – 283,
2018. ISSN 0926-5805. doi: https://doi.org/10.1016/j.autcon.2018.03.028. URL
https://www.sciencedirect.com/science/article/pii/S0926580517309767.
[144] Chirag Kyal, Motahar Reza, Bhumik Varu, and Shivangi Shreya. Image-based concrete
crack detection using random forest and convolution neural network. In Computational
Intelligence in Pattern Recognition, pages 471–481. Springer, 2022.
270
[145] H.M. La, R.S. Lim, B. Basily, N. Gucunski, Jingang Yi, A. Maher, F.A. Romero, and
H. Parvardeh. Autonomous robotic system for high-efficiency non-destructive bridge
deck inspection and evaluation. In Automation Science and Engineering (CASE), 2013
IEEE International Conference on, pages 1053–1058, Aug 2013. doi: 10.1109/CoASE.
2013.6653886.
[146] Tien-Thinh Le, Van-Hai Nguyen, and Minh Vuong Le. Development of deep learning
model for the recognition of cracks on concrete surfaces. Applied Computational
Intelligence and Soft Computing, 2021, 2021.
[147] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):
436–444, 2015.
[148] Donghan Lee, Jeongho Kim, and Daewoo Lee. Robust concrete crack detection using
deep learning-based semantic segmentation. International Journal of Aeronautical and
Space Sciences, 20(1):287–299, 2019.
[149] Jun S Lee, Sung Ho Hwang, Il Yoon Choi, and Yeongtae Choi. Estimation of crack
width based on shape-sensitive kernels and semantic segmentation. Structural Control
and Health Monitoring, 27(4):e2504, 2020.
[150] Taeyoung Lee, Melvin Leok, and N Harris McClamroch. Geometric tracking control of
a quadrotor uav on se (3). In 49th IEEE conference on decision and control (CDC),
pages 5420–5425. IEEE, 2010.
[151] Dawei Li, Qian Xie, Zhenghao Yu, Qiaoyun Wu, Jun Zhou, and Jun Wang. Sewer pipe
defect detection via deep learning with local and global feature fusion. Automation in
Construction, 129:103823, 2021.
[152] Duanshun Li, Anran Cong, and Shuai Guo. Sewer damage detection from imbal-
anced cctv inspection data using deep convolutional neural networks with hierarchical
classification. Automation in Construction, 101:199–208, 2019.
[153] Gang Li, Qiangwei Liu, Shanmeng Zhao, Wenting Qiao, and Xueli Ren. Automatic
crack recognition for concrete bridges using a fully convolutional neural network and
naive bayes data fusion based on a visual detection system. Measurement Science and
Technology, 31(7):075403, 2020.
[154] Gang Li, Biao Ma, Shuanhai He, Xueli Ren, and Qiangwei Liu. Automatic tunnel crack
detection based on u-net and a convolutional neural network with alternately updated
clique. Sensors, 20(3):717, 2020.
[155] Haifeng Li, Yu Liu, Longfei Fan, and Xinwei Chen. Towards robust and optimal image
stitching for pavement crack inspection and mapping. In 2017 IEEE International
Conference on Robotics and Biomimetics (ROBIO), pages 2390–2394. IEEE, 2017.
[156] Shengyuan Li and Xuefeng Zhao. Image-based concrete crack detection using convolu-
tional neural network and exhaustive search technique. Advances in Civil Engineering,
2019, 2019.
271
[157] Shengyuan Li and Xuefeng Zhao. Automatic crack detection and measurement of
concrete structure using convolutional encoder-decoder network. IEEE Access, 8:
134602–134618, 2020.
[158] Yadan Li, Zhenqi Han, Haoyu Xu, Lizhuang Liu, Xiaoqiang Li, and Keke Zhang.
Yolov3-lite: A lightweight crack detection network for aircraft structure based on
depthwise separable convolutions. Applied Sciences, 9(18):3781, 2019.
[159] Yundong Li, Hongguang Li, and Hongren Wang. Pixel-wise crack detection using deep
local pattern predictor for robot application. Sensors, 18(9):3042, 2018.
[160] Ronny Salim Lim, Hung Manh La, and Weihua Sheng. A robotic crack inspection
and mapping system for bridge deck maintenance. IEEE Transactions on Automation
Science and Engineering, 11(2):367–378, 2014.
[161] C. Harriet Linda and G. Wiselin Jiji. Crack detection in x-ray images using fuzzy
index measure. Applied Soft Computing, 11(4):3571 – 3579, 2011. ISSN 1568-4946. doi:
http://dx.doi.org/10.1016/j.asoc.2011.01.029. URL http://www.sciencedirect.com/
science/article/pii/S1568494611000391.
[162] Tony Lindeberg. Feature detection with automatic scale selection. International journal
of computer vision, 30(2):79–116, 1998.
[163] Dan Liu, Dajun Li, Meizhen Wang, and Zhiming Wang. 3d change detection using
adaptive thresholds based on local point cloud density. ISPRS International Journal of
Geo-Information, 10(3):127, 2021.
[164] Quancai Liu and Yong Liu. An approach for auto bridge inspection based on climbing
robot. In 2013 IEEE International Conference on Robotics and Biomimetics (ROBIO),
pages 2581–2586. IEEE, 2013.
[165] Yahui Liu, Jian Yao, Xiaohu Lu, Renping Xie, and Li Li. Deepcrack: A deep hierarchical
feature learning architecture for crack segmentation. Neurocomputing, 338:139–153,
2019.
[166] Yu-Fei Liu, Soojin Cho, BF Spencer Jr, and Jian-Sheng Fan. Concrete crack assessment
using digital image processing and 3d scene reconstruction. Journal of Computing in
Civil Engineering, 30(1):04014124, 2016.
[167] Yufei Liu, Soojin Cho, Billie F Spencer Jr, and Jiansheng Fan. Automated assessment
of cracks on concrete surfaces using adaptive digital image processing. Smart Structures
and Systems, 14(4):719–741, 2014.
[168] Zhenqing Liu, Yiwen Cao, Yize Wang, and Wei Wang. Computer vision-based concrete
crack detection using u-net fully convolutional networks. Automation in Construction,
104:129–139, 2019.
[169] David G Lowe. Distinctive image features from scale-invariant keypoints. International
journal of computer vision, 60(2):91–110, 2004.
272
[170] David G Lowe et al. Object recognition from local scale-invariant features. 1999.
[171] Touba Malekzadeh, Milad Abdollahzadeh, Hossein Nejati, and Ngai-Man Che-
ung. Aircraft fuselage defect detection using deep neural networks. arXiv preprint
arXiv:1712.09213, 2017.
[172] PE Mark Moore, Ph.D. Brent Phares, Benjamin Graybeal andDennis Rolander, and
PE Glenn Washer. Reliability of visual inspection for highway bridges, volume i:
Final report. Technical report, US Department of Transportation, Federal Highway
Administration, 2001. URL https://www.fhwa.dot.gov/publications/research/
nde/pdfs/01020a.pdf.
[173] Philippe Martin and Erwan Salaün. The true role of accelerometer feedback in quadrotor
control. In 2010 IEEE international conference on robotics and automation, pages
1623–1629. IEEE, 2010.
[174] S.F. Masri and Yulu Chen. Image-based damage detection and condition assessment
for maintenance of aircraft structures. Technical report, PWICE Institute, University
of Southern California, Los Angeles, CA, July 2021.
[175] Larry Matthies, Roland Brockers, Yoshiaki Kuwata, and Stephan Weiss. Stereo vision-
based obstacle avoidance for micro air vehicles using disparity space. In 2014 IEEE
international conference on robotics and automation (ICRA), pages 3242–3249. IEEE,
2014.
[176] Donald Meagher. Geometric modeling using octree encoding. Computer graphics and
image processing, 19(2):129–147, 1982.
[177] Qipei Mei and Mustafa Gül. Multi-level feature fusion in densely connected deep-
learning architecture and depth-first search for crack segmentation on images collected
with smartphones. Structural Health Monitoring, 19(6):1726–1744, 2020.
[178] Qipei Mei, Mustafa Gül, and Md Riasat Azim. Densely connected deep neural net-
work considering connectivity of pixels for automatic crack detection. Automation in
Construction, 110:103018, 2020.
[179] Xiaokun Miao, Jing Wang, Zhengfang Wang, Qingmei Sui, Yuan Gao, and Peng Jiang.
Automatic recognition of highway tunnel defects based on an improved u-net model.
IEEE Sensors Journal, 19(23):11413–11423, 2019.
[180] Krystian Mikolajczyk, Andrew Zisserman, and Cordelia Schmid. Shape recognition
with edge-based features. In British Machine Vision Conference (BMVC’03), volume 2,
pages 779–788. The British Machine Vision Association, 2003.
[181] Mohammad Ebrahim Mohammadi, Daniel P Watson, and Richard L Wood. Deep
learning-based damage detection from aerial sfm point clouds. Drones, 3(3):68, 2019.
[182] Arun Mohan and Sumathi Poobal. Crack detection using image processing: A critical
review and analysis. Alexandria Engineering Journal, 57(2):787–798, 2018.
273
[183] Mahtab Mohtasham Khani, Sahand Vahidnia, Leila Ghasemzadeh, Y Eren Ozturk,
Mustafa Yuvalaklioglu, Selim Akin, and Nazim Kemal Ure. Deep-learning-based
crack detection with applications for the structural health monitoring of gas turbines.
Structural Health Monitoring, 19(5):1440–1452, 2020.
[184] Mark Moore, Brent Phares, Benjamin Graybeal, Dennis Rolander, and Glenn Washer.
Reliability of visual inspection for highway bridges, volume i: Final report. Technical
report, U.S. Department of Transportation, Federal Highway Administration, 2001.
[185] Marius Muja and David G Lowe. Fast approximate nearest neighbors with automatic
algorithm configuration. 2009.
[186] Andriy Myronenko and Xubo Song. Point set registration: Coherent point drift. IEEE
transactions on pattern analysis and machine intelligence, 32(12):2262–2275, 2010.
[187] Bilal Nasser and Amir Rabani. An advanced method for matching partial 3d point
clouds to free-form cad models for in-situ inspection and repair. In Automated Visual
Inspection and Machine Vision III, volume 11061, page 1106105. International Society
for Optics and Photonics, 2019.
[188] Cong Hong Phong Nguyen and Young Choi. Comparison of point cloud data and 3d cad
data for on-site dimensional inspection of industrial plant piping systems. Automation
in Construction, 91:44–52, 2018.
[189] FuTao Ni, Jian Zhang, and ZhiQiang Chen. Pixel-level crack delineation in images
with convolutional feature fusion. Structural Control and Health Monitoring, 26(1):
e2286, 2019.
[190] FuTao Ni, Jian Zhang, and ZhiQiang Chen. Zernike-moment measurement of thin-
crack width in images enabled by dual-scale deep learning. Computer-Aided Civil and
Infrastructure Engineering, 34(5):367–384, 2019.
[191] S. Oka, S. Mochizuki, H. Togo, and N. Kukutsu. A neural network algorithm for
detecting invisible concrete surface cracks in near-field millimeter-wave images. In
Systems, Man and Cybernetics, 2009. SMC 2009. IEEE International Conference on,
pages 3801–3805, 10 2009. doi: 10.1109/ICSMC.2009.5346623.
[192] N. Otsu. A threshold selection method from gray-level histograms. Systems, Man
and Cybernetics, IEEE Transactions on, 9(1):62–66, 01 1979. ISSN 0018-9472. doi:
10.1109/TSMC.1979.4310076.
[193] Gianpaolo Palma, Paolo Cignoni, Tamy Boubekeur, and Roberto Scopigno. Detection of
geometric temporal changes in point clouds. In Computer Graphics Forum, volume 35,
pages 33–45. Wiley Online Library, 2016.
[194] Gang Pan, Yaoxian Zheng, Shuai Guo, and Yaozhi Lv. Automatic sewer pipe defect
semantic segmentation based on improved u-net. Automation in Construction, 119:
103383, 2020.
274
[195] Somin Park, Seongdeok Bang, Hongjo Kim, and Hyoungkwan Kim. Patch-based
crack detection in black box images using convolutional neural networks. Journal of
Computing in Civil Engineering, 33(3):04019017, 2019.
[196] Song Ee Park, Seung-Hyun Eem, and Haemin Jeon. Concrete crack detection and
quantification using deep learning and structured light. Construction and Building
Materials, 252:119096, 2020.
[197] Ricardo Perera, Alberto Pérez, Marta García-Diéguez, and José Zapico-Valle. Active
wireless system for structural health monitoring applications. Sensors, 17(12):2880,
2017.
[198] Pietro Perona and Jitendra Malik. Scale-space and edge detection using anisotropic
diffusion. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 12(7):
629–639, 1990.
[199] M. Petrou, J. Kittler, and K.Y. Song. Automatic surface crack detection on textured
materials. Journal of Materials Processing Technology, 56(1-4):158 – 167, 1996. ISSN
0924-0136. doi: http://dx.doi.org/10.1016/0924-0136(95)01831-X. URL http://
www.sciencedirect.com/science/article/pii/092401369501831X. International
Conference on Advances in Material and Processing Technologies.
[200] Guru Prakash, Xian-Xun Yuan, Budhaditya Hazra, and Daijiro Mizutani. Toward
a big data-based approach: A review on degradation models for prognosis of critical
infrastructure. Journal of Nondestructive Evaluation, Diagnostics and Prognostics of
Engineering Systems, 4(2):021005, 2021.
[201] Rongjun Qin and Armin Gruen. 3d change detection at street level using mobile laser
scanning point clouds and terrestrial images. ISPRS Journal of Photogrammetry and
Remote Sensing, 90:23–35, 2014.
[202] Rongjun Qin, Jiaojiao Tian, and Peter Reinartz. 3d change detection–approaches and
applications. ISPRS Journal of Photogrammetry and Remote Sensing, 122:41–56, 2016.
[203] Shi Qiu, Wenjuan Wang, Shaofan Wang, and Kelvin CP Wang. Methodology for
accurate aashto pp67-10–based cracking quantification using 1-mm 3d pavement images.
Journal of Computing in Civil Engineering, 31(2):04016056, 2016.
[204] Shi Qiu, Wenjuan Wang, Shaofan Wang, and Kelvin C. P. Wang. Methodology for
accurate aashto pp67-10–based cracking quantification using 1-mm 3d pavement images.
Journal of Computing in Civil Engineering, 31(2):04016056, 2017. doi: 10.1061/(ASCE)
CP.1943-5487.0000627.
[205] Zhong Qu, Jing Mei, Ling Liu, and Dong-Yang Zhou. Crack detection of concrete
pavement with cross-entropy loss function and improved vgg16 network model. IEEE
Access, 8:54564–54573, 2020.
275
[206] Lovedeep Ramana, Wooram Choi, and Young-Jin Cha. Fully automated vision-based
loosened bolt detection using the viola–jones algorithm. Structural Health Monitoring,
18(2):422–434, 2019.
[207] Aravinda S Rao, Tuan Nguyen, Marimuthu Palaniswami, and Tuan Ngo. Vision-based
automated crack detection using convolutional neural networks for condition assessment
of infrastructure. Structural Health Monitoring, 20(4):2124–2142, 2021.
[208] Abbas Rashidi, Fei Dai, Ioannis Brilakis, and Patricio Vela. Optimized selection of
key frames for monocular videogrammetric surveying of civil infrastructure. Adv. Eng.
Inform., 27(2):270–282, April 2013. ISSN 1474-0346. doi: 10.1016/j.aei.2013.01.002.
URL https://doi.org/10.1016/j.aei.2013.01.002.
[209] Russell Reed and Robert J MarksII. Neural smithing: supervised learning in feedforward
artificial neural networks. Mit Press, 1999.
[210] Russell D. Reed and Robert J. Marks. Neural Smithing: Supervised Learning in
Feedforward Artificial Neural Networks. MIT Press, Cambridge, MA, USA, 1998. ISBN
0262181908.
[211] Yupeng Ren, Jisheng Huang, Zhiyou Hong, Wei Lu, Jun Yin, Lejun Zou, and Xiaohua
Shen. Image-based concrete crack detection in tunnels using deep fully convolutional
networks. Construction and Building Materials, 234:117367, 2020.
[212] Santiago M. Reyna, Jorge A. Vanegas, and Abdul H. Khan. Construction technologies
for sewer rehabilitation. Journal of Construction Engineering and management, 120(3):
467–487, 1994.
[213] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representa-
tions by back-propagating errors. nature, 323(6088):533–536, 1986.
[214] Radu Bogdan Rusu, Nico Blodow, and Michael Beetz. Fast point feature histograms
(fpfh) for 3d registration. In 2009 IEEE international conference on robotics and
automation, pages 3212–3217. IEEE, 2009.
[215] P. Salembier. Comparison of some morphological segmentation algorithms based on
contrast enhancement. application to automatic defect detection. In 5. European Signal
Processing Conference., volume 2, pages 833–836, 1990.
[216] Tobias Schlagenhauf, Tim Brander, and Jürgen Fleischer. A stitching algorithm for
automated surface inspection of rotationally symmetric components. CIRP Journal
of Manufacturing Science and Technology, 35:169–177, 2021. ISSN 1755-5817. doi:
https://doi.org/10.1016/j.cirpj.2021.05.013. URL https://www.sciencedirect.com/
science/article/pii/S1755581721000833.
[217] Jürgen Schmidhuber. Deep learning in neural networks: An overview. CoRR,
abs/1404.7828, 2014. URL http://arxiv.org/abs/1404.7828.
276
[218] FlorianSchroff, DmitryKalenichenko, andJamesPhilbin. Facenet: Aunifiedembedding
for face recognition and clustering. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 815–823, 2015.
[219] T. Shehab-Eldeen. An automated system for detection, classification and rehabilitation
of defects in sewer pipes. PhD thesis, Concordia University, Montreal, Quebec, Canada,
2001.
[220] E. Shelhamer, J. Long, and T. Darrell. Fully convolutional networks for semantic
segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4):
640–651, 2017. doi: 10.1109/TPAMI.2016.2572683.
[221] Xuanjing Shen, Wei Wei, Jianwu Long, and Qingji Qian. Damaged solar cell detection
based on gray-intensity wave transformation. Procedia Engineering, 15(0):3808 – 3813,
2011. ISSN 1877-7058. doi: http://dx.doi.org/10.1016/j.proeng.2011.08.713. {CEIS}
2011.
[222] Zejiang Shen, Xili Wan, Feng Ye, Xinjie Guan, and Shuwen Liu. Deep learning based
framework for automatic damage detection in aircraft engine borescope inspection.
pages 1005–1010, 02 2019. doi: 10.1109/ICCNC.2019.8685593.
[223] Masanobu Shinozuka, Hung-Chi Chung, Makoto Ichitsubo, and Jianwen Liang. System
identification by video image processing. In Smart Structures and Materials 2001:
Smart Systems for Bridges, Structures, and Highways, volume 4330, pages 97–108.
International Society for Optics and Photonics, 2001.
[224] Mel Siegel and Priyan Gunatilake. Remote enhanced visual inspection of aircraft by a
mobile robot. 1998.
[225] Patrice Y Simard, David Steinkraus, John C Platt, et al. Best practices for convolutional
neural networks applied to visual document analysis. In Icdar, volume 3, 2003.
[226] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-
scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[227] Sunil K. Sinha and Paul W. Fieguth. Segmentation of buried concrete pipe images.
Automation in Construction, 15(1):47 – 57, 2006. ISSN 0926-5805. doi: 10.1016/j.
autcon.2005.02.007. URL http://www.sciencedirect.com/science/article/pii/
S0926580505000464.
[228] Sunil K. Sinha and Paul W. Fieguth. Automated detection of cracks in buried concrete
pipe images. Automation in Construction, 15(1):58 – 72, 2006. ISSN 0926-5805.
doi: 10.1016/j.autcon.2005.02.006. URL http://www.sciencedirect.com/science/
article/pii/S0926580505000452.
[229] K. Y. Song, M. Petrou, and J. Kittler. Wigner based crack detection in textured images.
In Image Processing and its Applications, 1992., International Conference on, pages
315–318, 04 1992.
277
[230] KengYew Song, Maria Petrou, and Josef Kittler. Texture crack detection. Machine
Vision and Applications, 8(1):63–75, 1995. ISSN 0932-8092. doi: 10.1007/BF01213639.
URL http://dx.doi.org/10.1007/BF01213639.
[231] Sandeep Sony, Shea Laventure, and Ayan Sadhu. A literature review of next-generation
smart sensing technology in structural health monitoring. Structural Control and Health
Monitoring, 26(3):e2321, 2019.
[232] Bentley systems. Contextcapture, a reality modeling software, 2019. URL https:
//www.bentley.com/en/products/brands/contextcapture.
[233] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir
Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper
with convolutions. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 1–9, 2015.
[234] Richard Szeliski. Computer Vision: Algorithms and Applications. Springer-Verlag,
Berlin, Heidelberg, 1st edition, 2010. ISBN 1848829345, 9781848829343.
[235] Richard Szeliski et al. Image alignment and stitching: A tutorial. Foundations and
Trends® in Computer Graphics and Vision, 2(1):1–104, 2007.
[236] Yi Tan, Ruying Cai, Jingru Li, Penglu Chen, and Mingzhu Wang. Automatic detection
of sewer defects based on improved you only look once algorithm. Automation in
Construction, 131:103912, 2021.
[237] Jinshan Tang and Yanliang Gu. Automatic crack detection and segmentation using
a hybrid algorithm for road distress analysis. In Systems, Man, and Cybernetics
(SMC), 2013 IEEE International Conference on, pages 3026–3030, 10 2013. doi:
10.1109/SMC.2013.516.
[238] David Tedaldi, Alberto Pretto, and Emanuele Menegatti. A robust and easy to
implement method for imu calibration without external equipments. In 2014 IEEE
International Conference on Robotics and Automation (ICRA), pages 3042–3049. IEEE,
2014.
[239] Alexandru Telea and Jarke J. van Wijk. An augmented fast marching method for
computing skeletons and centerlines. In Proceedings of the Symposium on Data Visuali-
sation 2002, VISSYM ’02, page 251–ff, Goslar, DEU, 2002. Eurographics Association.
ISBN 158113536X.
[240] Sebastian Thrun, Wolfram Burgard, and Dieter Fox. Probabilistic robotics. 2005.
[241] Federico Tombari, Samuele Salti, and Luigi Di Stefano. Unique signatures of histograms
for local surface description. In European conference on computer vision, pages 356–369.
Springer, 2010.
278
[242] Federico Tombari, Samuele Salti, and Luigi Di Stefano. Unique shape context for 3d
data description. In Proceedings of the ACM workshop on 3D object retrieval, pages
57–62, 2010.
[243] Xuhang Tong, Jie Guo, Yun Ling, and Zhouping Yin. A new image-based method for
concrete bridge bottom crack detection. In 2011 international conference on image
analysis and signal processing, pages 568–571. IEEE, 2011.
[244] M. Torok, M. Golparvar-Fard, and K. Kochersberger. Image-based automated 3d
crack detection for post-disaster building assessment. Journal of Computing in Civil
Engineering, 28(5):A4014004, 2014. doi: 10.1061/(ASCE)CP.1943-5487.0000334. URL
http://dx.doi.org/10.1061/(ASCE)CP.1943-5487.0000334.
[245] Nikolas Trawny and Stergios I Roumeliotis. Indirect kalman filter for 3d attitude
estimation. University of Minnesota, Dept. of Comp. Sci. & Eng., Tech. Rep, 2:2005,
2005.
[246] Du-Ming Tsai, Chih-Chieh Chang, and Shin-Min Chao. Micro-crack inspection in
heterogeneously textured solar wafers using anisotropic diffusion. Image and Vision
Computing, 28(3):491 – 501, 2010. ISSN 0262-8856. doi: http://dx.doi.org/10.1016/
j.imavis.2009.08.001. URL http://www.sciencedirect.com/science/article/pii/
S026288560900170X.
[247] Jeffrey Scott Vitter. Faster methods for random sampling. Communications of the
ACM, 27(7):703–718, 1984.
[248] Baoxian Wang, Yiqiang Li, Weigang Zhao, Zhaoxi Zhang, Yufeng Zhang, and Zhe
Wang. Effective crack damage detection using multilayer sparse feature representation
and incremental extreme learning machine. Applied Sciences, 9(3):614, 2019.
[249] Rui Wang and Youhei Kawamura. A magnetic climbing robot for steel bridge inspection.
In Proceeding of the 11th World Congress on Intelligent Control and Automation, pages
3303–3308. IEEE, 2014.
[250] Wenjuan Wang, Allen Zhang, Kelvin CP Wang, Andrew F Braham, and Shi Qiu.
Pavement crack width measurement based on laplace’s equation for continuity and
unambiguity. Computer-Aided Civil and Infrastructure Engineering, 33(2):110–123,
2018.
[251] Yue Wang and Justin M Solomon. Deep closest point: Learning representations for
point cloud registration. In Proceedings of the IEEE/CVF International Conference on
Computer Vision, pages 3523–3532, 2019.
[252] Water Research Centre. Sewerage rehabilitation manual. Water Res. Ctr., Swindon,
Wiltshire, England, 1983.
[253] G. J. Weil. Remote infrared thermal sensing of sewer voids. In In Proc. of 17th Annual
Conf., on Water Resources Planning and Management Division, ASCE, Houston, Tex.,
Apr. ASCE, Apr 1990.
279
[254] Stephan M Weiss. Vision based navigation for micro helicopters. PhD thesis, ETH
Zurich, 2012.
[255] Jack G Williams, Katharina Anders, Lukas Winiwarter, Vivien Zahs, and Bernhard
Höfle. Multi-directional change detection between point clouds. ISPRS Journal of
Photogrammetry and Remote Sensing, 172:95–113, 2021.
[256] Reini Wirahadikusumah, Dulcy M Abraham, Tom Iseley, and Ravi K Prasanth. Assess-
ment technologies for sewer system rehabilitation. Automation in Construction, 7(4):259
– 270, 1998. ISSN 0926-5805. doi: http://dx.doi.org/10.1016/S0926-5805(97)00071-X.
URL http://www.sciencedirect.com/science/article/pii/S092658059700071X.
[257] Christian Wöhler. 3D computer vision: efficient methods and applications. Springer
Science & Business Media, 2012.
[258] Jongbin Won, Jong-Woong Park, Changsu Shim, and Man-Woo Park. Bridge-surface
panoramic-imagegenerationforautomatedbridge-inspectionusingdeepmatching. Struc-
tural Health Monitoring, 20(4):1689–1703, 2021.
[259] Kong Xiangxiong and Li Jian. An image-based feature tracking approach for bolt
loosening detection in steel connections. volume 10598, 2018. doi: 10.1117/12.2296609.
URL https://doi.org/10.1117/12.2296609.
[260] Wen Xiao, Bruno Vallet, and Nicolas Paparoditis. Change detection in 3d point clouds
acquired by a mobile mapping system. ISPRS Annals of Photogrammetry, Remote
Sensing and Spatial Information Sciences, 1(2):331–336, 2013.
[261] Wen Xiao, Bruno Vallet, Mathieu Brédif, and Nicolas Paparoditis. Street environ-
ment change detection from mobile laser scanning point clouds. ISPRS Journal of
Photogrammetry and Remote Sensing, 107:38–49, 2015.
[262] Qian Xie, Dawei Li, Jinxuan Xu, Zhenghao Yu, and Jun Wang. Automatic detection
and classification of sewer defects via hierarchical deep learning. IEEE Transactions on
Automation Science and Engineering, 16(4):1836–1847, 2019.
[263] Qian Xie, Dening Lu, Kunpeng Du, Jinxuan Xu, Jiajia Dai, HongHua Chen, and Jun
Wang. Aircraft skin rivet detection based on 3d point cloud via multiple structures
fitting. Computer-Aided Design, 120:102805, 2020.
[264] Renping Xie, Jian Yao, Kang Liu, Xiaohu Lu, Yahui Liu, Menghan Xia, and Qifei Zeng.
Automatic multi-image stitching for concrete bridge inspection by combining point
and line features. Automation in Construction, 90:265 – 280, 2018. ISSN 0926-5805.
doi: https://doi.org/10.1016/j.autcon.2018.02.021. URL http://www.sciencedirect.
com/science/article/pii/S0926580518301237.
[265] Xue-jun Xu and Xiao-ning Zhang. Crack detection of reinforced concrete bridge using
video image. Journal of Central South University, 20(9):2605–2613, 2013.
280
[266] Yang Xu, Yuequan Bao, Jiahui Chen, Wangmeng Zuo, and Hui Li. Surface fatigue
crack identification in steel box girder of bridges by a deep fusion convolutional neural
network based on consumer-grade camera images. Structural Health Monitoring, 18(3):
653–674, 2019.
[267] Fan Yang, Lei Zhang, Sijia Yu, Danil Prokhorov, Xue Mei, and Haibin Ling. Fea-
ture pyramid and hierarchical boosting network for pavement crack detection. IEEE
Transactions on Intelligent Transportation Systems, 21(4):1525–1535, 2019.
[268] Jiaolong Yang, Hongdong Li, Dylan Campbell, and Yunde Jia. Go-icp: A globally
optimal solution to 3d icp point-set registration. IEEE transactions on pattern analysis
and machine intelligence, 38(11):2241–2254, 2015.
[269] Jun Yang, Yu-Gang Jiang, Alexander G Hauptmann, and Chong-Wah Ngo. Evaluating
bag-of-visual-words representations in scene classification. In Proceedings of the inter-
national workshop on Workshop on multimedia information retrieval, pages 197–206.
ACM, 2007.
[270] Ming-Der Yang and Tung-Ching Su. Automated diagnosis of sewer pipe defects based
on machine learning approaches. Expert Systems with Applications, 35(3):1327–1337,
2008.
[271] Ming-Der Yang and Tung-Ching Su. Segmenting ideal morphologies of sewer pipe
defects on cctv images for automated diagnosis. Expert Systems with Applications, 36
(2):3562–3573, 2009.
[272] Qiaoning Yang and Xiaodong Ji. Automatic pixel-level crack detection for civil infras-
tructure using unet++ and deep transfer learning. IEEE Sensors Journal, 2021.
[273] Xincong Yang, Heng Li, Yantao Yu, Xiaochun Luo, Ting Huang, and Xu Yang. Auto-
matic pixel-level crack detection and measurement using fully convolutional network.
Computer-Aided Civil and Infrastructure Engineering, 33(12):1090–1109, 2018.
[274] Yuan-Sen Yang, Chiun-lin Wu, Thomas TC Hsu, Hsuan-Chih Yang, Hu-Jhong Lu,
and Chang-Ching Chang. Image analysis method for crack distribution and width
estimation for reinforced concrete structures. Automation in Construction, 91:120–132,
2018.
[275] YaoYao, Shue-TingEllenTung, andBrankoGlisic. Crackdetectionandcharacterization
techniques-anoverview. Structural Control and Health Monitoring, pagesn/a–n/a, 2014.
ISSN 1545-2263. doi: 10.1002/stc.1655. URL http://dx.doi.org/10.1002/stc.1655.
[276] Xiao-Wei Ye, Tao Jin, and Peng-Yu Chen. Structural crack detection using deep
learning–based fully convolutional networks. Advances in Structural Engineering, 22
(16):3412–3419, 2019.
[277] Chul Min Yeum, Jongseong Choi, and Shirley J Dyke. Automated region-of-interest
localization and classification for vision-based visual assessment of civil infrastructure.
Structural Health Monitoring, 18(3):675–689, 2019.
281
[278] CM Yeum, J Choi, and SJ Dyke. Autonomous image localization for visual inspection
of civil infrastructure. Smart Materials and Structures, 26(3):035051, 2017.
[279] Zi Jian Yew and Gim Hee Lee. 3dfeat-net: Weakly supervised local 3d features for point
cloud registration. In Proceedings of the European Conference on Computer Vision
(ECCV), pages 607–623, 2018.
[280] Zi Jian Yew and Gim Hee Lee. City-scale scene change detection using point clouds.
arXiv preprint arXiv:2103.14314, 2021.
[281] Xianfei Yin, Yuan Chen, Ahmed Bouferguene, Hamid Zaman, Mohamed Al-Hussein,
and Luke Kurach. A deep learning-based framework for an automated defect detection
system for sewer pipes. Automation in construction, 109:102967, 2020.
[282] Chaobo Zhang, Chih-chen Chang, and Maziar Jamshidi. Concrete bridge surface
damage detection using a single-stage detector. Computer-Aided Civil and Infrastructure
Engineering, 35(4):389–409, 2020.
[283] Dejin Zhang, Qingquan Li, Ying Chen, Min Cao, Li He, and Bailing Zhang. An efficient
and reliable coarse-to-fine approach for asphalt pavement crack detection. Image and
Vision Computing, 57:130–146, 2017.
[284] Kaige Zhang, Yingtao Zhang, and HD Cheng. Self-supervised structure learning for
crack detection based on cycle-consistent generative adversarial networks. Journal of
Computing in Civil Engineering, 34(3):04020004, 2020.
[285] Lei Zhang, Fan Yang, Yimin Daniel Zhang, and Ying Julie Zhu. Road crack detection
using deep convolutional neural network. In 2016 IEEE international conference on
image processing (ICIP), pages 3708–3712. IEEE, 2016.
[286] Xinxiang Zhang, Dinesh Rajan, and Brett Story. Concrete crack detection using context-
aware deep semantic segmentation network. Computer-Aided Civil and Infrastructure
Engineering, 34(11):951–971, 2019.
[287] Yang Zhang and Ka-Veng Yuen. Crack detection using fusion features-based broad learn-
ing system and image processing. Computer-Aided Civil and Infrastructure Engineering,
2021.
[288] Yang Zhang, Xiaowei Sun, Kenneth J Loh, Wensheng Su, Zhigang Xue, and Xuefeng
Zhao. Autonomous bolt loosening detection using deep learning. Structural Health
Monitoring, 0(0):1475921719837509, 0. doi: 10.1177/1475921719837509. URL https:
//doi.org/10.1177/1475921719837509.
[289] Zhengyou Zhang. Microsoft kinect sensor and its effect. IEEE multimedia, 19(2):4–10,
2012.
[290] Xiukuan Zhao, Ruolin Wang, Haichang Gu, Gangbing Song, and YL Mo. Innovative
data fusion enabled structural health monitoring approach. Mathematical Problems in
Engineering, 2014, 2014.
282
[291] Qianqian Zhou, Zuxiang Situ, Shuai Teng, and Gongfa Chen. Convolutional neural
networks–based model for automated sewer defects detection and classification. Journal
of Water Resources Planning and Management, 147(7):04021036, 2021.
[292] Qinbang Zhou, Renwen Chen, Bin Huang, Chuan Liu, Jie Yu, and Xiaoqing Yu. An
automatic surface defect inspection system for automobiles using machine vision meth-
ods. Sensors, 19(3), 2019. ISSN 1424-8220. URL http://www.mdpi.com/1424-8220/
19/3/644.
[293] Zhenhua Zhu, Stephanie German, and Ioannis Brilakis. Visual retrieval of concrete crack
properties for automated post-earthquake structural safety evaluation. Automation in
Construction, 20(7):874–883, 2011.
283
Abstract (if available)
Abstract
Civil, mechanical, and aerospace infrastructures are subjected to applied loads and environmental forces like earthquakes, wind, and water waves in their operating lifespan. These factors will slowly deteriorate the structures during their service period, and often subtle observations of substantial damages are challenging. Due to the cost-effectiveness of high-resolution color, depth cameras, location sensors, and Micro Aerial Vehicles (MAVs), image processing, computer vision, and robotics techniques are gaining interest in Non-Destructive Testing (NDT) and condition assessment of infrastructures. In this study, several promising vision-based and data-driven, automated, and semi-automated condition assessment techniques are proposed and evaluated to detect and quantify a class of problems under the umbrella of infrastructure condition assessment. ❧ A synthetic crack generation methodology is introduced to generate "zero-labeled" samples for training the classical classifiers. This classifier was tested on a real-world dataset using the gradient-based hierarchical hybrid Multi-scale Fractional Anisotropy Tensor (MFAT) filter to segment the cracks. The results demonstrate the promising capabilities of the proposed synthetic crack generation method. Furthermore, the textural noise suppression and refinement are carried out by using an anisotropic diffusion filter. Guidelines are provided to select the parameters for the anisotropic diffusion filter. Further, this study presents the semantic segmentation of the cracks on concrete surface images using a deep Convolutional Neural Network (CNN) that has fewer parameters to learn. Several illustrative examples are presented to demonstrate the capabilities of the CNN-based crack segmentation procedure. The CNN was tested on the four real-world datasets, and the results show the proposed CNN's superiority against four state-of-the-art methods. ❧ As a part of this study, an efficient and autonomous crack change detection, tracking, and evolution methodology is introduced. Among the image registration methods, feature-based registration is robust to the noise, intensity change, and partial affine motion model. This study uses an efficient k-d tree-based nearest neighbor search which is faster than the quadratic computational complexity of the current pairwise search. Furthermore, unlike other methods, the fixed camera assumption is relaxed in this study. Another significant contribution is a probabilistic measure of the reliability of the analysis results that can aid the prognosis damage detection models. ❧ After the nearest neighbor search, the SURF-based keypoints are extracted from the images in the previous database and the current one. This is followed by the Random Sample Consensus (RANSAC)-based outliers rejection, bundle adjustment to refine the homographies, gain/exposure compensation and multi-band blending for the seamless registration images. Lastly, the registered image is compared to the current images for the change detection in crack physical properties. To demonstrate the capabilities of the proposed method, two datasets were utilized; a real-world dataset, and a synthetic dataset. The experimental results show that the performance of the proposed methodology is suitable for detecting the crack changes in two datasets. ❧ This work also studies the condition assessment of public sewer pipelines. The visual-bags-of-words model was evaluated for classifying the defective and non-defective sewer pipeline images using two feature descriptors. Three classical classifiers are trained and tested on a moderate-sized dataset of 14,404 images. The experimental results demonstrate that the classification accuracy of the visual-bags-of-words model is satisfactory and comparable to deep learning methods given the moderate dataset size. ❧ Lastly, defect detection of the three-dimensional surface of mechanical parts is studied. A preliminary study on a vision-based semi-autonomous spatio-temporal method to detect, locate and quantify the defects such as loose bolts, displacements, pipe chafing, or deformation is proposed. In addition, a probabilistic reliability quantification method based on the ensemble averaging of the Cloud-to-Cloud (C2C) distances is introduced for mechanical systems. Several quantitative and qualitative examples are presented to illustrate the capabilities of the proposed method. The results show that the proposed method is promising and robust to register the complex shapes, and detect and locate the changes in the mechanical systems.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Vision-based studies for structural health monitoring and condition assesment
PDF
Experimental and analytical studies of infrastructure systems condition assessment using different sensing modality
PDF
Studies into data-driven approaches for nonlinear system identification, condition assessment, and health monitoring
PDF
Homogenization procedures for the constitutive material modeling and analysis of aperiodic micro-structures
PDF
An analytical and experimental study of evolving 3D deformation fields using vision-based approaches
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
Data-driven image analysis, modeling, synthesis and anomaly localization techniques
PDF
A data-driven approach to image splicing localization
PDF
Studies into vibration-signature-based methods for system identification, damage detection and health monitoring of civil infrastructures
PDF
A data-driven approach to compressed video quality assessment using just noticeable difference
PDF
Analytical and experimental studies in system identification and modeling for structural control and health monitoring
PDF
Experimental design and evaluation methodology for human-centric visual quality assessment
PDF
Hybrid physics-based and data-driven computational approaches for multi-scale hydrologic systems under heterogeneity
PDF
Physics-based data-driven inference
PDF
Detecting semantic manipulations in natural and biomedical images
PDF
Numerical and experimental study on dynamics of unsteady pipe flow involving backflow prevention assemblies
PDF
A learning‐based approach to image quality assessment
PDF
Advanced techniques for stereoscopic image rectification and quality assessment
PDF
Advanced techniques for object classification: methodologies and performance evaluation
PDF
Algorithms for stochastic Galerkin projections: solvers, basis adaptation and multiscale modeling and reduction
Asset Metadata
Creator
Aghalaya Manjunatha, Preetham
(author)
Core Title
Vision-based and data-driven analytical and experimental studies into condition assessment and change detection of evolving civil, mechanical and aerospace infrastructures
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Civil Engineering
Degree Conferral Date
2022-05
Publication Date
01/06/2022
Defense Date
12/13/2021
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
condition assessment,crack change detection,crack localization,deep learning-based crack segmentation,hybrid crack segmentation,mechanical systems defect detection and quantification,OAI-PMH Harvest,sewer pipe condition assessment,synthetic crack generation
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Masri, Sami F. (
committee chair
), Nakano, Aiichiro (
committee member
), Wellford, Carter L. (
committee member
)
Creator Email
aghalaya@usc.edu,preethamam@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC110455243
Unique identifier
UC110455243
Legacy Identifier
etd-AghalayaMa-10331
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Aghalaya Manjunatha, Preetham
Type
texts
Source
20220112-usctheses-batch-907
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
condition assessment
crack change detection
crack localization
deep learning-based crack segmentation
hybrid crack segmentation
mechanical systems defect detection and quantification
sewer pipe condition assessment
synthetic crack generation