Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
RANSAC-based semi-supervised learning algorithms for partially labeled data, with application to neurological screening from eye-tracking data
(USC Thesis Other)
RANSAC-based semi-supervised learning algorithms for partially labeled data, with application to neurological screening from eye-tracking data
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
RANSAC-based semi-supervised learning algorithms for partially
labeled data, with application to neurological screening from
eye-tracking data
by
Po-Hsuan Huang
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF NEUROSCIENCE
(NEUROSCIENCE)
May 2024
Copyright 2024 Po-Hsuan Huang
Acknowledgements
I would like to express my deepest gratitude to my advisor, Dr. Laurent Itti, for his unwavering support and invaluable guidance throughout this research project. His expertise and
mentorship were instrumental in shaping this work.
I would also like to thank my research collaborators, Dr. Chen Chang, Dr. Alma Gharib,
Dr. Munoz Douglas, and Dr. Pat Levitt, for their dedication and contributions to our collaborative efforts. Their insights and teamwork greatly enriched the quality of this research.
I would like to thank all members of Ilab, particularly Zachary Murdock, Amanda Rios,
Adam Jones, and Kiran Lekkala, for your input, inspiration, and moral support.
I would like to thank my family. I cannot finish this journey without your support.
Financial support for this study was provided by the NIH, and I am immensely thankful
for their generous funding, which made this research possible.
This work would not have been possible without the collective contributions and support
of these individuals and organizations.
ii
Table of Contents
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Chapter 1: Introduction 1
1.1 General related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.1 Previous works on Semi-Supervised Learning . . . . . . . . . . . . . 4
1.1.2 Related work in RANSAC . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.3 Related work in Medical ML . . . . . . . . . . . . . . . . . . . . . . 6
1.1.4 Experiment series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Chapter 2: Semi-supervised RANSAC algorithms for partially labeled data
sets: primitive investigation with synthetic 2D manifolds 10
2.1 Experiment 1.1: qualitative analysis of model robustness . . . . . . . . . . . 10
2.1.1 Exp1.2 Test the three algorithms with three different base classifiers . 15
2.1.2 Exp1.3 Test the three algorithms with various ratios of mislabeling . 18
2.1.3 Exp1.4 Visualize the decision functions after every training iteration 22
Chapter 3: Investigation of partially labeled data sets with ground-truths:
detection of early neurodegenerative diseases with Semi-supervised
RANSAC SVM with eye-tracking, free-viewing data set. 27
3.1 Experiment 2: Sex Prediction of Infants in Toxic Stress dataset . . . . . . . 27
3.1.1 Experiment Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Chapter 4: Investigation of partially labeled data sets with weak labels.
Risk prediction with Semi-supervised RANSAC SVM for Adverse Childhood Experience induced Toxic Stress with eyetracking, free-viewing data sets 35
4.1 Experiment 3: Age group prediction of infants in Toxic Stress dataset . . . . 35
Chapter 5: Summary and Discussion 40
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
iii
List of Figures
1.1 An two-dimensional illustration of RANSAC robustly estimating
parameters of a linear model by rejecting outliers. The initial samples
are the two encircled orange dots on the solid line. The linear model was
fitted with the two orange dots. Then, the linear model generates tolerance
and predicts classes of the unlabeled data (marked with a dashed line). Data
points that fall within the tolerance are considered inliers (orange), and outliers (blue) are excluded. Inliers will be incorporated in the training of the
next round. The solid line will gradually align with the linear distribution
of the large unlabeled set every iteration. We repeat the procedure until the
result meets the criteria of termination. . . . . . . . . . . . . . . . . . . . . 3
1.2 Summarizing diagram of experiment series . . . . . . . . . . . . . . . . . . . 9
2.1 Exp1.1 Label Domain Analysis left: We analyze the accuracy of SSL
RANSAC2 on the whole domain of label quality defined by the test-split ratio
and mislabeling ratio. The higher the test-split ratio, the sparser the labeled
data. Therefore, more generalization power will be needed. The higher the
mislabeling, the worse the fitting will be. Therefore, some outlier detection
mechanisms will be necessary. In the upper row, we inspect the performance
of the proposed RANSAC2 in an easier label quality domain. In the lower
row, we inspect the performance of the proposed RANSAC2 in a more difficult
label quality domain. center: 3 layer multi-layer perceptrons model, of size
2-10-10-2 is used for evaluation. Three synthesized data distribution is tested
for binary classification. The moon shape distribution and the two circle distribution have distinct decision boundaries, while the overlapped distribution
doesn’t have a distinct decision boundary. The blue color means high accuracy, and the red color means low accuracy. The three columns from left to
right show the results of supervised training with clean labels, the supervised
method trained with noisy, sparse labels, and the semi-supervised method
trained with noisy, sparse labels. The results suggest the semi-supervised
RANSAC2 improves performance in both the easy and difficult label quality
domains. right: Constrained Support vector machine classifiers are trained
in three different scenarios. The three columns from left to the right show
the results of supervised training with clean labels, the supervised method
trained with noisy, sparse labels, and the semi-supervised method trained
with noisy, sparse labels. The results showed no distinct difference between
supervised and semi-supervised training in easier cases. However, the results
slightly improved in the more difficult case. . . . . . . . . . . . . . . . . . . 14
iv
2.2 Exp 1.2 Robustness to different datasets and base classifiers of SSL
RANSAC algorithms.(a) Three SSL RANSAC + SVM algorithms trained
with 80% of labeled data and 20% of unlabeled data, and the base classifier
trained with all labeled are compared. Labels are intentionally mislabeled by
10%. The SSL RANSAC1 performs the best of all three SSL algorithms in
all four tested datasets. It even outperforms the base SVM trained with all
labels on two moons and four blobs datasets. (b) Three SSL RANSAC +
KNN algorithms trained with 80% labeled data and 20% unlabeled data and
base KNN classifier trained with all labels are compared. Labels are intentionally mislabeled by 10%. The base KNN trained with all data performed
the best. SSL RANSAC 1 + KNN and SSL RANSAC 2 + KNN performed
similarly well. SSL RANSAC 3 + KNN performed the worst. (c) The Three
SSL RANSAC + MLP algorithms trained with 80% of labeled data and 20%
of unlabeled data and the base classifier trained with all labeled are compared. Labels are intentionally mislabeled by 10%. SSL RANSAC 1 + MLP
outperformed the base MLP trained with all labeled data in tow-moons, four
blobs, and two blobs datasets. SSL RANSAC 2 + MLP performed the best
in the two circles dataset. The experiment shows SSL RANSAC 1 is the most
reliable to various classifiers and datasets. . . . . . . . . . . . . . . . . . . . 18
2.3 Exp 1.3 Average precision scores of RANSAC algorithm with various mislabelling ratios. The bar graphs from left to right show the average
accuracy of RANSAC algorithms with 0%, 10%, and 30% mislabeling rates.
In each graph, there are four synthesized data sets evaluated. The blue bar
denotes supervised learning, the orange bar denotes RANSAC1, the green
bar denotes RANSAC2, and the red bar denotes RANSAC3. Comparing
three bar graphs, we found RANSAC1 outperforms plain SVM, RANSAC2,
and RANSAC3 as the mislabelling rate increases. RANSAC2 and RANSAC3
perform similarly or worse than the plain SVM in some synthetic data sets,
such as the two-blob dataset and the two-moon dataset. . . . . . . . . . . . 21
2.4 Exp 1.4 Iteration steps of RANSAC algorithms. From left to right:
Support vector machines, K-nearest neighbors, and Multi-layer-perceptron(MLP).
For each panel, five scenarios are compared. From left to right are a supervised base classifier trained with labeled data, RANSAC1 with a fraction of
labeled data, RANSAC2 with a fraction of labeled data, RANSAC3 with a
fraction of labeled data, and supervised with all labeled data. The first row
shows the initial training set. The blue and red color map shows the confidence score prediction of the whole space. It is clear from all SVM, KNN, and
MLP classifiers that RANSAC1 can form a more crisp and accurate decision
boundary and confidence score with very limited labeled data. On the other
hand, RANSAC2 and algorithm 3 show only slight improvements, agreeing
with our observation in Figure 2.2 and Figure 2.3 . . . . . . . . . . . . . . . 26
v
3.1 An overview of the pipeline of our proposed model for toxic stress
risk prediction The pipeline consists of three major steps. (a)Feature extraction The video clippets are processed to extract oculomotor features (saccade peak velocity, saccade velocity, saccade acceleration, fixation duration,
and pupil size) and saliency features (color, flicker, intensity, motion, orientation) at each attention time stamp which is defined as the end of a saccade In
total, there are 15 channels and τ attention time stamps The size of the data
is now 15 × T textbf(b)Dimensionality reduction The 2-dimensional features
are then summarized with a histogram to reduce the size T to a fixed value
After rescaling and extremum removal, the summarized histogram reduced
the temporal dimension and only represents the distribution of responsiveness to each feature We assign 9 bins for each channel and the total size of
the feature becomes 15 × 9 (c) RANSAC Classification We deploy different RANSAC algorithms to remove outliers from the final training set The
base classifiers predict the confidence value of each label If the confidence of a
prediction is high for a sample, we include the sample and the predicted label
in the training set of the next iteration The RANSAC process is repeated
after validation until results meet the convergence criteria . . . . . . . . . . 29
3.2 Mean average precision (mAP) of sex prediction with saliency (top)
and oculomotor features (bottom) with RANSAC1 with SVM as base
classifier across age groups from 2 months old to 24 months old. As we can
see, the plain SVM (light green line) only does slightly better than no-skill
(dashed blue line), where classes are predicted with Naive Bayes algorithm.
RANSAC1 improves the mAP significantly (blue line) for all age groups of
both feature modalities. Training RANSAC with combined labeled data of all
age groups (orange line) further improves the mAP of 2 months old and six
months old groups significantly when using saliency features. When training
RANSAC1 with combined data from oculomotor features, the improvement
was still significant, but mAPs were lower than that from saliency features . 33
vi
4.1 Results of experiment 3: Average precision of prediction of age
RANSAC algorithms with SVM-based classifiers are used to make binary
classifications of ages at 2, 6, 9, 12, 18, 24 months old. Models trained with
Saliency features(left panel), and oculomotor features(right panel) were both
evaluated. We used 10% of the data set as labeled set and 90% of the rest
as unlabeled set for semi-supervised training. Supervised SVM trained with
all labeled data (100 % SL) and only 10% of labeled data (10 % SL) were
also compared. Naive Bayes model fit with 10% of the data set was also used
as a no-skill baseline. Left panel: The graph shows the mean accuracies
of RANSAC algorithms in predicting age with saliency features. RANSAC2,
RANSAC3, SVM 10% labels, and SVM 100% labels all performed similarly
to the naive Bayes model, while RANSAC1 performs significantly better.
Right panel: The graph shows the mean accuracies of RANSAC algorithms
compared with SVM with 10% labels, SVM with all labels, and Naive Bayes
with 10% labels. RANSAC1 performs as well as SVM trained with all labels
at 2 months old group. However, RANSAC1 outperformed SVM trained with
all labels at all the other age groups. However, plain SVM trained with 10%
of labels performed better than RANSAC2 and RANSAC3 in 2 months old
group, consistent with the results of previous experiments. . . . . . . . . . . 38
vii
Abstract
Recognizing, identifying, and detecting chronic neurological disorders at an early stage with
the assistance of machine learning (ML) is expected to reduce the cost of medical treatment
significantly. One barrier, however, is that ML typically requires large amounts of accurately
labeled training data, which may not always be available. We propose three semi-supervised
learning algorithms that are robust to partially mislabeled data and that only require a few
data labels. The method combines self-training, ensemble learning, and RANSAC (random
sampling consensus) to iteratively propagate from a small labeled dataset to a much larger
pseudo-labeled dataset that is sufficiently large for ML training. First, we conducted qualitative and quantitative analysis of the proposed algorithms on synthesized datasets. Then,
we analyzed two eye-tracking, free-viewing datasets. The first dataset included six hundred
patients tested for five possible neurodegenerative disorders. Our approach provided 20%
better classification accuracy than 4 baseline algorithms in detection of Parkinson’s Disease.
The second dataset included 131 infants (age 2 to 24 months) tested for possible exposure to
adverse childhood experience (ACE). Because this dataset does not have objective groundtruth labels (which would require knowing whether an infant has indeed experienced abusive
or negligent behaviors), we first confirmed applicability of our algorithm by asking it to predict sex and age (for which ground-truth labels are available), which it was was able to do
with 90% and 42% accuracy. To then evaluate our algorithm on possible ACEs, we compared
its predictions to a composite score based on behavioral and socio-environmental data from
the infants’ parents.
viii
Chapter 1
Introduction
Machine learning has achieved great improvements in medical fields in recent years as the
advent of computer technology and availability of large database. For example, convolutional
neural networks(CNN) have been applied to assistive diagnosis with enormous amounts of
medical image data from X-ray, fMRI, CT, PET, or ultrasound scanning. (Sarvamangala
and Kulkarni, 2022) Recurrent neural networks (RNN) have been applied to predict heart
failure with substantial digital health records. (Maragatham and Devi, 2019) However,
the lack of a large amount of accurate labels still plagues most medical machine learning
models. Semi-supervised learning (SSL) has regained attention in recent years due to its
ability to alleviate the label shortage and the cost of human labeling that overrun most
medical datasets. Generally speaking, semi-supervised learning trains a model with both
the small labeled dataset and a large unlabeled dataset to improve prediction accuracy. A
labeled data is used to train a model, and this model is then used to make predictions on
the much larger set of unlabeled data. These predictions can be used to assign labels to
the unlabeled data, which can be incorporated into the model training process. By using
both labeled and unlabeled data, semi-supervised learning can help to increase the amount
of training data available to the model, thereby improving its accuracy and reducing the
impact of label shortages. For example, a study showed their SSL model outperformed the
state-of-the-art supervised learning models in medical image detection. (Wang et al., 2020)
Despite its advantages, the shortage of labels is often compounded by the lack of exact ground
truths in several medical disciplines. Not having exact ground truths is common in medical
diagnosis, e.g. diagnosis of rare diseases, diagnosis of psychological disorders, and prognosis
of neurological disorders. It can be due to rare occurrences of the diseases, unclear pathology,
lack of fine-grained description of symptoms, limited availability of accurate diagnostic tools,
the subjective interpretation of symptoms by medical professionals, etc. (Rath et al., 2017)
1
Chapter 1 | Introduction
Without an inherently definitive boundary of taxonomy, the diagnosis criteria are often
based on experts’ consensus. Not only does there exist a considerable amount of bad labels
in realistic medical datasets, but the small size of the dataset also makes the variability
among patients ineligible. The variability can be statistically described as outliers if they
are beyond the error tolerance of a machine learning model.
The noticeable presence of mislabeling and outliers in the initially labeled dataset can
be amplified during the training process of SSL, which can severely impact model efficacy.
As a result, medical datasets that rely on these diagnoses can be highly inconsistent and
inaccurate, leading to misdiagnosis and inappropriate treatment.
With an eye on these obstacles in medical machine learning, a robust SSL algorithm that
can handle a high percentage of mislabeling and outliers becomes sought after. Random
Sampling Consensus (RANSAC) (Fischler and Bolles, 1981) is a widely-used technique that
focuses on the robust estimation of model parameters when presented with a high percentage
of outliers in a training set. For example, it has been used in image matching (Hossein-Nejad
and Nasri, 2017), point cloud registration (Koguciuk, 2017), and vanishing points estimation
(Wu et al., 2021). RANSAC was first used in fitting problems that only require a few correct
samples to determine the parameters of a model. Most shallow classifiers are sensitive to
anomalies and usually require manual data cleansing to remove outliers. However, manually
inspecting and cleansing training data can be difficult and costly. As shown in Figure 1.1,
the linear regression cost function is sensitive to outliers in the training set(orange color),
which leads to an aberrant fitting line that is far from the ground truth underlying pattern
if blue dots are also included in the training set. By randomly selecting the samples in the
training set, we get the chance to have a clean subset free of outliers that ultimately help us
to find the correct fit with the minimum possible standard deviation.
The operational procedure is outlined below.
In the identification of rare diseases or early identification of chronic diseases, we conjecture RANSAC can be used to identify and exclude misdiagnosed cases or cases with
2
Chapter 1 | Introduction
Outline of RANSAC procedure
1: Randomly select the minimum number of points required to determine the model
parameters.
2: Solve for the parameters of the model.
3: Determine how many points from the dataset fit with a predefined tolerance .
4: If the fraction of the number of inliers over the total number of points in the set
exceeds a predefined threshold τ , re-estimate the model parameters using all the
identified inliers.
5: Otherwise, repeat steps 1 through 4.
6: Stop when the termination criteria are met.
Figure 1.1: An two-dimensional illustration of RANSAC robustly estimating parameters of a linear model by rejecting outliers. The initial samples are the two
encircled orange dots on the solid line. The linear model was fitted with the two orange
dots. Then, the linear model generates tolerance and predicts classes of the unlabeled data
(marked with a dashed line). Data points that fall within the tolerance are considered inliers
(orange), and outliers (blue) are excluded. Inliers will be incorporated in the training of
the next round. The solid line will gradually align with the linear distribution of the large
unlabeled set every iteration. We repeat the procedure until the result meets the criteria of
termination.
3
Chapter 1 | Introduction
inconsistent symptoms. By removing these outliers, RANSAC might be able to improve the
quality of the remaining data, which can be used to train and validate the robustness and
accuracy of the model.
1.1 General related work
1.1.1 Previous works on Semi-Supervised Learning
Semi-supervised learning encompasses a wide variety of techniques that excavates information from unlabeled datasets to go above and beyond the prediction accuracy of the
supervised learning trained with a small labeled dataset. Some semi-supervised learning
techniques aim at extracting extra information by recognizing patterns (manifold) in the
large unlabeled data set. Some semi-supervised learning techniques focus on improving the
robustness of assigning labels with an ensemble of hypotheses. These techniques can sometimes be used together to complement one another in practice. The following categories are
some of the most commonly used techniques.
Graph-based semi-supervised learning: Graph-based models, such as Label-Propagation
proposed by Zhu et al., (Zhu and Ghahramani, 2002), formulate the relations between the
data points of a dataset as a graph, with the weights of the edges the similarity among the
nodes. Label Propagation propagates labels from labeled nodes to unlabeled nodes by growing a minimum spanning tree and dividing the classes with minimum cut. Laplacian-SVMs
(LapSVMs) such as LapSVM by Belkin et al. (Belkin et al., 2006) find the best cut on the
manifold generated through spectral decomposition.
Transductive learning: A subset of semi-supervised learning algorithms targets transduction learning when the test set is not held out but is being used as an unlabeled working
set. Normal semi-supervised SVMs use statistical risk minimization(SRM) by assuming the
labeled training set is big enough and resembles the distribution of the unlabeled working set.
In transductive learning, the risk is estimated at each data point in the working set, which
4
Chapter 1 | Introduction
may improve the accuracy significantly compared to SRM. (Vapnik at the NIPS 1998 SVM)
Transductive SVM (TSVM) (Joachims, 1999), such as semi-supervised SVM (S3VM) (Bennett and Demiriz, 1998), assign labels to the working set to minimize the misclassification
error in the loss function.
Disagreement learning: Co-training and multi-view training (Blum and Mitchell,
1998) (Kumar and III, 2011) (Xu et al., 2013)) utilize diverse feature domains and exchange complementary information to alleviate the biases. However, the assigned pseudo
labels from other domains might not always converge to improvement if the views or feature
partitioning are not sufficiently compatible.
Self-supervised training: Self-supervised training, or self-training models, iteratively
predicts and evaluates the unlabeled working set to improve accuracy and reduce the required
amount of labeled data. (Yarowsky, 1995).
1.1.2 Related work in RANSAC
Most semi-supervised machine learning models adopt classical fitting techniques such as
least squares, assuming the systematic errors will be ’smoothened’ with the growth of the
size of the labeled set. The assumption often does not hold in many practical machine
learning problems, where one single outlier can destroy the estimation model parameters and
prediction accuracy. Fischler and Bolles proposed RANSAC(RANdom SAmaple Consensus),
which provides a mechanism to reject outliers in the process of model estimation in location
determination problems in computer vision. (Fischler and Bolles, 1981) They demonstrated
RANSAC was able to estimate the model parameters despite the high percentage of outliers
in datasets. RANSAC has been widely used in computer vision. It is used in problems such
as point cloud registration, image matching, object detection, and video stabilization, where
a high percentage of mislabelling and outliers cannot be smoothed out.
RANSAC was also proposed as a de-noising algorithm by detecting and rejecting mislabelling and outliers. Subhabrata et. al. (Debnath et al., 2015) proposed to use RANSAC
5
Chapter 1 | Introduction
with Support Vector Machine(RANSAC SVM) to improve the robustness against outliers in
the dataset. They randomly sample multiple subsets from the training set to train multiple
SVM models. They showed that the collective decision of the ensemble of models could
detect very hard or misannotated examples in the dataset. They showed that the model
trained with de-noised labels generalizes better and performs marginally better than the
base classifier trained with clean examples.
Nishida and Kurita (Nishida and Kurita, 2008) used RANSAC algorithm to solve the
scalability problem of SVM. Hyperparameter estimation of SVM is challenging in high dimensional feature space due to the computational and storage requirement. They used
RASNAC with SVM to find the most generalizable subset of samples to solve the problem
of hyperparameter estimation.
In general, RANSAC utilizes self-consistency assumption either to remove outliers or
find the most generalizable examples. This is different from the continuity and smoothness
assumption in the manifold.
1.1.3 Related work in Medical ML
There is a long history of research using supervised machine learning in prognosis, diagnosis,
treatment, and clinician workflow. Different algorithms such as SVM, Bayesian Network,
MLP were used in computer-aided diagnosis of different diseases such as heart diseases, diabetes, liver diseases, and hepatitis diseases. The analysis showed there is no one-fit-for-all
algorithm for all diseases diagnosis. The best algorithms largely depend on the nature of the
dataset, which is predominately determined by the availability of diagnostic tools (Fatima
et al., 2017). In recent years, the inventions of deep learning, such as convolutional neural
networks, have led to a plethora of studies in imaging-based diagnosis. (Aggarwal et al.,
2021) In the prediction of psychoneurological diseases, eye-tracking-based diagnostics shows
great progress. Po-He et. al. exacted eye-gaze signatures such as oculomotor movements and
visual saliency (Tseng et al., 2013b) to predict attention-related neurological disorders. The
6
Chapter 1 | Introduction
eye-tracking-based SVM achieved % 89.6 accuracy in the classification of Parkingson’s Disease(PD) versus age-matched controls, and % 77.3 accuracy in the classification of attention
deficit hyperactivity disorder(ADHD) versus fetal alcohol spectrum disorders(FASDs) versus
control children. Chen et. al. conducted a cost-effective analysis among eye-tracking-based
assessment, psychometric-based assessment, and fMRI-based assessment in FASD diagnosis, and suggest that eye-tracking assessments can be cost-effective enough to be applied to
large-scale screening. (Zhang et al., 2019) Wang et. al. found increased pixel-level saliency
but decreased semantic-level saliency in autism spectrum disorder(ASD) in eye-tracking freeviewing tasks. (Wang et al., 2015)
However, the availability of high-quality data remains one of the key challenges. Most
machine learning studies predict the diagnosis retrospectively. Currently, there’s no direct
proof that suggests the accuracy of these models can be transferred to patient prognosis
before the onset of the diseases. In many tasks, such as the prognosis of neurodegenerative
diseases, obtaining the ’ground truth’ may be impossible due to, for instance, a necessary
diagnostic test was not obtained. (Rajkomar et al., 2019) The challenges in defining and
measuring diagnostic errors are now emerging as a prominent problem in prognosis because
diseases and their manifestations often evolve over time. Study of diagnostic errors retrospectively can be difficult because the physician’s deliberation also evolves over time. (Zwaan
and Singh, 2015)
Semi-supervised learning methods were used to overcome the problem of the small size
of well-curated datasets by learning the manifold of a large unlabeled dataset. Zhao et.
al, proposed using a graph-based algorithm and label propagation for the prediction of the
diagnosis of Alzheimer’s Disease. The study on simulated datasets shows better sensitivity
and specificity than other graph-based semi-supervised learning algorithms. However, the
method does not take into account the scenarios of high-chance of unknown mislabeling and
a high probability of outliers. It is not shown the result can be retrospectively applied to the
prognosis of Alzheimer’s Disease. (Zhao et al., 2014) A plethora of other studies attempt to
7
Chapter 1 | Introduction
use semi-supervised learning in various domains of disease prediction. However, they either
assume the availability of a small clean training set consists of mostly hard labels (labels that
are not only ground truth but also generalizable), the large unlabeled dataset will smooth
out the noise, and the regularization will eliminate the bias. Some even require the direct
intervention of experts through active learning. (Qu et al., 2022) These studies suggest a
need for an algorithm tailored specifically to deal with low-quality datasets consisting of a
high percentage of mislabeling and outliers.
1.1.4 Experiment series
First, we investigated the performance of the algorithms in synthesized 2D datasets in various
ratios of labeled data, and various ratios of mislabeling to test the robustness of the proposed
algorithms. Then, we tested the three algorithms with three different base classifiers based
on four different manifold propagation assumptions. Then, we will test the robustness of
the SSL RANSAC against different mislabeling rates. After that, we visualized the decision
functions after every training iteration to demonstrate that the algorithms perform the label
propagation as we expected. In the second part, we evaluated our algorithm in eye-tracking
feature space. First, we evaluate our algorithm with a binary classification task in which
the labels are inherently binary ground-truths (male, female). We evaluated the average
accuracy scores in a sex classification test. Second, we investigated the algorithm in a binary
classification task, where the labels are based on nonbinary ground truths (age groups). We
evaluate the average precision scores in an age classification test. Third, we evaluate the
performance of the algorithm in a binary classification task where the labels are based on a
gold standard that is based on an expert’s diagnosis. We evaluated the average precision score
on the ONDRI data set. The data set consists of a high-risk population of neurodegenerative
disorders. Finally, we used our algorithm to evaluate the average precision score in a binary
classification task where the labels are based on psychology survey questions. We evaluated
the average precision score on CHLA and BCH toxic stress data sets.
8
Chapter 1 | Introduction
Figure 1.2: Summarizing diagram of experiment series
9
Chapter 2
Semi-supervised RANSAC algorithms for partially labeled
data sets: primitive investigation with synthetic 2D manifolds
2.1 Experiment 1.1: qualitative analysis of model robustness
2.1.0.1 Objective and Experiment Design
Outliers and mislabeling in a small dataset can have detrimental effects on machine learning
classifiers in classification tasks of realistic medical datasets. In this experiment, we investigate the robustness of our SSL RANSAC algorithm 2 in a noise space spanned by the
two components of the common source of the noise. The 2D noise space analysis will help
elucidate the general robustness behavior of a denoising algorithm because the two sources
of noise can have a compounding effect in the realistic medical dataset. We manipulated the
mislabeling rate to test the robustness of SSL RANSAC algorithms. On the other hand, we
manipulated the number of initial training samples to test the SSL RANSAC algorithm’s
robustness against the intra-class variability of a small training set by manipulating the split
ratio of labeled-unlabeled samples. To test the robustness of the SSL RANSAC algorithm
against different data manifold separability, we designed three datasets that represent three
types of manifold separability: manifolds that are linearly separable, radially separable, and
inseparable. We tested the SSL RANSAC algorithms in the above-mentioned conditions
with two different base classifiers to show the algorithm to investigate how SSL RANSAC
affects the performance with different classifiers.
2.1.0.2 Data preparation
We created three different binary classification datasets with different shapes of data distribution. Two of the synthesized datasets contain clear 1-D manifolds. One dataset is
10
Chapter 2 | Semi-supervised RANSAC algorithms for partially labeled data sets: primitive investigation
with synthetic 2D manifolds
only partially separable. The two-moon data set consists of two U-shaped distributions,
with their openings facing each other and their legs interleaving. The two-circles data set
consists of a big circle enclosing a smaller circle. The two-blobs data set consists of two
partially overlapping distributions, making them only partially separable. A small amount
of variation is added to the distributions to make them look more realistic. The results are
shown in Figure 2.1 b. We created an array of synthesized 2D datasets by manipulating the
mislabeling ratio between 10% to 50%, with 10% interval, and the split ratio of labeled and
unlabeled data between 5% to 95% with 10% interval, as shown in the top panel of Figure
2.1 a. The mislabeling is randomly sampled and class-balanced. In total, 300 data points
were used. We measured the average precision score of each combination of mislabeling rate
and labeled-unlabeled split ratio with five-fold cross-validation. The mean average precision
scores (mAP) are plotted as heat maps to analyze the robustness of the SSL RANSAC in
datasets of various qualities. The average precision score is defined as
AP =
PN
n=1(Rn − Rn−1)Pn
Pn and Rn are the precision and recall at the n th threshold. N is the number of
thresholds.
We selected two widely used classifiers, SVM and MLP, as our base classifiers. We
chose the RBF kernel for the SVM and selected the parameters C and gamma with grid
search before training. We selected a four-layer MLP classifier with 2,4,4,2 nodes in each
layer as another base classifier. We compared the results of the supervised learning of the
SVM and MLP trained with clean labels and noisy labels and the results of SSL RANSAC
algorithms trained with noisy labeled data and unlabeled data. We predict the mAP of the
base classifiers decreases as the size of the training set decreases and the mislabeling rate
increases. We predict the SSL RANSAC will show the most significant retention of mAP
under extremely noisy conditions because the base classifiers are robust to low degrees of
data variability and mislabeling.
11
Chapter 2 | Semi-supervised RANSAC algorithms for partially labeled data sets: primitive investigation
with synthetic 2D manifolds
To analyze the denoising effect of SSL RANSAC2 at a high percentage of mislabeling,
we further scrutinize the upper right quadrant of the noise space, as shown in the bottom
panel of Figure 2.1 a. We repeat the same measurement by manipulating the mislabeling
rate from 40% to 50% with 2% interval, and we manipulated the labeled-unlabeled ratio
from 75% to 95% with 5 % interval to analyze the effect of high intra-classe variability. We
predict the SSL RANSAC 2 algorithm will have higher mAP than plain SVM and MLP
classifiers trained with noisy labels.
2.1.0.3 Results
As shown in the upper left panel of Figure 2.1 c, we show the mAP heat map in the lower
quadrant of the noisy space, as shown in the top panel of Figure 2.1 a. There does not
seem to be a big performance drop when trained with noisy data, and there seems to be no
consistency in the heat map patterns. The SSL RANSAC 2 algorithm helps handle noises
well in three data distributions for MLP. In the upper right panel of Figure 2.1 c, the effect
of the SSL RANSAC2 was not significant when we used SVM, as the mAP is already very
high. To further examine the effect of RANSAC2, we zoomed in on the difficult scenarios
when mislabeling is between 40 % to 50 % and unlabeled data is between 75 % and 95 %.
The bottom left of Figure 2.1 c shows that mAPs drop significantly when the unlabeled
data ratio is between 85 % to 95%, corresponding to the number of labeled data between
45 to 15, in the two-moon dataset. In the two-circles dataset, the mAP drops significantly
when the mislabeling rate is around 46 %, and the unlabeled ratio increases to 90 %. After
encapsulating the MLP with SSL RANSAC2, the mAP increased significantly across the
whole heat maps for both two-moons and the tow-circle datasets. In the two-blobs dataset,
adding mislabeling rate and reducing the number of labeled data does not seem to deteriorate
the mAP too much. However, adding the SSL RANSAC2 to MLP deteriorated the overall
performance across the noise space. It suggests the SSL RANSAC is dataset dependent and
might not work well on an inseparable dataset. In the bottom-right panel of Figure 2.1
12
Chapter 2 | Semi-supervised RANSAC algorithms for partially labeled data sets: primitive investigation
with synthetic 2D manifolds
c, we investigate the SVM model qualitatively at the mislabeling rate from 40 % to 50%,
and an unlabeled data ratio between 75% to 95%. We noticed a consistent decline in mAP
when increasing the percentage of unlabeled data, while increasing mislabeling rate from
40% to 50% only showed a mild decline in mAP for all three datasets. When combining
SSL RANSAC2 with SVM, The mAP in the two-moons and two-circles datasets improved
evenly across the whole noise space. SSL RANSAC does not alleviate the mAP of SVM in
the two-blob dataset.
2.1.0.4 Discussion
There does not seem to be a clear pattern of decline of mAP for the MLP model when
trained with three datasets. It does poorly in the two-circle dataset and RANSAC without
mislabeling, suggesting the model did not converge with a limited amount of training data.
When SSL RANSAC2 is applied, the MLP model seems to converge well and is robust to
noise coming either from the mislabeling or high variability due to the small labeled training
set. This shows SSL RANSAC2 is able to improve the robustness against mislabeling and
internal variability. SSL RANSAC is able to enhance the robustness of the MLP model at
extremely adverse conditions when the mislabeling rate is close to 50 %, and the labeled set
size is only 5 % of the total 300 data points. SVM models are quite robust to mislabeling
and do not require a lot of labels. This can be attributed to the slack variable that punishes
large margins. The mAP of SVM only started to degrade when the number of training
samples was reduced to 45. SSL RANSAC does not significantly improve SVM only when
the number of labeled data is reduced to 45, or 15 %, in tow-moons and two-circle datasets.
SSL RANSAC does not improve the mAP when the dataset is inherently inseparable, even
with extra 85% unlabeled data.
13
Chapter 2 | Semi-supervised RANSAC algorithms for partially labeled data sets: primitive investigation
with synthetic 2D manifolds
Figure 2.1: Exp1.1 Label Domain Analysis left: We analyze the accuracy of SSL
RANSAC2 on the whole domain of label quality defined by the test-split ratio and mislabeling ratio. The higher the test-split ratio, the sparser the labeled data. Therefore, more
generalization power will be needed. The higher the mislabeling, the worse the fitting will
be. Therefore, some outlier detection mechanisms will be necessary. In the upper row, we
inspect the performance of the proposed RANSAC2 in an easier label quality domain. In the
lower row, we inspect the performance of the proposed RANSAC2 in a more difficult label
quality domain. center: 3 layer multi-layer perceptrons model, of size 2-10-10-2 is used
for evaluation. Three synthesized data distribution is tested for binary classification. The
moon shape distribution and the two circle distribution have distinct decision boundaries,
while the overlapped distribution doesn’t have a distinct decision boundary. The blue color
means high accuracy, and the red color means low accuracy. The three columns from left
to right show the results of supervised training with clean labels, the supervised method
trained with noisy, sparse labels, and the semi-supervised method trained with noisy, sparse
labels. The results suggest the semi-supervised RANSAC2 improves performance in both
the easy and difficult label quality domains. right: Constrained Support vector machine
classifiers are trained in three different scenarios. The three columns from left to the right
show the results of supervised training with clean labels, the supervised method trained
with noisy, sparse labels, and the semi-supervised method trained with noisy, sparse labels.
The results showed no distinct difference between supervised and semi-supervised training
in easier cases. However, the results slightly improved in the more difficult case.
14
Chapter 2 | Semi-supervised RANSAC algorithms for partially labeled data sets: primitive investigation
with synthetic 2D manifolds
2.1.1 Exp1.2 Test the three algorithms with three different base
classifiers
2.1.1.1 Objective and Experiment design
An ideal RANSAC algorithm should be able to recognize various manifolds to ensure the
RANSAC algorithms work well with real medical data sets where the data manifolds may
be complex or ambiguous. To find out the best SSL RANSAC algorithm out of the three
proposed SSL RANSAC algorithms in terms of algorithm robustness to data distribution
and compatibility with commonly used machine learning classifiers, we compared the average precision scores of three versions of SSL RANSAC algorithms when training with
three representative classification algorithms in binary classification tasks with four different representative scenarios. We compared the mAP of each SSL RANSAC algorithm when
training with Support vector machines (SVM), K-nearest neighbors (KNN), and multi-layerperceptron (MLP) as base classifiers. In addition, we also compared the three proposed SSL
RANSAC models with the supervised base classifiers trained with all labeled data. The three
SSL RANSAC algorithms are trained with 80% of the labeled data with the same mislabeling rate. The three base classifiers not only represent a wide range of design assumptions
but also the best performers in various medical classification tasks. For each combination of
SSL RANSAC and base classifiers, we tested the algorithm on four synthesized 2D datasets
that cover a wide range of data manifolds. If the mAP of an SSL RANSAC algorithm is
consistently better in all four dataset scenarios across all base classifiers, achieving mAP no
less than 0.5, the SSL RANSAC algorithm is likely the most robust to recognize the manifold
of unknown geometry in a multi-dimensional, noise-ridden dataset.
2.1.1.2 Data preparation
We created four different datasets with different shapes of manifold for the binary classification tasks. The two-moon dataset consists of two new-moon-shaped distributions. Each
15
Chapter 2 | Semi-supervised RANSAC algorithms for partially labeled data sets: primitive investigation
with synthetic 2D manifolds
moon distribution belongs to one class with 10 % mislabeling rate. The distribution is illustrated in the top graph in Figure 2.1 b. The second dataset consists of two circles. Each
circle belongs to one class with 10 % mislabelling rate, as shown in the middle graph in
Figure 2.1 c. The third dataset consists of two overlapping blobs. Each blob belongs to
one class, with 10 % mislabelling rate, as shown in the bottom graph in Figure 2.1 d. The
four-blobs dataset consists of four blobs with two pairs of overlapping blobs. There is no
overlap between the two pairs of blobs. Each pair is labeled with 10 % of the mislabeling
rate. For the base model, the whole dataset is labeled. For the semi-supervised algorithms,
only 80% of the data is labeled.
2.1.1.3 Statistical analysis
The average precision score is computed for each trial, and we performed five-fold crossvalidation. Finally, the mean average precision score is compared in each group.
2.1.1.4 Results
In Figure 2.2 a, we observed that SSL RANSAC algorithm 1 performed the best out of
the three proposed SSL RANSAC models with all four manifold scenarios when trained
with SVM base classifiers. The SSL RANSAC algorithm 1 outperformed not only the other
two SSL algorithms but also the SVM model trained with all the labels in the two-moon
dataset significantly. SSL RANSAC algorithm 1 outperformed the two other SSL RANSAC
algorithms and performed as good as the supervised SVM trained with all the labels in
two-circle dataset scenario. The SSL RANSAC 1 not only outperformed the other two SSL
RANSAC algorithms but also the supervised SVM trained with all the labels in the fourblobs dataset scenario significantly. The SSL RANSAC algorithm 1 performed as well as
SSL RANSAC algorithm 3, reaching mAP of 1.0 for the two blob dataset scenario, while
SSL RANSAC algorithm 2 and supervised SVM trained with all labels reaching mAP of 0.8.
Out of the three SSL RANSAC algorithms, SSL RANSAC 1 is the only one that consistently
16
Chapter 2 | Semi-supervised RANSAC algorithms for partially labeled data sets: primitive investigation
with synthetic 2D manifolds
In Figure 2.2 b, supervised KNN trained with all the labels is the best performer in
all four dataset scenarios. The three SSL RANSAC algorithms performed similarly in the
tow-moons dataset, with mAP around 0.5, lagging significantly behind the supervised KNN
trained with all the labels. The Three SSL RANSAC algorithms performed similarly in the
two circles dataset scenario, where they all resulted in mAP of 0.5, which is close to chance.
In four blobs dataset scenario, SSL RANSAC 1 performs the best, although far from the
performance of supervised KNN. In the two-blobs dataset scenario, the SSL RANSAC 1 and
SSL RANSAC 2 achieved mAP of 1.0, as well as the bench-marking supervised base model.
The SSL RANSAC algorithm 3 performs the worst, with its mAP below 0.5.
In Figure 2.2 c, we compared the mAP of three MLP-based SSL RANSAC models in
four different datasets. SSL RANSAC algorithm 1 performs the best in two moons dataset,
with its mAP reaching almost 0.9. The other two SSL algorithms performed similarly to the
supervised MLP.
In conclusion, SSL RANSAC 1 is the most reliable SSL RANSAC algorithm of the
three. It consistently performs better than the other two SSL RANSAC algorithms across 16
combinations of base classifiers and datasets that cover a wide range of probable scenarios.
When combined with SVM or MLP classifiers, it even outperforms the base classifier that
trains with all labeled data. This suggests the SSL RANSAC 1 utilizes its denoising capability
to achieve better generalization.
2.1.1.5 Discussion
SSL RANSAC 1 consistently achieved mAP larger than 0.5 in the four manifold scenarios
when trained with three classifiers. SSL RANSAC 1 is the only SSL RANSAC algorithm
that provides a significant performance boost when paired with SVM base classifiers. SSL
RANSAC 1 also works well with MLP, beating the supervised MLP by a large margin
in three of the four tested datasets. SSL RANSAC 2 performed slightly better than SSL
RANSAC 1 when trained with KNN base classifier. However, most SSL RANSAC algorithms
17
Chapter 2 | Semi-supervised RANSAC algorithms for partially labeled data sets: primitive investigation
with synthetic 2D manifolds
Figure 2.2: Exp 1.2 Robustness to different datasets and base classifiers of SSL
RANSAC algorithms.(a) Three SSL RANSAC + SVM algorithms trained with 80% of
labeled data and 20% of unlabeled data, and the base classifier trained with all labeled are
compared. Labels are intentionally mislabeled by 10%. The SSL RANSAC1 performs the
best of all three SSL algorithms in all four tested datasets. It even outperforms the base
SVM trained with all labels on two moons and four blobs datasets. (b) Three SSL RANSAC
+ KNN algorithms trained with 80% labeled data and 20% unlabeled data and base KNN
classifier trained with all labels are compared. Labels are intentionally mislabeled by 10%.
The base KNN trained with all data performed the best. SSL RANSAC 1 + KNN and
SSL RANSAC 2 + KNN performed similarly well. SSL RANSAC 3 + KNN performed the
worst. (c) The Three SSL RANSAC + MLP algorithms trained with 80% of labeled data
and 20% of unlabeled data and the base classifier trained with all labeled are compared.
Labels are intentionally mislabeled by 10%. SSL RANSAC 1 + MLP outperformed the base
MLP trained with all labeled data in tow-moons, four blobs, and two blobs datasets. SSL
RANSAC 2 + MLP performed the best in the two circles dataset. The experiment shows
SSL RANSAC 1 is the most reliable to various classifiers and datasets.
resulted in significant mAP drops compared with supervised KNN, when paired with KNN
base classifier. SSL RANSAC 3, although the most complicated algorithm of the three,
performed the worst in all four datasets and three base classifiers.
2.1.2 Exp1.3 Test the three algorithms with various ratios of mislabeling
2.1.2.1 Objective and Experiment design
It is known that many medical datasets suffer from label quality due to a limited understanding of the diseases. The high mislabeling rate can drastically reduce the prediction accuracy
of supervised learning algorithms. We compared the three SSL RANSAC algorithms with
the supervised algorithms to investigate their robustness to different levels of mislabeling in
18
Chapter 2 | Semi-supervised RANSAC algorithms for partially labeled data sets: primitive investigation
with synthetic 2D manifolds
different datasets. We evaluated the mAP of the three SSL RANSAC algorithms with SVM
as the base classifiers because it was the best performer in experiment 1.2. We evaluate
the SSL RANSAC algorithms in four different 2D synthesized datasets, each representing a
unique composition of manifolds. The supervised base classifier will be trained with 100%
of labeled data, and the SSL RANSAC algorithms will be trained with 80% of labeled data,
along with 20 % of unlabeled data. We manipulated the mislabeling rate of the labels and
evaluated the performance of the SSL RANSAC algorithms in 0% mislabeling rate, 10% mislabeling rate, and 30% mislabeling rate. We predict our RANSAC algorithms will enhance
the mAP of the base classifiers in these 2D datasets due to their ability to exclude mislabeled
samples. Therefore, they can learn the manifold and generalize better than the supervised
base classifier.
2.1.2.2 Data preparation
We created four different datasets with different shapes of manifold for the binary classification tasks. The two-moon dataset consists of two new-moon-shaped distributions. Each
moon distribution belongs to one class. The distribution is illustrated in the top graph in
Figure 2.1 b. The second dataset consists of two circles. Each circle belongs to one class,
as shown in the middle graph in Figure 2.1 b. The third dataset consists of two overlapping
blobs. Each blob belongs to one class, as shown in the bottom graph in Figure 2.1 b. The
four-blobs dataset consists of four blobs with two pairs of overlapping blobs. There is no
overlap between the two pairs of blobs. For the base model, the whole dataset is labeled.
For the semi-supervised algorithms, only 80% of the data is labeled. We randomly mislabel
the groundtruths with 0%, 10%, and 30% mislabeling rates in three settings.
2.1.2.3 Statistical analysis
We computed The average precision score for each trial and performed five-fold crossvalidation. Finally, the mean average precision score is compared in each group.
19
Chapter 2 | Semi-supervised RANSAC algorithms for partially labeled data sets: primitive investigation
with synthetic 2D manifolds
2.1.2.4 Results
The result is shown in Figure 2.3. The base SVM classifiers did well when there was no mislabeling in the groundtruths. It was trained with all available labels and therefore performed
the best in general. Among the three proposed SSL RANSAC algorithms, SSL RANSAC
1 performed the best in general across the four datasets. It achieved mAPs larger than
0.5 across all four datasets when there was no mislabeling. The SSL RANSAC 2 and SSL
RASNSAC 3 algorithm in certain datasets. However, their performances were not robust
to changes in manifold geometry. In the four-blob dataset, mAPs of SSL RANSAC 2 and
SSL RANSAC 3 achieved mAP of 0.5 when there was no mislabeling. It suggests they are
not suitable for realistic medical datasets, where prior knowledge of the manifold is often
not available. When the mislabeling rate is 10% as shown in the middle panel of Figure
2.3, the supervised SVM classifier started to suffer from the 10 % mislabeling rate in the
training data. The SSL RANSAC 1 performed better by large margins than the base SVM
classifiers in the two-moon dataset, the four-blob dataset, and the two-blob dataset, even
though it only had access to 80 % of the data. SSL RANSAC 2 and SSL RANSAC 3 either
performed equally well or worse than the base classifier in the four datasets. It suggests that
these two algorithms offer very limited advantages to the supervised classifier. When the
mislabeling rate is raised to 30%, as shown in the right panel in Figure 2.3, the benefit of
employing the SSL RANSAC 1 becomes even more noticeable. The mAPs of the supervised
classifier dropped to 0.5 in three of the four datasets, suggesting it fails to learn anything in
most scenarios. The SSL RANSAC 2 and SSL RANSAC 3 did not do much better than the
supervised classifier. SSL RANSAC 1 achieved mAP larger than 0.7 across the four datasets
when the mislabeling rate was 30 %. It demonstrated its resilience to mislabeling in various
scenarios. From the three bar graphs in Figure 2.3, it is obvious that SSL RANSAC 1 offers
great benefit to the robustness of mislabeling for the SVM classifier.
20
Chapter 2 | Semi-supervised RANSAC algorithms for partially labeled data sets: primitive investigation
with synthetic 2D manifolds
Figure 2.3: Exp 1.3 Average precision scores of RANSAC algorithm with various
mislabelling ratios. The bar graphs from left to right show the average accuracy of
RANSAC algorithms with 0%, 10%, and 30% mislabeling rates. In each graph, there are four
synthesized data sets evaluated. The blue bar denotes supervised learning, the orange bar
denotes RANSAC1, the green bar denotes RANSAC2, and the red bar denotes RANSAC3.
Comparing three bar graphs, we found RANSAC1 outperforms plain SVM, RANSAC2, and
RANSAC3 as the mislabelling rate increases. RANSAC2 and RANSAC3 perform similarly
or worse than the plain SVM in some synthetic data sets, such as the two-blob dataset and
the two-moon dataset.
2.1.2.5 Discussion
The results presented in the previous section provide valuable insights into the performance
of different classifiers in the presence of mislabeling. The base SVM classifiers demonstrated
strength when the training data had no mislabeling, benefitting from access to all available
labels.
However, the introduced SSL RANSAC algorithms showed distinct behavior. SSL RANSAC
1 exhibited robust performance across various datasets, even when trained on a reduced subset of the data. This suggests its effectiveness in scenarios where mislabeling is present, and
prior knowledge of the manifold is limited.
Conversely, SSL RANSAC 2 and SSL RANSAC 3 failed to provide significant advantages
over the supervised classifier, indicating their limited utility in the contexts examined. This
points to the importance of selecting the appropriate SSL method based on the dataset’s
characteristics and the potential presence of mislabeling.
The results further emphasized the critical role of SSL RANSAC 1 in enhancing the classifier’s robustness to mislabeling. As the mislabeling rate increased, its performance became
increasingly prominent, outperforming the supervised classifier and the other SSL methods.
21
Chapter 2 | Semi-supervised RANSAC algorithms for partially labeled data sets: primitive investigation
with synthetic 2D manifolds
These findings highlight the potential of SSL RANSAC 1 in addressing the challenges of mislabeled data, which is often encountered in real-world applications, including in the medical
domain.
2.1.3 Exp1.4 Visualize the decision functions after every training
iteration
2.1.3.1 Objective
The previous two experiments showed SSL RANSAC 1 offers several advantages to the plain
base classifiers. It is compatible with various base classifiers, making it suitable for realistic
classification tasks when there is insufficient evidence to infer the best base classifier. SSL
RANSAC 1 also performed reliably well in various dataset distributions compared with the
supervised base classifier. It serves as a good agnostic algorithm when there is no prior
knowledge of the geometry of the data manifold. Besides, it is robust to a high degree of
mislabeling in the training data, making it suitable for many realistic cases of medical classification tasks. However, it is not shown how our proposed SSL RANSAC algorithms achieved
such robustness. Theoretically, SSL RANSAC algorithms can achieve better robustness by
learning the data manifolds better than the supervised methods. We hypothesized that
SSL RANSAC algorithms learn better manifolds by (1) excluding the outliers in the labeled
training set and (2) recruiting unlabeled data into the iterative self-training process through
pseudo-labeling.
2.1.3.2 Experiment design
In this experiment, we visualize the decision function at each iteration of the self-training of
the SSL RANSAC algorithms to validate our hypotheses. We evaluate the results of manifold
learning by comparing the learned decision functions with the manifold groundtruths. We
also analyze the SSL RANSAC algorithms by observing the evolution of the learned manifold
22
Chapter 2 | Semi-supervised RANSAC algorithms for partially labeled data sets: primitive investigation
with synthetic 2D manifolds
over iterations. We predict that the final decision boundary of the SSL RANSAC algorithms
should look more similar to the groundtruth manifold than that of the supervised base
classifier trained with 10 % mislabeled data. Secondly, We compared the change of decision
boundary of the three SSL RANSAC algorithms to evaluate which algorithm learns the
groundtruth manifold the best. We expect the SSL RANSAC algorithms will be able to
form more and more accurate manifolds by recruiting inliers at each iteration. We visualize
the courses of learning for three different base classifiers to verify that the SSL RANSAC
algorithms can learn the manifolds robustly.
2.1.3.3 Data preparation
We created a 2D two-moon dataset, as shown in the first row of all subplots in Figure 2.4
(denoted by Iter 0 in y-axis). The 30 labeled data are randomly mislabeled with a mislabeling rate of 10 %. Three base classifiers are evaluated in the experiment: SVM, KNN, and
MLP. The supervised base classifiers are trained with the 30 mislabeled data, and the SSL
RANSAC algorithms are trained with additional 120 unlabeled data. Another supervised
base classifier is trained with 30 labels without mislabeling to generate groundtruth manifolds. The visualization of the decision boundary is created by estimating the confidence
score in the whole feature space after training with the pseudo-labeled inliers at the end of
each iteration. We use a color spectrum to represent the confidence estimate ranging from
1 to -1, where 1 represents class blue, and -1 represents class red, as shown in Figure 2.4.
We show each algorithm’s first five iterations as columns in Figure 2.4. The plots are left
blank if there are no more inliers to be found in the unlabeled dataset. For supervised base
classifiers, there is only one iteration.
2.1.3.4 Results
The learned manifolds of the five compared scenarios of SVM, KNN, and MLP are shown
in Figure 2.4. From left to right in each panel, it shows the decision boundary of the
23
Chapter 2 | Semi-supervised RANSAC algorithms for partially labeled data sets: primitive investigation
with synthetic 2D manifolds
base classifier trained with 10% mislabeling rate, SSL RANSAC 1, SSL RANSAC 2, SSL
RANSAC 3, and the base classifier trained with correctly labeled data. It shows that the
supervised learning reached the final decision boundary in the first iteration, while the semisupervised RANSAC algorithm 2 and semi-supervised RANSAC algorithm 3 did not converge
until multiple iterations. Semi-supervised RANSAC algorithm 1, although not shown in the
graph, did not start recruiting new data until a good initial subset was sampled. In the left
panel of Figure 2.4, supervised SVM did not learn an accurate two-moon manifold due to the
high proportion of mislabeling in the initial training data. The learned decision boundary
is so distorted by the 10% of the mislabeled data that it does not look like the groundtruth
manifold. It demonstrated how a small number of outliers can have a devastating effect on
a supervised classifier. SSL RANSAC 1 learned an accurate and crisp two-moon decision
boundary in the first iteration compared with the decision boundary of supervised SVM,
suggesting the random sampling mechanism removed the mislabeling from the initial training
set. The SSL RANSAC 1 converged at the first iteration, suggesting the unlabeled data did
not improve its accuracy in the second iteration. The SSL RANSAC 2 did not learn the
two-moon manifold as well as SSL RANSAC 1. However, the learned decision boundary
looks more circular than the decision boundary estimate of the supervised SVM at the first
iteration. It suggests that random sampling in RANSAC 2 helps mitigate the effect of
mislabeling. However, the semi-supervised algorithm failed to improve further after training
with the inliers at the second iteration. The manifold gravitates further into the red class,
suggesting the imbalance at the initial decision boundary estimate can have a devastating
effect that cannot be ignored. SSL RANSAC algorithm 3 did better than SSL RANSAC
2. It overcame the initial imbalance in the decision boundary estimate by eliminating bad
samples in the labeled training set. However, the algorithm seems to be more sensitive to
the initial mislabelling, and the decision boundary estimate looks more like the average of
the supervised SVM and the groundtruth manifold. The converging criteria are also less
stable. It also took much longer to converge, although not shown in Figure 2.4. However,
24
Chapter 2 | Semi-supervised RANSAC algorithms for partially labeled data sets: primitive investigation
with synthetic 2D manifolds
the extra cost does not seem to pay-off since the SSL RANSAC 1 achieved remarkable
results in the first iteration. In the middle panel of Figure 2.4, supervised KNN overfit the
blue class, and did not capture the groundtruth manifold well. SSL RANSAC 1 learned
the manifold very well in the first iteration, suggesting the critical importance of having a
good initial sampling. SSL RANSAC 2 looks more like the groundtruth than the supervised
KNN. However, the learned decision boundary estimate deteriorated in the second iteration.
It suggests that the self-training mechanism does not have the power to compensate for a
bad initial sampling. Unlike SSL RANSAC 1, SSL RANSAC 2 does not exclude samples in
the initial training set. SSL RANSAC 2 trained an ensemble of classifiers with a training
set containing incorrect labels and assumes the ensemble can average out the negative effect
of mislabeling. As a result, it does not completely rule out the influence of the incorrect
labels in the initial training set. SSL RANSAC 3 estimates the decision boundary that looks
more like the groundtruth compared with SSL RANSAC 2. The learned decision boundary
did not deteriorate in the second iteration. However, it did not improve in the third to fifth
iteration. It could be explained because the forward addition and backward elimination of
pseudo-labeling canceled each other out. However, further study is necessary to support this
claim. Nevertheless, it is clear that SSL RANSAC 2 and SSL RANSAC 3 performed worse
than SSL RANSAC 1.
In the right panel of Figure 2.4, the supervised MLP learned the manifold better than
that of SVM and KNN. However, SSL RANSAC 1 perfected the task unequivocally. On
the other hand, the result of SSL RANSAC 2 showed little resemblance to the groundtruth
manifold in the first two iterations. SSL RANSAC 3 learned a promising decision boundary
in the first iteration. Unfortunately, the circular pattern faded away as more incorrect inliers
were recruited during the self-training process.
25
Chapter 2 | Semi-supervised RANSAC algorithms for partially labeled data sets: primitive investigation
with synthetic 2D manifolds
Figure 2.4: Exp 1.4 Iteration steps of RANSAC algorithms. From left to right:
Support vector machines, K-nearest neighbors, and Multi-layer-perceptron(MLP). For each
panel, five scenarios are compared. From left to right are a supervised base classifier trained
with labeled data, RANSAC1 with a fraction of labeled data, RANSAC2 with a fraction of
labeled data, RANSAC3 with a fraction of labeled data, and supervised with all labeled data.
The first row shows the initial training set. The blue and red color map shows the confidence
score prediction of the whole space. It is clear from all SVM, KNN, and MLP classifiers that
RANSAC1 can form a more crisp and accurate decision boundary and confidence score with
very limited labeled data. On the other hand, RANSAC2 and algorithm 3 show only slight
improvements, agreeing with our observation in Figure 2.2 and Figure 2.3
2.1.3.5 Discussion
Our result reaffirms the conclusion of previous experiments that SSL RANSAC 1 is the best
SSL RANSAC algorithm to counter mislabeling in the dataset. It consistently learned the
data manifold well when combined with different base classifiers. It shows that SSL RANSAC
1 is a more reliable choice than the other two algorithms since its advantage is less dependent
on the classifier itself. In fact, the visualization of the training process of SSL RANSAC 2 and
SSL RANSAC 3 reveals their design flaws, for self-training cannot average out the negative
effect without excluding the initial incorrect labels from the ensemble of classifiers. Although
they are not as good as SSL RANSA 1, they learned the manifold better than the supervised
classifiers with the help of unlabeled data. Interestingly, the self-training process does not
seem to play an important role in the advantageous performance of SSL RANSAC 1. The
experiment suggests that most advantages come from re-sampling to exclude the bad initial
labels. The self-training does not seem to provide significant advantages when the underlying
manifold is relatively simple.
26
Chapter 3
Investigation of partially labeled data sets with ground-truths:
detection of early neurodegenerative diseases with Semi-supervised
RANSAC SVM with eye-tracking, free-viewing data set.
3.1 Experiment 2: Sex Prediction of Infants in Toxic Stress dataset
3.1.0.1 Objective
SSL RANSAC algorithm one is shown to be able to enhance robustness to the high percentage
of mislabeling across various distributions and base classifiers reliably in 2D, synthesized
datasets, indicating its potential applicability to complex medical datasets whose underlying
manifolds and mislabeling rates are unknown. However, a direct evaluation of the SSL
algorithms on challenging medical datasets might be unconvincing due to the absence of a
reliable diagnosis. The challenge can be further exacerbated when there are no definitive
symptoms of the disorders. It is challenging to formulate good machine learning features for
such diagnosis due to the lack of good bio-signatures. For example, the pathologies of chronic
neurological disorders are often multifaceted, and the symptoms can be subtle at their early
stages. As a result, the signal-to-noise ratio can be poor in the dataset that targets potential
bio-signatures.
One of these datasets is the eye-tracking free-viewing toxic stress dataset. Toxic stress is
suggested as a risk factor for multiple neurological disorders, and there are plenty of studies
suggesting stress disorders can lead to the alteration of neurological development. However,
it is unclear whether the cognitive difference can be reflected in a free-viewing task where
the stimuli consist of clippets of randomly sampled natural scenes.
To further elucidate the question, we hypothesize that the dataset can reflect subtle
differences in cognitive functions due to the wide variety of stimuli given appropriately formulated features that capture the underlying biosignatures. To verify the hypothesis, we
27
Chapter 3 | Investigation of partially labeled data sets with ground-truths: detection of early
neurodegenerative diseases with Semi-supervised RANSAC SVM with eye-tracking, free-viewing data set.
first examine the goodness of features with labels with definitive groundtruths. In this experiment, we investigate whether it is possible to classify sexes with our features, as sexual
differences in gazing behaviors exist in adults and infants. (Bayliss et al., 2005) (Alexander
et al., 2009) (Alexander and Wilcox, 2012) We hypothesize the SSL RANSAC 1 can improve
the accuracy of the prediction of sexes due to its ability to include inliers in the self-training
process. The experiment is conducted on infants of a number of month groups to validate
the hypothesis.
3.1.1 Experiment Design
Sexual differences in gazing behaviors exist in adults and infants. (Bayliss et al., 2005)
(Alexander et al., 2009) (Alexander and Wilcox, 2012) We predicted the sexes of the participants of eye-tracking free-viewing toxic stress dataset with SSL RANSAC algorithm 1
with SVM-based classifiers. Then, we compared the results to the results of a plain SVM
classifier at each month group. In addition, we evaluated the prediction accuracy of sexes at
each month group with a training set that combined all the data across month groups. The
augmented training set is expected to improve the performance of SSL RANSAC algorithm
1 due to more available inliers in the training set. We also compare the three methods with
a no-skill baseline, which indicates the bottom line where predictions are randomly sampled
from the label distribution. We fit the model on both saliency and oculomotor features,
which will be explained below.
28
Chapter 3 | Investigation of partially labeled data sets with ground-truths: detection of early
neurodegenerative diseases with Semi-supervised RANSAC SVM with eye-tracking, free-viewing data set.
Figure 3.1: An overview of the pipeline of our proposed model for toxic stress
risk prediction The pipeline consists of three major steps. (a)Feature extraction The
video clippets are processed to extract oculomotor features (saccade peak velocity, saccade
velocity, saccade acceleration, fixation duration, and pupil size) and saliency features (color,
flicker, intensity, motion, orientation) at each attention time stamp which is defined as the
end of a saccade In total, there are 15 channels and τ attention time stamps The size of
the data is now 15 × T textbf(b)Dimensionality reduction The 2-dimensional features are
then summarized with a histogram to reduce the size T to a fixed value After rescaling
and extremum removal, the summarized histogram reduced the temporal dimension and
only represents the distribution of responsiveness to each feature We assign 9 bins for each
channel and the total size of the feature becomes 15 × 9 (c) RANSAC Classification
We deploy different RANSAC algorithms to remove outliers from the final training set The
base classifiers predict the confidence value of each label If the confidence of a prediction
is high for a sample, we include the sample and the predicted label in the training set of
the next iteration The RANSAC process is repeated after validation until results meet the
convergence criteria
3.1.1.1 Data Preparation
Eye Tracking Free-viewing Paradigm
Eye-tracking-based models have been proposed as a cost-effective, high throughput screening
paradigm for neurological disease screening and have achieved remarkable progress for disorders of infants and pre-verbal young children. (Wang et al., 2015) (Itti, 2015) For example,
eye-tracking-based screening algorithms have been shown to reliably and efficiently detect
29
Chapter 3 | Investigation of partially labeled data sets with ground-truths: detection of early
neurodegenerative diseases with Semi-supervised RANSAC SVM with eye-tracking, free-viewing data set.
anomalies and to classify children with ADHD, and FASD (Tseng et al., 2013a), as well as
ASD (Wang et al., 2015).
In the eye-tracking paradigm, subjects were asked to watch a sequence of short video clips
without external incentives (free-viewing task). Experimenters then collected the eye-trace
recording and used a visual attention model to compute saliency maps on every video frame,
which predicted locations more likely to attract a control participant’s attention and gaze.
Saliency values sampled at a participant’s gaze points were used to construct training features
to support classification with machine learning algorithms (Itti et al., 1998). These features
reflected the degree of agreement between saliency model predictions and where participants
looked. Additional oculomotor features can also be derived from eye movements.
Saliency describes the local conspicuity of an image that attracts visual attention. Tseng
et al. (Tseng et al., 2013a) combined saliency features and oculomotor features to help
understand cognitive and oculomotor behaviors of ADHD and FASD. Wang et al.(2015)
(Wang et al., 2015) (Zhang et al., 2019)incorporated top-down attention features to deepen
our understanding of the attentive behaviors of patients with Autistic Spectrum Disorder.
Here, we use these features collected from infants as young as two months old. We found
they were generally quite interested in age-appropriate video clips. This is important for our
study, as these infants would not have been able to comply with any experimental protocol
that would involve following some instructions given by the experimenter.
Participants
The toxic stress data came from our collaborators at the Children’s Hospital of Los Angeles
(CHLA) and Boston Children’s Hospital (BCH). The researchers at two hospitals conducted
longitudinal studies with infant participants at ages of 2, 6, 9, 12, 18, 24, and 36 months.
There are 72 participants, each of whom took the free-viewing test during their visits at 2, 6,
9, 12, 24, and 36 months of age. Their parents are asked to self-evaluate stress levels through
standardized questionnaires during each test visit. We collected 265 trials from CHLA from
30
Chapter 3 | Investigation of partially labeled data sets with ground-truths: detection of early
neurodegenerative diseases with Semi-supervised RANSAC SVM with eye-tracking, free-viewing data set.
6 visits of 72 unique participants. The total number of trials is less than 432 because some
participants dropped out of the experiment. The demography of the participants is genderbalanced. Half of them are Hispanic. On the other hand, 59 subjects of the BCH data set
consist of half African Americans and half others.
Stimuli
Six blocks of audible videos were presented to the infant participants during each testing
visit. Each block of the experiment consists of a sixty-second display of a video followed by
a short break to maintain infants’ attention. For each video, twenty approximately threesecond-long clippets of 30 fps were stitched together. The clippets – footage arbitrarily cut
from documentaries, cartoons, and music videos– of different categories were interleaved
to engage different modalities of saliency and attention. Instantaneous gaze locations were
tracked by table-top Eye-Link 1000 eye tracker with a 500 Hz sampling rate. The heads of
the infants were not fixated.
Feature construction
We calculated pixel-level saliency values at the gaze location at the end of each saccade.
The ends of saccades instead of the starts of fixations were used because studies showed that
the ends of saccades are strongly related to shifts of attention. In contrast, fixations are
susceptible to center bias. We employ five modalities of pixel-level saliency features: color,
flicker, intensity, motion, and orientation (Itti et al., 1998). For oculomotor features, we
select saccade velocity, saccade acceleration, saccade duration, fixation duration, and pupil
size. We summarized the modalities in Table 2.
Both raw saliency and oculomotor samples at each timestamp were further summarized
to a single histogram of ten bins for each saliency and oculomotor modality after removing
the top five percentile of all samples. The top five percentiles were removed because the
outliers made the most of the data points fall within the first bin of the histograms, and
31
Chapter 3 | Investigation of partially labeled data sets with ground-truths: detection of early
neurodegenerative diseases with Semi-supervised RANSAC SVM with eye-tracking, free-viewing data set.
we want them to be more spread out. Finally, we discard the bottom-most bin because it
contains most data points, which out-proportioned the rest nine bins. As a result, the final
dimensionality was reduced to 45 for both saliency and oculomotor features.
Pixel-Level Saliency
Color
Orientation
Intensity
Motion
Flicker
Oculomotor
Saccade Peak Velocity
Saccade Velocity
Saccade Acceleration
Fixation Duration
Pupil Size
Table 3.1: The table shows the features used in our proposed model. Pixel-level saliency
is very predictive of bottom-up attention. Pixel-level saliency features are important in
classifying patients of ADHD and FASD from the healthy control group. Oculomotor features
are important in the classification of Parkinson’s Disease.
Labels
The labels are the sexes of the participants. The numbers of samples of the two sexes are
roughly balanced.
3.1.1.2 Statistical Analysis
We use average precision score as our metric of prediction accuracy. We run a ten-fold
cross-validation to calculate the mean average precision and variance for each month group.
32
Chapter 3 | Investigation of partially labeled data sets with ground-truths: detection of early
neurodegenerative diseases with Semi-supervised RANSAC SVM with eye-tracking, free-viewing data set.
3.1.1.3 Results
Figure 3.2 shows the results of prediction accuracy in each age group for each algorithm.
The average precision was further improved when data across age groups were combined in both saliency-features-based and oculomotor-features-based scenarios. The plain
SVM (light green line) only did slightly better than no-skill (dashed blue line), where Naive
Bayesian is used. SSL RANSAC 1 improved the mAP significantly (blue line) for all age
groups of both feature modalities. Training SSL RANSAC with combined labeled data of all
age groups (orange line) improved the mAP to above 0.9 when using saliency features. The
improvement is the most significant in infants before 12 months. When training RANSAC1
with combined data from oculomotor features, mAPs improved but not as much.
Figure 3.2: Mean average precision (mAP) of sex prediction with saliency (top)
and oculomotor features (bottom) with RANSAC1 with SVM as base classifier across
age groups from 2 months old to 24 months old. As we can see, the plain SVM (light green
line) only does slightly better than no-skill (dashed blue line), where classes are predicted
with Naive Bayes algorithm. RANSAC1 improves the mAP significantly (blue line) for all
age groups of both feature modalities. Training RANSAC with combined labeled data of
all age groups (orange line) further improves the mAP of 2 months old and six months old
groups significantly when using saliency features. When training RANSAC1 with combined
data from oculomotor features, the improvement was still significant, but mAPs were lower
than that from saliency features
33
Chapter 3 | Investigation of partially labeled data sets with ground-truths: detection of early
neurodegenerative diseases with Semi-supervised RANSAC SVM with eye-tracking, free-viewing data set.
3.1.2 Discussion
The result suggested that SSL RANSAC1 could remove heterogeneous samples (outliers)
to improve the model fitting even when mislabeling and unlabeled data were absent. That
is, SSL RANSAC1 helps select characteristic samples more representative of the class and
prevents ’outliers’ from undermining the fitting. In the case of age-combined fitting, SSL
RANSAC1 further improved the fitting by finding characteristic samples that are ageindependent. The random sampling technique provides an assumption-free way of finding a
more generalizable subset of samples.
34
Chapter 4
Investigation of partially labeled data sets with weak labels.
Risk prediction with Semi-supervised RANSAC SVM for
Adverse Childhood Experience induced Toxic Stress with
eye-tracking, free-viewing data sets
4.1 Experiment 3: Age group prediction of infants in Toxic Stress
dataset
4.1.0.1 Objective
Experiment 2 showed eye-tracking free-viewing paradigm could reveal subtle differences in
the cognitive development of infants that might be difficult to differentiate by human experts. Moreover, the application of SSL algorithm one on top of the SVM classifier achieved
significant improvements in all month groups by excluding the outliers in the training set.
The finding suggests that SSL RANSAC can improve the robustness of classifiers to challenging datasets that were thought impossible to classify due to the small size and high
noise-to-signal ratio, which is particularly common in medical datasets.
To further verify the validity of our claim that our SSL RANSAC algorithm can capture
subtle differences among the classes without accurate labels or feature domains, we utilize the
eye-tracking dataset used in the previous experiment to investigate whether the algorithm
can differentiate infants that are a few months apart. Although there are clear milestones
of cognitive development reflected in gazing behavior, the high variability in developmental
speed and the limitation of the eye-tracking free-viewing paradigm significantly compromise
the resolution and accuracy of such milestones. As a result, predicting the ages of the infant
participants from our eye-tracking dataset poses a challenging task that reflects the general
problem of predictive medicine.
35
Chapter 4 | Investigation of partially labeled data sets with weak labels. Risk prediction with
Semi-supervised RANSAC SVM for Adverse Childhood Experience induced Toxic Stress with eye-tracking,
free-viewing data sets
4.1.0.2 Experiment Design
We conducted the one-versus-rest binary classification on participants of 2, 6, 9, 12, 18, and
24-month-old. We use the mean average precision score to evaluate the performance of age
prediction with participants from CHLA. We evaluate the mAP score with three proposed
SSL RANSAC algorithms with SVM base classifiers, annotated as RSVM1, RSVM2, and
RSVM3. The SSL algorithms were trained with 10% of labeled data, and 90% of unlabeled
data. We also evaluate the mAP score of supervised SVM trained with 10% of labeled data
as the semi-supervised RANSAC algorithms. We evaluate the mAP of the SVM trained with
100% of labeled data as the performance upper-bound of the SSL RANSACs because there
is no mislabeling in the dataset. We also evaluated the mAP for the no-skill classifier. The
algorithms fit with saliency features, and oculomotor features were separately evaluated.
4.1.0.3 Data Preparation
We used the same CHLA toxic stress datasets as the previous experiment. The researchers
at two hospitals conducted longitudinal studies with infant participants aged 2, 6, 9, 12,
18, 24, and 36 months. There are 72 participants, each of whom took the free-viewing test
during their visits at 2, 6, 9, 12, 24, and 36 months of age. The total number of trials is less
than 432 because some participants dropped out of the experiment. The demography of the
participants is gender-balanced. Half of them are Hispanic.
4.1.0.4 Statistical Analysis
We conducted ten-fold cross-validation to obtain the mean average precision score.
4.1.0.5 Results
Infants’ visual attention develops rapidly over the first year of their lives (Ross-Sheehy et al.,
2015) (H., 1990) (J., 1984), and eye-tracking data only reflect the neurological development in
restrictive conditions. Therefore, predicting the age of infants whose ages were only months
36
Chapter 4 | Investigation of partially labeled data sets with weak labels. Risk prediction with
Semi-supervised RANSAC SVM for Adverse Childhood Experience induced Toxic Stress with eye-tracking,
free-viewing data sets
apart was not easy even when the data were correctly labeled. In this experiment, 10 % of
labeled data and 90% of unlabeled data were given to all RANSAC-based classifiers. An
SVM baseline trained with 10% labeled data and 100% data were also presented.
Our experimental results on age prediction showed RANSAC1-SVM(RSVM1) helped predict age better than RANSAC2-SVM(RSVM2) and RANSAC3-SVM(RSV3). (Figure 4.1)
There were no significant differences among RSVM2, RSVM3, SVM, and Naive Bayesian in
terms of the mAP score when fitted with saliency features. However, RSVM1 out-performed
the rest at all age classification tasks. The fact that there was no difference between the
Bayesian model, SVM 10% and SVM 100% suggests the labels were not useful. A similar
trend was found when only oculomotor features were used. RSVM2, RSVM3, and SVM
trained with 100% labels all performed similarly to the Bayesian model( no skill) from 6 to
24 months. Oculomotor features were only useful to distinguish 2-month-old infants from
the older ones. RSVM1 out-performed SVM trained with 100% of labels suggest RANSAC1
can learn from weakly labeled and unlabeled data.
4.1.0.6 Discussion
The experiment demonstrated that age classification with our eye-tracking data is difficult.
Most of the classifiers, including SVM training with 100% of the labeled data, do not perform
better than the no-skill estimate in most month groups in the saliency feature domain except
for the SSL RANSAC algorithm 1. It outperformed all other classifiers with around 5%
more accuracy in mAP scores across all month groups in the saliency feature domain. This
suggests that the exclusion of outliers is crucial in making an accurate classification of a high
noise-to-signal dataset. The SSL RANSAC algorithm 1 consistently outperformed the SVM
model trained with all labeled data in all month groups in both saliency and oculomotor
feature domains, suggesting that the algorithm is adaptable across various manifolds in the
real-world datasets, which agrees with our findings in experiment 1. On the other hand, the
SSL RANSAC algorithm 2 and SSL RANSAC algorithm 3 are not as robust and reliable as
37
Chapter 4 | Investigation of partially labeled data sets with weak labels. Risk prediction with
Semi-supervised RANSAC SVM for Adverse Childhood Experience induced Toxic Stress with eye-tracking,
free-viewing data sets
Figure 4.1: Results of experiment 3: Average precision of prediction of age
RANSAC algorithms with SVM-based classifiers are used to make binary classifications
of ages at 2, 6, 9, 12, 18, 24 months old. Models trained with Saliency features(left panel),
and oculomotor features(right panel) were both evaluated. We used 10% of the data set as
labeled set and 90% of the rest as unlabeled set for semi-supervised training. Supervised
SVM trained with all labeled data (100 % SL) and only 10% of labeled data (10 % SL) were
also compared. Naive Bayes model fit with 10% of the data set was also used as a no-skill
baseline. Left panel: The graph shows the mean accuracies of RANSAC algorithms in
predicting age with saliency features. RANSAC2, RANSAC3, SVM 10% labels, and SVM
100% labels all performed similarly to the naive Bayes model, while RANSAC1 performs
significantly better. Right panel: The graph shows the mean accuracies of RANSAC algorithms compared with SVM with 10% labels, SVM with all labels, and Naive Bayes with
10% labels. RANSAC1 performs as well as SVM trained with all labels at 2 months old
group. However, RANSAC1 outperformed SVM trained with all labels at all the other age
groups. However, plain SVM trained with 10% of labels performed better than RANSAC2
and RANSAC3 in 2 months old group, consistent with the results of previous experiments.
38
Chapter 4 | Investigation of partially labeled data sets with weak labels. Risk prediction with
Semi-supervised RANSAC SVM for Adverse Childhood Experience induced Toxic Stress with eye-tracking,
free-viewing data sets
SSL RANSAC algorithm 1, for they do not provide notable improvement to the supervised
classifier in our eye-tracking dataset. This finding also aligns with our finding in experiment
1.
The results suggest the plausibility of applying SSL RANSAC algorithm 1 to areas where
accurate groundtruths are not available, such as early detection, disease risk assessment, and
prognosis.
39
Chapter 5
Summary and Discussion
The difficulty and cost of obtaining high-quality annotated medical data have prohibited
predicted medicine from being applied to predicting health risks at early stages.
For example, Training machine learning models to detect chronic neurological diseases at
an early stage can be challenging due to the shortage of high-quality labels. The limited time
frames of most research projects to monitor the early development of chronic neurological
diseases, which may take years to decades to develop, prevent researchers from consolidating
a necessary understanding of the development of neurological diseases. As a result, the
accuracy of supervised learning-based machine learning methods can deteriorate quickly by
a small portion of mislabeling. On the other hand, semi-supervised learning utilizes the data
structure of unlabeled data to ’propagate’ labels from a small amount of labeled data to the
more accessible unlabeled data. Simply speaking, such ”pseudo-labeling” effectively provides
us with extra free training data, which can enhance the accuracy of model prediction even
when the initial training samples are scarce and contaminated.
We proposed a RANSAC-inspired semi-supervised learning algorithm that utilizes ransom initial sampling and self-consensus to propagate labels in a self-training manner. Our
RANSAC method generalizes to various base classifiers such as KNN and MLP and can be
applied to a wide range of mislabeling rates when tested with synthetic data distributions.
We apply our algorithms to synthesized datasets and real free-viewing datasets with groundtruths to evaluate the RANSAC-SVM algorithms. We experimented with three RANSAC
algorithms and found a better one based on both qualitative study and quantitative studies.
The experiments showed robust, significant improvement compared with supervised algorithms. We then applied the better algorithm, i.e., SSL-RANSAC algorithm 1 with SVM
kernel, to the actual toxic stress data set. We validated the applicability with strong ground
truths such as the sex and age of the participants. The experiments not only support the fea40
Chapter 5 | Summary and Discussion
sibility of SSL-RANSAC algorithm 1 in eye-tracking data, but indicate the great benefits of
combining data across age groups. Besides, we compare RANSAC-SVM with the other two
semi-supervised learning algorithms, i.e., label spreading and self-training algorithm with
ONDRI datasets. SSL-RANSAC, using SVM as a base classifier, performed remarkably better than the other two semi-supervised learning algorithms in classifying neurodegenerative
disorders when the appropriate feature domain was chosen.
Finally, we analyze the results of toxic stress prediction by comparing SSL-RANSAC
algorithm 1 with SVM kernel to SVM. Without hard ground-truths of toxic stress labels,
we relied on a demography-based composite score metric to assign risk labels to our participants. Studies suggest a correlation between maternal risk factor profiles and toxic stress
response in infants. (Gateau et al., 2022) Other studies also suggest the causal effect of social
environmental factors on the toxic stress of infants. (Shonkoff and et al, 2012) The significant improvement in predicting composite scores not only supports a correlation between
demography-based composite scores and saliency and oculomotor features of eye-tracking
data but also demonstrates we can reliably and accurately classify infantile participants of
high-risk groups from low-risk group. Eye-tracking-based screening paradigm has been proposed as a high-throughput, low-cost screening tool for various neurological diseases whose
pathology and symptoms are well-studied. However, it is the first time that we apply the
paradigm to a more realistic scenario, where the quality of labels is uncertain, and the characteristic symptoms are ill-studied. Adverse childhood experience-induced toxic stress can
leave a life-long impact on people, and it has cost the U.S. hundreds of billions of dollars
through health care and loss of workforce each year. Our work suggests that our RANASACinspired SSL algorithm can improve the accuracy of early recognition of high-risk populations
of adverse childhood experiences.
To summarize, our SSL-RANSAC has the following advantages: It is a high-level learning scheme that can be applied to various base classifiers based on various hypotheses to
maximally predict the respective problems. Second, it is specifically designed to tolerate a
41
Chapter 5 | Summary and Discussion
high percentage of mislabeling, outliers, and general noises in labels.
Third, it can be applied to a small dataset with a few labels. The design of our algorithm is suitable for machine learning prediction of a wide range of undiagnosed diseases.
Our preliminary results show the viability of applying the algorithm to chronic neurological
disorders. It lays the groundwork for further investigation of other undiagnosed diseases that
suffer from both label quality and label quantity challenges.
42
Bibliography
Ravi Aggarwal, Viknesh Sounderajah, Guy Martin, Daniel SW Ting, Alan Karthikesalingam,
Dominic King, Hutan Ashrafian, and Ara Darzi. Diagnostic accuracy of deep learning in
medical imaging: a systematic review and meta-analysis. NPJ digital medicine, 4(1):65,
2021.
Gerianne M Alexander and Teresa Wilcox. Sex differences in early infancy. Child
Development Perspectives, 6(4):400–406, 2012.
Gerianne M Alexander, Teresa Wilcox, and Rebecca Woods. Sex differences in infants’ visual
interest in toys. Archives of sexual behavior, 38:427–433, 2009.
Andrew P. Bayliss, Giuseppe di Pellegrino, and Steven P. Tipper. Sex differences in eye gaze
and symbolic cueing of attention. The Quarterly Journal of Experimental Psychology
Section A, 58(4):631–650, 2005. doi: 10.1080/02724980443000124. URL https://doi.
org/10.1080/02724980443000124. PMID: 16104099.
M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework for
learning from labeled and unlabeled examples. J. Mach. Learn. Res., 7:2399–2434, 2006.
Kristin Bennett and Ayhan Demiriz. Semi-supervised support vector machines. Advances
in Neural Information processing systems, 11, 1998.
Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training.
In Proceedings of the Eleventh Annual Conference on Computational Learning Theory,
COLT’ 98, page 92–100, New York, NY, USA, 1998. Association for Computing Machinery. ISBN 1581130570. doi: 10.1145/279943.279962. URL https://doi.org/10.1145/
279943.279962.
Subhabrata Debnath, Anjan Banerjee, and Vinay Namboodiri. Adapting ransac svm to detect outliers for robust classification. Proceedings of the British Machine Vision Conference
(BMVC), pages 168.1–168.11, September 2015. doi: 10.5244/C.29.168.
Meherwar Fatima, Maruf Pasha, et al. Survey of machine learning algorithms for disease
diagnostic. Journal of Intelligent Learning Systems and Applications, 9(01):1, 2017.
Martin A. Fischler and Robert C. Bolles. Random sample consensus: A paradigm for model
fitting with applications to image analysis and automated cartography. Commun. ACM,
24(6):381–395, June 1981. ISSN 0001-0782. doi: 10.1145/358669.358692.
Kameelah Gateau, Lisa Schlueter, Lara Pierce, Barbara Thompson, Alma Gharib, Ramon
Durazo-Arvizu, Charles Nelson, and Pat Levitt. Early maternal risk factor profiles and
their relationship to toxic stress response in infants across the first year of life. 2022.
Johnson Mark H. Cortical maturation and the development of visual attention in early
infancy. Journal of Cognitive Neuroscience, 2(2):81–95, 1990. doi: 10.1162/jocn.1990.2.2.
81.
43
Zahra Hossein-Nejad and Mehdi Nasri. An adaptive image registration method based on sift
features and ransac transform. Computers & Electrical Engineering, 62:524–537, 2017.
L. Itti. New eye-tracking techniques may revolutionize mental health screening. Neuron, 88
(3):442–444, Nov 2015.
L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapid
scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11):
1254–1259, 1998.
Atkinson J. Human visual development over the first 6 months of life. a review and a
hypothesis. Hum Neurobiol., (3(2)):61–74, 1984.
Thorsten Joachims. Transductive inference for text classification using support vector machines. page 200–209, 1999.
Daniel Koguciuk. Parallel ransac for point cloud registration. Foundations of Computing
and Decision Sciences, 42(3):203–217, 2017.
Abhishek Kumar and Hal Daume III. A co-training approach for multi-view spectral clustering. In Proceedings of the 28th International Conference on International Conference on
Machine Learning, ICML’11, page 393–400, Madison, WI, USA, 2011. Omnipress. ISBN
9781450306195.
G Maragatham and Shobana Devi. Lstm model for prediction of heart failure in big data.
Journal of medical systems, 43:1–13, 2019.
K. Nishida and T. Kurita. Ransac-svm for large-scale datasets. 2008 19th International
Conference on Pattern Recognition, pages 1–4, 2008.
Linhao Qu, Siyu Liu, Xiaoyu Liu, Manning Wang, and Zhijian Song. Towards label-efficient
automatic diagnosis and analysis: a comprehensive survey of advanced deep learning-based
weakly-supervised, semi-supervised and self-supervised techniques in histopathological image analysis. Physics in Medicine & Biology, 2022.
Alvin Rajkomar, Jeffrey Dean, and Isaac Kohane. Machine learning in medicine. New
England Journal of Medicine, 380(14):1347–1358, 2019.
Ana Rath, Valérie Salamon, Sandra Peixoto, Virginie Hivert, Martine Laville, Berenice Segrestin, Edmund AM Neugebauer, Michaela Eikermann, Vittorio Bertele, Silvio Garattini,
et al. A systematic literature review of evidence-based clinical practice for rare diseases:
what are the perceived and real barriers for improving the evidence and how can they be
overcome? Trials, 18:1–11, 2017.
Shannon Ross-Sheehy, Sebastian Schneegans, and John P. Spencer. The infant orienting with
attention task: Assessing the neural basis of spatial attention in infancy. Infancy, 20(5):
467–506, 2015. doi: https://doi.org/10.1111/infa.12087. URL https://onlinelibrary.
wiley.com/doi/abs/10.1111/infa.12087.
44
DR Sarvamangala and Raghavendra V Kulkarni. Convolutional neural networks in medical
image understanding: a survey. Evolutionary intelligence, 15(1):1–22, 2022.
Jack P Shonkoff and et al. The lifelong effects of early childhood adversity and toxic stress.
Pediatrics, 129(1):e232–e246, 2012.
P. H. Tseng, I. G. Cameron, G. Pari, J. N. Reynolds, D. P. Munoz, and L. Itti. Highthroughput classification of clinical populations from natural viewing eye movements. J.
Neurol., 260(1):275–284, Jan 2013a.
Po-He Tseng, Ian GM Cameron, Giovanna Pari, James N Reynolds, Douglas P Munoz, and
Laurent Itti. High-throughput classification of clinical populations from natural viewing
eye movements. Journal of neurology, 260(1):275–284, 2013b.
Dong Wang, Yuan Zhang, Kexin Zhang, and Liwei Wang. Focalmix: Semi-supervised learning for 3d medical image detection. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 3951–3960, 2020.
S. Wang, M. Jiang, X. M. Duchesne, E. A. Laugeson, D. P. Kennedy, R. Adolphs, and
Q. Zhao. Atypical Visual Saliency in Autism Spectrum Disorder Quantified through
Model-Based Eye Tracking. Neuron, 88(3):604–616, Nov 2015.
Jianping Wu, Liang Zhang, Ye Liu, and Ke Chen. Real-time vanishing point detector
integrating under-parameterized ransac and hough transform. In Proceedings of the
IEEE/CVF International Conference on Computer Vision (ICCV), pages 3732–3741, October 2021.
Chang Xu, Dacheng Tao, and Chao Xu. A survey on multi-view learning. CoRR,
abs/1304.5634, 2013. URL http://arxiv.org/abs/1304.5634.
David Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In
33rd annual meeting of the association for computational linguistics, pages 189–196, 1995.
Chen Zhang, Angelina Paolozza, Po-He Tseng, James N. Reynolds, Douglas P. Munoz, and
Laurent Itti. Detection of children/youth with fetal alcohol spectrum disorder through
eye movement, psychometric, and neuroimaging data. Frontiers in Neurology, 10:80, 2019.
doi: 10.3389/fneur.2019.00080.
Mingbo Zhao, Rosa HM Chan, Tommy WS Chow, and Peng Tang. Compact graph
based semi-supervised learning for medical diagnosis in alzheimer’s disease. IEEE signal
processing letters, 21(10):1192–1196, 2014.
Xiaojin Zhu and Zoubin Ghahramani. Learning from labeled and unlabeled data with label
propagation. ProQuest Number: INFORMATION TO ALL USERS, 2002.
Laura Zwaan and Hardeep Singh. The challenges in defining and measuring diagnostic error.
Diagnosis, 2(2):97–103, 2015.
45
Abstract (if available)
Abstract
Recognizing, identifying, and detecting chronic neurological disorders at an early stage with the assistance of machine learning (ML) is expected to reduce the cost of medical treatment significantly. One barrier, however, is that ML typically requires large amounts of accurately labeled training data, which may not always be available.
We propose three semi-supervised learning algorithms that are robust to partially mislabeled data and that only require a few data labels. The method combines self-training, ensemble learning, and RANSAC (random sampling consensus) to iteratively propagate from a small labeled dataset to a much larger pseudo-labeled dataset that is sufficiently large for ML training. First, we conducted qualitative and quantitative analysis of the proposed algorithms on synthesized datasets. Then, we analyzed two eye-tracking, free-viewing datasets. The first dataset included six hundred patients tested for five possible neurodegenerative disorders. Our approach provided 20\% better classification accuracy than 4 baseline algorithms in detection of Parkinson's Disease. The second dataset included 131 infants (age 2 to 24 months) tested for possible exposure to adverse childhood experience (ACE). Because this dataset does not have objective ground-truth labels (which would require knowing whether an infant has indeed experienced abusive or negligent behaviors), we first confirmed applicability of our algorithm by asking it to predict sex and age (for which ground-truth labels are available), which it was was able to do with 90\% and 42\% accuracy. To then evaluate our algorithm on possible ACEs, we compared its predictions to a composite score based on behavioral and socio-environmental data from the infants' parents.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Neuroscience inspired algorithms for lifelong learning and machine vision
PDF
Designing data-effective machine learning pipeline in application to physics and material science
PDF
Decision-aware learning in the small-data, large-scale regime
PDF
Towards learning generalization
PDF
Pretraining transferable encoders for visual navigation using unlabeled datasets
PDF
Transfer learning for intelligent systems in the wild
PDF
Federated and distributed machine learning at scale: from systems to algorithms to applications
PDF
Differentially private and fair optimization for machine learning: tight error bounds and efficient algorithms
PDF
Improving decision-making in search algorithms for combinatorial optimization with machine learning
PDF
Depth inference and visual saliency detection from 2D images
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
Eye-trace signatures of clinical populations under natural viewing
PDF
Validating structural variations: from traditional algorithms to deep learning approaches
PDF
Efficient learning: exploring computational and data-driven techniques for efficient training of deep learning models
PDF
Robust and generalizable knowledge acquisition from text
PDF
Learning controllable data generation for scalable model training
PDF
Fast and label-efficient graph representation learning
PDF
Toward robust affective learning from speech signals based on deep learning techniques
PDF
Contributions to structural and functional retinal imaging via Fourier domain optical coherence tomography
PDF
Learning invariant features in modulatory neural networks through conflict and ambiguity
Asset Metadata
Creator
Huang, Po-Hsuan
(author)
Core Title
RANSAC-based semi-supervised learning algorithms for partially labeled data, with application to neurological screening from eye-tracking data
School
College of Letters, Arts and Sciences
Degree
Graduate Certificate
Degree Program
Neuroscience
Degree Conferral Date
2024-05
Publication Date
01/30/2024
Defense Date
01/29/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
degenerative neurological diseases,machine learning,OAI-PMH Harvest,partially labeled data,predictive medicine.,Preventive Medicine,saliency,saliency-based biomarker,semisupervised learning,toxic stress
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Mel, Bartlett (
committee chair
), Herting, Megan (
committee member
), Itti, Laurent (
committee member
)
Creator Email
lets.lincredible@gmail.com,pohsuanh@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113817098
Unique identifier
UC113817098
Identifier
etd-HuangPoHsu-12644.pdf (filename)
Legacy Identifier
etd-HuangPoHsu-12644
Document Type
Dissertation
Format
theses (aat)
Rights
Huang, Po-Hsuan
Internet Media Type
application/pdf
Type
texts
Source
20240131-usctheses-batch-1124
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
degenerative neurological diseases
machine learning
partially labeled data
predictive medicine.
saliency
saliency-based biomarker
semisupervised learning
toxic stress