Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Sentiment analysis in the COVID-19 vaccine willingness among staff in the University of Southern California
(USC Thesis Other)
Sentiment analysis in the COVID-19 vaccine willingness among staff in the University of Southern California
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
SENTIMENT ANALYSIS IN THE COVID-19 VACCINE WILLINGNESS AMONG STAFF
IN THE UNIVERSITY OF SOUTHERN CALIFORNIA
By
Yutong Qin
A Thesis Presented to the
FACULTY OF THE USC KECK SCHOOL OF MEDICINE
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
MASTER OF SCIENCE
(BIOSTATISTICS)
MAY 2023
Copyright 2023 Yutong Qin
ii
Table of Contents
List of Figures .............................................................................................................................................. iii
List of Tables ................................................................................................................................................ iv
Abstract ......................................................................................................................................................... v
Chapter 1 Introduction ............................................................................................................................. 1
1.1 Career change to people’s attitude towards COVID-19 .............................................................. 1
1.2 Sentiment Analysis ...................................................................................................................... 2
1.2.1 LSTM (Long Short-Term Memory) ........................................................................................ 2
1.2.2 TF-IDF ..................................................................................................................................... 3
1.2.3 Logistic Regression ................................................................................................................. 3
1.3 Classification Matrix .................................................................................................................... 4
Chapter 2 Methods ................................................................................................................................... 6
2.1 Data .............................................................................................................................................. 6
2.2 Measures ...................................................................................................................................... 7
2.3 Statistical Analysis ....................................................................................................................... 9
2.3.1 LSTM .................................................................................................................................... 10
2.3.2 TF-IDF ................................................................................................................................... 11
2.3.3 Logistic Regression ............................................................................................................... 12
Chapter 3 Results ................................................................................................................................... 14
3.1 LSTM ......................................................................................................................................... 18
3.2 TF-IDF ....................................................................................................................................... 20
3.3 Logistic Regression (only) ......................................................................................................... 21
References ................................................................................................................................................... 27
iii
List of Figures
FIGURE 1. LOGIC BEHIND LSTM. ........................................................................................................................... 2
FIGURE 2. CONFUSION MATRIX. ............................................................................................................................ 4
FIGURE 3. THREE MODELS COMPARISON.. ....................................................................................................... 10
FIGURE 4. LSTM MODEL. ........................................................................................................................................ 11
FIGURE 5. DEMOGRAPHIC OF THE POPULATION.. .......................................................................................... 16
FIGURE 6. MODEL .................................................................................................................................................... 17
FIGURE 7. LOSS AND ACCURACY IN 9 EPOCHS. .............................................................................................. 18
FIGURE 8. ROC CURVE FOR LSTM+LR. ............................................................................................................... 19
FIGURE 9. ROC CURVE FOR TF-IDF+LR.. ............................................................................................................ 21
FIGURE 10. ROC CURVE FOR LOGISTIC REGRESSION.. .................................................................................. 22
iv
List of Tables
TABLE 1. SAMPLE SIZE COMPARISON IN TRAIN AND TEST DATA.. ........................................................... 12
TABLE 2. DEMOGRAPHIC AND RELEVANT VARIABLES (N=468). ................................................................ 14
TABLE 3. LSTM+LR CLASSIFICATION REPORT. ............................................................................................... 19
TABLE 4. TF-IDF CLASSIFICATION REPORT. ..................................................................................................... 20
TABLE 5. TF-IDF+LR CLASSIFICATION REPORT .............................................................................................. 21
TABLE 6. LOGISTIC REGRESSION CLASSIFICATION REPORT. ..................................................................... 22
TABLE 7. 3 MODELS PERFORMANCE COMPARISON. ...................................................................................... 23
v
Abstract
In January 2020, COVID-19 broke out worldwide. Because of the high turnover of people and the
infectivity of the virus, the number of COVID-19 confirmed cases is rising rapidly. With the
continuous development of the epidemic, public opinion is also changing. To a certain extent, the
behaviors of people represent people's attitudes and views on the development of epidemic
situations. The sentiment analysis of questionnaires can comprehend people's reactions and
psychology to the changes in national policies and emergencies, to play a guiding role in the
countermeasures of similar situations in the future. This paper selects the questionnaire data of The
Trojan Pandemic Research Initiative (TPRI) of the University of Southern California between
August and November 2021, to analyze the behaviors of people and their attitudes towards
COVID-19 and vaccinations. Sentiment analysis with TF-IDF (Term Frequency-Inverse
Document Frequency) and LSTM (Long Short-Term Memory) was used for the text variable
processing, and the logistic regression model was used for the classification. 2906 participants
enrolled in this study and 468 completed the open-ended text questions, which are questions about
whether a pandemic change in career affects participants’ vaccination willingness. A categorical
variable was created based on the results of sentiment analysis to represent the meaning of the text,
as the features of the classification model. The comparison between the accuracy of the basic
logistic regression model and the model with text analysis variables shows sentiment analysis
enhances the performance of prediction. The results of our analysis can help us understand the
relationship between public opinions and the development of the epidemic, to better comprehend
the psychology of the public.
1
Chapter 1 Introduction
1.1 Career change to people’s attitude towards COVID-19
COVID-19 had a profound impact on people's careers and, as a consequence, has been a major
career shock for many people
[1]
. The pandemic has resulted in widespread job losses, reduced
working hours, and changes in work arrangements such as remote work, which have had a
significant impact on workers' lives and well-being.
During the pandemic, due to the lockdown policy of the government and people’s worries about
the transmission of viruses, people stayed at home to work and had the chance to spend more time
with their families. People start to pay more attention to how to achieve a better work-life balance
in future work, especially after the pandemic. The pandemic pushed workers to have the
opportunity to reflect and re-evaluate what they wanted and hoped to gain from their jobs. The
new arrangements in work, especially those that are flexible, have disrupted traditional
relationships between employees and employers, as well as the typical work hours and schedules.
This has also brought attention to the importance of work-life balance and how individuals relate
to their work
[10]
.
There is a significant change in people’s working location preference after the pandemic,
according to the study of PRC (Pew Research Center). In the study, around 59% of American
workers who believe that their job duties can primarily be performed from home are currently
working from home either most of the time or all the time. Most of these workers, around 83%,
reported that they had already been working remotely from home even before the spread of the
Omicron variant in the US
[7]
.
2
1.2 Sentiment Analysis
1.2.1 LSTM (Long Short-Term Memory)
LSTM (Long Short-Term Memory) is a type of artificial neural network. LSTM is developed to
address some of the limitations of the Recurrent Neural Network (RNN) method. RNNs are a type
of neural network that can handle sequential data, such as time series or natural language, but they
suffer from several limitations, including short-term memory, vanishing gradients, and exploding
gradients
[3]
.
Figure 1. Logic behind LSTM. The network is initialized with a set of learnable parameters which is created by sentence
embedding, including the weights and biases dimension of the cells in LSTM. The short-term input S is taken into the network
through different layers (A), y is the output.
It can analyze the logical relationships between individuals based on the classification results of
the feedforward neural network, thereby solving the problem of logical inconsistencies in the text
caused by translational invariance. The logic behind LSTM shown in Figure 1 is that, the network
3
is initialized with a set of learnable parameters which is created by sentence embedding, including
the weights and biases dimension of the cells in LSTM. The short-term input S is taken into the
network through different layers (A). The LSTM algorithm is repeated for each time step in the
input sequence, allowing the network to learn long-term dependencies and make accurate
predictions over time. During training, the LSTM uses the gradient descent method to modify the
weights based on the error, improving training accuracy
[9]
.
1.2.2 TF-IDF
TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure that is widely
used in the fields of natural language processing. It is used to calculate the significance of a word
in selected documents, and have the frequency of the word appears in the documents and how
frequently it appears in the entire corpus we collected. By using this measure, it is possible to
identify which words are most important for understanding the content of the phrases or documents
in the corpus. The formula for TF-IDF is:
TF−IDF = TF(t,d)×IDF(t)
where, TF (t, d) = Number of times term "t" appears in a document "d" and IDF(t) = Inverse
document frequency of the term t.
1.2.3 Logistic Regression
The logistic regression is a statistical method to model the relationship between a dichotomous
dependent variable and the independent variables. It models the probability of a certain class or an
event existing. It does not require the independent variables to be normally distributed, but the
4
assumption for the logistic regression model is that the relationship between the log-odds of the
outcome and the independent variables is linear.
Logistic regression model is widely used in machine learning binary classification due to its
simplicity and ease of interpretability. In datasets with outliers and extreme values, logistic
regression is robust to noise and missing data.
1.3 Classification Matrix
Model performance is essential for machine learning, as it enables us to comprehend the strengths
and limitations of these models when making predictions in new situations
[4]
. For the classification
algorithm, the standard method for the performance test of machine learning is using a
classification matrix (confusion matrix), which is shown below (Figure 2).
Figure 2. Confusion Matrix.
In a confusion matrix, a true positive (TP) is a value that represents the number of correct positive
predictions made by the model, in another word, the model predicted a positive class and the label
of data was also positive. On the contrary, the true negative (TN) represents that the model
5
correctly predicts as negative, which means, the model predicted a negative class and the true label
of the data was also negative. False positive (FP), illustrates that the model has given a positive
prediction when the true label is negative, which means the prediction is wrong. And false negative
(FN) occurs when the model has given a negative prediction when the true label is positive
[2]
.
With the knowledge of these values, we can have the formulas of the precision score, recall,
accuracy, and F1-score:
Precision Score = TP/(FP+TP)
Recall Score = TP/(FN+TP)
AccuracyScore = (TP+TN)/(TP+FN+TN+FP)
F1−Score = 2 × Precision Score × Recall Score / (Precision Score + Recall Score)
The precision score measures the proportion of positively predicted labels that are actually correct.
The recall score is also called sensitivity. A high sensitivity illustrates that the machine learning
model is at well predict all outcomes, whether positive or negative. As for the accuracy score, the
score with 1 indicates the model has an accurate prediction, which is 100% correct. F1-score is the
harmonic mean of precision and recall of the model. The values of F1-score ranging from 0 to 1,
where 1 represents the model prediction has perfect precision and recall.
6
Chapter 2 Methods
2.1 Data
This paper uses the questionnaire data of The Trojan Pandemic Research Initiative (TPRI)
[5]
of the
University of Southern California (USC), to learn about the use and effectiveness of the COVID-
19 vaccines. Participants recruitment was via email by the COVID-19 Pandemic Research Center
(CPRC). Potential participants include 40,600 USC students, staff, and faculty aged 18 years or
older. All USC students enrolled in on-campus degree programs, staff, and faculty were eligible
to participate. The exclusion criterion was age < 18. There were no ethnic, language, or gender-
based exclusion criteria. All USC students, staff, and faculty over the age of 18 were eligible to
participate. We anticipated approximately 4,000 participants to join the cross-sectional survey
(Health Cohort).
The recruitment was completely voluntary and participants were asked to give biosamples (blood,
saliva, urine) to test their immune response to COVID-19 vaccines at six different time points over
9-12 months. Those who agreed to participate in this study were asked to finish the baseline online
survey and took the additional surveys every three months for one year. The additional surveys
took approximately 20 minutes to complete.
The Health Cohort Survey completed additional psychosocial measures included potential
COVID-19 exposures at home, school, or work, vaccine attitudes, and hesitancy, sources of
COVID-19 information, physical distancing behaviors and mask-wearing, depression, anxiety,
stress, etc.
This study includes 468 participants in the Wave 2 of Health cohort study who were recruited from
the current list of staff at the University of Southern California (USC) in Los Angeles, California,
7
from August to November 2021. This paper selects 468 participants who completed the open-
ended text questions in the survey, which was the question about whether the pandemic changed
their career plan, in order to analyze the behaviors and mental health of people, and their attitudes
towards COVID-19 and vaccinations.
2.2 Measures
The outcome variable was COVID-19 vaccination willingness, assessed by the question: have you
ever got a vaccination? This variable was coded as 1 (Yes) and 2 (No). To maximize the power to
compare across demographic groups, we stratified the sample of 468 into 24 strata (4 racial/ethnic
groups [Latinx, African American, Asian/Pacific Islander/Filipino, White] × 2 gender identity
groups [female, male] × 3 age groups [0-30, 30-50, 50+]. The questionnaire included demographic
features like self-identified race and ethnicity, gender identity, age, education level, and income
change due to the pandemic.
Based on the answers to the part Knowledge About Preventing COVID-19 in the questionnaire,
we created a new variable called w2_knowledge_grade. The questions in the Knowledge About
Preventing COVID-19 were statements in common sense and only had one correct answer in each
statement, like Using hand sanitizer with at least 60% alcohol is effective in reducing the risk of
infection from coronavirus. The answers varied from very unreliable to very reliable, coded from
1 to 7. The statements of 20 questions in this part were varied, and according to the news and
scientific research, we got the correct answers to these questions. After finishing the 20 questions
with correct answers and calculating the summation values, we got for those who had
comprehensive knowledge of COVID-19, the summation value should between 10 to 20, and the
8
new variable coded as 3. For those who graded 0-10 and 20-30, the new variable
w2_knowledge_grade coded as 2, represents this participant had a good understanding of the
pandemic but still needed to know more. For those who graded higher than 30, the new variable
coded as 1, which means these participants only had little knowledge of the pandemic.
For the measurement of participants’ trust in science, we assessed by averaging the responses to
the Nadelson’s Trust in Science Scale
[6]
. The Attitudes and Behaviors questions in the
questionnaire contained 21 statements about people's attitudes to scientists and scientific theories,
like We can trust scientists to share their discoveries even if they don't like their findings. This 21-
item scale included statements such as “We can trust science to find the answers that explain the
natural world” and “We should trust the work of scientists”, rated on a 5-point scale from “Strongly
Disagree” to “Strongly Agree”, and coded from 1 to 5. We created a new variable called
w2_attitudes_score, based on the scores we calculated according to the 21-item scale.
For the text variable, which the question was What has changed of your career goals because of
the pandemic, the text answers mainly focused on the work location, here some participants
expressed their willingness to stay more with family and have a better work-life balance, while
some participants mentioned that they had to earn more money and the office required them to
work onsite. We used sentiment analysis with both supervised and unsupervised machine learning
9
algorithms to transform the text answers into values, and then created a new categorical variable
using the content of the text and rated each answer with 0, 1, 2, where 0=remote, 1=onsite, 2=both.
By creating this new categorical variable, we were able to analyze the responses more
quantitatively and identify patterns in the data related to changes in career goals due to the
pandemic.
2.3 Statistical Analysis
This analysis contained 468 participants and only considered those who answered the text question.
We used machine learning algorithms to create a new variable based on the results of text analysis
and then included this new variable in the logistic regression models. For the sentiment analysis,
in order to compare the prediction performances between the supervised algorithm and
unsupervised algorithm in machine learning, we used LSTM (Long short-term memory) and TF-
IDF (Term Frequency-Inverse Document Frequency) for the text analysis. Due to the high volume
of features (more than 100) in this questionnaire, we used chi-squared (χ
!
) statistical test to
measure the association between features and the output variable. We selected 49 features with the
10
highest 49 scores in the statistical test to be the independent variables in the logistic regression
model.
Figure 3. Three Models Comparison. For LSTM and TF-IDF models, the binary variable was created by the outcome of the
sentiment analysis. LR represents logistic regression.
2.3.1 LSTM
LSTM is a supervised learning network, which requires us to do the model construction with the
manually rated variable w2_career_rate. We drop the "both" in w2_career_change, and only focus
on the people who have work-from-home/onsite. After that, we filter the answers and only valid
texts and words remain, with n=438 (remote:251, onsite:187). Then, we defined the number of
max features as 1000 and use Tokenizer to vectorize and convert text into sequences so the network
can deal with it as input. After the data processing and Tokenize, we set up the hyperparameters
of the network. Because this is a binary classification, and categorical cross-entropy is used for the
network here, we chose SoftMax. We split the dataset into a training (80%) and test (20%) set, and
set the value of epoch as 9. Although the parameters of the LSTM model are over thousands, and
11
it may take longer to handle the data, the hidden layers in this model can help to extract irrelevant
information from the dataset multiple times and help to get the most relevant information we want.
Figure 4. LSTM Model. The shape of the dense layer is that the number of classes in the classification task.
2.3.2 TF-IDF
TF-IDF is an unsupervised machine learning algorithm only needs the text variable as the feature
for the model training. The sample we used was the same as we did in LSTM. The first step in the
pre-processing of the dataset was converting all text answers into lowercase. In natural language
processing, useless words are referred to act as stop words. The Python natural language toolkit
library provides a list of English stop words. After removing the stop words of answers, we
performed the lemmatization of variables using the wordnet lemmatizer. We created a new
variable based on the pre-processed results, called career_change_clean, and this new variable
would be used for the model training. To keep the consistency of the model performance
comparison, we split the dataset into the train (80%) and test (20%), which was the same as we
did in the LSTM model.
12
Using the train and test data to do the vectorization of TF-IDF, we transform the documents into
the matrix of TF-IDF features. We converted the sparse matrix to a dense matrix and used the new
matrix for the creation of a classifier.
Table 1. Sample Size Comparison in Train and Test Data. The feature sizes were keeping the same, and splitting the dataset
into the train (80%) and test (20%).
Data Sample Size Feature Size
Train Data 350 534
Test Data 88 534
For the model construction, we used the Naive Bayes classifier with train data. As the standard
algorithm which is commonly used in text categorization classification, the Naive Bayes classifier
is more straightforward and powerful since it is based on applying Bayes’ theorem with strong
independence assumption between the features. We use the classification report with precision,
recall score, and F1-score to check the performance of TF-IDF model prediction.
2.3.3 Logistic Regression
To compare the performance of a traditional statistical model and the combination of the traditional
model with machine learning, we ran a multivariable logistic regression model with the manually
labelled answers. The sample we used was the same as we did in LSTM. In addition, based on the
predicted values of LSTM and TF-IDF, we created two new separate binary variables and included
them into the logistic regression instead of the manually labelled answers with the other 49 features.
For the logistic regression model analysis, we split the dataset into a training (50%) and test (50%)
set. The outcome of interest was whether people were willing to receive the COVID-19 vaccination
(yes or no). The equation is as follows:
13
𝑙𝑜𝑔E
𝑝
1−𝑝
G = 𝛼+𝛽
!
𝑥
!
+⋯+𝛽
"#
𝑥
"#
where p = Pr(Y = 1|x
"
,…,x
#$
) is the probability of people are willing to receive the COVID-
19 vaccination, α,β
"
,…,β
#$
are the model parameters.
As we did above, we use the classification report with precision, recall score, and F1-score for the
prediction performance. In order to have a more straightforward review of the model performance,
we did the ROC curve and calculated the model accuracy.
14
Chapter 3 Results
Table 2. Demographic and Relevant Variables (N=468). The outcome variable was COVID-19 vaccination willingness,
assessed by the question: have you ever got a vaccination?
Characteristics Yes (n=443) No(n=25)
Age, years 40.08 (11.55) 39.10 (7.70)
Gender (%)
Male 25.75 24.00
Female 74.25 76.00
Race (%)
Latinx 25.85 41.00
African American 5.97 5.00
Asian/Pacific Islander/Filipino 21.73 19.00
White 46.45 35.00
Attitudes Score 3.84 (0.28) 2.46 (0.12)
Knowledge Score 18.79 (5.64) 22.34 (3.79)
Education Level (%)
High school graduate (or GED) 0.67 0.80
Some college or technical school 10.74 11.20
College graduate 28.74 34.00
Postgraduate degree 59.85 54.00
Pandemic Change on Income (%)
Increase 16.09 14.60
Decrease 33.44 30.40
No Change 50.43 55.00
Table 2 shows the characteristics of two groups: individuals who received the COVID-19
vaccination (n=443) and individuals who did not receive the COVID-19 vaccination (n=25). The
mean age of those who received the vaccination was 40.08 years (SD = 11.55), and the mean age
of those who did not receive the vaccination was 39.10 years (SD = 7.70). In terms of gender, the
percentage of females was higher in both groups, with 74.25% of those who received the
vaccination, compared to 25.75% of those who received the vaccination was males. The racial
15
breakdown of the two groups shows that the Latinx population had a higher vaccination rate at
25.85% compared to 41% of those who did not receive the vaccination. In contrast, the white
population had a higher vaccination rate at 46.45% compared to 35% of those who did not receive
the vaccination. The attitudes score for those who received the vaccination was 3.84 (SD = 0.28)
and for those who did not receive the vaccination was 2.46 (SD = 0.12). The knowledge score was
also different, with those who received the vaccination having a mean score of 18.79 (SD = 5.64)
compared to 22.34 (SD = 3.79) for those who did not receive the vaccination. This suggests that
people who had more knowledge about the situation of the pandemic and had more trust in science
were prone to receive the vaccination. As for the pandemic's impact on income, 33.44% of those
who received the vaccination reported a decrease in income compared to 30.40% of those who did
not receive the vaccination.
16
Figure 5. Demographic of the Population. (A) Gender (B) Education level (C) Race.
For the feature selection, the top 50 features are listed below (Figure 6). The features with top
scores illustrated people's concern about vaccine safety, lack of convenience of receiving the
vaccine, mental health questions, social network activities, social media impacts, education level,
and career change. For the race with a high score, Latinx correlated with a greater mortality rate
compared to other ethnic groups, according to HCA Healthcare Journal
[8]
. One notable finding was
that mental health and physical health (e.g., pregnant) were important elements for the willingness
of receiving COVID-19 vaccination.
17
Figure 6. Model Features. Top 50 features with highest importance scores.
18
3.1 LSTM
Figure 7. Loss and Accuracy in 9 epochs.
From the result of loss and accuracy in 9 epochs, we had the loss decreasing and the accuracy
increasing, which indicated the LSTM model was trained in a good way. Using the validation to
calculate the score and accuracy of the network, we had a score of 0.46, and the accuracy was 0.78,
suggesting that the model was able to classify the 78% data correctly of the time. Furthermore, the
accuracy of prediction for the remote class was 83%, which indicates that the model was able to
correctly predict remote class in a good way. The accuracy of prediction for the onsite class was
75%, which was still high but slightly lower than the accuracy for the remote class. Overall, these
results suggest that the model performed well in the classification task. From the results of the
correct predictions, we can see that the accuracy of remote is larger than onsite, and both are larger
than or equal to 75%, which means the classification perform well.
We did the calculation on the accuracy of the new variable career_change_predict_lstm (N=438),
and the result was 94%, which means the correct predicted classified labels was 94%. Take the
19
new variable into the construction of the logistic regression model. According to the ROC curve
for this logistic regression model, we had the model accuracy was 75%.
Table 3. LSTM+LR Classification Report.
Category Precision Recall F1-score
Remote 0.97 0.99 0.98
Onsite 0.50 0.30 0.37
The classification report shows that the precision of remote was 0.97, which means the correct
classification was 97%. The recall of remote was 0.99, which means the sensitivity was 0.99. The
F1-score of the remote was 0.98, which was the harmonic mean of precision and recall for the
remote was 0.98. Meanwhile, the precision of onsite is 0.50, which means the correct classification
was only 50%. The recall of onsite was 0.30, which means the sensitivity was 0.30. The F1-score
of onsite was 0.37, which was the harmonic mean of precision and recall for onsite was 0.37. In
general, based on the classification report, the model performed well for the remote class with
higher precision, recall, and F1-score, but not as well for the onsite class.
Figure 8. ROC Curve for LSTM+LR. AUC is 0.75.
An AUC of 1.0 indicates that the classifier is perfect. The ROC curve with an area above the curve
(AUC) of 0.75 suggests that the classifier had moderate discriminatory power.
20
3.2 TF-IDF
We have for the model prediction, the overall accuracy was 0.72, which was slightly lower than
that of the LSTM algorithm. Additionally, the precision of prediction for the remote was 73% and
for onsite was 69%, indicating that the model was able to correctly predict positive outcomes with
moderate accuracy. Furthermore, the sensitivity of remote was 0.85, and for onsite was 0.51,
indicating that the model was able to correctly identify a relatively high proportion of the positive
prediction for the remote rather than onsite. Overall, although the model performed well for the
classification task, the precision of each prediction was lower than that of the LSTM algorithm,
which may indicate that further improvements can be made to the model.
Table 4. TF-IDF Classification Report.
Category Precision Recall F1-score
Remote 0.73 0.85 0.78
Onsite 0.69 0.51 0.59
Use the model to have the prediction on the whole dataset, we got the new variable
career_change_predict_tfidf which would be used in the logistic regression model. In order to
check on the model performance, and have a comparison between TF-IDF predicted values and
true values, we did the calculation on the accuracy of the new variable
career_change_predict_tfidf (N=438). To evaluate the performance of the model, the accuracy of
the predicted values was calculated and found to be 0.88, which suggests that the model had 88%
correctly prediction. This is a high accuracy and indicates that the model performed well in this
classification.
21
Table 5. TF-IDF+LR Classification Report
Category Precision Recall F1-score
Remote 0.97 0.99 0.98
Onsite 0.25 0.12 0.17
Take the new variable into the logistic regression model. The logistic regression model performed
well for the remote class with a high precision of 0.97 and a sensitivity of 0.99. However, the
model did not perform well for the onsite, with a low precision of 0.25 and a recall of 0.12. The
F1-score for the remote was 0.98, which indicates that the model has a good balance between
precision and recall for the remote. Hence, the model performs well for the remote but not for the
onsite.
Figure 9. ROC Curve for TF-IDF+LR. AUC is 0.68.
For the logistic regression model, the AUC was 0.68. It suggests that the model performed not so
well between positive and negative classes.
3.3 Logistic Regression (only)
22
Table 6. Logistic Regression Classification Report.
Category Precision Recall F1-score
Remote 0.97 0.98 0.97
Onsite 0.50 0.40 0.44
According to the classification report, the sensitivity of the remote class was 0.98. On the other
hand, the sensitivity of the onsite class was only 0.40, which suggests that the model had a lower
ability to correct prediction in onsite. The precision of the classification for the remote was 0.97.
The precision of the classification for the onsite class was 0.50, much lower than that of remote.
F1-score for the remote was 0.97, while the F1-score for the onsite class was 0.44.
Figure 10. ROC Curve for logistic regression. AUC is 0.81.
For the ROC curve for this logistic regression model, the area above the curve was 0.81. The AUC
value between 0.8 and 0.9 is considered to be a good prediction. AUC 0.81 suggests that the model
was able to distinguish between onsite and remote with relatively high accuracy.
23
Chapter 4 Discussion and Conclusion
For the analysis, the results suggest that there are several factors that may influence an individual's
decision to receive or not receive the COVID-19 vaccination. For the demographic factors, the
vaccination rates among different racial and ethnic groups are found. The Latinx population had a
higher vaccination rate, but also had a higher mortality rate compared to other groups. Individuals
who had higher levels of knowledge and trust in science were more likely to receive the vaccination,
since they have all-rounded knowledge and know the importance of receiving vaccination.
Additionally, mental and physical health concerns, the convenience of receiving the vaccine, social
network activities, and social media impacts were also found to be important factors in the
willingness of receiving the vaccination. For the sentiment analysis, people who are prone to work
onsite may prefer to receive the vaccination, this is might be due to the requirements of the office
or schools, and the consideration about their own health when exposed to the public.
Table 7. 3 Models Performance Comparison.
Model AUC
LR only 0.81
LSTM + LR 0.75
TF-IDF + LR 0.68
For the prediction models, based on the AUC values, the model with logistic regression (LR)
achieved an AUC of 0.81, which was the highest among the three models. For the model with a
combination of LSTM and logistic regression model, the AUC of this model was 0.75, which was
lower than the logistic regression (only) model's performance. The model that used the TF-IDF as
its text feature set, had the lowest AUC score of 0.68. Compared to the traditional statistical model,
the machine learning methods didn't significantly increase the accuracy of the model prediction.
24
But the model performance may vary depending on the dataset we used. For this dataset, we have
small sample size and the machine learning algorithm may need larger sample size to train the
model, so that it can have better performance. But in the comparison between LSTM and TF-IDF,
the model performance of the former algorithm was apparently better than that of the latter. We
have, for this dataset, supervised learning algorithms (logistic regression, LSTM) performed better
than the unsupervised learning algorithm (TF-IDF). Supervised learning algorithms are generally
better suited for tasks where there is a clear prediction between data and labels, and in this case,
having labeled data was essential to train the algorithm and it can help to make accurate predictions.
The reason why the prediction performance was poor in onsite class, is that the answers for this
class were multiple and were not specific like the answers for remote. And combination of features
can help to improve the classification accuracy than single words, like, “no home” performs better
than “no” + “home”, since the system may misunderstand the latter one and assign this into remote
class.
One of the main limitations of this study is, the small sample size. For this questionnaire dataset,
due to the limitation of questionnaires, there were more than 2,000 missing data of the text variable
w2_career_change in the original dataset (N=2906). The variable w2_career_change was based
on the question, what has changed of your career goals because of the pandemic, and this was an
additional question of the previous one, have your career goals changed because of the pandemic?
Missing data here were those who answer "No" in the previous question, or those who refused to
answer the additional question. In the beginning, we kept the missing data in the whole dataset in
order to have a larger sample size to train the model, and refilled the NA with, for those who
thought the pandemic had no change to their career, "There is no change on my career". But this
caused the unbalanced weights of the words "no" and "change". And for the model, answers with
25
“no” or “change” were more prone to be assigned to Y=2 (No), which could cause biased
prediction if the answers with one of the words illustrated a meaning on the contrary. The larger
sample size is helpful to accurately learn the underlying patterns and relationships in the data, and
it is beneficial to train the machine learning algorithms.
The limitations may result in the potential for overfitting of the model, and the model performance
may be limited. In the following study, we can collect more data from the whole population of The
Trojan Pandemic Research Initiative (TPRI) of the University of Southern California (USC), not
only in staff. Moreover, we can consider increasing the generalizability of the research, to expand
the sample by including a more diverse population, especially those with different demographic
characteristics, cultural backgrounds, and income status. Besides that, the conclusion drawn from
this study is only focused on the Wave 2 of the Trojan Pandemic Research Initiative (TPRI) of the
University of Southern California (USC). Longitudinal designs should be taken into consideration
that follow participants over time to better understand the if the time changes can affect the
attitudes and behaviors regarding vaccination or not.
Based on these findings, for higher dimensional datasets, machine learning with classification
based on features can help to improve the accuracy and efficiency of the traditional statistical
model. As for the vaccination study, it may be helpful to focus on improving public education and
support more access to know the correct information about the COVID-19 vaccine, especially
among communities with lower vaccination rates. Using social media's impacts on sharing more
knowledge about vaccination and the pandemic, and attract people’s attention on their own health
care, may also contribute to increasing vaccine rate. After the pandemic, many workplaces are
starting to transition back to onsite work, and this will also help to improve the willingness to the
COVID-19 vaccination. Overall, sentiment analysis can help the government and industries to get
26
the relationship between public opinions and the development of the epidemic, to better
comprehend the psychology of the public.
27
References
[1] Akkermans, J., Richardson, J., and Kraimer, M. L. The covid-19 crisis as a career shock:
Implications for careers and vocational behavior. Journal of Vocational Behavior 119 (2020),
103434.
[2] Fawcett, T. An introduction to roc analysis. Pattern Recognition Letters 27, 8 (2006), 861–874.
ROC Analysis in Pattern Recognition.
[3] Hansun, S., Charles, V., and Gherman, T. The role of the mass vaccination pro- gramme in
combating the covid-19 pandemic: An lstm-based analysis of covid-19 confirmed cases. Heliyon
9, 3 (2023), e14397.
[4] Kumar, A. Accuracy, precision, recall and f1-score – python examples.
https://vitalflux.com/accuracy-precision-recall-f1-score-python-example/#What_is_Recall_Score,
2023.
[5] Lee, R. C., Hu, H., Kawaguchi, E. S., Kim, A. E., Soto, D. W., Shanker, K., Klausner, J. D.,
Van Orman, S., and Unger, J. B. Covid-19 booster vaccine attitudes and behaviors among
university students and staff in the United States: The usc trojan pandemic research initiative.
Preventive Medicine Reports 28 (2022), 101866.
[6] Nadelson, L., Jorcyk, C., Yang, D., Jarratt Smith, M., Matson, S., Cor- nell, K., and Husting,
V. I just don’t trust them: The development and validation of an assessment instrument to measure
trust in science and scientists. School Science and Mathematics 114, 2 (2014), 76–86.
[7] Parker, K., Horowitz, J. M., and Brown, A. Covid-19 pandemic continues to reshape work in
America, 2022.
[8] Pedraza, L., Villela, R., Kamatgi, V., Cocuzzo, K., Correa, R., and Zylber- glait Lisigurski, M.
The impact of covid-19 in the Latinx community. HCA Healthcare Journal of Medicine 3, 3 (2022),
5.
[9] Sainath, T. N., Vinyals, O., Senior, A., and Sak, H. Convolutional, long short-term memory,
fully connected deep neural networks. In 2015 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP) (2015), pp. 4580–4584.
28
[10] Vyas, L. “New normal” at work in a post-COVID world: work–life balance and labor markets.
Policy and Society 41, 1 (01 2022), 155–167.
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Relationship between L.A. County residents' demographics and willingness to take the COVID-19 vaccine
PDF
Machine learning-based breast cancer survival prediction
PDF
Combination of quantile integral linear model with two-step method to improve the power of genome-wide interaction scans
PDF
Generalized linear discriminant analysis for high-dimensional genomic data with external information
PDF
Nonlinear modeling of the relationship between smoking and DNA methylation in the multi-ethnic cohort
PDF
The impact of the COVID-19 pandemic on cancer care delivery
PDF
Fine-grained analysis of temporal and spatial differences of behavior patterns and their correlation with the spread of COVID-19 in Los Angeles County
PDF
Enhancing model performance of regularization methods by incorporating prior information
PDF
The associations between ultra-processed food consumption and type 2 diabetes and obesity among young adults
PDF
Statistical downscaling with artificial neural network
PDF
Prediction modeling with meta data and comparison with lasso regression
PDF
Minimum p-value approach in two-step tests of genome-wide gene-environment interactions
PDF
The risk estimates of pneumoconiosis and its relevant complications: a systematic review and meta-analysis
PDF
Using social networks to recruit an HIV vaccine preparedness cohort
PDF
Comparison of participant and study partner predictions of cognitive impairment in the Alzheimer's disease neuroimaging initiative 3 study
PDF
Investigation into the use of repurposed influenza vaccines for immunotherapy of HPV-induced tumors
PDF
Integrative analysis of multi-view data with applications in epidemiology
PDF
Mixed-method analysis for school-aged children with access and functional needs during the Covid-19 pandemic
PDF
Association of arterial stiffness progression with subclinical atherosclerosis measurements in postmenopausal women
PDF
Hierarchical approaches for joint analysis of marginal summary statistics
Asset Metadata
Creator
Qin, Yutong
(author)
Core Title
Sentiment analysis in the COVID-19 vaccine willingness among staff in the University of Southern California
School
Keck School of Medicine
Degree
Master of Science
Degree Program
Biostatistics
Degree Conferral Date
2023-05
Publication Date
05/01/2023
Defense Date
04/28/2023
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
COVID-19,OAI-PMH Harvest,sentiment analysis,text analysis,vaccine willingness
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kawaguchi, Eric (
committee chair
), Lewinger, Juan (
committee member
), Li, Chun (
committee member
)
Creator Email
yutongqi@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113089498
Unique identifier
UC113089498
Identifier
etd-QinYutong-11740.pdf (filename)
Legacy Identifier
etd-QinYutong-11740
Document Type
Thesis
Format
theses (aat)
Rights
Qin, Yutong
Internet Media Type
application/pdf
Type
texts
Source
20230501-usctheses-batch-1034
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
COVID-19
sentiment analysis
text analysis
vaccine willingness