Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Supervised learning algorithms on factors impacting retweet
(USC Thesis Other)
Supervised learning algorithms on factors impacting retweet
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
1
Supervised Learning Algorithms On
Factors Impacting Retweet
by
Jiahua Zhu
A Thesis Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of
the Requirement for the Degree
MASTER OF SCIENCE
(APPLIED MATHEMATICS)
December 2017
2
Content
Abstract .......................................................................................................................................... 3
1. Introduction ............................................................................................................................... 4
1.1 Research Background and Significance ................................................................................ 4
1.2 Related Research ................................................................................................................... 5
2. Data ............................................................................................................................................ 6
2.1 Retweets ................................................................................................................................ 6
2.2 Data sets ................................................................................................................................ 6
3. Method ....................................................................................................................................... 7
3.1 Logistic regression ................................................................................................................ 7
3.2 Decision Tree ........................................................................................................................ 8
3.3 Support Vector Machine ..................................................................................................... 10
3.4 Neural Network ................................................................................................................... 10
4. Evaluation and discussion of results ...................................................................................... 11
References .................................................................................................................................... 13
Appendix ...................................................................................................................................... 15
Appendix 1: Python code for using Twitter API ....................................................................... 15
Appendix 2: Python code for cleaning the data and extracting the features. ............................ 16
Appendix 3: Python code for logistic regression ...................................................................... 18
Appendix 4: Python code for decision tree and svm. ............................................................... 19
Appendix 5: Python code for Neural Network ......................................................................... 20
3
Abstract
Social network services have become a viable source of information for users. Retweet is to
repost or forward a message posted by another user on Twitter and is usually done when users find
a very interesting message and they want to share the message with others. Thus, Retweets reflect
what the Twitter community considers interesting on a global scale.
In this paper, I investigate a set of content-based features that might affect retweetability of
tweets on a large collections of Twitter messages and I train four machine learning models
(Logistic Regression, Decision Tree, Support Vector Machine, Neural Network) to predict a
given tweet its likelihood of being retweeted. Interestingly, among four machine learning
models, SVM produced as high as 81% accuracy on the test set and logistic regression model
provides the weights of each feature for further research.
According to the result of logistic regression, username and hashtag have strong
relationship with retweetability, tweets with exclamation mark are more likely to be retweeted
than tweets with question mark, and people prefer to retweet tweets with negative terms.
Keywords:
Retweetability, Machine learning, Interestingness, Content-based Features.
4
1. Introduction
1.1 Research Background and Significance
Social network services such as Facebook and Twitter have become important
communication tools for many online users. A reposted or forwarded message on Twitter is
called retweet and normally users will forward a message if they consider it interesting and
worth sharing with others. In this paper I use ‘retweets’ as a measure of popularity and address
the problem by implementing different machine learning methods to predict whether the message
will be retweeted.
The question of what causes a message to be retweeted has frequently been addressed, Suh
et al.[11] conclude that contextual features including the number of followers, the age of the
account, the number of favorite tweets have influence on the probability of a message being
retweeted. In this paper, I want to explore the tweet content, which means I only focus on
features extracted from message. For this purpose, I analyze a set of content-based features on a
large collection of Twitter messages. The features are formed by the presence of emoticons,
URLS, hashtags, usernames, symbols, question mark and exclamation mark. I train four machine
learning models (Logistic Regression, Decision tree, Support Vector Machine, Neural Network)
on training datasets and test the model on the new message.
Thereby, in this paper I make two contributions: First, I collect a significant number (5297)
of tweets through the Twitter API, I separate the tweets into two datasets, 75% is training set and
the rest is test set. I explore various content features that have relationship with retweetability.
Second, I use four machine learning models to predict whether the message will be retweeted and
compare these four models, For the same amount of features, SVM produced as high as 81%
accuracy on the test set.
5
1.2 Related Research
Twitter is now the most popular social media around the world. Java et al.[4] described that
Twitter has gained notability and popularity worldwide since its creation in 2006. It found that
people use twitter to talk about daily activities and to share information.
Liu et al.[2] proposed a research model to investigate the factors influencing users’ intention
to use Twitter and the results showed that content gratifications and new technology gratification
were the two key types of gratifications affecting the continuance intention to use Twitter.
Twitter also has been used by candidates in political campaigns. Conover et al.[3] concluded
several methods for predicting the political alignment of Twitter users and the authors found that
structure in the data strongly associated with political association.
Cha et al.[6] introduced the retweetability based on the contextual features such as the
number of followers and the age of the account, the conclusion was that the number of followers
didn’t necessarily have relation with retweetability. However, Suh et al.[11] analyzed the content
and contextual features from a large data set and found factors including URLs, hashtags, the
number of followers, and the age of the account are significantly related to retweetability.
Hong et al. [7] investigated a wide spectrum of features based on the temporal information,
metadata of messages and users, as well as structural properties of the users’ social graph as the
features in predicting the messages to be retweeted. The authors concluded that temporal features
had strong effect on retweetability.
To sum up, the previous works reported that retweetability was related to contextual features
and content features. In this paper, I only focus on content features. I conduct four machine
learning methods on the data set and manage to compare these four models.
6
2. Data
2.1 Retweets
Retweet is a reposted or forwarded message on Twitter. When a user find a tweet is
interesting and want to share it with others, he can retweet the tweet by adding a text indicator
(RT@,Via@) followed by the user name of the original author.
I want to find the features that affect the retweetability of tweets. I treat this problem as
classification problem, so I mark tweets that have been retweeted as 1 and tweets have not been
retweeted as 0.
2.2 Data sets
I use Twitter’s API to collect a sample of the tweets from September 12, 2017 to September
14,2017. The total number of tweets I have collected is 5297.
From the data set, I analyze the json response and extract a set of features (Table 1) that
might be related to retweetability.
Feature Illustration
URLS
Used to indicate the location of the full text being talked about.
Hashtags Used to mark specific topics.
Username
Used in Twitter to refer to other users directly, either for addressing a
user or for talking about him.
Symbols Short character sequences representing emotions.
Positive term
Terms express positive attitude such as ‘great’, ‘like’, ‘excellent’,
‘rock’, ‘perfect’
Negative term
Terms express negative attitude such as ‘f**k’, ‘suck’, ‘fail’,
‘eww’,’shit’
Question mark Used to elicit responses
7
Exclamation mark
Used in personal communication to mark strong and potentially
emotional statements.
Table 1: The features and their illustration
3. Method
3.1 Logistic regression
Logistic regression is a generalized linear regression analysis model for learning a mapping
from numeric variables to a binary probabilistic variable.
The basic logistic model is as follows: test data is X(𝑥
$
,𝑥
&
,…,𝑥
(
), the corresponding
parameter is Θ(𝜃
$
,𝜃
&
,…,𝜃
(
), Z = 𝜃
$
𝑥
$
+𝜃
&
𝑥
&
+⋯+𝜃
(
𝑥
(
.
The sigmoid function has domain of all real numbers, with return value from 0 to 1 and
hence is interpretable as a probability. The sigmoid function is defined as follows:
σ t =
1
1+
𝑒
56
Therefore, the hypothesis function of logistic regression is: ℎ
8
𝑥 =
𝜎 𝑍 =
&
&;<
=>
?
@
ℎ
8
𝑥 is interpreted as the probability of the dependent variable equaling a ‘success’ or ‘case’.
The cost function is J θ =
−
&
D
[𝑦
G
log ℎ
8
𝑥
G
+(1−
𝑦
G
)log
(1−ℎ
8
𝑥
G
)]
D
GL&
.
We can implement gradient descent or normal equation on cost function to find the optimal
parameter.
In this paper, I conduct a mapping from the features of a tweet to the probability of a tweet
being retweeted. Let 𝑥
GM
be the feature i of tweet j, and 𝑟𝑒𝑡𝑤𝑒𝑒𝑡
M
indicating whether the tweet j
is retweeted. The logistic regression model is as follows:
P
𝑟𝑒𝑡𝑤𝑒𝑒𝑡
M
=
1
1+
𝑒
5(8
R
;
8
S
T
SU S
)
The parameter 𝜃
G
learned by logistic regression is interpreted as the log-odds for the feature i.
The result is as follows:
8
Feature
weight
URLS
-‐0.93102488
Hashtags
0.54753442
Username
2.88957666
Symbols
-‐0.08423995
Positive
term
-‐0.38200365
Negative
term
0.52519623
Question
mark
-‐0.76547879
Exclamation
mark
0.51776019
Table 2 :The features and their weights
I conduct this logistic regression model on the test set, the accuracy rate is 0.797. I calculate
the AUC value for training set is 0.786 and for test set is 0.801. I also plot the ROC curve. ROC
curve demonstrates the tradeoff between sensitivity and specificity. The closer the curve follows
the left-hand border and then the top border of the ROC space, the more accurate the test. An
ROC curve is the most commonly used way to visualize the performance of a binary classifier,
and AUC is the best way to summarize its performance in a single number.
Figure 1: ROC curves of training set and test set
3.2 Decision Tree
Decision tree is one of the predictive modeling approaches used in machine learning. The
goal is to create a model that predicts the value of a target variable by learning simple decision
9
rules inferred from the data features. Decision tree can be classified as classification tree and
regression tree based on whether the target variable is discrete variable or continuous variable. In
this paper, the target variable is whether the message is retweeted, so I use the classification tree.
The basic intuition behind a decision tree is to map out all possible decision paths in the form of
a tree. Decision tree construction involves splitting, in this paper, each feature can be split into
two ways.
Before splitting at each level of the tree, we evaluate all features based on their distribution
and decide which feature is the best to split. We use the ‘information gain’ as the indicator. The
bigger information gain, the more significant of the attribute. Choose “most significant attribute”
to be the root. Then split the dataset based on values of attribute, and repeat process on new
terminal leaves.
In this problem, I use the same features in logistic regression model and plot the decision
tree as follows:
Figure 2: Decision Tree model of eight features. (x[1]: question mark, x[2]: exclamation
mark, x[3]: positive term, x[4]: negative term, x[5]: symbols, x[6]: URL, x[7]: hashtag, x[8]:
user mention.)
Based on this decision tree, I implement the model on the test set, the accuracy rate is 0.798.
10
3.3 Support Vector Machine
Support Vector Machine (SVM) is a discriminative classifier formally defined by a
separating hyperplane. When we implement this algorithm, we plot each data item as a point in
n-dimensional space (where n is number of features) with the value of each feature being the
value of a particular coordinate. Then, we perform classification by finding the hyperplane
that differentiate the two classes very well.
For linearly separable data set, we select the hyperplane which segregates the two classes
better, which means we try to maximize the distances between nearest data point and hyper-
plane will help us to decide the right hyperplane.
Figure 3: linearly separable SVM Figure 4: Not linearly separable SVM
For non linearly separable data set, we can’t have linear hyper-plane between the two
classes. SVM will introduce kernel function which can take low dimensional input space and
transform it to a higher dimensional space, then we can find a linear hyperplane in this high-
dimensional space to deal with linear separable situations.
In this paper, I have eight features and the data set is not linearly separable, I use the ‘RBF’
kernel function and create the hyperplane in the high dimensional space. I implement the SVM
model on our test data set, the accuracy rate is 0.812.
3.4 Neural Network
Neural Network can be used to do classify and cluster analysis. In this paper, I use the back
11
propagation algorithm to establish the classification model. Because I have eight features of each
tweet, the input layer should contain eight units, adjust them by the weights, and pass them
through a special formula to neurons in the middle hidden layer. Then the output layer makes
some necessary processing of the information come from hidden layer. I calculate the error,
which is the difference between the neuron’s output and the desired output in the training set
example. Depending on the direction of the error, I adjust the weights slightly. Finally I use
python to repeat this process until it meets the preset range. In this model, I set three middle
hidden layers and I conduct the model on test set, the accuracy rate is 0.720.
4. Evaluation and discussion of results
In this paper, I use logistic regression model, decision tree, support vector machine and
neural network to train model. For each tweet, I extract eight features, all four models can do the
classification analysis.
For both logistic regression and artificial neural networks, the model parameters are
determined by maximum likelihood estimation. The advantage of logistic regression is that the
complexity is already low.
Decision tree have the advantage that they are not black-box models, but can easily be
expressed as rules. A major disadvantage of decision trees is given by the greedy construction
process: at each step, the combination of single best variable and optimal split-point is selected;
however, a multi-step that considers combinations of variables may obtain better results.
Support vector machine build optimal separating boundaries between data sets by solving a
constrained quadratic optimization problem. The disadvantage of support vector machine is that
no probability of class membership is given.
Compared with the other machine learning methods, for the same amount of feature, SVM
12
produced as high as 81.2% accuracy on the test set.
Next, I want to analyze the weights of each features I have obtained from logistic model.
The weight 𝜃
G
of a binary feature i with possible values 0/1 learned by logistic regression can be
interpreted as the log-odds of a tweet having that feature:
𝜃
G
= ln
[
𝑃(𝑟𝑒𝑡𝑤𝑒𝑒𝑡
M
|𝑥
GM
= 1)
𝑃(𝑟𝑒𝑡𝑤𝑒𝑒𝑡
M
|𝑥
GM
= 0)
]
I can draw these conclusions from the weights I get before:
• Messages with URLS are not likely to be retweeted.
• Hashtags and username have positive effect on retweetability. One possible explanation
is that tweets with username and hashtags are more likely to catch people’s attention. In
social media platform, everyone focuses on a certain number of topics, so tweets with
hashtags or username can easily be noticed.
• Symbols in tweet have little relation with retweetability.
• Tweets contain negative terms are more likely to be retweeted than tweets contain
positive terms. I think negative terms are more extreme and stronger to express the
emotion.
• Tweets with exclamation mark are likely to be retweeted but tweets with question mark
are not likely to be retweeted.
13
References
[1] D. W. Hosmer and S. Lemeshow. Applied logistic regression. John Wiley and Sons,
2000.
[2] Liu, I. L., Cheung, C. M., & Lee, M. K. (2010, July). Understanding Twitter Usage: What
Drive People Continue to Tweet. In Pacis (p. 92).
[3] Conover, M. D., Gonçalves, B., Ratkiewicz, J., Flammini, A., & Menczer, F. (2011,
October). Predicting the political alignment of twitter users. In Privacy, Security, Risk and
Trust (PASSAT) and 2011 IEEE Third Inernational Conference on Social Computing
(SocialCom), 2011 IEEE Third International Conference on (pp. 192-199). IEEE.
[4] Java, A., Song, X., Finin, T., & Tseng, B. (2007, August). Why we twitter: understanding
microblogging usage and communities. In Proceedings of the 9th WebKDD and 1st SNA-
KDD 2007 workshop on Web mining and social network analysis (pp. 56-65). ACM.
[5] Java, A., Song, X., Finin, T., and Tseng, B. Why We Twitter: Understanding
Microblogging Usage and Communities. Proc WebKDD/SNA-KDD’07, 56-65.
[6] Cha, M., Haddadi, H., Benevenuto, F., & Gummadi, P. K. (2010). Measuring user influence
in twitter: The million follower fallacy. Icwsm, 10(10-17), 30.
[7] Hong, L., Dan, O., & Davison, B. D. (2011, March). Predicting popular messages in twitter.
In Proceedings of the 20th international conference companion on World wide web (pp.
57-58). ACM.
[8] M. Cha, H. Haddadi, F. Benevenuto, and K. P. Gummadi. Measuring user influence in
Twitter: the million follower fallacy. In Proc. Int. Conf. on Weblogs and Social Media,
pages 10–17, 2010.
[9] Naveed, N., Gottron, T., Kunegis, J., & Alhadi, A. C. (2011, June). Bad news travel fast:
14
A content-based analysis of interestingness on twitter. In Proceedings of the 3rd
International Web Science Conference (p. 8). ACM.
[10] D. Boyd, S. Golder, and G. Lotan. Tweet, tweet, retweet: Conversational aspects of
retweeting on Twitter. In Hawaii Int. Conf. on System Sciences, pages 1–10, 2010.
[11] Suh, B., Hong, L., Pirolli, P., & Chi, E. H. (2010, August). Want to be retweeted? large
scale analytics on factors impacting retweet in twitter network. In Social computing
(socialcom), 2010 ieee second international conference on (pp. 177-184). IEEE.
[12] M. D. Choudhury, Y.-R. Lin, H. Sundaram, K. S. Candan, L. Xie, and A. Kelliher.
How does the data sampling strategy impact the discovery of information diffusion in
social media? In Proc. Conf. on Weblogs and Social Media, pages 34–41, 2010.
[13] D. M. Romero, W. Galuba, S. Asur, and B. A. Huberman. Influence and passivity in
social media. CoRR, abs/1008.1253, 2010.
[14] A. Pepe, H. Mao, and J. Bollen. Modeling public mood and emotion: Twitter sentiment
and socio-economic phenomena. CoRR, abs/0911.1583, 2009.
[15] S. Yardi and d. boyd, “Dynamic debates: An analysis of group polariza- tion over time
on Twitter,” Bulletin of Science, Technology and Society, vol. 20, pp. S1–S8, 2010.
[16] M. Hall, “Correlation-based feature selection of discrete and numeric class machine
learning,” Ph.D. dissertation, University of Waikato, August 2008.
15
Appendix
Appendix 1: Python code for using Twitter API
import tweepy
from tweepy import OAuthHandler
import json
ckey = 'DgTPfEMJMtMNYAT2jgCAtYkHY'
csecret = 'vvnsWBV1UQ4E1tPRpC5lA5E6PEml3lprFwwCndE7GId0Bx2YPQ'
atoken = '906247802862280705-ysAfbBH1ia9xmp0PpvzjhWA2OX4ljyG'
asecret = 'p1OjA8hnrmnY29ASqXWQifuDNzOWDEHtUKn0m3R9t8ZCN'
auth = OAuthHandler(ckey, csecret)
auth.set_access_token(atoken, asecret)
api = tweepy.API(auth)
for status in tweepy.Cursor(api.search, q='a', since="2017-09-12", until="2017-09-14",
lang='en').items():
json_response = json.dumps(status._json)
savefile = open("/Users/zhujiahua/Desktop/eg.txt", "a")
savefile.write(json_response)
savefile.write('\n')
savefile.close()
16
Appendix 2: Python code for cleaning the data and extracting the features.
import json
import numpy as np
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, roc_auc_score
def process_data(tweets_data_path):
tweets_text = []
tweets_entity = []
tweets_retweet = []
myretweet = []
tweets_questionmark = []
tweets_exclamationmark = []
tweets_positive = []
tweets_negative = []
tweets_symbols = []
tweets_URL = []
tweets_hashtag = []
tweets_usermention = []
tweets_file = open(tweets_data_path, "r")
for line in tweets_file:
try:
tweet = json.loads(line)
# print tweet['text']
# print ('\n')
tweets_text.append(tweet['text'])
tweets_entity.append(tweet['entities'])
tweets_retweet.append(tweet['retweet_count'])
except:
continue
# entities
for dict in tweets_entity:
if dict['symbols']:
tweets_symbols.append(1)
else:
tweets_symbols.append(0)
if dict['user_mentions']:
tweets_usermention.append(1)
else:
tweets_usermention.append(0)
if dict['hashtags']:
tweets_hashtag.append(1)
else:
tweets_hashtag.append(0)
if dict['urls']:
tweets_URL.append(1)
else:
tweets_URL.append(0)
17
# retweet
for num in tweets_retweet:
if num != 0:
myretweet.append(1)
else:
myretweet.append(0)
# text
for text in tweets_text:
text.strip()
mytext = text.split(" ")
questionmark = 0
exclamationmark = 0
positive_word = 0
negative_word = 0
for items in mytext:
item = items.lower()
if item == '?':
questionmark += 1
if item == '!':
exclamationmark += 1
if item in ['great','like','excellent','rock','perfect']:
positive_word += 1
if item in ['f**k','suck','fail','eww','shit']:
negative_word += 1
if questionmark > 0:
tweets_questionmark.append(1)
else:
tweets_questionmark.append(0)
if exclamationmark > 0:
tweets_exclamationmark.append(1)
else:
tweets_exclamationmark.append(0)
if positive_word > 0:
tweets_positive.append(1)
else:
tweets_positive.append(0)
if negative_word > 0:
tweets_negative.append(1)
else:
tweets_negative.append(0)
return myretweet, tweets_questionmark, tweets_exclamationmark, tweets_positive,
tweets_negative, tweets_symbols,tweets_URL, tweets_hashtag, tweets_usermention
myretweet, tweets_questionmark, tweets_exclamationmark, tweets_positive,
tweets_negative, tweets_symbols, tweets_URL, tweets_hashtag, tweets_usermention =
process_data("/Users/zhujiahua/Desktop/train.txt")
myretweet1, tweets_questionmark1, tweets_exclamationmark1, tweets_positive1,
tweets_negative1, tweets_symbols1, tweets_URL1, tweets_hashtag1, tweets_usermention1 =
process_data("/Users/zhujiahua/Desktop/test.txt")
18
Appendix 3: Python code for logistic regression
logistic_model = LogisticRegression()
X = np.array([tweets_questionmark, tweets_exclamationmark, tweets_positive,
tweets_negative, tweets_symbols, tweets_URL, tweets_hashtag, tweets_usermention])
y = np.array(myretweet)
logistic_model.fit(X.T,y)
print logistic_model.coef_
X1 = np.array([tweets_questionmark1, tweets_exclamationmark1, tweets_positive1,
tweets_negative1, tweets_symbols1, tweets_URL1, tweets_hashtag1, tweets_usermention1])
y1 = np.array(myretweet1)
# calculate the right rate on the train data set
predicted = logistic_model.predict(X.T)
accuracy_train = (predicted == y).mean()
# calculate the right rate on the test data set
predicted1 = logistic_model.predict(X1.T)
accuracy_test = (predicted1 == y1).mean()
train_probs = logistic_model.predict_proba(X.T)[:,1]
test_probs = logistic_model.predict_proba(X1.T)[:,1]
# calculate AUC
auc_train = roc_auc_score(y, train_probs)
auc_test = roc_auc_score(y1, test_probs)
print('Auc_train: {}'.format(auc_train))
print('Auc_test: {}'.format(auc_test))
# calculate roc curve
roc_train = roc_curve(y, train_probs)
roc_test = roc_curve(y1, test_probs)
plt.plot(roc_train[0], roc_train[1], color='blue', label="training set")
plt.plot(roc_test[0], roc_test[1], color='red', label="test set")
plt.legend(loc = 'upper left')
plt.title('ROC Curve')
plt.show()
19
Appendix 4: Python code for decision tree and svm.
X = np.array([tweets_questionmark, tweets_exclamationmark, tweets_positive,
tweets_negative, tweets_symbols, tweets_URL, tweets_hashtag, tweets_usermention])
y = np.array(myretweet)
X1 = np.array([tweets_questionmark1, tweets_exclamationmark1, tweets_positive1,
tweets_negative1, tweets_symbols1, tweets_URL1, tweets_hashtag1, tweets_usermention1])
y1 = np.array(myretweet1)
#
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X.T, y.T)
#
accuracy_rate = 1.0 * list((clf.predict(X1.T)) - y1).count(0) /
len(((clf.predict(X1.T)) - y1))
print accuracy_rate
dot_data = tree.export_graphviz(clf, out_file=None)
graph = pydotplus.graph_from_dot_data(dot_data)
graph.write_pdf("tweet.pdf")
clf = NuSVC()
clf.fit(X.T, y.T)
print clf.fit(X.T, y.T)
accuracy_rate = 1.0 * list((clf.predict(X1.T)) - y1).count(0) /
len(((clf.predict(X1.T)) - y1))
print accuracy_rate
20
Appendix 5: Python code for Neural Network
import numpy as np
def tanh(x):
return np.tanh(x)
def tanh_deriv(x):
return 1.0 - np.tanh(x) * np.tanh(x)
def logistic(x):
return 1 / (1 + np.exp(-x))
def logistic_derivative(x):
return logistic(x) * (1 - logistic(x))
class NeuralNetwork:
def __init__(self, layers, activation='tanh'):
if activation == 'logistic':
self.activation = logistic
self.activation_deriv = logistic_derivative
elif activation == 'tanh':
self.activation = tanh
self.activation_deriv = tanh_deriv
self.weights = []
for i in range(1, len(layers) - 1):
self.weights.append((2 * np.random.random((layers[i - 1] + 1, layers[i] +
1)) - 1) * 0.25)
self.weights.append((2 * np.random.random((layers[i] + 1, layers[i + 1])) -
1) * 0.25)
def fit(self, X, y, learning_rate=0.1, epochs=10000):
X = np.atleast_2d(X)
temp = np.ones([X.shape[0], X.shape[1] + 1])
temp[:, 0:-1] = X
X = temp
y = np.array(y)
for k in range(epochs):
i = np.random.randint(X.shape[0])
a = [X[i]]
for l in range(len(self.weights)):
a.append(self.activation(np.dot(a[l], self.weights[l])))
error = y[i] - a[-1]
deltas = [error * self.activation_deriv(a[-1])]
for l in range(len(a) - 2, 0, -1):
deltas.append(deltas[-1].dot(self.weights[l].T) *
self.activation_deriv(a[l]))
deltas.reverse()
for i in range(len(self.weights)):
layer = np.atleast_2d(a[i])
delta = np.atleast_2d(deltas[i])
self.weights[i] += learning_rate * layer.T.dot(delta)
def predict(self, x):
x = np.array(x)
temp = np.ones(x.shape[0] + 1)
temp[0:-1] = x
a = temp
21
for l in range(0, len(self.weights)):
a = self.activation(np.dot(a, self.weights[l]))
return a
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
The application of machine learning in stock market
PDF
Application of statistical learning on breast cancer dataset
PDF
Finding technical trading rules in high-frequency data by using genetic programming
PDF
Elements of dynamic programming: theory and application
PDF
Asset price dynamics simulation and trading strategy
PDF
An application of Markov chain model in board game revised
PDF
Recurrent neural networks with tunable activation functions to solve Sylvester equation
PDF
Improvement of binomial trees model and Black-Scholes model in option pricing
PDF
Neural matrix factorization model combing auxiliary information for movie recommender system
PDF
Identifying important microRNAs in progression of breast cancer
PDF
Large-scale inference in multiple Gaussian graphical models
PDF
Uniform distribution of sequences: Transcendental number and U.D. mod 1
PDF
Object detection and digitization from aerial imagery using neural networks
PDF
Construction of orthogonal functions in Hilbert space
PDF
Statistical insights into deep learning and flexible causal inference
PDF
The extension of skew-product semi-flow on ω-limit set to two-side distal skew-product flow
PDF
Shortcomings of the genetic risk score in the analysis of disease-related quantitative traits
PDF
Differentially private and fair optimization for machine learning: tight error bounds and efficient algorithms
PDF
Three essays on health economics
PDF
Facial key points detection by convolutional neural network
Asset Metadata
Creator
Zhu, Jiahua
(author)
Core Title
Supervised learning algorithms on factors impacting retweet
School
College of Letters, Arts and Sciences
Degree
Master of Science
Degree Program
Applied Mathematics
Publication Date
11/06/2017
Defense Date
11/01/2017
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
OAI-PMH Harvest,supervised learning algorithm, retweet, interestingness on Twitter
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Lototsky, Sergey (
committee chair
), Mancera, Ricardo (
committee member
), Sacker, Robert (
committee member
)
Creator Email
jiahuazh@usc.edu,zhujiahua9@126.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-452195
Unique identifier
UC11265657
Identifier
etd-ZhuJiahua-5874.pdf (filename),usctheses-c40-452195 (legacy record id)
Legacy Identifier
etd-ZhuJiahua-5874.pdf
Dmrecord
452195
Document Type
Thesis
Rights
Zhu, Jiahua
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
supervised learning algorithm, retweet, interestingness on Twitter