Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
BeyondURL: learning meaningful embedding representations for Web addresses
(USC Thesis Other)
BeyondURL: learning meaningful embedding representations for Web addresses
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
BeyondURL:
Learning Meaningful Embedding Representations for Web Addresses
by
Jeong Hyun An
A Thesis Presented to the
FACULTY OF THE USC VITERBI SCHOOL OF ENGINEERING
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
MASTER OF SCIENCE
COMPUTER SCIENCE
May 2022
Copyright 2022 Jeong Hyun An
TABLE OF CONTENTS
List of Tables iii
List of Figures iv
Abstract v
1 Introduction 1
2 Related work 1
3 Methodology 2
3.1 Baseline Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3.2 Fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.3 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
4 Experimental Setup 4
4.1 Experimental Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4.2 Getting the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.3 Generating Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.4 Extracting Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5 Results 6
5.1 Genre Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5.2 URL Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
6 Conclusion 8
References 9
ii
List of Tables
1 Size of the DMOZ dataset and precision, recall, f1-score of each category. . . 3
2 Although, ”sentence-t5-xxl” model shows highest performance ”all-MiniLM-
L6-v2” is 284 times faster. Web environment requires real-time process thus,
”all-MiniLM-L6-v2” is suitable for our task (models starts with ”all-” are
trained on 1 billion training pairs). After embedding URLs and descriptions
to the model, the performance increased by 4.8. . . . . . . . . . . . . . . . . 8
iii
List of Figures
1 RoBERTa-basemodelsoutperformsRoBERTa-largemodelsonthegenreclas-
sification task (trained and tested on whole-dataset). Fine-tuned models
shows slightly better results. . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Uppergraphsareusingeng-datasetandlowergraphsareusingwhole-dataset.
As shown in the graph, the gap in accuracy is closing as the models are trained. 6
3 The model which implemented MNR Loss function shows better performance
in the URL Ranking task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
iv
Abstract
Understanding the meanings of Web page URLs is inherit to users’ behaviors. From the Web
Intelligence perspective, it also benefits with applications including phishing URL identification,
Web page recommendation and focused Web crawlers. In this thesis, we develop a data-driven ma-
chine learning system for URL understanding based on State-of-the-art (SOTA) Natural Language
Processing (NLP) technologies.
Our development takes the following steps to improve the SOTA. First, we apply sequence
tagging model for word segmentation on the lexical features that are formed with multi-words so
we can break them down into individual words. Second, we apply a series of language modeling
technologies,includingpre-trainedTransformerlanguagemodelsandwordembeddings,torepresent
the extracted lexical features from the URLs. Third, we compare several classification algorithms
solely on the lexical features from the URL. Then, make further move by performing pre-training
meta-description of the site. We evaluate the proposed approaches for URL classification based on
the DMOZ benchmark.
v
1 Introduction
Along with Hypertext and HTTP Universal resource locator (URL) is one of the most
important Web concept. Browsers employ this technology to retrieve any Web resource that
hasbeenpublished. AURListheaddressofaspecificuniqueresourceontheInternetandit
is composed with following elements: protocol (http), domain (www.usc.edu), port number
(:8080), path to file (/path/to/file.html), parameters (?key1=value1&key2=value2).
The scheme is the first part of the URL and it specifies the protocol that the browser
must use to request the resource (a protocol is a set method for exchanging or transferring
data around a computer network).
The domain indicates the Web server that is being accessed. This is usually a domain
name, but it could also be an IP address (but this is rare as it is much less convenient).
The port is the technical ”gate” that allows you to access the web server’s resources. If
the Web server grants access to its resources using conventional HTTP ports (80 for HTTP
and 443 for HTTPS), it is frequently removed.
The path to the resource is primarily an abstraction handled by Web servers nowadays,
withnoactualsubstance. Considerthisthewebsite’sfolderstructure,whichallowsabrowser
to know which sub-directory to look for a Web page in.
The parameters are additional parts that the Web server receives. The ’&’ symbol sepa-
rates a list of key/value pairs in those parameters. Before returning the resource, the Web
server can use those arguments to do further tasks.
Each element of a valid URL contains important information about the resource which
the URL is containing. Since, the domains are inherent thus, it is often desired to have a
mnemonic URL which is formed with relevant words to the site. Hence, it is essential to
string several terms into a single URL. Except the domain it is often easy to split the terms
on special characters in the URLs. Despite this some words in the elements, specially the
domain, don’t include special characters.
The remainder of the paper is organized as follows. Section 2 describes related work.
Section 3 describes the proposed methodology for genre classification in real-time based on
wordvectorembeddingrepresentation. Section4demonstratestheresultsandtheconclusion
of the paper is discussed in section 5.
2 Related work
This section describes what kind of strategies previous works have chosen and how we can
implement them to our tasks.
Extracting features are the main point of genre classification of the Web pages. Many
types of features have been tested for genres classifying a Web page or distinguishing mali-
cious URLs from benign URLs. Lexical features, host-based features, content features and
even context and popularity features have all been taken into consideration [15]. Among
them lexical features are the most frequently used features for the tasks. They have been
proven for showing strong performance and are relatively easy to extract the features [12].
The types of lexical features retrieved from the URL string are statistical, such as the length
of the URL, the number of special characters and so on. In addition, characteristics similar
1
to Bag-of-Words are frequently used.
Previous works extracted lexical features, subsets of characters, from URL using n-
grams [3, 4, 5, 11]. In this paper we are going to focus on how word-level features can
be used for the genre classification. Additionally, with the extracted word-level features
from the URL, we can adopt the idea of fastText
1
and embed the words with matching
descriptions. With the embedded data, we can reversely retrieve the URLs that are in close
distance from the description in vector space. Such an ability can be used for ranking the
URLs with the given query hence, suggest recommendation of a similar genre Web pages.
Using only the URL for the genre classification would substantially aid our ability to
clear content categories, improve topic focused crawler [7] efficiency by avoiding download
irrelevant Web pages, recommendation of a similar Web pages and prevent access to harmful
links that could result in the loss of valuables (theft of money, identity theft, loss of data,
etc.).
3 Methodology
In this section we describe the methods used in the Multi-Genre Classification (MGC)
experimental results presented in the section 5 and details about fine-tuning, preprocessing
the data and embedding word vectors.
3.1 Baseline Model
WeuseRoBERTa[13,17]asabaselinemodelforthegenreclassificationtaskandSentence
Transformers [14] for the URL Ranking task.
RoBERTaisretrainedBERTwithtuninghyperparameters,trainingonlargerdata,larger
batch size and longer training, removing the next sentence predictions objective (NSP) and
dynamically changing the masking pattern. Furthermore, instead of the WordPiece tok-
enization, RoBERTa uses byte pair encoding (BPE) which is hybrid between character and
world-level representations.
Sentence Transformers is a framework written in Python for text and image embeddings.
It modifies the BERT/RoBERTa network with siamese and triplet network structures. This
reduces the effort for finding the similar pair considerably. Although it only allows using
cosine-similarity for comparison, the accuracy itself is maintained same [14].
In the experiments we compare the performance of RoBERTa-base and RoBERTa-large
on the genre classification task. Both models are pretrained on English language using a
masked language modeling (MLM) objective. They are case-sensitive hence, the capitalized
characters in URLs will be remained as it is. For the URL Ranking task we are going
to test semantic textual similarity based on the embedded descriptions with the matching
URLs. Two different loss functions will be implemented in this paper; Multiple Negatives
Ranking [10] (MNR) Loss and Triplet Loss.
1
https://github.com/facebookresearch/fastText
2
Category Size Train size Test size Precision Recall F1-score
Arts 158k 126k 32k 0.8835 0.9393 0.9106
Business 171k 137k 34k 0.9022 0.9455 0.9234
Computers 75k 60k 15k 0.8976 0.9383 0.9175
Games 25k 20k 5k 0.8900 0.9368 0.9128
Health 41k 33k 8k 0.8656 0.9250 0.8943
Home 17k 14k 3k 0.8713 0.9180 0.8940
News 6k 5k 1k 0.8754 0.9350 0.9042
Recreation 69k 55k 14k 0.8810 0.9450 0.9119
Reference 40k 32k 8k 0.9022 0.9139 0.9080
Regional 899k 719k 180k 0.9506 0.9232 0.9367
Science 78k 62k 16k 0.9004 0.9344 0.9171
Shopping 60k 48k 12k 0.9182 0.9456 0.9317
Society 164k 131k 33k 0.9054 0.9381 0.9215
Sports 70k 56k 14k 0.9056 0.9433 0.9241
World 1,648k 1,318k 330k 0.6493 0.3098 0.4195
Average 235k 188k 47k 0.8799 0.8927 0.8818
Table 1: Size of the DMOZ dataset and precision, recall, f1-score of each category.
3.2 Fine-tuning
RoBERTa is trained on the text dataset which has size of 160GB. The datasets are a
reunion of five; unpublished books, Wikipedia, News, Web text and story-like Winograd
Schema[20], datasets. Therangeofthedatasetisdistributedovergeneralandbasicfeatures
of language.
In this paper we are using a dataset which is purely from the Web, thus it is needed to
processdomainadaptationtothemodel. Inordertofine-tuneourmodelwehavecollectively
gathered the description of the Web pages in our DMOZ dataset. Since, NSP is removed
from RoBERTa, it only uses Masked Language Model (MLM) for training.
MLM uniformly selects 15% of the input tokens for possible replacement. Of the selected
tokens, 80% are replaced with [MASK], 10% are left unchanged and 10% are replaced by
a randomly selected vocabulary token [13]. We are going to load a pretrained RoBERTa
model with a language modeling head on top. With the given classes that Transformers
published, convert description texts into data for the language model and start fine-tuning
over the data.
3.3 Loss Functions
MNRLosstakesabatchconsistingofsentencepairsasinput. ConsiderasetofNDescrip-
tion and URL pairs, (d
1
,u
1
), ..., (d
N
,u
N
), where each d
i
and u
i
for i = 1, ..., N represents i
th
description and URL. All the inputs are positive pairs, matching description and URL pairs
and for all (d
i
,u
j
) where i!=j will be negative pairs.
Unlike MNR Loss, Triplet Loss takes triplets as an input in the form of (A,P,N) with
f(X) the sentence embedding for X. The triplet consists of an anchor (A)which is a URL,
3
a positive input (P) which is a matching description to A, a negative input (N) which is
a description from the other category from A. The loss function’s goal is to minimize the
distance between A and P while it maximizes the distance between A and N. The losses are
computed accordingly:
L(A,P,N) =max(||f(A)− f(P)||− ||f(A)− f(N)||+M,0)
Other than inputs, Triplet Loss has a margin (M) as a hyperparameter. It is used for
setting the minimum distance between A and N, i.e. M < (A,N) where, 0≤ M≤ 1. Tuning
margin respectively is necessary to get proper embedding results. By setting minimum
distance, the value is 0.5 in this paper.
4 Experimental Setup
In this section we will be mentioning the frameworks, details about data and features we
are going to use for the experiment.
4.1 Experimental Environment
Classifying a Web page into a several genres based on given URL is a basic NLP task
and there are several frameworks and models that we can implement. Although, Convo-
lutional Neural Networks (CNNs) have showed promise for text classification [23]. The
Transformer [21] showed its robustness over CNN [6] [18]. Of course, our task does not in-
clude images for classification but we are going to train our model with the word tokens not
statistical features. This gives the hypothesis that the self-attention mechanism; it is a type
of attention mechanism in the Transformer that allows each element of a sequence to inter-
act with the others and determine which element should receive greater attention, will yield
other mechanisms on performance. Simply, a self-attention layer aggregates global informa-
tion from the entire input sequence to update each component of a sequence. Other than
that,initiallyTransformersoutperformedothertechniquesontextclassification[17][13]. We
are implementing Transformer models to learn a URL embedding for genre classification.
WeareusingthePyTorch
2
environmentinthispaperduetofollowingreasons. Compared
to Keras, it can handle large datasets while showing high performance and that is what Web
environment is requiring to process large data in real-time. Also, our baseline models are
from HuggingFace and majority of there models are PyTorch exclusive.
For models we have chosen RoBERTa [13] over BERT [8], DistilBERT [16], XLNet [22].
These models have been compared in various environments and datasets. In many cases
RoBERTa out performed other models and it has been trained over most large data. As a
reminder our task is to classify URLs which are ubiquitous hence have various genres. If the
model is trained on large data it is likely to cover all the fields that URLs contain.
2
PyTorch is an open source machine learning library for Python, based on Torch. It is used for NLP
applications and it is known for its simplicity, ease of use, flexibility, memory efficiency and dynamic com-
putational graphs. Developed by Facebook’s AI research group.
4
Figure 1: RoBERTa-base models outperforms RoBERTa-large models on the genre classifi-
cation task (trained and tested on whole-dataset). Fine-tuned models shows slightly better
results.
4.2 Getting the Data
InMachineLearningamodel’sperformanceishighlydependentonthedata. Itisessential
to have a lot of training data to get high-performance. In this paper, the data is originally
from Open Directory Project
3
. The data has 15 main categories and multiple subcategories
which lead to 1,031,722 Categories in total. It would become very hectic for our model to
classify URLs into such specific categories and there is no need to classify a URL into one of
categories which contains only one or two URLs. By need, we have discarded subcategories
and merged 3,527,779 unique URLs into 15 main categories. The dataset can be found from
the official DMOZ page
4
.
Other than DMOZ we also got data from Canadian Institute for Cybersecurity
5
and
Malicious URL directory from Kaggle
6
. Each data has 68,908 and 651,190 unique URL
values and these data has 4 main categories; three malicious categories and one benign.
These data are for simple comparison in the performance with other works.
4.3 Generating Dataset
Non-English languages are found in two of the 15 DMOZ categories. The two categories
are ”Regional” and ”World.” The ”Regional” category is for sites that are in English. Re-
gional: Europe: Spain, for example. would list sites about Spain in English but contains
partially non-English words. The ”World” category, on the other hand, is for sites that are
not in English. World/Espa˜ nol/Regional/Espa˜ na, for example, would list Spanish-language
sites about Spain with Spanish-language descriptions. For the genre classification we are us-
ing RoBERTa as our baseline model which, trained on English only hence, two non-English
categories might have a minor affect on the classification result. As shown in table 1, per-
formance of the ”Regional” is reasonable considering its data size. However, despite of the
largest size of data performance on the ”World” category is lowest. To check how such a fact
3
https://dmoz-odp.org
4
https://dmoztools.net/docs/en/rdf.html
5
https://www.unb.ca/cic/datasets/url-2016.html
6
https://www.kaggle.com/sid321axn/malicious-urls-dataset
5
Figure 2: Upper graphs are using eng-dataset and lower graphs are using whole-dataset. As
shown in the graph, the gap in accuracy is closing as the models are trained.
could affect our experiment, we have made two separate datasets. One that includes two
categories (named whole-dataset afterwards) and the other that excludes the non-English
categories (named eng-dataset afterwards).
For each of the categories we split the data in 8:2 ratio, 80% of the data for training and
20% of the data for testing. RoBERTa takes a set of pairs for input, mostly a label with the
text, our input data will be a pair of a URL with the genre label.
4.4 Extracting Features
All the alphabetic character based languages have white space in their language system
which can easily be considered as basic word boundary delimiter. URLs are consist of sev-
eral characters: alphabetic character, special character, numerical character, even Unicode
character, butnotwhitespace(Unicode%20isnotconsideredasawhitespace)hence, there
is no explicit word boundary delimiters in the URLs. Although distinguishing words from
a string without white space is a fairly easy task for humans (e.g. thisisastringwithoutthe-
whitespace) it is relatively hard for machines to distinguish them. We can convert them into
a set of separate word tokens through implementing word segmentation techniques.
As mentioned in the previous paragraph, URLs consist of several types of characters, we
will have to segment the URL by not only white space but also punctuation to get adequate
word tokens. In order to do this we are implementing Universal Word Segmentation (UWS)
[19]. For brief explanation about UWS, the segmenter adapts bidirectional recurrent neural
networks with conditional random fields interface as the fundamental framework (BiRNN-
CRF) for word segmentation. It implemented an extra tag X for tagging the boundary of
the word, not applied on white space between multi word tokens. After training on neural
networks using TensorFlow [1] library F
1
score for the English segmentation showed 99.71.
5 Results
Through this section you will be able to see what kind of steps we took for the experiment
along with the results that we derived from it.
6
Figure 3: The model which implemented MNR Loss function shows better performance in
the URL Ranking task.
5.1 Genre Classification
We visualized the performance of the models on genre classification task for comparison.
The comparison will be done in two circumstances. One will be the comparing models it
selves and the other will be comparing the datasets that we have generated.
Genre classification task classifies given URL into one of the 15 different categories thus,
accuracy is the measurement evaluation function for this task. As a sub-task, we are going
to run binary classification on each category. In the sub-task we adopt precision, recall and
F
1
score for the evaluation. With the descriptions that had been collected in section 4.2 the
baseline models are fine-tuned. The training and testing is proceeded on the original version
of the models for comparison.
In figure 1, despite the size of trained data RoBERTa-base outperforms RoBERTa-large
by2-3%. Basedonsuchfacts,RoBERTa-basewillbeusedforourfurthertasksandcompared
with other pretrained models. In figure 2, the difference between upper graphs and lower
graphs is whether non-English data are included or not. As shown in the graph the models
without non-English data show higher accuracy at the beginning but after enough training
the gap between the graph closes.
5.2 URL Ranking
For testing we have embedded a description and 15 URLs from all categories as an input.
After calculating similarity in vector space is done we will rank each URLs based on the
scores and see how well our models performed finding relevant URLs to the description.
Semantictextualsimilarityisataskthatmeasuresdegreeofsemanticequivalencebetween
two given texts. It is common to use a regression function to compute text embeddings to
a similarity score [2, 9]. However, following all the computing steps often requires long
computing time. The most similar pair on such a network with a collection of 10k texts
requiresabout50million,N· (N− 1)/2,whereNisthenumberoftexts,inferencecomputation
which takes up to 65hours. This can be reduced to about 5 second by using Sentence
7
Model Sentence Semantic Avg. Encoding Mean Reciprocal
Name Embeddings Search Perform. Speed Rank (100k)
sentence-t5-xxl 70.88 54.40 62.64 50 0.4506
all-roberta-large-v1 70.23 53.05 61.64 800 0.2839
all-mpnet-base-v2 69.57 54.69 62.34 2,800 0.4434
all-MiniLM-L6-v2 68.03 48.07 58.05 14,200 0.4315
tuned-all-MiniLM-L6-v2 - - - - 0.4798
tuned-RoBERTa - - - - 0.3547
Table 2: Although, ”sentence-t5-xxl” model shows highest performance ”all-MiniLM-L6-v2”
is 284 times faster. Web environment requires real-time process thus, ”all-MiniLM-L6-v2” is
suitable for our task (models starts with ”all-” are trained on 1 billion training pairs). After
embedding URLs and descriptions to the model, the performance increased by 4.8.
Transformers.
MRR =
1
||Q||
Q
X
i=1
1
rank
i
The Mean Reciprocal Rank (MRR) column in table 2 shows the results of the ranking
task. All the models are retrained with the segmented URLs along with the matching
descriptions. With the URLs and descriptions embedded, a set of a description and 15
different URLs from each category is given to the model on each iteration. The model
calculates cosine similarity between the description and URLs and rank them in ascending
order. The rank of a URL which belongs to the same category as description will be applied
to the equation for Q = 100k.
In figure 3 the difference made by the loss functions can be seen. The model with MNR
Loss outperforms the model with Triplet Loss by approximately 30%. The result can be
explained with the characteristic of the inputs that each model takes. MNR Loss takes the
positive pairs each time and considers other inputs as negative inputs. On the other hand,
a negative input for Triplet Loss is given each time. The negative inputs here are the data
from other categories. By definition, MNR Loss learned all the data from other categories as
negativebutTripletLosslearnedselecteddataasnegativehence,onmulti-labelclassification
task, MNR Loss is more suitable.
6 Conclusion
The purpose of this paper was to examine how far we can make use of the meanings
contained in texts. As mentioned in section 2, there are a lot of methods we can implement
to boost our performance. However, we wanted to test in the field of NLP and see how well
the tuned SOTA models could perform.
The results of the genre classification task have shown that it is possible to classify Web
pages based on word-level lexical features purely from the URL. Although there are some
limitationsonextractingamplefeaturesfortheclassification,wecouldseesomepossibilityto
8
overcomethisproblembyfine-tuningourmodelsusingmetadatafromtheWebpages. MRR
scores in table 2 tells us complicated tasks will encounter even greater limitations. Specially,
with URL shortening technique it makes it almost impossible because the only difference
apart from other URLs will be the hash value which comes after the domain section.
From the URL Ranking task result, it is unlikely to have related recommendations of the
given URL. Is challenging to obtain reliable performance. One strategy we can take is to
classify the genres of the given URLs then, rank the URLs falls into same genre as the target
page. By following such a strategy MRR scores will improved significantly but it is a rather
taking advantage of high accuracy in genre classification than improving the performance of
the URL Ranking itself.
References
[1] Mart´ ın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean,
Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kud-
lur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner,
Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang
Zheng. TensorFlow: A System for Large-Scale Machine Learning. USENIXAssociation,
Savannah, GA, November 2016.
[2] Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, and Weiwei Guo. *SEM
2013 shared task: Semantic Textual Similarity. Association for Computational Linguis-
tics, Atlanta, Georgia, USA, June 2013.
[3] Eda Baykan, Monika Henzinger, Ludmila Marian, and Ingmar Weber. Purely URL-
based topic classification. In Proceedings of the 18th international conference on World
wide web - WWW ’09, page 1109, Madrid, Spain, 2009. ACM Press.
[4] RohitBharadwaj,AshutoshBhatia,LaxmiDivyaChhibbar,KamleshTiwari,andAnkit
Agrawal. Is this URL Safe: Detection of Malicious URLs Using Global Vector for Word
Representation. In2022 International Conference on Information Networking (ICOIN),
pages 486–491, Jeju-si, Korea, Republic of, January 2022. IEEE.
[5] Aashlesha Bhingarde and Deepali Vora. Effective Genre Classification - Understand-
ing Url And Webpage Attributes For Classification. International Journal of Recent
Technology and Engineering, 8(2S11):2011–2016, November 2019.
[6] Srinadh Bhojanapalli, Ayan Chakrabarti, Daniel Glasner, Daliang Li, Thomas Un-
terthiner, and Andreas Veit. Understanding Robustness of Transformers for Image
Classification. arXiv:2103.14586 [cs], October 2021. arXiv: 2103.14586.
[7] Soumen Chakrabarti, Martin van den Berg, and Byron Dom. Focused crawling: a new
approachtotopic-specificWebresourcediscovery. Computer Networks, 31(11-16):1623–
1640, May 1999.
9
[8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT:
Pre-training of Deep Bidirectional Transformers for Language Understanding.
arXiv:1810.04805 [cs], May 2019. arXiv: 1810.04805.
[9] Lushan Han, Abhay L. Kashyap, Tim Finin, James Mayfield, and Johnathan Weese.
UMBC-EBIQUITY-CORE: Semantic Textual Similarity Systems. AssociationforCom-
putational Linguistics, June 2013.
[10] Matthew Henderson, Rami Al-Rfou, Brian Strope, Yun-hsuan Sung, Laszlo Lukacs,
Ruiqi Guo, Sanjiv Kumar, Balint Miklos, and Ray Kurzweil. Efficient Natural Lan-
guage Response Suggestion for Smart Reply. arXiv:1705.00652 [cs], May 2017. arXiv:
1705.00652.
[11] Min-Yen Kan and Hoang Oanh Nguyen Thi. Fast webpage classification using URL
features. In Proceedings of the 14th ACM international conference on Information and
knowledge management - CIKM ’05, page 325, Bremen, Germany, 2005. ACM Press.
[12] HungLe, QuangPham, DoyenSahoo, andStevenC.H.Hoi. URLNet: LearningaURL
Representation with Deep Learning for Malicious URL Detection. arXiv:1802.03162
[cs], March 2018. arXiv: 1802.03162.
[13] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer
Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A Robustly
Optimized BERT Pretraining Approach. arXiv:1907.11692 [cs], July 2019. arXiv:
1907.11692.
[14] NilsReimersandIrynaGurevych. Sentence-BERT:SentenceEmbeddingsusingSiamese
BERT-Networks. arXiv:1908.10084 [cs], August 2019. arXiv: 1908.10084.
[15] Doyen Sahoo, Chenghao Liu, and Steven C. H. Hoi. Malicious URL Detection using
Machine Learning: A Survey. arXiv:1701.07179 [cs], August 2019. arXiv: 1701.07179.
[16] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. DistilBERT, a
distilled version of BERT: smaller, faster, cheaper and lighter. arXiv:1910.01108 [cs],
February 2020. arXiv: 1910.01108.
[17] Zein Shaheen, Gerhard Wohlgenannt, and Erwin Filtz. Large Scale Legal Text Clas-
sification Using Transformer Models. arXiv:2010.12871 [cs], October 2020. arXiv:
2010.12871.
[18] Rulin Shao, Zhouxing Shi, Jinfeng Yi, Pin-Yu Chen, and Cho-Jui Hsieh. On the Adver-
sarial Robustness of Vision Transformers. arXiv:2103.15670 [cs], October 2021. arXiv:
2103.15670.
[19] Yan Shao, Christian Hardmeier, and Joakim Nivre. Universal Word Segmentation: Im-
plementation and Interpretation. arXiv:1807.02974 [cs], July 2018. arXiv: 1807.02974.
[20] Trieu H. Trinh and Quoc V. Le. A Simple Method for Commonsense Reasoning.
arXiv:1806.02847 [cs], September 2019. arXiv: 1806.02847.
10
[21] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is All you Need. In I. Guyon,
U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett,
editors, Advances in Neural Information Processing Systems, volume 30. Curran Asso-
ciates, Inc., 2017.
[22] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and
Quoc V. Le. XLNet: Generalized Autoregressive Pretraining for Language Understand-
ing. arXiv:1906.08237 [cs], January 2020. arXiv: 1906.08237.
[23] Chunting Zhou, Chonglin Sun, Zhiyuan Liu, and Francis C. M. Lau. A C-LSTM Neu-
ral Network for Text Classification. arXiv:1511.08630 [cs], November 2015. arXiv:
1511.08630.
11
Abstract (if available)
Abstract
Understanding the meanings of Web page URLs is inherit to users’ behaviors. From the Web Intelligence perspective, it also benefits with applications including phishing URL identification, Web page recommendation and focused Web crawlers. In this thesis, we develop a data-driven ma- chine learning system for URL understanding based on State-of-the-art (SOTA) Natural Language Processing (NLP) technologies.
Our development takes the following steps to improve the SOTA. First, we apply sequence tagging model for word segmentation on the lexical features that are formed with multi-words so we can break them down into individual words. Second, we apply a series of language modeling technologies, including pre-trained Transformer language models and word embeddings, to represent the extracted lexical features from the URLs. Third, we compare several classification algorithms solely on the lexical features from the URL. Then, make further move by performing pre-training meta-description of the site. We evaluate the proposed approaches for URL classification based on the DMOZ benchmark.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Lexical complexity-driven representation learning
PDF
Incorporating aggregate feature statistics in structured dynamical models for human activity recognition
PDF
Fair Machine Learning for Human Behavior Understanding
PDF
Learning distributed representations from network data and human navigation
PDF
Word, sentence and knowledge graph embedding techniques: theory and performance evaluation
PDF
Parametric and semi-parametric methods for knowledge acquisition from text
PDF
Quickly solving new tasks, with meta-learning and without
PDF
Generating and utilizing machine explanations for trustworthy NLP
PDF
Robust and generalizable knowledge acquisition from text
PDF
Towards understanding language in perception and embodiment
PDF
Aggregating symbols for language models
PDF
Learning logical abstractions from sequential data
PDF
Robust and proactive error detection and correction in tables
PDF
High-throughput methods for simulation and deep reinforcement learning
PDF
Syntactic alignment models for large-scale statistical machine translation
PDF
Scaling up deep graph learning: efficient algorithms, expressive models and fast acceleration
PDF
Leveraging prior experience for scalable transfer in robot learning
PDF
Multi-modal preconditioned inference of commonsense knowledge
PDF
The inevitable problem of rare phenomena learning in machine translation
PDF
Transfer learning for intelligent systems in the wild
Asset Metadata
Creator
An, Jeong Hyun
(author)
Core Title
BeyondURL: learning meaningful embedding representations for Web addresses
School
Viterbi School of Engineering
Degree
Master of Science
Degree Program
Computer Science
Degree Conferral Date
2022-05
Publication Date
04/13/2022
Defense Date
03/11/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
embedding,machine learning,NLP,OAI-PMH Harvest,Roberta
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Chen, Muhao (
committee chair
), Jia, Robin (
committee member
), Raghavachary, Sathyanaraya (
committee member
), Shen, Wei-Min (
committee member
), Thomason, Jesse (
committee member
)
Creator Email
jeonghyu@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC110937304
Unique identifier
UC110937304
Document Type
Thesis
Format
application/pdf (imt)
Rights
An, Jeong Hyun
Type
texts
Source
20220413-usctheses-batch-923
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
embedding
machine learning
NLP