Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Alleviating the noisy data problem using restricted Boltzmann machines
(USC Thesis Other)
Alleviating the noisy data problem using restricted Boltzmann machines
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ALLEVIATING THE NOISY DATA PROBLEM USING RESTRICTED BOLTZMANN MACHINES
by
Ankith Mohan
A Thesis Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulllment of the
Requirements for the Degree
MASTER OF SCIENCE
(COMPUTER SCIENCE)
December 2020
Copyright 2020 Ankith Mohan
Dedication
I dedicate my dissertation work to my loving parents, whose words of encouragement and push for tenacity
ring in my ears. Without your love and support this dissertation would not have been possible.
ii
Acknowledgements
I wish to thank my committee members who were more than generous with their expertise and precious
time. A special thanks to Prof. Aiichiro Nakano and Prof. Emilio Ferrara for their countless hours of
reecting, reading, encouraging, and most of all patience throughout the entire process. Thank you Prof.
Kristina Lerman for agreeing to serve on my committee. Their excitement and willingness to provide
feedback made the completion of this research an enjoyable experience. I would lastly like to thank Dr.
Jeremy Liu for providing me with the opportunity to work with restricted Boltzmann machines and limited
Boltzmann machines, which inspired this project.
iii
TableofContents
Dedication ii
Acknowledgements iii
ListofTables vi
ListofFigures vii
Abstract xi
Chapter1: Introduction 1
1.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Aims and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Structure of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Chapter2: Background 5
2.1 Deep neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Information theoretic view of deep neural networks . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Restricted Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.1 Contrastive divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.2 Contrastive divergence on latent variable models . . . . . . . . . . . . . . . . . . . 9
2.3.3 Energy-based models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.4 Boltzmann machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.5 Restricted Boltzmann Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.5.1 Bernoulli-Bernoulli RBM . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.5.2 Gaussian-Bernoulli RBM . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.6 Content-addressable memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 node2vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.1 Neighbourhood sampling strategy (S) . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Graph neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5.1 Graph Convolutional Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5.2 GraphSAGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Chapter3: Denoisingpipeline 25
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.1 Trained deep neural network, . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.2 Denoising pipeline, . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
iv
3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.1 ogbn-arxiv: Paper Citation Network . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.2 Noisy data for ogbn-arxiv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Chapter4: Neuralnetwork-baseddenoisingpipeline 33
4.1 MLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 node2vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.1 Corrupted node feature matrix (X
c
) . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.1.1 Corrupted adjacency matrix (A
c
) . . . . . . . . . . . . . . . . . . . . . . 34
4.2.1.2 Blanked out adjacency matrix (A
z
) . . . . . . . . . . . . . . . . . . . . . 35
4.2.2 Blanked out node feature matrix (X
z
) . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2.2.1 Corrupted adjacency matrix (A
c
) . . . . . . . . . . . . . . . . . . . . . . 35
4.2.2.2 Blanked out adjacency matrix (A
z
) . . . . . . . . . . . . . . . . . . . . . 36
4.2.2.3 Intuitions regarding the performance of the layers . . . . . . . . . . . . . 36
Chapter5: Graphneuralnetwork-baseddenoisingpipeline 40
5.1 GCN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.1.0.1 Corrupted node feature matrix . . . . . . . . . . . . . . . . . . . . . . . . 40
5.1.0.2 Blanked out node feature matrix . . . . . . . . . . . . . . . . . . . . . . . 44
5.1.1 GraphSAGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.1.1.1 Corrupted node feature matrix . . . . . . . . . . . . . . . . . . . . . . . . 46
5.1.2 Blanked out node feature matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.1.2.1 Intuitions regarding the performance of the layers . . . . . . . . . . . . . 53
Chapter6: ConclusionsandFuturework 55
6.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Bibliography 58
v
ListofTables
1.1 List of notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
vi
ListofFigures
2.1 Boltzmann machine represented as a complete undirected graph withN visible units,M
hidden units, a visible and a hidden bias. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Restricted Boltzmann machine represented as a bipartite graph withN visible units,M
hidden units, a visible and a hidden bias. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1 Denoising pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1 Prediction accuracy of
MLP
i
(X
[n
X
]
c
;);i = f0; 1; 2; 3g;n
X
= f0; 10; ; 100g
and
MLP
(X
[n
X
]
c
;). Observe thatP(
MLP
1
(X
[n
X
]
c
;)) outperforms the rest while
P(
MLP
2
(X
[n
X
]
c
;)) deteriorates. We use the following terms interchangeably: 1.
MLP withP(
MLP
(X
[n
X
]
c
;)), 2. MLP: x withP(
MLP
0
(X
[n
X
]
c
;)), 3. MLP: z
1
withP(
MLP
1
(X
[n
X
]
c
;)), 4. MLP: z
2
withP(
MLP
2
(X
[n
X
]
c
;)), 5. MLP: z
3
with
P(
MLP
3
(X
[n
X
]
c
;)). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Prediction accuracy of
MLP
i
(X
[n
X
]
z
;);i =f0; 1; 2; 3g;n
X
=f0; 10; ; 100g and
MLP
(X
[n
X
]
z
;). We observe thatP(
MLP
0
(X
[n
X
]
z
;)) outperforms the rest. This is
closely followed byP(
MLP
1
(X
[n
X
]
z
;)). We use the following terms interchangeably:
1. MLP withP(
MLP
(X
[n
X
]
z
;)), 2. MLP: x withP(
MLP
0
(X
[n
X
]
z
;)), 3. MLP: z
1
withP(
MLP
1
(X
[n
X
]
z
;)), 4. MLP: z
2
withP(
MLP
2
(X
[n
X
]
z
;)), 5. MLP: z
3
with
P(
MLP
3
(X
[n
X
]
z
;)). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3 Prediction accuracy of
n2v
i
(X
[n
X
]
c
;A
[0]
c
);i =f0; 1; 2; 3g;n
X
=f0; 10; ; 100g and
n2v
(X
[n
X
]
c
;A
[0]
c
). Asn
X
increases we observe that the gap betweenP(
n2v
i
(X
[n
X
]
c
;A
[0]
c
)),
i = f0; 1g andP(
n2v
i
(X
[n
X
]
c
;A
[0]
c
)), i = f2; 3g widens. At n
X
> 50,
P(
n2v
2
(X
[n
X
]
c
;A
[0]
c
)) begins to deteriorate. We use the following terms interchangeably:
1. n2v withP(
n2v
(X
[n
X
]
c
;A
[0]
c
)), 2. n2v: x withP(
n2v
0
(X
[n
X
]
c
;A
[0]
c
)), 3. n2v: z
1
withP(
n2v
1
(X
[n
X
]
c
;A
[0]
c
)), 4. n2v: z
2
withP(
n2v
2
(X
[n
X
]
c
;A
[0]
c
)), 5. n2v: z
3
with
P(
n2v
3
(X
[n
X
]
c
;A
[0]
c
)). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
vii
4.4 Prediction accuracy of
n2v
i
(X
[n
X
]
z
;A
[0]
c
);i =f0; 1; 2; 3g;n
X
=f0; 10; ; 100g.
Observe that when n
X
> 70,
n2v
1
(X
[n
X
]
z
;A
[0]
c
) nosedives to collapse. We use the
following terms interchangeably: 1. n2v withP(
n2v
(X
[n
X
]
z
;A
[0]
c
)), 2. n2v: x with
P(
n2v
0
(X
[n
X
]
z
;A
[0]
c
)), 3. n2v: z
1
withP(
n2v
1
(X
[n
X
]
z
;A
[0]
c
)), 4. n2v: z
2
with
P(
n2v
2
(X
[n
X
]
z
;A
[0]
c
)), 5. n2v:z
3
withP(
n2v
3
(X
[n
X
]
z
;A
[0]
c
)). . . . . . . . . . . . . . . . . 37
5.1 Prediction accuracy of
GCN
i
(X
[n
X
]
c
;A
[0]
c
);i =f0; 1; 2; 3g;n
X
=f0; 10;:::; 100g. We
observe thatP(
GCN
0
(X
[n
X
]
c
;A
[0]
c
)) performs the best untilP(
GCN
3
(X
[n
X
]
c
;A
[0]
c
)) does
not decrease as quickly as the other layers. We use the following terms interchangeably: 1.
GCN withP(
GCN
(X
[n
X
]
c
;A
[0]
c
)), 2. GCN:x withP(
GCN
0
(X
[n
X
]
c
;A
[0]
c
)), 3. GCN:z
1
withP(
GCN
1
(X
[n
X
]
c
;A
[0]
c
)), 4. GCN: z
2
withP(
GCN
2
(X
[n
X
]
c
;A
[0]
c
)), 5. GCN: z
3
withP(
GCN
3
(X
[n
X
]
c
;A
[0]
c
)). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2 Prediction accuracy of
GCN
i
(X
[n
X
]
c
;A
[40]
c
);i =f0; 1; 2; 3g;n
X
=f0; 10;:::; 100g.
We observe that when n
X
40,P(
GCN
1
(X
[n
X
]
c
;A
[40]
c
)) begins to collapse, and
whenn
X
70,P(
GCN
3
(X
[n
X
]
c
;A
[40]
c
)) shows a resurgence. We use the following
terms interchangeably: 1. GCN withP(
GCN
(X
[n
X
]
c
;A
[40]
c
)), 2. GCN: x with
P(
GCN
0
(X
[n
X
]
c
;A
[40]
c
)), 3. GCN:z
1
withP(
GCN
1
(X
[n
X
]
c
;A
[40]
c
)), 4. GCN:z
2
with
P(
GCN
2
(X
[n
X
]
c
;A
[40]
c
)), 5. GCN:z
3
withP(
GCN
3
(X
[n
X
]
c
;A
[40]
c
)). . . . . . . . . . . . . 42
5.3 Prediction accuracy of
GCN
i
(X
[n
X
]
c
;A
[60]
c
);i =f0; 1; 2; 3g;n
X
=f0; 10;:::; 100g.
Observe thatP(
GCN
0
(X
[n
X
]
c
;A
[60]
c
)) decreases gradually,P(
GCN
1
(X
[n
X
]
c
;A
[60]
c
))
consistently performs poorly, andP(
GCN
3
(X
[n
X
]
c
;A
[60]
c
)) starts out poorly but
decreases much more gradually than the rest whenn
X
> 40. We use the following
terms interchangeably: 1. GCN withP(
GCN
(X
[n
X
]
c
;A
[60]
c
)), 2. GCN: x with
P(
GCN
0
(X
[n
X
]
c
;A
[60]
c
)), 3. GCN:z
1
withP(
GCN
1
(X
[n
X
]
c
;A
[60]
c
)), 4. GCN:z
2
with
P(
GCN
2
(X
[n
X
]
c
;A
[60]
c
)), 5. GCN:z
3
withP(
GCN
3
(X
[n
X
]
c
;A
[60]
c
)). . . . . . . . . . . . . 42
5.4 Prediction accuracy of
GCN
i
(X
[n
X
]
c
;A
[80]
c
);i =f0; 1; 2; 3g;n
X
=f0; 10;:::; 100g.
Observe the sudden rise inP(
GCN
3
(X
[n
X
]
c
;A
[80]
c
)) when n
X
> 60. We use the
following terms interchangeably: 1. GCN withP(
GCN
(X
[n
X
]
c
;A
[80]
c
)), 2. GCN:x with
P(
GCN
0
(X
[n
X
]
c
;A
[80]
c
)), 3. GCN:z
1
withP(
GCN
1
(X
[n
X
]
c
;A
[80]
c
)), 4. GCN:z
2
with
P(
GCN
2
(X
[n
X
]
c
;A
[80]
c
)), 5. GCN:z
3
withP(
GCN
3
(X
[n
X
]
c
;A
[80]
c
)). . . . . . . . . . . . . 43
5.5 Prediction accuracy of
GCN
(X
[n
X
]
z
;A
[0]
c
). Observe thatP(
GCN
0
(X
[n
X
]
z
;A
[0]
c
)) shows
a constant decrease in performance.P(
GCN
(X
[n
X
]
z
;A
[0]
c
)),P(
GCN
0
(X
[n
X
]
z
;A
[0]
c
))
andP(
GCN
1
(X
[n
X
]
z
;A
[0]
c
)) show similar performance.P(
GCN
3
(X
[n
X
]
z
;A
[0]
c
)) performs
poorly. We use the following terms interchangeably: 1. GCN withP(
GCN
(X
[n
X
]
z
;A
[0]
c
)),
2. GCN: x withP(
GCN
0
(X
[n
X
]
z
;A
[0]
c
)), 3. GCN: z
1
withP(
GCN
1
(X
[n
X
]
z
;A
[0]
c
)), 4.
GCN:z
2
withP(
GCN
2
(X
[n
X
]
z
;A
[0]
c
)), 5. GCN:z
3
withP(
GCN
3
(X
[n
X
]
z
;A
[0]
c
)). . . . . 44
viii
5.6 Prediction accuracy of
GCN
(X
[n
X
]
z
;A
[80]
c
). observe thatP(
GCN
1
(X
[n
X
]
z
;A
[80]
c
)) and
P(
GCN
3
(X
[n
X
]
z
;A
[80]
c
)) start with a poor performance but this begins to improve where
P(
GCN
1
(X
[n
X
]
z
;A
[80]
c
)) even manages to outperform the rest whenn
X
> 70. We use the
following terms interchangeably: 1. GCN withP(
GCN
(X
[n
X
]
z
;A
[80]
c
)), 2. GCN:x with
P(
GCN
0
(X
[n
X
]
z
;A
[80]
c
)), 3. GCN:z
1
withP(
GCN
1
(X
[n
X
]
z
;A
[80]
c
)), 4. GCN:z
2
with
P(
GCN
2
(X
[n
X
]
z
;A
[80]
c
)), 5. GCN:z
3
withP(
GCN
3
(X
[n
X
]
z
;A
[80]
c
)). . . . . . . . . . . . . 45
5.7 Prediction accuracy of
SAGE
i
(X
[n
X
]
c
;A
:
c
[0]);i =f0; 1; 2; 3g;n
X
=f0; 10;:::; 100g.
Observe thatP(
SAGE
1
(X
[n
X
]
c
;A
[0]
c
)) outperforms the rest. This is closely followed by
P(
SAGE
0
(X
[n
X
]
c
;A
[0]
c
)). We use the following terms interchangeably: 1. SAGE with
P(
SAGE
(X
[n
X
]
c
;A
[0]
c
)), 2. SAGE:x withP(
SAGE
0
(X
[n
X
]
c
;A
[0]
c
)), 3. SAGE:z
1
with
P(
SAGE
1
(X
[n
X
]
c
;A
[0]
c
)), 4. SAGE:z
2
withP(
SAGE
2
(X
[n
X
]
c
;A
[0]
c
)), 5. SAGE:z
3
with
P(
SAGE
3
(X
[n
X
]
c
;A
[0]
c
)). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.8 Prediction accuracy of
SAGE
i
(X
[n
X
]
c
;A
[40]
c
));i =f0; 1; 2; 3g;n
X
=f0; 10;:::; 100g.
When n
X
50, we observe the following performance hierarchy:
P(
SAGE
0
(X
[n
X
]
c
;A
[40]
c
))P(
SAGE
2
(X
[n
X
]
c
;A
[40]
c
))P(
SAGE
(X
[n
X
]
c
;A
[40]
c
))
P(
SAGE
3
(X
[n
X
]
c
;A
[40]
c
))P(
SAGE
1
(X
[n
X
]
c
;A
[40]
c
)). We use the following terms
interchangeably: 1. SAGE withP(
SAGE
(X
[n
X
]
c
;A
[40]
c
)), 2. SAGE: x with
P(
SAGE
0
(X
[n
X
]
c
;A
[40]
c
)), 3. SAGE: z
1
withP(
SAGE
1
(X
[n
X
]
c
;A
[40]
c
)), 4. SAGE: z
2
withP(
SAGE
2
(X
[n
X
]
c
;A
[40]
c
)), 5. SAGE:z
3
withP(
SAGE
3
(X
[n
X
]
c
;A
[40]
c
)). . . . . . . . . 47
5.9 Prediction accuracy of
SAGE
i
(X
[n
X
]
c
;A
[60]
c
);i =f0; 1; 2; 3g;n
X
=f0; 10;:::; 100g.
Observe that the performance of the layers start diering a lot from previous observations.
We use the following terms interchangeably: 1. SAGE withP(
SAGE
(X
[n
X
]
c
;A
[60]
c
)), 2.
SAGE:x withP(
SAGE
0
(X
[n
X
]
c
;A
[60]
c
)), 3. SAGE:z
1
withP(
SAGE
1
(X
[n
X
]
c
;A
[60]
c
)),
4. SAGE:z
2
withP(
SAGE
2
(X
[n
X
]
c
;A
[n
A
]
c
)), 5. SAGE:z
3
withP(
SAGE
3
(X
[n
X
]
c
;A
[60]
c
)). 48
5.10 Prediction accuracy of
SAGE
i
(X
[n
X
]
c
;A
[70]
c
);i =f0; 1; 2; 3g;n
X
=f0; 10;:::; 100g.
Observe thatP(
SAGE
3
(X
[n
X
]
c
;A
[70]
c
)) starts to decrease when n
X
> 40 and
P(
SAGE
2
(X
[n
X
]
c
;A
[70]
c
)) makes a bowl at n
X
50. We use the following terms
interchangeably: 1. SAGE withP(
SAGE
(X
[n
X
]
c
;A
[70]
c
)), 2. SAGE: x with
P(
SAGE
0
(X
[n
X
]
c
;A
[70]
c
)), 3. SAGE: z
1
withP(
SAGE
1
(X
[n
X
]
c
;A
[70]
c
)), 4. SAGE: z
2
withP(
SAGE
2
(X
[n
X
]
c
;A
[70]
c
)), 5. SAGE:z
3
withP(
SAGE
3
(X
[n
X
]
c
;A
[70]
c
)). . . . . . . . . 48
5.11 Prediction accuracy of
SAGE
i
(X
[n
X
]
c
;A
[30]
z
);i =f0; 1; 2; 3g;n
X
=f0; 10;:::; 100g.
We observe a deterioration of performance byP(
SAGE
3
(X
[n
X
]
c
;A
[30]
z
)) whenn
X
> 50.
We use the following terms interchangeably: 1. SAGE withP(
SAGE
(X
[n
X
]
c
;A
[30]
z
)), 2.
SAGE:x withP(
SAGE
0
(X
[n
X
]
c
;A
[30]
z
)), 3. SAGE:z
1
withP(
SAGE
1
(X
[n
X
]
c
;A
[30]
z
)),
4. SAGE:z
2
withP(
SAGE
2
(X
[n
X
]
c
;A
[30]
z
)), 5. SAGE:z
3
withP(
SAGE
3
(X
[n
X
]
c
;A
[30]
z
)). 49
ix
5.12 Prediction accuracy of
SAGE
i
(X
[n
X
]
c
;A
[50]
z
). Observe that the following hierarchy be-
comes apparent: P(
SAGE
1
(X
[n
X
]
c
;A
[50]
z
)), P(
SAGE
0
(X
[n
X
]
c
;A
[50]
z
)),
P(
SAGE
2
(X
[n
X
]
c
;A
[50]
z
)),P(
SAGE
(X
[n
X
]
c
;A
[50]
z
)),P(
SAGE
3
(X
[n
X
]
c
;A
[50]
z
)). We
use the following terms interchangeably: 1. SAGE withP(
SAGE
(X
[n
X
]
c
;A
[50]
z
)), 2.
SAGE:x withP(
SAGE
0
(X
[n
X
]
c
;A
[50]
z
)), 3. SAGE:z
1
withP(
SAGE
1
(X
[n
X
]
c
;A
[50]
z
)),
4. SAGE:z
2
withP(
SAGE
2
(X
[n
X
]
c
;A
[50]
z
)), 5. SAGE:z
3
withP(
SAGE
3
(X
[n
X
]
c
;A
[50]
z
)). 50
5.13 Prediction accuracy of
SAGE
i
(X
[n
X
]
z
;A
[0]
c
);i =f0; 1; 2; 3g;n
X
=f0; 10;:::; 100g.
Observe thatP(
SAGE
0
(X
[n
X
]
z
;A
[0]
c
)) outperforms the rest.P(
SAGE
1
(X
[n
X
]
z
;A
[0]
c
))
shows a bomb-like trajectory.P(
SAGE
2
(X
[n
X
]
z
;A
[0]
c
)),P(
SAGE
(X
[n
X
]
z
;A
[0]
c
)) and
P(
SAGE
3
(X
[n
X
]
z
;A
[0]
c
)) follows a more sub-linear path with negative slope. We use the
following terms interchangeably: 1. SAGE withP(
SAGE
(X
[n
X
]
z
;A
[0]
c
)), 2. SAGE: x
withP(
SAGE
0
(X
[n
X
]
z
;A
[0]
c
)), 3. SAGE:z
1
withP(
SAGE
1
(X
[n
X
]
z
;A
[0]
c
)), 4. SAGE:z
2
withP(
SAGE
2
(X
[n
X
]
z
;A
[0]
c
)), 5. SAGE:z
3
withP(
SAGE
3
(X
[n
X
]
z
;A
[0]
c
)). . . . . . . . . . 51
5.14 Prediction accuracy of
SAGE
i
(X
[n
X
]
z
;A
[40]
c
);i =f0; 1; 2; 3g;n
X
=f0; 10;:::; 100g.
P(
SAGE
0
(X
[n
X
]
z
;A
[40]
c
)) andP(
SAGE
1
(X
[n
X
]
z
;A
[40]
c
)) become more sub-linear path with
negative slope, while the rest of the layers collapse whenn
X
> 40. We use the following
terms interchangeably: 1. SAGE withP(
SAGE
(X
[n
X
]
z
;A
[40]
c
)), 2. SAGE: x with
P(
SAGE
0
(X
[n
X
]
z
;A
[40]
c
)), 3. SAGE: z
1
withP(
SAGE
1
(X
[n
X
]
z
;A
[40]
c
)), 4. SAGE: z
2
withP(
SAGE
2
(X
[n
X
]
z
;A
[40]
c
)), 5. SAGE:z
3
withP(
SAGE
3
(X
[n
X
]
z
;A
[40]
c
)). . . . . . . . . 51
5.15 Prediction accuracy of
SAGE
i
(X
[n
X
]
z
;A
[90]
c
);i =f0; 1; 2; 3g;n
X
=f0; 10;:::; 100g.
When n
X
> 30, P(
SAGE
2
(X
[n
X
]
z
;A
[90]
c
)), P(
SAGE
(X
[n
X
]
z
;A
[90]
c
)),
P(
SAGE
3
(X
[n
X
]
z
;A
[90]
c
)) collapse.P(
SAGE
1
(X
[n
X
]
z
;A
[90]
c
)) collapses whenn
X
> 50
while n
X
> 60 causesP(
SAGE
0
(X
[n
X
]
z
;A
[90]
c
)) to collapse. We use the following
terms interchangeably: 1. SAGE withP(
SAGE
(X
[n
X
]
z
;A
[90]
c
)), 2. SAGE: x with
P(
SAGE
0
(X
[n
X
]
z
;A
[90]
c
)), 3. SAGE: z
1
withP(
SAGE
1
(X
[n
X
]
z
;A
[90]
c
)), 4. SAGE: z
2
withP(
SAGE
2
(X
[n
X
]
z
;A
[90]
c
)), 5. SAGE:z
3
withP(
SAGE
3
(X
[n
X
]
z
;A
[90]
c
)). . . . . . . . . 52
x
Abstract
We propose a model agnostic pipeline to denoise data by exploiting thecontent-addressablememory prop-
erty of restricted Boltzmann machines. Although this pipeline can be used to deal with noise in any dataset,
it is particularly eective for the case of graph datasets. The proposed pipeline requires a neural network
that is already trained for the machine learning task on data, which is free from any form of corruption or
incompleteness. We show that our approach can increase the prediction accuracy by up to 40% for some
cases of noise in ogbn-arxiv dataset. We have also created a R shiny interactive web application for better
understanding of the results, which can be found at: https://ankithmo.shinyapps.io/denoiseRBM/.
xi
Chapter1
Introduction
Graphs are a ubiquitous data structure that are employed extensively in almost every eld of study. This
is due to their ability to eciently store complex information in a simple manner that is amenable to
mathematical analysis. Social, information, biological, ecological and recommendation networks are just
a few examples of the elds that can be readily modeled as graphs, which capture interactions between
individuals.
However, graphs are not only useful as structured knowledge repositories, they also play a pivotal role
in modern machine learning. Many machine learning applications seek to make predictions or identify
patterns using graph-structured data as feature information. For example, predicting the role of a person
in a collaboration network, recommending new content to a user in a social network, or predicting new
applications of molecules, all of which can be represented as graphs.
A very commonly occurring problem in almost every one of these elds is the problem of noisy data.
Noisy data can be loosely dened as data which has a signicant proportion of corruption or incomplete-
ness. Combating these problems have led to the development of several methods that are commonly placed
under the termdenoisingmodels. [3] provides a summary of some of the commonly used denoising models
for images. This noisy data problem becomes particularly acute when dealing with graphs because now
1
the distortions are not only possible in the properties of the nodes but also in the existence of edges and
their properties.
There has recently been a surge in methods under the term graph neural networks (GNNs)[15, 2, 16,
8, 19, 5] for deep learning on graphs and other irregular structures, also known asgeometricdeeplearning.
We propose a denoising pipeline for alleviating the noisy data problem, particularly eective for graphs,
by exploiting the property of restricted Boltzmann Machines (RBMs) that enable it to act as a content-
addressable memory, and the ability of GNNs to encode a combination of the node and the graph infor-
mation in a concise manner. Figure 3.1 illustrates the proposed pipeline. This denoising pipeline can work
with any deep neural network (DNN) trained on downward machine learning (ML) task with noise-free
training data.
1.1 Notations
We will abide by the notations provided in table 1.1 which will be explained as they appear. This list is
provided for ease in referring back to them.
We will briey describe the pipeline in the following steps, while a detailed explanation is provided in
chapter 3.
1. We assume that we are provided with a deep neural network that has already been trained on our
prediction task,.
2. Given, we rst choose the hidden layeri whose representationz
i
we nd most informative for the
denoising task.
3. We train an RBM onz
i
, denoted byRBM-z
i
. The denoising pipeline with this RBM is denoted as
i
.
4. Finally, we pass the noisy data through
i
to obtain z
[n]
i
where n is the amount of noise in the
data. z
[n]
i
is considered to be a noisy estimate of its true denoised representation. This estimate is
2
Notation Explanation
Trained deep neural network
X Node feature matrix
A Adjacency matrix
z
0
Data presented to the input layer
z
i
Representation learnt by thei-th hidden layer of
X
c
Corrupted node feature matrix
A
c
Corrupted adjacency matrix
X
z
Blanked out node feature matrix
A
z
Blanked out adjacency matrix
n
X
Amount of distortion in the node feature matrix
n
A
Amount of distortion in the adjacency matrix
data
n
Input data withn% of noise
z
[n]
i
Representation computed by thei-th layer of during inference withdata
[n]
RBM-z
i
RBM trained on thei-th layer representation
i
Denoising pipeline that uses RBM-z
i
~ z
i
[n]
Reconstructions ofz
[n]
i
by RBM-z
i
^
Y Output predictions
Table 1.1: List of notations
fed to RBM-z
i
to obtain a denoised reconstruction ~ z
i
[n]
, which is then fed to the layer immediately
succeeding layeri, and we continue processing it through to obtain the prediction.
We will attempt to make this clearer with the following example with respect to gure 3.1. Suppose
we nd that the second hidden layer has the most informative representation for our problem. Then, we
would extractz
2
from and trainRBM-z
2
on these embeddings. This would give us
2
. Finally, we would
pass the noisy data (withn% of noise) into
2
, obtain a noisy estimatez
[n]
2
, feed this to RBM-z
2
, obtain
sample ~ z
2
[n]
, pass this through theBN
2
layer followingz
2
, and down the rest of to obtain prediction
^
Y .
1.2 AimsandContributions
The thesis has two major aims.
1. To propose a pipeline that can eectively denoise data.
3
2. To analyze the denoising pipeline with various deep neural networks either using multi-layer per-
ceptrons or using graph neural networks. We will describe the behaviour of the pipeline when data
has a combination of corruption and incompleteness in the node feature matrix and the adjacency
matrix, as the amount of distortions is increased from 0% to 100%.
1.3 Structureofthethesis
The thesis is structured in the following manner.
The second chapter introduces the background material which forms the basis for the succeeding
chapters.
The third chapter describes the motivation and design of the denoising pipeline. This is followed by a
description of the experiment that we perform to test the eectiveness of our pipeline.
The fourth chapter discusses the performance of our denoising pipeline when various neural networks
are used.
In the fth chapter we will look at the performance of the denoising pipeline when it uses graph neural
networks.
In the concluding chapter we will describe the best performing layers depending on the distortions of
the node feature matrix and the adjacency matrix.
4
Chapter2
Background
In this chapter, we will provide a brief discussion of the concepts that are important for the chapters which
follow.
2.1 Deepneuralnetworks
Suppose that we are given the inputX, a high dimensional variable which is a low level representation of
data, and the desired output,Y which has signicantly lower dimensionality of the predicted class. At this
point, most of the entropy ofX is not very informative ofY , i.e, the features inX that are relevant to our
prediction task are highly distributed and dicult to extract. Linear separability between these features
is possible only when they are close to conditional independence given output classication. Since we
cannot just assume conditional independence for a general data distribution, we need this data to undergo
some representational changes up to a linear transformation that can decouple the inputs.
Deep neural networks [11] (DNNs) are successful in obtaining such representational changes through
sequential processing of the input data where each hidden layer acts as the input for the next layer. This
layered structure of the DNN generates a successive Markov chain of intermediate representations between
the input and the output layers that can construct higher level distributed representations.
5
2.2 Informationtheoreticviewofdeepneuralnetworks
Given the true joint distribution ofX andY , denoted byP (X;Y ), the relevant information is dened as
their mutual informationI(X;Y ), assuming statistical dependence betweenX andY . Here,Y implicitly
determines both the relevant and irrelevant features inX. This means that an optimal representation of
X would capture these relevant features and compress them by dismissing the irrelevant information that
do not contribute to the prediction ofY .
2.3 RestrictedBoltzmannMachines
This section will introduce the RBM that uses contrastive divergence learning in a manner that in our
opinion simplies its understanding.
2.3.1 Contrastivedivergence
Suppose we want to model the probability of a datapointx using a functionf(x; ) where is the set of
parameters that are used by our model as:
P (x; ) =
f(x; )
Z()
(2.1)
whereZ() is the partition function and is dened as:
Z() =
Z
x
f(x; )dx (2.2)
One way to learn the model parameters is by maximizing the probability of a training set of the data
X =fx
1
;:::;x
D
g:
6
P (X; ) =
D
Y
d=1
f(x
d
; )
Z()
(2.3)
This is equivalent to learning these parameters by minimizing the negative log of the likelihood
P (X; ) which we will refer to as the loss function:
L(X; ) =
1
D
logP (X; )
2:3
=
1
D
log
D
Y
d=1
f(x
d
; )
Z()
=
1
D
D
X
d=1
log
f(x
d
; )
Z()
= logZ()
1
D
D
X
d=1
logf(x
d
; ) (2.4)
The gradient of the loss function w.r.t is:
@
@
L(X; ) =
@
@
logZ()
1
D
D
X
d=1
@
@
logf(x
d
; )
=
@
@
logZ()E
x2X
@
@
logf(x; )
(2.5)
Since the rst term of (2.5) is computationally intractable, we reformulate it as:
7
@
@
logZ() =
1
Z()
@
@
Z()
2:2
=
1
Z()
@
@()
Z
x
f(x; )dx
=
1
Z()
Z
x
@
@()
f(x; )dx
=
1
Z()
Z
x
f(x; )
@
@()
logf(x; )dx
=
Z
x
f(x; )
Z()
@
@()
logf(x; )dx
2:1
=
Z
x
P (x; )
@
@()
logf(x; )dx
=E
xP (x;)
@
@()
logf(x; )
(2.6)
Using (2.6), (2.5) becomes:
@
@
L(X; ) =E
xP (x;)
@
@()
logf(x; )
E
x2X
@
@
logf(x; )
(2.7)
Equation (2.7) shows that we can now numerically approximate the rst term by drawing samples from
our proposed distributionP (x; ), while the second term is straightforward to compute givenX.
However, we still cannot directly sample fromP (x; ) since we do not knowZ(). This indicates
that we require several cycles of Monte Carlo Markov Chain (MCMC) sampling to transform our training
data into data drawn from our proposed distribution. This is possible because it involves ratio of two
probabilitiesP (x
0
; )=P (x; ), thus cancelling outZ().
LetX
N
denote the training data transformed usingN cycles of MCMC sampling andX
0
=X; then
Equation (2.7) can be written as:
8
@
@
L(X; ) =E
x2X
N
@
@()
logf(x; )
E
x2X
0
@
@
logf(x; )
(2.8)
This still leaves us with the problem that too many MCMC cycles are needed to compute the accurate
gradient.
[1] showed that only a few MCMC cycles were sucient to calculate a good approximation of the
gradient. This is because after just a few iterations, the data will have moved from the target distribution
(training data) towards the proposed distribution, thus giving us an idea in which direction the proposed
distribution to move to better model the training data.
Empirically, it is observed that just one MCMC cycle is sucient for the algorithm to converge to the
answer.
This is commonly referred to as contrastive divergence (CD-k) wherek denotes the number of MCMC
cycles:
@
@
L(X; ) =E
x2X
k
@
@()
logf(x; )
E
x2X
0
@
@
logf(x; )
(2.9)
Equation (2.9) gives the following parameter learning rule:
t+1
=
t
+
E
x2X
0
@
@()
logf(x; )
E
x2X
k
@
@
logf(x; )
(2.10)
where is the learning rate[18].
2.3.2 Contrastivedivergenceonlatentvariablemodels
Consider the same problem as in Section 2.3.1, but in a model which has both observed and latent variables.
Observed variables are tasked with learning the true data distribution, while latent variables learn the
underlying features of the observed variables, thereby increasing the expressiveness of the model. To
9
model the probability of a datapointx, the function in this case will bef(x;h; ) whereh denotes the
latent variables, and is the parameter set of the model. The probability can be represented as:
P (x;h; ) =
f(x;h; )
Z()
(2.11)
whereZ() is dened as:
Z() =
Z
x
Z
h
f(x;h; )dhdx (2.12)
Marginal distributions can be dened as:
P (x; ) =
Z
h
P (x;h; )dh =
1
Z()
Z
h
f(x;h; )dh (2.13)
P (h; ) =
Z
x
P (x;h; )dx =
1
Z()
Z
x
f(x;h; )dx (2.14)
Conditional distributions become:
P (xjh; ) =
P (x;h; )
P (h; )
=
f(x;h; )
R
x
f(x;h; )dx
(2.15)
P (hjx; ) =
P (x;h; )
P (x; )
=
f(x;h; )
R
h
f(x;h; )dh
(2.16)
Given the training setX =fx
1
;:::;x
D
g, the probability of the training set is:
P (X; )
2:13
=
1
Z()
D
Y
d=1
Z
h
f(x
d
;h; )dh (2.17)
The loss function (Equation 2.4) becomes:
10
L(X; ) = logZ()
1
D
D
X
d=1
log
Z
h
f(x
d
;h; )dh (2.18)
The gradient of this loss function w.r.t is:
@
@
L(X; ) =
@
@
logZ()E
x2X
@
@
log
Z
h
f(x;h; )dh
(2.19)
First term becomes:
@
@
logZ() =E
x;hP (x;h;)
@
@
logf(x;h; )
(2.20)
From Equation (2.20), Equation (2.19) becomes:
@
@
L(X; ) =E
x;hP (x;h;)
@
@
logf(x;h; )
E
x2X
@
@
log
Z
h
f(x;h; )dh
(2.21)
Reasoning on the same lines as those of Section 2.3.1, we can write the parameter learning rule as:
t+1
=
t
+
E
x2X
0
@
@
log
Z
h
f(x;h; )dh
E
x2X
k
;hP (hjx;)
@
@
logf(x;h; )
(2.22)
2.3.3 Energy-basedmodels
Energy-based models (EBMs) are generative latent variable models that can learn the underlying data
distribution from a sample dataset by minimizing an energy functionE(x;h; ) wherex andh denote
the set of observed and latent random variables respectively, and is the set of the parameters used by
11
the model. Once trained, these can produce other datasets that also match the data distribution, hence a
generative model.
The joint probability distribution is given by
P (x;h; ) =
exp(E(x;h; ))
Z()
(2.23)
where the partition functionZ() is given by
Z() =
Z
x
Z
h
exp(E(x;h; ))dhdx (2.24)
Therefore, we can write the marginal and conditional probabilities as:
P (x; ) =
Z
h
P (x;h; )dh =
1
Z()
Z
h
exp(E(x;h; ))dh (2.25)
P (h; ) =
Z
x
P (x;h; )dx =
1
Z()
Z
x
exp(E(x;h; ))dx (2.26)
P (xjh; ) =
P (x;h; )
P (h; )
=
exp(E(x;h; ))
R
x
exp(E(x;h; ))dx
(2.27)
P (hjx; ) =
P (x;h; )
P (x; )
=
exp(E(x;h; ))
R
h
exp(E(x;h; ))dh
(2.28)
Since our task is to learn the data distribution from the training dataX, our problem reduces to the
same as section 2.3.2 wheref(x;h; ) = exp(E(x;h; )).
The derivative of our loss function, Equation (2.21) becomes:
12
@
@
L(X; ) =E
x;hP (x;h;)
@
@
E(x;h; )
E
x2X
@
@
log
Z
h
exp(E(x;h; ))dh
(2.29)
The second term can be written as:
@
@
log
Z
h
exp(E(x;h; ))dh =
R
h
@
@
exp(E(x;h; ))dh
R
h
exp(E(x;h; ))dh
=
R
h
exp(E(x;h; ))
@
@
E(x;h; )dh
R
h
exp(E(x;h; ))dh
2:28
=
Z
h
P (hjx; )
@
@
E(x;h; )dh
=E
hP (hjx;)
@
@
E(x;h; )
(2.30)
Therefore, Equation (2.29) becomes:
@
@
L(X; ) =E
x;hP (x;h;)
@
@
E(x;h; )
E
x2X;hP (hjx;)
@
@
E(x;h; )
(2.31)
This makes our parameter learning rule:
t+1
=
t
+
E
x2X
0
;hP (hjx;)
@
@
E(x;h; )
E
x2X
k
;hP (hjx;)
@
@
E(x;h; )
(2.32)
2.3.4 Boltzmannmachine
ABoltzmannmachine (BM) is an energy-based model where the observed and the latent variables can take
either continuous or discrete values. Since a BM can be considered a stochastic recurrent neural network,
13
the observed variables are interchangeably called visible units (v), while the latent variables are called
hidden units (h). The model has the parameters = (b
v
;W
v
;W;W
h
;b
h
). The energy function is dened
as:
E(v;h; ) =b
T
v
vv
T
W
v
vv
T
Whh
T
W
h
hb
T
h
h (2.33)
Figure 2.1 illustrates a BM withN visible units andM hidden units.
b
h
v
1
v
2
::: v
N
b
v
h
1
h
2
:::
h
M
W
v
W
W
h
Figure 2.1: Boltzmann machine represented as a complete undirected graph withN visible units,M hidden
units, a visible and a hidden bias.
The gradients of Equation (2.33) w.r.t become:
@
@b
v
E(v;h; ) =v
@
@W
v
E(v;h; ) =vv
T
@
@W
E(v;h; ) =vh
T
@
@W
h
E(v;h; ) =hh
T
@
@b
h
E(v;h; ) =h
(2.34)
14
Our parameter learning rule Equation (2.32) becomes:
b
v
t+1
=b
vt
+ (E
v2X
0 [v]E
v2X
k [v])
W
v
t+1
=W
vt
+ (E
v2X
0
vv
T
E
v2X
k
vv
T
)
W
t+1
=W
t
+ (E
v2X
0
;hP (hjv;)
vh
T
E
v2X
k
;hP (hjv;)
vh
T
)
W
h
t+1
=W
ht
+ (E
v2X
0
;hP (hjv;)
hh
T
E
v2X
k
;hP (hjv;)
hh
T
)
b
h
t+1
=b
ht
+ (E
v2X
0
;hP (hjv;)
[h]E
v2X
k
;hP (hjv;)
[h]) (2.35)
2.3.5 RestrictedBoltzmannMachine
A restricted Boltzmann machine (RBM) is a variant of a BM where the observed variables and the latent
variables form a bipartite graph. This creates independence between the set of visible and hidden units.
The parameters are = (b
v
;W;b
h
).
Due to the independence imposed on the visible and the hidden units, Equation (2.33) reduces to:
E(v;h; ) =b
T
v
vv
T
Whb
T
h
h (2.36)
b
h
v
1
v
2
::: v
N
b
v
h
1
h
2
:::
h
M
W
Figure 2.2: Restricted Boltzmann machine represented as a bipartite graph withN visible units,M hidden
units, a visible and a hidden bias.
The parameter learning rule becomes:
15
b
v
t+1
=b
vt
+ (E
v2X
0 [v]E
v2X
k [v])
W
t+1
=W
t
+ (E
v2X
0
;hP (hjv;)
vh
T
E
v2X
k
;hP (hjv;)
vh
T
)
b
h
t+1
=b
ht
+ (E
v2X
0
;hP (hjv;)
[h]E
v2X
k
;hP (hjv;)
[h]) (2.37)
Depending on the distribution of the data that has to be modelled, the visible and hidden units can be
designed to take any desired distribution. We will be discussing two commonly used RBMs, one to model
binary data and the other for real-valued data.
2.3.5.1 Bernoulli-BernoulliRBM
A Bernoulli-Bernoulli RBM (BB-RBM) is one where both the visible and hidden units are Bernoulli dis-
tributed. The denition of the energy function stays the same as in Equation 2.36 withv2f0; 1g
jVj
and
h2f0; 1g
jHj
.
E(v;h; ) =b
T
v
vv
T
Whb
T
h
h
=
X
i
b
v
i
v
i
X
ij
v
i
W
ij
h
j
X
j
b
h
j
h
j
(2.38)
Since, our visible unit can take only binary values, we can rewrite Equation (2.25) as:
16
P (v
i
= 1jh; ) =
exp(E(v
i
= 1;h; ))
P
v
exp(E(v;h; ))
=
exp(E(v
i
= 1;h; ))
exp(E(v
i
= 1;h; )) + exp(E(v
i
= 0;h; ))
=
1
1 +
exp(E(v
i
=0;h;))
exp(E(v
i
=1;h;))
=
1
1 +
exp(
P
k 6=i
bv
k
v
k
+
P
k 6=i;j
v
k
W
kj
h
j
+
P
j
b
h
j
h
j
) exp(bv
i
0+0
P
j
W
ij
h
j
)
exp(
P
k 6=i
bv
k
v
k
+
P
k 6=i;j
v
k
W
kj
h
j
+
P
j
b
h
j
h
j
) exp(bv
i
1+1
P
j
W
ij
h
j
)
=
1
1 + exp((b
v
i
+
P
j
W
ij
h
j
))
=g(b
v
i
+
X
j
W
ij
h
j
) (2.39)
Here,g symbolizes the sigmoid function. Therefore,
P (v = 1jh; ) =g(b
v
+Wh) (2.40)
Similarly, we can show that:
P (h = 1jv; ) =g(b
h
+W
T
v) (2.41)
The parameter learning rule remains the same as in Equation (2.37).
2.3.5.2 Gaussian-BernoulliRBM
AGaussian-BernoulliRBM (GB-RBM) has Gaussian-distributed visible units and Bernoulli-distributed hid-
den units. With v 2 R
jVj
, h2f0; 1g
jHj
and = (b
v
; ;W;b
h
), the energy function can be written
as:
17
E(v;h; ) =
X
i
(v
i
b
v
i
)
2
2
2
i
X
ij
v
i
i
W
ij
h
j
X
j
b
h
j
h
j
=
1
2
vb
v
T
vb
v
v
T
Whb
T
h
h (2.42)
For the sake of brevity, we will not be deriving the conditional distributions. The reader is encouraged
to refer [12] for the derivations. The conditional distributions can be written as:
P (v
i
=xjh; ) =
1
q
2
2
i
exp
0
@
1
2
2
i
(vb
v
i
i
X
j
W
ij
h
j
)
2
1
A
=N (v
i
;b
v
i
+
i
X
j
W
ij
h
j
;
2
i
) (2.43)
P (h
j
= 1jv; ) =g(b
h
j
+
X
i
v
i
i
W
ij
) (2.44)
The gradients of the energy function, Equation 2.42 w.r.t become:
@
@b
v
E(v;h; ) =
(vb
v
)
T
@
@W
E(v;h; ) =
v
h
T
@
@b
h
E(v;h; ) =h
@
@
E(v;h; ) =
(vb
v
)
2
3
+ (
v
2
)
T
Wh (2.45)
18
Our parameter learning rule becomes:
b
v
t+1
=b
vt
+
E
v2X
0
vb
vt
2
t
E
v2X
k
vb
vt
2
t
W
t+1
=W
t
+
E
v2X
0
;hP (hjv;)
v
t
h
T
E
v2X
k
;hP (hjv;)
v
t
h
T
(2.46)
b
h
t+1
=b
ht
+
E
v2X
0
;hP (hjv;)
[h]E
v2X
k
;hP (hjv;)
[h]
t+1
=
t
+
E
v2X
0
;hP (hjv;)
(vb
vt
)
2
3
t
v
2
t
Wh
E
v2X
k
;hP (hjv;)
(vb
vt
)
2
3
t
v
2
t
Wh
(2.47)
2.3.6 Content-addressablememory
RBM can be considered a bipartite, stochastic and generative version of a Hopeld network [6], that uses
annealed Gibbs sampling instead of gradient descent. This means that we can extend the property of being
acontent-addressablememory to an RBM too [10]. By a content-addressable memory system, we mean that
the network is designed to store a number of patterns so that they can be retrieved from noisy or partial
cues. It does this by creating an energy surface which has minima representing each of the patterns. The
noisy and partial cues are states of the system which are close to these minima. As the network evolves,
it slides from the noisy pattern down the energy surface into the closest minima - representing the closest
stored pattern. For example, if you train an RBM on a set of images. Then present this network with either
a portion of one of the images (partial cue) or an image degraded with noise (noisy cue), sampling from
the system will attempt to reconstruct one of the stored images.
19
2.4 node2vec
In node2vec [4], the goal is to learn a functionf :V !R
d
which maximizes the objective function (2.48),
whereN
S
(u)V gives the network neighbourhood ofu using neighbourhood sampling strategyS for
every source nodeu2V .
max
f
X
u2V
logP (N
S
(u)jf(u)) (2.48)
Here we make two assumptions:
1. The likelihood of observing a neighborhood node is independent of observing any other neighbor-
hood node given the feature representation of the source:
P (N
S
(u)jf(u)) =
Y
n
i
2N
S
(u)
P (n
i
jf(u)) (2.49)
2. A source node and neighborhood node have a symmetric eect over each other in feature space,
which gives us the following:
P (n
i
jf(u)) =
exp (f(n
i
)f(u))
P
v2V
exp (f(v)f(u))
(2.50)
From equation (2.50), equation (2.49) becomes:
P (N
S
(u)jf(u)) =
Y
n
i
2N
S
(u)
exp (f(n
i
)f(u))
P
v2V
exp (f(v)f(u))
logP (N
S
(u)jf(u)) =
X
n
i
2N
S
(u)
f(n
i
)f(u) log
X
v2V
exp (f(v)f(u)) (2.51)
20
Using Equation (2.51), our objective function becomes
max
f
X
u2V
2
4
log
X
v2V
exp (f(v)f(u)) +
X
n
i
2N
S
(u)
f(n
i
)f(u)
3
5
(2.52)
The rst term in equation (2.52) is computationally intractable and hence is approximated using neg-
ative sampling. Equation (2.52) is optimized using stochastic gradient ascent over the model parameters
dening the featuresf.
2.4.1 Neighbourhoodsamplingstrategy(S)
Two extreme strategies:
1. Breadth-rst search (BFS): This focuses only on the immediate neighbours and is therefore eective
in embedding structurally equivalent nodes.
2. Depth-rst search (DFS): Here the focus is on neighbours at increasing distances and is ecient in
embedding nodes that exhibit homophily.
Since most nodes exhibit a varying amount of both homophily and structural equivalency with respect
to other nodes, we require a biased random walk approach that can explore neighbourhoods in a BFS as
well as a DFS manner.
Suppose that we start from a source nodec
0
=u. If we have a random walk of xed lengthl then,
P (c
i
=xjc
i1
=v) =
8
>
>
>
>
<
>
>
>
>
:
vx
Z
if (v;x)2E
0; otherwise
where
vx
is the transition probability betweenv andx, andZ if the normalizing constant.
21
This second-order random walk has 2 parametersp andq which guide the walk. If we consider the
walk that just traversed fromt tov and now wants to traverse to a successor nodex, then the unnormalized
transition probability is given by
vx
=
pq
(t;x)w
vx
where
pq
(t;x) =
8
>
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
>
:
1
p
if d
tx
= 0
1 if d
tx
= 1
1
q
if d
tx
= 2
Herew
vx
is the weight of edge (v;x) andd
tx
is the shortest path distance betweent andx. Since,
d
tx
2f0; 1; 2g,p andq are sucient to guide the random walk in the trade-o between BFS and DFS.
• p, return parameter: This parameter controls the likelihood of immediately revisiting a node. p >
max(q; 1) means that the walk is less likely to revisit a sampled node. Whereas, p < min(q; 1)
implies that the walk is more likely to revisit the node.
• q, in-out parameter: This dierentiates between inward and outward nodes. The random walk is
biased towards the nodes close tot whenq> 1, while it tries to move away fromt whenq< 1.
The node2vec algorithm can be summarized as follows:
1. Start from every nodeu2V and performr random walks each of lengthl.
2. At every step of the walk, sample the next node based on the transition probability
vx
(which can
be pre-computed).
3. Optimize the objective function 2.52.
22
2.5 Graphneuralnetworks
Graph neural networks (GNNs) are message passing networks where nodes aggregate information from
their local neighbours iteratively. At every iteration or search depth, the nodes incrementally gain more
and more information from further reaches of the graph. In this section we will look at two of these which
will be used in the denoising pipeline. These methods are continuous, parametrized and dierentiable
approximations of the Weisfeiler-Lehman (WL) isomorphism test [13].
2.5.1 GraphConvolutionalNetworks
Graph convolutional networks (GCNs) [8] encode the graph structure directly using the following layer-
wise propagation rule.
Z
[l+1]
=
~
D
1=2
~
A
~
D
1=2
H
[l]
W
[l]
(2.53)
H
[l+1]
=(Z
[l+1]
) (2.54)
where
~
A = A + I
N
,
~
D =
P
j
~
A
j
, W
[l]
is the trainable weight matrix for the layer l, H
[l]
is the
matrix of activations in thel-th layer whereH
[0]
= X, and denotes a non-linear function. This layer
propagation rule naturally leads to a graph-based deep neural network model. The neural network is
trained by minimizing a task-specic loss function. This model is particularly useful for semi-supervised
or transductive learning.
23
2.5.2 GraphSAGE
GraphSAGE [5] also encodes the graph structure of the network, but keeps the embeddings of the neigh-
bourhood distinct from those of the node. This enables GraphSAGE to be applied to inductive learning
problems. The layer propagation rule in this case can be written as:
h
[l+1]
N (v)
=AGGREGATE
[l+1]
(fh
[l]
u
;8u2N (v)g) (2.55)
z
[l+1]
v
=W
[l+1]
CONCAT (h
[l]
v
;h
[l+1]
N (v)
)
h
[l+1]
v
=(z
[l+1]
v
) (2.56)
h
[l+1]
v
=h
[l+1]
v
=
h
[l+1]
v
2
8v2V (2.57)
(2.58)
In this algorithm, we rst aggregate the representations of the neighbouring nodes using (AGGRE-
GATE) which can be any aggregation function that is trainable, operates on randomly ordered node neigh-
bourhoods, and maintains high representational capacity. Then, an ane transformation of the concate-
nation of the embeddings and the node’s embeddings is performed before being passed to a non-linear
activation.
The propagation rule in this case also leads to a graph-based deep neural network which can be trained
by minimizing a task-specic loss function. GraphSAGE can be described as a generalization of GCN to
the task of inductive learning. This generalization is through the use of an aggregation function over the
embeddings of the neighbouring nodes and concatenated with the embeddings of the nodes.
24
Chapter3
Denoisingpipeline
In this chapter we will look into the motivations behind our pipeline, and provide a detailed explanation,
followed with an experiment to test its performance.
3.1 Motivation
Problem statement Given clean training dataset and noisy test dataset for a prediction task, our goal
is to denoise the test data.
We know that a deep neural network is known for its performance because of its ability to learn the
characteristics of the training dataset to an extent that generalizes to test data. So, if we have a trained
deep neural network, this can be used to make predictions on noise-free test data.
Now we are left with the question of denoising the test data. Section 2.3.6 showed that an RBM can act
as a content-addressable memory which can retrieve patterns from noisy or partial cues of these patterns.
This means that if we train an RBM with the training data and pose it with a test datapoint, then that
datapoint will evolve to its nearest minima. When we sample from the RBM, we will get this evolved form.
We can extend this to say that if we posed this trained RBM with a noisy test datapoint, this noisy example
should also be able to evolve to its nearest minima. If we sample the RBM, the resulting reconstruction
should eectively denoise this datapoint.
25
However, this associative memory property will work only for properties of nodes or for the properties
of edges. We cannot use this for recovering the adjacency matrix. This is because the adjacency matrix for
most real-world datasets are far too sparse for any RBM to eectively learn any useful information.
We know from section 2.5 that we need the right neighbourhood for a node so that it can aggregate
the information from this neighbourhood. If we can obtain the correct aggregated information for a node,
then we are not really interested about its edge connectivity. This leads us to our following hypothesis.
Suppose we are given a trained deep neural network model,. We then take the encoded representation
from and train a RBM on this representation. Now we feed our noisy test data to and obtain a noisy
estimate of the true denoised representation. If we pass this noisy estimate to the RBM and sample from
it, the obtained reconstruction will be the denoised version of the noisy estimate.
However, this brings up another question. A neural network model usually contains several layers
while a graph-based neural network model usually has 2 or 3 GNN layers. Which of these layers’ rep-
resentations must we use for our denoising task? Although it would seem that reconstructions from the
representations learnt by the input layer or the hidden layer should perform better than the deeper layers,
we intend to investigate how well each of the layers perform on our denoising task.
Once we have denoised the embedding from either of these layers, this denoised embedding in itself
is not very useful. Therefore we feed this reconstruction back to by passing it to the layer immediately
succeeding the chosen layer. When we continue processing this through , we should end up with a
representation that is better at making predictions about the node than the representation without any
reconstruction.
3.2 Methodology
In this section we will integrate of the ideas discussed in section 3.1 into one denoising pipeline as described
in gure 3.1, that is capable of denoising the data.
26
data
NN
1
z
1
BN
1
ReLU
1
dropout
NN
2
z
2
BN
2
ReLU
2
dropout
NN
3
z
3
log_softmax
^
Y
Trained DNN,
Denoisez
0
?
RBM-z
0
Denoisez
1
?
RBM-z
1
Denoisez
2
?
RBM-z
2
Denoisez
3
?
RBM-z
3
Yes
Yes
Yes
Yes
z
[n]
0
No =) z
[n]
0
~ z
0
[n]
z
[n]
1
No =) z
[n]
1
~ z
1
[n]
z
[n]
2
No =) z
[n]
2
~ z
2
[n]
z
[n]
3
No =) z
[n]
3
~ z
3
[n]
Figure 3.1: Denoising pipeline
27
3.2.1 Traineddeepneuralnetwork,
In this portion of the pipeline, any deep neural network that is trained for our downstream prediction task
can be placed. We describe a deep neural network containing three neural networksNN
i
;i =f1; 2; 3g.
data denotes our input data which could be a combination of the node feature matrixX and the adjacency
matrixA. z
0
denotes the data presented to the input layer,z
i
;i =f1; 2; 3g indicates the representations
learnt by the i-th hidden layer in . Except for our nal layer, each of our other neural networks are
followed by a batch normalization layer (BN), a rectied linear unit layer (ReLU) and a dropout layer. Our
output prediction
^
Y is the logarithm of the softmax function applied to the nal layer representationz
3
.
This deep neural network is by no means restrictive.
We will denote the deep neural network trained on our clean training set by
NN
. We will usez
[n]
0
interchangeably withdata
[n]
to indicatedata withn% noise. We will use
NN
(data
[n]
) to indicate when
data
[n]
is fed to the trained pipeline
NN
.
NN
(data
[n]
) gives usz
[n]
1
,z
[n]
2
andz
[n]
3
which are the noisy
estimates at the corresponding hidden layers.
3.2.2 Denoisingpipeline,
Here we describe the part of the pipeline which is responsible for denoising the data based onz
[n]
i
;i =
f0; 1; 2; 3g. First we pick a layeri whose embeddings we consider is most informative for our denoising
task. We use RBM-z
i
to denote the RBM trained on layeri’s embeddings. We know thatz
[n]
i
is a noisy
estimate of some true representationz
i
that we would get when the data is not distorted. We feedz
[n]
i
to
RBM-z
i
and obtain a sample ~ z
i
[n]
. These reconstructions should have evolved towards its closest minimum
in the energy surface learnt byRBM-z
i
, and therefore ~ z
i
[n]
is closer to the true representation ofz
[n]
i
. When
~ z
i
[n]
is passed back into the ML pipeline at the layer succeeding layer i, and then allowed to progress
through the rest of the pipeline, we should get the true prediction
^
Y .
28
3.3 Experiments
For analyzing the performance of the denoising pipeline, we choose the ogbn-arxiv dataset from the open
graph benchmarks dataset collection [7]. We choose this dataset because it contains around 100000 nodes
and over a million edges which is of the size that ts into a standard 11 GB GPU.
3.3.1 ogbn-arxiv: PaperCitationNetwork
The ogbn-arxiv dataset is a directed graph, representing the citation network between all Computer Sci-
ence (CS) ARXIV papers indexed by MAG [17]. Each node is an ARXIV paper and each directed edge
indicates that one paper cites another. Each paper comes with a 128-dimensional feature vector obtained
by averaging the embeddings of words in its title and abstract.
Given a paper, our task is to predict which of the 40 subject areas of the ARXIV CS papers this paper
can be classied into.
The papers are split according to their publication dates. The papers published until 2017 form the
training set. The validation set contain papers published in 2018, while the papers published since 2019
are placed in the test set.
The following neural networks are used to gauge the performance of the dataset:
• MLP: A multi-layer perceptron (MLP) predictor that uses just the node feature matrixX as input.
This does not incorporate any of the graph information. It learns a weightW and biasb parameter
and computes the representation asz =WX +b
• node2vec: An MLP predictor that uses a concatenation of the node feature matrix and node2vec
embeddings 2.4.
• GCN: Full-batch graph convolutional network 2.5.1.
29
• GraphSAGE: Full-batch GraphSAGE 2.5.2 with mean pooling variant and a simple skip connection
to preserve central node features.
The pipeline used for these baselines are the same as described in gure 3.1. Therefore, we will be
replacingNN by these neural network algorithms while evaluating the respective pipeline’s performance.
We maintain the same xed hidden dimensionality of 256, a xed number of three layers, and a dropout
ratio of 0:5 as described in the OGB paper [7].
3.3.2 Noisydatafor ogbn-arxiv
We will assume that we are given a combination of (1,3), (1,4), (2,3) or (2,4) from the possibilities listed
below:
1. Corruptednodefeaturematrix,X
c
: This will occur when the title and the abstract contain words that
are confusing or misleading. These words may be chosen, intentionally or otherwise, to showcase
the paper as something it is not necessarily.
2. Partial node feature matrix, X
z
: If the title or abstract is very sparingly or vaguely worded, the
generated node feature vectors may contain a lot of zeroes.
3. Corrupted adjacency matrix,A
c
: Due to tricks used by journals to boost their impact factor or for
other reasons, a paper might have links to others that are not really necessary.
4. Partial adjacency matrix,A
z
: A paper can have very few citations.
We maintain the same node property prediction task as was assigned for the original dataset. But we
have the added constraint of having to resolve the noisy data before proceeding with the prediction task.
For this analysis, we assume that the training set is clean but the validation and test sets are noisy. For
n
=f0; 10;:::; 100g, wheren
is the percentage of distortion in either the node feature matrix or the
adjacency matrix, we prepare the noisy data as:
30
1. X
[n]
c
: We corrupt the node feature vectors by addingn% of noise from uniform random distribution
in [0; 1] to the original node feature vectors.
2. X
[n]
z
: For the partial node feature vectors, we generate them by blanking outn% of entries in the
original node feature vectors.
3. A
[n]
c
: Forn% of the nodes, the adjacency matrix is corrupted by replacing the true neighbour of a
node in the validation set with a random node in the training set. For the nodes in the test set, we
replace the true neighbour with a node in either the training set or the validation set.
4. A
[n]
z
: Blanking out in the adjacency matrix is achieved by eliminatingn% of edges.
We will useX when we have to refer to bothX
c
andX
z
. The same approach is followed when using
A.
We use P to denote the prediction accuracy of any of the pipelines. We will compare
P(
NN
(X
[n]
;A
[n]
)), P(
NN
0
(X
[n]
;A
[n]
)), P(
NN
1
(X
[n]
;A
[n]
)), P(
NN
2
(X
[n]
;A
[n]
)) and
P(
NN
3
(X
[n]
;A
[n]
)) for each of the above mentioned neural network algorithms and each of the possible
combinations of noisy data, whenn =f0; 10;:::; 100g.
We have observed that RBM- ~ z
0
with 4096 hidden units, and the RBMs corresponding to the other
layers with 1024 hidden units are the most ecient for learning the representations when each of these
are trained for 50000 epochs with 1 step of contrastive divergence (section 2.3.1).
3.4 Results
To enable clearer understanding of the results, we have hosted them in an interactive R Shiny application
which can be found athttps://ankithmo.shinyapps.io/denoiseRBM/. The application provides the user with
the following options for both the validation and the test sets:
31
Sidebarpanel
• Choose "Corrupted" or "Incomplete" for bothX andA.
• Change the amount of distortion inA.
Mainpanel
• Playground where
NN
(X
[n
X
]
;A
[n
A
]
);NN2fMLP;n2v;GCN;SAGEg,n
X
=f0; ; 100g,
n
A
is a user-dened input. The user can add
i
(X
[n
X
]
;A
[n
A
]
) of choice from the "Choose recon-
structions" dropdown box.
• Plot containingP(
NN
i
(X
[n
X
]
;A
[n
A
]
));i =f0; 1; 2; 3g andP(
NN
(X
[n
X
]
;A
[n
A
]
)) with a separate
tab for each possible value taken byNN.
• Ability to isolate specic performance lines by double-clicking on the corresponding legend.
• Performance lines can be removed by single-clicking in the respective legend.
The reader is encouraged to load the web application alongside this thesis to view the detailed results.
We will explain the denoising pipelines for each of the neural network algorithms in the follow-
ing chapters. We would like to note that for the convenience of plotting, we will use NN to denote
P(
NN
(X
[n]
;A
[n]
)),NN :x forP(
NN
0
(X
[n]
;A
[n]
)),NN :z
1
forP(
NN
1
(X
[n]
;A
[n]
)),NN :z
2
forP(
NN
2
(X
[n]
;A
[n]
)) andNN :z
3
to indicateP(
NN
3
(X
[n]
;A
[n]
)).
32
Chapter4
Neuralnetwork-baseddenoisingpipeline
In this chapter we discuss the performance of
MLP
and
n2v
. For the sake of brevity, we will only be
depicting the results that exhibit observations that require some explanation. The reader is encouraged to
load the R shiny web application (section 3.4) alongside this thesis.
4.1 MLP
In this section, we consider
MLP
. We usen
X
andn
A
to denote the amount of distortion in the node
feature matrix and the adjacency matrix respectively.
A general observation is that the performance when dealing with blanked out node feature matrix is
usually higher and slows down more gradually than in the case of corrupted node feature matrix.
Corruptednodefeaturematrix(X
c
): We observe from gure 4.1 that whenn
X
< 40, the order of per-
formance is as follows: P(
MLP
0
(X
[n
X
]
c
;)), P(
MLP
1
(X
[n
X
]
c
;)), P(
MLP
2
(X
[n
X
]
c
;)),
P(
MLP
(X
[n
X
]
c
;)), P(
MLP
3
(X
[n
X
]
c
;)). When n
X
40, the ordering becomes the following:
P(
MLP
1
(X
[n
X
]
c
;)),P(
MLP
0
(X
[n
X
]
c
;)),P(
MLP
(X
[n
X
]
c
;)),P(
MLP
3
(X
[n
X
]
c
;)),P(
MLP
2
(X
[n
X
]
c
;)).
This leads us to two important observations asn
X
increases: 1.P(
MLP
1
(X
[n
X
]
c
;)) usually outper-
forms the rest which results in a 6% improvement in prediction accuracy when compared with
P(
MLP
(X
[n
X
]
c
;)), 2.P(
MLP
2
(X
[n
X
]
c
;)) deteriorates.
33
Figure 4.1: Prediction accuracy of
MLP
i
(X
[n
X
]
c
;);i = f0; 1; 2; 3g;n
X
= f0; 10; ; 100g and
MLP
(X
[n
X
]
c
;). Observe thatP(
MLP
1
(X
[n
X
]
c
;)) outperforms the rest whileP(
MLP
2
(X
[n
X
]
c
;)) de-
teriorates. We use the following terms interchangeably: 1. MLP withP(
MLP
(X
[n
X
]
c
;)), 2. MLP: x
withP(
MLP
0
(X
[n
X
]
c
;)), 3. MLP:z
1
withP(
MLP
1
(X
[n
X
]
c
;)), 4. MLP:z
2
withP(
MLP
2
(X
[n
X
]
c
;)),
5. MLP:z
3
withP(
MLP
3
(X
[n
X
]
c
;)).
Blanked node feature matrix (X
z
): Figure 4.2 shows us that the performance of the various layers
abides by the following hierarchy: 1.P(
MLP
0
(X
[n
X
]
z
;)) outperforms the rest reecting an increase of
up to 15% in the prediction accuracy when compared withP(
MLP
(X
[n
X
]
z
;)), 2.P(
MLP
1
(X
[n
X
]
z
;))
follows, 3.P(
MLP
2
(X
[n
X
]
z
;)) which deteriorates whenn
X
40, 4.P(
MLP
(X
[n
X
]
z
;)),
5.P(
MLP
3
(X
[n
X
]
z
;)).
4.2 node2vec
Here, we look at
n2v
.
4.2.1 Corruptednodefeaturematrix(X
c
)
4.2.1.1 Corruptedadjacencymatrix(A
c
)
We observe that the performance can be ordered in the following manner: 1. P(
n2v
1
(X
[n
X
]
c
;A
[n
A
]
c
))
which gives an improvement ranging 8% (in the case of gure 4.3) to 2% as n
A
increases is observed
34
Figure 4.2: Prediction accuracy of
MLP
i
(X
[n
X
]
z
;);i = f0; 1; 2; 3g;n
X
= f0; 10; ; 100g and
MLP
(X
[n
X
]
z
;). We observe thatP(
MLP
0
(X
[n
X
]
z
;)) outperforms the rest. This is closely followed by
P(
MLP
1
(X
[n
X
]
z
;)). We use the following terms interchangeably: 1. MLP withP(
MLP
(X
[n
X
]
z
;)),
2. MLP: x withP(
MLP
0
(X
[n
X
]
z
;)), 3. MLP: z
1
withP(
MLP
1
(X
[n
X
]
z
;)), 4. MLP: z
2
with
P(
MLP
2
(X
[n
X
]
z
;)), 5. MLP:z
3
withP(
MLP
3
(X
[n
X
]
z
;)).
when compared withP(
n2v
(X
[n
X
]
c
;A
[n
A
]
c
)), 2.P(
n2v
0
(X
[n
X
]
c
;A
[n
A
]
c
)), 3.P(
n2v
(X
[n
X
]
c
;A
[n
A
]
c
)) which
gets closer toP(
n2v
0
(X
[n
X
]
c
;A
[n
A
]
c
)) with increasingn
X
, 4.P(
n2v
2
(X
[n
X
]
c
;A
[n
A
]
c
)) which deteriorates
whenn
X
60, 5.P(
n2v
3
(X
[n
X
]
c
;A
[n
A
]
c
)).
4.2.1.2 Blankedoutadjacencymatrix(A
z
)
When n
A
20, we usually observe a performance hierarchy that can be written as:
1.P(
n2v
0
(X
[n
X
]
c
;A
[n
A
]
z
)) which gives an increase of up to 11% when compared with
P(
n2v
(X
[n
X
]
c
;A
[n
A
]
z
)), 2.P(
n2v
1
(X
[n
X
]
c
;A
[n
A
]
z
)), 3.P(
n2v
(X
[n
X
]
c
;A
[n
A
]
z
)), 4.P(
n2v
2
(X
[n
X
]
c
;A
[n
A
]
z
))
that shows deterioration whenn
X
60, 5.P(
n2v
3
(X
[n
X
]
c
;A
[n
A
]
z
)).
4.2.2 Blankedoutnodefeaturematrix(X
z
)
4.2.2.1 Corruptedadjacencymatrix(A
c
)
We observe from gure 4.4 that: 1.P(
n2v
0
(X
[n
X
]
z
;A
[n
A
]
)) shows a linear negative trend that is relatively
gradual, 2.P(
n2v
1
(X
[n
X
]
z
;A
[n
A
]
)) follows but exhibits a bomb-like trajectory that becomes more linear,
35
Figure 4.3: Prediction accuracy of
n2v
i
(X
[n
X
]
c
;A
[0]
c
);i = f0; 1; 2; 3g;n
X
= f0; 10; ; 100g and
n2v
(X
[n
X
]
c
;A
[0]
c
). Asn
X
increases we observe that the gap between P(
n2v
i
(X
[n
X
]
c
;A
[0]
c
)),
i = f0; 1g and P(
n2v
i
(X
[n
X
]
c
;A
[0]
c
)), i = f2; 3g widens. At n
X
> 50,
P(
n2v
2
(X
[n
X
]
c
;A
[0]
c
)) begins to deteriorate. We use the following terms interchangeably: 1. n2v with
P(
n2v
(X
[n
X
]
c
;A
[0]
c
)), 2. n2v: x withP(
n2v
0
(X
[n
X
]
c
;A
[0]
c
)), 3. n2v: z
1
withP(
n2v
1
(X
[n
X
]
c
;A
[0]
c
)), 4.
n2v:z
2
withP(
n2v
2
(X
[n
X
]
c
;A
[0]
c
)), 5. n2v:z
3
withP(
n2v
3
(X
[n
X
]
c
;A
[0]
c
)).
3. P(
n2v
2
(X
[n
X
]
z
;A
[n
A
]
)),P(
n2v
(X
[n
X
]
z
;A
[n
A
]
)),P(
n2v
3
(X
[n
X
]
z
;A
[n
A
]
)) have a negative linear trend
that becomes sharper.
When comparingP(
n2v
0
(X
[n
X
]
z
;A
[n
A
]
c
)) toP(
n2v
(X
[n
X
]
z
;A
[n
A
]
c
)), we observe an improvement of
22% (in the case of gure 4.4) all the way up to 38% in some cases.
4.2.2.2 Blankedoutadjacencymatrix(A
z
)
We observe a same trend as exhibited in gure 4.4, but with a general higher accuracy. Here, there is an
improvement of up to 33% whenP(
n2v
(X
[n
X
]
z
;A
[n
A
]
z
)) is compared toP(
n2v
0
(X
[n
X
]
z
;A
[n
A
]
z
)).
4.2.2.3 Intuitionsregardingtheperformanceofthelayers
Here, NN
i
;i = f0; 1; 2; 3g denote the MLP layers and DNN will be used to refer to either MLP or
node2vec.
36
Figure 4.4: Prediction accuracy of
n2v
i
(X
[n
X
]
z
;A
[0]
c
);i =f0; 1; 2; 3g;n
X
=f0; 10; ; 100g. Observe
that whenn
X
> 70,
n2v
1
(X
[n
X
]
z
;A
[0]
c
) nosedives to collapse. We use the following terms interchange-
ably: 1. n2v withP(
n2v
(X
[n
X
]
z
;A
[0]
c
)), 2. n2v: x withP(
n2v
0
(X
[n
X
]
z
;A
[0]
c
)), 3. n2v: z
1
with
P(
n2v
1
(X
[n
X
]
z
;A
[0]
c
)), 4. n2v:z
2
withP(
n2v
2
(X
[n
X
]
z
;A
[0]
c
)), 5. n2v:z
3
withP(
n2v
3
(X
[n
X
]
z
;A
[0]
c
)).
Training phase: z
0
contains the most information for our prediction task but it is stored in a format
that is highly distributed and dicult to extract. This makes it harder for RBM-z
0
to store these patterns
in deeper energy minima.
z
1
is the result of passing z
0
through NN
1
which transforms it into a higher dimensionality, thus
making it more linearly separable. In this format,z
1
is easier to store for RBM-z
1
in deeper minima on the
energy contour.
When z
1
is passed throughReLU
1
, we are able to discard a lot of information that is irrelevant to
our prediction task. After feeding this toNN
2
, the remaining relevant information is made more linearly
separable with no change in dimensionality. This representation is the hardest for RBM-z
2
to store in
deeper minima on the energy surface.
Feedingz
2
throughReLU
2
andNN
3
results in a representation that has the least information needed
for the prediction task in a maximally linearly separable form with dimensionality equal to the number of
predicted classes. This is relatively easier to store in deeper energy minima by RBM-z
3
.
37
Inferencephase
1. Denoisez
0
:z
[n]
0
is harder to denoise thanz
[n]
1
asn increases, but denoising is still slightly eective
because of the high information still available.
2. Denoise z
1
: Transformation in NN
1
makes it easier to spot the noise in z
[n]
1
and reconstruct a
denoised version. This makes
DNN
1
overall more eective in denoising when compared with others.
3. Denoise z
2
: ReLU
1
can discard information confounded with noise in z
[n]
1
. When this is com-
pounded with the fact that RBM-z
2
nds it harder to store information in deeper minima, as n
increases,P[
DNN
2
] can collapse.
4. Denoise z
3
: ReLU
2
can discard noise confounded information in z
[n]
2
. The ease with which the
information is stored by RBM-z
3
prevents the performance from collapsing. However, this is not
very eective in denoising.
When the adjacency matrix is blanked out (A
z
), the sparsity fromA
[n
A
]
z
can propagate to the informa-
tion conveyed byX
c
thereby aecting the accuracy of deeper layers in the network asn
A
increases. This
indicates that
n2v
0
(X
[nx]
c
;A
[n
A
]
z
) is more eective in denoising than
n2v
i
(X
[nx]
c
;A
[n
A
]
z
);i =f1; 2; 3g.
Blankingoutoffeatures When the node feature matrix is blanked out (X
z
), it is easier to denoise this
because it is zeros where the features are missing and the rest of the features are kept intact. However,
this advantage gets weaker as we progress deeper into the neural network due to the transformations.
When we are also dealing with distortions in the adjacency matrix as in the case of node2vec, we believe
that the sparsity of information in the node feature matrix can propagate to information conveyed by the
distorted adjacency matrix. Inz
0
, the sparsity in the node feature matrix is distinct from the distortions in
the adjacency matrix and RBM-z
0
can deal with the two independently.
38
However, oncez
0
goes through a transformation, the sparsity and distortions can combine to make it
harder for RBM-z
1
to denoise it. At small values ofn
X
, RBM-z
1
is still able to produce a good reconstruc-
tion but as the value ofn
X
becomes too large, this starts severely aecting the capability of RBM-z
1
. This
leads to the bomb-like trajectory depicted byP(
n2v
1
(X
[n
X
]
z
;A
[n
A
]
)) in gure 4.4.
Deeper layers have too little information to work with and therefore exhibit poor performance.
39
Chapter5
Graphneuralnetwork-baseddenoisingpipeline
In this chapter we will go through the performance of
GCN
and
SAGE
. For concision, we will only be
illustrating the results which require some explanation, and therefore we encourage the reader to view the
R shiny application (section 3.4) alongside this thesis for other details.
A general observation is that the performance when dealing with blanked out node feature matrix is
usually higher and slows down more gradually than in the case of corrupted node feature matrix.
5.1 GCN
Here, we look at
GCN
.
5.1.0.1 Corruptednodefeaturematrix
Full adjacency matrix We observe in gure 5.1 that when n
X
50,P(
GCN
i
(X
[0]
c
;A
[0]
c
)), i =
f0; 1; 2g exhibit similar performance whileP(
GCN
3
(X
[0]
c
;A
[0]
c
)) performs worse than these. Atn
X
= 50,
we observe the following hierarchy in performances:P(
GCN
0
(X
[50]
c
;A
[0]
c
))P(
GCN
1
(X
[50]
c
;A
[0]
c
))
P(
GCN
2
(X
[50]
c
;A
[0]
c
))P(
GCN
(X
50]
c
;A
[0]
c
))P(
GCN
3
(X
[50]
c
;A
[0]
c
)). Now,P(
GCN
3
(X
[50]
c
;A
[0]
c
))
begins to exhibit some rather interesting behaviour. At n
X
= 60,P(
GCN
3
(X
[60]
c
;A
[0]
c
)) recovers to
the level ofP(
GCN
3
(X
[60]
c
;A
[0]
c
)). Whenn
X
= 70,P(
GCN
3
(X
[70]
c
;A
[0]
c
)) reaches the performance of
40
P(
GCN
1
(X
[70]
c
;A
[0]
c
)). Beyond this value ofn
X
, we see thatP(
GCN
3
(X
[n
X
]
c
;A
[0]
c
)) outperforms all of
the other layers.
Figure 5.1: Prediction accuracy of
GCN
i
(X
[n
X
]
c
;A
[0]
c
);i =f0; 1; 2; 3g;n
X
=f0; 10;:::; 100g. We observe
thatP(
GCN
0
(X
[n
X
]
c
;A
[0]
c
)) performs the best untilP(
GCN
3
(X
[n
X
]
c
;A
[0]
c
)) does not decrease as quickly
as the other layers. We use the following terms interchangeably: 1. GCN withP(
GCN
(X
[n
X
]
c
;A
[0]
c
)),
2. GCN: x withP(
GCN
0
(X
[n
X
]
c
;A
[0]
c
)), 3. GCN: z
1
withP(
GCN
1
(X
[n
X
]
c
;A
[0]
c
)), 4. GCN: z
2
with
P(
GCN
2
(X
[n
X
]
c
;A
[0]
c
)), 5. GCN:z
3
withP(
GCN
3
(X
[n
X
]
c
;A
[0]
c
)).
Corruptedadjacencymatrix Whenn
A
40, we observe that the behaviour exhibited by
P(
GCN
3
(X
[n
X
]
c
;A
[0]
c
)) whenn
X
50 in gure 5.1, is not reected in this case. Whenn
A
= 40, a rise in
P(
GCN
3
(X
[n
X
]
c
;A
[40]
c
)) is observed whenn
X
70.
Atn
A
= 50, we observe that all of the layers completely collapse whenn
X
50. Whenn
A
= 60,
gure 5.3 shows thatP(
GCN
0
(X
[n
X
]
c
;A
[60]
c
)) gradually decreases,P(
GCN
1
(X
[n
X
]
c
;A
[60]
c
)) consistently
performs poorly, andP(
GCN
3
(X
[n
X
]
c
;A
[60]
c
)) starts out poorly but decreases way slower than the others
whenn
X
> 40.
Atn
A
= 70,P(
GCN
0
(X
[n
X
]
c
;A
[70]
c
)) decreases rapidly andP(
GCN
2
(X
[n
X
]
c
;A
[70]
c
)) exhibits a slower
descent than the former. Here,P(
GCN
1
(X
[n
X
]
c
;A
[70]
c
)) andP(
GCN
3
(X
[n
X
]
c
;A
[70]
c
)) show very ran-
dom behaviour. When n
A
= 80, we observe from gure 5.4, a behaviour that is very similar to the
case when n
A
= 70 except that althoughP(
GCN
1
(X
[n
X
]
c
;A
[80]
c
)) starts slow, it descends slower than
41
Figure 5.2: Prediction accuracy of
GCN
i
(X
[n
X
]
c
;A
[40]
c
);i = f0; 1; 2; 3g;n
X
= f0; 10;:::; 100g.
We observe that when n
X
40, P(
GCN
1
(X
[n
X
]
c
;A
[40]
c
)) begins to collapse, and when n
X
70, P(
GCN
3
(X
[n
X
]
c
;A
[40]
c
)) shows a resurgence. We use the following terms interchangeably: 1.
GCN with P(
GCN
(X
[n
X
]
c
;A
[40]
c
)), 2. GCN: x with P(
GCN
0
(X
[n
X
]
c
;A
[40]
c
)), 3. GCN: z
1
with P(
GCN
1
(X
[n
X
]
c
;A
[40]
c
)), 4. GCN: z
2
with P(
GCN
2
(X
[n
X
]
c
;A
[40]
c
)), 5. GCN: z
3
with
P(
GCN
3
(X
[n
X
]
c
;A
[40]
c
)).
Figure 5.3: Prediction accuracy of
GCN
i
(X
[n
X
]
c
;A
[60]
c
);i = f0; 1; 2; 3g;n
X
= f0; 10;:::; 100g. Ob-
serve thatP(
GCN
0
(X
[n
X
]
c
;A
[60]
c
)) decreases gradually,P(
GCN
1
(X
[n
X
]
c
;A
[60]
c
)) consistently performs
poorly, andP(
GCN
3
(X
[n
X
]
c
;A
[60]
c
)) starts out poorly but decreases much more gradually than the rest
whenn
X
> 40. We use the following terms interchangeably: 1. GCN withP(
GCN
(X
[n
X
]
c
;A
[60]
c
)), 2.
GCN: x withP(
GCN
0
(X
[n
X
]
c
;A
[60]
c
)), 3. GCN: z
1
withP(
GCN
1
(X
[n
X
]
c
;A
[60]
c
)), 4. GCN: z
2
with
P(
GCN
2
(X
[n
X
]
c
;A
[60]
c
)), 5. GCN:z
3
withP(
GCN
3
(X
[n
X
]
c
;A
[60]
c
)).
42
P(
GCN
2
(X
[n
X
]
c
;A
[80]
c
)). We observe thatP(
GCN
3
(X
[n
X
]
c
;A
[80]
c
)) makes an unexpected increase when
n
X
> 60.
Figure 5.4: Prediction accuracy of
GCN
i
(X
[n
X
]
c
;A
[80]
c
);i = f0; 1; 2; 3g;n
X
= f0; 10;:::; 100g. Ob-
serve the sudden rise inP(
GCN
3
(X
[n
X
]
c
;A
[80]
c
)) when n
X
> 60. We use the following terms in-
terchangeably: 1. GCN withP(
GCN
(X
[n
X
]
c
;A
[80]
c
)), 2. GCN: x withP(
GCN
0
(X
[n
X
]
c
;A
[80]
c
)), 3.
GCN: z
1
withP(
GCN
1
(X
[n
X
]
c
;A
[80]
c
)), 4. GCN: z
2
withP(
GCN
2
(X
[n
X
]
c
;A
[80]
c
)), 5. GCN: z
3
with
P(
GCN
3
(X
[n
X
]
c
;A
[80]
c
)).
Whenn
A
has a value of 90 or 100, the accuracy is too low for any practical signicance.
When comparingP(
GCN
0
(X
[n
X
]
c
;A
[70]
c
)) toP(
GCN
(X
[n
X
]
c
;A
[n
A
]
c
)), an improvement ranging 12%
(in the case of gure 5.1), to 4% is observed asn
A
increases.
Blanked out adjacency matrix We observe a behaviour similar to the one explained in the previous
section concerning corrupted adjacency matrix, except that the layers are more robust to blanking out than
corruption in the adjacency matrix. Particularly, we do not observe the collapse whenn
A
= 50 and in gen-
eral we see thatP(
GCN
0
(X
[n
X
]
c
;A
[n
A
]
z
)) distinctly outperforms the others whileP(
GCN
3
(X
[n
X
]
c
;A
[n
A
]
z
))
shows the lowest performance.
Overall, we see an improvement of up to 16% when the prediction accuracy ofP(
GCN
(X
[n
X
]
c
;A
[n
A
]
z
))
is compared toP(
GCN
0
(X
[n
X
]
c
;A
[n
A
]
z
)).
43
5.1.0.2 Blankedoutnodefeaturematrix
Full adjacency matrix Figure 5.5 shows us thatP(
GCN
(X
[n
X
]
z
;A
[0]
c
)) outperforms all of the rest.
P(
GCN
1
(X
[n
X
]
z
;A
[0]
c
)) andP(
GCN
2
(X
[n
X
]
z
;A
[0]
c
)) perform on the same level asP(
GCN
(X
[n
X
]
z
;A
[0]
c
)).
P(
GCN
3
(X
[n
X
]
z
;A
[0]
c
)) consistently performs lower than these. The most interesting observation is that
P(
GCN
0
(X
[n
X
]
z
;A
[0]
c
)) shows a constant decrease in performance.
Figure 5.5: Prediction accuracy of
GCN
(X
[n
X
]
z
;A
[0]
c
). Observe that P(
GCN
0
(X
[n
X
]
z
;A
[0]
c
))
shows a constant decrease in performance. P(
GCN
(X
[n
X
]
z
;A
[0]
c
)), P(
GCN
0
(X
[n
X
]
z
;A
[0]
c
)) and
P(
GCN
1
(X
[n
X
]
z
;A
[0]
c
)) show similar performance. P(
GCN
3
(X
[n
X
]
z
;A
[0]
c
)) performs poorly. We
use the following terms interchangeably: 1. GCN with P(
GCN
(X
[n
X
]
z
;A
[0]
c
)), 2. GCN: x
with P(
GCN
0
(X
[n
X
]
z
;A
[0]
c
)), 3. GCN: z
1
with P(
GCN
1
(X
[n
X
]
z
;A
[0]
c
)), 4. GCN: z
2
with
P(
GCN
2
(X
[n
X
]
z
;A
[0]
c
)), 5. GCN:z
3
withP(
GCN
3
(X
[n
X
]
z
;A
[0]
c
)).
Corruptedadjacencymatrix Whenn
A
50, we observe a performance that is similar to that in g-
ure 5.5. Atn
A
= 60,P(
GCN
1
(X
[n
X
]
z
;A
[60]
z
)) starts performing worse thanP(
GCN
i
(X
[n
X
]
z
;A
[60]
z
));i =
f2; 3g andP(
GCN
(X
[n
X
]
z
;A
[60]
z
)). However,P(
GCN
3
(X
[n
X
]
z
;A
[60]
z
)) signicantly outperforms the oth-
ers whenn
X
> 50.n
A
= 70 depicts a very similar behaviour. Whenn
A
= 80, we observe from gure 5.6
thatP(
GCN
1
(X
[n
X
]
z
;A
[80]
z
)) andP(
GCN
3
(X
[n
X
]
z
;A
[80]
z
)) start with a poor performance but this begins
to improve whereP(
GCN
1
(X
[n
X
]
z
;A
[80]
z
)) even manages to outperform the rest whenn
X
> 70.
Atn
A
of 90 or 100, the accuracy is too low for any practical signicance.
44
Figure 5.6: Prediction accuracy of
GCN
(X
[n
X
]
z
;A
[80]
c
). observe that P(
GCN
1
(X
[n
X
]
z
;A
[80]
c
))
and P(
GCN
3
(X
[n
X
]
z
;A
[80]
c
)) start with a poor performance but this begins to improve where
P(
GCN
1
(X
[n
X
]
z
;A
[80]
c
)) even manages to outperform the rest when n
X
> 70. We use the following
terms interchangeably: 1. GCN withP(
GCN
(X
[n
X
]
z
;A
[80]
c
)), 2. GCN:x withP(
GCN
0
(X
[n
X
]
z
;A
[80]
c
)),
3. GCN: z
1
withP(
GCN
1
(X
[n
X
]
z
;A
[80]
c
)), 4. GCN: z
2
withP(
GCN
2
(X
[n
X
]
z
;A
[80]
c
)), 5. GCN: z
3
withP(
GCN
3
(X
[n
X
]
z
;A
[80]
c
)).
When comparingP(
GCN
2
(X
[n
X
]
z
;A
[n
A
]
c
)) toP(
GCN
(X
[n
X
]
z
;A
[n
A
]
c
)), we observe no improvement
in the case of gure 5.5, with an improvement of up to 5% in some rather rare instances.
Blankedoutadjacencymatrix We observe the same behaviour as described in the case when the adja-
cency matrix was corrupted, with a lower overall accuracy. We also observe thatP(
GCN
0
(X
[n
X
]
z
;A
[n
A
]
z
))
does not show as sharp a decrease.
In this case, we see an improvement of up to 2% in some very rare cases whenP(
GCN
(X
[n
X
]
z
;A
[n
A
]
z
))
is compared toP(
GCN
1
(X
[n
X
]
z
;A
[n
A
]
z
)).
5.1.1 GraphSAGE
Here we will discuss
SAGE
.
45
5.1.1.1 Corruptednodefeaturematrix
Fulladjacencymatrix We observe from gure 5.7 that whenn
X
is either 0 or 10,
P(
SAGE
i
(X
[n
X
]
c
;A
[0]
c
)), i = f0; 1; 2; 3g andP(
SAGE
(X
[n
X
]
c
;A
[0]
c
)) show very similar performance.
At n
X
= 20, we observe that the performance ofP(
SAGE
3
(X
[n
X
]
c
;A
[0]
c
)) starts to decrease. When
n
X
= 30, the gap between the performances ofP(
SAGE
0
(X
[n
X
]
c
;A
[0]
c
)) andP(
SAGE
2
(X
[n
X
]
c
;A
[0]
c
))
widens. Forn
X
with a value of 40 or 50, we see thatP(
SAGE
1
(X
[n
X
]
c
;A
[0]
c
)) outperforms the others. For
n
X
2f60; 70; 80; 90g, it is observed thatP(
SAGE
2
(X
[n
X
]
c
;A
[0]
c
)) starts deteriorating. Whenn
X
= 100,
P(
SAGE
0
(X
[n
X
]
c
;A
[0]
c
)) is observed to beatP(
SAGE
1
(X
[n
X
]
c
;A
[0]
c
)).
Figure 5.7: Prediction accuracy of
SAGE
i
(X
[n
X
]
c
;A
:
c
[0]);i =f0; 1; 2; 3g;n
X
=f0; 10;:::; 100g. Observe
thatP(
SAGE
1
(X
[n
X
]
c
;A
[0]
c
)) outperforms the rest. This is closely followed byP(
SAGE
0
(X
[n
X
]
c
;A
[0]
c
)).
We use the following terms interchangeably: 1. SAGE withP(
SAGE
(X
[n
X
]
c
;A
[0]
c
)), 2. SAGE: x
with P(
SAGE
0
(X
[n
X
]
c
;A
[0]
c
)), 3. SAGE: z
1
with P(
SAGE
1
(X
[n
X
]
c
;A
[0]
c
)), 4. SAGE: z
2
with
P(
SAGE
2
(X
[n
X
]
c
;A
[0]
c
)), 5. SAGE:z
3
withP(
SAGE
3
(X
[n
X
]
c
;A
[0]
c
)).
Corrupted adjacency matrix Except for the cases we have specically mentioned, we nd that most
percentage of corruption in the adjacency matrix result in observations that vary slightly from gure 5.7.
Whenn
A
= 20, we nd two observations: 1.P(
SAGE
0
(X
[n
X
]
c
;A
[20]
c
)) outperforms the rest atn
X
> 50,
2.P(
SAGE
2
(X
[n
X
]
c
;A
[20]
c
)) starts to deteriorate atn
X
> 40. Atn
A
= 40, we see that whilen
X
< 50,
46
the performance is similar to gure 5.7. Whenn
X
50, we observe the following hierarchy in perfor-
mance in gure 5.8:P(
SAGE
0
(X
[n
X
]
c
;A
[40]
c
))P(
SAGE
2
(X
[n
X
]
c
;A
[40]
c
))P(
SAGE
(X
[n
X
]
c
;A
[40]
c
))
P(
SAGE
3
(X
[n
X
]
c
;A
[40]
c
))P(
SAGE
1
(X
[n
X
]
c
;A
[40]
c
)).
Figure 5.8: Prediction accuracy of
SAGE
i
(X
[n
X
]
c
;A
[40]
c
));i = f0; 1; 2; 3g;n
X
=
f0; 10;:::; 100g. When n
X
50, we observe the following performance hierarchy:
P(
SAGE
0
(X
[n
X
]
c
;A
[40]
c
)) P(
SAGE
2
(X
[n
X
]
c
;A
[40]
c
)) P(
SAGE
(X
[n
X
]
c
;A
[40]
c
))
P(
SAGE
3
(X
[n
X
]
c
;A
[40]
c
)) P(
SAGE
1
(X
[n
X
]
c
;A
[40]
c
)). We use the following terms interchange-
ably: 1. SAGE withP(
SAGE
(X
[n
X
]
c
;A
[40]
c
)), 2. SAGE:x withP(
SAGE
0
(X
[n
X
]
c
;A
[40]
c
)), 3. SAGE:z
1
withP(
SAGE
1
(X
[n
X
]
c
;A
[40]
c
)), 4. SAGE: z
2
withP(
SAGE
2
(X
[n
X
]
c
;A
[40]
c
)), 5. SAGE: z
3
with
P(
SAGE
3
(X
[n
X
]
c
;A
[40]
c
)).
For n
A
= 60, we observe from gure 5.9, the following interesting behaviour when n
X
> 40: 1.
P(
SAGE
1
(X
[n
X
]
c
;A
[60]
c
)) sometimes outperforms the rest, 2.P(
SAGE
0
(X
[n
X
]
c
;A
[60]
c
)) slows down and
makes a gradual return, 3.P(
SAGE
2
(X
[n
X
]
c
;A
[60]
c
)) plateaus, 4. AlthoughP(
SAGE
(X
[n
X
]
c
;A
[60]
c
)) ini-
tially plateaus, it improves atn
X
> 80 even outperformingP(
SAGE
1
(X
[n
X
]
c
;A
[60]
c
)),
5.P(
SAGE
3
(X
[n
X
]
c
;A
[60]
c
)) consistently performs poorly.
Atn
A
= 70, gure 5.10 shows the following hierarchy: 1.P(
SAGE
0
(X
[n
X
]
c
;A
[70]
c
)),
P(
SAGE
1
(X
[n
X
]
c
;A
[70]
c
)) andP(
SAGE
(X
[n
X
]
c
;A
[70]
c
)) exhibit similar performance,
2.P(
SAGE
2
(X
[n
X
]
c
;A
[70]
c
)) deteriorates atn
X
> 40 and then makes a comeback,
3.P(
SAGE
3
(X
[n
X
]
c
;A
[70]
c
)) shows a linear downward trend whenn
X
> 60.
47
Figure 5.9: Prediction accuracy of
SAGE
i
(X
[n
X
]
c
;A
[60]
c
);i = f0; 1; 2; 3g;n
X
= f0; 10;:::; 100g.
Observe that the performance of the layers start diering a lot from previous observations. We
use the following terms interchangeably: 1. SAGE withP(
SAGE
(X
[n
X
]
c
;A
[60]
c
)), 2. SAGE: x
withP(
SAGE
0
(X
[n
X
]
c
;A
[60]
c
)), 3. SAGE: z
1
withP(
SAGE
1
(X
[n
X
]
c
;A
[60]
c
)), 4. SAGE: z
2
with
P(
SAGE
2
(X
[n
X
]
c
;A
[n
A
]
c
)), 5. SAGE:z
3
withP(
SAGE
3
(X
[n
X
]
c
;A
[60]
c
)).
Figure 5.10: Prediction accuracy of
SAGE
i
(X
[n
X
]
c
;A
[70]
c
);i =f0; 1; 2; 3g;n
X
=f0; 10;:::; 100g. Observe
thatP(
SAGE
3
(X
[n
X
]
c
;A
[70]
c
)) starts to decrease whenn
X
> 40 andP(
SAGE
2
(X
[n
X
]
c
;A
[70]
c
)) makes a
bowl atn
X
50. We use the following terms interchangeably: 1. SAGE withP(
SAGE
(X
[n
X
]
c
;A
[70]
c
)),
2. SAGE:x withP(
SAGE
0
(X
[n
X
]
c
;A
[70]
c
)), 3. SAGE:z
1
withP(
SAGE
1
(X
[n
X
]
c
;A
[70]
c
)), 4. SAGE:z
2
withP(
SAGE
2
(X
[n
X
]
c
;A
[70]
c
)), 5. SAGE:z
3
withP(
SAGE
3
(X
[n
X
]
c
;A
[70]
c
)).
48
Whenn
A
is 90 or 100, we nd that atn
X
> 40 the accuracy goes at or below 10% accuracy which is
not of a great practical signicance.
When comparingP(
SAGE
1
(X
[n
X
]
c
;A
[n
A
]
c
)) toP(
SAGE
(X
[n
X
]
c
;A
[n
A
]
c
)), we observe an improvement
ranging 9% (in the case of gure 5.7), up to 13% asn
A
increases.
Blankedoutadjacencymatrix In this case we usually observe performance trends that are similar to
those when the adjacency matrix was corrupted but here the resulting accuracy is higher. Whenn
A
= 30,
we observe from gure 5.11 thatP(
SAGE
0
(X
[n
X
]
c
;A
[30]
z
)) andP(
SAGE
1
(X
[n
X
]
c
;A
[30]
z
)) show similar per-
formance, which is greater thanP(
SAGE
2
(X
[n
X
]
c
;A
[30]
z
)) andP(
SAGE
(X
[n
X
]
c
;A
[30]
z
)). This is followed
byP(
SAGE
3
(X
[n
X
]
c
;A
[30]
z
)), the worst performing one which deteriorates whenn
X
> 50.
Figure 5.11: Prediction accuracy of
SAGE
i
(X
[n
X
]
c
;A
[30]
z
);i = f0; 1; 2; 3g;n
X
= f0; 10;:::; 100g.
We observe a deterioration of performance by P(
SAGE
3
(X
[n
X
]
c
;A
[30]
z
)) when n
X
> 50. We
use the following terms interchangeably: 1. SAGE withP(
SAGE
(X
[n
X
]
c
;A
[30]
z
)), 2. SAGE: x
withP(
SAGE
0
(X
[n
X
]
c
;A
[30]
z
)), 3. SAGE: z
1
withP(
SAGE
1
(X
[n
X
]
c
;A
[30]
z
)), 4. SAGE: z
2
with
P(
SAGE
2
(X
[n
X
]
c
;A
[30]
z
)), 5. SAGE:z
3
withP(
SAGE
3
(X
[n
X
]
c
;A
[30]
z
)).
Atn
A
= 40, we nd thatP(
SAGE
0
(X
[n
X
]
c
;A
[40]
z
)) outperforms the others whenn
X
> 60. When
n
A
= 50, gure 5.12 depicts the following hierarchy becomes apparent:P(
SAGE
1
(X
[n
X
]
c
;A
[50]
z
)),
P(
SAGE
0
(X
[n
X
]
c
;A
[50]
z
)),P(
SAGE
2
(X
[n
X
]
c
;A
[50]
z
)),P(
SAGE
(X
[n
X
]
c
;A
[50]
z
)),P(
SAGE
3
(X
[n
X
]
c
;A
[50]
z
)).
49
Figure 5.12: Prediction accuracy of
SAGE
i
(X
[n
X
]
c
;A
[50]
z
). Observe that the follow-
ing hierarchy becomes apparent: P(
SAGE
1
(X
[n
X
]
c
;A
[50]
z
)), P(
SAGE
0
(X
[n
X
]
c
;A
[50]
z
)),
P(
SAGE
2
(X
[n
X
]
c
;A
[50]
z
)), P(
SAGE
(X
[n
X
]
c
;A
[50]
z
)), P(
SAGE
3
(X
[n
X
]
c
;A
[50]
z
)). We use the fol-
lowing terms interchangeably: 1. SAGE with P(
SAGE
(X
[n
X
]
c
;A
[50]
z
)), 2. SAGE: x with
P(
SAGE
0
(X
[n
X
]
c
;A
[50]
z
)), 3. SAGE: z
1
with P(
SAGE
1
(X
[n
X
]
c
;A
[50]
z
)), 4. SAGE: z
2
with
P(
SAGE
2
(X
[n
X
]
c
;A
[50]
z
)), 5. SAGE:z
3
withP(
SAGE
3
(X
[n
X
]
c
;A
[50]
z
)).
In this case, we see an improvement of up to 13% whenP(
SAGE
(X
[n
X
]
c
;A
[n
A
]
z
)) is compared to
P(
SAGE
1
(X
[n
X
]
c
;A
[n
A
]
z
)).
5.1.2 Blankedoutnodefeaturematrix
Full adjacency matrix Figure 5.13 depicts thatP(
SAGE
0
(X
[n
X
]
z
;A
[0]
c
)) outperforms the rest. This is
followed byP(
SAGE
1
(X
[n
X
]
z
;A
[0]
c
)) which shows a bomb-like trajectory. Finally, we have
P(
SAGE
2
(X
[n
X
]
z
;A
[0]
c
)),P(
SAGE
(X
[n
X
]
z
;A
[0]
c
)) andP(
SAGE
3
(X
[n
X
]
z
;A
[0]
c
)) which follow a more sub-
linear path with negative slope.
Corruptedadjacencymatrix Whenn
A
< 40, we observe a performance that is not very far from that
in gure 5.13. Atn
A
= 40, gure 5.14 depict thatP(
SAGE
0
(X
[n
X
]
z
;A
[40]
c
)) andP(
SAGE
1
(X
[n
X
]
z
;A
[40]
c
))
become more sub-linear path with negative slope, while the rest of the layers collapse whenn
X
> 40.
Figure 5.15 shows some interesting observations whenn
A
= 90. Whenn
X
> 30,
P(
SAGE
2
(X
[n
X
]
z
;A
[90]
c
)),P(
SAGE
(X
[n
X
]
z
;A
[90]
c
)),P(
SAGE
3
(X
[n
X
]
z
;A
[90]
c
)) collapse.
50
Figure 5.13: Prediction accuracy of
SAGE
i
(X
[n
X
]
z
;A
[0]
c
);i = f0; 1; 2; 3g;n
X
= f0; 10;:::; 100g.
Observe thatP(
SAGE
0
(X
[n
X
]
z
;A
[0]
c
)) outperforms the rest. P(
SAGE
1
(X
[n
X
]
z
;A
[0]
c
)) shows a bomb-
like trajectory. P(
SAGE
2
(X
[n
X
]
z
;A
[0]
c
)), P(
SAGE
(X
[n
X
]
z
;A
[0]
c
)) and P(
SAGE
3
(X
[n
X
]
z
;A
[0]
c
)) fol-
lows a more sub-linear path with negative slope. We use the following terms interchangeably: 1.
SAGE withP(
SAGE
(X
[n
X
]
z
;A
[0]
c
)), 2. SAGE: x withP(
SAGE
0
(X
[n
X
]
z
;A
[0]
c
)), 3. SAGE: z
1
with P(
SAGE
1
(X
[n
X
]
z
;A
[0]
c
)), 4. SAGE: z
2
with P(
SAGE
2
(X
[n
X
]
z
;A
[0]
c
)), 5. SAGE: z
3
with
P(
SAGE
3
(X
[n
X
]
z
;A
[0]
c
)).
Figure 5.14: Prediction accuracy of
SAGE
i
(X
[n
X
]
z
;A
[40]
c
);i = f0; 1; 2; 3g;n
X
= f0; 10;:::; 100g.
P(
SAGE
0
(X
[n
X
]
z
;A
[40]
c
)) andP(
SAGE
1
(X
[n
X
]
z
;A
[40]
c
)) become more sub-linear path with negative slope,
while the rest of the layers collapse when n
X
> 40. We use the following terms interchangeably:
1. SAGE withP(
SAGE
(X
[n
X
]
z
;A
[40]
c
)), 2. SAGE: x withP(
SAGE
0
(X
[n
X
]
z
;A
[40]
c
)), 3. SAGE: z
1
withP(
SAGE
1
(X
[n
X
]
z
;A
[40]
c
)), 4. SAGE: z
2
withP(
SAGE
2
(X
[n
X
]
z
;A
[40]
c
)), 5. SAGE: z
3
with
P(
SAGE
3
(X
[n
X
]
z
;A
[40]
c
)).
51
P(
SAGE
1
(X
[n
X
]
z
;A
[90]
c
)) collapses whenn
X
> 50 whilen
X
> 60 causesP(
SAGE
0
(X
[n
X
]
z
;A
[90]
c
)) to
collapse.
Figure 5.15: Prediction accuracy of
SAGE
i
(X
[n
X
]
z
;A
[90]
c
);i = f0; 1; 2; 3g;n
X
=
f0; 10;:::; 100g. When n
X
> 30, P(
SAGE
2
(X
[n
X
]
z
;A
[90]
c
)), P(
SAGE
(X
[n
X
]
z
;A
[90]
c
)),
P(
SAGE
3
(X
[n
X
]
z
;A
[90]
c
)) collapse. P(
SAGE
1
(X
[n
X
]
z
;A
[90]
c
)) collapses when n
X
> 50 while
n
X
> 60 causesP(
SAGE
0
(X
[n
X
]
z
;A
[90]
c
)) to collapse. We use the following terms interchange-
ably: 1. SAGE withP(
SAGE
(X
[n
X
]
z
;A
[90]
c
)), 2. SAGE:x withP(
SAGE
0
(X
[n
X
]
z
;A
[90]
c
)), 3. SAGE:z
1
withP(
SAGE
1
(X
[n
X
]
z
;A
[90]
c
)), 4. SAGE: z
2
withP(
SAGE
2
(X
[n
X
]
z
;A
[90]
c
)), 5. SAGE: z
3
with
P(
SAGE
3
(X
[n
X
]
z
;A
[90]
c
)).
Whenn
A
= 100, we nd the accuracy to be too low for any practical signicance whenn
X
50.
When comparingP(
SAGE
0
(X
[n
X
]
z
;A
[n
A
]
c
)) toP(
SAGE
(X
[n
X
]
z
;A
[n
A
]
c
)), we observe an improvement
of 40% (in the case of gure 5.13), with an improvement of up to 34% in some cases.
Blankedoutadjacencymatrix Overall we nd that the performance in this case is more robust than
when we are dealing with corrupted adjacency matrix. When n
A
is between 30 and 50, we observe
thatP(
SAGE
1
(X
[n
X
]
z
;A
[n
A
]
z
)) deteriorates whenn
X
> 60. Additionally, whenn
A
50, we nd that
P(
SAGE
2
(X
[n
X
]
z
;A
[n
A
]
z
)),P(
SAGE
3
(X
[n
X
]
z
;A
[n
A
]
z
)) andP(
SAGE
(X
[n
X
]
z
;A
[n
A
]
z
)) have very similar per-
formance.
In this case, we see an improvement of up to 38% in some cases whenP(
SAGE
(X
[n
X
]
z
;A
[n
A
]
z
)) is
compared toP(
SAGE
0
(X
[n
X
]
z
;A
[n
A
]
z
)).
52
5.1.2.1 Intuitionsregardingtheperformanceofthelayers
In general, a graph neural network lets a node borrow information from all the nodes in its neighbourhood
and transform it into a higher dimension thus making it more linearly separable. Here x
i
denotes the
embeddings of nodei,N
i
indicates the neighbours of nodei,d
i
stands for the degree of this nodei, and
W;W
1
;W
2
denotes the learnable parameters.
Training phase z
0
is the same as that in Section 4.2.2.3, and therefore we oer the same explanation
for it here.
Whenz
0
is fed toNN
1
, each node borrows information from its neighbourhood and transforms it to
a higher dimension resulting inz
1
. This should make it easier for RBM-z
1
to storez
1
in deeper energy
minima.
By passingz
1
throughReLU
1
, we discard some of the information in the 1-hop neighbourhood em-
beddings that is irrelevant to our prediction task. Feeding it toNN
2
gets each node to borrow information
from its neighbours and making it more linearly separable. This makes it slightly hard for RBM-z
2
to store
in deeper minima on the energy surface, but the features are still highly informative in most cases.
Feeding z
2
through ReLU
2
discards irrelevant information from the 2-hop neighbourhood embed-
dings, passing throughNN
3
adds in information from 3-hop neighbourhood and converts it to a maxi-
mally linearly separable form with dimensionality equal to the number of predicted classes. This format
is easier to store in deeper energy minima by RBM-z
3
.
Inferencephase
1. Denoisez
0
: We oer the same explanation as in Section 4.2.2.3.
2. Denoisez
1
: NN
1
causes noise from the neighbours of each node to enter that node’s features. If
we use GCN layers, then from Equation (2.54) we observe that the information confounded with
53
noise in the node gets mixed in with the information confounded with noise borrowed from the
neighbouring nodes. This makes it even harder to denoise the resulting embeddings. On the other
hand, if we use GraphSAGE layers, then Equation (2.58) shows that these are easier to denoise be-
cause the node’s information and the neighbourhood’s information do not mix to that extent. This
explains whyP(
SAGE
1
(X
[n
X
]
c
;A
[n
A
]
)) usually performs better thanP(
SAGE
0
(X
[n
X
]
c
;A
[n
A
]
)), but
P(
GCN
0
(X
[n
X
]
c
;A
[n
A
]
)) outperforms the rest.
3. Denoisez
2
: ReLU
1
can discard relevant information confounded with noise inz
[n]
1
. If a sucient
number of nodes lose such relevant information, then passing it throughNN
2
will make the nodes
have low relevant information. This aects denoising ability asn increases.
4. Denoisez
3
: Denoising fromz
[n]
3
has the same issues as denoising fromz
[n]
2
but is slightly harder
because we now have to recover noise from nodes in 3-hop neighbourhood with the risk that a
lot of relevant information is lost through the two ReLU layers. This suggests thatP(
GCN
3
) can
perform poorly in most cases.
Blankingoutoffeatures When we are dealing with blanked out node feature matrix (X
z
), we believe
that this sparsity of information coupled with distortions in the adjacency matrix makes it harder for
GraphSAGE both to rely on the features of the node as well as the features from the neighbourhood of this
node. This becomes particularly severe at high values ofn
X
and is hardly a problem at lower values ofn
X
leading to a bomb-like trajectory. The eect on deeper layers is more pronounced and therefore reects in
poor performance.
54
Chapter6
ConclusionsandFuturework
We would like to conclude the thesis by summarizing the best performing layers for each combination of
distortions.
For the case of corrupted node feature matrix and corrupted adjacency matrix, we observe that
n2v
0
,
n2v
1
and
SAGE
0
and
SAGE
1
usually perform the best depending on the values of n
X
and n
A
. When
n
X
> 70 in some cases,
MLP
,
MLP
1
and
GCN
3
outperform the rest.
When dealing with corrupted node feature matrix and blanked out adjacency matrix, we observe that
GCN
,
SAGE
0
,
GCN
3
and
GCN
1
usually outperform the rest depending on the values ofn
X
andn
A
. In
rare cases we observe that
n2v
0
and
MLP
0
take the lead.
For the case of blanked node feature matrix and corrupted adjacency matrix, we nd that
SAGE
0
,
SAGE
1
,
n2v
0
,
n2v
1
and
GCN
0
usually show the best performance depending on the values ofn
X
andn
A
.
In a couple of cases, we see
MLP
0
taking the lead.
In the case of blanked node feature matrix and blanked adjacency matrix, we observe that
GCN
and
GCN
1
performing the best for most values ofn
X
andn
A
. Very rarely do we nd that
GCN
2
,
SAGE
0
and
MLP
0
outperforming the rest.
55
We have provided short videos of these best performing layers to show how progess in the distortions
of the node feature matrix and the adjacency matrix aects the performance of the layers. These can be
viewed at the following:
• Corrupted node feature matrix and corrupted adjacency matrix: https://bit.ly/2CNpoWN.
• Corrupted node feature matrix and blanked out adjacency matrix: https://bit.ly/31mUO0B.
• Blanked out node feature matrix and corrupted adjacency matrix: https://bit.ly/2BKry9l.
• Blanked out node feature matrix and blanked out adjacency matrix: https://bit.ly/387zBck.
For comparing the predictions of the other layers when dealing with various combinations of the
distortions of the node feature matrix and the adjacency matrix, the reader is encouraged to interact with
the playground available in the Shiny application described in section 3.4.
6.1 Futurework
We nd that some of the observations in gure 5.1 and some in sections 5.1.0.2 and 5.1.0.2 depict perfor-
mance trends that are counter to our intuitions. We would like to further investigate these trends.
We hope to extend these denoising tasks to other features related to nodes and edges, such as distortions
in the node position matrix, edge feature matrix, edge position matrix, etc. Also, we would like to work
on other datasets.
Generally, GB-RBMs are not robust to noise as it assumes a diagonal Gaussian as its conditional dis-
tribution over the visible nodes. This means that the log probability assigned to a noisy outlier would be
very low and classication accuracy tends to be poor for noisy, out-of-sample test cases. We intend to
investigate RBMs that have visible units with other suitable distributions for our denoising task.
56
Robust restricted Boltzmann machine (RoBM) proposed by [14] have been shown to be robust to cor-
ruptions in the training set and are capable of accurately dealing with occlusions and noise by using mul-
tiplicative gating to induce a scale mixture of Gaussians over pixels. RoBM have been successfully used in
image denoising and inpainting. We would like to incorporate RoBMs into our denoising pipeline which
would eliminate the need for a DNN previously trained on a clean training set.
[9] describes how a limited Boltzmann machine using an adiabatic quantum computer can be used to
train deep neural networks. The paper also describes approaches such as neuromorphic computing and
high-performance computing methods. We forsee the use of such methods in our denoising pipeline.
57
Bibliography
[1] Miguel A Carreira-Perpinan and Georey E Hinton. “On contrastive divergence learning.” In:
Aistats. Vol. 10. Citeseer. 2005, pp. 33–40.
[2] Wei-Lin Chiang, Xuanqing Liu, Si Si, Yang Li, Samy Bengio, and Cho-Jui Hsieh. “Cluster-GCN”. In:
Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery Data
Mining (July 2019).doi: 10.1145/3292500.3330925.
[3] Linwei Fan, Fan Zhang, Hui Fan, and Caiming Zhang. “Brief review of image denoising
techniques”. In: Visual Computing for Industry, Biomedicine, and Art 2.1 (2019), p. 7.
[4] Aditya Grover and Jure Leskovec. node2vec: Scalable Feature Learning for Networks. 2016. arXiv:
1607.00653[cs.SI].
[5] Will Hamilton, Zhitao Ying, and Jure Leskovec. “Inductive representation learning on large
graphs”. In: Advances in neural information processing systems. 2017, pp. 1024–1034.
[6] John J Hopeld. “Neural networks and physical systems with emergent collective computational
abilities”. In: Proceedings of the national academy of sciences 79.8 (1982), pp. 2554–2558.
[7] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu,
Michele Catasta, and Jure Leskovec. “Open graph benchmark: Datasets for machine learning on
graphs”. In: arXiv preprint arXiv:2005.00687 (2020).
[8] Thomas N. Kipf and Max Welling. Semi-Supervised Classication with Graph Convolutional
Networks. 2016. arXiv: 1609.02907[cs.LG].
[9] Jeremy Liu, Federico M Spedalieri, Ke-Thia Yao, Thomas E Potok, Catherine Schuman,
Steven Young, Robert Patton, Garrett S Rose, and Gangotree Chamka. “Adiabatic quantum
computation applied to deep learning networks”. In: Entropy 20.5 (2018), p. 380.
[10] K. Nagatani and M. Hagiwara. “Restricted Boltzmann machine associative memory”. In: 2014
International Joint Conference on Neural Networks (IJCNN). 2014, pp. 3745–3750.
[11] Michael A Nielsen. Neural networks and deep learning. Vol. 2018. Determination press San
Francisco, CA, 2015.
58
[12] Jan Schlüter. Restricted Boltzmann Machine Derivations. Tech. rep. Technical Report TR-2014-13,
Österreichisches Forschungsinstitut für Articial Intelligence (OFAI), Vienna, Austria, 2014., 2014.
[13] Nino Shervashidze, Pascal Schweitzer, Erik Jan Van Leeuwen, Kurt Mehlhorn, and
Karsten M Borgwardt. “Weisfeiler-lehman graph kernels.” In:Journal of MachineLearning Research
12.9 (2011).
[14] Y. Tang, R. Salakhutdinov, and G. Hinton. “Robust Boltzmann Machines for recognition and
denoising”. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition. 2012,
pp. 2264–2271.
[15] Kiran K. Thekumparampil, Chong Wang, Sewoong Oh, and Li-Jia Li. Attention-based Graph Neural
Network for Semi-supervised Learning. 2018. arXiv: 1803.03735[stat.ML].
[16] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and
Yoshua Bengio. Graph Attention Networks. 2017. arXiv: 1710.10903[stat.ML].
[17] Kuansan Wang, Zhihong Shen, Chiyuan Huang, Chieh-Han Wu, Yuxiao Dong, and
Anshul Kanakia. “Microsoft academic graph: When experts are not enough”. In: Quantitative
Science Studies 1.1 (2020), pp. 396–413.
[18] Oliver Woodford. “Notes on contrastive divergence”. In: Department of Engineering Science,
University of Oxford, Tech. Rep (2006).
[19] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How Powerful are Graph Neural
Networks? 2018. arXiv: 1810.00826[cs.LG].
59
Abstract (if available)
Abstract
We propose a model agnostic pipeline to denoise data by exploiting the content-addressable memory property of restricted Boltzmann machines. Although this pipeline can be used to deal with noise in any dataset, it is particularly effective for the case of graph datasets. The proposed pipeline requires a neural network that is already trained for the machine learning task on data, which is free from any form of corruption or incompleteness. We show that our approach can increase the prediction accuracy by up to 40% for some cases of noise in ogbn-arxiv dataset. We have also created a R shiny interactive web application for better understanding of the results, which can be found at: https://ankithmo.shinyapps.io/denoiseRBM/.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Generating gestures from speech for virtual humans using machine learning approaches
PDF
Designing data-effective machine learning pipeline in application to physics and material science
PDF
Dynamic topology reconfiguration of Boltzmann machines on quantum annealers
PDF
Learning distributed representations from network data and human navigation
PDF
Simulation and machine learning at exascale
PDF
Learning to diagnose from electronic health records data
PDF
Representation problems in brain imaging
PDF
Mutual information estimation and its applications to machine learning
PDF
Learning fair models with biased heterogeneous data
PDF
High throughput computational framework for synthesis and accelerated discovery of dielectric polymer materials using polarizable reactive molecular dynamics and graph neural networks
PDF
Fast and label-efficient graph representation learning
PDF
Learning the geometric structure of high dimensional data using the Tensor Voting Graph
PDF
Human motion data analysis and compression using graph based techniques
PDF
Understanding diffusion process: inference and theory
PDF
Physics-based data-driven inference
PDF
Neural sequence models: Interpretation and augmentation
PDF
Learning distributed representations of cells in tables
PDF
Deep learning models for temporal data in health care
PDF
Human appearance analysis and synthesis using deep learning
PDF
Effective graph representation and vertex classification with machine learning techniques
Asset Metadata
Creator
Mohan, Ankith
(author)
Core Title
Alleviating the noisy data problem using restricted Boltzmann machines
School
Viterbi School of Engineering
Degree
Master of Science
Degree Program
Computer Science
Publication Date
09/10/2020
Defense Date
08/26/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
graph denoising,graph machine learning,graph neural networks,graph representation learning,noisy data problem,OAI-PMH Harvest,restricted Boltzmann machines
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Nakano, Aiichiro (
committee chair
), Ferrara, Emilio (
committee member
), Lerman, Kristina (
committee member
)
Creator Email
ankithmo@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-367418
Unique identifier
UC11666193
Identifier
etd-MohanAnkit-8945.pdf (filename),usctheses-c89-367418 (legacy record id)
Legacy Identifier
etd-MohanAnkit-8945.pdf
Dmrecord
367418
Document Type
Thesis
Rights
Mohan, Ankith
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
graph denoising
graph machine learning
graph neural networks
graph representation learning
noisy data problem
restricted Boltzmann machines