Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Evaluation of sequential hypothesis tests for cross validation of learning models using big data
(USC Thesis Other)
Evaluation of sequential hypothesis tests for cross validation of learning models using big data
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Evaluation of Sequential Hypothesis Tests for Cross Validation of
Learning Models Using Big Data
by
Mohammad Reza Rajati
A Thesis Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements of the Degree
MASTER OF SCIENCE
(Statistics)
August 2015
Copyright 2015 Mohammad Reza Rajati
Dedicated to my parents, Ali Rajati and Nahid Hakami-Kermani, the best teachers I
have ever had.
ii
Acknowledgements
The heavenly spheres which in this domain reside,
Have bewildered the wise, thinking far and wide;
Behold and don’t lose the trail of wisdom,
For the price of wisdom is to reel to every side. ”.
Omar Khayyam (Translated from Persian by Mehdi Amin Razavi)
First and foremost, I would like to express my highest gratitude towards my Committee Chair,
Dr. Jay Bartroff, for introducing me to the topic of this thesis, and advising me during its comple-
tion.
I would also like to thank my Thesis Committee members, Profs. Larry Goldstein and Brian
Keith Jenkins for their valuable time and feedback.
I started taking courses on Statistics when I was working on my PhD in Electrical Engineering
to fulfill the requirement of taking a minor in an outside department. After a while, I fell in
love with Statistics, mainly due to the excellent education that I received in the Department of
Mathematics at USC. I should like to thank Profs. Richard Arratia, Jay Bartroff, Jerry M. Mendel,
Robert Scholtz, and Jianfeng Zhang for what they taught me about probability, random processes,
statistics, and estimation.
Special thanks goes to Prof. Lotfi A. Zadeh of UC Berkeley, the father of fuzzy logic, who
has been a true inspiration to me. He is one of the people who encouraged me to go deeper in the
field of probability and statistics, by statements like the following one:
iii
“I am glad to hear that you are taking advanced courses relating to probability theory. It is
very important for you develop a high level of expertise in probability theory and its applications.”
Last, but not least, I would like to extend my highest gratitude to my beloved family for their
continuous and unconditional support and love through these many tough years of separation
from them. My father is the most important source of inspiration and wisdom in my life. I have
inherited his passion for poetry, languages, philosophy, and knowledge through both genes and
pedagogy. My mother is a unique example of a soft-hearted person with exceptional patience and
vision, to whom I owe even my ability to read and write! I would not be where I am without
their teaching and guidance, and I cannot find proper words for thanking them for all they have
generously granted me. My beloved siblings, Ahmad Reza and Sepideh have always been very
supportive and kind to me, and I hereby thank them.
Mohammad Reza Rajati
Los Angeles, CA,
June 2015
iv
Table of Contents
Abstract x
1 Prologue 1
1.1 Introduction: Sequential Hypothesis Testing for Machine Learning Using Big Data 1
2 Sequential Hypothesis Testing for Fast Cross Validation 3
2.1 Introduction:k-Fold Cross Validation . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Sequential Hypothesis Testing for Fast Cross Validation . . . . . . . . . . . . . . 4
3 Evaluation of Statistical Tests in Cross Validation of Learning Models over Big Data 9
3.1 Alternative Statistical Tests to Cochran’s Q Test for Fast Cross Validation with
Sequential Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Alternative Statistical Tests to Friedman’s Test for Fast Cross Validation with Se-
quential Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4 Epilogue: Conclusions and Future Works 39
A Non-parametric Tests for Evaluation of Classification Models 40
A.1 Cochran Q Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
A.2 The F-Test for Evaluation of Classifiers . . . . . . . . . . . . . . . . . . . . . . 41
A.3 Testing Using Wald Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
B Non-parametric Tests for Evaluation of Regression Models 43
B.1 The F-Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
B.2 The Friedman Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
B.3 Friedman Aligned Ranks Test . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
B.4 Quade Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
BIBLIOGRAPHY 48
v
List of Figures
3.1 Fast Cross Vaidatin for SVM and noisy sine data withd = 5 andv = 0:04. a)
Time for each of the tests b) Accuracies for each of the tests c) Ratio of time of
the Cochran algorithm to the F-Test algorithm, and of the Cochran algorithm to
the Wald-Test algorithm d) Difference between accuracies of each of algorithms
and the Cochran algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Fast Cross Vaidatin for SVM and noisy sine data withd = 5 andv = 0:25. a)
Time for each of the tests b) Accuracies for each of the tests c) Ratio of time
of the Cochran algorithm to the F-Test algorithm, and of the Cochran algorithm
to the Wald-Test algorithm. d) Difference between accuracies of each of algo-
rithms and the Cochran algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Fast Cross Vaidatin for SVM and noisy sine data withd = 50 andv = 0:04. a)
Time for each of the tests b) Accuracies for each of the tests c) Ratio of time of
the Cochran algorithm to the F-Test algorithm, and of the Cochran algorithm to
the Wald-Test algorithm d) Difference between accuracies of each of algorithms
and the Cochran algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Fast Cross Vaidatin for SVM and noisy sine data withd = 50 andv = 0:25. a)
Time for each of the tests b) Accuracies for each of the tests c) Ratio of time of
the Cochran algorithm to the F-Test algorithm, and of the Cochran algorithm to
the Wald-Test algorithm d) Difference between accuracies of each of algorithms
and the Cochran algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.5 Fast Cross Vaidatin for SVM for and sine data withd = 100 andv = 0:04. a)
Time for each of the tests b) Accuracies for each of the tests c) Ratio of time of
the Cochran algorithm to the F-Test algorithm, and of the Cochran algorithm to
the Wald-Test algorithm d) Difference between accuracies of each of algorithms
and the Cochran algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.6 Fast Cross Vaidatin for SVM and noisy sine data withd = 100 andv = 0:25. a)
Time for each of the tests b) Accuracies for each of the tests c) Ratio of time of
the Cochran algorithm to the F-Test algorithm, and of the Cochran algorithm to
the Wald-Test algorithm d) Difference between accuracies of each of algorithms
and the Cochran algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
vi
3.7 Fast Cross Vaidatin for SVM and breast cancer data. a) Time for each of the tests
b) Accuracies for each of the tests c) Ratio of time of the Cochran algorithm to
the F-Test algorithm, and of the Cochran algorithm to the Wald-Test algorithm
d) Difference between accuracies of each of algorithms and the Cochran algo-
rithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.8 Fast Cross Vaidatin for SVM and banana data. a) Time for each of the tests b)
Accuracies for each of the tests c) Ratio of time of the Cochran algorithm to the
F-Test algorithm, and of the Cochran algorithm to the Wald-Test algorithm d)
Difference between accuracies of each of algorithms and the Cochran algorithm. 19
3.9 Fast Cross Vaidatin for SVM and bank note authentication data. a) Time for
each of the tests b) Accuracies for each of the tests c) Ratio of time of the
Cochran algorithm to the F-Test algorithm, and of the Cochran algorithm to the
Wald-Test algorithm d) Difference between accuracies of each of algorithms
and the Cochran algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.10 Fast Cross Vaidatin for SVM and Pima Indian diabetes data. a) Time for each
of the tests b) Accuracies for each of the tests c) Ratio of time of the Cochran
algorithm to the F-Test algorithm, and of the Cochran algorithm to the Wald-
Test algorithm d) Difference between accuracies of each of algorithms and the
Cochran algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.11 Fast Cross Vaidatin for SVM and image data. a) Time for each of the tests b)
Accuracies for each of the tests c) Ratio of time of the Cochran algorithm to the
F-Test algorithm, and of the Cochran algorithm to the Wald-Test algorithm d)
Difference between accuracies of each of algorithms and the Cochran algorithm. 22
3.12 Fast Cross Vaidatin for Kernel Ridge Regression for noisy sinc data withd = 2
andv = 0:01. a) Time for each of the tests b) MSEs for each of the tests c) The
ratios of time of the Friedman-Cochran algorithm to each of the other algorithms
d) Difference between MSEs of each of algorithms and the Friedman-Cochran
algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.13 Fast Cross Vaidatin for Kernel Ridge Regression on noisy sinc data withd = 2
andv = 0:04. a) Time for each of the tests b) MSEs for each of the tests c) The
ratios of time of the Friedman-Cochran algorithm to each of the other algorithms
d) Difference between MSEs of each of algorithms and the Friedman-Cochran
algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.14 Fast Cross Vaidatin for Kernel Ridge Regression on noisy sinc data withd = 3
andv = 0:01. a) Time for each of the tests b) MSEs for each of the tests c) The
ratios of time of the Friedman-Cochran algorithm to each of the other algorithms
d) Difference between MSEs of each of algorithms and the Friedman-Cochran
algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
vii
3.15 Fast Cross Vaidatin for Kernel Ridge Regression on noisy sinc data withd = 3
andv = 0:04. a) Time for each of the tests b) MSEs for each of the tests c) The
ratios of time of the Friedman-Cochran algorithm to each of the other algorithms
d) Difference between MSEs of each of algorithms and the Friedman-Cochran
algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.16 Fast Cross Vaidatin for Kernel Ridge Regression on noisy sinc data withd = 4
andv = 0:01. a) Time for each of the tests b) MSEs for each of the tests c) The
ratios of time of the Friedman-Cochran algorithm to each of the other algorithms
d) Difference between MSEs of each of algorithms and the Friedman-Cochran
algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.17 Fast Cross Vaidatin for Kernel Ridge Regression on noisy sinc data withd = 4
andv = 0:04. a) Time for each of the tests b) MSEs for each of the tests c) The
ratios of time of the Friedman-Cochran algorithm to each of the other algorithms
d) Difference between MSEs of each of algorithms and the Friedman-Cochran
algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.18 Fast Cross Vaidatin for Kernel Ridge Regression on housing data. a) Time for
each of the tests b) MSEs for each of the tests c) The ratios of time of the
Friedman-Cochran algorithm to each of the other algorithms d) Difference be-
tween MSEs of each of algorithms and the Friedman-Cochran algorithm. . . . 33
3.19 Fast Cross Vaidatin for Kernel Ridge Regression on power plant data. a) Time
for each of the tests b) MSEs for each of the tests c) The ratios of time of
the Friedman-Cochran algorithm to each of the other algorithms d) Difference
between MSEs of each of algorithms and the Friedman-Cochran algorithm. . . 34
3.20 Fast Cross Vaidatin for Kernel Ridge Regression on air foil data. a) Time for
each of the tests b) MSEs for each of the tests c) The ratios of time of the
Friedman-Cochran algorithm to each of the other algorithms d) Difference be-
tween MSEs of each of algorithms and the Friedman-Cochran algorithm. . . . 35
3.21 Fast Cross Vaidatin for Kernel Ridge Regression on Concrete strength data. a)
Time for each of the tests b) MSEs for each of the tests c) The ratios of time of
the Friedman-Cochran algorithm to each of the other algorithms d) Difference
between MSEs of each of algorithms and the Friedman-Cochran algorithm. . . 36
3.22 Fast Cross Vaidatin for Kernel Ridge Regression on Auto Mpg data. a) Time
for each of the tests b) MSEs for each of the tests c) The ratios of time of
the Friedman-Cochran algorithm to each of the other algorithms d) Difference
between MSEs of each of algorithms and the Friedman-Cochran algorithm. . . 37
3.23 Fast Cross Vaidatin for Kernel Ridge Regression on Yacht hydrodynamics data.
a) Time for each of the tests b) MSEs for each of the tests c) The ratios of time of
the Friedman-Cochran algorithm to each of the other algorithms d) Difference
between MSEs of each of algorithms and the Friedman-Cochran algorithm. . . 38
viii
List of Tables
3.1 Means and Medians of time ratios for each of the alternative hypothesis tests for
classification of noisy sine data . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Properties of the binary classification data sets . . . . . . . . . . . . . . . . . . . 13
3.3 Means and Medians of time ratios for each of the alternative hypothesis tests for
classification of benchmark data . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Variations of the CVST algorithm for Regression . . . . . . . . . . . . . . . . . 25
3.5 Means and Medians of time ratios for each of the alternative hypothesis tests for
regression of noisy sinc data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.6 Properties of the benchmark regression data sets . . . . . . . . . . . . . . . . . . 32
3.7 Means and Medians of time ratios for each of the alternative hypothesis tests for
Regression of benchmark data . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
ix
Abstract
In this thesis, we examine the application of various sequential hypothesis tests to fast cross
validation for model selection. Fast cross validation can be utilized to select hyperparameters of
classifiers and regression models, when dealing with very large amounts of data. We examine the
performance of the F-test as well as the Wald test versus the Cochran test for classification tasks.
We also examine the performance of the F-test, Friedman Aligned Ranks test, and the Quade test
versus the Friedman test for selection of hyperparameters of regression models. We demonstrate
our results on synthetic and real data sets. We show that replacing the Cochran test in classification
tasks with the F-test does not diminish the performance of fast cross validation with sequential
hypothesis testing, whereas a Wald test yields higher times for the fast cross validation algorithm,
without actually improving its accuracy. For regression tasks, replacing both Friedman’s and
Cochran’s tests with the F-test yields slightly better times for the algorithms, although for one of
the data sets, it comes with higher inaccuracy of the algorithm.
x
Chapter 1
Prologue
1.1 Introduction: Sequential Hypothesis Testing for Machine Learn-
ing Using Big Data
S
EQUENTIAL hypothesis testing involves “any statistical test procedure which gives a spe-
cific rule, at any stage of experiment (at the n-th trial for each integral value of n), for
making one of the following three decisions: (1) to accept the hypothesis being tested (null hy-
pothesis), (2) to reject the null hypothesis, (3) to continue the experiment by making an additional
observation. Thus, such a test procedure is carried out sequentially. On the basis of the first trial,
one of the three decisions mentioned above is made. If the first or the second decision is made,
the process is terminated. If the third decision is made, a second trial is performed. Again, on the
basis of the first two trails one of the three decisions is made and if the third decision is reached a
third trial is performed, etc. This process is continued until either the first or the second decision
is made.
1
An essential feature of the sequential test, as distinguished from the current [=non-sequential]
test procedure, is that the number of observations required by the sequential test is not predeter-
mined, but is a random variable due to the fact that at any stage of the experiment the decision of
terminating the process depends on the results of the observations previously made.” [42]
Since sequential hypothesis testing decides at each step whether the data are enough for ac-
cepting or rejecting the null hypothesis or not, it is an appealing setting for state of the art machine
learning tasks, since it can be employed to determine the “optimal” amount of data needed for test-
ing a hypothesis, which in turn may help reduce the amount of computational power needed for
running algorithms that deal with very large amounts of data, something that has become more
and more common in the recent years, and is well known as the “Big Data” paradigm [9, 29–31].
Cross validation is a method for selecting hyperparameters for machine learning algorithms.
Cross validation is a computationally expensive algorithm. This issue becomes more important
when dealing with big data. Sequential hypothesis testing can be used to reduce the computational
burden of this algorithm, as we will see in the sequel. This thesis is devoted to evaluation of
various statistical tests for cross validation using sequential hypothesis testing.
2
Chapter 2
Sequential Hypothesis Testing for Fast Cross Validation
2.1 Introduction: k-Fold Cross Validation
T
HIS chapter is devoted to description of the application of sequential hypothesis testing
[10, 39, 40, 42, 43] to selecting learning parameters
2.1
(also called hyperparameters [6, 7])
for training over big data sets usingk-fold cross validation [3–5, 24, 35, 36].
k-fold cross validation involves partitioning a datasetD =XY =f(x
j
;y
j
);j = 1;:::;Ng
into k subsetsD
1
;D
2
;:::;D
k
of equal (or nearly equal) size. It is assumed that the data are
drawn i.i.d. from a probability distribution P onXY. A model y = f(x) is trained using
D
i
;i = 1; 2;:::;k (thus trained k times) and tested onDD
i
. Assuming that the error of
the model on the test subset is J
i
, the following estimate of the expected error of the model
E[J(f(X);Y )] is derived:
b
J
cv
=
1
k
k
X
i=1
J
i
(2.1)
2.1
Examples of such parameter configurations (hyperparameters) include the standard deviation of Gaussian ker-
nels in various kernel learning methods, or the parameter that control the steepness of the logistic function in neural
networks or logistic regression tasks.
3
For classification problems, J is usually selected as the accuracy of the classifier, and for
regression problems, it is usually chosen as the Mean-Square Error (MSE) of the model.
Assume that there is a setC ofr parameter configurationsC =f
1
;
2
;:::;
r
g for training
a modely =f(x;
c
);
c
2C, and one has to select a parameter configuration
with which the
model has the best performance:
= argmin
2C
E[J(f(X;);Y )] (2.2)
k-fold cross validation can be used to obtain an estimate of the error of the model associated with
each parameter configuration 2C, so that the parameter configuration with the best perfor-
mance can be determined:
b
= argmin
2C
^
J
cv
() (2.3)
where
^
J
cv
() is the cross validation estimate of the error of the model that was trained using
2C.
When the number of parameter configurations are large and one is dealing with big data,
usingk-fold cross validation can impose a huge computational burden on the learning algorithms.
A general idea to improve such computationally intensive tasks is to use sequential hypothesis
testing for algorithm configuration [22].
2.2 Sequential Hypothesis Testing for Fast Cross Validation
In [25], a method for using sequential hypothesis testing for model selection with cross validation
was proposed. Their procedure involves at mostS 1 steps where thes
th
step involves training
4
models on a small subset containingn =s data points, where =N=S is the number of data
points that are added to the training set at each step, and then testing them on the rest ofNn =
Ns data points. The selection process starts training with all of the parameters inC, and
tests for existence of underperforming parameters in each step, to eliminate them. In particular,
a pointwise performance matrixP
p
is established whose rows represent parameter configurations
and whose columns represent one of theNn members of the test data set (Algorithm 2.1).
2.2
In other words, the [c;j] entry ofP
p
represents the error of the model that was trained using the
c
th
parameter configuration and was tested on thej
th
data point infd
1
;:::;d
Nn
g, where d
j
represents (x
j
;y
j
). Algorithm 2.2 is used to determine the parameter configurations with the
best performance in each step, using Friedman Test [14,15] for regression tasks and Cochran’s Q
Test [11] for classification tasks. Algorithm 2.3 is used to determine which configurations have to
be removed, according to their performance in the previous steps and the current step. Algorithm
2.4 is used to stop the algorithm using the Cochran test. Further discussion about those algorithms
can be found in [25].
In the future chapters, we evaluate the performance of the fast cross validation algorithm,
when Friedman’s Test and Cochran’s Q Test are replaced with alternative non-parametric tests.
Those non-parametric statistical tests are discussed in Appendices A and B.
2.2
All of the algorithms presented in this chapter were adopted from [25].
5
Algorithm 2.1 Cross Validation with Sequential Hypothesis Testing
1: function CVST(d
1
;:::;d
N
;S;C;;
l
;
l
;w
stop
)
2: N=S
3: n
4: 8s2f1;:::;S 1g;c2f1; 2;:::;rg :T
S
[c;s] 0
5: 8s2f1;:::;S 1g;c2f1; 2;:::;rg :P
S
[c;s] NA
6: c2f1; 2;:::;rg : isActive[c] true
7: fors 1 toS do
8: 8j2f1;:::;Nng;c2f1; 2;:::;rg :P
p
[c;j] NA
9: forc 1 tor do
10: if isActive[c] then
11: f TRAIN(d
1
;:::;d
n
;
c
)
12: 8j2f1;:::;Nng :P
p
[c;j] J(f(x
n+j
;
c
);y
n+j
)
13: P
S
[c;s]
1
Nn
P
Nn
j=1
P
p
[c;j]
14: index
top
TOPCONFIGURATIONS(P
p
;)
15: T
S
[index
top
;s] 1
16: forc 1 tor do
17: if isActive[c] and ISFLOPCONFIGURATION(T
S
[c; 1 :s];s;S;
l
;
l
) then
18: isActive[c] false
19: if SIMILARPERFORMANCE(T
S
[isActive; max(sw
stop
+ 1; 1) :s];) then
20: break
21: n n +
22: return SELECTWINNER(P
S
; isActive;w
stop
;s)
6
Algorithm 2.2 Iterative Testing for Finding Top Configurations
1: function TOPCONFIGURATIONS(P
p
;)
2: 8c2f1; 2;:::;rg; P
m
[k]
1
Nn
P
p
[k;j]
3: index
sort
SORTINDEXDECREASING(P
m
)
4:
e
P
p
=P
p
[index
sort
; :]
5: K SUM(WHICH(ISNA)(Pm))
6: e ==K
7: fork = 2 :K do
8: if classification then
9: p COCHRANQTEST(
e
P
p
[1 :k; :])
10: else
11: p FRIEDMANTEST(
e
P
p
[1 :k; :])
12: ifpe then
13: break
14: return index
sort
[1 : k 1]
Algorithm 2.3 Detecting Flop Configurations Using Sequential Testing
1: function ISFLOPCONFIGURATION(T;s;S;
l
;
l
)
2:
0
0:5;
1
1
1
2
S
q
1
l
l
3: a
log
l
1
l
log
1
)
log
1
1
1
0
4: b
log
1
0
1
1
log
1
)
log
1
1
1
0
5: return
P
s
i=1
T
i
a +bs
7
Algorithm 2.4 Test whether Remaining Configurations Have Similar Performance
1: function SIMILARPERFORMANCE(T
S
;)
2: p COCHRANQTEST(T
S
)
3: returnp
Algorithm 2.5 Select The Winning Configuration among The Remaining Configurations
1: function SELECTWINNER(P
S
; isActive;w
stop
;s)
2: 8j2f1; 2;:::;sg;c2f1; 2;:::;rg; R
S
[c;j] 1
3: for j=1:s do
4: for c=1:r do
5: if isActive[c] then
6: R
S
[c;j] RANKIN(P
S
[c;j];P
S
[:;j])
7: 8c2f1; 2;:::;rg; M
S
[c] 1
8: forc = 1 :r do
9: if isActive[c] then
10: M
S
[c]
1
wstop
P
s
i=max(swstop+1;1)
R
S
[c;j]
11: return WHICHMIN(M
S
)
8
Chapter 3
Evaluation of Statistical Tests in Cross Validation of
Learning Models over Big Data
I
N this chapter, we computationally evaluate the performance and time of the fast cross vali-
dation algorithm that was presented in Chapter 2, when the Friedman Test and Cochran’s Q
Test are replaced with various alternative tests.
3.1 Alternative Statistical Tests to Cochran’s Q Test for Fast Cross
Validation with Sequential Testing
Some alternatives have been proposed for Cochran’s Test [11] in the literature [8]. Using the
F-test as an alternative to Cochran’s test was suggested by Cochran himself in [11]. These tests
have been described in Appendix A.
In this section, we examine the time and performance of fast cross validation for parameter
selection for classification. The classification model that we use is a Support Vector Machine
(SVM) [18, 41]. We try to determine the penalty parameter and the spread parameter of the
Gaussian kernels,.
9
First, we examine the method on synthetic data. We use the following noisy sine data, which
comprises the sign of a sinusoid function contaminated by Gaussian noise:
y = sgn(sin(x) +); N(0;v);x2 [0; 2d];v2f0:04; 0:25g;d2f5; 50; 100g (3.1)
We generated 11000 data points for each model and ran various 10 step Cross Validation via Se-
quential Testing (CVST) algorithms on a randomly selected subset of 1000 data points and tested
the SVM with the parameter configuration selected by CVST on the remaining 10000 data points.
The parameters for all of the CVSTs are: = 0:05;
l
= 0:01;
l
= 0:1;w
stop
= 3. We recorded
the test error and the time consumed for selection of the parameter configuration. We chose
log
10
2 f0:2000; 0:1789; 0:1578; 0:1367; 0:1156; 0:0944; 0:0733; 0:0522; 0:0311; 0:0100g to
have 10 values for and log
10
2f3;2:9;:::; 3g to have 61 values for , hence we had
a total of 610 parameter configurations, in the form of
i
= [
i
;
i
]
T
; i = 1; 2;:::; 610. We
compared the test errors and machine runtimes for various versions of the CVST algorithm. We
obtained those CVST algorithms by replacing the Cochran test in both Algorithms 2.2 and 2.4
with the F-test and the Wald test. The Wald test involves matrix inversion, which is computation-
ally intensive, hence we did not expect it to improve the algorithm time. The objective of using
the Wald test was to observe whether it could improve the accuracy of the fast cross validation
algorithm, even by compromising the speed of the algorithm. Figs 3.1-3.6 show the box plots
3.1
that compare the time and performances of the alternative statistical tests with those of the original
CVST algorithm on the synthetic data obtained from (3.1). The means and medians of the ratio of
runtime of the fast cross validation with Cochran Q Test and fast cross validation with alternative
3.1
On each box, the central mark is the 50th percentile (i.e. the median), the edges of the box are the 25th and 75th
percentiles, the whiskers extend to the most maximum and minimum data points that are not outliers, and the outliers
are plotted individually.
10
tests that use the F-test and the Wald Test are also shown in Table 3.1 for all combinations ofv
andd, and the means and medians that are larger than 1 are bolded. Although the improvement is
not considerable, the F-test performs at least as well as the Cochran test, and as it can be seen in
Figs 3.1-3.6, its test errors are in the same range as that of the errors of the fast cross validation
algorithm with Cochran’s Test. On the other hand, the fast cross validation using Wald Test needs
significantly larger computation times, while its error performance is sometimes slightly worse
and sometimes slightly better than the original fast cross validation algorithm.
In order to test the algorithm on real data, we used real data from UCI Machine Learning
Repository [27] and Machine Learning Data Repository [1]. Namely, we used the following data
sets for binary classification: Breast cancer, Banana, Bank note authentication data, Pima Indian
diabetes, and Image. We used approximately 50% of the data for cross validation and selecting
the best parameter configuration, and the rest for testing. For each data set, we chose 61 points for
log
10
in [log
10
min
; log
10
max
] uniformly, and 10 points for log
10
in [log
10
min
; log
10
max
],
uniformly. Those limits were selected by inspection of the data and training some SVMs with
various parameters, to realize which ranges for them are meaningful. The number of data points
used for training and testing, log
10
min
; log
10
max
; log
10
min
; log
10
max
, and the number of
input attributes for each data set are summarized in Table 3.2. We ran the fast cross validation
algorithm 50 times using the F-test and the Wald Test. The results are shown in Figs. 3.7-3.11
using box plots. The means and medians of those tests are summarized in Table 3.3, and the means
and medians that are larger than 1 are bolded. The F-test shows rather significant improvement
in time over Cochran Test for the breast cancer data set and the Pima Indian diabetes data set.
In other cases, although the improvement is not that significant, at least one can claims that the
F-test works as effective as the Cochran Q Test. The Wald Test increases the times significantly,
but overall, its error performance is slightly worse than the Cochran Q Test.
11
Cochran F Wald
0
200
400
600
800
1000
Algorithm Time in Seconds
(a)
Cochran F Wald
0.05
0.052
0.054
0.056
0.058
0.06
0.062
0.064
0.066
Test Accuracy
(b)
F Wald
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Algorithm Time Ratios
(c)
F Wald
−4
−2
0
2
4
6
8
10
x 10
−3
Differences in Accuracy
(d)
Figure 3.1: Fast Cross Validation for SVM and noisy sine data withd = 5 andv = 0:04. a) Time
for each of the tests b) Accuracies for each of the tests c) Ratio of time of the Cochran algorithm
to the F-Test algorithm, and of the Cochran algorithm to the Wald-Test algorithm d) Difference
between accuracies of each of algorithms and the Cochran algorithm.
Table 3.1: Means and Medians of time ratios for each of the alternative hypothesis tests for
classification of noisy sine data
d=5 d=50 d=100
v =0:04 v =0:25 v =0:04 v =0:25 v =0:04 v =0:25
mean median mean median mean median mean median mean median mean median
F 1.0618 1.0076 1.0320 0.9992 1.0372 1.0053 1.1046 1.0896 1.0703 1.0091 1.0877 1.0809
Wald 0.0301 0.0297 0.0298 0.0297 0.0271 0.0265 0.0295 .0296 0.0272 0.0268 0.0301 0.0301
12
Cochran F Wald
0
200
400
600
800
1000
1200
Algorithm Time in Seconds
(a)
Cochran F Wald
0.14
0.16
0.18
0.2
0.22
0.24
0.26
0.28
0.3
0.32
0.34
Test Accuracy
(b)
F Wald
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Algorithm Time Ratios
(c)
F Wald
−0.02
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
Differences in Accuracy
(d)
Figure 3.2: Fast Cross Vaidatin for SVM and noisy sine data withd = 5 andv = 0:25. a) Time
for each of the tests b) Accuracies for each of the tests c) Ratio of time of the Cochran algorithm
to the F-Test algorithm, and of the Cochran algorithm to the Wald-Test algorithm. d) Difference
between accuracies of each of algorithms and the Cochran algorithm
Table 3.2: Properties of the binary classification data sets
Number of attributes Train Test log
10
min log
10
max log
10
min log
10
max
Breast cancer 9 130 133 -3 0 0.01 0.2
Banana 2 2650 2650 -3 3 0.01 0.2
Bank notes 4 680 692 -3 3 0.01 2
Pima Indians diabetes 8 380 388 -3 3 0.01 2
Image 18 1040 1046 -1 3 0.01 0.2
13
Cochran F Wald
0
200
400
600
800
1000
1200
1400
1600
1800
Algorithm Time in Seconds
(a)
Cochran F Wald
0.075
0.08
0.085
0.09
0.095
0.1
Test Accuracy
(b)
F Wald
0
0.5
1
1.5
Algorithm Time Ratios
(c)
F Wald
−15
−10
−5
0
5
x 10
−3
Differences in Accuracy
(d)
Figure 3.3: Fast Cross Vaidatin for SVM and noisy sine data withd = 50 andv = 0:04. a) Time
for each of the tests b) Accuracies for each of the tests c) Ratio of time of the Cochran algorithm
to the F-Test algorithm, and of the Cochran algorithm to the Wald-Test algorithm d) Difference
between accuracies of each of algorithms and the Cochran algorithm.
Table 3.3: Means and Medians of time ratios for each of the alternative hypothesis tests for
classification of benchmark data
Breast cancer Banana Bank note Pima Indian diabetes Image
mean median mean median mean median mean median mean median
F 2.0255 1.8315 1.0231 1.0139 1.0318 0.9999 1.4573 1.0217 1.0448 0.9886
Wald 0.0418 0.0367 0.0355 0.0351 0.0131 0.0131 0.0551 0.0378 0.0335 0.0326
14
Cochran F Wald
0
200
400
600
800
1000
1200
Algorithm Time in Seconds
(a)
Cochran F Wald
0.16
0.165
0.17
0.175
0.18
0.185
0.19
0.195
0.2
Test Accuracy
(b)
F Wald
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Algorithm Time Ratios
(c)
F Wald
−20
−15
−10
−5
0
5
x 10
−3
Differences in Accuracy
(d)
Figure 3.4: Fast Cross Vaidatin for SVM and noisy sine data withd = 50 andv = 0:25. a) Time
for each of the tests b) Accuracies for each of the tests c) Ratio of time of the Cochran algorithm
to the F-Test algorithm, and of the Cochran algorithm to the Wald-Test algorithm d) Difference
between accuracies of each of algorithms and the Cochran algorithm.
15
Cochran F Wald
0
200
400
600
800
1000
1200
1400
1600
1800
Algorithm Time in Seconds
(a)
Cochran F Wald
0.1
0.11
0.12
0.13
0.14
0.15
0.16
Test Accuracy
(b)
F Wald
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Algorithm Time Ratios
(c)
F Wald
−8
−6
−4
−2
0
2
4
6
8
x 10
−3
Differences in Accuracy
(d)
Figure 3.5: Fast Cross Vaidatin for SVM for and sine data withd = 100 andv = 0:04. a) Time
for each of the tests b) Accuracies for each of the tests c) Ratio of time of the Cochran algorithm
to the F-Test algorithm, and of the Cochran algorithm to the Wald-Test algorithm d) Difference
between accuracies of each of algorithms and the Cochran algorithm.
16
Cochran F
10
20
30
40
50
60
70
80
90
Algorithm Time in Seconds
(a)
Cochran F Wald
0.19
0.2
0.21
0.22
0.23
0.24
0.25
Test Accuracy
(b)
F Wald
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Algorithm Time Ratios
(c)
F Wald
−0.015
−0.01
−0.005
0
0.005
0.01
0.015
Differences in Accuracy
(d)
Figure 3.6: Fast Cross Vaidatin for SVM and noisy sine data withd = 100 andv = 0:25. a) Time
for each of the tests b) Accuracies for each of the tests c) Ratio of time of the Cochran algorithm
to the F-Test algorithm, and of the Cochran algorithm to the Wald-Test algorithm d) Difference
between accuracies of each of algorithms and the Cochran algorithm.
17
Cochran F Wald
0
100
200
300
400
500
600
700
800
Algorithm Time in Seconds
(a)
Cochran F Wald
0.24
0.26
0.28
0.3
0.32
0.34
0.36
0.38
0.4
0.42
Test Accuracy
(b)
F Wald
0
1
2
3
4
5
6
7
8
9
Algorithm Time Ratios
(c)
F Wald
−0.06
−0.04
−0.02
0
0.02
0.04
0.06
0.08
Differences in Accuracy
(d)
Figure 3.7: Fast Cross Vaidatin for SVM and breast cancer data. a) Time for each of the tests b)
Accuracies for each of the tests c) Ratio of time of the Cochran algorithm to the F-Test algorithm,
and of the Cochran algorithm to the Wald-Test algorithm d) Difference between accuracies of
each of algorithms and the Cochran algorithm.
18
Cochran F Wald
0
1000
2000
3000
4000
5000
6000
Algorithm Time in Seconds
(a)
Cochran F Wald
0.09
0.095
0.1
0.105
0.11
Test Accuracy
(b)
F Wald
0
0.2
0.4
0.6
0.8
1
1.2
Algorithm Time Ratios
(c)
F Wald
−6
−4
−2
0
2
4
6
8
10
12
14
x 10
−3
Differences in Accuracy
(d)
Figure 3.8: Fast Cross Vaidatin for SVM and banana data. a) Time for each of the tests b)
Accuracies for each of the tests c) Ratio of time of the Cochran algorithm to the F-Test algorithm,
and of the Cochran algorithm to the Wald-Test algorithm d) Difference between accuracies of
each of algorithms and the Cochran algorithm.
19
Cochran F Wald
0
500
1000
1500
2000
2500
3000
3500
Algorithm Time in Seconds
(a)
Cochran F Wald
0
1
2
3
4
5
6
7
8
9
10
x 10
−3
Test Accuracy
(b)
F Wald
0
0.5
1
1.5
Algorithm Time Ratios
(c)
F Wald
−5
0
5
10
x 10
−3
Differences in Accuracy
(d)
Figure 3.9: Fast Cross Vaidatin for SVM and bank note authentication data. a) Time for each
of the tests b) Accuracies for each of the tests c) Ratio of time of the Cochran algorithm to the
F-Test algorithm, and of the Cochran algorithm to the Wald-Test algorithm d) Difference between
accuracies of each of algorithms and the Cochran algorithm.
20
Cochran F Wald
0
100
200
300
400
500
600
700
800
900
1000
Algorithm Time in Seconds
(a)
Cochran F Wald
0.22
0.24
0.26
0.28
0.3
0.32
0.34
Test Accuracy
(b)
F Wald
0
1
2
3
4
5
6
7
8
9
10
Algorithm Time Ratios
(c)
F Wald
−0.04
−0.02
0
0.02
0.04
0.06
0.08
Differences in Accuracy
(d)
Figure 3.10: Fast Cross Vaidatin for SVM and Pima Indian diabetes data. a) Time for each of
the tests b) Accuracies for each of the tests c) Ratio of time of the Cochran algorithm to the F-
Test algorithm, and of the Cochran algorithm to the Wald-Test algorithm d) Difference between
accuracies of each of algorithms and the Cochran algorithm.
21
Cochran F Wald
0
100
200
300
400
500
600
700
800
900
1000
Algorithm Time in Seconds
(a)
Cochran F Wald
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Test Accuracy
(b)
F Wald
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Algorithm Time Ratios
(c)
F Wald
−0.12
−0.1
−0.08
−0.06
−0.04
−0.02
0
0.02
0.04
0.06
0.08
Differences in Accuracy
(d)
Figure 3.11: Fast Cross Vaidatin for SVM and image data. a) Time for each of the tests b)
Accuracies for each of the tests c) Ratio of time of the Cochran algorithm to the F-Test algorithm,
and of the Cochran algorithm to the Wald-Test algorithm d) Difference between accuracies of
each of algorithms and the Cochran algorithm.
22
3.2 Alternative Statistical Tests to Friedman’s Test for Fast Cross
Validation with Sequential Testing
Various alternatives have been proposed for Friedman’s Test [14,15] in the literature [2,13]. Using
the F-test as an alternative to Friedman’s test was studied in [12, 23]. Friedman Aligned Ranks
Test [19, 33] and Quade Test [34] are tests that have been proposed as alternatives to Friedman’s
test. These tests have been described in Appendix B.
In this section, we examine the time and performance of fast cross validation for parameter
selection for regression. The regression model that we use is Kernel Ridge Regression (KRR)
[17], which is an extension of the Linear Ridge Regression method [20, 20, 21] using kernel
methods. We try to determine the Ridge parameter
and the spread parameter of the Gaussian
kernels,.
First, we examine the method on synthetic data. We use the following noisy sinc data, which
comprises a sinc function contaminated by a high-frequency sinusoid and Gaussian noise:
y = sinc(4x) +
sin(15x)
5
+; N(0;v);x2 [;];v2f0:01; 0:04g;d2f2; 3; 4g (3.2)
We generated 11000 data points for each model and ran various 10 step CVST algorithms on a
randomly selected subset of 1000 data points and tested the KRR with the parameter configuration
selected by CVST on the remaining 10000 data points. The parameters for all of the CVSTs are:
= 0:05;
l
= 0:01;
l
= 0:1;w
stop
= 3. We recorded the test error and the time consumed
for selection of the parameter configuration. We chose log
10
2f7;6;:::; 2g to have 10
values for
and log
10
2f3;2:9;:::; 3g to have 61 values for, hence we had a total of
610 parameter configurations in the form of
i
= [
i
;
i
]
T
; i = 1; 2;:::; 610. We compared the
23
test errors and machine runtimes for various versions of the CVST algorithm. We obtained those
CVST algorithms by replacing the Friedman test in Algorithm 2.2 with the F, Friedman Aligned
Ranks, and Quade tests, and by replacing the Cochran test in Algorithm 2.4 with the F-test. The
CVST algorithms that we compared are summarized in Table 3.4.
Figs 3.12-3.17 show the box plots that compare the time and performance of the alternative
statistical tests with those of the original CVST algorithm on the synthetic data obtained from
(3.2). The means and medians of the ratio of the fast cross validation with Cochran Q Test and
fast cross validation with alternative tests that use the F-test and the Wald Test are also shown in
Table 3.5 for all combinations ofv andd. To demonstrate the algorithms that show the largest time
ratio mean and median, the largest mean and median for each combination ofd andv are shown
in bold. From Table 3.5 , one can observe that substitution of both Friedman and Cochran tests
with the F-test gives the largest means and medians in most cases. Although, the improvement in
algorithm time is not significant using the F-test, one can claim that it works at least as well as
the original CVST algorithm that uses the Friedman Test and the Cochran Test on synthetic data.
Note that the MSE associated to all of the algorithms in Figs. 3.12-3.17 are almost in the same
range.
In order to test the algorithm on real data, we used real data from UCI Machine Learning
Repository [27] and Machine Learning Data Repository [1]. Namely, we used the following data
sets for binary classification: Boston housing, Combined cycle power plant, Airfoil self noise,
Concrete strength, Auto Mpg, and Yacht Hydrodynamics. We used approximately 50% of the
data for cross validation and selecting the best parameter configuration, and the rest for testing.
For each data set, we chose 61 points for log
10
in [log
10
min
; log
10
max
] uniformly, and 10
points for log
10
in [log
10
min
; log
10
max
], uniformly. Those limits were selected by inspection
of the data and developing some Kernel Ridge Regressions with various parameters, to realize
24
which ranges for them are meaningful. The number of data points used for training and testing,
log
10
min
; log
10
max
; log
10
min
; log
10
max
, and the number of input attributes for each data set
are summarized in Table 3.6. We ran the fast cross validation algorithm 50 times using various
alternative tests. The results are shown in Figs. 3.18-3.23 using box plots. The means and medians
of those tests are summarized in Table 3.7, and the largest means and medians are shown in bold.
The Ad-hoc test that replaces both Friedman’s and Cochran’s tests with the F-test shows some
improvement in time in most cases. Substitution of the Friedman test with Friedman Aligned
Ranks test shows improvement in the time of the algorithm on combined cycle power plant data
set and airfoil self noise dataset. The errors of the algorithms over all data sets are the almost the
same, excepting the housing data set and the yacht hydrodynamics data set. In the housing data
set, the time improvement with the Adhoc F-test method is significant; however, the errors of the
Ad-hoc F-test are significantly higher than those of other algorithms. In the yacht hydrodynamics
data set, on the other hand, the Ad-hoc F-test gives reasonable errors, while the median ratio of
the time for the original CVST and Ad-hoc F-test CVST is more than 1:5.
Table 3.4: Variations of the CVST algorithm for Regression
TOPCONFIGURATIONS SIMLARPERFORMANCE
Friedman-Cochran [25] Friedman Cochran
Friedman-F Friedman F
Ad-hoc F -F F F
F-Aligned-Cochran Friedman Aligned Ranks Cochran
Quade-Cochran Quade Cochran
25
Friedman−Cochran Friedman−F Ad−hoc F−FF−Aligned−Cochran Quade
400
500
600
700
800
900
1000
Algorithm Time in Seconds
(a)
Friedman−Cochran Friedman−F Ad−hoc F−FF−Aligned−Cochran Quade
0
1
2
3
4
5
6
Test MSE
(b)
Friedman−F Ad−hoc F−F F−Aligned−Cochran Quade
0.5
1
1.5
2
Algorithm Time Ratios
(c)
Friedman−F Ad−hoc F−F F−Aligned−Cochran Quade
−0.5
−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
0.4
0.5
Differences in MSE
(d)
Figure 3.12: Fast Cross Vaidatin for Kernel Ridge Regression for noisy sinc data withd = 2 and
v = 0:01. a) Time for each of the tests b) MSEs for each of the tests c) The ratios of time of
the Friedman-Cochran algorithm to each of the other algorithms d) Difference between MSEs of
each of algorithms and the Friedman-Cochran algorithm.
26
Friedman−Cochran Friedman−F Ad−hoc F−FF−Aligned−Cochran Quade
400
600
800
1000
1200
1400
1600
Algorithm Time in Seconds
(a)
Friedman−Cochran Friedman−F Ad−hoc F−FF−Aligned−Cochran Quade
0.078
0.08
0.082
0.084
0.086
0.088
0.09
0.092
0.094
0.096
0.098
Test MSE
(b)
Friedman−F Ad−hoc F−F F−Aligned−Cochran Quade
0.5
1
1.5
2
2.5
3
Algorithm Time Ratios
(c)
Friedman−F Ad−hoc F−F F−Aligned−Cochran Quade
−0.01
−0.005
0
0.005
0.01
0.015
Differences in MSE
(d)
Figure 3.13: Fast Cross Vaidatin for Kernel Ridge Regression on noisy sinc data withd = 2 and
v = 0:04. a) Time for each of the tests b) MSEs for each of the tests c) The ratios of time of
the Friedman-Cochran algorithm to each of the other algorithms d) Difference between MSEs of
each of algorithms and the Friedman-Cochran algorithm.
27
Friedman−Cochran Friedman−F Ad−hoc F−FF−Aligned−Cochran Quade
400
500
600
700
800
900
1000
Algorithm Time in Seconds
(a)
Friedman−Cochran Friedman−F Ad−hoc F−FF−Aligned−Cochran Quade
0
0.5
1
1.5
2
2.5
3
Test MSE
(b)
Friedman−F Ad−hoc F−F F−Aligned−Cochran Quade
0.5
1
1.5
2
Algorithm Time Ratios
(c)
Friedman−F Ad−hoc F−F F−Aligned−Cochran Quade
0
0.05
0.1
0.15
0.2
Differences in MSE
(d)
Figure 3.14: Fast Cross Vaidatin for Kernel Ridge Regression on noisy sinc data withd = 3 and
v = 0:01. a) Time for each of the tests b) MSEs for each of the tests c) The ratios of time of
the Friedman-Cochran algorithm to each of the other algorithms d) Difference between MSEs of
each of algorithms and the Friedman-Cochran algorithm.
28
Friedman−Cochran Friedman−F Ad−hoc F−FF−Aligned−Cochran Quade
300
400
500
600
700
800
900
1000
Algorithm Time in Seconds
(a)
Friedman−Cochran Friedman−F Ad−hoc F−FF−Aligned−Cochran Quade
0.08
0.1
0.12
0.14
0.16
0.18
0.2
0.22
0.24
0.26
Test MSE
(b)
Friedman−F Ad−hoc F−F F−Aligned−Cochran Quade
0.5
1
1.5
2
2.5
3
Algorithm Time Ratios
(c)
Friedman−F Ad−hoc F−F F−Aligned−Cochran Quade
−0.02
−0.01
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
Differences in MSE
(d)
Figure 3.15: Fast Cross Vaidatin for Kernel Ridge Regression on noisy sinc data withd = 3 and
v = 0:04. a) Time for each of the tests b) MSEs for each of the tests c) The ratios of time of
the Friedman-Cochran algorithm to each of the other algorithms d) Difference between MSEs of
each of algorithms and the Friedman-Cochran algorithm.
29
Friedman−Cochran Friedman−F Ad−hoc F−FF−Aligned−Cochran Quade
300
400
500
600
700
800
900
1000
Algorithm Time in Seconds
(a)
Friedman−Cochran Friedman−F Ad−hoc F−FF−Aligned−Cochran Quade
0.048
0.05
0.052
0.054
0.056
0.058
0.06
0.062
0.064
0.066
0.068
Test MSE
(b)
Friedman−F Ad−hoc F−F F−Aligned−Cochran Quade
0.5
1
1.5
2
Algorithm Time Ratios
(c)
Friedman−F Ad−hoc F−F F−Aligned−Cochran Quade
−6
−4
−2
0
2
4
6
8
x 10
−3
Differences in MSE
(d)
Figure 3.16: Fast Cross Vaidatin for Kernel Ridge Regression on noisy sinc data withd = 4 and
v = 0:01. a) Time for each of the tests b) MSEs for each of the tests c) The ratios of time of
the Friedman-Cochran algorithm to each of the other algorithms d) Difference between MSEs of
each of algorithms and the Friedman-Cochran algorithm.
30
Friedman−Cochran Friedman−F Ad−hoc F−FF−Aligned−Cochran Quade
300
400
500
600
700
800
900
1000
Algorithm Time in Seconds
(a)
Friedman−Cochran Friedman−F Ad−hoc F−FF−Aligned−Cochran Quade
0.08
0.085
0.09
0.095
0.1
Test MSE
(b)
Friedman−F Ad−hoc F−F F−Aligned−Cochran Quade
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
Algorithm Time Ratios
(c)
Friedman−F Ad−hoc F−F F−Aligned−Cochran Quade
0
1
2
3
4
5
6
7
8
9
x 10
−3
Differences in MSE
(d)
Figure 3.17: Fast Cross Vaidatin for Kernel Ridge Regression on noisy sinc data withd = 4 and
v = 0:04. a) Time for each of the tests b) MSEs for each of the tests c) The ratios of time of
the Friedman-Cochran algorithm to each of the other algorithms d) Difference between MSEs of
each of algorithms and the Friedman-Cochran algorithm.
31
Table 3.5: Means and Medians of time ratios for each of the alternative hypothesis tests for
regression of noisy sinc data
d=2 d=3 d=4
v =0:01 v =0:04 v =0:01 v =0:04 v =0:01 v =0:04
mean median mean median mean median mean median mean median mean median
Friedman-F 0.9997 0.9991 1.0015 1.0007 0.9996 1.0001 1.0354 1.0195 1.0199 1.0085 1.0244 1.0087
Ad-hoc F -F 1.0514 1.0239 1.1890 1.0120 1.0800 1.0111 1.2172 1.0320 1.0754 1.0184 1.1299 1.0139
F-Aligned-Cochran 1.0087 1.0237 1.0572 1.0021 1.0017 1.0011 1.0189 1.0020 1.0218 1.0171 1.0699 1.0144
Quade-Cochran 1.0172 1.0254 0.9742 0.9995 1.0054 0.9953 1.0218 1.0163 1.0344 1.0200 1.0674 1.0015
Table 3.6: Properties of the benchmark regression data sets
Number of attributes Train Test log
10
min log
10
max log
10
min log
10
max
Boston Housing 13 250 256 -3 3 -7 2
Power Plant 5 4780 4788 -3 3 -1 0.2
Airfoil 5 750 753 -3 3 -7 0
Concrete Strength 8 500 530 -3 3 1 2
Auto Mpg 7 190 202 -3 3 -7 2
Yacht Hydrodynamics 6 150 158 -3 3 -0.6 1
Table 3.7: Means and Medians of time ratios for each of the alternative hypothesis tests for
Regression of benchmark data
Boston Housing Power Plant Airfoil Concrete Auto Mpg Yacht
mean median mean median mean median mean median mean median mean median
Friedman-F 1.0253 1.0094 1.0089 1.0007 1.0241 1.0073 1.0069 1.0016 1.0095 0.9859 1.0260 1.0287
Ad-hoc F -F 1.8428 1.8365 1.0222 1.0237 1.0204 1.0039 1.0822 1.0240 1.1703 1.0694 1.8227 1.7431
F-Aligned-Cochran 1.1040 1.0353 1.0298 1.0350 1.0385 1.0304 1.0430 1.0191 1.0332 1.0217 1.1627 1.0499
Quade-Cochran 1.1133 1.0355 1.0280 1.0322 1.0345 1.0063 1.0286 1.0159 1.0531 1.0145 1.0267 0.9632
32
Friedman−Cochran Friedman−F Ad−hoc F−FF−Aligned−Cochran Quade
40
60
80
100
120
140
160
Algorithm Time in Seconds
(a)
Friedman−Cochran Friedman−F Ad−hoc F−FF−Aligned−Cochran Quade
0
1
2
3
4
5
6
7
8
9
10
Test MSE
(b)
Friedman−F Ad−hoc F−F F−Aligned−Cochran Quade
0.5
1
1.5
2
2.5
3
3.5
Algorithm Time Ratios
(c)
Friedman−F Ad−hoc F−F F−Aligned−Cochran Quade
0
1
2
3
4
5
6
7
8
9
Differences in MSE
(d)
Figure 3.18: Fast Cross Vaidatin for Kernel Ridge Regression on housing data. a) Time for
each of the tests b) MSEs for each of the tests c) The ratios of time of the Friedman-Cochran
algorithm to each of the other algorithms d) Difference between MSEs of each of algorithms and
the Friedman-Cochran algorithm.
33
Friedman−Cochran Friedman−F Ad−hoc F−FF−Aligned−Cochran Quade
1.1
1.2
1.3
1.4
1.5
1.6
x 10
4
Algorithm Time in Seconds
(a)
Friedman−Cochran Friedman−F Ad−hoc F−FF−Aligned−Cochran Quade
0.051
0.052
0.053
0.054
0.055
0.056
Test MSE
(b)
Friedman−F Ad−hoc F−F F−Aligned−Cochran Quade
0.96
0.98
1
1.02
1.04
1.06
1.08
1.1
1.12
1.14
1.16
Algorithm Time Ratios
(c)
Friedman−F Ad−hoc F−F F−Aligned−Cochran Quade
−0.5
−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
0.4
0.5
Differences in MSE
(d)
Figure 3.19: Fast Cross Vaidatin for Kernel Ridge Regression on power plant data. a) Time for
each of the tests b) MSEs for each of the tests c) The ratios of time of the Friedman-Cochran
algorithm to each of the other algorithms d) Difference between MSEs of each of algorithms and
the Friedman-Cochran algorithm.
34
Friedman−Cochran Friedman−F Ad−hoc F−FF−Aligned−Cochran Quade
150
200
250
300
350
Algorithm Time in Seconds
(a)
Friedman−Cochran Friedman−F Ad−hoc F−FF−Aligned−Cochran Quade
0.13
0.14
0.15
0.16
0.17
0.18
Test MSE
(b)
Friedman−F Ad−hoc F−F F−Aligned−Cochran Quade
0.6
0.8
1
1.2
1.4
1.6
Algorithm Time Ratios
(c)
Friedman−F Ad−hoc F−F F−Aligned−Cochran Quade
−0.5
−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
0.4
0.5
Differences in MSE
(d)
Figure 3.20: Fast Cross Vaidatin for Kernel Ridge Regression on air foil data. a) Time for each of
the tests b) MSEs for each of the tests c) The ratios of time of the Friedman-Cochran algorithm to
each of the other algorithms d) Difference between MSEs of each of algorithms and the Friedman-
Cochran algorithm.
35
Friedman−Cochran Friedman−F Ad−hoc F−FF−Aligned−Cochran Quade
50
100
150
200
250
Algorithm Time in Seconds
(a)
Friedman−Cochran Friedman−F Ad−hoc F−FF−Aligned−Cochran Quade
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
1.05
Test MSE
(b)
Friedman−F Ad−hoc F−F F−Aligned−Cochran Quade
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Algorithm Time Ratios
(c)
Friedman−F Ad−hoc F−F F−Aligned−Cochran Quade
−0.5
−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
0.4
0.5
Differences in MSE
(d)
Figure 3.21: Fast Cross Vaidatin for Kernel Ridge Regression on Concrete strength data. a) Time
for each of the tests b) MSEs for each of the tests c) The ratios of time of the Friedman-Cochran
algorithm to each of the other algorithms d) Difference between MSEs of each of algorithms and
the Friedman-Cochran algorithm.
36
Friedman−Cochran Friedman−F Ad−hoc F−FF−Aligned−Cochran Quade
10
12
14
16
18
20
22
24
26
28
Algorithm Time in Seconds
(a)
Friedman−Cochran Friedman−F Ad−hoc F−FF−Aligned−Cochran Quade
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Test MSE
(b)
Friedman−F Ad−hoc F−F F−Aligned−Cochran Quade
0.5
1
1.5
2
2.5
Algorithm Time Ratios
(c)
Friedman−F Ad−hoc F−F F−Aligned−Cochran Quade
−1
−0.5
0
0.5
Differences in MSE
(d)
Figure 3.22: Fast Cross Vaidatin for Kernel Ridge Regression on Auto Mpg data. a) Time for
each of the tests b) MSEs for each of the tests c) The ratios of time of the Friedman-Cochran
algorithm to each of the other algorithms d) Difference between MSEs of each of algorithms and
the Friedman-Cochran algorithm.
37
Friedman−Cochran Friedman−F Ad−hoc F−FF−Aligned−Cochran Quade
5
10
15
20
25
30
35
40
Algorithm Time in Seconds
(a)
Friedman−Cochran Friedman−F Ad−hoc F−FF−Aligned−Cochran Quade
0
2
4
6
8
10
12
14
16
Test MSE
(b)
Friedman−F Ad−hoc F−F F−Aligned−Cochran Quade
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
Algorithm Time Ratios
(c)
Friedman−F Ad−hoc F−F F−Aligned−Cochran Quade
−0.5
0
0.5
1
1.5
2
2.5
3
Differences in MSE
(d)
Figure 3.23: Fast Cross Vaidatin for Kernel Ridge Regression on Yacht hydrodynamics data. a)
Time for each of the tests b) MSEs for each of the tests c) The ratios of time of the Friedman-
Cochran algorithm to each of the other algorithms d) Difference between MSEs of each of algo-
rithms and the Friedman-Cochran algorithm.
38
Chapter 4
Epilogue: Conclusions and Future Works
In this thesis, we examined the application of different sequential hypothesis tests to fast cross
validation. We showed, using both synthetic and real data sets, that replacing the Cochran test
with the F-test does not diminish the performance of the fast cross validation algorithm for binary
cross validation tasks. This conforms to the conjecture of Cochran about the validity of using an
F-test for binary data, at least in the context of using sequential analysis for fast cross validation.
On the other hand, the Wald test improves neither the time nor the accuracy of the fast cross
validation algorithm.
We also demonstrated that an Ad-hoc F-test, which replaces both the Friedman and Cochran
tests with the F-test, can be slightly more effective than the original CVST algorithm in cross val-
idation over synthetic and real data. Using Friedman Aligned Ranks test instead of the Friedman
test also showed some improvement in the time of the fast cross validation algorithm.
Future work has to be devoted to the application of the sequential hypothesis testing method
that was studied in this thesis for variable selection as well as early stopping of resampling meth-
ods.
39
Appendix A
Non-parametric Tests for Evaluation of Classification
Models
T
HIS chapter briefly describes the non-parametric tests that can be used for evaluation of
classification problems. For simplicity, we avoid using the language of analysis of vari-
ance, i.e., instead of treatment, we will use parameter configuration, and instead of observation,
we use performance, because the performance of the parameter configurations on a data point is
the observation in the framework of this thesis. Note that for a classification problem, the perfor-
mance of the classifier is binary, because it is 0 in case of correct classification, and 1 in case of
incorrect classification.
A.1 Cochran Q Test
The Cochran Q Test [11] is an analogue of Friedman’s test, when the observations are binary.
Therefore, it can test whether parameter configurations for a classifier have equal expected per-
formance [28].
40
The test statistic for the Cochran test is:
Q =r(r 1)
P
r
c=1
(p
c:
p
::
=r)
2
P
k
i=1
p
:i
(rp
:i
)
(A.1)
wherep
ci
is the performance of thec
th
parameter configuration that was observed in thei
th
step,
p
c:
=
P
k
i=1
p
ci
,p
:i
=
P
r
c=1
p
ci
, andp
::
=
P
r
c=1
P
k
i=1
p
ci
. r andk are respectively the number
of parameter configurations and the number of the steps that were performed so far.
Q in (A.1) has a
2
distribution withr 1 degrees of freedom under the null hypothesis that
all the parameter configurations are expected to have similar performance.
A.2 The F-Test for Evaluation of Classifiers
Cochran recommends using the F-test, even when the observations are binary in [11]:
If the data had been measured variables that appeared normally distributed, instead
of a collection of 1’s and 0’s, the F-test would be almost automatically applied as
the appropriate method. Without having looked into the matter, I had once or twice
suggested to research workers that the F-test might serve as an approximation even
when the table consists of 1’s and 0’s. As a testimony to the modern teaching of
statistics, this suggestion was received with incredulity, the objection being made
that the F-test requires normality, and that a mixture of 1’s and 0’s could not by any
stretch of the imagination be regarded as normally distributed. The same workers
raised no objection to a
2
test, not having realized that both tests require to some
extent an assumption of normality, and that it is not obvious whether F or
2
is more
41
sensitive to the assumption. Inclusion of the F-test is also worth while in view of the
widespread interest in the application of the analysis of variance to non-normal data.
Per Cochran’s recommendation, use of F-test in testing the hypothesis of similarity of classi-
fiers (binary observations) was studied in [26,28,32,38,44]. The details of the F-test are given in
Section B.1.
A.3 Testing Using Wald Statistics
Bhapkar [8] suggests that the following Wald statistic be used as an alternative to the Cochran Q
statistic:
W =
0
@
c
X
i=1
c
X
j=1
a
ij
p
i:
p
j:
1
A
P
c
i=1
P
c
j=1
p
i:
a
ij
2
P
c
i=1
P
c
j=1
a
ij
(A.2)
where:
[a
ij
] =
h
T
ij
p
i:
p
j:
k
i
1
(A.3)
and
T
ij
=
8
>
<
>
:
p
i:
+p
j:
i6=j
p
i:
i =j
(A.4)
W has a
2
distribution with c 1 degrees of freedom under the null hypothesis that all the
parameter configurations act similarly.
42
Appendix B
Non-parametric Tests for Evaluation of Regression
Models
W
E briefly describe the non-parametric tests that can be used for evaluation of regression
and classification problems. For simplicity, we avoid using the language of analysis
of variance, i.e., instead of treatment, we will use parameter configuration, and instead of obser-
vation, we use performance, because the performance of the parameter configurations on each of
the data points is the observation in the framework of this thesis.
B.1 The F-Test
The F-test [37] is a statistical test for the hypothesis that the means of a set of normally distributed
populations, all having the same variance, are equal. Therefore, it can be used to see whether
the expected performances of the parameter configurations in multiple steps are equal. The test
statistic for the F-test is:
F =
P
r
c=1
P
k
i=1
(p
ci
p
c:
)
2
P
r
c=1
P
k
i=1
(p
ic
p
::
)
2
=(r 1)
P
r
c=1
P
k
i=1
(p
ic
p
::
)
2
=(r(k 1))
(B.1)
43
where p
ci
is the observed performance of the c
th
parameter configuration in the i
th
step of the
algorithm,R(p
ci
) is the rank ofp
ci
among all of thep
ci
’s that were observed,r is the number of
parameter configurations, andk is the number of the data points on which the performance of the
algorithm was tested. Moreover,p
c:
= 1=k
P
k
i=1
p
ci
andp
::
= 1=r 1=k
P
r
c=1
P
k
i=1
p
ci
.
The statistic in (B.1) has anF distribution withr 1 andr(s 1) degrees of freedom under
the null hypothesis that the expected performance of all of the parameter configurations is equal.
It is well-known that the F-test is sensitive to heteroscedasticity, i.e., inequality of the vari-
ances, and non-normality. Therefore, one has to assume that the performances of the parameter
configurations are normally distributed and have equal variances, which are strong assumptions.
Therefore, non-parametric statistical tests [16] such as the Friedman Test (and its generaliza-
tions) were devised. Since such tests use ranks of the data points instead of their values, they are
distribution-free. We study Friedman Test, Friedman Aligned Test, and Quade Test in the sequel.
B.2 The Friedman Test
Friedman test [14, 15] involves calculating a statistic based on the rank of the performance of the
parameter configurations observed in each step of the CVST algorithm.
Assume:
R
:c
=
1
k
k
X
i=1
R(p
ci
);c = 1; 2;:::;r (B.2)
X
Friedman
=
12k
r(r + 1)
"
r
X
c=1
R
2
:c
r(r + 1)
2
4
#
(B.3)
44
wherer is the number of parameter configurations andk is the number of data points on which
the performance of the algorithm was examined, i.e., the number of observed performances.
X
Friedman
has a
2
distribution ofr 1 degrees of freedom.
B.3 Friedman Aligned Ranks Test
In Friedman Aligned Ranks Test [19, 33] for the Friedman test, first the average performance
observed from all algorithms at each step is calculated:
p
:i
=
1
r
r
X
c=1
p
ci
(B.4)
Then the observed performances are centered around zero, as:
p
0
ci
=p
ci
p
:i
(B.5)
Then,p
0
ci
are ranked:
b
R
ci
=R(p
0
ci
) (B.6)
where
b
R(p
0
ci
) is the rank ofp
0
ci
, i.e., the aligned rank ofp
ci
.
The aligned ranks test computes the following statistics:
X
AR
=
(k 1)
h
P
r
c=1
b
R
2
c:
(rk
2
=4)(rk + 1)
2
i
[rk(rk + 1)(2rk + 1)=6] (1=k)
P
k
i=1
b
R
2
:i
(B.7)
45
where
b
R
c:
=
P
k
i=1
b
R
ci
and
b
R
:i
=
P
r
c=1
b
R
ci
. X
AR
has a
2
distribution withk 1 degrees of
freedom under the null hypothesis that the expected performance of all of the parameter configu-
rations is equal.
B.4 Quade Test
In Quade test, the rankings computed at each step are scaled by the differences observed in the
performances of configurations; hence, Quade test uses weighted rankings. Assume thatR(p
ci
)
is the rank of the performance of thec
th
parameter configuration on thei
th
data point. The range
of the performances observed within stepi is the difference between the largest and the smallest
performances within that step:
i
= max
1cr
p
ci
min
1cr
p
ci
(B.8)
Assume that the rank of the range of thei
th
step isR(
i
). Assume:
Q
ci
=R(
i
)
R(p
ci
)
r + 1
2
(B.9)
Also:
Q
c:
=
k
X
i=1
Q
ci
(B.10)
Also, define:
A =k(k + 1)(2k + 1)r(r + 1)(r 1)=72 (B.11)
46
B =
1
k
r
X
c=1
Q
2
c:
(B.12)
The test statistic is:
F
Q
=
(k 1)B
AB
(B.13)
The test statisticF
Q
has anF distribution withr1 and (k1)(r1) degrees of freedom under
the null hypothesis that the expected performance of all of the parameter configurations is equal.
WhenA =B, thep-value is (1=r!)
k1
.
47
BIBLIOGRAPHY
[1] Maching Learning Data Repository. [Online]. Available: mldata.org
[2] S. A. Amanchi, “Applied nonparametric statistical tests to compare evolutionary and swarm
intelligence approaches,” Master’s thesis, North Dakota State University, 2014.
[3] S.-I. Amari, N. Murata, K.-R. Muller, M. Finke, and H. H. Yang, “Asymptotic statistical
theory of overtraining and cross-validation,” IEEE Transactions on Neural Networks, vol. 8,
no. 5, pp. 985–996, 1997.
[4] S. Arlot and A. Celisse, “A survey of cross-validation procedures for model selection,”
Statistics Surveys, vol. 4, pp. 40–79, 2010.
[5] Y . Bengio and Y . Grandvalet, “No unbiased estimator of the variance of k-fold cross-
validation,” The Journal of Machine Learning Research, vol. 5, pp. 1089–1105, 2004.
[6] J. Bergstra and Y . Bengio, “Random search for hyper-parameter optimization,” The Journal
of Machine Learning Research, vol. 13, no. 1, pp. 281–305, 2012.
[7] J. S. Bergstra, R. Bardenet, Y . Bengio, and B. K´ egl, “Algorithms for hyper-parameter opti-
mization,” in Advances in Neural Information Processing Systems, 2011, pp. 2546–2554.
[8] V . P. Bhapkar, “On the comparison of proportions in matched samples,” Sankhy¯ a: The Indian
Journal of Statistics, Series A, pp. 341–356, 1973.
[9] B. Brown, M. Chui, and J. Manyika, “Are you ready for the era of
´
big data
´
?” McKinsey
Quarterly, vol. 4, pp. 24–35, 2011.
[10] H. Chernoff, Sequential analysis and optimal design. SIAM, 1972.
[11] W. G. Cochran, “The comparison of percentages in matched samples,” Biometrika, pp. 256–
266, 1950.
[12] W. Conover and R. L. Iman, “On some alternative procedures using ranks for the analysis
of experimental designs,” Communications in Statistics-Theory and Methods, vol. 5, no. 14,
pp. 1349–1368, 1976.
48
[13] J. Derrac, S. Garc´ ıa, D. Molina, and F. Herrera, “A practical tutorial on the use of nonpara-
metric statistical tests as a methodology for comparing evolutionary and swarm intelligence
algorithms,” Swarm and Evolutionary Computation, vol. 1, no. 1, pp. 3–18, 2011.
[14] M. Friedman, “The use of ranks to avoid the assumption of normality implicit in the analysis
of variance,” Journal of the American Statistical Association, vol. 32, no. 200, pp. 675–701,
1937.
[15] ——, “A comparison of alternative tests of significance for the problem of m rankings,” The
Annals of Mathematical Statistics, vol. 11, no. 1, pp. 86–92, 1940.
[16] J. D. Gibbons and S. Chakraborti, Nonparametric statistical inference. Springer, 2011.
[17] T. Hastie, R. Tibshirani, and J. Friedman, The elements of statistical learning. Springer,
2009, vol. 2, no. 1.
[18] S. Haykin, Neural networks: a comprehensive foundation. Prentice Hall PTR, 1994.
[19] J. Hodges and E. L. Lehmann, “Rank methods for combination of independent experiments
in analysis of variance,” The Annals of Mathematical Statistics, vol. 33, no. 2, pp. 482–497,
1962.
[20] A. E. Hoerl and R. W. Kennard, “Ridge regression: Biased estimation for nonorthogonal
problems,” Technometrics, vol. 12, no. 1, pp. 55–67, 1970.
[21] T. Hofmann, B. Sch¨ olkopf, and A. J. Smola, “Kernel methods in machine learning,” The
annals of statistics, pp. 1171–1220, 2008.
[22] F. Hutter, H. H. Hoos, and K. Leyton-Brown, “Sequential model-based optimization for
general algorithm configuration,” in Learning and Intelligent Optimization. Springer, 2011,
pp. 507–523.
[23] R. L. Iman and J. M. Davenport, “”approximations of the critical region of the friedman
statistic,” Communications in Statistics-Theory and Methods, vol. 9, no. 6, pp. 571–595,
1980.
[24] R. Kohavi, “A study of cross-validation and bootstrap for accuracy estimation and model se-
lection,” in Proceedings of International Joint Conference on Artificial Intelligence, vol. 14,
no. 2, 1995, pp. 1137–1145.
[25] T. Krueger, D. Panknin, and M. Braun, “Fast cross-validation via sequential testing,” arXiv
preprint arXiv:1206.2248, 2012.
[26] K. J. Levy and S. C. Narula, “An empirical comparison of several methods for testing the
equality of dependent proportions,” Communications in Statistics-Simulation and Computa-
tion, vol. 5, no. 4, pp. 189–195, 1976.
49
[27] M. Lichman, “UCI machine learning repository,” 2013. [Online]. Available:
http://archive.ics.uci.edu/ml
[28] S. W. Looney, “A statistical technique for comparing the accuracies of several classifiers,”
Pattern Recognition Letters, vol. 8, no. 1, pp. 5–9, 1988.
[29] C. Lynch, “Big data: How do your data grow?” Nature, vol. 455, no. 7209, pp. 28–29, 2008.
[30] J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, and A. H. Byers, “Big
data: The next frontier for innovation, competition, and productivity,” McKinsey Global
Institute, Tech. Rep., May 2011.
[31] A. McAfee and E. Brynjolfsson, “Big data: the management revolution.” Harvard business
review, no. 90, pp. 60–6, 2012.
[32] J. L. Myers, J. V . DiCecco, J. B. White, and V . M. Borden, “Repeated measurements of
dichotomous variables: Q and f tests.” Psychological Bulletin, vol. 92, no. 2, p. 517, 1982.
[33] T. W. OGorman, “A comparison of the f-test, friedmans test, and several aligned rank tests
for the analysis of randomized complete blocks,” Journal of Agricultural, Biological, and
Environmental Statistics, vol. 6, no. 3, pp. 367–378, 2001.
[34] D. Quade, “Using weighted rankings in the analysis of complete blocks with additive block
effects,” Journal of the American Statistical Association, vol. 74, no. 367, pp. 680–683,
1979.
[35] P. Refaeilzadeh, L. Tang, and H. Liu, “Cross-validation,” in Encyclopedia of Database Sys-
tems. Springer, 2009, pp. 532–538.
[36] J. D. Rodriguez, A. Perez, and J. A. Lozano, “Sensitivity analysis of k-fold cross valida-
tion in prediction error estimation,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 32, no. 3, pp. 569–575, 2010.
[37] G. A. Seber and A. J. Lee, Linear regression analysis, 2nd ed. John Wiley & Sons, 2003.
[38] P. Seeger and A. Gabrielsson, “Applicability of the cochran q test and the f test for statistical
analysis of dichotomous data for dependant samples.” Psychological Bulletin, vol. 69, no. 4,
p. 269, 1968.
[39] D. Siegmund, Sequential analysis: Tests and confidence intervals. Springer Science &
Business Media, 1985.
[40] F. K. Sun, Sequential methods in pattern recognition and machine learning. Academic
Press, 1968.
[41] V . N. Vapnik and V . Vapnik, Statistical learning theory. Wiley New York, 1998, vol. 1.
[42] A. Wald, Sequential analysis. Wiley, 1947.
50
[43] A. Wald and J. Wolfowitz, “Optimum character of the sequential probability ratio test,” The
Annals of Mathematical Statistics, pp. 326–339, 1948.
[44] B. J. Winer, D. R. Brown, and K. M. Michels, Statistical principles in experimental design.
McGraw-Hill New York, 1971, vol. 2.
51
Abstract (if available)
Abstract
In this thesis, we examine the application of various sequential hypothesis tests to fast cross validation for model selection. Fast cross validation can be utilized to select hyperparameters of classifiers and regression models, when dealing with very large amounts of data. We examine the performance of the F‐test as well as the Wald test versus the Cochran test for classification tasks. ❧ We also examine the performance of the F‐test, Friedman Aligned Ranks test, and the Quade test versus the Friedman test for selection of hyperparameters of regression models. We demonstrate our results on synthetic and real data sets. We show that replacing the Cochran test in classification tasks with the F‐test does not diminish the performance of fast cross validation with sequential hypothesis testing, whereas a Wald test yields higher times for the fast cross validation algorithm, without actually improving its accuracy. For regression tasks, replacing both Friedman’s and Cochran’s tests with the F‐test yields slightly better times for the algorithms, although for one of the data sets, it comes with higher inaccuracy of the algorithm.
Linked assets
University of Southern California Dissertations and Theses
Asset Metadata
Creator
Rajati, Mohammmad Reza (author)
Core Title
Evaluation of sequential hypothesis tests for cross validation of learning models using big data
School
College of Letters, Arts and Sciences
Degree
Master of Science
Degree Program
Statistics
Publication Date
07/08/2015
Defense Date
06/24/2015
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
big data,classification,Cochran test,cross validation,F test,Friedman test,hyper parameter selection,Learning,non-parametric tests,OAI-PMH Harvest,regression,sequential hypothesis testing,Statistics
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Bartroff, Jay (
committee chair
), Goldstein, Larry (
committee member
), Jenkins, Brian Keith (
committee member
)
Creator Email
mohammadreza.rajati@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-587159
Unique identifier
UC11301690
Identifier
etd-RajatiMoha-3563.pdf (filename),usctheses-c3-587159 (legacy record id)
Legacy Identifier
etd-RajatiMoha-3563.pdf
Dmrecord
587159
Document Type
Thesis
Format
application/pdf (imt)
Rights
Rajati, Mohammmad Reza
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
big data
Cochran test
cross validation
F test
Friedman test
hyper parameter selection
non-parametric tests
regression
sequential hypothesis testing