Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Essays on factor in high-dimensional regression settings
(USC Thesis Other)
Essays on factor in high-dimensional regression settings
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Essays on Factors in High-Dimensional Regression Settings
by
Jeong Sang Yoo
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ECONOMICS)
August 2023
Copyright 2023 Jeong Sang Yoo
Acknowledgments
I want to thank everyone for their support. Some of the professors that I would like
to mention are, Hashem Pesaran, Roger Moon, Wayne Ferson, Robert Deckle, David Zeke,
Cheng Hsiao, Geert Ridder, Pablo Kurlat, Caroline Betts, Romaine Ranciere, Steven Sapra,
Maggie Switek and Jeff Nugent. I also appreciate much support from family members and
friends.
ii
Table of Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
1 Chapter 1: Introduction 1
2 Chapter 2: Factor Strengths in High-Dimensional Settings with Applica-
tions to Finance and Macroeconomics 4
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Estimation Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Estimation of Factor Strengths Assuming a Model . . . . . . . . . . . 9
2.2.2 Estimator for Factor Strength in High-Dimensional Settings . . . . . 10
2.2.2.1 The Factor Strength Estimator . . . . . . . . . . . . . . . . 10
2.2.2.2 Lasso and Adaptive Lasso . . . . . . . . . . . . . . . . . . . 12
2.2.2.3 OCMT and GOCMT . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Simulation Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Monte Carlo Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Empirical Applications to Finance and Macroeconomics . . . . . . . . . . . . 28
2.4.1 Identifying risk factors in high dimensional settings . . . . . . . . . . 28
2.4.1.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.1.2 Factor Models for Individual Securities . . . . . . . . . . . . 31
2.4.1.3 Estimates of Factor Strengths . . . . . . . . . . . . . . . . . 35
2.4.2 R-squared Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.4.3 Applications to Other Settings . . . . . . . . . . . . . . . . . . . . . . 47
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3 Chapter 3: Dominant Species in the Factor Zoo 55
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.2 Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.3 Empirical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.3.2 OCMT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.3.3 Algorithm for Finding R-squared Maximizing Factors . . . . . . . . . 64
3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
iii
3.4.1 PR
2
adj
for Different Models . . . . . . . . . . . . . . . . . . . . . . . . 70
3.4.2 R-Squared Maximizing Variables for All Models . . . . . . . . . . . . 72
3.4.3 Robustness Check for the Dimension Reduction Technique . . . . . . 74
3.4.4 Areas of Future Research . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4 Chapter 4: Time Varying Effects of Oil Price Shocks 83
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.2 Time Varying Shocks using Rolling SVARs . . . . . . . . . . . . . . . . . . . 86
4.2.1 Econometric Specification for SVAR . . . . . . . . . . . . . . . . . . . 86
4.2.2 Impulse Responses from Rolling SVARs . . . . . . . . . . . . . . . . . 88
4.2.3 SVARs Controlling for Economic Conditions . . . . . . . . . . . . . . 92
4.2.3.1 Variables Representing Economic Conditions . . . . . . . . . 92
4.2.3.2 Impulse Response Functions . . . . . . . . . . . . . . . . . . 94
4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Bibliography 97
A Appendix to Chapter 2 100
iv
List of Tables
2.1 Factor Strength Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Factor Strength Estimates for Pseudo-Factors . . . . . . . . . . . . . . . . . 25
2.3 Factor Strength Estimates for Noise-Factors . . . . . . . . . . . . . . . . . . 26
2.4 Summary statistics of factor strength estimates for the 162 factors . . . . . . 37
3.1 Factors in Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.2 Illustration of OCMT-selection-percentages . . . . . . . . . . . . . . . . . . . 67
3.3 PR
2
adj
maximizing factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
A.1 Strength Estimates using Principal Components . . . . . . . . . . . . . . . . 101
A.2 Summary statistics of factor strength estimates for the 162 factors when
GOCTM was used for the strength estimator using various number of princi-
pal components as control variables . . . . . . . . . . . . . . . . . . . . . . . 103
A.3 Strength Estimates of Factors . . . . . . . . . . . . . . . . . . . . . . . . . . 104
v
List of Figures
2.1 Number of Firms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2 Number of Unobserved Common Factors . . . . . . . . . . . . . . . . . . . . 33
2.3 Strength Estimate for Market Return Factor . . . . . . . . . . . . . . . . . . 38
2.4 Strength estimates of factors with highest estimates when GOCMT(3PC) was
used for the strength estimator . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.5 Strength estimates of factors with highest estimates when GOCMT(Market)
was used for the strength estimator . . . . . . . . . . . . . . . . . . . . . . . 42
2.6 Strength estimates of factors with highest estimates when Lasso was used for
the strength estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.7 Pooled R-squared using the CAPM and one-factor models using factors with
highest estimates when GOCMT(3PC) was used for the strength estimator . 48
2.8 Pooled R-squared using the CAPM and one-factor models using factors with
highest estimates when GOCMT(Market) was used for the strength estimator 49
2.9 Pooled R-squared using the CAPM and one-factor models using factors with
highest estimates when Lasso was used for the strength estimator . . . . . . 50
3.1 Correlation Amongst the Factors . . . . . . . . . . . . . . . . . . . . . . . . 66
3.2 Number of Unobserved Common Factors Amongst 55 Factors . . . . . . . . 69
3.3 Size of OCMT Panel-Approximating Set . . . . . . . . . . . . . . . . . . . . 71
3.4 R
2
adj,pooled
using different Factor Models . . . . . . . . . . . . . . . . . . . . . 72
3.5 Robustness Check 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.6 Robustness Check 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.7 PR
2
adj
maximizing factors using strongest factors identified by different vari-
able selection methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.1 Spot Oil Price over Years . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.2 Rolling Impulse Responses of Variables to an Orthogonalized Oil Price Shock. 89
4.3 Rolling Impulse Responses of Variables to an Orthogonalized Oil Price Shock
when Controlling for Stock Market Volatility and Credit Spread . . . . . . . 95
vi
Abstract
This thesis brings together three research papers. While the three papers are not directly
related to one another, further improvements of research in the future may allow the three different
research to benefit from one another. The first paper, in Chapter 2, which is the job market paper,
attempts to measure strengths of observable factors in a high-dimensional regression settings. In
other words, given a large number of potential factors, the paper proposes an estimator that
measures strengths of all the factors in the data. Variable selection algorithms such as OCMT,
Lasso, and their variants are used to handle the big data when estimating factor strengths. Given
the fact that researchers are beginning to recognize the presence of non-strong factors, and also
given that fact that data are becoming larger, the estimator would be useful in high-dimensional
regression settings. The precision of the estimator is validated through extensive Monte Carlo
experiments. It is shown that for factors with strengths between 0.5 and 1, the estimator is fairly
precise. Factors with strengths below 0.5 cannot be distinguished from purely random variables
since the strength estimator returns estimates near 0.5 even for random variables that do should
not play any role in explaining y
it
. When the estimator is applied to a large set of asset pricing
factors, it is shown that when y
it
are individual US stock returns, the Market factor is the only
strong, while the most of the remaining ones have strength estimates between 0.5 and 0.8. This
finding is important, since studies in the asset pricing literature are beginning to show that factor
strengths matter for precision of risk premia estimates when one is using the two pass regression
methods.
Thesecondpaperisanempiricalpaperthatattemptstofindak-factormodelthatmaximizes
pooled R-squared when y
it
96 portfolio returns sorted on size and value and the factors are 56
asset pricing factors proposed in the literature to explain cross section of expected stock returns.
Examining 10-year rolling samples, from July 1972 to December 2017, and examining various factor
models, such as the three and four-factor models, it is shown that when the Market factor is used
as a control variable, the pooled R-squared maximizing factors that are chosen the most across
the rolling samples are Size and valuation-related factors, such as SMB, HML and Cash-Flow-To-
Price factor. It should be noted that analyses are further needed to validate the empirical results.
Perhaps another contribution of this study is introducing an algorithm to find the pooled R-squared
maximizing factors when the researcher is faced with a large set of potential factors.
The final paper is an empirical paper that shows how the effects of oil price shocks change
over time. By looking at 10-year rolling sample from January 1970 to December 2017, the study
computes the effects of orthogonalized oil price shocks on changes of monthly industrial production,
US stock market index, and US CPI. Study shows that effects of oil price shocks may depend on the
condition of the economy, for instance, the effects are more pronounced during the 2008 recession.
vii
Chapter 1
Chapter 1: Introduction
With abundance of data, large factor models are becoming more prevalent. Here, it’s
not only the number of y variables that are large, but the number of potential factors are also
large. Take the example of asset pricing factors or the number of macroeconomic variables
available. Wenowhavemorethan300assetpricingfactorsproposedintheliterature(Harvey
et al., 2015; Harvey et al., 2019) and there are well over 100 number of macroeconomic
variables, e.g., consumption, employment, price inflation (Stock and Watson, 2002; 2015).
Without a strong theoretical model, if a researcher were to face these large sets of potential
factors, it would be very difficult to figure out which subset of factors would yield the best
factor model in explaining the y variables. Here, potential factors are referring to the large
number of variables from which researchers can design a factor model that would explain
the set of y variables.
There can be many different ways to deal with a large number of potential factors. One
may be looking for a set of factors that affect the most number of y variables, or one may
be looking for a set of factors that maximize R-squared of the panel of y variable. Strength
of factors can tell us how many y variables are affected by a particular factor, that is, a
stronger factor will affect more number of y variables than weaker ones. The first study
presented in this dissertation, which can be seen in Chapter 2, deals with factor strengths.
In particular, given a large set of potential factors, the study proposes a method to estimate
strengths for all the factors in the set, no matter the size of the set. It is a direct extension of
1
the study by Bailey et al.(2021) that first introduced the estimator for factor strengths. The
estimator in Bailey et al.(2021) can be used when a factor model is already assumed, such
as a one-factor model or a three-factor model. When faced with a large set of factors and
when one wants to avoid making any assumption on the composition of a factor model, one
can use the estimator proposed in Chapter 2 in this dissertation. By using variable selection
algorithms, such as Lasso and GOCMT, this new estimator returns strength estimates for all
the potential factors without requiring one to assume a particular factor model. It should be
noted that the original estimator proposed in Bailey et al.(2021) still allows one to deal with
a large set of potential factors. For instance, if there are K
n
number of potential factors,
one can still obtain strength estimates for each of the factors by assuming K
n
number of
one-factor models. Thus, the strength estimates using the new estimator presented in this
dissertation can be compared to the ones obtained from the estimator in Bailey et al.(2021).
Why would we want to know strength estimates? One reason was already discussed in
the paragraph above, which is that when a theoretical model is absent, and if one is facing
a large set of potential factors, knowing factor strengths will give a a good idea as to which
variables are the important variables in a panel data setting. Another reason relates to the
fact that the literature is beginning to recognize the existence factors that are not strong,
for instance, weak factors that only affects small number of y variables in the data. Onatski
(2012) has shown that the well known PC estimator is inconsistent when weak factors are
included in the data. In the asset pricing literature, studies are beginning to show how
presence of weak factors may affect precision of risk premia estimates in two-pass procedures
(Anatolyev and Mikusheva (2022); Pesaran and Smith (2021)). These studies suggest that,
given a set of factors or variables, it would benefit to know if the data include factors that
are not strong, and the strength estimator by Bailey et al.(2021) and the one proposed in
Chapter 2 of this dissertation provide strength estimates of factors in the data.
In addition to knowing the factor strengths, examining pooled R-squared can also
be useful when dealing with a large set potential factors. Pooled R-squared, denoted
2
as PR
2
=1− (SSR
pooled
)/(SST
pooled
), where SSR
pooled
=
P
T
t=1
P
n
i=1
u
2
it
and SST
pooled
=
P
T
t=1
P
n
i=1
(y
it
− ¯ y
iT
)
2
, T represents time series dimension, n represents number of y vari-
ables in a panel data, ¯ y
iT
=
1
T
P
T
t
y
it
, and u
it
represent the error term for the factor model
corresponding to the y variable,y
it
. The second study in this dissertation, shown in Chapter
3, deals with finding the set of factors that maximize PR
2
when one is facing a large set of
potential factors. For instance, one may be interested in identifying a three factor model that
maximizesPR
2
. Rather than going through all possible combinations of three factor models
from the large set of potential factors, I propose an algorithm that significantly reduces the
computation time of the search of factor models that maximize PR
2
. The algorithm uses
variable selection algorithms called OCMT to eliminate the unimportant variables. Thus, by
focusing on the remaining variables that are deemed important, I search for the factor model
that maximizes PR
2
. The dimension reduction by OCMT allows for an efficient search for
the PR
2
maximizing variables. As an empirical exercise, Chapter 3 uses 96 portfolio stock
returns as the y variables and 56 asset pricing factors as the large number of x variables.
Across 10-year rolling samples, it is shown that SMB and valuation-related asset pricing
factors are the ones that are often chosen to be PR
2
maximizing variables.
The final paper in this dissertation, shown in Chapter 4, is not directly related to the
two studies mentioned above. It is an empirical paper that examines how effects of oil price
shocks have changed over time. The study uses 10-year rolling samples and orthogonalized
oil price shocks. Other variables in the structural VAR (SVAR) are industrial production,
inflation, and US stock market index. All variables are transformed to be stationary, that is,
they represent rate of change on a monthly basis. The results indicate that effects oil price
shocks appear to be magnified during times of economic stress.
3
Chapter 2
Chapter 2: Factor Strengths in
High-Dimensional Settings with
Applications to Finance and
Macroeconomics
2.1 Introduction
Factorsareusedinfinanceandmacroeconomicstoexplainlargesetof y
it
. Forinstance,
in the asset pricing literature, three or five asset pricing factors are commonly used to explain
individual stock returns, and in macroeconomics, some set of unobserved factors are believed
to affect a large set of macroeconomics variables (Stock and Watson, 2015; Harvey and Liu,
2019). In both cases, most of the studies are either silent about strength of factors or it
is commonly assumed that the factors to be strong, that is, if n represents cross-sectional
dimension, T represents time series dimension, and K represents number of factors in the
factor model, K factors are assumed to affect all n number of y
it
, whether they are n number
of stock returns, or wage aggregates at the county level.
4
Assuming that factors in the model are strong, however, may be unrealistic given how
researchers are beginning to show that not all factors are strong and, more importantly, how
incorrectly assuming factor strengths may lead to erroneous conclusions in various settings
(Pesaran and Smith, 2021; Anatolyev and Mikusheva, 2022Onatski (2012)). Recognizing
the significance of knowing strength of factors, Bailey, Kapetanios, and Pesaran (2021, BKP
hereafter) provide a method to estimate factor strengths. Their estimator requires one to
assume a particular factor model, such as a one-factor model or a three-factor model. When
a researcher is facing a large number of potential factors, however, it would be beneficial
to estimate factor strengths without having to assume a particular factor model. Here,
potential factors are referring to the large number of variables from which researchers can
design a factor model that would explain the set of y
it
.
This paper contributes to the literature by proposing a method to estimate factor
strengths in high-dimensional regression settings where not only n and T are large, but
the number of potential factors is also large. The estimator introduced in this study is a
direct extension of the estimator originally proposed in BKP (2021). Given a set of factors,
which are fixed across n, BKP (2021) introduces an estimator that returns a value between
0 and 1, which represents a continuous degree of factor strengths, with 1 indicating the
maximum factor strength. The main difference between the estimator in BKP (2021) and
the one introduced here is that factors are fixed in BKP (2021), and thus one only obtains
strength estimates for the K number of factors for each assumed model. In contrast, by
using variable selection algorithms, the estimator in this paper takes into account set of all
potential factors in the data and obtains strength estimates for all of them, no matter how
large the set of potential factors may be. Another way to think about the difference is that
one must assume a particular K-factor model when using the estimator by BKP (2021),
while the estimator presented in this study does not require one to make any assumption
about the composition of the factor model. In other words, one simply takes the potential
factors as they are and estimate factor strengths for all of them. When faced with a large
5
data such as macroeconomic variables, or asset pricing factors, not assuming a particular
factor model would yield factor strength estimates that can provide valuable insights when
compared to strength estimates obtained by using the estimator in BKP (2021).
Moreformally, letKrepresentthenumberoffactorsinthefactormodelthatresearcher
assumes to affect a set of y
it
. These K number of factors will be called the signal-factors
in this study. Specifically, signal-factors affect at least one y
it
in the data and no other
factors are needed to explain the set of y
it
in a factor model if all the signal-factors are
included in the factor model. Now, let K
n
represent the large number of potential factors
available to the researcher. K
n
>K, since the set of large number of potential factors would
include the signal-factors, along with many other factors that are not signal-factors, which
are pseudo-factors and noise-factors. Pseudo-factors are the ones that affect none of the y
it
,
but are nonetheless correlated with the signal-factors, and noise-factors are purely random
variables. The setting in this paper allows K
n
to grow, along with n and T. In standard
factor models, K and factors themselves are chosen a priori by the researcher, which means
that in order to consider other factors in the data, one must assume different factors models.
The estimator presented in this study simply takes all K
n
number of potential factors and
provides factor strength estimates without requiring one to make any assumption on the
factor model.
To briefly explain how the estimator operates, a variable selection algorithm is applied
to a large set of potential factors available for each y
it
. Thus, considering n number of y
it
, a
variable selection algorithm would be used n number of times. The intuition is that if a factor
is selected by a variable selection algorithm many times over n number of y
it
, then it would
be considered to be a stronger factor than the one that has been selected less number of
times. Just like the estimator presented in BKP (2021), the estimator presented in this study
returns a value between 0 and 1, with 1 indicating the maximum factor strength. A factor
with true strength of 1 is referred to as a strong factor, or strong signal-factor, and those
with strengths between 0.5 and 1 are referred to as semi-strong factors, or semi-strong signal-
6
factors, and those below 0.5 are referred to as weak factors, or weak signal-factors. These are
different from pseudo-factors or noise-factors that do not affect any y
it
. The algorithms used
in this paper to obtain factor strength estimates are Lasso, Adaptive Lasso, One Covariate
at a Time Multiple Testing (OCMT), and Generalized-OCMT (GOCMT). Adaptive Lasso
and GOCMT are variants of Lasso and OCMT, respectively. OCMT is a variable selection
algorithm that has been introduced in the recent years, and its performances has been shown
to be comparable to that of Lasso.
1
The small sample properties of the estimator are tested extensively using Monte Carlo
experiments that consider various settings where the estimator is applied to a large number
of y
it
and a large number of potential factors. The estimator is expected to estimate factor
strengths correctly when it does not know which factors are signal-factors or non-signal-
factors, that is, when it does not know which are the important factors that affect the
set of y
it
. This setting is increasingly observed in reality where researchers are facing large
data. The experiments were also designed such that the distributions of parameters matched
the ones observed in empirical distributions of asset pricing factors as close as possible.
2
The experiments show that, with large enough n and T, and if common correlations are
adequately controlled for, the factor strength estimates are fairly precise if true value of
factor strengths lie between 0.5 and 1. For a weak signal-factor with factor strength of 0.45,
and for pseudo-factors, the estimates come out to be too high, and they increase with n.
Even for noise-factors, estimates can be as high as 0.5. These findings suggest that one
should focus on factors with strength estimates that are sufficiently higher than 0.5. Those
with strength estimates below or near 0.5 are too weak to be distinguished apart from purely
random factors.
For an empirical exercise, the estimator is applied to 162 factors that have appeared
in studies in the asset pricing literature. Harvey and Liu (2019) document that, over the
1
Sections 2.2.2.2 and 2.2.2.3 explain in detail about the four algorithms used in this paper.
2
To obtain the empirical distribution, a five-factor model was used on individual stock returns in rolling
window regressions. Specific details on obtaining empirical distributions are provided in Section 2.3.1.
7
years, more than 300 potential asset pricing factors have been proposed to explain cross
section of expected stock returns. For these observed factors, the literature is either silent
about strength of factors or it is implicitly assumed that they are strong, i.e., given a K
factor model, the K number of factors affect all the stock returns or portfolios. When the
factor strength estimator is applied to 162 factors, it is found that only the Market Return
factor has a strength estimate close to 1 for most of the rolling samples, while most of the
remaining factors have strength estimates in between 0.5 and 0.8. These findings suggest
that many are semi-strong, and more importantly, one should take caution when estimating
risk premia via two-step regression method since the number of studies are beginning to
show that factor strengths matter for precision of risk premia estimates (Pesaran and Smith,
2021; Anatolyev and Mikusheva, 2022).
FurtheranalysisiscompletedtoseeifthereisarelationshipbetweenR-squaredstatistic
and factor strength. In particular, I examine pooledR
2
, orPR
2
=1−(SSR
pooled
)/(SST
pooled
),
where SSR
pooled
=
P
T
t=1
P
n
i=1
u
2
it
and SST
pooled
=
P
T
t=1
P
n
i=1
(r
it
− ¯ r
iT
)
2
. Pesaran and Smith
(2019) show that given a particular factor model where only one of the factors is strong and
others are not strong, the pooledR
2
, orPR
2
will be dominated by the strong factor, and the
contribution of other factors toPR
2
will vanish for large number of n. I test this theoretical
prediction by comparing PR
2
from the Capital Asset Pricing Model (CAPM) and other
one-factor models using factors with strength estimates less than 1. Focusing on one-factor
models, it is shown that the PR
2
is largest for the CAPM where the Market excess return
is the only factor. No other one-factor models different from the CAPM produces PR
2
that
is greater than the one observed when the CAPM was used. The intuition is that given a
factor that is strong, such as the Market factor, one would expect thePR
2
to be the highest,
since the Market Return is the factor that affects far more number of stock returns than
any other factors in the data. The results support the theoretical prediction in Pesaran and
Smith (2021) that the strong factor should have the largest contribution to the PR
2
.
8
2.2 Estimation Strategy
I will first explain how factor strengths are estimated using the method in BKP (2021).
In the next section, I then consider a high-dimensional setting with many potential factors,
and introduce the main estimator of this paper.
2.2.1 Estimation of Factor Strengths Assuming a Model
InBKP(2021), theveryfirststepinestimatingfactorstrengthistoassumeaparticular
factor model, such as the K-factor model shown below in equation 2.1.
y
it
=a
i
+ c
′
i
z
t
+
K
X
k=1
β
k
f
kt
+u
it
, for i=1,2,...,n, (2.1)
where n represents number of y
it
, and f
kt
, k=1,2,...K, are the factors assumed by the re-
searcher to explain y
it
, a
i
represents unit specific effects, u
it
∼IID(0,σ
2
i
) is the error term
fory
it
, andβ
ik
is the factor loading on factor k fory
it
, and c
′
i
z
t
refer to control variables and
corresponding coefficients. Given such a model, factor strength for factor k is measured by
estimating α
k
in the following statements.
|β
ik
|>c a.s. for i = 1, 2,..., [n
α
k
], (2.2)
|β
ik
| = 0 a.s. for i=[n
α
k
]+1,[n
α
k
]+2,...,n,
where c is some number greater than 0. If α
k
=1, or its maximum value, then β
ik
will be
non-zero for all y
it
. In this case, factor k can be labeled as a strong factor. If α
k
< 1,
however, then n−n
α
k
number of betas will be 0, in which case factor k will be semi-strong
or weak. Following BKP (2021), this paper will also label factors as semi-strong if its α
k
is
greater than 1/2 and weak if itsα
k
is less than 1/2. More formally,α
k
is measuring the rate
at which β rises with n. To estimate α
k
, first consider proportion of statistically significant
betas given by the following equation.
9
ˆ π
k
=
1
n
n
X
i=1
1[|t
ik
|>c
p
(n)], (2.3)
wheret
ik
is referring to t-statistics of factor loading for factor k in regression 2.1, and critical
value function is given by,
c
p
(n) = Φ
−1
1−
p
2n
c
!
, (2.4)
where p represents nominal size of the test and c represents some non-negative number. If
c=0, then we have our usual critical value and if c > 0, then critical value accounts for
multiple testing problem. Using the estimated proportion from equation 2.3, BKP (2021)
provide the following equation to estimate strength of factor k.
ˆ α
k
= 1 +
ln(ˆ π
k
)
ln(n)
(2.5)
In the case where ˆ π
k
=0, we then set ˆ α
k
=0. This original estimator by BKP (2021) can be
used when the researcher is concerned with a particular factor model, such as one or three-
factor models. The estimator that I propose in this paper does not require one to assume a
particular factor model, and instead simultaneously considers all available potential factors
known to the researcher.
2.2.2 Estimator for Factor Strength in High-Dimensional Settings
2.2.2.1 The Factor Strength Estimator
The new estimator proposed in this paper will rely on variable selection algorithms
to estimate factor strengths in a high-dimensional setting. The four variable selection al-
gorithms considered are Lasso, Adaptive Lasso, OCMT, and GOCMT. Consider again the
following regression model with n number of y
it
that depend on K number of factors and
possibly some vector of control variables z
t
.
y
it
=a
i
+ c
′
i
z
t
+
K
X
k=1
β
ik
f
kt
+u
it
, for i=1,2,...,n. (2.6)
10
In addition, assume that there is a large number of potential factors and the researcher
does not know and does not want to assume which K number of factors are the signal-
factors, or the ones that belong in the equation above. LetA ={x
1t
,x
2t
,...,x
Knt
} represent
the active set, that is, it is the large number of potential factors that are available to the
researcher. Thus, the active set,A contains K number of signal-factors,{f
1t
,f
2t
,...,f
Kt
},
and also includes pseudo-factors and/or noise factors. Again, pseudo-factors are the ones
that are not part of data generating process (DGP), but are nonetheless correlated with
signal-factors, and noise-factors are pure random variables. Therefore, x
qt
, for q= 1,2,...,K
n
,
can be a signal-factor, a pseudo-factor, or a noise-factor, and the researcher does not know
which ones are the signal-factors. The subscript n on K
n
indicates that K
n
depends on n,
the cross-sectional dimension, since both are assumed to rise together at a particular rate.
If one is considering all the factors presented in the asset pricing literature, then K
n
, for
instance, would be as large as 300. K
n
would thus be much larger than K, which is the
notation usually used to refer to number of signal-factors, or the ones that are part of the
data generating process. Without knowing or assuming which are the signal-factors, one
would not be able to use the estimator in BKP (2021). Even when one wants to assume a
factor model, the researcher may choose a set of factors fromA ={x
1t
,x
2t
,...,x
Knt
} that
may be different from the signal-factors. In such cases, factor strength estimates may be
different from the true factor strengths.
For this reason, for each y
it
, the estimator presented in this study applies variable
selection algorithms to the active set in order to figure out which ones are the factors with
non-zero betas, and by doing so, it obtains factor strength estimates for all the potential
factors in the data. After the variable selection algorithms have been applied to each of
the n number of y
it
, the estimator then estimates the fraction of times each factor has been
selected by the variable selection algorithms. Intuitively, a factor that is selected more often
across n number ofy
it
variables is estimated to be the stronger factor. More formally, denote
S
i,method
, the set of factors selected by one of the four variable selection methods for equation
11
i, or y
it
. Then, factor strength estimate for x
qt
, denoted as ˆ α
q
, is computed in the following
way,
ˆ π
q,method
=
P
n
i=1
I[x
q
∈ S
i,method
]
n
, (2.7)
ˆ α
q,method
= 1 +
ln(ˆ π
q,method
)
ln(n)
, (2.8)
for q=1,2,...,K
n
and method∈{OCMT,GOCMT,Lasso,Alasso }. In equation 2.7, I[x
q
∈
S
i,method
] refers to an indicator function that takes on a value of 1 if the argument inside the
function is satisfied.
2.2.2.2 Lasso and Adaptive Lasso
I briefly discuss about the variable selection algorithms used in this study beginning
with Lasso and Adaptive Lasso. Lasso is the most widely used variable selection method
in the literature today. Lasso sets of some of the coefficients to zero, thereby enabling one
to do model selection when faced with high-dimensional data. Consider again, y
it
and the
active setA ={x
1t
,x
2t
,...,x
Knt
}, then Lasso solves the following minimization problem:
ˆ
β =argmin(β)
y−
Kn
X
q=1
x
q
β
q
2
+λ
Kn
X
q=1
|β
q
| (2.9)
Zou (2006), however, showed that in some settings, Lasso may fail to satisfy the oracle prop-
erties, and proposed Adaptive Lasso, where coefficients are penalized using different weights,
as shown below in equation 10.
ˆ
β =argmin(β)
y−
Kn
X
q=1
x
q
β
q
2
+λ
Kn
X
q=1
w
q
|β
q
| (2.10)
The weights, for instance, can be the inverse of OLS beta coefficients obtained from first
stage.
12
2.2.2.3 OCMT and GOCMT
Just like Lasso and Adaptive Lasso, loosely speaking, OCMT is designed to pick out
the important x variable given a single y
t
. Specifically, it picks out signals, or x variables
with non-zero beta coefficients, and it also pick out pseudo-signals, or those correlated with
the signals. Given an active set,A ={x
1t
,x
2t
,...,x
Knt
}, and ay
t
, OCMT runs the following
regression:
y
it
=a
i
+ c
′
i
z
t
+β
iq
x
qt
+u
it
for i=1,2,...,n, (2.11)
and for q=1...K
n
,a
i
and z
t
refer to the intercept and control variables, respectively. It thus
runs regression one at a time using only one variable from the active set. It then uses the t
statistics to determine whether or not the variable is important. When variables in the active
set are strongly correlated, or if they share one or more common factors, common factors
should be controlled for before applying the OCMT procedure, and this procedure, known
as GOMCT, has been shown to be effective (Sharifvaghefi, 2022). If strong correlation is
controlled for, as number of variables in the active set and T tend to infinity, OCMT will
pick out the signals and those correlated with signals with probability 1.
There are several ways to control for strong correlations. One is to include control
variables when applying OCMT. These would be the variables that the researcher knows
with certainty to be important variables for the y variable. One can also estimate unobserved
commonfactorsandusethemascontrolvariablesbeforeapplyingOCMT.Theprocesswhere
control variables are used to purge out strong correlations is called GOCTM. Specifically,
equation 2.12 is used to filter out common factors.
x
qt
=v
q
+ d
′
q
g
t
+ζ
qt
, (2.12)
where g
t
represents vector of control variables or principal components used to filter out
common factors amongst the variables in the active set. Principal components are estimated
13
from the active set, and the number of principal components can be determined by methods
in Onatski (2010) or Bai and Ng (2002).
After filtering out principal components or control variables, the new active set is then,
A
z
={x
g
1t
,x
g
2t
,...,x
g
Knt
} where x
g
qt
= ζ
qt
, that is, the variables in the active set are now the
residuals from equations 2.12. GOCMT refers to the case when OCMT is applied to this
new active set. After selection by GOCMT, the factor strength is estimated by equations
2.7 and 2.8.
2.3 Monte Carlo
In this section, I test the factor strength estimator presented in this study using simu-
lations under different settings. I generate a large number of y
it
using a five-factor model. I
use a total of 155 factors, which means that five are signal-factors, and the rest are pseudo-
factors and noise factors. The main goal is to verify that the estimator correctly estimates
factor strengths when the estimator does not know which ones are signal-factors, pseudo-
factors, or noise-factors. In other words, to the estimator, the 155 generated variables are
simply potential factors.
2.3.1 Simulation Design
The data generating process is explained by equation 2.13 shown below. For parame-
ters, such as mean and variance of factors, I set them to be the same as the ones obtained
when a five factor-model is used on US stock returns.
3
Of course, the empirical distribution
could have been obtained using a different data, such as wages for the set of y
it
and macroe-
3
In particular, I use a five-factor model consisting of Market, Leverage, Book to Market, Size, and
Momentum. Except for the Market factor, other factors could have been chosen and Monte Carlo results
would not change much. The five-factor model was estimated on a rolling sample basis, with each rolling
sample consisting of ten years. Parameter estimates were then averaged across the estimates obtained from
the rolling samples. For the parameters of pseudo-factors and noise factors, such as mean and variance,
parameters were obtained from the averaging estimates of mean and variance across 158 asset pricing factors
taken from the asset pricing literature.
14
conomic variables for the potential factors, but I argue that Monte Carlo results are not
sensitive to different parameter calibrations using different data. Moreover, in the empirical
section, I estimate factor strengths using stock returns and a large set of potential asset
pricing factors. This is why, for the Monte Carlo experiments, I use parameters obtained
from the five-factor model using stock returns as y variables.
DGP: I assume that returns are generated according to a five-factor model, as shown
in the equation below.
r
it
=d
i
+
5
X
k=1
β
ik
f
kt
+ϵ
it
, (2.13)
where i= 1,2,...,n, n=number of y
it
, t= 1,2,...,T, and k=1,2,...,5, d
i
∼ N(0, 0.0056
2
), ϵ
it
∼
N(0,σ
2
i
), andσ
2
i
is set such that R-squared is set at a particular value for each stock return.
For R-squared of n number of equations, I use three different distributions with different
means: R
2
i
∼ IIDU(0.1, 0.4), R
2
i
∼ IIDU(0.1, 0.6), R
2
i
∼ IIDU(0.1, 0.8). Setting the R-
squared would be the same as setting variances for the error terms, and the results will show
that the precision of the estimator will increase with higher R-squared. The large set of
potential factors are generated using the three equations below.
f
kt
=ϕ fk
g
1t
+λ
fk
g
2t
+q
fkt
, (2.14)
s
ct
=ϕ sc
g
1t
+λ
sc
g
2t
+q
sct
, (2.15)
η
jt
=q
ηjt
, (2.16)
wheref
kt
, k= 1,2,...,5, are the five signal-factors, s
ct
, c= 1,2,...,50, are the 50 pseudo-factors,
and η
jt
, j= 1,2,...,100, are the 100 noise-factors. This setting makes K
n
= 155 and K= 5.
Except for the 100 noise-factors, the rest of the 55 factors (5 signal-factors and 50 pseudo-
factors) are correlated with one another. I generate this correlation using two variables, g
1t
and g
2t
. As shown in the equations above, g
1t
and g
2t
are IIDN(0,1) variables that serve
as common factors amongst the signal-factors and the pseudo-factors. The coefficients on
15
g
1t
and g
2t
are what determines the pair-wise correlations amongst the signal-factors and
pseudo-factors, and the pair-wise correlations are set to mimic the ones observed in the
empirical distribution.
For the error terms above, q
fkt
∼IIDN(µ fk
,σ
2
qfk
), q
sct
∼IIDN(µ sc
,σ
2
qsc
), and q
ηjt
∼
IIDN(µ ηj
,σ
2
qηj
). And since q
ηjt
is a pure noise factor, I set q
ηjt
∼ IIDN(0, 1). I set the
coefficients on the two common factors and the first two moments of the error terms on
the signal-factors and pseudo-factors such that the means and variances of the signal-factors
and pseudo-factors match the ones observed in the empirical distribution. The pair-wise
correlations amongst the five signal-factors and 50 pseudo-factors are also matched to the
ones seen in the empirical distribution.
Specifically, let ρ
f
k1
and ρ
f
k2
, for k= 1,2,...,5, represent correlations between the five
signal-factors and g
1t
and g
2t
, respectively. I set {ρ
f11
, ρ
f21
, ρ
f31
, ρ
f41
, ρ
f51
}= {-0.44,
0.65, -0.06, -0.53, 0.24}, and {ρ
f12
, ρ
f22
, ρ
f32
, ρ
f42
, ρ
52
}={0.40, -0.61, -0.71, -0.61, 0.21}.
For the pseudo-factors, their correlations with g
1t
and g
2t
, denoted as ρ
sc1
and ρ
sc2
, are
randomly drawn from Uniform distributions, that is, ρ
sc1
∼ IIDU(−0.9, 0.9) and ρ
sc2
∼ IIDU(−0.7, 0.7) for c=1,2,...,50. These correlations were obtained from the empirical
distribution when correlation cofficients were estimated between each of the 162 asset pric-
ing factors and the two common factors. The common factors were estimated using the first
two principal components of the 158 asset pricing factors.
4
Let µ fk
and σ
2
fk
, k= 1,2,...,5, represent means and variances of the five signal-factors.
I set {µ f1
, µ f2
, µ f3
, µ f4
, µ f5
}={0.68, 0.29, 1.03, 0.20, 1.54} and {σ
2
f1
, σ
2
f2
, σ
3
f3
, σ
2
f4
, σ
2
f5
}={4.5
2
, 4.7
2
, 4.8
2
, 3.08
2
, 8.9
2
}. For pseudo-factors and noise-factors, I set µ sc
= µ ηj
= 0.5
and σ
2
sc
= σ
2
ηj
= 3.5
2
for c= 1,2,....50, and j=1,2,...,100. The reason for setting mean and
variancetobe0.5and 3.5
2
isbecausetheyarethemeanmonthlyreturnandmeanofvariance
of monthly returns, respectively, of 158 factors in the data.
4
I used Onatsky’s method to estimate number of common factors amongst the 158 asset pricing factors
for each rolling sample. The number of strong factors were estimated to be around three. To keep the setting
simple, I took two as the number of common factors when generating the data generating process.
16
Let α
fk
= factor strength for the signal-factor f
kt
, and I set {α
f1
, α
f2
, α
f3
, α
f4
,
α
f5
}={1, 0.9, 0.8, 0.7, 0.45}. These factor strengths were not obtained from the empiri-
cal distribution, but rather were assigned these particular values in order to better examine
how precision of the estimator would respond to varying degrees of factor strengths. For the
betas,β
ik
∼IIDN(µ βk
,σ
2
βk
) forn
α
fk
number of stock returns andβ
ik
= 0 for n-n
α
fk
number
of stock returns. Thus, higher factor strength means that the signal-factor will affect more
number of y variables. Betas are randomly shuffled across i, meaning that the betas that are
equal to 0 and those that are not equal to 0 are assigned across i in a random fashion. Since
I’m considering a five-factor model for the DGP, the factor strengths and the betas are only
defined for the five signal-factors. For the specific values for mean and variance of the betas,
I set {µ β1
, µ β2
, µ β3
, µ β4
, µ β5
}={1.05, -0.011, 0.082, -0.014, -0.041} and {σ
2
β1
, σ
2
β2
, σ
2
β3
, σ
2
β4
,
σ
2
β5
}={0.47
2
, 0.62
2
, 0.60
2
, 0.54
2
, 0.18
2
}.
Finally, I consider n∈{200, 500, 1000} and T∈{120, 240, 360}. These combinations
together with three different R-squared statistics result in total of 27 different combinations
of 155 factor strength estimates for each type of variable selection method used by the
estimator. The three different R-squared refers to the case where the R-squared for 2.13 is
drawn from IIDU(0.1,0.4), IIDU(0.1,0.6), and IIDU(0.1,0.8). In the main table of results,
the mean of distributions are reported to let the reader know which of the three distributions
were used to set the R-squared statistic for 2.13. The variable selection algorithms that were
employed for the estimator were Lasso, Adaptive Lasso, and GOCMT using various number
of principal components.
2.3.2 Monte Carlo Results
Table 2.1 shows Monte Carlo results using three different variable selection methods,
which are Lasso, Adaptive Lasso, and GOCMT controlling for the first three principal com-
ponents. GOCMT using the first and the first two principal components is shown in Table ??
in the Appendix. The first three principal components were estimated from the 155 factors
17
generatedaccordingtoequations2.14, 2.15, and2.16. Theactivesetforthedifferentvariable
selection methods is also the 155 generated factors, which include the five signal-factors, 50
pseudo-factors, and 100 noise-factors. Each estimate shown in Table 2.1 is the mean of 2000
replications. Specifically, ˆ α
fk
=
1
R
P
R
r=1
ˆ α
fkr
for k=1,2,3,4,5 represent strength estimates for
the five signal-factors and R= 2000 for 2000 replications. For the 50 pseudo-factors, Table
1 reports one estimate of strength, which is the mean of strength estimates across the 50
pseudo-factors and 2000 replications. Specifically, let ˆ α
sc
, for c= 1,2,...,50, represent strength
estimate for thec
th
pseudo-factor,s
ct
, then the mean of the estimates for the pseudo-factors
isrepresentedby
¯
ˆ α
sc
=
P
50
c=1
1
50
(
1
R
P
R
r=1
ˆ α
scr
), whereR=2000. Likewise, forthenoise-factors,
Table 2.1 reports one estimate of strength, which is the mean of strength estimates across
the 100 noise-factors and 2,000 replications. Let ˆ α
ηj
, j= 1,2,...,100, be the strength estimate
for the j
th
noise-factor, η
jt
, then the mean of estimates for the noise-factors is represented
by
¯
ˆ α
ηj
=
P
100
j=1
1
100
(
1
R
P
R
r=1
ˆ α
ηjr
), where R= 2000.
The estimates in Table 2.1 show that the estimator is fairly precise, regardless of
which particular method is used. The R-squared statistic that is most close to the one
seen in empirical distribution would be the medium case, in which R-squared is drawn from
IIDU(0.1,0.6) with a mean of 0.35. For this reason, discussion about the estimates will focus
on the cases where the R-squared statistics were drawn from IIDU(0.1,0.6), which I will
informally address in the discussion as the medium R-squared case.
For the first signal-factor, it is shown that all three methods yield estimates that are
all close to the true value, which is 1. All three methods slightly underestimate the true
values, but 0.97 is the lowest estimate in the medium R-squared case.
For the second signal-factor, for all three methods, the estimator appears to underes-
timate the true values by approximately 10%. For Lasso and Adaptive Lasso, the estimates
that are most close to the true value, 0.9, are 0.85 and 0.83, respectively.For GOCMT(3PC),
0.83 is the one that is most close to 0.9. There is an explanation for the underestimation
18
Table 2.1: Factor Strength Estimates
Lasso Adaptive Lasso GOCMT(3PC)
Estimates
¯
R
2
n/T 120 180 240 120 180 240 120 180 240
200 0.97 0.98 0.98 0.96 0.97 0.98 0.96 0.97 0.98
0.25 500 0.98 0.98 0.99 0.97 0.98 0.98 0.97 0.98 0.99
1000 0.98 0.99 0.99 0.97 0.98 0.99 0.97 0.98 0.99
200 0.98 0.99 0.99 0.97 0.98 0.98 0.97 0.98 0.99
ˆ α
f1
(1) 0.35 500 0.98 0.99 0.99 0.98 0.98 0.99 0.98 0.99 0.99
1000 0.98 0.99 0.99 0.98 0.99 0.99 0.98 0.99 0.99
200 0.98 0.99 0.99 0.98 0.98 0.99 0.98 0.99 0.99
0.45 500 0.99 0.99 0.99 0.98 0.99 0.99 0.98 0.99 0.99
1000 0.99 0.99 0.99 0.98 0.99 0.99 0.99 0.99 0.99
200 0.76 0.80 0.83 0.71 0.76 0.79 0.69 0.73 0.76
0.25 500 0.80 0.83 0.85 0.76 0.80 0.82 0.74 0.77 0.79
1000 0.82 0.84 0.86 0.78 0.82 0.84 0.76 0.79 0.81
200 0.81 0.84 0.85 0.77 0.81 0.83 0.74 0.77 0.79
ˆ α
f2
(0.9) 0.35 500 0.83 0.85 0.87 0.80 0.83 0.85 0.77 0.80 0.82
1000 0.85 0.86 0.88 0.82 0.84 0.86 0.79 0.82 0.83
200 0.83 0.85 0.87 0.81 0.83 0.85 0.77 0.80 0.82
0.45 500 0.85 0.87 0.88 0.83 0.85 0.86 0.80 0.82 0.84
1000 0.86 0.88 0.88 0.84 0.86 0.87 0.82 0.84 0.85
200 0.76 0.78 0.79 0.72 0.75 0.77 0.73 0.75 0.77
0.25 500 0.78 0.79 0.80 0.75 0.77 0.78 0.75 0.77 0.79
1000 0.79 0.80 0.81 0.76 0.78 0.79 0.76 0.78 0.79
200 0.78 0.80 0.80 0.75 0.77 0.79 0.76 0.78 0.79
ˆ α
f3
(0.8) 0.35 500 0.79 0.80 0.81 0.77 0.78 0.79 0.77 0.79 0.80
1000 0.80 0.81 0.81 0.78 0.79 0.80 0.78 0.79 0.80
200 0.79 0.80 0.81 0.77 0.78 0.79 0.77 0.79 0.80
0.45 500 0.80 0.81 0.81 0.78 0.79 0.80 0.79 0.80 0.80
1000 0.81 0.81 0.82 0.79 0.80 0.80 0.79 0.80 0.80
Notes: Factor strength estimates for the five signal-factors, 50 pseudo-factors, and 100 noise-
factors. ˆ α
fi
, for i= 1,2,...,5 are the estimates for the 5 signal-factors. For the pseudo-factors and
noise-factors, the mean of estimates are shown, i.e.,
¯
ˆ α
s
refer to the mean of strength estimates
across the 50 pseudo-factors, and
¯
ˆ α
η
refer to the mean of strength estimates across the 100 noise-
factors. For the five signal-factors, true values of strengths are in parantheses next to each factor.
¯
R
2
shown in the table refers to the mean of R
2
i
for equation 2.13 used for the DGP. All estimates
are the mean of estimates from 2000 replications. ˆ α
fk
=
1
R
P
R
r=1
ˆ α
fkr
for k= 1,2,3,4,5 are estimates
for the five signal-factors, and R=2000.
¯
ˆ α
s
=
P
50
c=1
1
50
(
1
R
P
R
r=1
ˆ α
scr
) is the mean of estimates for
the pseudo-factors, and
¯
ˆ α
η
=
P
100
j=1
1
100
(
1
R
P
R
r=1
ˆ α
ηjr
) is the mean of estimates for the noise-factors.
GOCMTusesδ = 1/4and p= 0.10. R
2
i
forequation2.13isdrawnfromthreedifferentdistributions,
which are IIDU(0.1,0.4), IIDU(0.1,0.6), and IIDU(0.1,0.8). The means,
¯
R
2
, are reported in the
tables. GOCMT(3PC)controlsforthefirstthreeprincipalcomponentsofthe155factorsgenerated.
19
Lasso Adaptive Lasso GOCMT(3PC)
Estimates
¯
R
2
n/T 120 180 240 120 180 240 120 180 240
200 0.60 0.64 0.66 0.54 0.58 0.61 0.55 0.58 0.61
0.25 500 0.64 0.66 0.68 0.59 0.62 0.64 0.60 0.62 0.64
1000 0.66 0.68 0.69 0.61 0.64 0.66 0.63 0.65 0.66
200 0.64 0.67 0.68 0.59 0.63 0.65 0.59 0.61 0.64
ˆ α
f4
(0.7) 0.35 500 0.67 0.69 0.70 0.63 0.65 0.67 0.63 0.65 0.67
1000 0.68 0.70 0.70 0.64 0.66 0.68 0.66 0.67 0.68
200 0.66 0.68 0.69 0.63 0.65 0.67 0.62 0.64 0.66
0.45 500 0.68 0.70 0.71 0.65 0.67 0.68 0.65 0.67 0.68
1000 0.69 0.71 0.71 0.66 0.68 0.69 0.68 0.68 0.69
200 0.54 0.55 0.55 0.47 0.48 0.49 0.50 0.51 0.52
0.25 500 0.58 0.58 0.59 0.51 0.51 0.51 0.54 0.55 0.55
1000 0.61 0.61 0.61 0.53 0.53 0.53 0.57 0.57 0.57
200 0.56 0.56 0.57 0.49 0.49 0.50 0.51 0.52 0.52
ˆ α
f5
(0.45) 0.35 500 0.59 0.59 0.59 0.52 0.52 0.53 0.55 0.55 0.55
1000 0.61 0.61 0.61 0.54 0.54 0.54 0.57 0.57 0.57
200 0.57 0.57 0.57 0.5 0.51 0.51 0.52 0.52 0.52
0.45 500 0.60 0.60 0.60 0.53 0.53 0.53 0.55 0.55 0.55
1000 0.62 0.62 0.62 0.55 0.55 0.55 0.57 0.57 0.56
200 0.42 0.42 0.43 0.26 0.25 0.25 0.36 0.34 0.33
0.25 500 0.51 0.51 0.51 0.37 0.36 0.36 0.46 0.45 0.44
1000 0.55 0.55 0.55 0.43 0.42 0.42 0.52 0.50 0.50
200 0.43 0.44 0.44 0.26 0.26 0.26 0.35 0.33 0.33
¯
ˆ α
s
0.35 500 0.52 0.52 0.52 0.37 0.37 0.37 0.45 0.44 0.43
1000 0.56 0.56 0.56 0.43 0.43 0.43 0.51 0.50 0.49
200 0.44 0.44 0.45 0.28 0.27 0.28 0.33 0.32 0.31
0.45 500 0.52 0.52 0.52 0.38 0.38 0.38 0.44 0.43 0.42
1000 0.57 0.56 0.56 0.44 0.43 0.43 0.50 0.48 0.48
200 0.44 0.43 0.43 0.29 0.26 0.26 0.35 0.33 0.32
0.25 500 0.52 0.51 0.51 0.40 0.38 0.37 0.45 0.44 0.43
1000 0.57 0.56 0.56 0.46 0.44 0.44 0.51 0.50 0.49
200 0.45 0.44 0.44 0.29 0.28 0.29 0.34 0.32 0.31
¯
ˆ α
η
0.35 500 0.53 0.52 0.52 0.40 0.39 0.39 0.44 0.43 0.42
1000 0.58 0.57 0.57 0.46 0.45 0.45 0.50 0.49 0.48
200 0.46 0.45 0.45 0.31 0.30 0.31 0.32 0.30 0.29
0.45 500 0.54 0.53 0.53 0.41 0.41 0.41 0.43 0.41 0.40
1000 0.58 0.57 0.57 0.47 0.46 0.46 0.49 0.47 0.47
20
for the second signal-factor. As explained in the previous section, the five-factor model was
used to calibrate the parameters for the variables used in the Monte Carlo experiments. For
the beta of the second signal-factor, the Leverage factor was used to calibrate the mean and
variance of the loading for the second factor, which were -0.011 and 0.62
2
, respectively. All
else equal, higher mean and/or variance of the beta would yield more precise estimates of
factor strengths. In other words, if one were to set the variance of the beta to 1, then all
three methods yield estimates that are close to 0.9, the true value of factor strength.
In other studies, when conducting Monte Carlo experiments using asset pricing factor
mimicking variables, researchers often set the mean and variance of the beta to be 0 and 1,
respectively. I could have done the same, but in order to follow the empirical distribution
as much as possible, I used the exact empirical distribution of the beta on the Leverage
factor. There are other asset pricing factors, such as the Tail Risk Beta factor, that would
have yielded more precise estimates primarily because of higher mean and/or variance of the
beta. Future revisions of this paper can perhaps use the Tail Risk Beta factor. For now,
it’s sufficient to note that the underestimation of the second signal-factor does not imply a
shortcoming of the estimator. The underestimation rather hints that either a different factor
should have been used for setting parameters, such as the Tail Risk Beta factor, or simply
assign a IIDN(0,1) distribution to the beta of the second signal-factor, which is the norm
that other studies have followed when conducting Monte Carlo experiments.
For the third and the fourth signal-factors, estimates are close to their true values for
all three methods especially when n is large. For the fifth signal-factor, whose strength has
been assigned to be 0.45, all three methods tend to overestimate the strength. In particular,
for the case where n= 500, T= 280, and when R-squared is drawn from IIDU(0.1,0.6), Lasso
returns an estimate of 0.59, while GOCMT(3PC) returns an estimate of 0.57, which are well
above the true strength, or 0.45.
For the pseudo-factors and the noise-factors, Lasso and GOCMT(3PC) yield similar
estimates, which range in between 0.33 and 0.5. For the estimator to be truly precise,
21
the strength estimates should be close to 0, since none of these variables contribute to the
data generating process. Thus, estimates in between 0.33 and 0.5 suggest that the variable
selection algorithms pick up some of these variables as important variables purely by chance.
For the pseudo-factors and noise-factors, estimates obtained using Adaptive Lasso are the
most precise since the estimates are the lowest. Comparing Lasso and GOCMT(3PC),
estimates obtained using GOCMT(3PC) are lower and therefore more precise compared to
those from Lasso. These results suggest that one should focus on factors with strength
estimates above 0.5, since it would be difficult to differentiate signal-factors from the non-
signal-factors if strength estimates come out to be lower than 0.5.
Overall, there are notable systematic patterns in the estimates. First, Adaptive Lasso
estimates are always lower than the estimates from Lasso, and Lasso estimates are generally
lower than the estimates from GOCMT. Second, for all three methods, strength estimates
increase with n for all three methods for any level of T,R
2
, and true value of factor strength.
Third, for all three methods, as T increases, strength estimates become more precise and
approach their true values for any level of T, R
2
, and true value of factor strength. In
particular, if the three methods tend to underestimate the factor strength, then the estimates
increase with T. If the methods tend to overestimate factor strength, such as the ones for
the pseudo-factors and noise-factors, then the strength estimates decrease with larger T.
Finally, for all three methods, as R
2
increases, strength estimates increase for any level of
T and n. This pattern is slightly different for the pseudo-factors and the noise-factors. For
Lasso and Adaptive Lasso, strength estimates for pseudo-factors and noise-factors increase
with larger R
2
, which means that estimates become less precise with larger R
2
. However,
for GOCMT(3PC), strength estimates become smaller with larger R
2
for any level of T and
n. In other words, for GCOMT(3PC), strength estimates become more precise with larger
R
2
in the case of pseudo-factors and noise-factors.
Table A in the Appendix shows the estimates under the same Monte Carlo setting
but using different numbers of principal components for OCMT. The results align with our
22
expectations. SincetheMonteCarlodesignusedtwocommonfactorstogeneratecorrelations
amongst the covariates, controlling for two or more number of principal components yield
fairly precise estimates. The estimates using two and three principal components are almost
identical, implying that controlling for two principal components were enough, and that
controlling for more than two principal components also is okay. As expected, when only
one principal component is controlled for, the estimates are much higher than the true values
due to the residual correlations amongst the signal-factors and pseudo-factors. In empirical
applications, one is not informed with the number of unobserved common factors, and thus
would need to rely on methods introduced in Bai and Ng (2002) and Onatski (2010) to
estimate the number of unobserved common factors.
Examining simply the mean of strength estimates for the 50 pseudo-factors and 100
noise-factors may lead one to overlook information that may be important. Table 2.2 and Ta-
ble 2.3 report strength estimates for three individual pseudo-factors and three noise-factors,
respectively. In particular, Table 2.2 reports estimates for 1st, 25th, and 50th pseudo-
factors, while Table 2.3 reports estimate for the 1st, 50th, and 100th noise variables. The
50 pseudo-factors are ordered by the degree of correlations with the signal-factors. As the
previous section explains, the degree of correlations between the five signal-factors and the
50 pseudo-factors are governed by their correlations with the two common factors. Since
the five signal-factors are all correlated with the two common factors, a pseudo-factor will
be more related to the signal-factors if it has greater degree of correlations with the two
common factors. The correlations between the pseudo-factors and the two common fac-
tors were randomly generated according to uniform distributions. To quantify the extent to
which the 50 pseudo-factors are related to the five signal-factors, I regress each of the 50
pseudo-factors on the five signal-factors, which results in a total of 50 regressions along with
50 R
2
statistics. I then order the 50 pseudo-factors in terms of their R
2
statistics. After
the ordering, ˆ α
scr
is the factor strength estimate computed, and it corresponds to the cth
pseudo-factor in the rth replication. Finally, ˆ α
sc
=
1
R
P
R
r=1
ˆ α
scr
represents the final strength
23
estimate of cth pseudo-factor amongst the 50 pseudo-factors that were generated. Since
the pseudo-factors were ordered in terms of R
2
for each replication, ˆ α
s1
corresponds to the
strength estimate for the 1st pseudo-factor with the highest mean of R
2
across the 2,000
replications. In other words ˆ α
s1
and ˆ α
s50
are the pseudo-factors that are related to the five
signal-factors the most and the least, respectively. Let
¯
R
2
c.pseudo
represent the mean of R
2
across the 2000 replications only for the cth pseudo-factor, where each R
2
corresponds to
the regression with cth pseudo-factor as the y
it
and the five signal-factors as the regressors.
Then
¯
R
2
1,pseudo
,
¯
R
2
25,pseudo
,
¯
R
2
50,pseudo
, are 0.77, 0.32, and 0.025, respectively.
Table 2.2 shows three strength estimates, ˆ α
s1
, ˆ α
s25
, ˆ α
s50
, and looking at the results ob-
tained from GOCMT(3PC), one can see that the strength estimates decrease as the pseudo-
factors becomes less correlated with the signal-factors. The highest estimate for the 1st
pseudo-factor is 0.54 and the highest estimate for the 50th pseudo-factor is 0.45. It should
be noted that if the three principal components are able to control for the two common
factors completely, then one would not observe differences between the estimates across the
50 pseudo-factors.
This implies that the three principal components were not able to completely capture
the two common factors driving the correlations amongst the five signal-factors and the
50 pseudo-factors. Perhaps this is because when generating the correlations between the
two common factors and the 50 pseudo-factors, the correlations were drawn from uniform
distributions that included 0, thereby making the relations between the common factors and
some pseudo-factors to be negligible, which in turn means that estimation of common factors
via principal components may not have been entirely adequate since the two common factors
did not affect all the 50 pseudo-factors.
In general, the strengthestimates obtained from GOCMT(3PC)appearto suggest that
the three pseudo-factors are weak factors or noise-factors with the highest estimate being
0.54, and most estimates lying below 0.50. Since GOCMT(3PC) intended to capture all
correlations between the signal-factors and the pseudo-factors, the estimator using
24
Table 2.2: Factor Strength Estimates for Pseudo-Factors
Lasso Adaptive Lasso GOCMT(3PC)
factors
¯
R
2
n/T 120 180 240 120 180 240 120 180 240
200 0.37 0.39 0.4 0.22 0.23 0.22 0.39 0.38 0.38
0.25 500 0.47 0.49 0.49 0.33 0.34 0.33 0.48 0.47 0.48
1000 0.52 0.53 0.54 0.40 0.40 0.40 0.54 0.53 0.53
200 0.39 0.40 0.42 0.22 0.22 0.22 0.39 0.38 0.38
ˆ α
s1
0.35 500 0.48 0.49 0.5 0.33 0.33 0.33 0.49 0.47 0.49
1000 0.53 0.54 0.54 0.39 0.39 0.39 0.54 0.53 0.54
200 0.4 0.41 0.42 0.22 0.22 0.23 0.38 0.37 0.39
0.45 500 0.49 0.50 0.5 0.33 0.33 0.33 0.48 0.48 0.48
1000 0.54 0.54 0.55 0.39 0.39 0.39 0.54 0.53 0.54
200 0.44 0.44 0.44 0.27 0.26 0.26 0.36 0.34 0.34
0.25 500 0.52 0.52 0.52 0.38 0.37 0.36 0.46 0.44 0.44
1000 0.57 0.56 0.56 0.45 0.43 0.43 0.52 0.51 0.5
200 0.45 0.45 0.45 0.27 0.27 0.27 0.35 0.34 0.33
ˆ α
s25
0.35 500 0.53 0.53 0.53 0.39 0.38 0.38 0.45 0.44 0.43
1000 0.57 0.57 0.57 0.44 0.43 0.44 0.51 0.5 0.49
200 0.46 0.46 0.46 0.29 0.29 0.29 0.34 0.33 0.31
0.45 500 0.54 0.53 0.53 0.39 0.39 0.39 0.44 0.43 0.42
1000 0.58 0.58 0.57 0.45 0.45 0.44 0.50 0.48 0.48
200 0.43 0.42 0.43 0.28 0.26 0.26 0.32 0.31 0.31
0.25 500 0.52 0.51 0.51 0.40 0.37 0.37 0.43 0.42 0.42
1000 0.56 0.56 0.56 0.45 0.44 0.43 0.49 0.48 0.48
200 0.44 0.44 0.44 0.28 0.28 0.28 0.30 0.29 0.28
ˆ α
s50
0.35 500 0.52 0.52 0.52 0.39 0.39 0.39 0.41 0.41 0.4
1000 0.57 0.56 0.56 0.45 0.45 0.45 0.48 0.47 0.47
200 0.45 0.44 0.45 0.30 0.30 0.30 0.27 0.26 0.26
0.45 500 0.53 0.52 0.53 0.41 0.40 0.40 0.39 0.38 0.38
1000 0.57 0.57 0.57 0.46 0.46 0.46 0.45 0.45 0.45
Notes: Factor strength estimates for three pseudo-factors. Estimates are shown for s
1t
, s
25t
, s
50t
,
or the 1st, 25th, and the 50th pseudo-factors amongst the total of 50 pseudo-factors generated. ˆ α
sc
,
for c= 1,25,50 represent the strength estimates shown in the table. The pseudo-factors are ordered
in terms of degree of correlation with the five signal-factors, f
1t
,f
2t
,...,f
5t
, where the degree of
correlation is represented by R
2
of the regression when the c-th pseudo-factor, or s
ct
, is regressed
on the five signal-factors with an intercept. R
2
1
, R
2
25
, and R
2
50
are 0.77, 0.32, 0.025, respectively,
meaning that 1st pseudo-factor (s
1t
) is most related to the five signal-factors, and the 50th pseudo-
factor (s
50t
) is least related to the five signal-factors.
¯
R
2
shown in the table refers to the mean of
R
2
i
for equation 2.13 used for the DGP. All estimates are the mean of estimates from 2000 replica-
tions. ˆ α
sc
=
1
R
P
R
r=1
ˆ α
fkr
for c= 1,25,50 are estimates for the three pseudo-factors, and R= 2000.
GOCMT usesδ = 1/4 and p=0.10. R
2
i
for equation 2.13 is drawn from three different distributions,
which are IIDU(0.1,0.4), IIDU(0.1,0.6), and IIDU(0.1,0.8). The means,
¯
R
2
, are reported in the
tables. GOCMT(3PC)controlsforthefirstthreeprincipalcomponentsofthe155factorsgenerated.
25
Table 2.3: Factor Strength Estimates for Noise-Factors
Lasso Adaptive Lasso GOCMT(3PC)
factors
¯
R
2
n/T 120 180 240 120 180 240 120 180 240
200 0.44 0.42 0.43 0.29 0.26 0.26 0.35 0.33 0.32
0.25 500 0.53 0.51 0.51 0.4 0.38 0.38 0.45 0.44 0.43
1000 0.57 0.56 0.56 0.46 0.44 0.44 0.51 0.5 0.49
200 0.45 0.44 0.44 0.29 0.28 0.28 0.33 0.32 0.31
ˆ α
η1
0.35 500 0.53 0.52 0.52 0.4 0.39 0.40 0.44 0.43 0.42
1000 0.57 0.57 0.57 0.46 0.45 0.45 0.5 0.49 0.48
200 0.46 0.45 0.45 0.31 0.30 0.31 0.32 0.30 0.28
0.45 500 0.54 0.53 0.53 0.41 0.41 0.41 0.43 0.41 0.40
1000 0.58 0.57 0.57 0.47 0.46 0.46 0.48 0.47 0.46
200 0.44 0.43 0.43 0.29 0.26 0.26 0.35 0.33 0.32
0.25 500 0.52 0.51 0.51 0.4 0.38 0.37 0.45 0.44 0.43
1000 0.57 0.56 0.56 0.46 0.44 0.44 0.51 0.50 0.49
200 0.45 0.44 0.44 0.29 0.28 0.29 0.34 0.32 0.31
ˆ α
η50
0.35 500 0.53 0.52 0.52 0.4 0.39 0.39 0.44 0.43 0.42
1000 0.57 0.57 0.57 0.46 0.45 0.45 0.5 0.49 0.48
200 0.46 0.45 0.45 0.31 0.30 0.31 0.32 0.30 0.29
0.45 500 0.54 0.53 0.53 0.41 0.41 0.41 0.43 0.41 0.41
1000 0.58 0.57 0.57 0.47 0.46 0.46 0.49 0.47 0.47
200 0.44 0.43 0.43 0.28 0.26 0.26 0.35 0.33 0.32
0.25 500 0.52 0.51 0.51 0.4 0.38 0.37 0.45 0.44 0.43
1000 0.57 0.56 0.56 0.46 0.44 0.44 0.51 0.5 0.49
200 0.45 0.44 0.44 0.29 0.28 0.29 0.34 0.32 0.31
ˆ α
η100
0.35 500 0.53 0.52 0.52 0.40 0.39 0.40 0.44 0.43 0.42
1000 0.57 0.57 0.57 0.46 0.45 0.45 0.50 0.49 0.48
200 0.46 0.45 0.45 0.31 0.30 0.31 0.32 0.30 0.29
0.45 500 0.54 0.53 0.53 0.41 0.41 0.41 0.43 0.41 0.41
1000 0.58 0.57 0.57 0.47 0.46 0.46 0.49 0.48 0.46
Notes: Factor strength estimates for three noise-factors. In particular, estimates are shown for
η
1t
, η
50t
, η
100t
, or the 1st, 50th, and the 100th noise-factors amongst the total of 100 noise-factors
generated. ˆ α
ηj
, for j= 1,50,100 represent the strength estimates shown in the table. The
noise-factors are purely random variables.
¯
R
2
shown in the table refers to the mean of R
2
i
for
equation (13) used for the DGP. Detail on
¯
R
2
is shown in the note on the bottom of the page.Note
all estimates are the mean of estimates from 2000 replications. ˆ α
ηj
=
1
R
P
R
r=1
ˆ α
fkr
, for j=
1,50,100, are estimates for the three pseudo-factors, and R= 2000. GOCMT uses δ = 1/4 and p=
0.10. R
2
i
for equation (13) is drawn from three different distributions, which are IIDU(0.1,0.4),
IIDU(0.1,0.6), and IIDU(0.1,0.8). The means,
¯
R
2
, are reported in the tables. GOCMT(3PC)
controls for the first three principal components of the 155 factors generated.
26
GCOMT(3PC) should have yielded strength estimates that are close to the ones for noise-
factors. Thus, if one is willing to accept that the upper boundary for the strength estimates
for the noise-factors are around 0.5, as was suggested from results in Table 2.1, then one can
take the estimator using GOCMT(3PC) to be fairly reliable in estimating strengths for the
pseudo-factors. However, in empirical applications, if correlations between the signal-factors
and pseudo-factors are not adequately controlled for, then the estimator using GOCMT may
yield estimates that are higher than the true values.
Turning to the estimates obtained from Lasso, one can observe that the estimates are
a bit higher compared to the ones from GOCMT(3PC). Interestingly, compared to the 25th
and the 50th pseudo-factor, the strength estimates are the lowest for the 1st pseudo-factor,
which has the highest degree of correlation with the five signal-factors. The estimates for
the 25th and the 50th pseudo-factors are very close to one another and both are higher
than the ones from the 1st pseudo-factor. This appears to imply that for Lasso, estimates
are more precise for the pseudo-factors that are more related to the signal-factors, and
precision decreases as the pseudo-factor becomes more like a noise-factor. Across the three
pseudo-factors, the highest estimate is 0.58 when Lasso is used. Although the estimates from
Lasso are the highest compared to the other methods, they still remain close to 0.5, which
appear to be close to the upper boundary for strength estimates for purely random variables.
Comparing across the three variable selection methods, for the pseudo-factors, the estimates
from Adaptive Lasso are the most precise since the estimates are the lowest.
Table 2.3 shows estimates for three noise-factors amongst the 100 noise-factors. Since
the noise-factors were purely random noise-factors, there should not be notable differences in
theestimatesacrossall100noise-factors. ResultsinTable2.3showsthattheestimatesareall
veryclosetooneanotheracrossthethreenoise-factors. Acrossthedifferentvariableselection
methods and across different combinations of n, T, and R
2
, the estimates for the three noise-
factors range from 0.57 and 0.26, with most of them lying in below 0.5. Moreover, they are
27
also close the estimates shown in Table 2.2 for the 50th pseudo-factor that had almost no
relation to the five signal-factors.
2.4 Empirical Applications to Finance and Macroeco-
nomics
2.4.1 Identifying risk factors in high dimensional settings
In this section, I estimate factor strengths of 162 asset pricing factors using the esti-
mator proposed in this study. The 162 asset pricing factors are factors that were shown in
many studies to explain cross section of expected returns.
5
Using the estimator on the large
number of asset pricing factors would be a good application of the estimator for number of
reasons. First, the problem involves a panel data with a large number ofy
it
, or stock returns,
and a large number of potential factors, or the 162 factors. This type of data structure with
large n, T, andK
n
is exactly what the estimator requires in order for one to obtain a reliable
estimate of factor strength. For instance, if one only has a small number of factors available,
or if one chooses to only work with a small number of factors supported by some theory,
then one may choose to use the factor strength estimator originally proposed in BKP (2021).
Researchers, however, are increasingly facing data where not only n and T are large, but
the number of potential factors is also large, hence the need for variable selection algorithms
such as Lasso and OCMT.
Second, estimating factor strengths may help to offer an answer to one of the most
sought out questions in the asset pricing literature, which is to find out which factors can
5
It should be noted that in the Monte Carlo experiments, I used 158 asset pricing factors to calibrate the
parameters. That is because the 158 factors used in the Monte Carlo experiments do not include the SMB,
HML, RMW, and CMA, the four factors from the well known Fama-French five-factor models. However,
this should not be a concern, since, as was already noted in previous sections, the Monte Carlo results are
most likely not sensitive to small changes in parameter calibrations. Adding or omitting four factors from
158 factors will have a small, and insignificant effects on the parameters used to design the Monte Carlo
experiments shown in the previous section.
28
independently explain the cross-section of stock returns. Over the years, the asset pricing
literature has presented more than 300 factors to explain cross section of expected stock
returns Harvey, 2019, and due to the large number factors available, including all of them
in a regression may not be feasible. This is especially true when using two-step regression
methods to estimate risk premia and when the time dimension if often restricted to 5 to
10 years, which translate into 60 to 120 monthly observations. Using the method presented
in this study, one can obtain factor strength estimates for all the factors presented in the
literature, which will give one an idea as to which factors are the ones that affect more
number of stock returns, that is, which factors are stronger. If one is willing to argue that a
risk factor should affect a relatively large number of stock returns in time series regressions,
then knowing factor strengths will allow the researcher to eliminate the factors that are not
strong enough when using the two-step regression methods. Doing so will reduce the set of
potential factors, thereby allowing one to include all the factors in the reduced set for the
time series regressions in the two-step regression methods. Moreover, Pesaran and Smith
(2021) have shown that in order for one to obtain more precise estimates of risk premia using
the two-step regression methods, one needs more number of stock returns, or y variables,
if weaker factors are included in the factor model. This implies that knowing strength of
factors will be beneficial since researcher will know that which factors will yield reliable risk
premia estimates.
2.4.1.1 Data
Time period examined is from January 1977 to December 2020. I use 10-year rolling
samples, and the samples is rolled over for each month. This gives a total of 409 rolling
samples in between January 1977 and December 2020. As an example, the first rolling
sample spans from January 1977 to December 1986, which gives a total of 120 months. Thus,
the first strength estimate is obtained using that 10-year span, and the second estimate is
29
obtained using the period from February 1977 to January 1987, and so forth until December
2020.
For the asset pricing factors, I use the data compiled by Andrew Chen and Tom
Zimmermann in their Open Source Asset Pricing website (Chen and Zimmermann, 2022).
This data gives a total of 158 asset pricing factors, including the Market factor. According
the the authors’ descriptions, these asset pricing factors were constructed using the same
characteristics defined in previous studies published in the literature. More specifically,
these factors are long-short portfolio returns based on characteristics shown to predict cross
section of expected returns. Although this data contains factors sorted on size and book
to market ratios, it does not contain the well known factors in the Fama French five factor
model (Fama and French, 2015). For this reason, I add SMB, HML, RMW, and CMA to
the 158 asset pricing factors, which makes the total number of asset pricing factors in the
data to be 162.
For US publicly listed firms, I use those listed on NYSE and Nasdaq. The firms listed
on AMEX are not included since most of them are small or microcaps.
In order to better capture the relationship between risk factors and stock returns, it
is argued that medium to large firms would be preferred since they are less vulnerable to
non-risk related factors, such as pump and dump schemes. In each 10-year rolling sample, I
first retain only the firms with monthly returns available for that 10-year span. Using these
firms, I then classify those with market cap greater than the median NYSE stock as big,
those with market capitalization in between the median and 20th percentile as small, and
below 20th percentile as microcaps. This classification is used in work such as Hou, Xue, and
Zhang(2015)andFamaandFrench(2015). Ionlyretainthefirmswithmarketcapitalization
larger than the 20th percentile defined by NYSE firms. The market capitalization for the
firms in each rolling sample is calculated as the average monthly market capitalization for
that 10-year sample. Following the literature, only ordinary shares from CRSP are included
and shares are unique for each year and month, that is, there are no repeats. Number of
30
Figure 2.1: Number of Firms
Notes: Number of firms in each 10-year rolling sample. Sample period is from January 1977 to
December 2020. Thus, the first value on the graph corresponds to the 10-year sample from January
1977 to December 1986. For US publicly listed firms, firms listed on Nasdaq and NYSE are used
and those on AMEX are not included since most of them are small or microcaps. In each 10-year
rolling sample, only the firms with monthly returns available for the 10-year period are retained.
Following the classification used in work such as Hou, Xue, and Zhang (2015) and Fama and French
(2015), only the firms with market capitalization larger than the 20th percentile defined by NYSE
firms are retained. The market capitalization for the firms in each rolling sample is calculated as
the average monthly market capitalization for that period. Following the literature, only ordinary
shares from CRSP are included.
firms for each rolling sample is shown in Figure 2.1. The first value in the graph in Figure
2.1, which is slightly below 1,000, corresponds to the 10-year sample that spans January
1977 to December 1986. Although the number of firms in the data fluctuates, especially for
the first half of the data, as the Figure shows, a good approximation for the average number
of firms in a 10-year sample is around 1,000.
2.4.1.2 Factor Models for Individual Securities
For the empirical analysis, I assume the following K factor asset pricing model.
31
r
it
−r
ft
=a
i
+
K
X
k
β
ik
f
kt
+ϵ
it
for i=1,2,...,n
τ
(2.17)
where r
it
−r
ft
represents excess stock return, f
kt
for j=1,2,...,K represents K number of
factors assumed to explain stock returns, n
τ
represents the number of securities in 10-year
rolling samples from January 1977 to December 2020, and τ = 1...409, since there are a
total of 409 10-year rolling samples. Given such a factor model, BKP (2021) estimate K
number of factor strengths by looking at statistical significance of the beta estimates. In
contrast, given the large number of potential factors, or 162, I do not assume a particular
factor model. I instead use the estimator proposed in this study to determine which are
the important variables, and ultimately obtain strength estimates using the entire set of
162 potential asset pricing factors. For a particular factor, if it is selected by a variable
selection algorithm, it would be analogous to that factor being statistically significant in
BKP (2021). For estimates using OCTM, I use two different methods to control for the
strong correlations amongst the factors. The first method is to use the Market excess return
as a control variable since it is already widely accepted that it is an important asset pricing
factor in the literature. Assuming that the Market factor is one of the strong common factors
amongst the remaining 161 factors in the data, the OCMT procedure using the Market as a
control variables can be called GOCMT, as described in Section ??. For the second method,
I use principal components estimated from the 162 factors as proxies for the common factors,
and control for them when using OCMT, and this procedure is also called GOCMT. Figure
2.2 shows number of common factors estimated from 162 factors in the data. The method
introduced in Onatski (2012) estimates around an average of three unobserved common
factors across the rolling samples, while the method in Bai and Ng (2002) estimates around
an average of seven unobserved common factors. According to Onatski (2012), the methods
in Bai and Ng (2002) severely overestimate number of unobserved common factors when
there the residuals in the factor model are weakly correlated. I argue that weak factors
must be present amongst the 162 factors, and therefore, I estimate three number of principal
32
Figure 2.2: Number of Unobserved Common Factors
Notes: Number of common factors estimated amongst the 162 asset pricing factors estimated via
two methods for each 10-year rolling samples from January 1977 to December 1986. One method
was introduced in Onatski (2010), and another method introduced in Bai and Ng (2002). For the
method in Bai and Ng (2002), IC
p1
was employed.
components and control for them when using OCMT when estimating factor strengths. To
differentiate between the two methods, the first will be referred to as GCOMT(Market), and
the second will be referred to as GOCMT(3PC). For Lasso and Adaptive Lasso, all of the
162 factors are considered in the error minimization problem, and thus the active set for
these two methods is simply the original set of 162 factors available in the data. In total, I
show a total of four estimates of factor strengths using the four methods.
Estimate 1 (GOCMT(3PC)): The strength estimator presented in this study uses
GOCMT using three principal components as control variables. The first three princi-
pal components are estimated from the 162 factors in the data. For each excess return,
33
r
it
−r
ft
, OCMT is applied to the active set,A
z
={x
z
1t
,x
z
2t
,...,x
z
162
}, where x
z
qt
= for
q= 1,2,...,162 represents set of factors after filtering out the two principal components
using methods shown in Section 2.2.2.3. The estimator then uses equations 2.7 and 2.8
to arrive at strength estimates.
Estimate 2 (GOCMT(Market)): The strength estimator presented in this study
usesGOCMTusingMarketfactorasacontrolvariable. Foreachexcessreturn,r
it
−r
ft
,
OCMTisappliedtotheactiveset,A
z
={x
z
2t
,x
z
3t
,...,x
z
162
},wherex
z
qt
=forq=2,3,...,162
representssetoffactorsafterfilteringoutMarketfactorusingmethodsshowninSection
2.2.2.3. Since the Market factor,f
1t
is a control variable, it is not included in the active
set. The estimator then uses equations 2.7 and 2.8 to arrive at strength estimates.
Estimate 3 (Lasso): The strength estimator presented in this study uses Lasso. For
each excess return, r
it
−r
ft
, Lasso is applied to the active set,A ={x
1t
,x
2t
,...,x
162
}.
The estimator then uses equations 2.7 and 2.8 to arrive at strength estimates.
Estimate 4 (Adaptive Lasso): The strength estimator presented in this study uses
Adaptive Lasso. For each excess return, r
it
−r
ft
, Adaptive Lasso is applied to the
active set,A ={x
1t
,x
2t
,...,x
162
}. The estimator then uses equations 2.7 and 2.8 to
arrive at strength estimates.
It should be noted that when measuring factor strengths for potential asset pricing
factors, a slightly different definition of factor strengths is needed if one is concerned about
risk premia estimation via two-step regression methods. Pesaran and Smith (2021) point
out that factor strength for factor k is defined by
P
n
i=1
(β
ik
−
¯
β
k
)
2
=⊖ (n
α
k
). In other words,
a factor would be deemed stronger if more number of its betas deviate from its mean. This
definition would be more appropriate when one is using the two-step regression method to
estimate risk premia of factors. This is because even if loadings are all non-zero across n
number of excess returns, there needs to be sufficient degree of variation from the mean for
the risk factor to be priced. Pesaran and Smith (2021) show that when the Market factor
is used as a control variable, other coefficient estimates automatically become deviations
34
from their mean, in which case one can use
P
n
i=1
(β
ik
)
2
=⊖ (n
α
k
) to measure the strength of
factors, which is the definition used in this study to measure factor strengths. Since strength
estimates using GOCTM(3PC) does not use the Market Return factor as a control variable,
it should be noted that its strength estimates may simply represent non-zero factor loadings,
and does not inform one about how many of the loadings are different from their means.
2.4.1.3 Estimates of Factor Strengths
To get a big picture of strength estimates across the four different methods, Table 2.4 shows
summary statistics of the estimates. Since the Market Return factor is the most recognized
factor in the literature, its statistics are separated from the statistics of the remaining 161
asset pricing factors. It should be noted that these statistics were obtained from all estimates
from the 409 rolling samples. For instance, the statistics for the Market Return factor
were obtained from 409 strength estimates of the Market factor, and the statistics for the
remaining 161 factors were obtained using 161×409, or 65,849 strength estimates. Thus, the
mean of strength estimate for the Market Return was the mean of the 409 estimates, and the
mean of estimates for the remaining 161 factors were the mean of 65,849 estimates. Other
statistics were obtained in a similar fashion.
First focusing on the Market Return factor, the estimates from the three variable
selection methods show that the maximum values are well near 1, while the minimum values
are near 0.9. The mean of strength estimates using GOCMT(3PC), Lasso, and Adaptive
Lassoare0.968, 0.964, 0.955, respectively. Onecanseethattheestimatesareingenerallower
for Lasso, compared to those from GOCMT(3PC), while the variance of strength estimates
from Lasso are much higher than that from GCOMT(3PC). Table A.2 in the Appendix
shows summary statistics of factor strength estimates when GOCMT uses different numbers
of principal components as control variables. Specifically it shows strength estimates when
GOCMT uses two, three, and four principal components. One can see that the strength
estimates decline when more number of principal components are used for the GOCMT
35
when estimating factor strengths. However, the strength estimates are not too different
from one another when different number of principal components are used for the strength
estimator. Future work can perhaps try estimating factor strengths using different number
of principal components for each rolling sample, as 2.2 suggests that number of unobserved
common factors vary across the rolling samples.
Figure 2.3 displays a better description of strength estimates of the Market Return
factor over the rolling samples. It appears that factor strength slumps down during financial
turmoil, e.g., whenthedatabeginstoincludethedot-combubbleburstandthe2008financial
crisis. This suggests that during these turbulent times, some individual stock returns become
decoupled from the Market Return factor. During the normal periods with financial stability,
the strength estimates for the Market Return factor remain near 1, which suggests that one
can consider the Market Return factor as the strong factor. In future work, it would be
interesting to find out which particular types of firms become decoupled from the Market
Return factor. Defensive stocks can be one type of firms who can become decoupled form
the overall market conditions. During periods of financial stability when the overall market
is going up, the stock prices of these firms will also likely go up, although at a much slower
pace compared to, for instance, growth firms. However, during times of financial instability,
the defensive stocks may be affected less, or even appreciate in stock prices. For this reason,
if a particular 10-year rolling sample includes periods of both bull and the bear market, then
the correlation between the Market Return factor and defensive stocks may become closer
to zero, which would contribute to lower strength estimate for the Market factor. Another
improvement to this study would be to estimate factor strengths on an annual basis using
daily stock returns. This method would provide a more granular and thus more precise
picture of how strength estimates evolve over time.
For the remaining 161 factors, the summary statistics is very different. Comparing
estimates obtained using different methods, the means of the estimates for the factors range
from 0.55 to 0.7, which are much lower than the ones from the Market Return factor.
36
Table 2.4: Summary statistics of factor strength estimates for the 162 factors
Minimum, maximum, mean, and standard deviation of strength estimates are shown for
the Market Return factor and the remaining 161 factors. For the Market Return factor, the
summary statistics are obtained by using the 409 estimates across the 10-year rolling samples
from January 1977 to December 2020. For the remaining 161 factors, summary statistics
are obtained by using all 162 strength estimates across the 409 rolling samples. The set
of y
it
are U.S. stock returns obtained from CRSP (details about the stock returns can be
found in section 2.4.1.1). GOCMT(3PC) refers to the case where the first three principal
components of 162 factors are controlled for when using OCMT for the strength estimator.
GOCMT(Market) refers to the case where the Market Return factor is controlled for when
using OCMT for the strength estimator. Thus, strength estimates of the Market Return
factor using GOCTM(Market) is not available. Lasso and Adaptive Lasso refer to the case
when the estimator uses Lasso and Adaptive Lasso, respectively.
Market Return Factor Remaining 161 Factors
GOCMT
(3PC)
Lasso
Adaptive
Lasso
GOCMT
(Market)
GOCMT
(3PC)
Lasso
Adaptive
Lasso
Minimum 0.923 0.895 0.876 0.263 0.161 0 0
Maximum 0.997 0.997 0.994 0.916 0.982 0.850 0.830
Mean 0.968 0.964 0.955 0.697 0.673 0.599 0.550
Std. Dev. 0.019 0.031 0.036 0.103 0.119 0.149 0.160
Thus, on average, one can only label these factors as semi-strong factors. Across the four
different variable selection methods, the maximum of estimates ranges from 0.83 to 0.982.
The minimum of estimates ranges from 0 to 0.263, indicating that there are weak factors for
someoftherollingsamples. Thisimpliesthatforsomerollingsamples, thereareweakfactors
with strength estimates that are close to, or even lower than those of pure random variables.
To recall, Monte Carlo results in Table 2.1 showed that for the case where n= 1,000, T=
120, and
¯
R
2
= 0.35, which is the setting most similar to the one in this empirical exercise, the
strength estimates for the noise-factors ranged from 0.46 to 0.58 across the three different
variable selection methods. Therefore, given a particular 10-year rolling sample, factors with
estimates close to or lower than 0.5 can be weak signal-factors, or pseudo-factors, or simply
factors that are close to being noise-factors.
Table A.3 in the Appendix provides a more detailed picture of strength estimates
for the factors in the data. In this table, for each factor, the mean of strength estimates
37
Figure 2.3: Strength Estimate for Market Return Factor
Notes: Factor strength estimates for the Market Return factor across the 10-year rolling samples
from January 1977 to December 2020. Since strength estimates using GOCMT(Market) uses the
Market excess return factor as a control variable, the strength estimates are only available when
the strength estimator uses GOCMT(3PC), Lasso, and Adaptive Lasso. GOCMT(3PC) refers to
the case where the estimator uses the first three principal components as control variables when
using OCMT.
across the 409 rolling samples is computed. Since the mean of the strength estimate for the
Market Return factor is already reported in Table 2.4, Table A.3 shows mean of estimates
for the remaining 161 factors. The factors are ordered from highest strength estimate to
lowest estimate when the estimates were obtained using GOCMT(3PC). When the estimates
are averaged across the rolling samples, for estimates using GCOMT(3PC), the Frazzini-
Pederson Beta factor has the highest strength estimate. For this factor, estimate using
GOCMT(Market) is also high, which is 0.809. Strangely, for estimates using Lasso, this
factor has a rather low strength of 0.610. This discrepancy of strength estimates between
those obtained from GOCMT and Lasso is a challenge for this empirical study. In other
words, there are number of other factors that are identified as semi-strong when the strength
38
estimator uses GOCMT(3PC) and GOCMT(Market), yet the same factors are identified as
weak factors when Lasso is used for strength estimates. For instance, the well known Size
factorhasastrengthestimateof0.797and0.706whenGOCMT(3PC)andGOCMT(Market)
are used, but its strength estimate is only 0.447 when Lasso is used. The Price factor is
identified as a semi-strong factor when GOCMT is used, but it has the lowest estimate of
0.133 when Lasso is used. Overall, to summarize the results, many can be identified as a
semi-strong factors with estimates between 0.5 and 0.8, but there are also weak signal-factors
or pseudo-factors or noise-factors with estimates near 0.5 or below 0.5. This is especially true
for strength estimates when Lasso is used, since estimates tend to be lower than those from
GOCMT. All these differences between the estimates using GOCMT and Lasso imply that
one should take caution before interpreting the results. Finally, one should also note that the
factors with strength estimates close to 0.5 or even lower should not be ignored, since some
of them may still carry risk premia in the cross section of expected returns. Nonetheless,
it would be difficult to argue that these factors with such low strength estimates are risk
factors that represent some pervasive forces affecting the economy.
To better examine how factor strengths change over time, Figures 2.4, 2.5, and 2.6
display strength estimates across the rolling samples for some of the factors in the data. In
particular, Figure2.4showsstrengthestimatesforthe10factorswithhighestestimateswhen
GOCMT(3PC)isusedforthestrengthestimator. ThelistexcludestheMarketReturnfactor
since its estimates over time were already shown in Figure 2.3. The 10 factors are Frazzini-
Pedersen Beta, CAPM beta, Days with zero trades 3, Tail risk beta, Volume Variance,
Volume to market equity, Past trading volume, Days with zero trades 1, HML, and Book
leverage (annual). For Figure 2.4, one should note the y-scales on the left and on the right
are different. The one the left correspond to the strength estimates when GOCMT(3PC)
and GOCMT(Market) were used, and the one on the right corresponds to the strength
estimates when Lasso and Adaptive Lasso were used for the strength estimator. The reason
for the separation, as can be seen from the graphs, is that the estimates obtained when
39
using GOCMT tend to be a higher than those obtained when using Lasso. For the estimates
obtained using GOCMT(Market) and GOCMT(3PC), one can see that, except for the Book
leverage (annual) factor, the estimates of factors rise after the 10-year rolling samples begin
to include the the year 2000, which is around the point in time where the strength estimate
for the Market Return factor begins to fall, as shown by the plot in Figure ??. This pattern
appears to suggest that perhaps the strength of anomalies begins to rise when strength of the
Market Return factor begins to fall. In other words, the anomalies’ strengths in explaining
stock returns appear to rise when the stock returns begin to be decoupled from the Market
Return factor. As for the strength estimates obtained using Lasso and Adaptive Lasso, it’s
easy to see that the two set of estimates closely follow one another, with the estimates when
using Adaptive Lasso appearing to be lower than the ones when using Lasso for the strength
estimator. Apart from this, for the strength estimates when Lasso is used, one cannot see
any other noticeable pattern across time for the ten factors.
Figure 2.5 shows strength estimates for the 10 factors with highest estimates when
GOCMT(Market) is used for the strength estimator. The 10 factors are SMB, HML, Book
leverage (annual), CAPM beta, idiosyncratic risk (3-factor), Idiosyncratic risk (AHT), Id-
iosyncratic risk, Frazzini-Pederson Beta, Market leverage, Volume to market equity factors.
In contrast to the 10 factors in Figure 2.4, it’s not easy to see a noticeable pattern of strength
estimate across the rolling samples.
Figure 2.6 shows strength estimates for the 10 factors with highest estimates when
Lasso is used for the strength estimator. The y-scales for the strength estimates when Lasso
and Adaptive Lasso are used for Figure 2.6 are different compared to y-scales of strength
estimates when Lasso and Adaptive Lasso are used in Figures 2.4 and 2.5. Specifically, the
y-scale of strength estimates ranges from 0.5 to 1 for Figure 2.6, since the strength estimates
when Lasso and Adaptive Lasso are used for these 10 factors are higher compared to the
ones seen in the previous figures. Looking at the strength estimates obtained using Lasso
and Adaptive Lasso,
40
Figure 2.4: Strength estimates of factors with highest estimates when GOCMT(3PC) was
used for the strength estimator
Notes: Factor strength estimates for the 10 factors with highest strength estimates when strength estimates
were obtained and ordered using GOCMT(3PC). The 10 graphs show strength estimates for the 10 factors
when GOCMT(3PC), GOCMT(Market), Lasso, and Adaptive Lasso were used for the strength estimator.
GCOMT(3PC)and GOCMT(Market) refer to strength estimates when three principal components and the
Market Return factor, respectively, were used as control variables when using OCMT. The principal com-
ponents were estimated from the 162 factors. Each estimate was obtained using 10-year rolling samples and
the entire time period is from January 1977 to December 2020. The black y-ticks on the left correspond to
strength estimates when GOCMT is used and the blue y-ticks on the right correspond to strength estimates
when Lasso and Adaptive Lasso are used for the strength estimator.
41
Figure 2.5: Strength estimates of factors with highest estimates when GOCMT(Market)
was used for the strength estimator
Notes: Factor strength estimates for the 10 factors with highest strength estimates when estimates were
obtained and ordered using GOCMT(Market). The 10 graphs show strength estimates for the 10 fac-
tors when GOCMT(3PC), GOCMT(Market), Lasso, and Adaptive Lasso were used. GCOMT(3PC)and
GOCMT(Market) refer to strength estimates when three principal components and the Market Return fac-
tor, respectively, wereusedascontrolvariableswhen usingOCMT. Theprincipal components were estimated
from the 162 factors. Each estimate was obtained using 10-year rolling samples and the entire time period is
from January 1977 to December 2020. The black y-ticks on the left correspond to strength estimates when
GOCMT is used and the blue y-ticks on the right correspond to strength estimates when Lasso and Adaptive
Lasso are used for the strength estimator.
42
Figure 2.6: Strength estimates of factors with highest estimates when Lasso was used for
the strength estimator
Notes: Factor strength estimates for the 10 factors with highest strength estimates when estimates
were obtained and ordered using Lasso. The 10 graphs show strength estimates for the 10 factors
when GOCMT(3PC), GOCMT(Market), Lasso, and Adaptive Lasso were used. GCOMT(3PC)and
GOCMT(Market) refer to strength estimates when three principal components and the Market Return
factor, respectively, were used as control variables when using OCMT. The principal components were es-
timated from the 162 factors. Each estimate was obtained using 10-year rolling samples and the entire
time period is from January 1977 to December 2020. The black y-ticks on the left correspond to strength
estimates when GOCMT is used and the blue y-ticks on the right correspond to strength estimates when
Lasso and Adaptive Lasso are used for the strength estimator.
43
it’s clear that they range from 0.7 to 0.8, and all of them in the 10 graphs appear to
be fairly stable across the rolling samples. For the strength estimates obtained using
GOCMT(Market) and GOCMT(3PC), it’s not clear if there is a clear pattern of estimates
across the rolling samples.
Comparing the strength estimates obtained using the four different variable selection
methods, there is no single factor that is chosen by all methods to be one of the 10 factors
with highest strength estimates. In other words, one cannot find any factor that appears
in Figures 2.4, 2.5, and 2.6. This is concerning since it implies that without knowing which
variable selection method yields a precise estimate of factor strength, one would not be able
to know which estimates to rely on. One definitive conclusion one can draw is that the
Market Return factor is the only factor one can be sure to have a factor strength close to 1
for most of the rolling samples, and therefore it is the only factor that can be considered to be
the strong factor. For other factors, although it’s difficult to identify the rank of the factors
in terms of strength, one can conclude that many factors’ strengths lie in between 0.5 and
0.8, and there also appears to be number of weak signal-factors, or pseudo-factors, or noise-
factors with strengths below 0.5 when Lasso and Adaptive Lasso are used for the strength
estimator. One possible method to rank the factors in terms of strengths is to examine
R-squared statistic for each factor. Comparing a factor that is strong or semi-strong and
another one that is weak, one would expect the mean of R-squared statistic across the set
of y
it
to be higher if the stronger factor is used to explain the y variables in a panel data
setting. The next section addresses this topic and examines the relationship between factor
strengths and R-squared statistic.
2.4.2 R-squared Analysis
Pesaran and Smith(2019) show how factor strengths and the pooled R-squared statistic
are related. Pooled R-square statistic is defined as the R-squared statistic obtained by
44
accounting for variations in the panel regression setting as a whole. In particular, consider
the following K-factor model and its mean over time series dimension.
r
it
=d
i
+
K
X
k=1
β
ik
f
kt
+ϵ
it
, (2.18)
Then, PR
2
is given by,
PR
2
= 1−
(nT )
−1
P
T
t=1
P
n
i=1
u
2
it
(nT )
−1
P
T
t=1
P
n
i=1
(r
it
− ¯ r
iT
)
2
(2.19)
where, ¯ r
iT
=
P
T
t=1
r
it
, and Equation 2.19 shows that pooled R-squared static, or PR
2
, is
analogous to R
2
of individual time series regression. The difference is that SSR and SST
takes into account all then×T number of errors andy
it
’s deviations from their means. Thus,
PR
2
is different from simply taking the mean of n number of R
2
obtained from n number
of time series regressions. The authors further show that under reasonable assumptions and
large n and T, the PR
2
becomes,
PR
2
=
n
−1
P
n
i=1
β
′
i
Σ
ft
β
i
/¯ σ
2
n
+o
p
(1)
1 +n
−1
P
n
i=1
β
′
i
Σ
ft
β
i
/¯ σ
2
n
+o
p
(1)
(2.20)
where ¯ σ
2
n
is the average of variance of ϵ
it
across n number of y
it
, β
i
is the vector of factor
loadingscorrespondingtoasseti, and Σ
ft
isthevariancematrixofthefactors. Equation2.20
states thatPR
2
depends on the factor loadings, or strengths of factors included in a model.
Specifically, as n becomes large, the factors with strengths less than 1 will not contribute to
PR
2
. This can be seen through Equation 2.21 below, where it is shown if one assumes that
f
1t
is the only strong factor factor in the model, then only β
i1
is relevant for PR
2
,
PR
2
→
(
Var(f
1t
)
¯ σ
2
n
)[n
−1
lim
n→∞
P
n
i=1
β
2
i1
]
1 + (
Var(f
1t
)
¯ σ
2
n
)[n
−1
lim
n→∞
P
n
i=1
β
2
i1
]
. (2.21)
In this section, I validate this theoretical relationship by estimating PR
2
using the
asset pricing factors and stock returns that were used in the previous section. In particular,
I compare PR
2
using different one-factor models, and see if there is a notable difference
45
between PR
2
obtained from the CAPM and other one-factor models. The theoretical pre-
diction in Pesaran and Smith (2021) rely on the true factor loadings for the factors, while
the one-factor models that I use may yield factor loading estimates that are different from
the true values due to omitted variable bias. Nonetheless, I begin with one factor models
since inspecting one factor models would be the important first step to gain valuable insights
about the relationship between PR
2
and factor strengths.
As was shown in the previous section, the Market Return factor is the only factor with
strength estimates near 1 across the 10-year rolling samples, and the upper boundary for
strength estimates for all the other factors appears to be around 0.8. This implies that one
would expect PR
2
from the CAPM to be greater than those obtained from any other one-
factor models. Figures 2.7,2.8, and 2.9 show the estimates of PR
2
under various one-factor
models. Figure 2.7 shows 10 graphs ofPR
2
using 10 one-factor models where the 10 factors
are the ones that were estimated to have the highest strength estimates when GOCMT(3PC)
was used for the strength estimator. Figure 2.8 shows 10 graphs of PR
2
using 10 one-factor
models that were estimated to have the highest strength estimates when GOCMT(Market)
was used for the strength estimator. Finally, Figure 2.9 shows 10 graphs of PR
2
using 10
one-factor models that were estimated to have the highest strength estimates when Lasso
was used for the strength estimator. Thus, these factors in Figures 2.7,2.8, and 2.9 are the
same as the ones shown on Figures 2.4, 2.5, and 2.6. For each one of the 10 graphs in all
the Figures,PR
2
obtained from the CAPM is also shown in order to compare between PR
2
obtained from various one-factor models and thePR
2
obtained from the CAPM. For all the
graphs, each estimate of PR
2
corresponds to a 10-year rolling sample, and the entire time
period examined is from January 1977 to December 2020.
Looking at the graphs in Figure 2.7, the first notable feature is that the PR
2
obtained
from the one-factor models are all lower than the PR
2
obtained from the CAPM, which is
to be expected since the Market Return factor has the highest strength estimates. Equation
2.21 shows that larger the n and T, this difference of PR
2
obtained from the CAPM and
46
other one-factor models will become more pronounced. The results in Figure 2.8 are similar
to those shown in Figure 2.7, i.e., the PR
2
obtained from the CAPM is far greater for most
of the rolling samples compared to PR
2
obtained from all other one-factor models. And
finally, one can observe a similar pattern in graphs in Figure 2.9. It should be noted that
there are some one-factor models with PR
2
very close to PR
2
from the CAPM for some
of the 10-year rolling samples. These are, for instance, obtained from one-factor models
using Frazzini-Pedersen Beta, CAPM beta, Days with zero trades 3, Tail risk beta, Volume
Variance, Volume to market equity, and Days with zero trades 1 factors. This implies that,
compared to other factors, these factors are able to explain stock returns just as well as the
Market Return factor for some rolling samples. This in turn suggests that, in addition to
factor strengths, perhaps one can focus onPR
2
to pick out the important factors. If a factor
has both high strength estimate and a relatively high PR
2
, then it should demand more
attention than other ones with low strength estimate and/or low PR
2
. This is especially
true, since it was shown in previous sections that the rank or the ordering of the factors in
terms of strengths estimates is difficult to identify since the variables selection algorithms
yielded strength estimates that are different to one another.
2.4.3 Applications to Other Settings
While the estimator proposed in this study has been applied in identifying risk factors
for the asset pricing literature, there can be many other settings where the estimator may
be useful. In particular, whenever a researcher faces a panel data with large number of
potential x variables, the estimator can be used to estimate strengths of all the x variables
available to the researcher. McCracken and Ng (2016) have compiled data that includes
187 macroeconomics variables. Suppose one obtains a panel data with county-level wages in
United States.
47
Figure 2.7: Pooled R-squared using the CAPM and one-factor models using factors with
highest estimates when GOCMT(3PC) was used for the strength estimator
Notes: PooledR-squaredstatisticobtainedusingtheCAPMand10one-factormodels. PooledR
2
, orPR
2
=
1− (SSR
pooled
)/(SST
pooled
), where SSR
pooled
=
P
T
t=1
P
n
i=1
u
2
it
and SST
pooled
=
P
T
t=1
P
n
i=1
(r
it
− ¯ r
iT
)
2
.
For each graph, the bold line represents PR
2
from the CAPM. The dashed line represents PR
2
from a
one-factor model. For each graph, the factor used to explain each n number of stock return is shown as the
title above each graph. Each factor used in the one-factor models is one of the 10 factors that were shown
to have highest strength estimates when GOCMT(3PC) was used to estimate strengths. GOCMT(3PC)
refers to the case where the OCMT uses the first three components as control variables, and the first three
principal components are estimated from 162 factors. Each PR
2
was obtained using 10-year rolling samples
and the entire time period is from January 1977 to December 2020.
48
Figure 2.8: Pooled R-squared using the CAPM and one-factor models using factors with
highest estimates when GOCMT(Market) was used for the strength estimator
Notes: Pooled R-squared statistic obtained using the CAPM and 10 one-factor models. Pooled R
2
, or
PR
2
=1− (SSR
pooled
)/(SST
pooled
), where SSR
pooled
=
P
T
t=1
P
n
i=1
u
2
it
and SST
pooled
=
P
T
t=1
P
n
i=1
(r
it
−
¯ r
iT
)
2
. For each graph, the bold line represents PR
2
from the CAPM. The dashed line represents PR
2
from
a one-factor model. For each graph, the factor used to explain each n number of stock return is shown as the
title above each graph. Each factor used in the one-factor models is one of the 10 factors that were shown to
have highest strength estimates when GOCMT(Market) was used to estimate strengths. GOCMT(Market)
refers to the case where the OCMT uses the Market Return factors as a control variable. Each PR
2
was
obtained using 10-year rolling samples and the entire time period is from January 1977 to December 2020.
49
Figure 2.9: Pooled R-squared using the CAPM and one-factor models using factors with
highest estimates when Lasso was used for the strength estimator
Figure 2.9: Pooled R-squared statistic obtained using the CAPM and 10 one-factor models. Pooled R
2
,
orPR
2
=1− (SSR
pooled
)/(SST
pooled
), whereSSR
pooled
=
P
T
t=1
P
n
i=1
u
2
it
andSST
pooled
=
P
T
t=1
P
n
i=1
(r
it
−
¯ r
iT
)
2
. For each graph, the bold line represents PR
2
from the CAPM. The dashed line represents PR
2
from
a one-factor model. For each graph, the factor used to explain each n number of stock return is shown as the
title above each graph. Each factor used in the one-factor models is one of the 10 factors that were shown to
have highest strength estimates when Lasso was used to estimate strengths. Each PR
2
was obtained using
10-year rolling samples and the entire time period is from January 1977 to December 2020.
50
This setting presents a large panel, as well as large number of potential x variables, which
are the 187 macroeconomics variables. One can then apply the estimator proposed in this
study to estimate factor strengths for the 187 macroeconomics variables. Doing so may give
us important hints as to which macroeconomic variables are important variables affecting
wage aggregates at the county level.
2.5 Conclusion
In a panel data setting, one would benefit much by knowing factor strengths of factors
included in the factor model. For instance, knowing factor strengths offers more granular
information about the factors compared to information one would obtain simply by running
a panel regression, such as the pooled OLS. Knowing factor strengths of factors informs one
how many of the y
it
each factor affects, that is, for each factor, one can know how many of
the n factor loadings are non-zero.
Onecanalsoestimatestrengthsforsub-sectionsofthedata. Forinstance, givenstockreturns
asthesetofy
it
,onecansplitthedataaccordingtodifferentindustries,andestimatestrengths
of the factors for each of those industries. In the asset pricing literature, factor strengths
determine the precision of risk premia estimates when one uses the two pass regression
methods.
Bailey et al. (2021) offered a method to estimate factor strengths when one is given
a particular factor model, such as the Fama French three or five-factor models. This study
offered to extend the estimator presented by Bailey et al. (2021) to also account for a high-
dimensional regression settings. Considering a large number of potential factors is important
becauseresearchersareincreasinglyfacinglargerdata. Byusingvariableselectionalgorithms
such as Lasso, Adaptive Lasso, and OCMT, the estimator proposed in this study provides
estimates of factor strengths no matter how large the number of potential factors are given
51
to the researcher. It also does not require one to make any assumption on the composition
of the true factor model.
The estimator is validated by extensive Monte Carlo experiments that are designed to
mimic the empirical setting as close as possible. In particular, a five-factor model was used
to explain a large number of y variables, and the parameters used for the data generating
process are taken from the Fama French five-factor model using US stock returns as the set
ofy
it
. Results from the experiments show that the estimator is precise given that the signals’
true strengths are around 0.7 or above. Specifically, for the four signal-factors with strengths
1, 0.9, 0.8, and 0.7, the estimator is precise for all variable selection methods employed. For
the fifth signal-factor with true strength set to 0.45, the estimates overestimate the true
values for all the selection methods. For instance, the lowest estimates are 0.54 when using
Lasso, 0.47 when using Adaptive Lasso, and 0.50 when using GOCMT(3PC), while the true
value of strength is 0.45.
For the pseudo-factors and noise-factors, the estimates appear to be same compared
to one another. For the pseudo-factors, and focusing on the case where theR
2
is set at 0.35,
the estimates range from 0.43 to 0.56 when using Lasso, 0.26 to 0.43 when using Adaptive
Lasso, and 0.33 to 0.51 when using Adaptive Lasso. Since pseudo-factors do not contribute
to the data generating process, their true factor strengths should be close to noise-factors
when correlations are held constant. In other words, factor strengths should be non-zero
only because the variable selection algorithms pick up pseudo-factors simply due to chance.
For the noise-factors, focusing on the case where the R
2
is set at 0.35, the estimates range
from 0.44 to 0.58 when using Lasso, 0.28 to 0.46 when using Adaptive Lasso, and 0.31 to 0.50
when using GOCMT(3PC). These interesting results suggest that for factors with strength
estimates around 0.5 or lower, one would not be able to identify if the factor is a weak
signal-factor with low strength, or is simply a pseudo-factor or a noise-factor. The results
also imply that in empirical work, if one’s objective is to identify the factors that actually
52
affect some number of y
it
, one should focus on those with strength estimates near 0.7 or
higher, that is, semi-strong factors and strong factors should be the ones to take notice.
For an empirical application, the estimator is applied to the so-called factor zoo, or the
large number of asset pricing factors introduced in studies over the years. The data consists
of 162 factors and the goal is to estimate strengths for all the factors using US publicly listed
stock returns as the y variables. Using 10-year rolling samples starting from January 1977
to December 2020, it is shown that only the Market factor can be considered as the strong
factor with strength estimates near 1 for most of the rolling samples. All the other factors are
semi-strong at best, that is, the strength estimates lie in between 0.5 to 0.8 for many factors.
When the strength estimator uses Lasso and Adaptive Lasso, there are also factors with
strength estimates lower than 0.5. In contrast to results from Monte Carlo, the empirical
exercise shows that estimates are noticeably different depending on which variable selection
methods is used. In general, the estimates when GOCMT(3PC) is used are higher than the
ones when Lasso and Adaptive Lasso are used, and the variance of estimates amongst the
factors are larger for the estimates when Lasso and Adaptive Lasso are used for strength
estimator. The most notable difference is the ordering of the factors in terms of strength
estimates. In other words, there are some factors that are identified as semi-strong when
GOCMT(3PC) is used, but the same factors are identified as weak factors when Lasso and
Adaptive Lasso are used for the strength estimator. This discrepancy of estimates imply that
without knowing which variable selection method is the correct one to use, it is difficult for
one to rely on the strength estimates. The only conclusion one can draw from the empirical
exercise is that the Market Return factor can be considered to be the strong factor, while
the remaining factors consist of semi-strong, weak factors, and noise-factors
The asset pricing factors are further examined to see if there is a relationship between
the pooled R
2
and factor strengths. In particular, pooled R
2
from the CAPM is compared
to pooled R
2
of various one-factor models using factors different from the Market Return
factor. By comparing pooledR
2
, orPR
2
of various one-factor models, it is shown that, just
53
as Pesaran and Smith (2021) predicted, the PR
2
obtained from the CAPM is noticeably
larger than the pooled R
2
obtained from other one-factor models. This result was expected
since Market Return factor was shown to be the only strong factor, while others were semi-
strong, at best. The results in this exercise imply that one should focus more on the factors
with relatively highPR
2
. This is especially true since different variable selection methods led
to different factor strength estimates. In other words, if one’s goal is to find the important
factors affecting the set of y
it
, it is recommended that one focuses on factors with high PR
2
and also high strength estimates.
There are numerous ways in which future work can improve this study. The most
urgent part in need of attention is to think about how one can settle the discrepancies in
factor strength estimates obtained using different variable selection methods. Since precise
estimates of factor strengths depend on knowing which factors have non-zero factor loadings,
precision of strength estimates ultimately depends on the precision of the variable selection
methods. Ifthedifferencesinstrengthestimatesacrossthevariableselectionmethodscannot
be resolved, then one can perhaps think about which methods are more reliable in different
settings. One can also only focus on factors with relatively high strength estimates for all
the variable selection methods used, such as the Market Return factor. Future work can
also consider more elaborate settings in the Monte Carlo experiments, such as allowing for
serial correlations in the error terms, and also allowing for semi-strong and weak common
factors amongst the potential factors in the active set. For empirical work, other data can
be considered. Any type of panel data setting with a large number of x variables merits the
use of the estimator proposed in this study.
54
Chapter 3
Chapter 3: Dominant Species in the
Factor Zoo
3.1 Introduction
For many years, the Fama French three-factor model (FF3) has been the benchmark
model to explain stock returns (Fama and French, 1993; Fama and French, 2015; Hou, Xue,
and Zhang, 2015). The FF3 consists of three variables: market portfolio net of risk free
rate, return difference between small and large firms, and return difference between high
value and low value firms. Fama and French argue that these variables represent some sort
of unobservable risk in the underlying economy. For instance, during recessions, smaller
firms are perceived to be more risky, and thus investors require higher returns to hold them.
Whatever the case is, the striking feature of FF3 model is its ability to generate exceptionally
high R-squared statistic when the y variables are portfolios of stock returns.
Overtheyears, theliteraturehaspresentedmanyfactors, over300accordingtoHarvey,
Liu, and Zhu, (2015). It would be reasonable to assume that some factors will likely be
redundant if the entire set is considered. One may also conjecture that there can be other
55
combinations of factors that may yield higher R-squared than FF3 model. Also, the R-
squared from such factors can change over time (Linnainmaa and Roberts, 2016).
The purpose of this study is to address the issues stated above. While the literature is
mostly concerned with alphas from the factors, I examine R-squared statistic of time series
regression models from various factor models. Examining time series regressions will inform
which factors are likely to be the ones that represent some unobserved forces affecting the
economy. Specifically, using a dimension reduction technique called One Covariate Multiple
Testing (OCMT) (Chudik, Kapetanios, and Pesaran, 2018), I see which combinations of
factors amongst the set of all available factors yield highest R-squared statistic when y
it
is
one of the 96 stock portfolios sorted on size and value. In particular, I examine adjusted
pooled R
2
of panel of portfolio returns (henceforth will be called PR
2
adj
), which is given by
the following set of equations:
PR
2
adj
= 1−
SSR
T−k− 1
!,
SST
T− 1
!
, (3.1)
SSR =
96
X
i=1
T
X
t=1
e
2
it
, (3.2)
SST =
96
X
i=1
T
X
t=1
(y
it
− ¯ y
i
)
2
, (3.3)
where y
it
, for i=1,2,...,n, represents one of the 96 portfolios returns, e
it
represents residuals
from the factor models with y
it
as the portfolio return, and ¯ y
i
=
1
T
P
T
t=1
y
it
with T=120 for
120 monthly observations since PR
2
adj
is computed for each rolling sample, and k represents
total number of factors in the factor model. For different k-factor models, this PR
2
adj
is
computed for each rolling sample. Because of the different factor models considered, I show
PR
2
adj
instead of the PR
2
. It should be noted that PR
2
adj
is different from average of R
2
adj
across n number ofy
it
. One should expect the two different measures to be highly correlated
with one another. It should also be noted that, in Chapter 2, it was shown that stronger
factors contribute more toPR
2
, which means that one should also expect stronger factors to
56
contribute more in increasingPR
2
adj
. Although strengths of factors are not directly measured
in this study, the process by which PR
2
adj
maximizing factors are selected involves working
with a set of factors that can be thought to have the highest strengths.
Forthefactors,Iexamine56factors,andeventually,Ihopetoexpandthissettoinclude
many more factors known in the literature. Since the Market factor is already known to be
an important variable, I always include it as a control variable in all analyses, which means I
search for the PR
2
adj
maximizing k-factor model within the 55 remaining factors. I examine
multiple models, e.g., three-factor model to six-factor model. In contrast to many related
studies, I look at rolling samples to better capture the changing nature of anomalies and
their effects on stock returns. Each rolling sample spans 120 months, and the entire sample
begins at July 1972 and ends at December 2017.
In order to find combinations of factors that maximize PR
2
adj
, I first reduce the di-
mension of set of available factors. Without dimension reduction, computation time for the
search for the combination ofPR
2
adj
maximizing factors would be too large. Taking the basic
three-factor model as an example, if I have a total of 55 factors, going over all possible two-
factor combination means computing 26,235 regressions for each one of 427 rolling samples
used in this study. The reason for the search for two-factor combination is because one of
the factors, Market factor, is always included in this study. If the study later considers more
factors and even macroeconomic variables, possibly more than 300, then searching for the
best combination of factors would require a very large computing power.
For the dimension reduction, I obtain the approximating set using OCMT, controlling
for the Market factor. Approximating set is estimated by OCMT and it contains signals, or
variables that affect y
it
, and it also contains pseudo-signals, or those that do not affect y
it
but are nonetheless correlated with the signals. The ones not included in the approximating
set are called noise variables, and they are regarded as purely random variables. It has been
shown in the original paper that as number of variables and observations tend to infinity, the
probability that OCMT picks out the approximating set tends to 1. In regards to factors in
57
a panel data setting, throughout this study, signal-factors refer to factors that affect at least
one y
it
, for i=1,2,...,n. Pseudo-factors refer to the factors that do not affect any y
it
, but are
nonetheless correlated with the signal-factors. Thus, if the signal-factors are not controlled
for in a panel data setting, pseudo-factors may be erroneously picked up as signal-factors.
Finally, in this study, I define noise-factors as those that do not affect any y
it
and also that
are not correlated with the signal-factors, that is, they are purely random variables.
Using the approximating set obtained for each y
it
, which is smaller than the original
set, I can then search for the combination of factors that maximize PR
2
adj
. As an example,
supposethatforaparticularperiod, theaveragenumberoffactorsinOCMT’sapproximating
set is 10 (the average is taken across the portfolio returns). In this case, I only focus on 10
number of factors that were chosen the most number of times by OCMT across the n number
of portfolio returns. Using this reduced set of 10 factors, I finally pick out the k-factor model
that maximizes PR
2
adj
. Needless to say, focusing only on 10 factors, instead of considering
all 55 factors, is more efficient.
Using this procedure to pick out the PR
2
adj
maximizing factors, I consider three and
four-factor models with Market always being the control variable. SMB factor is always
chosen as one of thePR
2
adj
maximizing factors for all rolling samples. After SMB, valuation-
type factors were chosen as PR
2
adj
maximizing factors most frequently across the rolling
samples. These valuation-type factors are, for instance, HML, Cashflow to price, and Sales
to price factors.
I then add statistical factors into the analyses. I apply OCMT to each y
it
now con-
trolling for the Market factor and two principal components. The two principal components
were estimated using the 55 factors (excluding the Market factor). Controlling for principal
components, or common factors, correlations amongst the 55 factors become much smaller,
and thus OCMT is able to get rid of noise-factors more efficiently. This means that the
total number of regressions needed to find the k-factors model that maximizes PR
2
adj
is
significantly lower. After applying the OCMT using the three control variables, I look for
58
three-factor and four-factor models that maximizePR
2
adj
. These three-factor and four-factor
models always include the Market factor as one of the factors. Using this procedure, the
results are similar to the case where the OCMT did not use principal components. Specifi-
cally, SMB is again chosen 100 percent of times for both two models. Valuation-type factors
are also chosen as PR
2
adj
maximizing factors. Some of these factors that were chosen most
frequently, next to SMB, include Sales to price, Earnings to price, Cashflow to price, and
HML.
Using the factors with highest percentages, that is, the ones that were chosen the most
number of times across the y
it
, PR
2
adj
for the entire 96 portfolios is obtained on a rolling
sample basis using various k-factor models. PR
2
adj
is high across time periods, ranging from
0.76 to 0.88. Controlling for the Market, the fact that SMB is the most frequently chosen
factors, and its ability to generate high PR
2
adj
confirm FF3 model’s superior performance
to explain portfolio returns. Of course, we already knew FF3 model’s superiority, but this
paper is the first, to my knowledge, to confirm its performance even against many other
factors in terms of PR
2
adj
.
To confirm the validity of the dimension reduction procedure, I perform various ro-
bustness checks. If the procedure using variable selection algorithms correctly picks out the
important factors that explain the 96 portfolio returns, then there should be notable differ-
ences in PR
2
adj
between models using PR
2
adj
maximizing factors within the approximating
set and models using factors excluded from the approximating set. By using the factors
that are selected by OCMT with lowest percentages across the 96 portfolios, I confirm that
the PR
2
adj
is indeed much lower than the one that is observed when one uses the PR
2
adj
maximizing factors chosen from the approximating set. I also calculate the best possible
PR
2
adj
for various factor models. For instance, if I’m considering a three-factor model, then
after controlling for the Market factor, I search for the combination of two factors from the
entire data that yield the highestPR
2
adj
. Comparing this best possible scenario, I show that
the highest possiblePR
2
adj
obtained from factor models using OCMT’s approximating set is
59
not much different than the highest possible PR
2
adj
using the entire data. Shortly put, the
dimension reduction procedure laid out in this paper is shown to be effective in reducing
the set of factors set so that one is able to pick out the PR
2
adj
maximizing k-factor model
efficiently.
3.2 Literature
In the literature, there have been papers that attempt to address similar questions.
Harvey, Liu, and Zhu (2015) and Harvey and Liu (2019) argue that there needs to be some
order imposed on the factor zoo, and they examine the average return of high minus low
portfolio returns sorted on various types of characteristics. Specifically, their main concern
is that some of the anomalies could have been discovered simply by chance, and thus applies
techniques from the Multiple Testing literature to control for various error rates, such as
False Positive rate. They suggest that t-statistic should be set at higher rate, such as 3, and
that if this higher hurdle is applied, only a fraction of 300+ anomalies pass the statistical
significance test.
While Harvey, Liu, and Zhu (2015) look at statistical significance of average returns
of long-short portfolios, Green, Hand, and Zhang (2017) use cross sectional regressions to
see which of the firm characteristics can predict stock returns. From a set of about 100 firm
characteristics, FM regressions showed that about 12 of them can independently predict
one period ahead stock returns. Some of them are Book to Market Ratio, Momentum
characteristic, and Earnings related characteristics. Lettau and Pelger (2018) examines
portfolios similar to the ones used in this study, and the authors estimate latent factors
and loadings such that the estimated factor model explain the portfolios the most. Their
objective function also includes a penalty term to minimize cross sectional pricing error.
Simply put, their method considers both time series and cross sectional relationship, and
estimates latent factors and loadings that minimize the loss function. They conclude that
60
the latent factors that overlap mostly with characteristics are, for instance, market, value,
profitability, momentum, and reversal.
In line with the three papers mentioned above, there are number of other papers ad-
dressing similar questions. Most of them are concerned with finding variables that minimize
cross sectional pricing error ((Feng, Giglio, and Xiu, 2020). This is because asset pricing
theory boils down to cross sectional relationship between expected returns and risk. In con-
trast, this paper focuses, for now, on finding factors that can explain 96 portfolio returns the
most in terms ofPR
2
adj
using only time series regressions. Specifically, for various numbers of
k, I find out which k-factor models maximize PR
2
adj
. Although the question may be simple,
results will offer valuable insights. As noted in the literature, risk factors should be able
to explain portfolio or stock returns both in time series and cross sectional settings. If the
factor does not explain returns well in time series regressions, then it may mean that the
factor does not represent some underlying forces in the economy. In other words, the first
hurdle for a set of factors to be risk factors would be to generate high enough PR
2
adj
when
the factors are used in a factor model to explain returns. Thus, factors that are consistently
chosen to maximizePR
2
adj
across rolling samples should be taken seriously as candidates for
risk factors.
3.3 Empirical Analysis
3.3.1 Data
The 96 stock portfolios are obtained from 100 portfolios sorted on size and value.
These are obtained straight from Fama French website. The reason I only use 96 out of 100
is simply because of missing values. Portfolios are first sorted by size into 10 deciles. Then,
for each of the 10 size portfolios, portfolios are further sorted on book to market ratios.
List of 56 asset pricing factors are listed in Table 3.1. These are factor returns, that is,
they are difference in returns between high return and low return portfolios sorted on some
61
particular characteristic. For instance, the famous SMB is short for "Small Minus Big" and
it is a return of a portfolio that is long on small firms and short on large firms. These have
to be constructed manually. These 56 factors were obtained from Professor Wayne Ferson
and Juhani Linnainmaa
1
. All of these are anomalies documented in the literature. In other
words, given the particular time period examined in the original paper, the average of return
difference between high and low deciles were shown to be statistically significant.
Table 3.1: Factors in Data
Notes: List of 56 asset pricing factors (long-short portfolio returns) used in the data.
MKT Sustainable growth Market beta
IA Z-score Firm age
ROE Industry-adjusted CAPX growth Idiosyncratic volatility
SMB Sales-inventory growth Long-term reversals
HML Investment-to-capital Maximum daily return
RMW Investment growth rate Momentum
CMA Investment to assets Intermediate momentum
Accruals QMJ Profitability Nominal price
Asset growth Distress risk Heston-Sadka seasonality
Cashflow to price Operating profitability Short-term reversals
Change in asset turnover Cash-based profitability High-volume return premium
Earnings to price Return on assets Debt issuance
Enterprise multiple Return on equity Total external financing
Gross profitability 52-week high Sales to price
Growth in inventory Amihud’s illiquidity One-year share issuance
Piotroski’s F-score O-score Five-year share issuance
Abnormal investment Profit margin Net operating assets
Leverage Industry concentration Net working capital changes
M/B and accruals Sales growth
3.3.2 OCMT
Since I’m using OCMT for variable selection, this section gives its description, along
with comparison to other selection methods. Lasso and its variants are the popular tools to
reduce dimensions or pick out the important variables when faced with large set of potential
covariates, such as the factor zoo. But in general, it tends to pick too many variable. Thus,
while its TPR (true positive rate) may be high, FPR (false positive rate) is at the same
1
Professor Wayne Ferson is from Marshall Business School at USC and Professor Juhani Linnainmaa is
from Tuck Business School at Dartmouth College
62
time, not negligible. It is shown in Monte Carlo that OCMT tends to pick less number
of variables, and thus is able to reduce dimensions further compared to Lasso (Chudik et
al., 2018). Of course, there is always the possibility that an important variable is left out
from the OCMT’s approximating set. It is shown in the study by Chudik et al., (2018) that
OCMT’s performance is comparable to that of Lasso and its variants. Specifically, when the
goal of the study is to minimize forecast error, or to find variable that maximize R-squared,
the approximating set picked out by OCMT performs just as well as other methods, if not
better. Furthermore, computation for OCMT is much simpler and faster than Lasso. The
ability of OCMT to reduce dimensions with reliable accuracy, and its short computation time
makes it an ideal method to be employed in a setting like in this paper where dimension
reduction is a necessity due to the large number of factors and rolling samples considered
in the study. Future work may consider Lasso and its variants, but for now, this study will
only utilize OCMT and its variants for variable selection algorithms.
For a brief description of how OCMT operates, consider a set of data with K number
of potential variables, where K can be a very large number. In this case, this set of K
number of variables is called the active set, denoted as{x
1t
,x
2t
,...,x
K
}, where x
it
can be
a signal, pseudo-signal, or a noise variable. Given y
t
, and the active set,{x
1t
,x
2t
,...,x
K
},
OCMT runs the following K number of regressions,
y
t
=α
i
+ c
′
i
z
t
+β
i
x
it
+u
it
for i=1,2,...,K
n
, (3.4)
where α
i
is the intercept and z
t
are the vector of control variables. OCMT then selects
variable x
it
if t-statistics of β
i
is greater than a critical value, where the critical value is
defined as the following,
c
p
(n,δ) = Φ
−1
1−
p
2f(K,δ)
!
, (3.5)
63
where p refers to the size of the test, andf(K,δ) =K
δ
, andδ
∗
>δ in the second stage. For
this paper, I set p=0.01 and and δ = 1 and δ
∗
= 2. The variables chosen by OCMT make
up the approximating set and it includes signals and pseudo-signals. The ones not included
in the approximating set are those that are identified as noise variables by OCMT.
3.3.3 Algorithm for Finding R-squared Maximizing Factors
The goal in this study is to find out which combinations of factors maximize PR
2
adj
using the approximating set identified by OCMT when one is faced with a large number of
potential factors. I lay out below the specific steps that I follow in order to find the PR
2
adj
maximizing factors. It should be noted that I’m not assuming that the PR
2
adj
maximizing
factors are the signal-factors for the set of y
it
in the data, although it would be likely that
they are signal-factors. In other words, I’m not making any assumption that the PR
2
adj
maximizing factors make up the true factor model. The only claim that I make is that the
PR
2
adj
maximizing factors found using OCMT are indeed identical or very similar to the
factors that maximizePR
2
adj
even when one does not use OCMT. This claim is tested in the
later section as a robustness check. Finally, for all the results, I include the Market factor
as a control variable when running OCMT, which means that the number of all potential
factors, which I call as the active set, is now 55, instead of 56. The Market factor is always
controlled for since the literature unambiguously agrees that it is the most important factor,
at least in time series regression. Specific steps for the algorithm is laid out below:
Step 1: For each rolling sample, using the Market factor as a control variable, apply
OCMT for each y
it
and the active set containing 55 factors. Each y
it
, in our case would be
each portfolio return form the 96 portfolios. Thus, for each rolling sample, one would have
96 approximating sets identified by OCMT. OCMT uses the following regression,
y
j
=α
ij
+ c
′
ij
z
t
+θ
ij
x
i
+ϵ
ij
, (3.6)
64
where j=1,2,...96, i=1,2,...,55. For better description of the variables, note that, y
j
= One of
the 96 portfolios returns, x
i
= One of the 55 factors (excluding Market), α
ij
=intercept, z
t
=
control variable(s) with the Market factor always included as a control variable. For another
case, I use the Market factor and the first two principal components as control variables.
The first two principal components are estimated from the set of 55 asset pricing factors that
excludes the Market factor. The reason for including principal components is to reduce the
correlations amongst the 55 factors. OCMT is more effective at reducing dimensions and
picking out signals if correlations between signals and other variables are low. To give an
example, Figure 3.1 in next page shows histogram of absolute value of pairwise correlation
coefficients amongst 55 factors. The histogram on the left shows histogram of correlation
estimates when only the Market factor is controlled for, and the histogram on the right shows
histogram of correlation estimates when Market and two principal components are used as
control variables. As one can see by comparing the two histograms, the absolute values
of pairwise correlations become lower whee one also accounts for the first three principal
components. The lower degree of correlations amongst the 55 factors would reduce the size
of approximating set, which would then significantly reduce the computation time for finding
the PR
2
adj
maximizing k-factor model from the approximating set.
Step 2: For each rolling sample, compute the percentage of times each factor in
the approximating set was chosen across the 96 portfolios. These percentages will be called
OCMT-selection-percentages. In a simpler case where one examines only oney
it
, or portfolio
return, there would not be any ambiguity about the chosen approximating set by OCMT. An
example of OCMT-selection-percentages is shown in Table 3.2 in the next page. The results
in 3.2 were obtained using the Market as the control variables, and time period used is the
first rolling sample, i.e. the first 10 years in the sample. The left column of Table 3.2 shows
results from OCMT only for the 1st portfolio from the 96 portfolios used as y
it
. 1 indicates
that the variable was chosen to be in the approximating set and 0 indicates that OCMT
has identified the variable as a noise variable. If one is using the 1st portfolio as the only
65
Figure 3.1: Correlation Amongst the Factors
Notes: Histogram of absolute value of pairwise correlations amongst the 55 asset pricing factors.
The figure on the left corresponds to the case where the Market factor is controlled for when
computing pair wise correlations, and the graph on the right corresponds to the case where the
Market factor and two principal components are controlled for when computing the pair wise
correlations.
y
it
, then one can simply collect the factors with 1s, and then find the k-factor model that
maximizes PR
2
adj
. However, in case of many y
it
, 96 in our case, there would be 96 different
approximating sets for each given rolling sample. For this reason, I compute the percentage
of times each factor was chosen across 96 y
it
. The right side of Table 3.2 shows this result,
again, for the first rolling sample. When taking the average across 96 porfolios, SMB factor
is chosen 71.88 percent of times, and thus 71.88 would be the OCMT-selection-percentage
for SMB for the first rolling sample. This means that for the first rolling sample, OCMT
procedure has chosen SMB to be included in the approximating set for 69 portfolios. The
right column orders the variables with highest to lowest OCMT-selection-percentages. The
next step would be to use the right column of Table 3.2, or the OCMT-selection-percentages,
and determine where the cutoff should be made to create an approximating set for the panel
of y
it
. As an example, if one decides that the five factors with highest OCTM-selection-
66
Table 3.2: Illustration of OCMT-selection-percentages
Notes: Illustration of results from OCMT using the Market factor as a control variable. The results
corresponds to the first rolling sample. Details explaining about the results shown in this Table can be
found in the text.
Selection OCMT-selection-percentages
MKT control MKT control Sales Growth 47.00%
IA 0 SMB 71.88% Cashflow to price 47.92%
ROE 0 Nominal price 71.88% O-score 45.83%
SMB 1 Sales to price 68.75% Idiosyncratic volatility 40.63%
HML 0 Profit margin 66.67% Cash-based profitability 38.54%
RMW 1 Leverage 63.54% Asset growth 36.46%
CMA 0 Long-term reversals 62.50% Industry-adjusted CAPX growth 35.42%
Accruals 0 HML 59.38% CMA 33.33%
. ROE 58.33% RMW 32.29%
. Earnings to price 56.25% Enterprise multiple 31.25%
Sales Growth 1 Amihud’s illiquidity 48.96% M/B and accruals 31.25%
percentages should comprise the approximating set of the entire panel data, then, SMB,
Nominal Price, Sales to Price, Profit Margin, and Leverage would be chosen to be in the
approximating set for the panel of y
it
, and from this set of five factors, one would finally
search for the PR
2
adj
maximizing k-factor model.
Step 3: For each rolling sample, pick out the first τ number of factors with highest
OCMT-selection-percentages. As explained already, OCMT-selection-percentages refer to
the percentages of times each factor was chosen to be in the approximating set by OCMT
across the n number ofy
it
. In this particular study, it refers to the percentage of times each
factor was chosen by OCMT across the 96 portfolio returns. τ is a hyper-parameter that one
has to set using some guidelines. I letτ be the average size of approximating set. First, let ¯ ω
represent the mean number of factors included in the approximating set chosen by OCMT.
In other words, if ω
i,model
represents number of variables in an approximating set chosen by
OCMT when OCMT is applied using y
it
and 55 factors, then the mean would be computed
using
1
n
P
n
i=1
ω
i,model
, where i=1,2,..,n, and n is the number of y
it
, and model specifies which
control variables were used, e.g., either the Market factor or Market factor together with 3
principal components. Thus, ¯ ω=
1
n
P
n
i=1
ω
i,model
. Set τ= ¯ ω.
67
After τ number of factors with highest OCMT-selection-percentages are selected for
each rolling sample, this τ number of factors will make up a new approximating set for the
panel of y
it
. I will refer to this new approximating set as panel approximating set. For
instance, if the average size of approximating set is five for a given rolling sample, or τ=5,
then looking at the right column in Table 3.2, SMB, Nominal Price, Sales to Price, Profit
Market, and Leverage would be chosen to be included in the panel approximating set. From
this panel approximating set, I then search for the k-factor model that yields highest PR
2
adj
for the panel of 96 y
it
.
Step 4: Let ζ equal the number of factors that you wish to include in your factor
model (excluding the control variables). If ¯ ω > ζ, then for each rolling sample, from the
panel approximating set computed in Step 3, choose ζ number of factors that maximize
PR
2
adj
. If ¯ ω ≤ ζ, then simply select ζ number of factors with highest OCMT-selection-
percentages computed in Step 3. In the end, choosing ζ number of factors will yield the
PR
2
adj
maximizing k-factor model.
3.4 Results
Usingthestepsthatwerelaidoutintheprevioussection, Inowshowresultsforvarious
types of factor models. All the results are based on 10-year rolling samples, or 120 monthly
observations, and the entire time period spans from July 1972 to December 2017. Figure 3.2
first shows the number of unobserved common factors amongst the 55 factors detected by
methods introduced in Onatski(2010). Since OCMT performs better when common factors
amongst the factors are controlled for, one first needs to determine the number of unobserved
common factors present amongst the covariates, which is the 55 factors in our case. Figure
3.2showsthatacrosstherollingsamples, thenumberofunobservedcommonfactorsfluctuate
between zero and five, but the average number of common factors appears to be around 2.
For that reason, in this study, whenever principal components are used as control variables
68
Figure 3.2: Number of Unobserved Common Factors Amongst 55 Factors
Notes: Number of unobserved common factors estimated using the methods introduced in
Onatski(2010). The unobserved common factors are estimated from the 55 factors excluding the
Market factors.
for OCMT, the first two principal components estimated from 55 asset pricing factors are
used.
The next figure 3.3 shows mean number of variables included in the approximating set
chosen by OCMT under two different scenarios. One scenario is when the Market factor is
the only control variable, and the other scenario is when the control variables are the Market
factor and the two principal components of the 55 factors. As described in the previous
section, the meannumsberoffactorsincludeinthe96approximatingsetisneededtoidentify
the panel approximating set. Looking at the figure, for the one using the Market factor as
the only control variable, there is a wide fluctuation of the size of approximating set across
the rolling samples. In the first rolling sample, the average of number of variables chosen by
OCMT is around 14. Highest appears to be around 16, and the number drops sharply in
the recent periods. The sharp decline of size of approximating set may be consistent with
the phenomenon shown in the literature that anomalies have begun to fade away in recent
69
years. It should also be noticed that the size of approximating set becomes larger when
the 10-year rolling samples begin to include the dot com bubble and the burst beginning
around the year 2000. This is because during financial turmoil, stock returns generally
become detached from the overall Market, and instead become more sensitive to factors
related to anomalies, such as SMB and Price to earnings factor. This in turn increases the
size of OCMT approximating sets because other than the Market, the remaining 55 factors
in the data are related to anomalies. While further examination of these phenomena would
be interesting, for the purpose of this study, the main focus will be on the mean size of
the approximating set, since I will be using the factors within the panel approximating set
to find the PR
2
adj
maximizing k-factor model. Finally, looking at the results when using
two principal components, one can see that the mean size of approximating set over rolling
samples becomes smaller significantly. This was expected since the principal components
are expected to control for correlations amongst the factors, which means that OCMT will
likely select fewer number of factors in each approximating set.
3.4.1 PR
2
adj
for Different Models
Using the size of panel approximating set shown in Figure 3.2, which I use to set
τ, I pick τ number of factors with highest OCMT-selection-percentages. This is done for
each 10-year rolling sample. Using this set that contains τ number of factors, I then search
for the PR
2
adj
maximizing k-factor model. If I’m looking at a 3-factor model, since the
Market is already included, I search for the two factors that yield highest PR
2
adj
. This is the
same for the model that uses principal components when applying OCMT, since principal
components are only used to aid OCMT when identifying the panel approximating set, that
is, they are not used as part of the k-factor model when identifying the PR
2
adj
maximizing
k-factor model. It should be noted that whenever τ came out to be less than ζ, I ignored
the panel approximating set and simply selected ζ number of factors with highest selection
percentages. I performed this exercise for 3 and 4 factor models when the Market factor is
70
Figure 3.3: Size of OCMT Panel-Approximating Set
Notes: Mean number of factors selected by OCMT when using the Market factor as a control
variable. The mean is computed using the number of selected variables across the y variables.
Each estimate of mean is computed for each rolling sample.
the only control variable, and also performed the same exercise for 3 and 4 factor models
when two principal components were used to aid OCMT to identify panel approximating
set.
Figure 3.4 shows PR
2
adj
statistics for various factor models across the rolling samples
when the factors in the model were the PR
2
adj
maximizing factors identified using the algo-
rithms explained in this study. Figure 3.4 shows some clear patterns to be recognized. First,
PR
2
adj
are all about the same for different models, and more importantly, the statistics are all
fairly high across the rolling samples. The highest appears to be around 0.88 and the lowest
appears to be around 0.75. The fact that statistics are all high supports the use of OCMT,
since this implies that OCMT’s panel approximating set did not exclude important vari-
ables that contribute toPR
2
adj
. It’s also interesting to note that when OCMT used principal
components to reduce the size of the approximating sets, PR
2
adj
is not much different than
PR
2
adj
from the models that did not use principal components. This means that when using
71
Figure 3.4: R
2
adj,pooled
using different Factor Models
Notes: PR
2
adj
using different factor models. PR
2
adj
is given by the equations 3.4, 3.5, and 3.6. The
3 and 4 factor models with labels (OCMT) refer to thePR
2
adj
maximizing models when OCMT only
used the Market factor as a control variable. The 3 and 4 factor models with labels (GOCMT) refer
tothePR
2
adj
maximizingmodelswhenOCMTusedtheMarketfactorandtwoprincipalcomponents
when identifying the approximating sets. When computing PR
2
adj
, the Market factor is always one
of the factors in all the PR
2
adj
maximizing models. Each estimate of PR
2
adj
corresponds to a one
10-year rolling sample, and the entire time period spans from July 1972 to December 2017.
OCMT, using principal components to reduce the size of approximating sets is a reliable
method that is efficient, since doing so reduces the size of the approximating sets, while still
retaining important factors that contribute the most to PR
2
adj
.
3.4.2 R-Squared Maximizing Variables for All Models
While Figure 3.4 shows the maximum PR
2
adj
when OCMT’s panel approximating set
is used, it does not inform as to which particular factors were selected to be the ones that
maximize PR
2
adj
. Table 3.3 shows list of factors that maximized PR
2
adj
for each of the four
72
Table 3.3: PR
2
adj
maximizing factors
Notes: Fraction of times each factor was chosen as one of the PR
2
adj
maximizing factors using different
models. The fraction is calculated across the rolling samples, that is, fraction of times each factor is selected
over the 427 rolling samples. Estimate for the Market factor is not reported since it is always used as a
control variable. The 3 and 4 factor models with labels (GOCMT) refer to the PR
2
adj
maximizing models
when OCMT used the Market factor and two principal components when identifying the approximating
sets. The 3 and 4 factor models without the labels (GOCMT) only uses the Market factor as a control
variable when identifying the approximating sets. The entire time period spans from July 1972 to December
2017. Each factor is a long-short portfolio return based on characteristics shown to predict cross section of
expected returns, where the characteristics’ names are shown in the Table.
Market & 2 Factors Market & 2 Factors (GOCMT)
SMB 1 SMB 1
HML 0.499 Sales to price 0.431
Cashflow to price 0.283 Earnings to price 0.265
Sales to price 0.101 Cashflow to price 0.234
Leverage 0.037 HML 0.026
Earnings to price 0.035 Idiosyncratic volatility 0.023
Gross profitability 0.023 Return on assets 0.012
Distress risk 0.021 Distress risk 0.005
Market beta 0.002
Maximum daily return 0.002
Market & 3 Factors Market & 3 Factors (GOCMT)
SMB 1 SMB 1
HML 0.789 Sales to price 0.527
Cashflow to price 0.323 Earnings to price 0.330
Sales to price 0.211 Cashflow to price 0.269
Distress risk 0.159 HML 0.211
Earnings to price 0.122 Idiosyncratic volatility 0.124
Idiosyncratic volatility 0.108 Net working capital changes 0.112
Nominal price 0.080 Return on assets 0.110
Leverage 0.052 Profit margin 0.042
Cash-based profitability 0.035 Piotroski’s F-score 0.040
Gross profitability 0.030 Enterprise multiple 0.037
O-score 0.021 Distress risk 0.033
Profit margin 0.019 Return on equity 0.026
Operating profitability 0.019 Firm age 0.026
Market beta 0.014 Nominal price 0.026
Growth in inventory 0.005 Change in asset turnover 0.019
QMJ Profitability 0.005 Net operating assets 0.014
RMW 0.002 Market beta 0.014
Enterprise multiple 0.002 RMW 0.012
Industry concentration 0.002 Amihud’s illiquidity 0.009
Firm age 0.002 Operating profitability 0.007
Growth in inventory 0.005
Gross profitability 0.002
Industry concentration 0.002
Maximum daily return 0.002
73
modelsthatwasconsidered. ThenumbersinTable3.3refertopercentageoftimes, acrossthe
entire 427 rolling samples, the factor was chosen to maximizePR
2
adj
. It should be noted that
this percentage is different from the OCMT-selection-percentages by OCMT. As a reminder,
OCMT-selection-percentages are only used to identify the panel approximating set, which
is then used to identify PR
2
adj
maximizing factors. Looking at the percentages in Table 3.3,
SMB is always chosen as one of the factors irrespective of the model. Compared to other
factors, HML is also chosen frequently under the models not using principal components.
When principal components are added, HML gets chosen much less frequently, but it is still
one of the five factors chosen most frequently. Overall, next to SMB, the valuation-type
factors are chosen most frequently, and they are HML, Sales to price, Earnings to price,
Cashflow to price factors.
3.4.3 Robustness Check for the Dimension Reduction Technique
Figure 3.4 and Table 3.3 both showed the main results for this study, which was to
show which factors are the PR
2
adj
maximizing factors and show a plot of the PR
2
adj
when
the PR
2
adj
maximizing factors are used in various factor models. As a reminder, choosing
the combinations ofPR
2
adj
maximizing factors was aided by a dimension reduction technique
called OCMT. In short, from a large data consisting of 55 asset pricing factors, I eliminated
the factors that were identified as noise-factors. This reduction enabled a much more efficient
search for thePR
2
adj
maximizing factors. However, the results in Figure 3.4 and Table 3.3 are
only reliable given that the panel approximating set correctly eliminated the noise-factors. In
other words, if the panel approximating set identified some factors to be noise-factors when
in fact those factors are signal-factors, or those that contribute significantly to PR
2
adj
, then
the main results shown in the previous section would not be valid. To validate the dimension
reduction procedure introduced in this study, this section performs some robustness checks.
In particular, I compute PR
2
adj
using factors with lowest OCMT-selection-percentages and
also compute the best possible PR
2
adj
. For the dimension reduction technique used in this
74
study to be reliable, thePR
2
adj
using factors with lowest OCMT-selection-percentages should
be much lower than the ones in Figure 3.4. In addition, there shouldn’t be much of a
difference of PR
2
adj
between the best possible case and the ones in Figure 3.4.
I first compute PR
2
adj
of 3 and 4-factor models using the Market factor and factors
with lowest OCMT-selection-percentages. Figure 3.5 is showing plots of PR
2
adj
using 3 and
4-factor models when the Market factor is a control variable. The 3 and 4-factor models
with labels “(OCMT)” refer to factor models when OCMT’s panel approximating set was
used to pick out the PR
2
adj
maximizing factors. Thus, plots of PR
2
adj
of these models are
the same as the ones seen on Figure 3.4. The 3 and 4-factor models with labels “Weakest”
refer to the factor models that include the Market factor and factors that were estimated to
have the lowest OCMT-selection-percentages. The difference is very clear, that is, plots of
PR
2
adj
using the factors with lowest OCMT-selection-percentages are much lower than the
plots of PR
2
adj
from models that used the OCMT panel approximating set. This implies
that the factors with lowest OCMT-selection-percentages, or those excluded from the panel
approximating set, are not important factors, at least with respect to PR
2
adj
.
Figure 3.6 now shows plots of PR
2
adj
when dimension reduction by OCMT was not
used. The 3 and 4-factor models with labels “unrestricte” refer to the models that did not
use OCMT when searching for thePR
2
adj
maximizing factors. For these unrestricted models,
the only assumption was to use the Market factor as a control variable. The remaining
factors were thePR
2
adj
maximizing factors chosen from the entire set of 55 factors. Thus, the
unrestricted models generatePR
2
adj
that is the highest possible. The main idea is to compare
the plots from these models against the ones from models using OCMT approximating sets.
Market & 2 Factors (OCMT) and Market & 3 Factors (OCMT), respectively, refer to 3 and
4-factor models used to computePR
2
adj
when OCMT was used. The plots in Figure 3.6 show
that PR
2
adj
from the 4 different models are very close to one another.
75
Figure 3.5: Robustness Check 1
Notes: PR
2
adj
using different factor models. PR
2
adj
is given by the equations 3.4, 3.5, and 3.6.
Market & 2 Factors (OCMT) and Market & 3 Factors (OCMT), respectively, refer to 3 and 4-factor
models used to compute PR
2
adj
when OCMT was used for dimension reduction to identify PR
2
adj
maximizing factors. The Market factor was the only control variable, and the remaining factors
are PR
2
adj
maximizing variables chosen using OCMT approximating sets. Market & 2 Factors
(Weakest) and Market & 3 Factors (Weakest), respectively, refer to 3 and 4-factor models used to
compute PR
2
adj
when the Market factor was the only control variable, and the remaining factors
are the ones with lowest OCMT-selection-percentages. Given a particular rolling sample, OCMT-
selection-percentages refer to fraction of times each factor was chosen to be in the approximating
set across the n number ofy
it
. Each estimate ofPR
2
adj
corresponds to a one 10-year rolling sample,
and the entire time period spans from July 1972 to December 2017.
These results imply that using OCMT for dimension reduction is reliable. Specifically,
OCMT’s approximating sets do contain factors that are important to portfolio returns, at
least in regards to PR
2
adj
.
3.4.4 Areas of Future Research
This study sought to find k-factor asset pricing models that maximized PR
2
adj
for a
panel ofy
it
that consisted of 96 porfolio returns sorted on size and book-to-market valuation.
76
Figure 3.6: Robustness Check 2
Notes: PR
2
adj
usingunrestrictedmodelsandmodelsaidedbyOCMT.Market &2Factors(OCMT)
and Market & 3 Factors (OCMT), respectively, refer to 3 and 4-factor models used to compute
PR
2
adj
when OCMT was used for dimension reduction to choose PR
2
adj
maximizing factor. The
Market factor was the only control variable, and the remaining factors are PR
2
adj
maximizing
variables chosen from OCMT’s panel approximating set. The unrestricted models uses PR
2
adj
maximizing factors chosen from the entire set of factors available in the data. In other words,
dimension reduction is not used, and the PR
2
adj
maximizing factors were chosen by computing all
possible PR
2
adj
using all the factors available in the data. The Market factor is always included as
one of the control variables. Each estimate of PR
2
adj
corresponds to a one 10-year rolling sample,
and the entire time period spans from July 1972 to December 2017.
In doing so, the study also introduced an algorithm that efficiently searched for the PR
2
adj
maximizing factors, and it was shown that the algorithm performed well. Future research
can also dive into ways to improve the algorithm. In particular, rather than searching for
the PR
2
adj
maximizing factors within the panel-approximating set, one can simply take the
k number of strongest factors to be ones to maximize PR
2
adj
. This new algorithm will be
more than the one introduced in this study, since using the panel-approximating set would
not even be necessary. Of course, the question is whether or not the strongest factors are
indeed the ones to maximizePR
2
adj
. The graphs in Figure 3.7 show that taking the strongest
77
factors do indeed generate PR
2
adj
that are comparable to the ones seen in Figure 3.4 and
Figure 3.6.
The PR
2
adj
for various k-factor models shown in the graphs in Figure 3.7, were all
computed using the Market factor as one of the factors. The remaining factors in the k-factor
models were the factors that were estimated to have the highest strength estimates when the
strengths were estimated using the method introduced in Chapter 2 of this dissertation. The
difference between the PR
2
adj
in the three graphs in Figure 3.7 is that the strength estimator
used three different variable selection algorithm. The PR
2
adj
in the first graph (the one in the
top) in Figure 3.7 was computed when the strength estimator used OCMT with the Market
factorbeingusedasacontrolvariable. ThePR
2
adj
inthesecondgraph(theoneinthemiddle)
was computed when the strength estimator used OCMT with the Market and two principal
components being used as control variables. These two principal components are the first
two principal components estimated from the 55 factors. These principal components were
not used in the k-factor models when computingPR
2
adj
. Finally, thePR
2
adj
in the last graph
(the one on the bottom) was computed when the strength estimator used Lasso. Since the
Market factor is always selected as the factor with the highest strength estimate for all rolling
10-year samples, it is always included in the k-factor models, just like it was included the
models for the results in the other two graphs.
All three graphs in Figure 3.7 showPR
2
adj
using one-factor model (CAPM), two-factor
model, and a three-factor model, for each 10-year rolling sample. With the addition of the
strongest factors, PR
2
adj
increases, as one would expect. This is different from what was
observed in Figure 3.5, where it was shown that PR
2
adj
remains low for a three-factor model
using the Market and two weakest factors. Moreover, the graphs in Figure 3.7 show that
PR
2
adj
remains high, or comparable to the ones shown in Figures 3.4 and 3.6. This suggests
that choosing the strongest factors may be a plausible method when one is trying to identify
the PR
2
adj
maximizing factors.
78
Figure 3.7: PR
2
adj
maximizing factors using strongest factors identified by different variable
selection methods
Notes: PR
2
adj
for each 10-year rolling sample computed using different k-factor models with strongest
factors identified by different variable selection algorithms. In the graphs Market means that the only factor
is the Market factor. The k-factor models referenced with (OCMT) means that the strongest factor(s)
included in the model are the ones identified by the strength estimator when the strength estimator uses
OCMT with the Market factor as a control variable to compute the approximating sets. The k-factor models
referenced with (GOCMT) means that the strongest factors(s) included in the model are the ones identified
by the strength estimator when the strength estimator uses OCMT with the Market factor and the first two
principal components to compute the approximating sets. The principal components are estimated from the
55 factors in the data (excluding the Market factor). The k-factor models referenced with (Lasso) means
that the strongest factors(s) included in the model are the ones identified by the strength estimator when
the strength estimator uses Lasso. Each estimate of PR
2
adj
corresponds to a one 10-year rolling sample, and
the entire time period spans from July 1972 to December 2017.
79
Comparing the PR
2
adj
from the three different graphs in Figure 3.7, one can see that the
PR
2
adj
appear to be a bit higher for the two graphs on the bottom, which suggests that, at
least for the data used in this study, there is a more direct relationship between PR
2
adj
and
strengths of factors when the strength estimator employs GOCMT and Lasso.
3.5 Discussion
Fama French 3-factor model and its new variants with momentum and profitability
factors have been the benchmark model to beat when trying to explain stock returns or
anomalies. But with so many factors that haven been discovered thus far in the literature,
one may question whether there can be other combination of factors that can better explain
portfolio returns in time series dimension. By looking at adjusted R-squared of the panel
of y
it
, or PR
2
adj
, and by using a large data containing 55 factors, this study answered that
question. Simply put, the main goal of this study was to find out combinations of factors
that would maximizePR
2
adj
when the Market factor is always included as one of the factors
in the factor model. The search for the PR
2
adj
maximizing factors was aided by a dimen-
sion reduction technique introduced in this study. In particular, by using variable selection
method called OCMT, I only focused on the factors that OCMT deemed important, thereby
reducing the size of the set of factors considerably.
Regardless of the different types of factor model, across the rolling samples, SMB is
always chosen as one of PR
2
adj
maximizing factors. HML is also one of the factors that
was chosen frequently. This confirms superiority of FF3 factor model, at least in time
series dimension. Overall, next to SMB, the factors that are chosen frequently to maximize
PR
2
adj
are valuation-type factors, such as HML, Cashflow to price, Price to earnings, and
Sales to price factor. This implies that these fluctuations in these factors represent some
shifts in sentiments in various parts of the economy, since they the are the ones that best
80
explain portfolio returns. This also implies that when there are changes in the outlook of
the economy, investors respond buy making changes in their portfolios using some valuation
metrics as their guide. For instance, when the economy looks to be fragile, investors may
respond by allocating more into safer assets, such as stocks of larger firms and/or lower price
to earning ratios.
Considering various factor models over the rolling samples, plots of PR
2
adj
were shown
when the factor models included PR
2
adj
maximizing factors chosen from OCMT approxi-
mating sets. The PR
2
adj
for the four different models are about the same across the rolling
samples. They range from 0.88 to about 0.76. The fact that the PR
2
adj
are quite high for
the various factor models suggest that the reducing dimensions using OCMT is reliable.
In order to validate the dimension reduction technique involving OCMT, for each
rolling sample, I also show plots of PR
2
adj
using the factors that are selected by OCMT the
least number of times across the 96 portfolio returns. For the dimension reduction technique
to be reliable, PR
2
adj
obtained from models using these weak factors should be much lower
than the PR
2
adj
obtained from models using the PR
2
adj
maximizing factors chosen from the
approximating sets. It is shown that, indeed, thePR
2
adj
is much lower when using the factors
that were selected by OCMT the least number of times. In other words, this result suggests
that OCMT approximating sets correctly identifies which factor are not deemed important,
at least with respect to PR
2
adj
. In another robustness check, I show plots of highest PR
2
adj
possible by searching for the PR
2
adj
maximizing factor using the entire set of 55 factors. For
the dimension reduction algorithm to be reliable, the highest possiblePR
2
adj
andPR
2
adj
using
OCMT approximating sets should not be noticeably different from one another. It is shown
that for different factor models, the difference in the two PR
2
adj
is negligible.
Looking ahead, there are much improvements to be made. This paper only considers a
small set of 55 factors, and thus future work can include many more factors in the following
work. Macroeconomic variables can also be included. Doing so will increase the size of
data considerably, but using OCMT to reduce dimensions will alleviate concerns for large
81
computations. Also, it is advised that individual stock returns are used asy
it
, instead of the
96 portfolio returns used in this study. Because these 96 portfolios were sorted on Size and
Value, one could argue that simply by construction, SMB and HML could have been found
as one of thePR
2
adj
maximizing factors. Using individual stock returns would eliminate any
concern arising from such bias.
82
Chapter 4
Chapter 4: Time Varying Effects of
Oil Price Shocks
4.1 Introduction
Oil price shocks are important factors affecting the economy. Major exporters of
petroleum, some of which include Iran and Saudi Arabia, depend heavily on oil, and since
a large proportion of GDP is comprised of oil revenues, oil price shocks greatly affect their
economies. Even for non-major oil exporters, oil price shocks affect consumers on the de-
mand side, thereby affecting those economies as well. Recognizing the significance of oil
price shocks, oil price has been closely watched for decades. One notable example of how
oil affects economies can be seen from oil supply shocks in the 1970s, and how it preceded
recessions. This is part of the reasons why researchers such as Hamilton(2009, 2011) argue
that oil price shocks contribute significantly to recessions. However, such arguments have
been questioned in recent years, especially due to some periods when oil price increases were
not followed by recessions. Specifically, these periods can be seen from Figure 4.1, which
shows movements of oil price over time, with shades indicating recessionary periods.
83
Figure 4.1: Spot Oil Price over Years
Notes: Spot Crude (West Texas Intermediate) Oil Price. Shades indicate U.S. recessions and the
graph is directly obtained from FRED website.
As can be easily seen from the Figure 4.1, there are three episodes of oil price spikes
that were not followed by recessions. These are in 1996, 2006, and 2011. So one is left to
wonder whether fluctuations in oil price have significant effects on the economy. Perhaps oil
price shocks did have significant effects on the economy in the earlier years, but the effects
have diminished in the recent years. In other words, one can hypothesize that oil price shocks
have time varying effects. Indeed, several studies find that effects of oil price shocks do vary
across different time periods (Baumeister et al., 2012; Blanchard et al., 2007). By splitting
samplesandusingrollingbivariateregressions, BlanchardandGali(2007)showedthateffects
of oil price shocks have diminished over the years. They point out to better monetary policy
and alternative energy sources as some potential reasons for diminishing effects of oil price
shocks on the economy. By using Bayesian VAR with time varying parameters, Baumeister
et al. (2012) also show that effects of shocks have diminished over the years.
The goal of this paper is to investigate these time varying effects of oil price shocks
and factors affecting them, but with different methodologies and focusing on different time
periods. Following the recursive, orthogonalized structure that Kilian and Park (2009) have
84
used, I use SVAR (structural vector autoregression) to investigate whether the magnitude
of effects of oil price shocks vary across time periods by using 10-year rolling samples with
monthlyobservations. IfindthateffectsoforthogonalizedoilpriceshocksontheUSeconomy
did not necessarily diminish over the years, but rather they increase considerably when the
rolling samples begin to include the 2008 financial crisis. This is true for all three variables
examined, real industrial production growth rate, real value-weighted stock returns, and
inflation rate. This result appears to suggest that the effects of oil price shocks depend
on the financial conditions in the economy, that is, the negative effects of oil price shocks
become more pronounced during times when the economy is more fragile. This hypothesis
is confirmed when the effects of oil price shocks are examined controlling for variables that
proxy financial conditions. Specifically, I control for volatility of the stock market and credit
spread. It is shown that when these two variables are held fixed, the negative effects of oil
price shocks are considerably reduced for the rolling samples that include the financial crisis.
This is true for all three response variables, which are CPI, industrial production index, and
the stock market index. Overall, when economic conditions are held constant, the effects of
oil price shocks remain relatively stable over the entire rolling samples. All of these results
suggest that the effects of oil price shocks did not necessarily diminish over the years, and
thus one should still consider oil price shocks to be important, especially during the times
when the economy is fragile.
These results support the findings by Kilian and Vigfusson (2015). To capture the
nonlinear relationship between oil price shocks and US output growth rate, they use a non-
linear model to show that effects of oil price shocks on GDP did not necessarily decline over
time, and that effects on GDP are larger when the economy is fragile. While these authors
used nonlinear models to capture the nonlinear relationship, I use structural VAR to capture
the relationship between oil price shocks and macroeconomic variables. Furthermore, Kilian
and Vigfussion (2015) only look at effects of oil price shocks on US gdp growth rate while I
85
also include other variables in analyses. Specifically, I analyze effects of oil price shocks on
US CPI (inflation), US industrial production, and the US stock market.
4.2 Time Varying Shocks using Rolling SVARs
4.2.1 Econometric Specification for SVAR
In order to find out how effects of oil price shocks vary across time periods, researchers
have relied on various methods, such as splitting the sample into smaller time periods, or
using rolling regressions or VARS. I examine 10-year rolling samples using SVAR with or-
thogonalized innovations to oil price shocks. For orthogonalization of the errors, I closely
follow econometric specifications used by Kilian and Park (2009). This is to compare the
results from this study and those from Kilian and Park (2009) since their paper has been
widely recognized. In their paper, they used Cholesky decomposition to isolate the effects
of oil price shocks and showed that oil price increases due to precautionary demand for oil
are the primary factors damaging the economy, particularly the US stock market. Precau-
tionary demand for oil refers to buyers or investors piling up on oil stocks because they fear
that prices will increase in the near future. In contrast, oil price increases due to height-
ened economic activity was shown not to be detrimental. The main difference between the
specification in Kilian and Park (2009) and the one from this study is that I do not include
oil production into the SVAR, which means that an oil price shock in this study is not only
referring to precautionary oil demand shock, but an oil price shock is some combination of
oil supply shock and precautionary demand shock. Another difference is that, in addition
to the baseline structural VAR, I complete another analysis to see how the results from the
baseline model changes when one controls for variables representing economic conditions.
The equation below first lays out the baseline model.
y
t
=α + Σ
12
j=1
Π
j
y
t−j
+u
t
, (4.1)
86
wherey
t
represents the vector of y variables,α is the vector of intercepts, Π
j
would be the
(3 by 3) matrix of coefficients on the j-th lag of the y variables, and finally, u
t
represents the
vector of error terms for the y variables. The ordering of the variables in the model is the
following,
y
t
=
inflation rate
rate of change of real industrial production
real US stock return
rate of change of real oil price.
(4.2)
In the baseline model shown above, the errors will be orthogonalized to separate the
oil price shocks from other factors. For the rate of change of real oil price, I deflate West
Texas Intermediate Spot Crude Oil Price using US CPI index. For real stock returns, I
subtract US inflation rate from value-weighted returns of publicly listed firms. Inflation rate
is derived from US CPI index. Data for CPI index, industrial production, and oil price are
all obtained from Federal Reserve Data and US stock returns are obtained from Center for
Research in Security Prices. They are all monthly observations and the entire sample covers
from January 1980 to December 2017. Since the variables in the structural VAR need to
be stationary, the four variables are all expressed in rate of change. While Kilian and Park
(2009) use 24 lags, I allow 12 months of lags.
US industrial production growth is included to capture the changes in aggregate de-
mand for commodities. Without controlling for aggregate demand, oil price shocks in the
model will carry effects from demand shocks, which will complicate results from the impulse
responses. To capture fluctuations in aggregate demand, Kilian and Park (2009) use data
on "single voyage bulk dry cargo ocean shipping freight rates." While this variable is more
appropriate to control for aggregate demand for commodities, or specifically demand for
oil, I argue that using US industrial production data would also suffice to represent shifts
in aggregate demand for commodities, as changes in US industrial production should be
positively correlated with aggregate demand in commodities.
87
By using Cholesky decomposition, a structural representation of VAR model, or equa-
tion 4.1, is the following,
Ay
t
=Aα + Σ
12
j=1
AΠ
j
y
t−j
+Au
t
, (4.3)
where A represents a lower triangle obtained from Cholesky decomposition. The ordering of
the orthogonalized shocks Au
t
implies that shocks to US inflation rate contemporaneously
affect the other variables, which are industrial production, the stock market, and the oil
price. Next, shocks to industrial production are assumed to contemporaneously affect stock
returns and oil price, while affecting inflation with a lag. Shocks to the stock market affect
the oil price and affect industrial production and inflation with a lag. Finally, since the rate
of change in oil price is placed last in the ordering, it is assumed that shocks to oil price
affect all the other variables with a lag. But more importantly, shocks to oil price is not
correlated with shocks to any other variables in the model. Thus, it should be noted that
because of orthogonlization, shocks to oil price means unexpected increase in real oil price
due to oil supply shocks and/or increase in precautionary demand for oil. In other words,
shocks to oil price in the model would be a negative shock to the US economy, and thus one
would expect industrial production and the stock market to react negatively, while inflation
rate would be expected to react positively, at least in the short run, to a positive oil price
shock. Next section reports graphs with impulse responses obtained from rolling SVARs.
4.2.2 Impulse Responses from Rolling SVARs
Figure 4.2 in the next page shows responses of variables upon a one standard deviation
increase in the rate of change of real price of oil.
The first graph on top represents impulse responses of US CPI to a one standard
deviation increase in orthogonalized oil price shock. Since the estimates are obtained for
each 10-year rolling sample, each impulse response corresponding to a time period specified
in the x-axis is estimated using 10-years, or 120 months of data. As an example, the first
88
Figure 4.2: Rolling Impulse Responses of Variables to an Orthogonalized Oil Price Shock.
Notes: These graphs show impulse responses of each variable to a one standard deviation increase in the
rate of change of real oil price for each 10-year rolling sample. The first graph corresponds to consumer price
index, the second graph in the middle corresponds to US industrial production, and the last graph on the
bottom corresponds to value-weighted US stock market index. The z-scales represent percentage deviation
of the response variables from their initial levels before the oil price shock, y-scales represent months after
the initial oile price shock, and the x-scales represent each 10-year rolling sample. Ordering of the variables
for the recursive structure is 1.) inflation rate 2.) rate of change of real industrial production 3.) real US
value-weighted stock return and 4.) rate of change of real oil price. Although the graph shows percentage
changes in the level of the response variables, the structural VAR uses rate of change of the response variables
in order for them to be stationary. The entire sample period spans from January 1980 to December 2017.
The first impulse response on the graphs corresponds to the 10-year sample beginning from January 1980 to
December 1989.
89
impulse response of US CPI (the one that is at the most left on the x-axis) is obtained using
the sample with time period from January of 1980 to December of 1989, which represents
120 months of data. The z-scale represents percentage deviation of US CPI index from its
initial level before the oil price shock. As an example, for US CPI in the graph, the range
of percentage deviation of CPI from its initial level is from -0.5% to 0.5%. We can see
that in the initial months after the oil price shock, the CPI index reacts positively and its
response slowly dissipates in the following months. It’s interesting to note that around the
period when the 10-year rolling sample begins to include the data for the year 2008 and
subsequent years, the response of inflation rate due to oil price shock becomes larger than
the responses observed in the earlier years. In particular, the positive reaction of CPI is
much more pronounced, and moreover, the positive reaction is followed by disinflation in
the following months after initial shock. This is likely due to the fact that the economy was
fragile during the 2008 recession, and therefore negative shocks such as oil price shocks would
have had more damaging effects on the economy, which in turn would have led the economy
to face disinflationary pressures. In short, CPI reacts positively in the short run and then
comes back down again in the following months. However, the graph also suggests that when
the economy is fragile, the reaction of CPI is more pronounced, and in some periods, the
CPI level ends up being lower than its initial level after the initial oil price shock.
The second graph corresponds to responses US industrial production index across
10 year rolling samples. Again, the z-scales represent percentage deviation of the level of
industrial production index from their initial level before the oil price shock. According to
the findings in the literature about time varying effects of oil price shocks, we should observe
diminishing effects over time, that is, industrial production should dip below zero to a less
magnitude in the recent years. One can see that the results appear to be consistent with
findings in the literature up to about year 2000 (the impulse response at year 2000 in the
graph corresponds to the sample period from 2000 to 2010), that is, the drop in industrial
production becomes less severe over the years up to the 2000 mark. Using quarterly data
90
and 10-year rolling regressions, Blanchard et al. (2007) examined time period between 1970
and 2005 to show that variables such as US GDP, inflation rate, and employment responded
less to oil price shocks over the years, which is consistent with the findings in Figure 4.2.
However, it’s clear from the graphs that after about 2000 year mark, effects of oil price
shocks become large again, that is, the drop in real industrial production becomes much
more pronounced towards the end of the rolling-samples. These drastic effects may also be
attributed to 2008 recession playing a role.
SimilarpatternscanbeobservedfortheUSrealstockreturns. Lookingattheveryfirst
impulse response estimated from the first 10-year rolling sample (sample period including
year 1980 to the end of 1989), real US stock market index drops upon a one standard
deviation increase in real price of oil. Over the next rolling samples, the severity of the
drop across the months appear to become less severe, or appear to remain about the same.
However, approximately around the year 2000 mark and afterwards, the drop in US stock
market becomes much more pronounced than the ones seen in the previous rolling samples.
In summary, these impulse responses from SVARs using 10-year rolling samples show
that the effects of oil price shocks appear to be stable or weakening in the earlier years, which
is consistent with the findings in the literature. What’s interesting is that the responses of
variables due to oil price shock begin to increase during the latter years when sample period
begins to include the 2008 financial crisis. These results are different than the ones found
in the current literature that argue for diminishing effects of oil price shocks. While effects
of oil price shocks may have diminished when focusing on the earlier years, analyses using
10-year rolling samples show that the effects of oil price shocks have increased considerably
in the latter years. I now turn to provide some potential explanation as to why we observe
systematic differences of effects of oil price shocks across time periods.
Referring to the diminishing effects of oil price shocks, Blanchard et al. (2007) sug-
gested that one of the reasons was improvement of monetary policies, that is, US central
bank learned to manage the economy better, thus enabling the economy to be more resilient
91
to oil price shocks. Another potential explanation provided was the fact that US and other
countries were becoming more reliable on alternative energy sources. Since we observed in
the previous section that the effects of oil price shocks actually increased again in the recent
years, the alternative energy explanation may not be the primary factor driving the size of
effects of oil price shocks, or at least not yet. If we believe that better monetary policies
helped the US economy to be more resilient to oil price shocks, then either monetary policies
became less effective, or some series of events magnified the effects of oil price shocks during
the latter years, or perhaps it was simply a combination of the two. I take the view that
some events in the latter periods (2007-2017) were responsible for the increase in effects of
oil price shocks.
The fact that responses of all three variables became more sensitive during the latter
years when data begins to include the 2008 financial crisis hints that perhaps economic
instability is the factor driving the magnitude of effects of oil price shocks. Intuitively, when
the condition of the economy is not so stable, any type of shock negatively affecting the
economy, such as an oil price shock, can be expected to have more damaging effects. Here,
I’m not arguing that a sudden oil price shock is the cause of economic crises or recessions, but
rather arguing that when the economy is already fragile due to some other shocks, additional
shocks such as oil price shocks will damage the economy to a greater extent. To test this
hypothesis, I perform the same rolling SVARs that were analyzed in the previous section,
but now controlling for variables that proxy as economic conditions.
4.2.3 SVARs Controlling for Economic Conditions
4.2.3.1 Variables Representing Economic Conditions
There are number of variables that represent the level of economic instability. It is
well known that during crises, volatilities of variables tend to rise. Thus, using standard
deviations of macroeconomic variables, such as that of output, can be one of the options.
VIX also captures how the financial market is perceived to be fragile. However, data on
92
VIX is only available starting from 1986. Also, since daily observations for US output is not
available, standard deviation of output at monthly frequency cannot be constructed. This is
why I’ve used standard deviation of daily value-weighted US stock returns to be one of the
variables representing economic conditions. Specifically, I use the following equation:
volatility
t
=
v
u
u
t
1
n− 1
n
X
i=1
(x
i
−mean(x))
2
(4.4)
x=daily value weighted US stock returns
n=number of days in each month
Inadditiontothisvariablerepresentingvolatility, anotherindicatorthatinformsabout
the condition of the economy is the credit spread, or spread between corporate bond yields
and treasury rates. It is well documented that during times of economic stress, the spread
between these two rates widen since corporate bonds are deemed too risky while investors
pourintosafeassetssuchastreasurybonds. Ithususecorporatespread, alongwithstandard
deviation of stock returns to represent conditions of the economy. The intuition is that
high levels of stock market volatility and credit spread are usually observed during times
of economic stress. Daily value-weighted US stock returns are obtained from CRSP and
credit spread is obtained from Federal Reserve Data website. For the credit spread, I use
Moody’s Seasoned Baa Corporate Bond minus Federal Funds Rate. I use these two variables
as exogenous variables in the rolling SVARs specified in equation 4.1. Specific econometric
specification is the following:
y
t
=α + Σ
12
j=1
Π
j
y
t−j
+ Φx
t
+u
t
, (4.5)
where y
t
still represents three endogenous variables that I previously considered for
the rolling SVAR, and Φx
t
represents a matrix containing the stock market return volatility
and credit spread and their coefficients. Thus, Π, the coefficients on y
t−j
will be different
than the ones from equation 4.1 if the two variables representing economic conditions are
93
correlated with the lags of the three y variables. The coefficients in equation 4.3 will also
be different. In other words, if the effect of oil price shocks actually depend on economic
conditions, then one would observe different impulse response functions. In fact, if oil price
shocks become more severe during times of economic stress, then one would expect the
negative effects of oil price shocks to be milder in this new analysis. Finally, one should note
that this new analysis suffers from reverse causality because changes in the variables may
affect the two variables representing economic conditions. Future work can consider different
ways to work around the reverse causality issue. Orthogonalization process and ordering of
the endogenous variables for equation 4.5 remain the same as seen from equation 4.3.
4.2.3.2 Impulse Response Functions
Figure 4.3 shows responses of the same four variables upon a one standard deviation
increase in the rate of change of real oil price. The three graphs on the left column are the
same ones that were already shown in Figure 4.2 while the three graphs on the right column
represent impulse responses controlling for the variables representing economic conditions.
Both sets of graphs are shown in two columns to better compare how economic conditions
affect severity of effects of oil price shocks.
Observing first at rolling responses for CPI, we can see that the responses of CPI is
much milder when economic condition is held constant. In other words, inflation is much
less sensitive to oil price shocks when we control for variables that proxy for conditions of
the economy. The same pattern can be observed for US industrial production and the stock
market index. Specifically, the negative effects of oil price shocks are significantly reduced
for many of the rolling samples for US industrial production, especially during the latter
periods when the data includes the 2008 financial crisis. Looking at the final graph for the
US stock market, it’s also clear that responses of US stock market index to oil price shocks
have been reduced, which can also be seen in the latter periods.
94
Figure 4.3: Rolling Impulse Responses of Variables to an Orthogonalized Oil Price Shock
when Controlling for Stock Market Volatility and Credit Spread
Notes: These graphs show impulse responses of each variable to a one standard deviation increase in rate
of change of real oil price for each 10-year rolling sample. The first graph corresponds to consumer price
index, the second graph in the middle corresponds to US industrial production, and the last graph on the
bottom corresponds to value-weighted US stock market index. The z-scales represent percentage deviation
of the response variables from their initial levels before the oil price shock. Ordering of the variables for
the recursive structure is 1.) inflation rate 2.) rate of change of real industrial production 3.) real US
value-weighted stock return and 4.) rate of change of real oil price. The three graphs on the left column are
the same ones that were already shown in Figure 4.2 while the three graphs on the right column represent
impulse responses controlling for the variables representing economic conditions. The entire sample period
spans from January 1980 to December 2017. The first impulse response on the graphs corresponds to the
10-year sample beginning from January 1980 to December 1989.
95
All these result suggest that when the economy is fragile, all the variables appear to
be affected more by oil price shocks. For CPI, the positive responses, as well as negative
responses become more pronounced, and for industrial production and the stock market, the
negative responses become worse. Another way to state these results is that when economic
conditions are controlled for, size of effects of oil price shocks becomes much more stable over
time, which is what one would expect if all time-varying factors affecting the magnitude of
oil price shocks are held constant.
4.3 Discussion
This paper examined how effects of oil price shocks on the US economy varied across
different time periods. To do so, SVAR was employed on 10-year rolling samples. Results
using SVAR and orthogonalized oil price shocks suggest that there were periods in recent
years when effects of oil price shock were considerably large, which contradicts some of the
findingsintheliteraturethateffectsofoilpriceshockshavediminishedovertheyears. Closer
examinations of the effects of oil price shocks point out that there may be some time-varying
factors affecting how the economy responds to oil price shocks.
Hypothesizing that effects of oil price shocks depend on economic conditions, this
study went through the same analyses, but controlling for economic conditions, or equity
market volatility and credit spread. Controlling for these factors significantly reduces the
magnitude of effects of oil price shocks on all three response variables, which are CPI,
industrial production and the stock market. These results altogether suggest that one should
take caution to accept the claim that effects of oil price shocks have diminished in the recent
years, since the effects of oil price shocks appear to be more pronounced during the times
when the economy is fragile.
96
REFERENCES
Anatolyev, Stanislav and Anna Mikusheva (2022). Factor models with many assets: Strong
factors, weak factors, and the two-pass procedure. Journal of Econometrics, 229(1), 103-
126.Y
Bai, J. and S. Ng (2002). Determining the number of factors in approximate factor models.
Econometrica 70 (1), 191–221
Bailey, N., M. H. Pesaran, and L. V. Smith (2019). A multiple testing approach to the
regularisation of sample correlation matrices. Journal of Econometrics 208 (2), 507–534.
Bailey, Natalia, George Kapetanios, M. Hashem Pesaran (2021). Measurement of Factor
Strength: Theory and Practice. Journal of Applied Econometrics, 36(5), 587-613. Y
Chen, Andrew, Tom Zimmermann (2022). Open Source Cross-Sectional Asset Pricing. Crit-
ical Finance Review, 27(2), 207-264
Chudik, A. and M.H. Pesaran (2016). "Theory and Practice of GVAR Modeling." Journal
of Economic Surveys, 30(1), 165-197
Chudik, A., G. Kapetanios, and M. H. Pesaran (2018). A one covariate at a time, multiple
testing approach to variable selection in high-dimensional linear regression models. Econo-
metrica 86 (4), 1479–1512.Y
Chudik, A., M. H. Pesaran, and E. Tosetti (2011). Weak and strong cross section dependence
and estimation of large panels. The Econometrics Journal 14 (1), C45–C90.
Chudik, A., Kapetanios, G., Pesaran, M.. (2018). A one covariate at a time, multiple testing
approach to variable selection in high-dimensional linear regression models. Econometrica,
Vol. 86, No. 4, 1479-1512
Green, H, Hand, J.R., Zhang, X.F. (2017). The characteristics that provide independent in-
formation about average us monthly stock returns. The Review of Financial Studies, 30(12):
4389-4436.
Fama, E. F. and S. K. R. French (1993). Common risk factors in the returns on stocks and
bonds. Journal of Financial Economics 33 (1), 3–56. Y
Fama, E. F. and J. MacBeth (1973). Risk, returns and equilibrium: Empirical tests. Journal
of Political Economy 81 (3), 607–636.
Fama, Eugene and Kenneth French (2015). A Five-Factor Asset Pricing Model. Journal of
Financial Economics, 116(1), 1-22.Y
97
Feng, G., S. Giglio, and D. Xiu (2020). Taming the factor zoo: A test of new factors. Journal
of Finance, forthcoming.Y
Hastie, T., R. Tibshirani, and M. Wainwright (2015). Statistical Learning with Sparsity.
CRC Press.
Harvey, Campbell, Yan Liu and Heqing Zhu (2015). ... and the cross-section of expected
returns. The Review of Financial Studies, 29(1): 5-68.Y
Hou,K, Xue, C, and Zhang (2015). Digesting anomalies: An investment approach. The
Review of Financial Studies, 28(3): 650-750 Y
Jurado, Kyle, Sydney C Ludvigson, and Serena Ng, "Measuring uncertainty," American Eco-
nomic Review, 2015, 105 (3), 1177-1216.
Lettau, M., Pelger, M. (2018). Factors that Fit the Time Series and Cross-Section of Stock
Returns. Working Paper.Y
Linnainmaa, J., Roberts, M. (2016). The History of the Cross Section of Stock Returns.
The Review of Financial Studies, Volume 31, Issue 7, 2606-2649.Y
McCracken, M. and S. Ng (2016). FRED-MD: A monthly database for macroeconomic re-
search. Journal of Business & Economic Statistics 34 (4), 574–589.
Onatski, Alexia (2010). Determining The Number of Factors from Empirical Distribution of
Eigenvalues, The Review of Economics and Statistics, 92(4), 1004-1016. Y
Onatski, Alexia (2012). Asymptotics of the principal components estimator of large factor
models with weakly influential factors. Journal of Econometrics, 168(2), 244–258. Y
Pesaran, M. H. (2006). Estimation and inference in large heterogeneous panels with a mul-
tifactor error structure. Econometrica 74 (4), 967–1012.
Pesaran, M. Hashem and Ron Smith (2021). Factor Strengths, Pricing Errors, and Estima-
tion of Risk Premia. CESifo Working Paper No. 8947.
Ross, S. A. (1976). The arbitrage theory of capital asset pricing. Journal of Economic The-
ory 13 (3), 341–360.
Sharifvaghefi, Mahrad (2022). Variable Selection in Linear Regressions with Many Highly
Correlated Covariates. SSRN.
Stock, James and Mark Watson (2002). Forecasting using principal components from a large
number of predictors. Journal of the American Statistical Association, 97(460), 147-162.Y
98
Stock, James and Mark Watson (2015). Factor models in macroeconomics. In J. Taylor and
H. Uhlig (Eds.), Handbook of Macroeconomics, Vol. 2. North Holland.Y
Uematsu, Y. and T. Yamagata (2019). Estimation of weak factor models. ISER Discussion
Paper No. 1053.
Zou, Hui (2006). The Adaptive Lasso and Its Oracle Properties. Journal of the American
Statistical Association, 101, 1418-1429.Y
99
Appendix A
Appendix to Chapter 2
100
Table A.1: Strength Estimates using Principal Components
GOCMT(1PC) GOCMT(2PC) GOCMT(3PC)
Estimates
¯
R
2
n/T 120 180 240 120 180 240 120 180 240
200 0.96 0.98 0.98 0.96 0.97 0.98 0.96 0.97 0.98
0.25 500 0.97 0.98 0.99 0.97 0.98 0.99 0.97 0.98 0.99
1000 0.98 0.98 0.99 0.97 0.98 0.99 0.97 0.98 0.99
200 0.97 0.98 0.99 0.97 0.98 0.99 0.97 0.98 0.99
ˆ α
f1
(1) 0.35 500 0.98 0.99 0.99 0.98 0.99 0.99 0.98 0.99 0.99
1000 0.98 0.99 0.99 0.98 0.99 0.99 0.98 0.99 0.99
200 0.98 0.99 0.99 0.98 0.99 0.99 0.98 0.99 0.99
0.45 500 0.98 0.99 0.99 0.98 0.99 0.99 0.98 0.99 0.99
1000 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99
200 0.84 0.88 0.89 0.69 0.73 0.76 0.69 0.73 0.76
0.25 500 0.86 0.89 0.91 0.74 0.77 0.79 0.74 0.77 0.79
1000 0.87 0.90 0.92 0.76 0.79 0.81 0.76 0.79 0.81
200 0.87 0.90 0.92 0.73 0.77 0.79 0.74 0.77 0.79
ˆ α
f2
(0.9) 0.35 500 0.89 0.92 0.93 0.77 0.80 0.82 0.77 0.80 0.82
1000 0.90 0.92 0.94 0.79 0.82 0.83 0.79 0.82 0.83
200 0.89 0.91 0.93 0.77 0.80 0.82 0.77 0.80 0.82
0.45 500 0.90 0.93 0.94 0.80 0.82 0.84 0.80 0.82 0.84
1000 0.92 0.93 0.94 0.82 0.84 0.85 0.82 0.84 0.85
200 0.84 0.87 0.89 0.73 0.75 0.77 0.73 0.75 0.77
0.25 500 0.85 0.88 0.90 0.75 0.77 0.79 0.75 0.77 0.79
1000 0.86 0.89 0.91 0.76 0.78 0.79 0.76 0.78 0.79
200 0.86 0.89 0.91 0.76 0.77 0.79 0.76 0.78 0.79
ˆ α
f3
(0.8) 0.35 500 0.88 0.91 0.92 0.77 0.79 0.80 0.77 0.79 0.80
1000 0.89 0.91 0.93 0.78 0.79 0.80 0.78 0.79 0.80
200 0.88 0.91 0.93 0.77 0.79 0.80 0.77 0.79 0.80
0.45 500 0.89 0.92 0.93 0.78 0.80 0.80 0.79 0.80 0.80
1000 0.90 0.93 0.94 0.79 0.80 0.81 0.79 0.80 0.80
Notes: Factor strength estimates for the five signals, 50 pseudo-signals, and 100 noise variables. ˆ α
fi
,
for i= 1,2,...,5 are the estimates for the 5 signals. For the pseudo-signals and noise variables, the mean
of estimates are shown, i.e.,
¯
ˆ α
s
refer to the mean of strength estimates across the 50 pseudo-signals,
and
¯
ˆ α
η
refer to the mean of strength estimates across the 100 noise variables. True values of strengths
are in parantheses next to each factor.
¯
R
2
shown in the table refers to the mean of R
2
i
for equation
2.13 used for the DGP. Detail on
¯
R
2
is shown in the note on the bottom of the page. All estimates
are the mean of estimates from 2,000 replications. ˆ α
fk
=
1
R
P
R
r=1
ˆ α
fkr
for k=1,2,3,4,5 are estimates for
the five signals, and R=2,000.
¯
ˆ α
s
=
P
50
c=1
1
50
(
1
R
P
R
r=1
ˆ α
scr
) is the mean of estimates for the pseudo-
signals, and
¯
ˆ α
η
=
P
100
j=1
1
100
(
1
R
P
R
r=1
ˆ α
ηjr
) is the mean of estimates for the noise variables. GOCMT
uses δ = 1/4 and p=0.10. R
2
i
for equation 2.13 is drawn from three different distributions, which are
IIDU(0.1,0.4), IIDU(0.1,0.6), and IIDU(0.1,0.8). The means,
¯
R
2
, are reported in the tables. GOCMT(1PC),
GOCMT(2PC), and GOCMT(3PC) control for the first, the first two, and the first three principal
components, respectively, of the 155 factors generated.
101
Table 2.5 continued
GOCMT(1PC) GOCMT(2PC) GOCMT(3PC)
Estimates
¯
R
2
n/T 120 180 240 120 180 240 120 180 240
200 0.79 0.83 0.86 0.55 0.58 0.61 0.55 0.58 0.61
0.25 500 0.81 0.85 0.88 0.60 0.62 0.64 0.60 0.62 0.64
1000 0.83 0.87 0.89 0.63 0.65 0.66 0.63 0.65 0.66
200 0.83 0.87 0.89 0.59 0.61 0.64 0.59 0.61 0.64
ˆ α
f4
(0.7) 0.35 500 0.85 0.89 0.91 0.63 0.65 0.67 0.63 0.65 0.67
1000 0.86 0.89 0.91 0.66 0.67 0.68 0.66 0.67 0.68
200 0.85 0.89 0.91 0.62 0.64 0.66 0.62 0.64 0.66
0.45 500 0.87 0.90 0.92 0.65 0.67 0.68 0.65 0.67 0.68
1000 0.88 0.91 0.93 0.68 0.68 0.69 0.68 0.68 0.69
200 0.54 0.57 0.60 0.50 0.51 0.52 0.50 0.51 0.52
0.25 500 0.59 0.61 0.63 0.55 0.55 0.55 0.54 0.55 0.55
1000 0.61 0.64 0.65 0.57 0.57 0.57 0.57 0.57 0.57
200 0.57 0.60 0.63 0.51 0.52 0.52 0.51 0.52 0.52
ˆ α
f5
(0.45) 0.35 500 0.61 0.63 0.66 0.55 0.55 0.55 0.55 0.55 0.55
1000 0.63 0.66 0.68 0.57 0.57 0.57 0.57 0.57 0.57
200 0.59 0.62 0.65 0.52 0.52 0.52 0.52 0.52 0.52
0.45 500 0.62 0.65 0.68 0.55 0.55 0.55 0.55 0.55 0.55
1000 0.64 0.68 0.70 0.57 0.57 0.56 0.57 0.57 0.56
200 0.57 0.62 0.64 0.36 0.34 0.33 0.36 0.34 0.33
0.25 500 0.64 0.67 0.70 0.46 0.45 0.44 0.46 0.45 0.44
1000 0.67 0.70 0.73 0.52 0.50 0.50 0.52 0.50 0.50
200 0.61 0.65 0.68 0.35 0.33 0.33 0.35 0.33 0.33
¯
ˆ α
s
(pseudo) 0.35 500 0.66 0.70 0.73 0.45 0.44 0.43 0.45 0.44 0.43
1000 0.70 0.73 0.75 0.51 0.50 0.49 0.51 0.50 0.49
200 0.63 0.67 0.70 0.33 0.32 0.31 0.33 0.32 0.31
0.45 500 0.68 0.72 0.74 0.44 0.42 0.42 0.44 0.43 0.42
1000 0.71 0.74 0.77 0.50 0.48 0.48 0.50 0.48 0.48
200 0.34 0.32 0.31 0.35 0.33 0.32 0.35 0.33 0.32
0.25 500 0.45 0.43 0.42 0.45 0.44 0.43 0.45 0.44 0.43
1000 0.50 0.49 0.49 0.51 0.50 0.49 0.51 0.50 0.49
200 0.32 0.31 0.30 0.34 0.32 0.31 0.34 0.32 0.31
¯
ˆ α
η
(noise) 0.35 500 0.43 0.42 0.41 0.44 0.43 0.42 0.44 0.43 0.42
1000 0.49 0.48 0.47 0.50 0.49 0.48 0.50 0.49 0.48
200 0.30 0.28 0.28 0.32 0.30 0.29 0.32 0.30 0.29
0.45 500 0.41 0.40 0.39 0.43 0.41 0.40 0.43 0.41 0.40
1000 0.47 0.46 0.45 0.49 0.47 0.47 0.49 0.47 0.47
102
Table A.2: Summary statistics of factor strength estimates for the 162 factors when
GOCTM was used for the strength estimator using various number of principal compo-
nents as control variables
Market Return Factor
GOCMT(2PC) GOCMT(3PC) GOCMT(4PC)
Minimum 0.913 0.923 0.866
Maximum 0.997 0.997 0.996
Mean 0.976 0.968 0.943
Std. Dev. 0.022 0.019 0.032
Remaining 161 Factors
GOCMT(2PC) GOCMT(3PC) GOCMT(4PC)
Minimum 0.102 0.161 0.282
Maximum 0.988 0.982 0.977
Mean 0.687 0.673 0.640
Std. Dev. 0.138 0.119 0.094
Notes: Minimum, maximum, mean, and standard deviation of strength estimates are shown for the Market
Return factor and the remaining 161 factors. For the Market Return factor, the summary statistics are
obtained by using the 409 estimates across the 10-year rolling samples from January 1977 to December 2020.
For the remaining 161 factors, summary statistics are obtained by using all 162 strength estimates across
the 409 rolling samples. The y variables are U.S. stock returns obtained from CRSP (details about the stock
returns can be found in section 2.4.1.1). GOCMT(2PC), GOCMT(3PC), GOCMT(4PC) refer to the cases
where the first two, three, and four principal components, respectively, of 162 factors are controlled for when
using OCMT for the strength estimator.
103
Table A.3: Strength Estimates of Factors
Factors GOCMT(3PC) GOCM(Market)Lasso
Adaptive
Lasso
Frazzini-Pedersen Beta 0.870 0.809 0.610 0.547
CAPM beta 0.867 0.823 0.470 0.396
Days with zero trades 3 0.859 0.735 0.630 0.566
Tail risk beta 0.859 0.799 0.800 0.756
Volume Variance 0.851 0.740 0.669 0.607
Volume to market equity 0.849 0.803 0.604 0.540
Past trading volume 0.832 0.633 0.522 0.453
Days with zero trades 1 0.820 0.696 0.605 0.546
HML 0.815 0.824 0.610 0.562
Book leverage (annual) 0.813 0.824 0.602 0.560
Amihud”s illiquidity 0.811 0.672 0.650 0.589
Size 0.797 0.706 0.447 0.381
Days with zero trades 2 0.792 0.710 0.578 0.507
Sales-to-price 0.792 0.793 0.448 0.380
Market leverage 0.790 0.806 0.311 0.246
Total assets to market 0.790 0.778 0.312 0.244
Momentum (12 month) 0.778 0.677 0.383 0.322
SMB 0.777 0.827 0.745 0.713
Book to market using most recent ME 0.776 0.733 0.263 0.186
Predicted div yield next month 0.775 0.708 0.765 0.725
Net Operating Assets 0.766 0.753 0.632 0.587
Price 0.757 0.747 0.133 0.088
Gross profits / total assets 0.754 0.759 0.761 0.735
Short Interest 0.751 0.713 0.735 0.685
Taxable income to income 0.747 0.750 0.704 0.667
Book to market using December ME 0.746 0.765 0.465 0.401
Industry concentration (sales) 0.745 0.760 0.650 0.603
Notes: Mean of factor strength estimates across the rolling samples, which is represented by
¯
ˆ α
q,method
=
1
409
P
409
τ=1
ˆ α
qτ,method
for q= 2,3,...162, where ˆ α
qτ,method
represents the factor strength estimate of the q
th
factor in the active set in the τ
th
rolling sample, and method∈{GOCMT(3PC), GOCMT(Market), Lasso,
Adaptive Lasso}. The window for each rolling sample is 10 years, and there are a total of 409 rolling samples
from January 1977 to December 2020. GOCMT(3PC) refers to the case where OCMT is applied to the active
set,A
z
={x
z
1t
,x
z
2t
,...,x
z
162
}, where x
z
qt
, for q=1,2,...,162, represents qth factor after filtering out the three
principal components using methods shown in Section 2.2.3. GOCMT(Market) refers to the case OCMT is
applied to the active set,A
z
={x
z
2t
,x
z
2t
,...,x
z
162
}, wherex
z
qt
, for q=2,3,...,162, represents the qth factor after
filtering out the the Market excess return factor, or x
1t
. For estimates under Lasso and Adaptive Lasso, the
active set isA ={x
1t
,x
2t
,...,x
162
}, i.e., the original data consisting of 162 factors.
104
Factors GOCMT(3PC) GOCM(Market) Lasso
Adaptive
Lasso
Share turnover volatility 0.745 0.712 0.692 0.643
Operating leverage 0.743 0.793 0.686 0.644
Change in net operating assets 0.742 0.673 0.369 0.307
CMA 0.741 0.749 0.642 0.601
Inst Own and Idio Vol 0.739 0.706 0.697 0.657
Composite equity issuance 0.738 0.755 0.714 0.676
Net debt to price 0.737 0.745 0.634 0.584
Debt Issuance 0.736 0.783 0.704 0.661
Momentum and LT Reversal 0.735 0.604 0.610 0.564
Momentum (6 month) 0.735 0.689 0.477 0.419
Net external financing 0.734 0.746 0.392 0.328
Total accruals 0.733 0.687 0.363 0.303
Advertising Expense 0.732 0.755 0.603 0.559
Cash-based operating profitability 0.727 0.735 0.731 0.688
Operating profitability R&D adjusted 0.725 0.723 0.684 0.637
Industry concentration (equity) 0.725 0.736 0.618 0.569
Intangible return using BM 0.724 0.793 0.519 0.462
Cash-flow to price variance 0.723 0.759 0.299 0.231
Industry Momentum 0.722 0.615 0.648 0.603
Momentum without the seasonal part 0.718 0.689 0.520 0.470
Employment growth 0.716 0.643 0.475 0.421
Convertible debt indicator 0.716 0.691 0.682 0.629
Change in equity to assets 0.713 0.675 0.222 0.151
Change in ppe and inv/assets 0.711 0.703 0.490 0.438
Analyst Value 0.710 0.765 0.630 0.584
Industry concentration (assets) 0.708 0.734 0.625 0.578
Efficient frontier index 0.708 0.645 0.587 0.534
Inst Own and Turnover 0.705 0.744 0.699 0.660
Firm age based on CRSP 0.704 0.757 0.709 0.668
Momentum based on FF3 residuals 0.704 0.651 0.636 0.590
RMW 0.703 0.759 0.637 0.594
Share issuance (5 year) 0.699 0.656 0.647 0.592
Cash to assets 0.697 0.785 0.450 0.382
Intangible return using CFtoP 0.693 0.790 0.528 0.469
Asset growth 0.693 0.675 0.376 0.324
Earnings Surprise 0.693 0.603 0.671 0.629
Inst Own and Market to Book 0.692 0.734 0.670 0.627
Enterprise Multiple 0.689 0.744 0.534 0.483
Initial Public Offerings 0.689 0.752 0.576 0.527
Net equity financing 0.683 0.757 0.389 0.327
Cash Productivity 0.683 0.786 0.364 0.289
Change in Taxes 0.681 0.611 0.667 0.625
Earnings-to-Price Ratio 0.679 0.764 0.725 0.688
105
Factors GOCMT(3PC) GOCM(Market) Lasso
Adaptive
Lasso
Share repurchases 0.677 0.794 0.480 0.425
Net income / book equity 0.676 0.759 0.516 0.459
Change in long-term investment 0.676 0.751 0.677 0.639
Earnings streak length 0.676 0.647 0.646 0.605
Long-run reversal 0.672 0.697 0.304 0.240
Growth in book equity 0.672 0.666 0.328 0.256
Equity Duration 0.672 0.795 0.658 0.616
Analyst Optimism 0.671 0.800 0.628 0.574
Brand capital investment 0.671 0.708 0.583 0.541
Growth in long term operating assets 0.669 0.709 0.734 0.696
Idiosyncratic risk (3 factor) 0.669 0.819 0.507 0.447
Organizational capital 0.668 0.712 0.764 0.736
Earnings surprise of big firms 0.667 0.660 0.699 0.657
Change in capex (two years) 0.667 0.650 0.585 0.536
Return on assets (qtrly) 0.667 0.713 0.375 0.318
Earnings Forecast to price 0.665 0.765 0.579 0.532
O Score 0.664 0.699 0.661 0.623
Change in net financial assets 0.660 0.628 0.665 0.622
Change in current operating assets 0.660 0.674 0.513 0.456
Off season reversal years 6 to 10 0.659 0.634 0.685 0.644
Change in capex (three years) 0.659 0.664 0.523 0.473
Intangible return using EP 0.658 0.733 0.563 0.513
Operating Cash flows to price 0.653 0.769 0.259 0.212
52 week high 0.652 0.753 0.348 0.260
Revenue Growth Rank 0.651 0.659 0.694 0.650
Coskewness 0.648 0.653 0.761 0.730
Operating profits / book equity 0.648 0.763 0.438 0.386
Investment to revenue 0.648 0.677 0.726 0.690
Revenue Surprise 0.645 0.615 0.674 0.634
Idiosyncratic risk 0.644 0.809 0.562 0.506
Net debt financing 0.643 0.592 0.682 0.639
Real estate holdings 0.638 0.669 0.733 0.698
Momentum in high volume stocks 0.637 0.673 0.637 0.588
Off season reversal years 16 to 20 0.633 0.654 0.722 0.686
Book-to-market and accruals 0.632 0.690 0.597 0.549
Cash flow to market 0.631 0.738 0.607 0.564
Percent Operating Accruals 0.630 0.692 0.689 0.648
Leverage component of BM 0.630 0.764 0.604 0.550
R&D over market cap 0.629 0.757 0.505 0.451
Firm Age - Momentum 0.627 0.634 0.663 0.619
Share issuance (1 year) 0.626 0.712 0.544 0.487
Earnings announcement return 0.626 0.617 0.679 0.640
Medium-run reversal 0.622 0.682 0.534 0.486
Change in financial liabilities 0.617 0.624 0.651 0.603
Exchange Switch 0.615 0.654 0.681 0.632
106
Factors GOCMT(3PC) GOCM(Market) Lasso
Adaptive
Lasso
Payout Yield 0.615 0.652 0.669 0.626
Off season long-term reversal 0.615 0.686 0.510 0.461
Tangibility 0.614 0.703 0.586 0.539
Change in current operating liabilities 0.613 0.717 0.615 0.564
Idiosyncratic risk (AHT) 0.613 0.816 0.488 0.430
Unexpected R&D increase 0.612 0.733 0.646 0.605
Sin Stock (selection criteria) 0.611 0.672 0.738 0.702
Growth in advertising expenses 0.609 0.589 0.698 0.659
Composite debt issuance 0.609 0.635 0.701 0.658
Sales growth over overhead growth 0.606 0.626 0.716 0.679
Short term reversal 0.603 0.649 0.661 0.617
Net Payout Yield 0.602 0.782 0.449 0.388
Off season reversal years 11 to 15 0.602 0.630 0.709 0.669
Real dirty surplus 0.601 0.699 0.473 0.420
Enterprise component of BM 0.600 0.683 0.680 0.639
Volume Trend 0.597 0.604 0.577 0.521
Intermediate Momentum 0.594 0.655 0.684 0.645
Order backlog 0.594 0.696 0.702 0.664
Maximum return over month 0.592 0.789 0.614 0.560
Return skewness 0.590 0.707 0.635 0.591
Mohanram G-score 0.587 0.685 0.690 0.653
Accruals 0.586 0.668 0.686 0.638
Inventory Growth 2 0.586 0.629 0.609 0.562
Intangible return using Sale2P 0.585 0.727 0.373 0.306
Earnings consistency 0.585 0.649 0.652 0.611
Abnormal Accruals 0.583 0.644 0.684 0.643
Spinoffs 0.581 0.687 0.686 0.647
Sales growth over inventory growth 0.576 0.546 0.692 0.651
Change in order backlog 0.576 0.575 0.698 0.658
Dividend Initiation 0.576 0.698 0.631 0.586
Inventory Growth 1 0.575 0.666 0.610 0.565
Dividend seasonality 0.571 0.583 0.725 0.687
Change in Asset Turnover 0.570 0.540 0.705 0.663
Idiosyncratic skewness (3F model) 0.569 0.680 0.632 0.589
Piotroski F-score 0.566 0.623 0.714 0.679
Change in Net Noncurrent Op Assets 0.566 0.550 0.680 0.640
Return seasonality years 11 to 15 0.566 0.510 0.714 0.675
Return seasonality last year 0.557 0.534 0.695 0.652
Change in Net Working Capital 0.549 0.599 0.698 0.657
Return seasonality years 2 to 5 0.548 0.578 0.719 0.680
Return seasonality years 16 to 20 0.546 0.539 0.705 0.667
Change in capital inv (ind adj) 0.545 0.616 0.673 0.629
Return seasonality years 6 to 10 0.544 0.544 0.697 0.658
Industry return of big firms 0.539 0.533 0.700 0.661
R&D ability 0.531 0.527 0.731 0.694
Share Volume 0.528 0.609 0.719 0.679
107
Abstract (if available)
Abstract
Studies are beginning to show that not all factors are strong, and that factor strengths have important implications for various results in finance and macroeconomics. In this paper, I provide a method to estimate factor strengths in a high dimensional regression setting where cross sectional dimension (n), time series dimension (T), and number of available potential covariates (K(n)) are all allowed to be large. Instead of assuming a particular K factor model, the estimator considers all K(n) number of potential covariates available to the researcher, and by using variable selection algorithms, one obtains factor strength estimates for all K(n) number of variables, which is much larger than K. Assuming K to be fixed may not be appropriate because researchers are increasingly examining large set of covariates, such as more than 300 number of potential factors in the asset pricing literature. The small sample properties of the estimator is tested extensively through Monte Carlo experiments, and it is shown to be precise given large enough n and T. For an empirical exercise, factor strengths are estimated for 162 potential asset pricing factors, and it is shown that only the Market factor can be considered to be the strong factor, while the remaining ones are semi-strong, at best. This finding implies that one should be aware of factor strengths when estimating risk premia using two pass regression methods, since studies have shown that factor strengths matter for precision of risk premia estimates. After estimation of factors strengths, it is shown that pooled r-squared is higher if the factor model includes factors with higher strengths.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Essays on the econometric analysis of cross-sectional dependence
PDF
Three essays on linear and non-linear econometric dependencies
PDF
Essays on econometrics analysis of panel data models
PDF
Essays on high-dimensional econometric models
PDF
Essays on estimation and inference for heterogeneous panel data models with large n and short T
PDF
Feature selection in high-dimensional modeling with thresholded regression
PDF
High-dimensional regression for gene-environment interactions
PDF
Robust estimation of high dimensional parameters
PDF
Model selection principles and false discovery rate control
PDF
Essays in panel data analysis
PDF
Hierarchical regularized regression for incorporation of external data in high-dimensional models
PDF
Essays on nonparametric and finite-sample econometrics
PDF
Essays on causal inference
PDF
Reproducible large-scale inference in high-dimensional nonlinear models
PDF
Genetic and environmental effects on symbiotic interactions across thermal gradients
PDF
Comparing robustness to outliers and model misspecification between robust Poisson and log-binomial models
PDF
Two essays on financial econometrics
PDF
Social movements and access to credit
PDF
Prediction and feature selection with regularized regression in integrative genomics
PDF
Essays on the firm and stakeholders relationships: evidence from mergers & acquisitions and labor negotiations
Asset Metadata
Creator
Yoo, Jeong Sang
(author)
Core Title
Essays on factor in high-dimensional regression settings
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Economics
Degree Conferral Date
2023-08
Publication Date
05/22/2023
Defense Date
05/17/2023
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
cross sectional dependence,factor models,factor strength,high dimensionality,market factor,measure of pervasiveness,model selection,Monte Carlo experiments,OAI-PMH Harvest,penalized regressions
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Pesaran, Hashem (
committee chair
), Deckle, Robert (
committee member
), Ferson, Wayne (
committee member
), Hsiao, Cheng (
committee member
)
Creator Email
jeongsyo@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113134164
Unique identifier
UC113134164
Identifier
etd-YooJeongSa-11879.pdf (filename)
Legacy Identifier
etd-YooJeongSa-11879
Document Type
Dissertation
Format
theses (aat)
Rights
Yoo, Jeong Sang
Internet Media Type
application/pdf
Type
texts
Source
20230522-usctheses-batch-1047
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
cross sectional dependence
factor models
factor strength
high dimensionality
market factor
measure of pervasiveness
model selection
Monte Carlo experiments
penalized regressions