Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
A survey on the computational hardness of linear-structured Markov decision processes
(USC Thesis Other)
A survey on the computational hardness of linear-structured Markov decision processes
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
A Survey on the Computational Hardness of linear-structured Markov Decision
Processes
by
Chuhuan Huang
A Thesis Presented to the
FACULTY OF THE USC DANA AND DAVID DORNSIFE COLLEGE OF
LETTERS, ARTS AND SCIENCE
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulllment of the
Requirements for the Degree
MASTER OF ARTS
(APPLIED MATHEMATICS)
May 2023
Copyright 2023 Chuhuan Huang
Acknowledgements
This survey paper and related research are conducted under the supervision of Professor Steven
Heilman, and funded by Graduate Teaching Assistantship, Department of Mathematics, Univer-
sity of Southern California, 2022-2023 academic year.
The author would like to thank Professor Steven Heilman, Professor Jianfeng Zhang, and
Professor Remigijus Mikulevicius for their patient mentoring and advising in his early career in
Mathematics and in the application for his Ph.D. program, as the thesis committee.
The author would like to thank his family: father Xuegong Huang and mother Qian Cheng,
for their caring and support since November 1997. Though they sometime may be hard to com-
municate, indeed their love exists in a strong sense.
The author would like to thank his friends, both here in California and back in the People’s
Republic of China, for their mental support and inspiration.
The author would like to thank a philosopher from Philadelphia, who was known for his
mentality and resilience. He inspired the author so much in every stage of life and every aspect
of the author’s life philosophy: something greater will.
In loving memory of his mother’s mother and the philosopher.
ii
TableofContents
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Chapter 2: Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Reinforcement learning and Markov decision processes . . . . . . . . . . . . . . . 7
2.2 Complexity theory terminologies and computationally hard problems . . . . . . . 12
Chapter 3: The Computational-Statistical gaps of LinearQ
=V
Class . . . . . . . . . . . . 14
3.1 The Sharp Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 The Proof of the Reduction Proposition . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 High rank remark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Chapter 4: Linear MDPs are both computational and statistical ecient . . . . . . . . . . . 23
4.1 The Algorithm and its Time Complexity . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Correctness and Sample Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3 Remarks and Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Chapter 5: In showing the analogous result for Linear Mixture MDPs . . . . . . . . . . . . 32
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Appendix: Technical Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
iii
Abstract
The curse of dimensionality in reinforcement learning leads to the function approximation tech-
nique to approximate the (action) value function. However, unfortunately, this approximation
technique makes the provability much challenging in general. In [5], they proved there exists a
reinforcement learning algorithm that is both statistically and computationally ecient by as-
suming the linearity of the transition probability and reward function. Therefore, it is natural to
ask,what is the minimal requirement for the environments/ dynamics to ensure the existence of an
RL algorithm that is provably both statistically and computationally ecient?
Driven by this question, this thesis will survey the results of [6] and [5], along with some
other related results. We will discuss some analogous work that could be done for the Linear
Mixture MDPs, introduced in [8], [1], [17], which is a main focus of our future work.
iv
Chapter1
Introduction
Suppose in a virtual world, a warrior wants to strengthen a piece of his equipmentX, to make
it more powerful and triple its market price. To strengthen this equipment successfully, he has
to elevate the numerical attribute ofX from 50 to 110 in precisely 7 steps. In each step, he could
choose 3 dierent types of tools to elevate the numeric, but dierent tools have dierent random
eects onX and have dierent price. How should he choose the tool in each step to maximize
his gain, even facing uncertainties and randomness? This is a typical question considered in the
context of reinforcement learning.
Since the 1956 workshop in Dartmouth, Machine Learning (ML), as a major eld in Articial
Intelligence (AI), has been growing rapidly. This eld was initially aiming in training a machine
to predict a general pattern (the model) from limited observed data. The machine is said to be
“learning" throughout a “training process." Some exmaple of recent progress include: advance-
ments in the machines playing games such as Go, namely AlphaGo’s superhuman performance
versus Lee Sedol in 2016, together with land-changing literature [11], the invention of the Trans-
former and attention machanism [16] in 2017, and, the game-changing chatbot storm from GPT-4
1
[9]. Articial intelligence and Machine Learning and their applications have been a worldwide
inuential phenomenon in every aspect of social and economic life.
Throughout their history, dierent machine learning paradigms emerged[10], including su-
pervised learning, unsupervised learning, and reinforcement learning.
Supervised learning is a type of machine learning paradigm where the model is trained by
manually-labeled data (the input data set is called the training set). Upon nishing a successful
training process, the machine should be able to predict the label for unlabeled data outside of
the training set. Two typical jobs of supervised learning are classication and regression, corre-
sponding to discrete labels and continuous labels, respectively.
Unsupervised learning is a type of machine learning paradigm where the model is trained
by an unlabeled training set, and upon nishing the training process, the machine should be
able to partition the feeding datasets properly based on their intrinsic features. A typical job of
unsupervised learning is clustering, where, upon successfully trained, the model should partitions
the input data set into several subsets based on their intrinsic attribute.
In between the previous two paradigms, semi-supervised learning is a machine learning
paradigm where the model is trained by a dataset that is largely unlabeled, except for a small
labeled portion. In practice, large-scale data labeling is generally expensive, so semi-supervised
learning alleviates this issue and combines the advantages of accuracy in supervised learning and
the cost-eciency in unsupervised learning.
Supervised learning, semi-supervised learning, and unsupervised learning are listed in de-
creasing order in the dependence on human guidance. However, human is not, or never was, the
only guidance for the learning process; the environment is also a natural guidance, which leads
to our main interest, reinforcement learning.
2
Reinforcement learning (RL)[10][14] is a machine learning paradigm where the model (typi-
cally we use the word agent) that is trained through a sequence of interactions with the surround-
ing environment: in each step, the agent observes the current environment, makes actions, gets
numerical feedback, called a reward, directly from the environment. In contrast to the previous
paradigms, the agent receives a numerical reward instead of a manual label, and the environment
changes reacting to the agent’s action. The typical goal of reinforcement learning is to train the
model, such that the agent would be able to choose a proper action in each round of interactions
to maximize the total collected rewards from the environment.
Consider the example in the opening paragraph. The warrior could realize that there is only
nite many possible numeric (called state) he could go, and he has only nite many choices
in actions. If he knows the distribution of tool’s eect on sending a given numeric to another
(called transition probability), he could apply Dynamic Programming (DP) to solve to maximize
his expected revenue. Unfortunately, this distribution is typically unknown to the warrior. In
this case, he could apply Monte Carlo method[10] to get an empirical distribution and then apply
DP. However, the sampling process in Monte Carlo method is usually expensive in terms of time
and monetary cost required, but in order to get an empirical distribution close enough to the true
distribution, the warrior has to sample as much as he could. To overcome this dilemma between
the accuracy and the cost, it is sensible for the warrior to apply SARSA [14].
In SARSA[14], instead of estimating transition probability for all state-action pair, the model
estimates the collected rewardQ(s;a), given initial actiona and initial states directly, and more-
over it updates the estimation by the Bellman equation, current states and chosen actiona, the
rewardr collected in this step, and next states
0
after the actiona and next actiona
0
; The action
3
choosing policy chooses the action such that it maximizes the estimation. Under this implemen-
tation, the the total amount of computation (time complexity) decreases, but it is still exponential
in the size of all states (jSj) and all actions (jAj), which makes the larger scale implementation
impractical.
The curse of dimensionality in reinforcement learning [5] refers to the fact that the compu-
tational and sample complexity (numbers of samples) of learning algorithms tends to grow ex-
ponentially with the number of dimensions or features in the state space, and hence it becomes
much more dicult to compute the exact value functionV (expected reward to be collected, a
function of the initial state) and learn a good policy based on the actualV .
A typical way to overcome this curse is to apply (value) function approximation, in theory,
and in practice, where the (action) value function is parameterized and estimated during training,
typically via deep neural networks (e.g. Deep Q Network, DQN[7], is a modied Q-Learning
where the action value functionQ is parameterized by a deep neural network, and this network
can be trained by Stochastic Gradient Descent where the loss function is dened asL
2
distance
between the current estimate and scaled next estimate plus an immediate reward[7] ).
Even though the function approximation technique is applied in practice to accommodate to
large-scalejSj andjAj, the amount of time involved training’s computation is still intimidating.
For example, the world-famous AlphaZero [12] went through a 13-day-training, even with 5000
tensor processing units (TPUs); DOTA2 bots trained by OpenAI Five took 10 months in real-time,
with 128000 GPUs [2].
Function approximation unleashes the power of RL by enlarging the eective states and ac-
tion spaces, but its provability is technically challenging: the neighborhood of most of the states
remains unvisited during training episodes, impairing the reliability of the estimates of the value
4
functions [14]. To deal with this challenge, a typical approach is to assume some linearity in
the structure of the dynamics, hoping to facilitate the theoretical proofs. In [5], by assuming the
linearity of the transition probability and reward function with respect to some known feature
maps, they proved there exists a reinforcement learning algorithm that is both statistically and
computationally ecient (the amount of samples and the amount of computation required are
polynomial in some intrinsic structural property), in the function approximation setting.
Thus, as of our main interest, a natural question is, what is the minimal requirement for the
environments/dynamicstoensuretheexistenceofaRLalgorithmthatisprovablybothstatistically
and computationally ecient?
Driven by the natural question above, discussion in statistical eciency, or say sample ef-
ciency, becomes as a main body in the reinforcement learning theory community. In [3], by
assuming the linearity of the optimal action value function and optimal value function with re-
spect to some known feature, sample eciency is possible. However, recently, in [6], by following
the same assumption: the linear optimal value function and action value function with respect
to some known feature, they rst proposed a gap between the statistical eciency and computa-
tional eciency. That is, under their assumed linearity, the sample eciency is possible [3], but
there is no randomized algorithm to solve the proposed RL problem in polynomial time, where it
is polynomial with respect to a known feature dimensiond.
This thesis will survey the results of [6] and [5] along with some other related results. We will
review some necessary background preliminaries in section 2, then we will review the computa-
tional hardness for LinearQ
=V
MDPs in [6] in section 3; in section 4, we will go over the main
idea and the structure in [5] for Linear MDPs, showing the existence of mathematically provable
RL algorithm that is both computational and statistically ecient. Further, in section 5, we will
5
discuss some analogous work that could be done for the Linear Mixture MDPs, introduced in [8],
[1], [17], which is a main focus of our future work. We also attach an appendix section for some
technical details.
6
Chapter2
Preliminaries
In this chapter, we review some preliminaries that are used throughout this thesis.
2.1 ReinforcementlearningandMarkovdecisionprocesses
In reinforcement learning (RL), the agent interacts with the surroundings, gets feedback, and
learns to adjust its strategy to survive and thrive. A typically choice of the underlying framework
is the Markov decision process[14].
Given a ltered probability space (
;F;F;P), where
is the sample space,F is the -
algebra on
,F is the ltration, andP is the probability measure on
.
Denition2.1. A(discretetime,episodic,uniformly-boundedreward,time-homogeneous)Markov
DecisionProcess (MDP)M is a stochastic process dened by a tuple (S;A;p(j;);R;H), where
• H2Z
+
is the time horizon, or say the length of each episode.
7
•S isthenon-emptysetofallpossiblestates,calledtheStateSpace,andthereisadistinguished
terminalstateinS,denotedass
?
.S isequippedwithitsown-algebraF
S
thatcontainsall
singletonfsg;8s2S.
•A, the Action Space, is the set of all possible actions, also equipped with its own-algebra
F
A
that contains all singletonfag;8a2A.
• p :F
S
SA! [0; 1], dened as, for every (B;s
0
;a)2F
S
AS, p(Bjs;a) :=
P(S
t+1
2BjS
t
=s;A
t
=a), is thetransitionkernel, where
– S :f0; 1; 2;:::;Hg
!S is the state process;
– A :f0; 1; 2;:::;Hg
!A is the action process;
– the dynamics satisfy the Markov property, i.e. for any bounded measurable functionf,
E[f(S
t+1
)j(S
t
;A
t
;S
t1
;:::;S
0
;A
0
)] =E[f(S
t+1
)j(S
t
;A
t
)]: (2.1)
Note here we assume p does not depend on t, or say the transition is assumed to be time-
homogenous, and we sometimes abuse the notation by writingp(s
0
js;a) :=p(fs
0
gjs;a).
• R :SA!4([0; 1]), dened by
R(s;a) :=R
s;a
;
8
is the(immediate,stochastic)reward, where4([0; 1]) is the space of all probability mea-
suresover [0; 1]. Forall (s;a)2SA,wecouldalsodenetheexpectedrewardr :SA!
[0; 1] by
r(s;a) :=
Z
[0;1]
xR
s;a
(dx):
In an episode, we illustrate the agent-environment interactions as the following: an agent
starts to observe some initial states
0
2S of the environment. At each stept, the agent observe
that the environment is on states
t
; he takes actiona
t
, and collects a reward sampling from the
distribution corresponding toR
st;at
; the environment reacts to his action and transitions from
the current state s
t
to a next state, following the transition p(js
t
;a
t
). The agent continues to
interact with the environment until he observess
?
ort =H. We shall denote
:= infft2Z
+
:S
t
=s
?
g^H;
as the rst time that the interaction stops. We say this sequence of states froms
0
tos
, atrajec-
tory.
Areinforcementlearningproblem (RLproblem) is, given an MDPM, to nd an action-
choosing function (calledpolicy) for the agent upon observing the current state of the environ-
ment that maximize his expected rewards collected through the trajectory. Sometimes, we may
abuse the terminology that we interchangeably use the phrases “solving an RL problem" and
“solving an MDP".
9
More rigorously, a (deterministic) policy is a function :S!A (write as2A
S
, where
A
S
:=ff :S!Ag ) and we dene thestatevaluefunctionV
:S!R, the expected reward
to be collected along the path given a starting states2S under policy:
V
(s) :=E
h
1
X
t=0
r(S
t
;(S
t
))
S
0
=s
i
whereS
1
;A
1
;:::;S
1
;A
1
are obtained via executing inM, i.e.A
t
=(S
t
);8t2 [0;1]\Z.
We also dene the state-action value functionQ
:SA!R, the expected reward to
be collected along the path given a starting states and initial actiona, under policy:
Q
(s;a) :=r(s;a) +E
h
1
X
t=1
r(S
t
;(S
t
))
S
0
=s
i
:
We remarked that, given a xed policy2A
S
, the sequence (S
t
)
t=0
is a Markov process. There-
fore, by applying the Markov property, we obtain the Bellman equation[14] for action-value
functionQ and the value functionV .
Lemma 2.1 (Bellman equation). Given an MDP M = (S;A;p;R;H) and a policy , for any
s2S,a2A,V
andQ
satisfy the following equation:
Q
(s;a) =r(s;a) +E
s
0
p(js;a)
V
(s
0
);
whereE
s
0
p(js;a)
f(s
0
) :=
R
S
f(s
0
)p(ds
0
js;a).
10
We conclude this section by our natural interest in
V
(s) := sup
2A
S
V
(s)
and
Q
(s;a) := sup
2A
S
Q
(s;a);
the optimal state value function and optimal state-action value function, respectively.
Similarly to the Bellman equation, we have theBellmanoptimalityequation[14]:
Lemma2.2 (Bellman optimality equation). Given an MDPM = (S;A;p;R;H), for anys2S,
a2A,V
andQ
satisfy the following equations:
V
(s) = sup
a2A
Q
(s;a); Q
(s;a) =r(s;a) +E
s
0
p(js;a)
V
(s
0
)
Theoptimalpolicy
:S!A is a policy s.t.V
=V
andQ
=Q
.
11
2.2 Complexitytheoryterminologiesandcomputationally
hardproblems
We rst recall some complexity concepts from theoretical computer science [13].
A decision problem is a computational problem in that for any inputs, there are only two
possible outputs (yes or no). We say a decision problem is in RP class if it can be solved by a
randomized algorithm in polynomial time; a decision problem is inNP class if its solution can be
veried by an algorithm in polynomial time. We further say it isNP-complete if every problem
in NP is polynomial-time reducible to it.
It is well-known that the RP class is a subset of the NP class, i.e. every decision problem in
the RP class is in the NP class, while whether the reverse inclusion is true remains a major open
problem in the community of theoretical computer science.
Now we move to a more concrete problem: the UNIQUE-3-SAT problem. We begin with
some pre-requisite terminologies.
ABooleanexpression, orpropositionallogicformula, is a sequence of Boolean variables,
operators AND (conjunction, denoted as^)
∗
, OR (disjunction, denoted as_), NOT (negation,
denoted as:), and parentheses. For example, (x
1
_x
2
_x
3
_:x
4
)^x
5
is a Boolean expression. A
formula is said to besatisable if it can be madeTRUE by assigning Boolean values (i.e. TRUE,
FALSE) to its variables. The Boolean Satisability Problem, or SAT, is to check whether a
given input formula is satisable.
∗
Please distinguish the notation^ for a minimum of two and for AND based on the context, i.e. it means AND
when it comes to Boolean variables and it means minimum otherwise.
12
A literal is either a variable or the negation of a variable. A clause is a parenthesized dis-
junction of literals (or a single literal). A formula is said to be 3-conjunctive normal form,
or 3-CNF, if it is a conjunction of clauses (or single clause), where clauses are of length 3, i.e. a
clause that is connected by two_. An example of 3-CNF formula is (x
1
_x
2
_x
3
)^(:x
1
_x
2
_x
3
).
UNIQUE-3-SAT problem is an SAT problem where the input 3-CNF formula is ensured to have
either 0 or 1 satisfying assignment.
Also, we may frequently make statements[6] like that a randomized algorithm A solves a
problem in time t with error probability p, by which we means that after runningO(t) total
amount of steps, and, with at least 1p probability,A solves the problem. HereO is the big O
notation. More specically, for an SAT problem, it returns YES on a satisable formula with at
least 1p probability and it with probability 1 returns NO on an unsatisable formula. For a
reinforcement learning problem, with probability at least 1p, it returns a"-optimal policy,
such that sup
s2S
jV
(s)V
(s)j<". We typically usepoly(n) as the abbreviation of “polynomial
inn".
We nish this section by pointing out a main result related to the Unique-3-SAT problem.
It is well-known that the Unique-3-SAT problem is NP-complete by Cook-Levin Theorem[13];
further, [15] states a stronger result:
Theorem2.3 (Valiant-Vazirani,1985[15][6]). UnlessNP =RP,norandomizedalgorithmcan,
with probability at least 7=8, solve UNIQUE-3-SAT withv variables inpoly(v) time.
13
Chapter3
TheComputational-StatisticalgapsofLinearQ
=V
Class
In this chapter, we review a recent article [6] in the reinforcement learning theory community
that rst reveals a sharp computational-statistical gap in reinforcement learning.
Before we specify this gap, we rst introduce a classication for MDPs:
Denition3.1 ([6]). WesayanMDPM satiseslinearQ
=V
conditionifthereexistsad2N,
known feature maps :S!R
d
,
~
:SA!R
d
, and an unknown
∗
xed;!2R
d
s.t. for all
s2S;a2A;
V
(s) =h (s);!i; Q
(s;a) =h
~
(s;a);i:
We shall call this class of MDPs thelinearQ
=V
class, and referd as thefeaturedimension.
A basic assumption is that, if we try to use an algorithm to solve the RL problem associated
with a linearQ
=V
MDP, transitionp, the features ,
~
, and the immediate rewardR are as-
sumed accessible by this algorithm. The call of each function above is assumed to take constant
runtime; input-parsing/output-computing for these functions takespoly(d) time . In this sense,
the transitionp and the immediate rewardR sampling from the probability measure are known.
∗
By known, we mean it is at most polynomial-time computable, while constant time is possible but not required.
By unknown, we don’t know if there is a randomized algorithm to compute it in polynomial time.
14
Now, with the notion above in mind, we may specify the computational-statistical gap in re-
inforcement learning: MDPs in linearQ
=V
class, as long asjAj is a xed and nite constant, are
statistically ecient to solve as presented in [3], meaning it merely requirespoly(d) samples; yet
they are computationally hard to solve, i.e. unlessNP =RP , there is no randomized algorithm
can be used to solve them inpoly(d) time[6].
For the rest of the section, we rst review the computational hardness for this class of MDPs.
We rst specify the RL problem of major interest: LINEAR-3-RL[6].
ALINEAR-3-RL problem is a RL problem where the MDPM satises the following proper-
ties:
•jSj<1.
• it has deterministic transitionP , i.e. given (s;a)2SA;9s
0
2S uniquely s.t.p(s
0
js;a) =
1, andp(js;a) = 0 elsewhere. We accordingly deneP :SA!S byP (s;a) = s
0
if
p(s
0
js;a) = 1.
•jAj = 3, i.e. it has 3 actions.
• it satises the linearQ
=V
condition.
• it has horizonH =O(d), whered is the feature dimension in the linearQ
=V
condition.
A LINEAR-3-RL is viewed as a tuple (OracleM, goalG), where theM is an MDP satisfying the
properties stated above and the goalG is to nd a policy s.t. sup
s2S
jV
V
j<
1
4
, or say this
LINEAR-3-RL is solved by nding such policy.
15
3.1 TheSharpResult
We now present the major computational hardness result of reinforcement learning in [6].
Theorem 3.1 (D. Kane et al., 2022[6]). UnlessNP = RP, no randomized algorithm can solve,
with error probability at least 9=10, LINEAR-3-RL with feature dimensiond inpoly(d) time.
The punchline of the proof of this theorem is the following reduction that connects the compu-
tational hardness of solving the given MDP to the well-known computational hardness of solving
the UNIQUE-3-SAT problem.
Proposition 3.2 ([6]). Supposeq 1. If LINEAR-3-RL with feature dimensiond can be solved in
timed
q
with probability at least 9/10, then UNIQUE-3-SAT withv variables can be solved in time
poly(v) with probability at least 7=8.
Together with the well-established computational hardness result for UNIQUE-3-SAT, the
theorem 2.3, this reduction implies the computational hardness of the LINEAR-3-RL, the theorem
3.1.
16
3.2 TheProofoftheReductionProposition
In this sub-subsection, we sketch the proof of the reduction and leave the technical details in
the appendix for reference.
We divide the proof into several steps:
1. We construct an MDPM
'
, based on the given 3-CNF formula', which happens to satisfy
the linearQ
=V
condition.
2. We construct an algorithmA
SAT
to solve UNIQUE-3-SAT using an algorithmA
RL
solving
theM
'
, such that by solving theM
'
, i.e. the RL algorithm returns a good policy,A
SAT
solves the UNIQUE-3-SAT problem by nding the satisfying assignment.
More specically, given a 3-CNF formula' that hasv variables in total andO(v) clauses in
total, andw
2f1; 1g
v
the unique satisfying assignment, we construct theM
'
as the following
• each states = (l;w), wherel2N is the depth andw2f1; 1g
v
is the associated assign-
ment.M
'
starts ats
0
:= (0;w
0
), wherew
0
:= (1;1;:::;1)2f1; 1g
v
.
• the deterministic dynamics are dened as:
– for a non-satisfying states with assignmentw6= w
andl < H, consider the rst
unsatised clause with variablesx
i
1
;x
i
2
;x
i
3
. Under actionj2f1; 2; 3g, states tran-
sition to s
0
with assignment (w
0
1
;:::;w
0
v
) where w
0
k
=:w
k
if k = i
j
and w
0
k
= w
k
elsewhere, i.e. thei
j
-th bit in the assignment is ipped, and the depth levell
0
=l + 1.
– for a non-satisfying state s with w 6= w
and l = H, or a satisfying state s with
w =w
, everyj2f1; 2; 3g takess tos
?
.
17
• the reward is everywhere 0 except on states with satisfying assignments or states on the
last layerH. In both case,s = (l;w), for anya2f1; 2; 3g,R(s;a)Ber(g(l;w)), where
Ber(p) is the Bernoulli distribution with parameterp,
g(l;w) :=
1
dist(w;w
) +l
H +v
r
:
r is a parameter
†
that we will specify it in the later context, and dist(;) is the hamming
distance.
Now we claimM
'
satises the linearQ
=V
condition, and hence the associated RL problem
is a LINEAR-3-RL problem.
Claim3.1. Given an actiona and a states = (l;w), i.e. it is in levell and it has an assignmentw,
i) We haveV
(s) =g(l;w);
ii)Byassumingvislargeenough,wecouldconstructfeaturesmaps :S!R
d
,
~
:SA!R
d
withd 2v
r
involvenothingbeyondinformationfromsanda,and2R
d
involvesnothingbeyond
informationfromw
suchthatV
andQ
areinlinearformwithrespecttothefeatures and
~
i.e.
Q
(s;a) =h;
~
(s;a)i;V
(s) =h; (s)i:
Now supposedly we are given a randomized algorithmA
RL
for the LINEAR-3-RL problem,
we could further construct a randomized algorithmA
SAT
for the UNIQUE-3-SAT problem based
onA
RL
. Note that the runtime of theA
RL
accumulates each call through accessing the transi-
tion kernel, and the features, and may assume they require constant time to nish computing.
†
the notationr represents either this parameter or the expected reward, but it should not be confusing as it is
easily distinguishable from the context.
18
However, in collecting the reward, it is necessary to know thew
in advance, which cannot be
computed eciently. Therefore, we instead replace the MDPM
'
with a simulator
M
'
.
M
'
has the same transition kernel and features, but it always assigns zero reward at the
last layer
‡
i.e. the simulator
M
'
assigns zero reward except on the satisfying state. Now since by
our previous assumption on the polynomial runtime of the reward, transition, and features, now
whenever
M
'
is called, it takespoly(d)-time to execute.
We constructA
SAT
for UNIQUE-3-SAT as the following:
On input 3-CNF formula',
•A
SAT
callsA
RL
, but each call toM
'
is replaced with the call to
M
'
.
• SupposeA
RL
returns a policy that the interactions between the agent to the environment
end on a states = (l;w). Then, ifw satises the formula,A
SAT
returnsYES. Otherwise, it
returnsNO.
To see its correctness, we claim the following:
Claim3.2. Supposer> 1 andH =v
r
. IfA
RL
returns a policy such that
sup
s2S
jV
(s)V
(s)j<
1
4
;
then, if' is satisable,A
SAT
returnsYES; otherwise, it returnsNO.
‡
By last layer, we mean states = (l;w) withl =H andw6=w
.
19
Claim 3.3. Supposer 2 andH = v
r
. SupposeA
RL
withM
'
access runsv
r
2
4
number of time
steps, and, with probability
9
10
, it returns a policy such that
sup
s2S
jV
(s)V
(s)j<
1
4
:
Then, still runningv
r
2
4
number of time steps,A
RL
with
M
'
access will return a policy such that
sup
s2S
jV
(s)V
(s)j<
1
4
with probability at least
7
8
.
We remark that the showed up in Claim 3.3 twice and they are not necessary to be the same
one.
Together by claim 3.2 and claim 3.3, if there is an algorithmA
RL
such that, with probability
at least 9=10, it solves the LINEAR-3-RL with feature dimensiond = 2v
r
inv
r
2
=4
number of time
steps, then the constructed algorithmA
SAT
can solve, with probability at least 7/8, the UNIQUE-
3-SAT problem in timev
r
2
=4
poly(d) (it takespoly(d) time for each call to the
M
'
, so we multiply
this factor).
At this moment, we choose
r = 8q;
andq 1 impliesr 2, and further we observe,
d
q
v
r
2
4
(3.1)
and
poly(d)v
r
2
4
=poly(v); (3.2)
20
where the inequality (1) is given by
v
r
2
4
= (v
r
)
r
4
d
r
8
=d
q
whenv is large enough.
Conclusively, if there is an algorithmA
RL
that, with probability at least 9=10, solves the
LINEAR-3-RL with feature dimensiond in timed
q
, by choosingv s.t. d = 2v
r
,A
RL
, with prob-
ability at least 9=10, solves the LINEAR-3-RL with feature dimensiond in timev
r
2
=4
, and hence
we know by claim 3.2 and claim 3.3, there is a randomized algorithmA
SAT
that, with probability
at least
7
8
, solves thev-variable UNIQUE-3-SAT problem in timev
r
2
4
poly(d) =poly(v), i.e. the
reduction proposition (proposition 3.2)[6] is proved.
21
3.3 Highrankremark
We nish this section by remarking that, by xing a deterministic policy2A
S
in the MDP
associated with the LINEAR-3-RL problem, the transition matrix in the corresponding Markov
chain is a 2
v
2
v
matrix and it is high-rank, i.e. the rank of the transition matrix is of order
higher than polynomial ind.
To see this, supposeN < 2
v
=v is the number of next states reachable from 2
v
initial states,
then by pigeonhole principle there exists a next state that can be reached from more thanv states,
which is impossible as for each state, we may only allow to ip at mostv bits, resulting at most
v next states. Then, since there are at least 2
v
=v the next states, it is equivalent to say there are
at least 2
v
=v distinct standard unit vectors, and hence the rank is at least 2
v
=v. Now, as we set
v = (d=2)
1=r
, the rank is at least 2
d=2
1=r
=(d=2)
1=r
, which is not in polynomial order.
22
Chapter4
LinearMDPsarebothcomputationalandstatisticalecient
In the previous chapter, we reviewed the established computational-statistical gap for the linear
Q
=V
class. To nd the boundary, or say the equivalent conditions for computational- statistical
gap, in this section, we shift to another class of MDPs with a dierent notion of linear structure,
presented in [5], that it turns out this condition ensures that both computational and statistical
eciency, i.e. there is no computational-statistical gap in the following class of MDPs.
Denition4.1 ([5]). We say an MDPM is alinearMDP if there exists ad2N, a known feature
map :SA!R
d
, an unknown vector-valued measure :F
S
!R
d
, and an unknown xed
2R
d
, s.t. for all (s;a)2SA,
p(js;a) =h(s;a);()i; r(s;a) =h(s;a);i:
By scaling, we may, without loss of generality, assumek(s;a)k 1 for8(s;a)2SA, and
maxfk(S)k;kkg
p
d.
We remark that, if we try to use an algorithm to solve an RL problem associated with a linear
MDP, we assume, except for the feature, the algorithm hasNODIRECT access to the associated
23
reward functionR, and transitionP ; the reward actually received in each step is still accessible,
and takes constant time to receive. That is, we assume for8(s;a)2SA;(s;a) is computable
in polynomial time, with respect to the feature dimensiond, while the transition and reward are
parameterized by known feature map and unknown parameters and.
Remark4.1. We also remark that linear MDPs are in the linearQ
Class.
That is, for any2A
S
, the action value functionQ
and the optimal value functionQ
are
also linear in the feature map.
To see this, we apply the Bellman optimality equation (lemma 2.2), so we have for any (s;a)2
SA,
Q
(s;a) =r(s;a) +E
s
0
p(js;a)
V
(s
0
)
=r(s;a) +
Z
S
V
(s
0
)dp(s
0
js;a)
=h(s;a);i +
Z
S
V
(s
0
)h(s;a);(ds
0
)i
=h(s;a); +
Z
S
V
(s
0
)(ds
0
)i:
(4.1)
By dene
~
:= +
R
S
V
(s
0
)(ds
0
), we knowQ
is linear in the feature map. The proof forQ
is similar.
24
4.1 TheAlgorithmanditsTimeComplexity
From now on, we assumejAj<1, and rewardR is deterministic sor coincides
∗
withR.
In the rest of the section, we review the literature [5], and present both computational and
statistical eciency results for linear MDPs, by rst introducing the algorithm and the main result
in [5]
†
that directly implies both polynomial runtime and polynomial sample complexity.
Before we introduce the algorithm, we rst add a signicant notion, the total (expected)
regret.
Denition4.2. SupposewerepeattheinteractionsoftheagentandtheenvironmentforK episodes
and for each episode k 2f1; 2;:::;Kg, the agent applies the policy
k
. We dene the the total
(expected)regret as
Regret(K) :=
K
X
k=1
h
V
(s
k
0
)V
k
(s
k
0
)
i
;
wheres
k
0
is the starting state in thek-th episode.
Now we present the algorithmLSVI-UCB, theLeast-SquareValueIterationwithUpper-
CondenceBounds.
In Algorithm 1, each episode consists of a backward and a forward loop over all steps. In the
rst loop, (w
h
;
h
) are updated to build theQ functoin. Note here we start the loop fromh =H,
as we need to build theQ function backward as in the dynamic programming. In the second loop,
the agent greedily chooses the action,
a
k
h
arg sup
a2A
Q(s
k
h
;a);
∗
In [Jin et al., 2019], it is remarked that the following results are readily generalized to stochastic rewards.
†
In our review, we still assume the time-homogeneity of the MDP for notation simplicity, and the result for a
general time-inhomogeneous setting is proved in [5]
25
Algorithm1 Least-Square Value Iteration with Upper-Condence Bounds (LSVI-UCB), [5]
1: for episodesk = 1;:::;K do
2: Initiate the starting states
k
1
by sampling from a given distribution onS.
3: for steph =H;:::; 1do
4:
h
I +
P
k1
i=1
(s
i
h
;a
i
h
)(s
i
h
;a
i
h
)
|
.
5: w
h
1
h
P
k1
i=1
(s
i
h
;a
i
h
)
h
sup
a2A
Q(s
i
h+1
;a) +r(s
i
h
;a
i
h
)
i
.
6: Q(;) minfH;
q
(;)
|
1
h
(;) +w
|
h
(;)g:
7: endfor
8: for steph = 1;:::;H do
9: a
k
h
arg sup
a2A
Q(s
k
h
;a).
10: the agent takes actiona
k
h
, transitions tos
k
h+1
based on transitionp(js
k
h
;a
k
h
).
11: endfor
12: endfor
to approach the maximum of theQ function built in the rst loop. We remark a boundary case
that in line 4-5 the summation sums from 1 to 0 whenk = 1, so we resolve this by assigningw
h
with 0 and
h
withI. In this case, there is no actual update happening in line 6.
This Least-Square Value Iteration is inspired by the classical Value Iteration (VI) algorithm.
In the classical VI, theQ function is updated following the Bellman optimality equation (Lemma
2.2),
Q(s;a) r(s;a) +E
s
0
p(js;a)
sup
a2A
Q(s
0
;a); 8(s;a)2SA:
In practice, this update may be hard to implement because it is impossible to iterate all (s;a)2
SA whenjSj =1. However, recall in Remark 4.1 we could parameterizeQ
(s;a) =w
|
(s;a)
for a parameterw2R
d
, together with the known feature, so it is natural to think the following
L
2
-regularized least-squares problem (L
2
-LS), at thek-th episode and theh-th step:
arg min
~ w2R
d
k1
X
i=1
h
r(s
i
h
;a
i
h
) + sup
a2A
Q(s
i
h+1
;a) ~ w
|
(s
i
h
;a
i
h
)
i
2
+k ~ wk
2
2
;
26
and hence we naturally apply the solution of thisL
2
-LS in updating the parameterw
h
, namely,
the line 4-5, where
h
is exactly the normal matrix appearing in the solution.
Moreover, to encourage exploration, they added an additional UCB bonus term
q
(;)
|
1
h
(;);
where
1
(;)
|
1
h
(;)
is naively the eective numbers of samples, along the direction, that our agent has observed
till steph, and the uncertainty along the direction is naively represented by the bouns term.
Now we rst analyze the time complexity for Algorithm 1, and postpone the correctness of
the algorithm till we nish the presentation of the main result, which implies both the correctness
and the sample eciency at the same time.
In the Algorithm 1, Sherman-Morrison formula allows us to compute
1
h
inO(d
2
) time, so
the time complexity of Algorithm 1 is largely depending on the time complexity in computing
sup
a2A
Q(s
i
h+1
;a) for all i2 [k]. For each step, it takesO(d
2
jAjK) time, where the K term
comes from the summationk K, thejAj term comes from the comparing each action to nd
maxQ, andd
2
term comes from computingQ(x
i
h+1
;a) as it involves all previous computations,
which are ofd
2
order, e.g. computing
h
and
1
h
. That is, in total, the computational complexity
isO(d
2
jAjKT ).
27
4.2 CorrectnessandSampleComplexity
Now we could present the main result: the
p
T -regret bound of the LSVI-UCB, whereT :=KH
is the count of steps in total.
Theorem4.2 (Jinetal.,2019[5]). GivenalinearMDPM,thereexistsanabsoluteconstantc> 0
s.t. foranyxedp2 (0; 1),ifweset := 1and :=cdH
p
inAlgorithm1with := log(2dT=p),
thenwithprobability 1p,thetotalregretofLSVI-UCBisatmostO(
p
d
3
H
3
T
2
),whereO()hides
only absolute constants.
To see the correctness of Algorithm 1 implied by theorem 1, we follow a similar discussion in
section 3.1 [4], that, given
K
X
k=1
[V
(s
1
)V
k
(s
1
)]CT
1
2
;
with probability 1p, whereC := C
0
p
d
3
H
3
2
andC
0
is an absolute constant, then, under the
condition that the previous inequality holds, by uniformly choosing =
k
fork = 1; 2;:::;K,
we have
V
(s
1
)V
(s
1
) 8CHT
1
2
= 8CH
1
2
1
p
K
;
with (conditional) probability at least 7=8. This is true as, ifV
(s
1
)V
(s
1
) > 8CHT
1
2
for
probability greater than 1=8, then
K
X
k=1
[V
(s
1
)V
k
(s
1
)]>
K
8
8CHT
1
2
=CT
1
2
;
28
and we derive a contradiction. Therefore, given"> 0, we want to chooseK large enough s.t.
8CH
1
2
1
p
K
<":
By plugging inC =C
0
p
d
3
H
3
2
, ourK should satisfy
8C
0
p
d
3
H
3
h
log(
2dH
p
) + log(K)
i
1
p
K
<":
Notice lim
K!1
log(K)
p
K
= 0, we may chooseK large enough to further satisfy
8C
0
p
d
3
H
3
log(K)
p
K
<
"
2
;
and
8C
0
p
d
3
H
3
log(
2dH
p
)
1
p
K
<
"
2
: (4.2)
The inequality (11) nally allows us to choose K =
~
O(d
3
H
3
="
2
), where
~
O() hides absolute
constants and log polynomial terms, and hence
V
(s
1
)V
(s
1
)<";
i.e. is an"-optimal policy.
That is, by running the Algorithm 1 for K =
~
O(d
3
H
3
="
2
) episodes and drawing HK =
~
O(d
3
H
4
="
2
) samples, with probability
7
8
(1p), the Algorithm 1 learns an"-optimal policy sat-
isesV
(s
1
)V
(s
1
)<", and the policy was chosen uniformly fromf
1
;:::;
K
g, and each
k
was generated based on the functionQ function at the corresponding episode.
29
Together, it is established that there is no computational-statistical gap for linear MDPs.
30
4.3 RemarksandDiscussions
Remark4.3. we rst remark that, given a linear MDPM withjSj<1, if we x a deterministic
policy, transition matrixP in the corresponding Markov chainfS
t
g is low-rank factorizable, i.e.
it can be written as a product of matrices with rankd.
To see this, denote
~
(s) :=(s;(s)).
:=
~
(s
1
)
~
(s
2
)
~
(s
jSj
)
is adjSj matrix has rankd as :SA!R
d
.
M :=
(s
1
) (s
2
) (s
jSj
)
is also adjSj matrix has rankd as :S!R
d
. Then,
P =M
|
:
We nish this section by pointing out that the relationship between the computational e-
ciency of an MDP and the rank of the transition matrix in the corresponding Markov chain is
still unknown, to the best of our knowledge. We presented examples: Linear MDPs are compu-
tationally ecient and low-rank, while the LinearQ
=V
MDPs are computationally hard and
high-rank.
31
Chapter5
InshowingtheanalogousresultforLinearMixtureMDPs
In the previous chapters, we presented two classes of MDPs that are both statistically ecient, but
they dier in their computational eciency: LinearQ
=V
MDPs[6] are computationally hard
while the Linear MDPs[5] are computationally ecient. We remarked on their intrinsic dierence
in their transition matrices: linear MDPs have a low-rank transition matrix, while linearQ
=V
MDPs have a high-rank transition matrix. Therefore, it is natural to consider the relationship
between the computational eciency of an MDP and the rank of the transition matrix in the
corresponding Markov chain.
In our way of investigating this relationship, we present another class of MDPs introduced in
[8], [1], [17] in the following.
Denition 5.1. We say an MDP M is a linear mixture MDP if there exists a d2 N, known
feature maps :SSA!R
d
, :SA!R
d
, and an unknown xed2R
d
, s.t. for all
(s;a)2SA;s
0
2S,
p(s
0
js;a) =h(s
0
;s;a);i; r(s;a) =h (s;a);i:
32
From [3], linear Mixture MDPs are statistically ecient, and by applying the similar argument
in remark 4.3, the corresponding transition matrices in the linear Mixture MDPs also enjoy low-
rank factorization. In fact, given xed2A
S
, by denoting 2R
djSj
2
and :=
and 2R
djSj
2
with
ss
0 :=(s
0
;s;(s)), we have
P =
|
;
where rank() = rank(
|
) = 1 and rank() d. However, it is not clear if the Q
of a
Linear Mixture MDP is linear in a known feature map, where by known we still mean there is
an algorithm to compute in polynomial time. To see this, observing from the Bellman optimality
equation, we have
Q
(s;a) =r(s;a) +E
sp(js;a)
V
(s
0
)
=h (s;a);i +
Z
S
V
(s
0
)p(ds
0
js;a)
=h (s;a) +
Z
S
V
(s
0
)(ds
0
;s;a);i:
(5.1)
This linear form seems attractive. However, we typically have no a priori information on the
computational time onV
orQ
themselves and it is not that natural to assume they are poly-
nomial time computable in the rst place. The fact thatQ
is unclear to be a linear function of a
known feature map suggests that a fundamentally dierent approach is required to either prove
or disprove the computational eciency of linear mixture MDPs, as we recall from the previous
section that the linear parameterization ofQ
in linear MDPs[5] played a vital role in updating
the weights in the Least-Square Value Iteration algorithm.
33
Our future work continues to investigate whether linear mixture MDPs are computationally
ecient, or whether there is a reduction from a linear mixture MDP to a well-known computa-
tionally hard problem, which should imply in general linear mixture MDPs are computationally
hard. To construct such a reduction, if possible, we will try to consider the computational prob-
lemNASH, which is to nd a Nash equilibrium in ad-player game. This consideration follows
the following “logic": the existence of a mixed Nash equilibrium is implied by the Brouwer xed
point theorem, while the action value function is a xed point acted by the Bellman operator;
if there is a polynomial time reduction from the updating step in the Bellman operator to the
updating step in the Brouwer xed point theorem, we are “good".
34
Bibliography
[1] Alex Ayoub, Zeyu Jia, Csaba Szepesvari, Mengdi Wang, and Lin F Yang. “Model-based
reinforcement learning with value-targeted regression”. In: arXiv preprint
arXiv:2006.01107 (2020).
[2] Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemyslaw Debiak,
Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Christopher Hesse,
Rafal Jozefowicz, Scott Gray, Catherine Olsson, Jakub Pachocki, Michael Petrov,
Henrique Ponde de Oliveira Pinto, Jonathan Raiman, Tim Salimans, Jeremy Schlatter,
Jonas Schneider, Szymon Sidor, Ilya Sutskever, Jie Tang, Filip Wolski, and Susan Zhang.
“Dota 2 with large scale deep reinforcement learning”. In: CoRR (2019).
[3] Simon S. Du, Sham M. Kakade, Jason D. Lee, Shachar Lovett, Gaurav Mahajan, Wen Sun,
and Ruosong Wang. “Bilinear Classes: A Structural Framework for Provable
Generalization in RL”. In: arXiv preprint arXiv:2103.10897 (2021).
[4] Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. “Is Q-Learning
Provably Ecient?” In: Advances in Neural Information Processing Systems. Ed. by
S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett.
Vol. 31. Curran Associates, Inc., 2018.url: https://proceedings.neurips.cc/paper_files/
paper/2018/file/d3b1fb02964aa64e257f9f26a31f72cf-Paper.pdf.
[5] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I. Jordan. “Provably Ecient
Reinforcement Learning with Linear Function Approximation”. In: arXiv preprint
arXiv:1907.05388 (2019).
[6] Daniel Kane, Sihan Liu, Shachar Lovett, and Gaurav Mahajan. “Computational-Statistical
Gaps in Reinforcement Learning”. In: arXiv preprint arXiv:2202.05444 (2022).
[7] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, and et al. “Human-level control
through deep reinforcement learning”. In: Nature 518.7540 (2015), pp. 529–533.
35
[8] Aditya Modi, Nan Jiang, Ambuj Tewari, and Satinder Singh. “Sample complexity of
reinforcement learning using linearly combined model ensembles”. In: Conference on
Articial Intelligence and Statistics. 2020.
[9] OpenAI. “GPT-4 Technical Report”. In: arXiv preprint arXiv:2303.08774 (2023).
[10] Xipeng Qiu. Neural Networks and Deep Learning. Publishing House of Electronics
Industry, 2019.isbn: 978-7-111-64968-7.
[11] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre,
George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou,
Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham,
Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach,
Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. “Mastering the game of Go
with deep neural networks and tree search”. In: Nature 529.7587 (2016), pp. 484–489.doi:
10.1038/nature16961.
[12] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang,
Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen,
Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and
Demis Hassabis. “Mastering the game of Go without human knowledge”. In: Nature
550.7676 (2017), pp. 354–359.doi: 10.1038/nature24270.
[13] Michael Sipser. Introduction to the Theory of Computation. 3rd. Cengage Learning, 2012.
isbn: 978-1-133-18779-0.
[14] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The
MIT Press, 2018.isbn: 978-0-262-19398-6.
[15] L.G. Valiant and V.V. Vazirani. “NP is as easy as detecting unique solutions”. In:
Theoretical Computer Science 47 (1986), pp. 85–93.issn: 0304-3975.doi:
https://doi.org/10.1016/0304-3975(86)90135-0.
[16] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. “Attention Is All You Need”. In:
CoRR abs/1706.03762 (2017). arXiv: 1706.03762.url: http://arxiv.org/abs/1706.03762.
[17] Dongruo Zhou, Jiafan He, and Quanquan Gu. “Provably ecient reinforcement learning
for discounted MDPs with feature mapping”. In: International Conference on Machine
Learning. PMLR. 2021, pp. 12793–12802.
36
Appendix: TechnicalDetails
Proof of Lemma 2.1. Note
Q
(s;a) =r(s;a) +E
h
1
X
t=1
r(S
t
;(S
t
))jS
0
=s
i
=r(s;a) +E
s
0
p(js;a)
E
h
1
X
t=1
r(S
t
;(S
t
))jS
1
=s
0
;S
0
=s
i
=r(s;a) +E
s
0
p(js;a)
E
h
1
X
t=1
r(S
t
;(S
t
))jS
1
=s
0
i
by Markov property (1)
=r(s;a) +E
s
0
p(js;a)
V
(s
0
)
(2)
37
Proof of Lemma 2.2.
Note whenever the optimal policy
exists, the second equation holds automatically by plug-
ging the
into the lemma 2.1.
More generally, when the existence of
is not assumed. Fix (s;a)2SA, observe
Q
(s;a) := sup
Q
(s;a)
= sup
h
r(s;a) +E
s
0
p(js;a)
V
(s
0
)
i
=r(s;a) + sup
E
s
0
p(js;a)
V
(s
0
)
r(s;a) +E
s
0
p(js;a)
sup
V
(s
0
)
=r(s;a) +E
s
0
p(js;a)
V
(s
0
):
(3)
To see the other direction, recall that we set a nite time horizon, so we know the V
(s
0
) =
sup
V
(s
0
) H;8s
0
2S and then by dominated convergence theorem, for every" > 0,92
A
S
, s.t.
E
s
0
p(js;a)
h
V
(s
0
)V
(s
0
)
i
=E
s
0
p(js;a)
V
(s
0
)V
(s
0
)
<"
Hence, we have
Q
(s;a)Q
(s;a)
=r(s;a) +E
s
0
p(js;a)
V
(s
0
)
>r(s;a) +E
s
0
p(js;a)
V
(s
0
)";
(4)
that is, we established the second equation.
38
To showV
(s) = sup
a2A
Q
(s;a), note for any deterministic policy :S!A,
V
(s) =Q
(s;(s));
and byV
(s) := sup
V
(s), we left to show
sup
Q
(s;(s)) = sup
a2A
Q
(s;a):
To show left inequality, notice
Q
(s;(s))Q
(s;(s)) sup
a2A
Q
(s;a);
(5)
and hence it is given by taking sup over in both sides of the inequality.
To show the other way around, if V
(s) < sup
a2A
Q
(s;a) for some s 2 S, then there
9a
0
2A s.t.
Q
(s;a
0
)>V
(s);
and moreover there92A
S
s.t.
Q
(s;a
0
)>V
(s):
Therefore, if we choose a modied policy :S!A s.t. for allx2S,
• (s) =a
0
;
• Q
(x; (x))Q
(x;(x)) =V
(x).
39
We left to showV
(s) Q
(s;a
0
). But this is given by the policy improvement theorem [14],
together with the bellman equation (lemma 2.1). More precisely, we have
V
(s)Q
(s;a
0
) =Q
(s; (s))Q
(s;a
0
)
=E
s
0
p(js;a
0
)
[V
(s
0
)V
(s
0
)]
0;
(6)
asV
(s
0
)V
(s
0
);8s
0
2S by the policy improvement theorem.
40
Proof of Claim 3.1 i).
This claim is showed in two steps.
1. We construct a policy such thatV
(s) =g(l;w);
2. Given any other policy
0
2A
S
, we showV
0
(s)g(l;w).
In step 1), we construct the policy as the following: for every non-satisfying state s, i.e.
w6= w
, (s) := a is such an action that the hamming distance dist(w;w
) from the current
assignmentw to the satisfying assignmentw
is decreased by 1. Note that we can always nd
such action, because by denition, all clauses are satised by the satisfying assignment.
Therefore, from a states = (l;w), together with transitionP and the policy,s
0
:=P (s;(s)) =
(l + 1;w
1
) satisfying either
1. w
1
=w
; or
2. w
1
is on the optimal path (in a level later thanw) i.e. dist(w;w
) = dist(w;w
1
)+dist(w
1
;w
).
In both cases,
V
(s) =
1
dist(w;w
1
) +l + dist(w
1
;w
)
H +v
r
=g(l;w):
In step 2). For any other policy
0
that leadss to end on states
0
= (l
0
;w
0
) (that is, either it is
on the last layerl
0
=H, or it reaches the satisfying assignmentw
0
=w
, we have
V
0
(s) =
1
dist(w
0
;w
) +l
0
H +v
r
1
dist(w;w
0
) +l + dist(w
0
;w
)
H +v
r
g(l;w);
(7)
41
wherel
0
l dist(w;w
0
) contributes to the rst and the triangle inequality
dist(w;w
) dist(w;w
0
) + dist(w
0
;w
)
contributes to the second.
42
Proof of Claim 3.1 ii).
This claim is showed in two steps.
1. We can writeV
(s) as a polynomial inw andw
, with degree at mostr.
2. We construct the and
~
directly from the polynomial in the previous step.
In the step 1, we observe that
dist(w;w
) =
vhw;w
i
2
;
andg(l;w) can automatically be written as a polynomial in dist(w;w
), of degreer.
For the second, we construct vector as the following:
• LetS be the collection of all multisetS [v] andjSjr;
• for eachS2S , let
S
:=
Q
i2S
w
i
;
• Let := (
S
)
S2S
.
That is, each coordinate
S
is a monomial in the coordinate ofw
, e.g.w
i
w
j
w
k
wherew
i
;w
j
;w
k
isi-th,j-th, andk-th coordinate ofw
, respectively.
Then, note that
g(l;w) =
1
dist(w;w
) +l
H +v
r
=
1
vhw;w
i
2
+l
H +v
r
andhw;w
i =
P
v
1
w
i
w
i
. We set (s) to be the corresponding coecient vector (each coordinate
corresponds to a
S
) and henceV
(s) =h; (s)i follows. Note there are at most
P
r
i=0
v
i
2v
r
43
(when v large enough) many coecients (for multiset S [v] withjSj = i, there at most v
i
possible choice), we may set the feature dimension asd = 2v
r
.
Lastly,
~
(s;a) := (P (s;a)) asP is deterministic and the linear representation ofQ
follows
byQ
(s;a) =V
(P (s;a)).
44
Proof of Claim 3.2.
A
SAT
will always return NO on unsatisable formula is clear. We are left to consider those
satisable formulas.
Let' be a satisable formula. In the MDP oracleM
'
, we remark the following observation:
assuming large enough v and r, the rewards collected in the last layer and the optimal value
functionV
are separated.
More specically,
• the reward is comparatively small in the last layer:
E[R] =
1
dist(w;w
) +H
H +v
r
(
v
H +v
)
r
v
rr
2
<
1
4
;
• V
is comparatively large:
V
=
1
l
H +v
r
1
v
H +v
r
=
1 +
v
v
r
r
1 + (r)
v
v
r
1
2
;
wherel v contributes to the rst inequality(action choosing follows from the proof of
claim 3.1 and thisl v follows), the second from Bernoulli’s inequality and the third is
some elementary calculus bounding.
Therefore, ifA
RL
returns a policy s.t. it produces large value functionV
, i.e.
sup
s2S
jV
V
j<
1
4
;
then the policy has to lead to a satisfying state andA
SAT
will returnYES by its construction.
45
Proof of Claim 3.3. Let
• – P
M'
be the joint probability measure on the generated policies and observed rewards
whenA
RL
has access toM
'
.
– P
M'
be the joint probability measure whenA
RL
has access access toM
'
the simula-
tor.
• R
i
be the reward collected alongi-th trajectory. recall reward is only granted on the last
layer or the satisfying state.
• After nishing runningA
RL
with access toM
'
, we denoteT as the count of all trajectories
within the episodeH.
As in our assumption,A
RL
runsv
r
2
=4
amount of steps and hence
Tv
r
2
=4
:
As in the proof of Claim 3.2, assuming large enoughr andv, the expected reward in the MDP
M
'
is comparatively small on the last layer
E[R]v
r
2
2
;
and states on the last layer are visited byA
RL
at mostv
r
2
=4
times, it is with high probability
that the rewards at the last level are all zeros.
46
More precisely, when large enoughr;v is still assumed,
P
M'
(fR
i
= 0;8i2 [T ]g) = 1P
M'
(fR
i
6= 0;9i2 [T ]g)
1v
r
2
4
4
5
;
(8)
where the rst inequality in (7) follows from
P
M'
(fR
i
6= 0;9i2 [T ]g) =P
M'
([
i2[T ]
fR
i
6= 0g)
X
i2[T ]
P
M'
(fR
i
6= 0g)
Tv
r
2
2
v
r
2
4
v
r
2
2
=v
r
2
4
(9)
and
P
M'
(fR
i
6= 0g) =E
M'
[R
i
]v
r
2
2
by proof of Claim 2 and the nature of the Bernoulli distribution, and the second inequality in (7)
is a general bounding assumingv is large enough.
Let
• S
A
RL
;M'
(orS
A
RL
;M'
) be the event thatA
RL
, with access toM
'
(orM
'
), after running
at mostv
r
2
=4
steps, returns a policy that satises sup
s2S
jV
V
j <
1
4
. (S is short for
succeeds)
47
• V
[T ]
be the event that the reward collected in the last layer is zero for all trajectories in [T ]
(V is short for void).
It is assumed that
P
M'
(S
A
RL
;M'
) =
9
10
;
and from previous reasoning we know
P
M'
(V
[T ]
)
4
5
:
Therefore, together we have
P
M'
(S
A
RL
;M'
jV
[T ]
)
=
P
M'
(S
A
RL
;M'
\V
[T ]
)
P
M'
(V
[T ]
)
9
10
1
5
4
5
=
7
8
(10)
where the inequality by
P(A\B) = 1P(A
c
[B
c
) 1 [P(A
c
) +P(B
c
)]:
We remark that since the only dierence betweenM
'
and simulatorM
'
is on the last layer, and
hence the marginal distributionP
M'
andP
M'
on policy, conditioned onV
[T ]
, coincide.
That is, we have
P
M'
(S
A
RL
;M'
jV
[T ]
) =P
M'
(S
A
RL
;M'
jV
[T ]
)
48
Since,P
M'
(V
[T ]
) = 1 by construction of the simulator, we conclude that
P
M'
(S
A
RL
;M'
)
7
8
;
i.e. with probability at least 7=8,A
RL
, with access toM
'
the simulator, after running at most
v
r
2
=4
steps, returns a policy that satises sup
s2S
jV
V
j<
1
4
.
49
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Online reinforcement learning for Markov decision processes and games
PDF
Application of statistical learning on breast cancer dataset
PDF
Analysis using generalized linear models and its applied computation with R
PDF
Decision-aware learning in the small-data, large-scale regime
PDF
Reinforcement learning for the optimal dividend problem
PDF
Context-adaptive expandable-compact POMDPs for engineering complex systems
PDF
Robust and adaptive online decision making
PDF
On the simple and jump-adapted weak Euler schemes for Lévy driven SDEs
PDF
Generalized Taylor effect for main financial markets
PDF
From least squares to Bayesian methods: refining parameter estimation in the Lotka-Volterra model
PDF
Provable reinforcement learning for constrained and multi-agent control systems
PDF
The spread of an epidemic on a dynamically evolving network
PDF
Conformalized post-selection inference and structured prediction
PDF
Stochastic Variational Inference as a solution to the intractability problem of Bayesian inference
PDF
High dimensional estimation and inference with side information
PDF
Max-3-Cut performance of graph neural networks on random graphs
PDF
Improving decision-making in search algorithms for combinatorial optimization with machine learning
PDF
A rigorous study of game-theoretic attribution and interaction methods for machine learning explainability
PDF
Learning and control for wireless networks via graph signal processing
PDF
Asset price dynamics simulation and trading strategy
Asset Metadata
Creator
Huang, Chuhuan
(author)
Core Title
A survey on the computational hardness of linear-structured Markov decision processes
School
College of Letters, Arts and Sciences
Degree
Master of Arts
Degree Program
Applied Mathematics
Degree Conferral Date
2023-05
Publication Date
05/03/2023
Defense Date
05/03/2023
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
complexity analysis,machine learning,Markov decision processes,OAI-PMH Harvest,reinforcement learning
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Heilman, Steven (
committee chair
), Mikulevicius, Remigijus (
committee member
), Zhang, Jianfeng (
committee member
)
Creator Email
chuan129@jh.edu,chuhuanh@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113099365
Unique identifier
UC113099365
Identifier
etd-HuangChuhu-11766.pdf (filename)
Legacy Identifier
etd-HuangChuhu-11766
Document Type
Thesis
Format
theses (aat)
Rights
Huang, Chuhuan
Internet Media Type
application/pdf
Type
texts
Source
20230504-usctheses-batch-1036
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
complexity analysis
machine learning
Markov decision processes
reinforcement learning